Center for Computation and Visualization
oscar - Historical Issues
Return to Current Status

  • Oscar: New jobs are slow to start

    • Resolved 3/13/2025, 3:07:23 PM

      This issue has been resolved

    • Update 3/13/2025, 3:07:23 PM

      This is fixed.

    • 2/6/2025, 1:57:20 PM

      Oscar: New jobs are slow to start

      Oscar Users, Please see the details below on the Oscar outage.

      Issue: New interactive jobs are slow to start Impact: Users may need to wait several minutes before an interactive job starts

      OIT is investigating this issue. Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar: New jobs cannot be submitted

    • Resolved 2/3/2025, 6:11:24 PM

      This issue has been resolved

    • 1/31/2025, 10:40:08 PM

      Oscar: New jobs cannot be submitted

      Oscar Users, Please see the details below on the Oscar outage.

      Issue: New jobs cannot be started on Oscar

      Impact: Users cannot start new batch or interactive jobs Users cannot start new Open OnDemand sessions Users cannot check the status of existing jobs.

      Instructions for users:

      Wait for a message from our team before submitting a new job Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

  • Open Ondemand Outage

    • Resolved 1/29/2025, 2:13:10 PM

      This issue has been resolved

    • Update 1/29/2025, 2:13:11 PM

      The issue is resolved.

    • 1/29/2025, 1:25:45 PM

      Open Ondemand Outage

      There is an issue with the Open Ondemand server. New Open Ondeman (OOD) jobs cannot be launched. Old jobs are in the 'undetermined' status. We are working on the issue right now. Please email support@ccv.brown.edu with any issues, questions or concerns.

  • Oscar Outage

    • Resolved 7/11/2024, 8:10:13 PM

      This issue has been resolved

    • Update 7/11/2024, 8:10:13 PM

      Closing this. The outage was resolved the very same day.

    • Investigating 6/18/2024, 12:19:05 PM

      Oscar Outage

      Oscar Users,

      Please see the details below on the Oscar outage.

      Issue:

      Oscar services are impacted by the current power issue at a data center. We are investigating the issue now and will update with more details soon.

      Impact:

      Users cannot access Oscar using these services

      Instructions for users:

      Users can still connect to OOD from a web browser and launch interactive apps.

      We will update this email thread with more information later.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely,

      CCV Research Technical Services

  • Open OnDemand is down

    • Resolved 5/5/2024, 12:14:41 AM

      This issue has been resolved

    • Investigating 5/4/2024, 11:01:47 PM

      Open OnDemand is down

      Users are currently experiencing problems accessing Open OnDemand (ood.ccv.brown.edu) . Users are requested to use SSH to access Oscar.

  • Oscar Unavailable due to Campus-Wide Outage

    • Resolved 3/25/2024, 12:57:37 PM

      This issue has been resolved

    • Update 3/25/2024, 12:57:37 PM

      The issue has been resolved.

    • 3/24/2024, 11:34:42 PM

      Oscar Unavailable due to Campus-Wide Outage

      Oscar users, Please see the details below on the current outage:

      Issue:

      • Users cannot connect to Oscar due to a campus-wide outage. This outage is affecting multiple services.

      Impact on Oscar:

      • Users cannot connect to Oscar through SSH, Open OnDemand, File transfer services

      • Oscar jobs should remain unaffected, unless they need to access the internet

      We will update this email thread with more information later.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar Outage - Cannot start SSH session

    • Resolved 3/15/2024, 5:33:35 PM

      This issue has been resolved

    • Update 3/15/2024, 5:33:15 PM

      Oscar Users,

      The issue has been resolved. However, some running jobs were requeued (The jobs were killed and restarted). The system is stable at this point. Feel free to submit any jobs.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely,

      CCV Research Technical Services

    • Investigating 3/15/2024, 5:08:37 PM

      Oscar Outage - Cannot start SSH session

      Oscar Users,

      Please see the details below on the current outage:

      Issue:

      Our engineers are investigating.

      Impact:

      Users will NOT be able to start an SSH session from these gateways:

      Open OnDemand remains unaffected

      All Slurm jobs are unaffected

      The file system is unaffected

      Instructions for Users:

      Use Open OnDemand to access Oscar

      We will update this email thread with more information later.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely,

      CCV Research Technical Services

  • GPU Jobs Being Killed Due to a Scheduler Bug

    • Resolved 1/25/2024, 8:52:55 PM

      This issue has been resolved

    • Update 1/25/2024, 8:52:55 PM

      The scheduler change was completed Fri, Jan 19, 3:32 PM (6 days ago). No jobs were disrupted during the process. Let's please close these issues in a timely manner.

    • 1/18/2024, 9:00:45 PM

      GPU Jobs Being Killed Due to a Scheduler Bug

      Oscar Users,

      Please see the details below on GPU jobs were killed and requeued:

      What is the issue: During the process of making configuration changes to the cluster, we have identified a scheduler bug: the scheduler kills and requeues GPU jobs when the scheduler configuration is updated.

      What is the impact: A requeued GPU job will restart with the same job ID

      Should you submit jobs? You can continue to submit jobs as usual, and they will proceed to run. We want to reassure you that before implementing any new configuration changes to the scheduler, we will provide a community-wide announcement to ensure that you have time to save your workloads.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar Upgrade to RHEL9

    • Resolved 1/9/2024, 10:09:12 PM

      This issue has been resolved

    • 1/9/2024, 1:46:38 PM

      Oscar Upgrade to RHEL9

      CCV will conduct the scheduled Oscar maintenance that upgrades the operating system and module system in January. To facilitate the transition, CCV has set up a mini cluster with the upgraded OS and module system for users to test their programs/jobs.

      • Users can connect to the mini RHEL/9.2 cluster by running the command ssh -X login009 on the current login nodes (login007 or login008). For more details, refer to this page.
      • Users can drop in during this Zoom meeting on Mondays and Wednesdays between 3 and 4pm before winter break for any assistance for the new cluster.
      • We encourage users to test their programs/jobs on this cluster and reach out to us (support@ccv.brown.edu) for any help, issues or feedback.

      Please see below for details on the scheduled maintenance.

      Maintenance Window:

      • Note: Below is the estimated window. The exact dates and times are still tentative.
      • Start: 1/9/2024 5:00 am EST
      • End: 1/12/2024 5:00 pm EST

      Maintenance Description:

      • The OS will be upgraded from RHEL/7.2 to RHEL/9.2. For exact details please refer to this documentation page
      • We are increasing the cpu cores for priority-GPU accounts. Please see above link
      • Oscar modules system will be migrated to LMod . For a comprehensive list of old and new modules refer to this documentation page.

      Expected Impact:

      • All Oscar services will be unavailable during the downtime
      • Jobs which won’t complete by the beginning of the maintenance window won’t start and ‘myq/squeue’ will report (ReqNodeNotAvail, Reserved for maintenance)

      Instructions for Users

      • Users will need to resubmit jobs after the maintenance

      • Users will need to update their job submission scripts since Oscar module names/versions will be different in the new module system.

      • Locally installed packages/environments may not work after the maintenance due to OS upgrade. Users are recommended to reinstall and test their installed packages on the RHEL/9.2 mini cluster before the downtime.

  • Service Degradation on VSCode server

    • Resolved 11/20/2023, 7:58:31 PM

      This issue has been resolved

    • 11/16/2023, 12:19:09 AM

      Service Degradation on VSCode server

      Oscar HPC experienced a service degradation with the VS Code server this evening. Our automated security systems identified a potential threat on the VS Code host node. Any ongoing sessions were terminated as a result. Our Systems team has identified that this host is safe and it should now be operational again.

  • Internet Outage

    • Resolved 10/21/2023, 3:14:51 PM

      This issue has been resolved

    • Update 10/21/2023, 3:14:51 PM

      The issue is resolved.

    • Investigating 10/21/2023, 1:50:05 PM

      Internet Outage

      What’s was the issue: Route to Wide Area Network (WAN) is having an issue, resulting in no internet access to/from Oscar

      Impact: Routing to the internet is broken Users are unable to connect to Oscar (ssh.ccv.brown.edu or ood.ccv.brown.edu) if not connected to Brown networks or VPN Jobs and Processes that need internet access such as - rsync, scp, git pull/push, file transfers, singularity pull etc. are not working at the moment and will fail. Users need to resubmit their failed jobs. Running and pending jobs that do not need external internet access are unaffected. Globus transfers are not working.

      Current Status: Our engineers are currently working with the ISP to get this resolved as soon as possible.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

  • Oscar - Open OnDemand Down

    • Resolved 6/26/2023, 9:26:43 PM

      This issue has been resolved

    • Update 6/26/2023, 9:26:43 PM

      Oscar Users,

      Open OnDemand web-portal is back in service.

      We will continue to monitor the website. If you are having trouble accessing it, please contact us at support@ccv.brown.edu.

      Sincerely, CCV Research Technical Services Team

    • 6/26/2023, 7:56:37 PM

      Oscar - Open OnDemand Down

      Oscar Users, Please see below regarding issues on the Open OnDemand

      What’s the issue: Open OnDemand (OOD) website is not accessible

      Impact: Users are not able to access the OOD website/sessions Slurm jobs are NOT impacted All OOD app sessions are NOT impacted. They are just inaccessible right now.

      Instructions for Users: Users are still able to access Oscar over SSH. Please wait until we send out another announcement to access the OOD website Users can check the status at status.ccv.brown.edu

      We apologize for the inconvenience caused here. We are working on resolving this issue. Please reach out to us at support@ccv.brown.edu if you have any questions.

      Sincerely, CCV Research Technical Services

  • Scheduled Maintenance

    • Resolved 6/18/2023, 10:39:57 PM

      This issue has been resolved

    • Update 6/18/2023, 10:40:34 PM

      Oscar returned to service.

    • 6/12/2023, 3:33:29 PM

      Scheduled Maintenance

      OIT Planned Maintenance: Oscar High Performance Computing

      In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.

      Scheduled - Maintenance Window Start: 6/12/2023 5:00 am EST End: 6/19/2023 10:00 am EST

      Maintenance Description During this maintenance period, we will be improving the electrical infrastructure for the entire data center. We will also be migrating Oscar’s underlying storage system from IBM’s GPFS to a new all-flash system provided by Vast.

      Expected Impact All Oscar services will be unavailable during the downtime. Jobs which won’t complete by the beginning of the maintenance window won’t start and ‘myq/squeue’ will report. (ReqNodeNotAvail, Reserved for maintenance)

      Users will need to resubmit jobs after the maintenance.

      Any existing batch/interactive jobs, Open OnDemand sessions, Terminal, Screen, and Tmux sessions will be terminated.

      Please email support@ccv.brown.edu with any issues, questions or concerns. We’ll be in touch again with additional instructions for the above changes (as appropriate).

  • Open OnDeman Server is Down

    • Resolved 4/20/2023, 3:30:37 PM

      This issue has been resolved

    • Update 4/20/2023, 3:30:37 PM

      The OOD server is back to service.

    • 4/20/2023, 3:19:54 PM

      Open OnDeman Server is Down

      The Open OnDemand server (ood.ccv.brown.edu) is down. Users are not able to connect to OOD. We are actively working on the issue. Apologize for the inconvenience.

  • GPFS Performance Degradation

    • Resolved 3/22/2023, 10:41:12 PM

      This issue has been resolved

    • Update 3/22/2023, 10:41:12 PM

      The failed controller is reset. GPFS performance is back to normal.

    • Investigating 3/22/2023, 9:55:36 PM

      GPFS Performance Degradation

      The GPFS performance may experience a temporary decline as a result of a hardware malfunction on the storage server. The systems team is currently working on resolving the issue.

  • Transfer Server Outage for GUI Applications

    • Resolved 3/15/2023, 9:29:20 PM

      This issue has been resolved

    • Update 3/15/2023, 9:29:20 PM

      The transfer server has been returned to service. Users now are able to transfer files to/from Oscar using GUI applications.

    • 3/14/2023, 10:14:56 PM

      Transfer Server Outage for GUI Applications

      Users are not able to transfer files using GUI applications - FileZilla, Cyberduck, and WinSCP. We will fix the issue as soon as possible, and post additional updates here.

  • Slowness on Oscar file system

    • Resolved 4/28/2023, 3:19:49 PM

      This issue has been resolved

    • Update 4/28/2023, 3:18:28 PM

      We followed the vendor's instructions. After that we are not able to reproduce the slowness issue.

    • Update 3/29/2023, 3:10:11 PM

      We have successfully implemented the recommended changes from the vendor, which have mitigated the issue. One of the key measures we took was the deployment of additional metadata servers to achieve a more equitable load distribution. Our team is actively working with the vendor to identify the most optimal resolution strategy for this issue.

    • 3/14/2023, 7:28:59 PM

      Slowness on Oscar file system

      Several users have reported slowness in the file system while trying to open files on Oscar. We have contacted the vendors and we are still investigating this issue. We will post additional updates here.

  • Oscar Scheduled Maintenance

    • Resolved 1/12/2023, 3:05:26 PM

      This issue has been resolved

    • Update 1/12/2023, 3:05:26 PM

      The scheduled maintenance for Oscar is now complete. You should be able to connect to Oscar now.

      Instructions:

      MPI users need to rebuilt MPI and MPI-enabled applications (please refer to this document for for rebuilding MPI applications)

      A new dedicated node is placed for VSCode remote-ssh connections

      VSCode config must be updated (please refer to this document for new config)

      If you set up node1103 before, you may need to remove its host identification.

      The VNC server (desktop.ccv.brown.edu) has been retired. Use Open OnDemand Desktop app.

      Help Session:

      To assist users with the transition, we have scheduled a drop-in Zoom session today. Feel free to drop in if you have any questions regarding Oscar.

      Time: Jan 12, 2023(Thursday) 10am-12pm

      Zoom link: https://brown.zoom.us/j/92145839462

      Note:

      In order to prevent unexpected power fluctuations, we will be gradually reintroducing nodes back into operation. As soon as a node becomes available, SLURM will begin allocating pending jobs to it.

    • 1/11/2023, 12:09:17 PM

      Oscar Scheduled Maintenance

      We are currently expecting the maintenance to be completed and services to be back online by January 12th at 10:00 am EST. Please note that this maintenance window is related to electrical work being performed. We will do our best to restore services as soon as possible.

  • Oscar - Login Nodes are not responding

    • Resolved 10/7/2022, 9:23:47 PM

      This issue has been resolved

    • Update 10/7/2022, 9:23:26 PM

      Oscar Users, This issue is now resolved. You should be able to connect to the login nodes. Please reach out to us if you see any related issues at support@ccv.brown.edu Sincerely, CCV Research Technical Services Team

    • 10/7/2022, 7:55:22 PM

      Oscar - Login Nodes are not responding

      Oscar Users, Please the details below for the Oscar login nodes outage:

      What’s the issue: The login nodes are currently very slow or unresponsive.

      Impact: Users will not be able to connect to the login nodes through SSH This does NOT affect running jobs All apps other that the terminal app on Open OnDemand are working correctly

      Alternatives: Use the OOD Desktop App or VNC Client to launch a desktop session and use the terminal inside that app to submit jobs, check job status, etc.

      We are working on resolving this issue. We will send you an update as soon as the issue is resolved.

  • Oscar - Transfer Server Outage

    • Resolved 8/18/2022, 7:43:01 PM

      This issue has been resolved

    • Update 8/18/2022, 7:43:01 PM

      The transfer server is back in service.

    • Investigating 8/18/2022, 6:10:36 PM

      Oscar - Transfer Server Outage

      Oscar users, Please see the details below for the transfer server outage:

      What’s the issue: Due to an issue on the transfer server, users will not be able to access the transfer server (transfer.ccv.brown.edu) to perform file transfers, to and from Oscar. Users might be prompted for a password, but access will be denied to the server.

      Alternatives to transfer server: Users can still transfer files using the gateway node (ssh.ccv.brown.edu) , SMB service, or Globus.

      We are actively working on returning the transfer server to service. We will send out an update once the transfer server is back up and running.

  • Test issue

    • Resolved 6/7/2022, 2:35:16 PM

      This issue has been resolved

    • 6/3/2022, 9:56:47 PM

      Test issue

      This is just a test

  • Oscar - File System Outage

    • Resolved 5/5/2022, 2:56:01 AM

      This issue has been resolved

    • Update 5/5/2022, 2:56:01 AM

      Oscar has been returned back to full service. We applied the recommended vendor upgrades, and are waiting for the further analysis of the root cause for this issue. Previously running jobs, and VNC sessions were terminated, we request you to re-submit your jobs.

      We have a zoom bridge open until 12pm tonight if you encounter any issues (join zoom meeting).

      Thank you for your continued patience. Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

    • Update 5/4/2022, 11:13:38 PM

      Update: The vendor (IBM) has recommended to upgrade RPMs to the latest point release. We are in the process of deploying it across cluster. We are actively working to bring back the cluster online as soon as possible.

    • Update 5/4/2022, 3:40:25 PM

      Update: We are still actively working with the vendor (IBM) to mitigate this issue. We will provide regular updates via this thread.

      We apologize for the inconvenience.

      Sincerely, CCV Research Technical Services

    • 5/4/2022, 8:11:32 AM

      Oscar - File System Outage

      Please see below regarding the file-system outage Oscar:

      What’s the issue?

      Due to an issue with the file-system:

      • users will not be able to login
      • All running/pending jobs are impacted

      What’s the cause?

      We have reopened the case with the vendor (IBM) and are actively working to identify the root cause and mitigate this issue as soon as possible.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar - File System Outage

    • Resolved 4/26/2022, 2:31:55 AM

      This issue has been resolved

    • Update 4/26/2022, 2:31:54 AM

      This was resolved. Oscar has been returned back to full service.

    • 4/25/2022, 1:37:45 PM

      Oscar - File System Outage

      What’s the issue?

      Due to an issue with the file-system

      • users will not be able to login
      • All running/pending jobs are impacted

      What’s the cause?

      We are actively working with the vendor to identify the root cause. We will provide an update by noon tomorrow (April 26th).

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

  • Oscar Outage - Logins not working

    • Resolved 1/31/2022, 12:30:51 AM

      This issue has been resolved

    • Update 1/31/2022, 12:30:51 AM

      Oscar has been returned to service. All running SLURM jobs were affected, users will have to re-submit their jobs. This issue was caused by a failure in one of the storage volumes; there should not have been any data corruption/loss. We sincerely apologize for the inconvenience caused.

    • 1/30/2022, 11:44:13 PM

      Oscar Outage - Logins not working

      Oscar Users,

      Please see below regarding the outage Oscar: What’s the issue?

      Logins via SSH, VNC, VSCode etc are not working. Users might see the 'oscar2' prompt, but the current password will not be accepted.

      What’s the cause?

      We are still trying to identify the root cause and mitigate this. We apologize for the inconvenience. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar - Intermittent GPFS File System Slowness

    • Resolved 1/25/2022, 7:55:55 PM

      This issue has been resolved

    • 1/21/2022, 4:42:39 PM

      Oscar - Intermittent GPFS File System Slowness

      Please see below regarding the intermittent file-system slowness Oscar:

      What’s the issue? Due to an intermittent slowness issue with the file-system:

      • Users might notice slowness while logging in or doing tasks like opening files etc.
      • Running jobs might perform slow due to I/O slowness
      • VSCode, SSH, VNC sessions may take longer than usual to start

      What’s the cause? We are working with the storage vendor to get this resolved. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar is down for annual system maintenance

    • Resolved 1/17/2022, 6:31:46 AM

      This issue has been resolved

    • 1/11/2022, 2:10:17 PM

      Oscar is down for annual system maintenance

      Maintenance Window: Start - 5:00 am, Tuesday, January 11, 2022 End - 4:00 pm, Friday, January 14, 2022 Please note: If unanticipated issues occur we expect to have them resolved by 8:00 am, Monday, January 17, 2022 at the latest

      Maintenance Description: Performance upgrades to our GPFS storage cluster

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV Research Technical Services

  • Oscar- University License Server Disruption

    • Resolved 10/27/2021, 7:01:55 PM

      This issue has been resolved

    • 10/27/2021, 5:39:12 PM

      Oscar- University License Server Disruption

      Please see below regarding the current license server disruption on Oscar:

      What’s the issue?

      • Commercial packages like (Matlab, Maple, ArcGIS) that require a license are currently inaccessible
      • MATLAB GUI & Command Line prompt can’t be launched. Any running jobs using these modules may crash or fail
      • This is due to license servers having issues

      What’s the cause? The appropriate team is working to identify the root cause of this problem.

      Please email support@ccv.brown.edu with any issues, questions or concerns.

  • CCV Oscar Outage

    • Resolved 8/25/2021, 11:36:42 PM

      This issue has been resolved

    • Update 8/25/2021, 11:36:42 PM

      Oscar Users,

      The cluster has been returned to service. Users should be able to log-in now. Unfortunately, the running jobs were terminated. The pending jobs will begin to run as more nodes become available.

      We apologize for the inconvenience. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV Research Technical Services

    • 8/25/2021, 9:58:50 PM

      CCV Oscar Outage

      Oscar Users,

      Please see below regarding the current partial outage on Oscar:

      What’s the issue? Oscar is experiencing a partial outage, including:

      1. Users will not be able to log-in via SSH, VNC, etc.
      2. Running/pending jobs may have been impacted
      3. Users will not be able to mount GPFS filesystem via SMB or CIFS

      What’s the cause? We are working to mitigate the issue and return the cluster to service.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar Major Outage

    • Resolved 5/3/2021, 6:57:15 PM

      This issue has been resolved

    • Update 5/3/2021, 6:57:15 PM

      Oscar has been returned back to service. No running or pending jobs were impacted. VNC sessions will resume from the same state.

      We apologize for the inconvenience. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

    • 5/3/2021, 6:22:11 PM

      Oscar Major Outage

      Please see below regarding the current outage on Oscar:

      What’s the issue? Oscar is experiencing a complete outage, including: Users will not be able to log in via SSH, VNC, etc. Users will not be able to submit new jobs. Running jobs will NOT be impacted Users will not be able to mount GPFS filesystem via SMB or CIFS Globus, file-transfers are unavailable

      What’s the cause? We are working to mitigate the issue and return the cluster to service.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar Partial Outage

    • Resolved 4/23/2021, 11:50:42 PM

      This issue has been resolved

    • Update 4/23/2021, 11:50:52 PM

      Issue is resolved.

    • 4/23/2021, 10:28:22 PM

      Oscar Partial Outage

      Please see below regarding the current outage on Oscar What’s the issue? Oscar is experiencing issues with some SLURM related commands sbatch, sacct etc:

      1. Users will not be able to submit new jobs via SBATCH
      2. Users will not be able to query the SLURM database via sacct, myq, myjobinfo etc.
      3. Running jobs are not impacted

      What’s the cause?

      We are working to mitigate the issue and return the cluster to service.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar partial outage

    • Resolved 7/22/2021, 6:34:09 PM

      This issue has been resolved

    • Update 7/22/2021, 6:34:09 PM

      This is resolved. All services are operational.

    • Investigating 4/3/2021, 12:11:41 AM

      Oscar partial outage

      The following services are partially available at this time:

      • Transfer nodes (accessible from internal Oscar network only). Not available from the public domain

      The following services are not available at this time:

      • SFTP & SSHFS to mount GPFS filesystem

      Thank you for your continued patience.

      CCV User Services

  • Oscar Major Outage

    • Resolved 4/2/2021, 11:44:11 PM

      This issue has been resolved

    • Update 4/2/2021, 11:44:11 PM

      Oscar has been returned back to service. We encourage you to login and start your workloads. Previously running jobs were terminated, users will have to re-submit the jobs. 

      The following services are not available at this time:

      1. SMB & CIFS Mounts 
      2. Transfer Nodes
      3. Globus Transfer Service
      4. SFTP & SSHFS to mount GPFS filesystem

      Thank you for your continued patience. We will try to minimize future disruptions to services already restored.

      If you experience any issues this evening, please email support@ccv.brown.edu and we will respond in the morning.

      Sincerely, CCV User Services

    • 3/30/2021, 2:24:07 PM

      Oscar Major Outage

      Please see below regarding the current outage on Oscar

      What’s the issue?

      Oscar is experiencing a complete outage, including:

      1. Users will not be able to log-in via SSH, VNC, etc.
      2. All running jobs have been terminated
      3. Users will not be able to mount GPFS filesystem via SMB or CIFS
      4. Globus, file-transfers are unavailable.

      What’s the cause? We are working to mitigate the issue and return the cluster to service.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar Outage - SMB/CIFS

    • Resolved 3/30/2021, 2:23:30 PM

      This issue has been resolved

    • 3/30/2021, 1:24:49 PM

      Oscar Outage - SMB/CIFS

      Please see below regarding the current outage on Oscar

      What’s the issue?

      Oscar is experiencing complete outage, including:

      1. Users will not be able to login via SSH, VNC etc.
      2. All running jobs have been terminated
      3. Users will not be able to mount GPFS filesystem via SMB or CIFS
      4. Globus, file-transfers are unavailable.

      What’s the cause? We are working to mitigate the issue and return the cluster to service.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar VNC Outage

    • Resolved 1/20/2021, 10:42:20 PM

      This issue has been resolved

    • Update 1/20/2021, 10:42:20 PM

    • 1/20/2021, 8:31:22 PM

      Oscar VNC Outage

      Please see below regarding the current VNC outage on Oscar:

      What’s the issue?

      • Reconnecting to an existing VNC session will not work 
      • New VNC sessions will not be able to start

      What’s the cause? We are working to identify the root cause of this problem. Expect an update soon.

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      CCV User Services

  • Oscar MATLAB Service Disruption

    • Resolved 12/22/2020, 2:00:49 PM

      This issue has been resolved

    • Update 12/22/2020, 2:00:49 PM

      Oscar Users, 

      This is resolved now. The license server was restored and is successfully leasing out the MATLAB license requests.   We apologize for the disruption. 

      Regards, CCV User Services

    • 12/22/2020, 12:45:32 PM

      Oscar MATLAB Service Disruption

      Please see below regarding the current MATLAB disruption on Oscar

      What’s the issue?

      • MATLAB is currently inaccessible from Oscar. Any running/scheduled MATLAB jobs will fail
      • MATLAB GUI & Command Line prompt can’t be launched
      • This is due to MATLAB license server having issues

      What’s the cause?

      • The appropriate team working to identify the root cause of this problem. Expect an update soon

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV User Services

  • Oscar Outbound Network Outage

    • Resolved 12/22/2020, 12:44:18 PM

      This issue has been resolved

    • Update 12/22/2020, 12:44:18 PM

      Oscar Users,

      The network issue is now resolved.

      We appreciate your patience. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

    • 12/22/2020, 5:36:31 AM

      Oscar Outbound Network Outage

      Oscar Users,

      Please see below regarding the current outbound network outage on Oscar

      What’s the issue?

      • Oscar nodes cannot connect to outside networks and hosts
      • This affects operations like wget, cloning git repos, any application calls that access the internet

      What’s the cause?

      • We are working to identify the root cause of this problem. Expect an update soon

      Please email support@ccv.brown.edu with any issues, questions or concerns.

      Sincerely, CCV User Services

  • Oscar Slurm Issue (Runaway Jobs)

    • Resolved 12/18/2020, 6:08:47 AM

      This issue has been resolved

    • Update 12/18/2020, 6:08:47 AM

      This issue is resolved now. It was due to a bug in the current Slurm version. This is addressed in the new version of Slurm that we are upgrading to during the winter maintenance.

      We apologize for the inconvenience. 

      Sincerely,  CCV User Services

    • 12/18/2020, 3:26:41 AM

      Oscar Slurm Issue (Runaway Jobs)

      Oscar Users,

      Please see below regarding the current Slurm runaway jobs on Oscar

      What’s the issue?

      • Slurm jobs after completion are not being cleared from the queue (runaway jobs). This means even if your previous job is finished it will still be shown as its in a RUNNING state to slurmctld
      • This affects new jobs of users who have too many runaway jobs as their new jobs will be pending in the queue due to QOS limits

      What’s the cause?

      • We are working to identify the root cause of this problem. Expect an update soon

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Intermittent GPFS Performance Issue

    • Resolved 12/16/2020, 1:59:10 PM

      This issue has been resolved

    • Update 12/16/2020, 1:59:10 PM

      This was resolved. We made significant configuration changes recommended by the vendor.

    • 12/7/2020, 10:04:23 PM

      Intermittent GPFS Performance Issue

      Please see below regarding the current GPFS file-system issue on Oscar.

      What’s the issue?

      • All OSCAR applications/modules, which reside in GPFS runtime are affected and users might see slowness in application startup times
      • This also affects the Data storage pool. Any jobs writing directly to /gpfs/data will notice slow performance.

      What’s the cause?

      • We are working with the vendor (IBM) to determine the root cause of this issue and will send an update when we know more

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Partial Network Outage in Oscar

    • Resolved 12/8/2020, 2:25:31 PM

      This issue has been resolved

    • Update 12/8/2020, 2:25:31 PM

      License servers are accessible now. We have addressed critical applications like Matlab, SAS, Cadence, Virtuoso, etc. If you encounter any issues with a specific application please let us know.

      We appreciate your patience.

      Sincerely, CCV User Services

    • Update 12/8/2020, 2:01:12 AM

      The network issue is still not fully resolved, we are having issues with the license servers. Applications such as Matlab, SAS, etc. that get license configuration(s) from a remote host will fail to launch. Any pending/running jobs that are using licensed packages might fail.

      We are actively working to resolve this issue. Thank you for your patience.

      Sincerely, CCV User Services

    • Update 12/7/2020, 10:03:16 PM

      The network issue is now resolved. The traffic is now being routed via a different host.

      We appreciate your patience. Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

    • 12/7/2020, 7:20:06 PM

      Partial Network Outage in Oscar

      Please see below regarding the current network issue on Oscar.

      What’s the issue? This affects operations like wget, cloning git repos, any application calls that access the internet. Users can still login to Oscar and any running or pending jobs are not affected.

      What’s the cause? One of the gateway servers suffered a hardware failure. Outbound Internet traffic from Oscar is unavailable. We are working to determine the root cause of this issue and will send an update when we know more

      Please email support@ccv.brown.edu with any issues, questions, or concerns.

      Sincerely, CCV User Services

  • Oscar Maintenance - Downtime 8:00 am, Tuesday, 1/5/21 - 11:59 pm Thursday, 1/7/21

    • Resolved 1/8/2021, 2:34:30 PM

      This issue has been resolved

    • Update 1/8/2021, 2:21:42 AM

      Oscar Users,  Oscar maintenance is now complete and the cluster has been returned to full production. Please refer to our discourse platform for more details.

    • 12/4/2020, 5:19:10 PM

      Oscar Maintenance - Downtime 8:00 am, Tuesday, 1/5/21 - 11:59 pm Thursday, 1/7/21

      Please click here for additional details regarding the upcoming CCV maintenance work:

  • Oscar Outage

    • Resolved 11/9/2020, 3:46:24 PM

      This issue has been resolved

    • Update 11/9/2020, 3:46:24 PM

    • 11/9/2020, 4:48:41 AM

      Oscar Outage

      The Oscar cluster is inaccessible from remote sites. Admins are looking into it. We will update soon.

  • GPFS Filesystem Performance Issues

    • Resolved 11/2/2020, 3:31:01 PM

      This issue has been resolved

    • Update 11/2/2020, 3:31:01 PM

      This issue was resolved the file-system performance is back under the acceptable limits.

    • 10/30/2020, 3:25:25 PM

      GPFS Filesystem Performance Issues

      The file-system performance is degraded, some users might experience slowness while creating virtual-environments or performing small file IO operations. We are investigating the issue.

  • SMB Maintenance - Thursday, August 13, 2020 at 8:00 am - 12:00 pm

    • Resolved 8/13/2020, 6:15:05 PM

      This issue has been resolved

    • Update 8/13/2020, 6:15:05 PM

      We have completed the SMB maintenance. We request to re-connect your network drives (instructions). As a reminder, the secondary mount oscarcifs.ccv.brown.edu will no longer be available as of September 1, 2020. Please begin utilizing the SMB service as soon as possible

      As always, thank you for your patience please report any issues to support@ccv.brown.edu

      Sincerely, CCV User Services

    • 8/13/2020, 1:25:23 PM

      SMB Maintenance - Thursday, August 13, 2020 at 8:00 am - 12:00 pm

      Who is impacted:

      • SMB users who access OSCAR locally via the network drive on Windows, Finder on Macs, or Linux

      What services are impacted:

      • SMB service
      • Note: the secondary mount oscarcifs.ccv.brown.edu will no longer be available as of September 1, 2020. Please begin utilizing the SMB service as soon as possible
  • Slurm partial outage

    • Resolved 8/5/2020, 8:06:36 PM

      This issue has been resolved

    • Update 8/5/2020, 8:06:36 PM

      The pending jobs were cancelled. And SLURM is back to normal.

      A small subset of users encountered issues starting VNC sessions. To fix this:

      1. SSH into Oscar via Terminal.
      2. Run 'myq' and kill pending VNC jobs - scancel <jobid>
      3. Run the VNC client again and it should start.

      If the issue persists let us know at support@ccv.brown.edu. Thank you for your patience.

      Sincerely, CCV User Services

    • 8/4/2020, 8:47:32 PM

      Slurm partial outage

      Whats the issue? SLURM daemons (sbatch, squeue, sinfo, scancel, srun etc.) are slow to respond. A new job submission will also be slow. The new VNC sessions are unable to start.

      What's the cause? A user submitted over 30 thousand jobs overloading SLURM daemons.

      How will it be resolved? SLURM is canceling these jobs. We have also implemented the MaxJobPerUserLimit.

      When will it be resolved? Once the jobs are fully cancelled, SLRUM should return back to normal (in the coming hours)

  • GPFS Performance Issue

    • Resolved 7/31/2020, 5:26:54 PM

      This issue has been resolved

    • Update 7/31/2020, 5:26:54 PM

      GPFS filesystem is returned back to normal service. All filesystem-level checks indicate performance is up-to acceptable limits.

    • Update 7/16/2020, 1:23:15 PM

      Thursday July 16, 2020 Update on the GPFS filesystem performance: All OSCAR applications, which reside in GPFS runtime, and user home directories have been moved to a faster disk. Users should expect to see significant improvements in application startup times.

      Up next: We are currently migrating the Scratch and Data volumes to faster disks and anticipate this will be completed within the next few days.

    • Update 7/14/2020, 8:15:09 PM

      Tuesday July 14, 2020

      What’s the issue?

      • Oscar’s GPFS file system is experiencing slow I/O performance

      What’s the cause? 2 things contributed to this issue:

      • GPFS was recently upgraded but some components of the file system have not yet been moved to the faster disk
      • Poor utilization of the Oscar cluster by a small number of users (thus impacting the file system for all users)

      How will it be resolved?

      • GPFS metadata has been moved to the faster disk
      • The applications in GPFS runtime and the home directories are currently being moved to the faster disk
      • The Data and Scratch directories will also need to move to a faster disk
      • The small number of Oscar users who were impacting the file system had their jobs terminated and asked to work with CCV User Services

      When will it be resolved?

      • Sometime within the window of July 14th-July 15th, we expect GPFS performance to be significantly better, as the migration of GPFS runtime and the home directories will be complete
      • The migration of Scratch and Data will begin the week of July 20th. More communication to follow regarding that work
    • Update 7/14/2020, 4:25:37 PM

      Tuesday July 14, 2020 We continue to make significant changes to the GPFS file system to improve its performance. Over the next few days, we’ll continue to migrate GPFS components to faster disk. As we progress with the migrations, we’ll monitor the situation closely and provide you with periodic updates.

      We recognize this issue has been very inconvenient and appreciate your patience as we mitigate the technical difficulties.

    • 7/13/2020, 4:53:16 PM

      GPFS Performance Issue

      Monday July 13, 2020 The Oscar cluster began experiencing issues with slow I/O performance with the new GPFS file system. In order to help mitigate this issue, we are moving some components to a faster pool of disks.

      We appreciate your patience as we troubleshoot this issue. We will send another update once it is resolved. Please email [support@ccv.brown.edu] with any questions.

  • OSCAR is down for filesystem maintenance.

    • Resolved 6/28/2020, 5:14:54 PM

      This issue has been resolved

    • Update 6/28/2020, 5:14:54 PM

      The Oscar cluster is back online. The GPFS file-system has been upgraded and the Slurm reservations have been released; the queued jobs started running yesterday around 8pm.

    • 6/22/2020, 12:51:24 PM

      OSCAR is down for filesystem maintenance.

      The CCV OSCAR System is down for filesystem maintenance during the week of June 22.

  • GPFS Filesystem Outage

    • Resolved 6/16/2020, 10:47:15 PM

      This issue has been resolved

    • Update 6/16/2020, 10:47:14 PM

      The file-system is back online. We have released the Slurm queues, and all services should be operational.

    • 6/16/2020, 8:10:04 PM

      GPFS Filesystem Outage

      Slow filesystem responsiveness, operations such as running shell commands, starting applications, data transfers, etc are affected.

  • Login004 - No Response

    • Resolved 6/9/2020, 9:35:25 PM

      This issue has been resolved

    • Update 6/9/2020, 9:35:25 PM

      A user had hundreds of python processes, which ran out of memory and killed GPFS daemon. Admins killed the processes and rebooted login004. The issue is fixed.

    • 6/9/2020, 8:57:41 PM

      Login004 - No Response

      Users could not log into Oscar. The login004 node did not respond for ssh. Admins are investigating the issue.

  • Unscheduled Oscar Outage

    • Resolved 12/18/2019, 1:49:01 PM

      This issue has been resolved

    • Update 12/18/2019, 1:48:58 PM

      The Oscar cluster has been returned to normal service. Job queues have been enabled and job scheduling has been resumed.

      Possible impacts from outage:

      1. Running jobs might have been impacted if they were writing to /home
      2. Applications running inside VNC sessions might have crashed.

      We request you to check job output files and active VNC sessions and let us know if there are any issues.

      We apologize for the disruption of service. Please report any issues to support@ccv.brown.edu

    • 12/18/2019, 1:39:40 PM

      Unscheduled Oscar Outage

      The Oscar cluster began experiencing issues with the home filesystem around 5:00am. Admins are currently diagnosing the issue and are working to identify a fix. Job scheduling has been paused while this issue is being addressed.