Status of ESSC IT facilities

  • 18th May 2012, 18:30. The scheduled Grid Engine service restart on medusa has taken place with no impact on either interactive sessions or batch jobs.
  • 18th May 2012, 12:00. Two Rocks compute nodes have hung in the past few hours, compute-0-0 late last night and compute-0-2 a few minutes ago. This is likely to be related to the unusually heavy load the cluster is experiencing at the moment, and I am taking steps to reduce the impact of the large number of batch jobs that are currently running and waiting to run. I have been running a memory check on compute-0-0 this morning and found no errors so I belive it is safe to reboot the nodes with the new scheduling policy in force.
  • 14th May 2012, 17:30. Sorry, I have not managed to stop the atmos volume after all, possibly because some of the servers are busy and slow to respond to the instruction. I will schedule another maintenance period tomorrow after the staff meeting.
    • 14th May 2012, 17:05. The "atmos" storage volume is off line while one of the storage servers is being worked on.
  • 13th May 2012, 23:00. The ESSC home directory backups have been working normally for the past few days, so I think it is safe to assume that the SCSI bus problem was solved by rebooting odin on Thursday.
    • 4th May 2012, 14:00. The backup tape library has stopped working and no backups will be performed until further notice. Therefore, all important documents and source files should be stored in University/ITS NetDrives rather than in ESSC home directories.
  • 10th May 2012, 14:15. Odin is booting up after the scheduled restart at 4PM. It is still going because it took longer than expected to shut down.
  • 3rd May 2012, 15:45. Odin threw another wobbly this afternoon and is currently rebooting. Hanging applications and commands should return to normal shortly.
  • 2nd May 2012, 14:55. Compute-0-1 seems to be a casualty of the temporary storage volume outage earlier and is being restarted.
    • 2nd May 2012, 14:00. All volumes are back on line. Intermittent mounting problem seems to have been solved but expect occasional disruption until bdan12 is back on line again after hardware repair and re-sync.
    • 2nd May 2012, 13:20. I am attempting to stop the storage cluster volumes to carry out the scheduled maintenance operation (see IT News, 2nd May, Recent storage cluster disruption - UPDATE). All storage volumes should be considered at risk at further notice and may become unavailable at any time.
  • 19th April 2012, 12:30. Odin's data is back on line and all home directories and packages are accessible. There are still SCSI errors being reported so there might be more disruption until the fault can be located.
    • 18th April 2012, 21:10. I have not been able to repair odin so we will have to wait for Oracle to respond to the service request tomorrow morning. The data is completely inaccessible this time - no read or write access to ESSC home directories, packages or other data on odin is possible.
    • 18th April 2012, 20:00. There is another problem with odin preventing access to ESSC home directories. If the reboot currently in progress doesn't solve the problem please use the work-around described in the IT News update "More odin hardware maintenance - UPDATE" on 16th February.
  • 18th April 2012, 14:55. Saturn is working properly again and is safe to use. Compute-0-2 is being restarted and will be available for use again in a few minutes.
    • 18th April 2012, 13:30. Saturn is not working properly because of dependencies on Meteorology services that are not running as a result of the GU15 maintenance work. Please continue to use jupiter until all the Meteorology services are on line again.
    • 18th April 2012, 13:20. Behemoth, gorgon and saturn are running again now and can be used safely. The ClusterVision cluster nodes are starting up now and should be ready to use in a few minutes. There are no known problems with WebVPN or connections to ESSC servers from outside.
    • 18th April 2012, 10:30. There appears to be a problem with compute-0-2 so I have taken it off the list of compute nodes accessible via qlogin and qrsh. The problem affects storage cluster access and is likely to be related to yesterday evening's intermittent network disruption. I will try to fix the problem later on but the machine might have to be restarted. Please use other compute nodes for the time being.
    • 18th April 2012, 10:15. Saturn, behemoth, gorgon and the ClusterVision cluster are still off line following yesterday's scheduled power cut in the Meteorology area of the GU15 machine room. A problem with WebVPN has also been reported.
  • 17th April 2012, 15:45. The kmcolour2 printer is now back on line; I had the spare parts in stock after all.
    • 16th April 2012, 11:15. The "new" Konica Minolta printer in the computer lab (called kmcolour2 in CUPS) is off line while waiting for parts. Please use the Canon photocopier (cacolour) instead for the time being.
  • 5th April 2012, 17:50. Easter maintenance update: Saturn is running again in its new home. Please refer to the IT News for details of data access changes. I will wait until the long running interactive process on compute-0-2 has finished before restarting jupiter.
    • 5th April 2012, 00:34. Easter maintenance update: I have shutdown saturn, and I am taking the opportunity to transfer it to a server in the GU15 server room where it can be more easily administered by Meteorology IT staff. The transfer is taking longer than expected so saturn might not be available again until tomorrow (Thursday the 15th April). I will leave jupiter running until saturn is available again.
    • 4th April 2012, 19:10. Easter Maintenance update: I have postponed the restart of saturn and jupiter until later this evening because there are two interactive jobs running on the Rocks cluster that will be killed by the restarts. Hopefully they will both finish before I have to restart jupiter and saturn. This situation can easily be avoided in future by using the "screen" command, or batch mode in Sun Grid Engine; see Rocks cluster section of wiki for details. I have not received the replacement drives for bdan12 so I will not be able to proceed with the server repair and re-synchronisation, and that means I won't be able to change the ownership of any data on the storage cluster.
  • 3rd April 2012, 13:45. All Rocks Cluster compute nodes are working again now
    • 3rd April 2012, 13:15. Two compute nodes are off line while being restarted. These are compute-0-2 and compute-0-3. Currently only compute-0-0 and compute-0-1 are accessible but these are working normally.
    • 3rd April 2012, 12:50. The Rocks Cluster compute nodes have been having trouble accessing the storage cluster and only compute-0-1 is accessible at the moment. The others may need to be restarted.
  • 30th March 2012, 10:15. The black and white printer downstairs, hp43, is off line while waiting for a service. Please use cacolour or kmcolour2 instead for the time being.
  • 22nd March 2012, 13:30. One of the storage servers suffered an extraordinary triple drive failure yesterday evening, which left the RAID array in a very fragile state. I have taken it out of service while waiting for replacement drives and the storage cluster should be able to operate using replicated files on another server. However it is possible that there has been some damage caused while the drives were starting to fail with the server still connected to the cluster. This might not necessarily involve damage to files themselves although I can't rule out that possibility.
  • 21st March 2012, 9:30. The network problems have been resolved. Laptops and thin clients can now connect to the network but will need to be restarted.
  • 14th March 2012, 12:25. The ClusterVision cluster is working again now.
    • 13th March 2012, 19:40. All services and data volumes except the ClusterVision cluster are now back on line after the power cut this afternoon. Unfortunately I was not able to get back into the GU15 server room to investigate a problem with the cluster's Myrinet switch before it was closed for the day, but I should be able to get the cluster working again tomorrow morning.
  • 5th March 2012, 08:45. The "atmos" volume is back on line again now.
    • 4th March 2012, 13:45. The "atmos" storage volume is currently off line while a maintenance operation is being performed. It should be available again in a few hours.
  • 1st March 2012, 11:10. Odin is working normally again, and write access to ESSC home directories has been enabled.
    • 27th February 2012, 10:20. Bad news about odin I'm afraid. The Oracle field engineer's visit on Friday didn't solve the problem and it is still not safe to allow write access to the data.
    • 24th February 2012, 15:40. Odin's data will remail read-only while thorough I/O tests are carried out on the drives in the RAID array. If all the hardware passes the test and the test completes over the weekend then odin will be available for normal use some time on Monday. It will not be necessary to shut down odin this afternoon.
    • 22nd February, 18:15. Odin's data remains read-only while investigations into the cause of the I/O errors contine. It wouldn't be true to say that I am confident in Oracle's ability to find out what's wrong with it any time soon. Please refer to the IT News page for details of the work-around I have implemented.
    • 13th February 2012, 16:30. Odin is officially very ill. Dominic from NERC-IST is coming tomorrow morning to help me sort it out. Oracle has also been notified in case it's a hardware issue. Unfortunately all access to odin's data is suspended until further notice, but other services such as Internet access and user authentication should be able to continue intermittently.
    • 13th February 2012, 9:45. Odin is running again, but the fault hasn't yet been identified so it might hang again without warning. Thin clients that are not connected to the network must be restarted to allow them to connect.
    • 12th February 2012, 23:25. Odin has just hung for the third time this weekend. I went in to restart it the first two times but it will now have to wait until Monday morning. It might need attention from NERC-IST and Oracle before it is stable again. In the meantime I will press Reading ITS to make some progress on migrating some of odin's key services to ITS servers; I can now justifyably say the matter is urgent.
  • 22nd February 2012, 20:15. The atmos volume has been mounted read-write again, and the temporary home directories have been transferred to their previous locations on that volume. The GlusterFS repair is still going on, but I have been told by the developers that it should be safe to allow write access. I have identified the likely cause of the errors and put in place measures to prevent the same thing from happening again. Please report any I/O errors or unusual beheviour to me immediately.
    • 21st February, 18:25. Urgent maintenance to the atmos volume requires it to be mounted read-only. Users whose temporary home directories were in /glusterfs/atmos have been given new home directories in /glusterfs/tracks. Space in this volume is limited an is being shared by a number of users.
  • 10th February 2012, 11:15. Odin is working again now.
  • 9th February 2012, 13:10. The ClusterVision cluster is working again now.
    • 9th February 2012, 11:45. The ClusterVision cluster batch queues are currently disabled because several of the compute nodes need to be restarted. This will be done this afternoon when the jobs currently running have finished.
  • 24th January 2012, 13:50. The "xerox" printer upstairs is working again now. There is a spare maintenance kit in the store cuboard with the spare ink cartridges.
    • 19th January 2012, 11:45. The "xerox" printer upstairs has a problem and might need replacement parts. Please use hpcolour instead until xerox has been repaired, or use cacolour or hp43 downstairs for double sided printing.
  • 24th January 2012,13:10. Gorgon is back on line again.
    • 24th January 2012, 11:55. Gorgon is currently off line while a hard drive is being replaced.
  • 22nd January 2012, 19.30. All storage volumes are back on line.
    • 22ndJanuary 2012, 19.30. The "nemo" volume is now back on line, along with the following subvolumes: gorgon, pegasus, romulus and remus. Work on the "atmos" volume is still in progress.
    • 22ndJanuary 2012, 15.30. The "nemo" and "atmos" storage volumes are still off line while filesystem conversion and system updates are being carried out. They should be available again this evening.
  • 22nd January 2012, 15:30. Jupiter is safe to use again after sustem updates.
    • 21st January 2012, 18:05. Jupiter is unavailable while system updates are being carried out, after which it will be restarted. Please don't try to log on again until a message appearing here saying it is safe to do so.
  • 17th January 2012, 03:00. The "land" volume is back on line again.
    • 16th January 2012, 14:30. The "land" storage volume is off line while filesystem checks are being carried out after I/O errors were reported by some clients.

IT Status updates from 2011 and 2010

Topic revision: r107 - 18 May 2012 - 17:32:07 - DanBretherton
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback