Networking Power Requirements to upgrade to 10Gbits/s

Gary Buhrmaster, Yang Wei, John Weisskopf, Ron Barrett, Charley Granieri, Ken Martell, Les Cottrell, Boris Ilinets, Karl Amrhein, Teresa Downey

Introduction

LHC USATLAS has a need to upgrade the networking capacity for their servers at SLAC to 10Gbits/s. This is a requirement to test and validate the performance of the SLAC tier2 site in preparation for the LHC turn on later this year. The requirement has a desired deadline of end of January. The purpose of the first meeting on 1/7/2008 was to understand the current situation, evaluate what needs and can be done to enable 10Gbits/s to the SLAC core network, and to identify the next steps.  These notes are not meant to capture all the details of this lively meeting but rather to identify the next steps.

 

The new 10Gbits/s equipment (8 switches) is already on-site and installed in the racks. The basic requirement for the networking equipment is to have house and UPS power (for backup) and stay within the cutover limits. This is complicated by the problems of balancing load, staying within limits during cut-over, harmonics, problem with UPS1 (networking UPS) being 17 years old and past end of life.

Actions proposed from 1/7/08 Meeting

Attendees included all from the above list except Karl Amrhein and Teresa

 

The following actions will be taken in the short term to get UPS power for the first 4 new switch/routers and the Nokia firewall (for business services, the same power issues) up and running:

Turn off the UPS to the console and flora servers in the network row. This should get a couple of Amps.  This will be done by Ron by close of business (COB) 1/9/08.

Rebalance the network equipment to see the effects.

Inventory network equipment to see what is connected to UPS and identify what can be turned off, removed from the UPS or moved to another UPS.  This will be done by Ken and Ron and will be ready by January 14th at 9:00am.

Verify that FARM13 is using house power only. John/Ron did this following the meeting.

Move FARM12 equipment to another switch (FARM16) so FARM12 can be removed.  Some of the machines are in the klyster cluster administered by Bob Steel and others are administered by Stuart Marshall of Kavli (Orange) servers. Systems will need to re IP since they need Gbit/s. John will coordinate the move to be completed on Tuesday 1/15/08. Wei will alert and coordinate the re IP address. Teresa is prepared to do the CANDO updates, Neal is aware of the LSF needs.  We will bring up a klyster and Orange host in advance of the cutover to ensure the final cutover will be smooth.

Remove equipment from FARM01 (2 sulkies are critical so must be moved).  They can be moved to FARM11 (in the same row) without a new IP address. John will coordinate, it will be done soon after Mike Hogaboom returns on January 22nd.

New nodes need to be added to FARM16 which will require re-addressing hosts. Wei will talk to systems to get a schedule for this.

Run the border router on one side only (i.e. only house or only UPS), then run the 2nd border on the other power source. This will normally use the same amount of power but may help in case of a loss of power on one side. Gary & Charley will review this after the above items have been accomplished.

Charley will get the first 4 new switches (BORDER1 and 2, new CORE1 and 2) running on house power (assuming it is available) to provide burn in. Then they will be configured.

Longer term we have to address:

Bringing up the remaining 4 new switches.

Additional power will be needed for the 4 new switches. Boris can provide this from the 75KVA Power Management Module (PMM) purchased (a 125KVA PMM was also purchased at the same time) to provide power to move the windows systems to where the VAXen used to be (part of the floor replacement project). An executive decision will be required on this (i.e. who gets the PMM).

Creating a plan to provide sufficient clean power with backup for the network equipment. We should replace the 17 year old UPS1 which is running hot, introducing harmonics and could fail. Also consider whether to get lots of small UPS's to provide backup power

Accomplishments/Questions

  • Tuesday:
    • Need to expedite a plan to move the systems on the farm09/10 switches to the (new) farm16 (in fact, that is what farm16 was put in place to do).  We also need to move the ports from farm07, which had most of its systems turned off at one point. This will allow the turn-off of two or three additional old switches (and for *simple* values, for every two-to-three old switches you can turn off, you get the equivalent for a new style (much more power hungry) switch).  As with the moves of systems from the (temp) farm16 to the (real/new) farm16, this requires a reboot of the servers, along with some configuration/cabling moves.  Wei will need to get the systems groups schedule for accomplishing these moves.  This is a new action item for Wei.
    • Some load removed in Sacramento row. PP-UPS1 decreased to Ia = 80.5A, Ib = 65A, Ic = 60.1A. Unfortunately phase imbalance is still the same and In = 32A.  Would be very nice if we could re-distribute some loads from phase A to 2 other phases and lower the In.
  • Wednesday:
    • Notified Ted Shab that getting close to having power for new EPN2 firewall.
    • IP adresses for for klyster and orange machines allocated
  • Thursday:
    • Sulky16, 19 have cables in place, moved from FARM01 to FARM11 and FARM01 turned off.
    • Have green light from Bob Steel and Stuart Marshall) for Tuesday move of Orange & Klyster machines from FARM12 to FARM16
    • Karl Amrhein taking care of Orange cluster testing/Infiniband issues.
  • Friday:
    • All in-use equipment in 4 network rows labeled, power cables labeled at both end of cable. Orange=UPS, white = house power. Needs to be entered into inventory.
    • Orange-nfs needs to be tested, it has IP address (172.23.32.64), but is not pingable, is it connected? Stuart Marshall agrees no need to pre-test with a single host in preparation for Orange cluster move.
    • Ports on FARM16 assigned for Orange & Klysters
    • Follow on meeting arranged for 1/15/07.
  • Monday:
    • Completed removal of equipment from Sacramento row. Boris checking load & temp of UPS1 cabinet in preparation for Row 8 energization.
    • Two new PMMs arrived.
    • Need to get network cables for first 4 new switches installed. Gary has assigned. Charley reviewed and sent email to Ron etc.
    • Need to move Nospam3’s second power source to house power. Ron will make it happen. It has been renamed from mailgate03, console acting up.
  • Tuesday:
    • Orange-nfs reported (via RT) connected (John) and switch configured (Charley). Karl does  OS installation.
    • Second house power supply moved from old SWH-FARM16 to new SWH-FARM16, new SWH-FARM16 now has both hours & UPS.
    • FARM12 machines moved to FARM16. Klyster and Orange clusters back up (Luster awaits Stuart Marshall). SWH-FARM12 shut down.
    • Spreadsheets of device, power source(s), location and notes created for 4 network rows.
    • Wei achieves file transfer of > 1 Gbits/s for 6 hours from BNL to SLAC.

Action Items from meeting 1/16/08:

Attendees, Ken, Wei, Charley, Boris, Gary, Antonio, John, Ron, Richard, Les

 

John noted that his team will be under pressure to install Dells. Not sure of impact on 10G project.

 

  • EPN2 power is not a priority at this time, it has not been decided when to do it.
  • Move FARM16 old to FARM16 new.
    • They are close, believe we can reuse cables, Ron will check
    • Wei will schedule Re-IP, inform users, move.
    • Ron  will move cables
    • Schedule tentatively for Friday1/18/08
  • Bring up 10G core
    • We have enough power.
    • Charley will take lead, before Jan 28th
  • There needs to be a 30 min outage of PPUPS1. This will require electricians in the scheduling. Probably in early Feb.
    • Move one of each of RTR-BSDNET1 and 2 power supplies to house power, Charley will schedule this
    • Boris, Charley, Ron discuss how and when to make this happen and report on plan by Tuesday 22 Jan.
  • Charley will verify FARM7 can move to FARM16.
  • Get update on next scheduled power outage.

Next meeting in 1 week to review progress on 10G move and FARM16, and look at FARM7. Boris may have jury duty, Monday is Martin Luther King’s birthday.

Accomplishments/Questions

  • Wednesday:
    • Halimede, RTR-MON powered off.
    • Ron updated and made Excel power spreadsheets available via Sharepoint
    • Wei contacts users of hosts on FARM16 to schedule outage
  • Thursday
    • Outage scheduled
  • Tuesday-Thursday
    • First four switches powered up. Now drawing 72Amp on A phase (load used to be 65.7Amp). Netdev met and working on plan to bring up the new core. Initially to interconnect switches (RTR-CORE1,2 and RTR-BORDER1,2 for redundancy) then connect up ESnet & Stanford, plus RTR-CORE1,2-OLD & FARM-CORE1. When done we can power down RTR-DMZ1 and RTR-CORE.
    • Completed outage to move machines from FARM16 old to new. Machines moved & restored to service except yakut12
    • Next meeting scheduled for 2pm Friday. Next switches are FARM9 & 10 (fibre connections are tricky, these are big outages) and FARM7 (lot of old cabling & random hosts). FARM7 can move to FARM16, probably easier than FARM 9 & 10 which are big outages.   
    • ID need to move Bbr-xfer12..17 from FARM09 to FARM16. Wei gets agreement from Wilko and Shirley to make move on Monday at 10:30am  FARM16 old powered off.
    • Jean Pierre asks when will there be power for the new firewall, so he can replace the Nokia IP740 RTR-FW01 which has both house power and UPS it is advertised at 3A/100-120V. The new Nokia IP 1270 Firewall draws 300W. During the cutover, both firewalls needs to be connected for about 8 hours. Once the new firewall is in production the old RTR-FW01 and BSDNET1 can be removed. They are all in Walkabout Creek 2BK-03.

Meeting Friday 1/25/08

In attendance: Len Moss, Chuck Boeheim, Randy Melen, John Bartelt, Boris Ilinets, Gary Buhrmaster, Shirley Melen, Wei Yang, Charley Granieri, Les Cottrell.

 

Wei needs a date when the ATLAS hosts will have a 10Gbps path to outside. He wants to participate in the ATLAS full dress rehearsal which happens in February.  He is also on vacation from Feb 1 – 18 in ChinaGary reported he will be doing link testing with ESnet on Jan 28th evening. He hopes after a couple of days to advertise BGP. He hopes to be able to report good news by Friday  1st February.

 

We looked at the requirements to move machines off FARM07, 09 and 10. The contents of these switches can be found via http://www.slac.stanford.edu/comp/net/mon-slaconly/lanmon/cathtml/switch-index.netmaster.html  For these switches there are many hosts that need to be moved that need opening a ticket, coordinating with users, setting a date, get cabling in place, getting IP addresses and making the move.

  • BBR-XFER12..20 on FARM9 are being moved to FARM16 by Shirley on Monday 28th February.
  • Shirley will look after the Sulkys,
  • Len will look after the Yakuts, BBR-SIMUL,BBR-EVDISP, BLDLNX, and tentatively some of the OBJY hosts.
  • Neal will need to be involved in moving the GRIDDEVs and MORABs, The GRIs can be done at any time. I am not sure who will coordinate/do this.
  • John Bartelt will look after the GLASTLNX01-15 hosts. Two of them are known on the Internet and so will need extra care.
  • Yacek needs to be contacted for the DATADEVSOL and DATADEVLNX hosts.  I am unclear who will lead this.

The next meeting will be on February 29th, Charley will organize.