Last quarter we optimistically added Bug 447696 (make talos machines rebootable) to the Release Engineering goals list. We wanted to solve this consistently repeated order of events:
- Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
- IT looks at bug, Talos documentation is incredibly complex and doesn’t inspire confidence
- IT hands bug to Release Eng
- Release Eng attempts to fix the machine without having to resort in a reboot
- Reboot deemed necessary, IT bug filed for manual reboot
- Machine rebooted, IT passes bug back to Release Eng
- Release Eng goes through manual configuration steps to get machine back to stable state
- Release Eng restarts buildbot slave
- Success!
The worst of this is that all those steps with ‘Release Eng’ really end up meaning ‘Alice’. The Talos project suffers somewhat from having all of its knowledge centralized in a single human - and this human wants to take a holiday now and then.
Thankfully, we are now in a place where all active Talos boxes are fully rebootable. Go ahead - reboot whatever you want. It should come back up clean and ready to test. The new order of operations is thus:
- Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
- IT looks at bug, whatever the problem is IT reboots the machine
- Success!
We got here by having me carefully tease out the various configuration settings necessary on our five supported platforms (WinXP, Vista, Tiger, Leopard, Ubuntu) and then learning all about how automation works under various frameworks (batch files! plists! rc.local! etc, etc, etc). Once I had created a plan of attack for each platform I moved on to manually updating our 58 active Talos machines. Big thanks to John O’Duinn and Nick Thomas for helping out with the machine updates; it would have taken a lot longer without their assistance. You can dig around from that main bug to all the various bits and pieces that it took to put this together. It was time consuming. It was painful. It was totally worth it.
Did I mention that I’m going on holiday next month…?
You never mentioned that we are making you take vacations, lol
Good job Alice!
Comment by armenzg — October 8, 2008 @ 5:18 pm