Bug 463020 - Talos machines should be automatically rebooted periodically has finally been fully implemented and resolved fixed. Here’s what we know about Talos machines:
- The longer the uptime the less trustworthy the numbers
- Machines can stay up and report untrustworthy/garbage numbers for a long time before anyone notices - as long as the machine cycles green it is usually ignored
- Monitoring every active Talos box to see what machines need a reboot is complicated and difficult
Since the majority of the Talos work that comes across my desk ends up as reboot requests to solve machine number drift, the obvious solution is to start a system of auto-rebooting. It turned out that doing so wasn’t even that complicated due to earlier work I’ve done on the system to ensure that Talos machines come back from reboot ready to test and auto-reconnect to their buildbot master. Thanks to catlee for the patch required to get that last piece in place that fires at the end of every test cycle on every talos machine and starts a reboot.
Now when looking at performance results you can rest assured that the machine that did the testing was in a clean state, newly rebooted and without any extraneous processes kicking around from previous busted browsers/crash reporters/spawned dialogs/whatever. No longer do we have to live in fear of Bug 419620 - qm-mini-vista04 (branch) numbers gradually rising and other such bugs resulting from long uptimes on Talos test boxes (say, Leopard Talos boxes showing odd spikes in results very 5-20 test cycles, or Tiger Talos boxes having runaway Terminal processes after a few dozen test cycles, etc).
The nifty side effect is that the variance between results is also decreased - I see much smoother lines on every platform. The only downside here is that we do lose machines that fail to reboot. At the moment, I’m seeing about 5-6 machines fall over per week. This does increase the load on IT as the only solution is a manual kick and thus requires a trip to the colo. They’ve been real good sports about the increase in bug traffic brought about by this change.
Thanks again to catlee and IT for their help.
Isn’t there a risk that there will be a class of performance bug that will now never show up on the radar? Can we be sure that e.g. the “auto rising” of memory after a period of continued use is not our problem?
Comment by Håkan — January 6, 2009 @ 1:10 am
It’s unreasonable to attempt to track possible performance regressions based upon a side-effect of the initial Talos system set up. If we are interested in tracking how the browser behaves on a system that has been up for a long time, or re-using the same profile again and again, or some other long running metric then we should design a system to test it that is consistent and repeatable. As it was, we couldn’t say how long a machine had been up or how many browsers it had tested (and if any of those browsers were regressed and did something weird to the system, all the worse) - essentially it gave us no information about the sort of thing that you are interested in.
I think that we need a robust and varied approach to performance. We need to design different types of tests to cover different things that we are interested in. Attempted to fit everything under the Talos umbrella just isn’t the best route to that end.
Comment by alice — January 6, 2009 @ 5:57 pm