If you’ve been following along with Bug 480413 - design test to monitor browser shut down time you’ll know that I attempted to roll out Tshutdown on the 18th. Unfortunately, there were a couple of bugs that weren’t discovered in staging and I spent 6 or 7 hours on a beautiful Saturday afternoon attempting to fix them on the fly. Not able to make it work I had to back it all out and fume.
With a clearer head I approached the bugs on Monday morning and resolved both of them. Then it was just a question of waiting for an appropriate downtime. Though I’m as excited about 3.5b4 as anyone, it really gets in the way of arranging Talos code changes. Finally, yesterday afternoon we were able to shut down everything and check things in.
I’m glad to say that this landing was successful. There’s a good chance that Tshutdown will clean up the issues in Bug 478603 - intermittent orange on Windows mozilla-central talos Ts and Tdhtml tests (”failed to initialize browser”), as we’ll start to record the longer shutdown times instead of freezing up and reporting orange.
It’ll be a few more days till we have enough data for Tshutdown to look like anything but a scatter graph, but you can check out the reported results at graphs-new.mozilla.org. There’s actually more to Tshutdown than a single test. You’ll see “Tp3 Shutdown”, “Tp3 Nochrome Shutdown”, “Tp3 Fast Shutdown” and “Ts Shutdown”. Basically, we record the shutdown time after running the given test. Post Ts results in a ‘clean’ shutdown time as we run Ts with an empty, new profile; post Tp3 results in a ‘dirty’ shutdown time as the browser has just completed cycling 10 times through 400 web pages. The post Tp3 results will also show greater variance because we only run Tp3 once per full test cycle (the test does take a good hour or more to complete depending on the platform) and we only have that single value to report, with Ts we rapidly open and close the browser 20 times so we have a data set that we can average to get a more consistent value.
I’m very pleased to get the whole mess put to bed. There were non-threadsafe python libraries (subprocess, I’m looking at you) to deal with, twisted banana errors (I kid you not), and a whole mess of timing issues. You can’t build what you don’t monitor and shutdown is an import part of our user’s experience - hopefully we’ll be able to start to trim down our shutdown time now that we are reporting results from all our active Talos boxes.
I’m a little late in posting anything about this, but Bug 468731 - talos testing of builds using sendchange is a big deal. I had initially thought that we wouldn’t be able to make use of Buildbot’s senchange systems due to the weird hacking that Talos does to the Buildbot change object to move around all the pieces of information required by Talos. Thankfully, catlee wasn’t nearly so pessimistic and found a way to make it work.
Mostly it sounds like a bunch of Buildbot nonsense that shouldn’t really interest anyone outside of the Release Engineering team, but it has huge benefits to sheriffs and developers. With Talos buildbot now supporting sendchange we can push builds through the Talos testing infrastructure provided only a correctly formatted link to the build in question. I’ve already used this to force a re-test of a build that had failed; it immediately failed again on different Talos test boxes and proved that the issue was in the build and not in Talos.
If a build is on stage and downloadable we can make Talos test it as many times as we want. We still don’t have the means to push any build we like through, but retesting pretty much any build on staging can be very useful when trying to narrow down a regression range or figure out if a regression is ‘real’. If you are in a situation where you would like to retest a build contact the Release Engineering team member on buildduty (identified as nick-buildduty on irc) and they’ll get it going for you.
Thanks, catlee!
Bug 487329 - Graph server migration tracking is almost fixed and complete. What does this mean for people who use the graph server?
- graphs-new.mozilla.org will become graphs.mozilla.org
- The current graphs.mozilla.org will become graphs-old.mozilla.org
- No new data will be sent to graphs-old.mozilla.org
- graphs-old.mozilla.org will remain up so that older data can be viewed/searched
We’ve been seeding the new graph server with data for a month now and I think that the majority of people are already using it as their main means of viewing performance data; for the most part the switch over should be painless.
What may be slightly more controversial is that there is no current plan to migrate any data from the old graph server to the new. There’s some very good reasons for this:
- Data in the old graph server was generated by using throttled test slaves, we no longer throttle slaves so the numbers would not be comparable
- We are close to rolling out a new Tp test page set, we did not test with this new page set before so the numbers in the old graph server would not be comparable
- Most of the numbers on the old graph server were collected before we rolled out reboot-every-test-slave-post-every-test - the numbers have greater variance and there are large swaths of data that aren’t trustworthy or useful in anyway (basically, long periods of time when the box in question was in serious need of a reboot)
Instead of banging our heads against shifting data from the old, poorly thought out schema to the new, super fast schema we’re going to consider designing a system whereby we can pick up old builds of interest and push them through the current testing harness. That way we save ourselves headaches and get data that is actually comparable to current results.
Once all the dependent bugs filed against graph server migration have been fixed we’ll roll all this out. I’m hoping that that will be in the next week or two. Right now I’m more interested if anyone has any strong feelings about what builds are ‘interesting’ enough that we should come up with the means to re-test them with our current test harness. Any favorites out there? Top ten?