This really shouldn’t be super secret, but it has come to my attention that some people don’t know about the compendium of Talos graph links.
It’s really hard to construct graphs on the graph server that are a) meaningful and b) correct - you have to know which talos machines are testing which code and if they are configured in such a way as to generate useful comparisons. While we are currently doing a lot of graph server infrastructure work, it is going to be a while before we can fix this overarching difficulty. I’ve been maintaining this list of graph links for some time and keeping them organized by branch, platform and test.
So, next time you are pushed to the brink of madness trying to figure out if there is a regression in Ts for mozilla-central on WinXP, please use the graph links and save your sanity.
This downtime will affect all build machines reporting to mozilla-central (Firefox waterfall).
The following fix will be applied:
The applied patch will affect all leak/bloat/codesighs values reported to the graph server as we will be aligning on bytes as the unit of measurement. A jump will be observed in the graphs for those values that are current being recorded in kilobytes/megabytes.
Please let me know if there is any reason not to proceed with the downtime.
Fixes in this downtime include:
- Bug 443979 - talos collects memory information on mac in KB, on linux/win in Bytes. Talos collects various memory metrics (RSS, Private Bytes, Memory Set Size) to get a feel for how much space the browser is taking up. I realized a little while back that we are collecting values in bytes for linux & windows and in kilobytes on mac. This patch will get us to a state where everything is stored in bytes. The downside is that there is going to be a really big jump in the mac memory allocation graphs as we switch from kilobytes to bytes. Mark and I played around with doing a database update to make for a smooth transition, but this would end up requiring shutting down the tree for longer than is comfortable (potentially over a day without any mac Talos boxes reporting). So, we are going to live with the jump in the graphs to get this fixed.
- Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians. This is a potentially large change, but I have high hopes that it won’t affect reported Talos numbers too drastically. The basic issue is that the pageloader extension that Talos uses to cycle through web page test sets has been sorting the list of numerical results as if they were strings. Orderings can end up like (’100′, ‘1900, ‘900′, ‘999′, …), throwing off the median result for a given set of numbers. I think that we are hitting mis-sortings pretty rarely, but we won’t know for sure until the fix is applied to the production boxes and new numbers are generated. The worst case scenario would be changes across the board to all Talos results. Whatever the new results will be they will be considered the new baseline for performance data.
- Bug 457885 - windows talos machines stopped testing anything for quite a few hours. The current way of determining if a new build is available to test is to scrape the tinderbox waterfall for builds marked ’success’. A fine system when we only had a single build machine reports to a single waterfall column. As we have expanded the amount of builders reporting to a single column Talos skips builds - a builder reports a successful build to the waterfall but, before Talos has a chance to process that change, another builder overwrites that build report with its own successful build data. To avoid this we are changing to a system where we monitor the ftp directories where builds are dropped. I’ve tested extensively on staging and this should be a smooth transition.
Firefox, Firefox3.0 & Mozilla1.8 will remain closed until I’m confident that these patches have applied correctly and that the reported Talos numbers are stable.
Here’s the short story. There was what appeared to be a performance regression across all machines on 9/26. While the usual culprit would be a change to the browser, the regression coincided with several changes made to the Talos system (both to test configuration and physical machine configuration). Investigating the regression proved complicated as browser, Talos code and Talos machine changes had to be taken into consideration. The surprise outcome was that it was actually the resolution to a Talos bug that caused a regression on 8/7, and that the performance change on 9/26 was simply a return to normalcy. For the whole story, read on.
This took a little while to untangle, so you’ll have to get some background from way back in August. I was working on Bug 432883 - talos config file unification and automation. This was a clean up project. We’d wandered into a state were every separate set of Talos boxes required its own configuration file, resulting in a ballooning number of config files leading to difficulties in rolling out updates/changes. I had created a scheme wherein there would be a single config file checked in which would be custom altered by each set of Talos machines per test run. Everything worked great in testing and on 8/7 the change went live. There was no redness on any tree and I congratulated myself on a job well done.
Now we get to 9/26. It was a Friday afternoon and I was interested in getting some things done before the weekend. This is what I ended up doing that afternoon:
- Bug 457464 - New: Linux Tp3 numbers show machine configurations appearing to change. This bug had to do with the state of the linux Talos machines and whether or not the numbers were reliable. Talos boxes are created in sets of three and this bug came down to the set qm-plinux-trunk01/02/03. These boxes are designed to work from the same image, on the same hardware, running the same tests and should report numbers within 1% of each other. qm-plinux-trunk02 was reporting suspiciously higher numbers. I poked around and saw that it was spending a lot of time running trackerd. I decided that that process couldn’t be trusted and could be causing extra variance in the reported performance numbers as it could run at any time and affect cpu availability. So, I went through _all_ linux talos machines and turned it off.
- Bug 450666 - Increase intervals for long-running tests. Here we were trying to reduce the glut of data Talos gathers per test cycle. Mostly we build up memory/cpu information because we sample these metrics every second during the running of Tp. Tp can take 45 minutes to run and we would routinely gather over 1500 memory data points. This had become a concern in graph server work as the graph server database had some tables with over a billion rows. I had done some tests on my staging environment and determined that we could change from sampling every second to every 20th second without affecting the value of the data generated. It was a simple patch to the central Talos config file and I’d been holding onto it for a while and wanted to push it out.
- Now, as it happens, I noticed while looking at the config file that there was a problem with it. When we designed the Talos tests we had various discussions trying to figure out what the ideal amount of times to cycle through the Tp test page set to get the least amount of wobble in the recorded numbers. I did some tests and figured out that 5=high variance, 10=pretty consistant, 20=ideal, 50=no further improvement. We settled on 10 cycles as a reasonable balance between variance and time to execute the test (recall that we are talking about cycling through 400 test pages so, as is, 10 cycles can take a pretty long time). When I looked at the config file I realized that it was only set to cycle 5 times. Oops. I knew that I must have introduced this during my work on config file unification. I figured that the numbers must be having greater variance and that it would be good to get back to smoother results to make it easier to notice smaller regressions. Here’s where I decided to piggyback the change back from 5 cycles -> 10 cycles with my change to metric sampling. I was making a config file update anyway, might as well lump them together. Bad Alice!
It only took till Sunday for Bug 457582 - performance metric changes between 1pm-2pm 2008-09-26. The numbers on every Talos machine had taken a change on Friday afternoon. Some had gone up, some had gone down. Due to the amount of alterations to Talos that I had made in the regression range (machine configuration, metric sampling, Tp cycles) we had a host of possible reasons to investigate. Everything came down to Bug 450401 - Tp regression from 4:21 8/7/2008. Turns out that back when I did the config file unification project and accidentally switched us from Tp cycles 10->5 I had caused a regression. The amount of Tp cycles not only affects the variance of the numbers reported for tests, but also the basic test result. So, the change on that fateful Friday afternoon was actually resolving a regression from back on 8/7. Huzzah!
Now, it’s not very good that I caused confusion from my various tinkerings to that Talos system on 9/26, but it’s worse that I caused a regression on 8/7 resulting in wasted time attempting to find a bustage in the Firefox code base. There’s a few things that we can take away from this misadventure:
- Talos machine configuration changes during downtimes only
- Any Talos changes should be associated with monitoring of the generated performance numbers
- File separate bugs for separate patches (okay, we already know this, but turns out that I needed a refresher)
- Look for regressions in _both_ the browser and Talos code (for the most part this should be trivial because Talos changes are pretty rare)
Overall, I see this as growing pains in the Talos project. It’s been pretty easy for me to consider Talos my own personal sandbox to do with as I please. Considering that I’m the default owner of all the active Talos boxes it can be very tempting to just go in and start changing things when I think that they are broken instead of going through the time to file bugs and wait for appropriate downtimes.
Last quarter we optimistically added Bug 447696 (make talos machines rebootable) to the Release Engineering goals list. We wanted to solve this consistently repeated order of events:
- Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
- IT looks at bug, Talos documentation is incredibly complex and doesn’t inspire confidence
- IT hands bug to Release Eng
- Release Eng attempts to fix the machine without having to resort in a reboot
- Reboot deemed necessary, IT bug filed for manual reboot
- Machine rebooted, IT passes bug back to Release Eng
- Release Eng goes through manual configuration steps to get machine back to stable state
- Release Eng restarts buildbot slave
- Success!
The worst of this is that all those steps with ‘Release Eng’ really end up meaning ‘Alice’. The Talos project suffers somewhat from having all of its knowledge centralized in a single human - and this human wants to take a holiday now and then.
Thankfully, we are now in a place where all active Talos boxes are fully rebootable. Go ahead - reboot whatever you want. It should come back up clean and ready to test. The new order of operations is thus:
- Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
- IT looks at bug, whatever the problem is IT reboots the machine
- Success!
We got here by having me carefully tease out the various configuration settings necessary on our five supported platforms (WinXP, Vista, Tiger, Leopard, Ubuntu) and then learning all about how automation works under various frameworks (batch files! plists! rc.local! etc, etc, etc). Once I had created a plan of attack for each platform I moved on to manually updating our 58 active Talos machines. Big thanks to John O’Duinn and Nick Thomas for helping out with the machine updates; it would have taken a lot longer without their assistance. You can dig around from that main bug to all the various bits and pieces that it took to put this together. It was time consuming. It was painful. It was totally worth it.
Did I mention that I’m going on holiday next month…?