Here’s the short story. There was what appeared to be a performance regression across all machines on 9/26. While the usual culprit would be a change to the browser, the regression coincided with several changes made to the Talos system (both to test configuration and physical machine configuration). Investigating the regression proved complicated as browser, Talos code and Talos machine changes had to be taken into consideration. The surprise outcome was that it was actually the resolution to a Talos bug that caused a regression on 8/7, and that the performance change on 9/26 was simply a return to normalcy. For the whole story, read on.
This took a little while to untangle, so you’ll have to get some background from way back in August. I was working on Bug 432883 - talos config file unification and automation. This was a clean up project. We’d wandered into a state were every separate set of Talos boxes required its own configuration file, resulting in a ballooning number of config files leading to difficulties in rolling out updates/changes. I had created a scheme wherein there would be a single config file checked in which would be custom altered by each set of Talos machines per test run. Everything worked great in testing and on 8/7 the change went live. There was no redness on any tree and I congratulated myself on a job well done.
Now we get to 9/26. It was a Friday afternoon and I was interested in getting some things done before the weekend. This is what I ended up doing that afternoon:
- Bug 457464 - New: Linux Tp3 numbers show machine configurations appearing to change. This bug had to do with the state of the linux Talos machines and whether or not the numbers were reliable. Talos boxes are created in sets of three and this bug came down to the set qm-plinux-trunk01/02/03. These boxes are designed to work from the same image, on the same hardware, running the same tests and should report numbers within 1% of each other. qm-plinux-trunk02 was reporting suspiciously higher numbers. I poked around and saw that it was spending a lot of time running trackerd. I decided that that process couldn’t be trusted and could be causing extra variance in the reported performance numbers as it could run at any time and affect cpu availability. So, I went through _all_ linux talos machines and turned it off.
- Bug 450666 - Increase intervals for long-running tests. Here we were trying to reduce the glut of data Talos gathers per test cycle. Mostly we build up memory/cpu information because we sample these metrics every second during the running of Tp. Tp can take 45 minutes to run and we would routinely gather over 1500 memory data points. This had become a concern in graph server work as the graph server database had some tables with over a billion rows. I had done some tests on my staging environment and determined that we could change from sampling every second to every 20th second without affecting the value of the data generated. It was a simple patch to the central Talos config file and I’d been holding onto it for a while and wanted to push it out.
- Now, as it happens, I noticed while looking at the config file that there was a problem with it. When we designed the Talos tests we had various discussions trying to figure out what the ideal amount of times to cycle through the Tp test page set to get the least amount of wobble in the recorded numbers. I did some tests and figured out that 5=high variance, 10=pretty consistant, 20=ideal, 50=no further improvement. We settled on 10 cycles as a reasonable balance between variance and time to execute the test (recall that we are talking about cycling through 400 test pages so, as is, 10 cycles can take a pretty long time). When I looked at the config file I realized that it was only set to cycle 5 times. Oops. I knew that I must have introduced this during my work on config file unification. I figured that the numbers must be having greater variance and that it would be good to get back to smoother results to make it easier to notice smaller regressions. Here’s where I decided to piggyback the change back from 5 cycles -> 10 cycles with my change to metric sampling. I was making a config file update anyway, might as well lump them together. Bad Alice!
It only took till Sunday for Bug 457582 - performance metric changes between 1pm-2pm 2008-09-26. The numbers on every Talos machine had taken a change on Friday afternoon. Some had gone up, some had gone down. Due to the amount of alterations to Talos that I had made in the regression range (machine configuration, metric sampling, Tp cycles) we had a host of possible reasons to investigate. Everything came down to Bug 450401 - Tp regression from 4:21 8/7/2008. Turns out that back when I did the config file unification project and accidentally switched us from Tp cycles 10->5 I had caused a regression. The amount of Tp cycles not only affects the variance of the numbers reported for tests, but also the basic test result. So, the change on that fateful Friday afternoon was actually resolving a regression from back on 8/7. Huzzah!
Now, it’s not very good that I caused confusion from my various tinkerings to that Talos system on 9/26, but it’s worse that I caused a regression on 8/7 resulting in wasted time attempting to find a bustage in the Firefox code base. There’s a few things that we can take away from this misadventure:
- Talos machine configuration changes during downtimes only
- Any Talos changes should be associated with monitoring of the generated performance numbers
- File separate bugs for separate patches (okay, we already know this, but turns out that I needed a refresher)
- Look for regressions in _both_ the browser and Talos code (for the most part this should be trivial because Talos changes are pretty rare)
Overall, I see this as growing pains in the Talos project. It’s been pretty easy for me to consider Talos my own personal sandbox to do with as I please. Considering that I’m the default owner of all the active Talos boxes it can be very tempting to just go in and start changing things when I think that they are broken instead of going through the time to file bugs and wait for appropriate downtimes.