happiest unalice ever

January 6, 2009

Talos on Mobile Performance Results Now Available

Filed under: mozilla, talos — alice @ 6:03 pm

Collection of Talos mobile performance graph links now available. This is pretty much all due to the efforts of Aki. He’s taken Talos on mobile from a mere twinkle in Fennec’s eye to an actual system with multiple machines reporting consistent results with low variance. Aki is still working through the last hurdles to get Tp cycling, but the rest of the tests are now up and running.

Happy regression hunting!

January 5, 2009

Auto-Rebooting Talos Boxen Saves The World

Filed under: mozilla, talos — alice @ 6:01 pm

Bug 463020 - Talos machines should be automatically rebooted periodically has finally been fully implemented and resolved fixed.  Here’s what we know about Talos machines:

  1. The longer the uptime the less trustworthy the numbers
  2. Machines can stay up and report untrustworthy/garbage numbers for a long time before anyone notices - as long as the machine cycles green it is usually ignored
  3. Monitoring every active Talos box to see what machines need a reboot is complicated and difficult

Since the majority of the Talos work that comes across my desk ends up as reboot requests to solve machine number drift,  the obvious solution is to start a system of auto-rebooting.  It turned out that doing so wasn’t even that complicated due to earlier work I’ve done on the system to ensure that Talos machines come back from reboot ready to test and auto-reconnect to their buildbot master.  Thanks to catlee for the patch required to get that last piece in place that fires at the end of every test cycle on every talos machine and starts a reboot.

Now when looking at performance results you can rest assured that the machine that did the testing was in a clean state, newly rebooted and without any extraneous processes kicking around from previous busted browsers/crash reporters/spawned dialogs/whatever.  No longer do we have to live in fear of Bug 419620 - qm-mini-vista04 (branch) numbers gradually rising and other such bugs resulting from long uptimes on Talos test boxes (say, Leopard Talos boxes showing odd spikes in results very 5-20 test cycles, or Tiger Talos boxes having runaway Terminal processes after a few dozen test cycles, etc).

The nifty side effect is that the variance between results is also decreased - I see much smoother lines on every platform.  The only downside here is that we do lose machines that fail to reboot.  At the moment, I’m seeing about 5-6 machines fall over per week.  This does increase the load on IT as the only solution is a manual kick and thus requires a trip to the colo.  They’ve been real good sports about the increase in bug traffic brought about by this change.

Thanks again to catlee and IT for their help.

November 24, 2008

Lucky Number 90

Filed under: mozilla, talos — alice @ 6:51 pm

After the recent work done to bring up the new Firefox3.1 branch I did a count of active Talos boxes. As far as I can figure it breaks down like this:

  1. 7 testing 1.8
  2. 25 testing 1.9
  3. 23 testing 1.9.1
  4. 23 testing 1.9.2
  5. 9 testing Tracemonkey
  6. 3 testing try server builds

That would give a grand total of 90.  Since we are starting down the road of decommissioning the 7 machines assigned to 1.8 this is probably as high as we are going to get for quite some time - though I envision a future with a lot more infrastructure dedicated to the whole try system.

Recall that the first set of Talos mac mini WinXP boxes were pushed to the production waterfall to test Firefox 3 sometime during the week ending on December 7th, 2007 - or so says my status report for that week.  That would be from 3 machines to 90 in under a year.  Seems like a good time to reflect upon the breadth and depth that the whole Talos infrastructure now covers and also hope that we won’t be having to make any more orders for batches of 50 minis.  I don’t think that IT will ever talk to me again if I make them rack more minis.

October 28, 2008

The Super Secret List of Graph Links

Filed under: graphs, mozilla, talos — alice @ 11:17 am

This really shouldn’t be super secret, but it has come to my attention that some people don’t know about the compendium of Talos graph links.

It’s really hard to construct graphs on the graph server that are a) meaningful and b) correct - you have to know which talos machines are testing which code and if they are configured in such a way as to generate useful comparisons.  While we are currently doing a lot of graph server infrastructure work, it is going to be a while before we can fix this overarching difficulty.  I’ve been maintaining this list of graph links for some time and keeping them organized by branch, platform and test.

So, next time you are pushed to the brink of madness trying to figure out if there is a regression in Ts for mozilla-central on WinXP, please use the graph links and save your sanity.

October 24, 2008

Firefox Build Machine Downtime Tuesday, October 28th 9am-11am PST

Filed under: mozilla — alice @ 2:28 pm

This downtime will affect all build machines reporting to mozilla-central (Firefox waterfall).

The following fix will be applied:

The applied patch will affect all leak/bloat/codesighs values reported to the graph server as we will be aligning on bytes as the unit of measurement.  A jump will be observed in the graphs for those values that are current being recorded in kilobytes/megabytes.

Please let me know if there is any reason not to proceed with the downtime.

October 22, 2008

Details & Warnings: Talos downtime Friday, October 24th 9am-1pm PST

Filed under: mozilla, talos — alice @ 2:06 pm

Fixes in this downtime include:

  • Bug 443979 - talos collects memory information on mac in KB, on linux/win in Bytes.  Talos collects various memory metrics (RSS, Private Bytes, Memory Set Size) to get a feel for how much space the browser is taking up.  I realized a little while back that we are collecting values in bytes for linux & windows and in kilobytes on mac.  This patch will get us to a state where everything is stored in bytes.  The downside is that there is going to be a really big jump in the mac memory allocation graphs as we switch from kilobytes to bytes.  Mark and I played around with doing a database update to make for a smooth transition, but this would end up requiring shutting down the tree for longer than is comfortable (potentially over a day without any mac Talos boxes reporting).  So, we are going to live with the jump in the graphs to get this fixed.
  • Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians. This is a potentially large change, but I have high hopes that it won’t affect reported Talos numbers too drastically.  The basic issue is that the pageloader extension that Talos uses to cycle through web page test sets has been sorting the list of numerical results as if they were strings.  Orderings can end up like (’100′, ‘1900, ‘900′, ‘999′, …), throwing off the median result for a given set of numbers.  I think that we are hitting mis-sortings pretty rarely, but we won’t know for sure until the fix is applied to the production boxes and new numbers are generated.  The worst case scenario would be changes across the board to all Talos results.  Whatever the new results will be they will be considered the new baseline for performance data.
  • Bug 457885 - windows talos machines stopped testing anything for quite a few hours.  The current way of determining if a new build is available to test is to scrape the tinderbox waterfall for builds marked ’success’.  A fine system when we only had a single build machine reports to a single waterfall column.  As we have expanded the amount of builders reporting to a single column Talos skips builds - a builder reports a successful build to the waterfall but, before Talos has a chance to process that change, another builder overwrites that build report with its own successful build data.  To avoid this we are changing to a system where we monitor the ftp directories where builds are dropped.  I’ve tested extensively on staging and this should be a smooth transition.

Firefox, Firefox3.0 & Mozilla1.8 will remain closed until I’m confident that these patches have applied correctly and that the reported Talos numbers are stable.

October 10, 2008

How I Caused the Talos Regression on 9/26 (and 8/7)

Filed under: mozilla, talos — alice @ 5:17 pm

Here’s the short story.  There was what appeared to be a performance regression across all machines on 9/26.  While the usual culprit would be a change to the browser, the regression coincided with several changes made to the Talos system (both to test configuration and physical machine configuration).  Investigating the regression proved complicated as browser, Talos code and Talos machine changes had to be taken into consideration.  The surprise outcome was that it was actually the resolution to a Talos bug that caused a regression on 8/7, and that the performance change on 9/26 was simply a return to normalcy.  For the whole story, read on.

This took a little while to untangle, so you’ll have to get some background from way back in August.  I was working on Bug 432883 - talos config file unification and automation.  This was a clean up project.  We’d wandered into a state were every separate set of Talos boxes required its own configuration file, resulting in a ballooning number of config files leading to difficulties in rolling out updates/changes.  I had created a scheme wherein there would be a single config file checked in which would be custom altered by each set of Talos machines per test run.  Everything worked great in testing and on 8/7 the change went live.  There was no redness on any tree and I congratulated myself on a job well done.

Now we get to 9/26.  It was a Friday afternoon and I was interested in getting some things done before the weekend.  This is what I ended up doing that afternoon:

  1. Bug 457464 - New: Linux Tp3 numbers show machine configurations appearing to change.  This bug had to do with the state of the linux Talos machines and whether or not the numbers were reliable.  Talos boxes are created in sets of three and this bug came down to the set qm-plinux-trunk01/02/03.  These boxes are designed to work from the same image, on the same hardware, running the same tests and should report numbers within 1% of each other.  qm-plinux-trunk02 was reporting suspiciously higher numbers.  I poked around and saw that it was spending a lot of time running trackerd.  I decided that that process couldn’t be trusted and could be causing extra variance in the reported performance numbers as it could run at any time and affect cpu availability.  So, I went through _all_ linux talos machines and turned it off.
  2. Bug 450666 - Increase intervals for long-running tests.  Here we were trying to reduce the glut of data Talos gathers per test cycle.  Mostly we build up memory/cpu information because we sample these metrics every second during the running of Tp.  Tp can take 45 minutes to run and we would routinely gather over 1500 memory data points.  This had become a concern in graph server work as the graph server database had some tables with over a billion rows.  I had done some tests on my staging environment and determined that we could change from sampling every second to every 20th second without affecting the value of the data generated.  It was a simple patch to the central Talos config file and I’d been holding onto it for a while and wanted to push it out.
  3. Now, as it happens, I noticed while looking at the config file that there was a problem with it.  When we designed the Talos tests we had various discussions trying to figure out what the ideal amount of times to cycle through the Tp test page set to get the least amount of wobble in the recorded numbers.  I did some tests and figured out that 5=high variance, 10=pretty consistant, 20=ideal, 50=no further improvement.  We settled on 10 cycles as a reasonable balance between variance and time to execute the test (recall that we are talking about cycling through 400 test pages so, as is, 10 cycles can take a pretty long time).  When I looked at the config file I realized that it was only set to cycle 5 times.  Oops.  I knew that I must have introduced this during my work on config file unification.  I figured that the numbers must be having greater variance and that it would be good to get back to smoother results to make it easier to notice smaller regressions.  Here’s where I decided to piggyback the change back from 5 cycles -> 10 cycles with my change to metric sampling.  I was making a config file update anyway, might as well lump them together.  Bad Alice!

It only took till Sunday for Bug 457582 - performance metric changes between 1pm-2pm 2008-09-26.  The numbers on every Talos machine had taken a change on Friday afternoon.  Some had gone up, some had gone down.  Due to the amount of alterations to Talos that I had made in the regression range (machine configuration, metric sampling, Tp cycles) we had a host of possible reasons to investigate.  Everything came down to Bug 450401 - Tp regression from 4:21 8/7/2008.  Turns out that back when I did the config file unification project and accidentally switched us from Tp cycles 10->5 I had caused a regression.  The amount of Tp cycles not only affects the variance of the numbers reported for tests, but also the basic test result.  So, the change on that fateful Friday afternoon was actually resolving a regression from back on 8/7.  Huzzah!

Now, it’s not very good that I caused confusion from my various tinkerings to that Talos system on 9/26, but it’s worse that I caused a regression on 8/7 resulting in wasted time attempting to find a bustage in the Firefox code base.  There’s a few things that we can take away from this misadventure:

  • Talos machine configuration changes during downtimes only
  • Any Talos changes should be associated with monitoring of the generated performance numbers
  • File separate bugs for separate patches (okay, we already know this, but turns out that I needed a refresher)
  • Look for regressions in _both_ the browser and Talos code (for the most part this should be trivial because Talos changes are pretty rare)

Overall, I see this as growing pains in the Talos project.  It’s been pretty easy for me to consider Talos my own personal sandbox to do with as I please.  Considering that I’m the default owner of all the active Talos boxes it can be very tempting to just go in and start changing things when I think that they are broken instead of going through the time to file bugs and wait for appropriate downtimes.

October 8, 2008

Rebootable Talos Machines (and why that’s a good thing)

Filed under: mozilla, talos — alice @ 3:27 pm

Last quarter we optimistically added Bug 447696 (make talos machines rebootable) to the Release Engineering goals list.  We wanted to solve this consistently repeated order of events:

  1. Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
  2. IT looks at bug, Talos documentation is incredibly complex and doesn’t inspire confidence
  3. IT hands bug to Release Eng
  4. Release Eng attempts to fix the machine without having to resort in a reboot
  5. Reboot deemed necessary, IT bug filed for manual reboot
  6. Machine rebooted, IT passes bug back to Release Eng
  7. Release Eng goes through manual configuration steps to get machine back to stable state
  8. Release Eng restarts buildbot slave
  9. Success!

The worst of this is that all those steps with ‘Release Eng’ really end up meaning ‘Alice’.  The Talos project suffers somewhat from having all of its knowledge centralized in a single human - and this human wants to take a holiday now and then.

Thankfully, we are now in a place where all active Talos boxes are fully rebootable.  Go ahead - reboot whatever you want.  It should come back up clean and ready to test.  The new order of operations is thus:

  1. Talos box starts behaving oddly (low/high numbers, stopped reporting all together, etc), developer notices and files bug
  2. IT looks at bug, whatever the problem is IT reboots the machine
  3. Success!

We got here by having me carefully tease out the various configuration settings necessary on our five supported platforms (WinXP, Vista, Tiger, Leopard, Ubuntu) and then learning all about how automation works under various frameworks (batch files! plists! rc.local! etc, etc, etc).  Once I had created a plan of attack for each platform I moved on to manually updating our 58 active Talos machines.  Big thanks to John O’Duinn and Nick Thomas for helping out with the machine updates; it would have taken a lot longer without their assistance.  You can dig around from that main bug to all the various bits and pieces that it took to put this together.  It was time consuming.  It was painful. It was totally worth it.

Did I mention that I’m going on holiday next month…?

September 30, 2008

Standalone Talos, V1.3.1

Filed under: mozilla, talos — alice @ 4:14 pm

I recently updated Standalone Talos to version 1.3.1.  This version includes updates to Talos and Pageloader code along with some simple fixes:

  • Upon starting, Standalone Talos now warns the user that all open browser windows must be closed before testing can begin and then exits.  This was in response to the not terribly pleasant behavior described in Bug 454999 - Standalone talos kills all running Firefox processes.  Be warned that Talos enters the browser kill/clean up code after each individual test is run - so if you start a secondary Firefox during testing it will get summarily killed.
  • Removed some bad pages from the included web page test set used by Tp.  A couple of the included pages displayed Flash warnings and can kill the browser.  Since these problems were most likely introduced in the attempt to clean up the pages for testing purposes it isn’t considered a ‘real’ browser crash and we’d rather just ignore it.
  • Better ability to recognize browser freeze up/crash.  Way back with Bug 416911 - per-test timeout in talos, code was added to Talos to monitor the browser for activity - if it doesn’t load a page in a given amount of time it is considered to be busted. This is finally making its way into Standalone Talos.

Currently, Standalone Talos is simply a zipped up package of code, manifest files and test pages.  In the future it would be smarter to have it checked into its own directory or have its own branch - as is, it only gets updated if I happen to remember that it has fallen out of date.  That said, if you see a fix go into Talos that hasn’t yet been included in a version of Standalone Talos feel free to enter a bug under component “Release Engineering: Talos”.

« Newer Posts

Powered by WordPress