happiest unalice ever

February 25, 2009

Double Sending Talos Results To New & Improved Graph Server

Filed under: graphs, mozilla, talos — alice @ 3:16 pm

We are inching closer and closer to completing the effort to rewrite the back end of the graph server.  This has included a whole new schema and rewriting all the scripts that interact with that schema.  Due to the major differences between the old and new schemas data migration isn’t going to be easy.  While we do plan on moving some data over (say around interesting branch points, or for retired branches that we are no longer actively testing), we figured that we would go for a plan where we would ease into use of the new database.  This means that for the next few weeks, possibly a month, Talos boxes are going to be sending data first to the graphs.mozilla.org and then immediately sending the same results to graphs-new.mozilla.org.  This should in no way affect the numbers collected by Talos, or impact the cycle time of Talos machines.  This gives us a few benefits:

  1. Stress test the new graph server.  We’ve had a staging version of the graph server with the new schema up and running for a while, but it only has 5 or 6 staging Talos boxes reporting to it.  We need to see what happens when 90 boxes try to report results all at once.
  2. A good chance to pre-populate the new graph server db.  When we had discussions about just migrating all data in the old db and forcing a switch over to the new graph server all in one shot we ended up talking on the order of 24-36 hours (or more) to convert and transfer all the data.  Just doing double send for a while is going to be far less intrusive and will let us work out the remaining data migration issues under a more reasonable time frame.

The main change that developers who monitor the various waterfalls will notice is that Talos columns will have the standard links to graphs followed by a second set of graph links which will be to the new graph server.  Expect this change to take place by the end of this week.  If this causes any undue confusion, or you find bugs in the graphs-new.mozilla.org feel free to drop by #perfomatic to provide feedback.

January 27, 2009

Firefox3.0 Unthrottled, Victory In Our Time?

Filed under: mozilla, talos — alice @ 6:43 pm

During a long downtime last Friday morning I unthrottled all the WinXP, Vista and Ubuntu boxes testing Firefox3.0.  I had forcast a 25-30 minute time savings in each full Talos test cycle but, alas, that was not to be.  After crunching the numbers, it looks like we save 10 minutes per full test cycle for each operating system.  Now, that was initially pretty disappointing.  After further consideration I think that 10 minutes would be a very welcome time savings to anyone babysitting the tree waiting on performance results; in fact, any decrease in Talos machine cycle time is pretty much a good thing.

Now that the machines have had some time to cycle and get us a base of results I went over the appropriate Firefox3.0 graph links to see how the numbers look. From examinations of Tp and Ts:

  • Ubuntu results slightly increased in variance post un-throttling
  • WinXP results slightly decreased in variance post un-throttling
  • Vista variance was unchanged

With the time savings, and knowing that the Linux numbers are slightly degraded, I believe that unthrottling should be rolled out across all Talos boxes testing all branches.  While it isn’t going to solve all our problems it’s a step in the direction of a better testing harness. Feel free to comment here or in the bug. I’d like to get as much feed back as I can, as this will change the reported results of a large amount of our test boxes.

January 15, 2009

Standalone Talos V1.4

Filed under: mozilla, talos — alice @ 5:44 pm

Includes a few noteworthy fixes:

  1. Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians
  2. Bug 398463 - talos runs tests in a different order than listed in the config file
  3. Bug 419367 - Standalone Talos should output results as it runs

Directions for installing and running Standalone Talos have been updated to point to the latest release.

January 9, 2009

Investigating Unthrottling Talos Boxes

Filed under: mozilla, talos — alice @ 3:12 pm

Way back in the day when the Talos project was just getting off the ground we had concerns about the granularity of JavaScript timers and their ability to accurately measure performance results. We quickly decided to go with a CPU throttling system, whereby we could slow down the testing and get repeatable test results (Bug 393940 - throttle mac mini XP speed down to something slower). Since that day we’ve had throttling on all Talos WinXP, Vista and Ubuntu boxes, and would have had on Tiger and Leopard if we’d managed to figure out a way to do it.

John Resig recently put together a very thorough blog post about the accuracy of JavaScript time. From his graphs you can see that Firefox 3 does a fine job of reporting to the millisecond timing, and it appears to have been doing so since Bug 363258 - bad millisecond resolution for (new Date).getTime() / Date.now() on Windows was fixed back in the summer of 2007. Now, I could be wrong here, but it was my understanding the the timing was still bad on Firefox 2, which Talos continued to test till just last month when the boxes were retired for that branch.

In either case, we now have good timing on Firefox 3 and we no longer have to worry about the Firefox 2 question so it’s time to examine if we want to keep throttling Talos boxes. There are several obvious pluses to not running throttled: faster test results, cross-comparison between performance results across all platforms (at the moment, we cannot compare Tiger/Leopard results to anything else as they are unthrottled), closer match between test environment and end users’ machines and a more simplified set up for Talos boxes themselves. We will lose out on the ability to compare current results to any historic data that we have already collected - turning off throttling doesn’t simply half the results so there isn’t any easy way to tell what a new unthrottled results “means” in comparison to an old throttled result.

I’ve started to take steps to get us to an unthrottled state (Bug 468680 - unthrottle talos winxp, vista and ubuntu boxes.). For now that just means turning off throttling on all of our Talos staging machines and letting it run for a while. In a week I’ll look over the numbers and see if we are getting consistent, low-variance results out of the newly unthrottled machines. If all that goes smoothly a downtime will be scheduled and the scripts on all the Talos machines that control throttling will be removed and a drop will be seen for all performance results on WinXP/Vista/Ubuntu.

Just for the record, this does break my heart just a little bit. Controlling throttling on Talos boxes has been one of the biggest headaches of the system. Turns out that modern operating systems don’t really want you to do anything to touch the CPU, and are quite happy to hide it very deep and ignore requests when they see fit. That throttling is on consistently across three operating systems and is correctly restarted on reboot is due to weeks of effort. Now that you have all shed a single tear for my loss I’ll get back to removing all my throttling scripts. Sunrise, sunset and all that.

January 6, 2009

Talos on Mobile Performance Results Now Available

Filed under: mozilla, talos — alice @ 6:03 pm

Collection of Talos mobile performance graph links now available. This is pretty much all due to the efforts of Aki. He’s taken Talos on mobile from a mere twinkle in Fennec’s eye to an actual system with multiple machines reporting consistent results with low variance. Aki is still working through the last hurdles to get Tp cycling, but the rest of the tests are now up and running.

Happy regression hunting!

January 5, 2009

Auto-Rebooting Talos Boxen Saves The World

Filed under: mozilla, talos — alice @ 6:01 pm

Bug 463020 - Talos machines should be automatically rebooted periodically has finally been fully implemented and resolved fixed.  Here’s what we know about Talos machines:

  1. The longer the uptime the less trustworthy the numbers
  2. Machines can stay up and report untrustworthy/garbage numbers for a long time before anyone notices - as long as the machine cycles green it is usually ignored
  3. Monitoring every active Talos box to see what machines need a reboot is complicated and difficult

Since the majority of the Talos work that comes across my desk ends up as reboot requests to solve machine number drift,  the obvious solution is to start a system of auto-rebooting.  It turned out that doing so wasn’t even that complicated due to earlier work I’ve done on the system to ensure that Talos machines come back from reboot ready to test and auto-reconnect to their buildbot master.  Thanks to catlee for the patch required to get that last piece in place that fires at the end of every test cycle on every talos machine and starts a reboot.

Now when looking at performance results you can rest assured that the machine that did the testing was in a clean state, newly rebooted and without any extraneous processes kicking around from previous busted browsers/crash reporters/spawned dialogs/whatever.  No longer do we have to live in fear of Bug 419620 - qm-mini-vista04 (branch) numbers gradually rising and other such bugs resulting from long uptimes on Talos test boxes (say, Leopard Talos boxes showing odd spikes in results very 5-20 test cycles, or Tiger Talos boxes having runaway Terminal processes after a few dozen test cycles, etc).

The nifty side effect is that the variance between results is also decreased - I see much smoother lines on every platform.  The only downside here is that we do lose machines that fail to reboot.  At the moment, I’m seeing about 5-6 machines fall over per week.  This does increase the load on IT as the only solution is a manual kick and thus requires a trip to the colo.  They’ve been real good sports about the increase in bug traffic brought about by this change.

Thanks again to catlee and IT for their help.

November 24, 2008

Lucky Number 90

Filed under: mozilla, talos — alice @ 6:51 pm

After the recent work done to bring up the new Firefox3.1 branch I did a count of active Talos boxes. As far as I can figure it breaks down like this:

  1. 7 testing 1.8
  2. 25 testing 1.9
  3. 23 testing 1.9.1
  4. 23 testing 1.9.2
  5. 9 testing Tracemonkey
  6. 3 testing try server builds

That would give a grand total of 90.  Since we are starting down the road of decommissioning the 7 machines assigned to 1.8 this is probably as high as we are going to get for quite some time - though I envision a future with a lot more infrastructure dedicated to the whole try system.

Recall that the first set of Talos mac mini WinXP boxes were pushed to the production waterfall to test Firefox 3 sometime during the week ending on December 7th, 2007 - or so says my status report for that week.  That would be from 3 machines to 90 in under a year.  Seems like a good time to reflect upon the breadth and depth that the whole Talos infrastructure now covers and also hope that we won’t be having to make any more orders for batches of 50 minis.  I don’t think that IT will ever talk to me again if I make them rack more minis.

October 28, 2008

The Super Secret List of Graph Links

Filed under: graphs, mozilla, talos — alice @ 11:17 am

This really shouldn’t be super secret, but it has come to my attention that some people don’t know about the compendium of Talos graph links.

It’s really hard to construct graphs on the graph server that are a) meaningful and b) correct - you have to know which talos machines are testing which code and if they are configured in such a way as to generate useful comparisons.  While we are currently doing a lot of graph server infrastructure work, it is going to be a while before we can fix this overarching difficulty.  I’ve been maintaining this list of graph links for some time and keeping them organized by branch, platform and test.

So, next time you are pushed to the brink of madness trying to figure out if there is a regression in Ts for mozilla-central on WinXP, please use the graph links and save your sanity.

October 22, 2008

Details & Warnings: Talos downtime Friday, October 24th 9am-1pm PST

Filed under: mozilla, talos — alice @ 2:06 pm

Fixes in this downtime include:

  • Bug 443979 - talos collects memory information on mac in KB, on linux/win in Bytes.  Talos collects various memory metrics (RSS, Private Bytes, Memory Set Size) to get a feel for how much space the browser is taking up.  I realized a little while back that we are collecting values in bytes for linux & windows and in kilobytes on mac.  This patch will get us to a state where everything is stored in bytes.  The downside is that there is going to be a really big jump in the mac memory allocation graphs as we switch from kilobytes to bytes.  Mark and I played around with doing a database update to make for a smooth transition, but this would end up requiring shutting down the tree for longer than is comfortable (potentially over a day without any mac Talos boxes reporting).  So, we are going to live with the jump in the graphs to get this fixed.
  • Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians. This is a potentially large change, but I have high hopes that it won’t affect reported Talos numbers too drastically.  The basic issue is that the pageloader extension that Talos uses to cycle through web page test sets has been sorting the list of numerical results as if they were strings.  Orderings can end up like (’100′, ‘1900, ‘900′, ‘999′, …), throwing off the median result for a given set of numbers.  I think that we are hitting mis-sortings pretty rarely, but we won’t know for sure until the fix is applied to the production boxes and new numbers are generated.  The worst case scenario would be changes across the board to all Talos results.  Whatever the new results will be they will be considered the new baseline for performance data.
  • Bug 457885 - windows talos machines stopped testing anything for quite a few hours.  The current way of determining if a new build is available to test is to scrape the tinderbox waterfall for builds marked ’success’.  A fine system when we only had a single build machine reports to a single waterfall column.  As we have expanded the amount of builders reporting to a single column Talos skips builds - a builder reports a successful build to the waterfall but, before Talos has a chance to process that change, another builder overwrites that build report with its own successful build data.  To avoid this we are changing to a system where we monitor the ftp directories where builds are dropped.  I’ve tested extensively on staging and this should be a smooth transition.

Firefox, Firefox3.0 & Mozilla1.8 will remain closed until I’m confident that these patches have applied correctly and that the reported Talos numbers are stable.

October 10, 2008

How I Caused the Talos Regression on 9/26 (and 8/7)

Filed under: mozilla, talos — alice @ 5:17 pm

Here’s the short story.  There was what appeared to be a performance regression across all machines on 9/26.  While the usual culprit would be a change to the browser, the regression coincided with several changes made to the Talos system (both to test configuration and physical machine configuration).  Investigating the regression proved complicated as browser, Talos code and Talos machine changes had to be taken into consideration.  The surprise outcome was that it was actually the resolution to a Talos bug that caused a regression on 8/7, and that the performance change on 9/26 was simply a return to normalcy.  For the whole story, read on.

This took a little while to untangle, so you’ll have to get some background from way back in August.  I was working on Bug 432883 - talos config file unification and automation.  This was a clean up project.  We’d wandered into a state were every separate set of Talos boxes required its own configuration file, resulting in a ballooning number of config files leading to difficulties in rolling out updates/changes.  I had created a scheme wherein there would be a single config file checked in which would be custom altered by each set of Talos machines per test run.  Everything worked great in testing and on 8/7 the change went live.  There was no redness on any tree and I congratulated myself on a job well done.

Now we get to 9/26.  It was a Friday afternoon and I was interested in getting some things done before the weekend.  This is what I ended up doing that afternoon:

  1. Bug 457464 - New: Linux Tp3 numbers show machine configurations appearing to change.  This bug had to do with the state of the linux Talos machines and whether or not the numbers were reliable.  Talos boxes are created in sets of three and this bug came down to the set qm-plinux-trunk01/02/03.  These boxes are designed to work from the same image, on the same hardware, running the same tests and should report numbers within 1% of each other.  qm-plinux-trunk02 was reporting suspiciously higher numbers.  I poked around and saw that it was spending a lot of time running trackerd.  I decided that that process couldn’t be trusted and could be causing extra variance in the reported performance numbers as it could run at any time and affect cpu availability.  So, I went through _all_ linux talos machines and turned it off.
  2. Bug 450666 - Increase intervals for long-running tests.  Here we were trying to reduce the glut of data Talos gathers per test cycle.  Mostly we build up memory/cpu information because we sample these metrics every second during the running of Tp.  Tp can take 45 minutes to run and we would routinely gather over 1500 memory data points.  This had become a concern in graph server work as the graph server database had some tables with over a billion rows.  I had done some tests on my staging environment and determined that we could change from sampling every second to every 20th second without affecting the value of the data generated.  It was a simple patch to the central Talos config file and I’d been holding onto it for a while and wanted to push it out.
  3. Now, as it happens, I noticed while looking at the config file that there was a problem with it.  When we designed the Talos tests we had various discussions trying to figure out what the ideal amount of times to cycle through the Tp test page set to get the least amount of wobble in the recorded numbers.  I did some tests and figured out that 5=high variance, 10=pretty consistant, 20=ideal, 50=no further improvement.  We settled on 10 cycles as a reasonable balance between variance and time to execute the test (recall that we are talking about cycling through 400 test pages so, as is, 10 cycles can take a pretty long time).  When I looked at the config file I realized that it was only set to cycle 5 times.  Oops.  I knew that I must have introduced this during my work on config file unification.  I figured that the numbers must be having greater variance and that it would be good to get back to smoother results to make it easier to notice smaller regressions.  Here’s where I decided to piggyback the change back from 5 cycles -> 10 cycles with my change to metric sampling.  I was making a config file update anyway, might as well lump them together.  Bad Alice!

It only took till Sunday for Bug 457582 - performance metric changes between 1pm-2pm 2008-09-26.  The numbers on every Talos machine had taken a change on Friday afternoon.  Some had gone up, some had gone down.  Due to the amount of alterations to Talos that I had made in the regression range (machine configuration, metric sampling, Tp cycles) we had a host of possible reasons to investigate.  Everything came down to Bug 450401 - Tp regression from 4:21 8/7/2008.  Turns out that back when I did the config file unification project and accidentally switched us from Tp cycles 10->5 I had caused a regression.  The amount of Tp cycles not only affects the variance of the numbers reported for tests, but also the basic test result.  So, the change on that fateful Friday afternoon was actually resolving a regression from back on 8/7.  Huzzah!

Now, it’s not very good that I caused confusion from my various tinkerings to that Talos system on 9/26, but it’s worse that I caused a regression on 8/7 resulting in wasted time attempting to find a bustage in the Firefox code base.  There’s a few things that we can take away from this misadventure:

  • Talos machine configuration changes during downtimes only
  • Any Talos changes should be associated with monitoring of the generated performance numbers
  • File separate bugs for separate patches (okay, we already know this, but turns out that I needed a refresher)
  • Look for regressions in _both_ the browser and Talos code (for the most part this should be trivial because Talos changes are pretty rare)

Overall, I see this as growing pains in the Talos project.  It’s been pretty easy for me to consider Talos my own personal sandbox to do with as I please.  Considering that I’m the default owner of all the active Talos boxes it can be very tempting to just go in and start changing things when I think that they are broken instead of going through the time to file bugs and wait for appropriate downtimes.

« Newer PostsOlder Posts »

Powered by WordPress