happiest unalice ever

February 25, 2009

Double Sending Talos Results To New & Improved Graph Server

Filed under: graphs, mozilla, talos — alice @ 3:16 pm

We are inching closer and closer to completing the effort to rewrite the back end of the graph server.  This has included a whole new schema and rewriting all the scripts that interact with that schema.  Due to the major differences between the old and new schemas data migration isn’t going to be easy.  While we do plan on moving some data over (say around interesting branch points, or for retired branches that we are no longer actively testing), we figured that we would go for a plan where we would ease into use of the new database.  This means that for the next few weeks, possibly a month, Talos boxes are going to be sending data first to the graphs.mozilla.org and then immediately sending the same results to graphs-new.mozilla.org.  This should in no way affect the numbers collected by Talos, or impact the cycle time of Talos machines.  This gives us a few benefits:

  1. Stress test the new graph server.  We’ve had a staging version of the graph server with the new schema up and running for a while, but it only has 5 or 6 staging Talos boxes reporting to it.  We need to see what happens when 90 boxes try to report results all at once.
  2. A good chance to pre-populate the new graph server db.  When we had discussions about just migrating all data in the old db and forcing a switch over to the new graph server all in one shot we ended up talking on the order of 24-36 hours (or more) to convert and transfer all the data.  Just doing double send for a while is going to be far less intrusive and will let us work out the remaining data migration issues under a more reasonable time frame.

The main change that developers who monitor the various waterfalls will notice is that Talos columns will have the standard links to graphs followed by a second set of graph links which will be to the new graph server.  Expect this change to take place by the end of this week.  If this causes any undue confusion, or you find bugs in the graphs-new.mozilla.org feel free to drop by #perfomatic to provide feedback.

January 27, 2009

Firefox3.0 Unthrottled, Victory In Our Time?

Filed under: mozilla, talos — alice @ 6:43 pm

During a long downtime last Friday morning I unthrottled all the WinXP, Vista and Ubuntu boxes testing Firefox3.0.  I had forcast a 25-30 minute time savings in each full Talos test cycle but, alas, that was not to be.  After crunching the numbers, it looks like we save 10 minutes per full test cycle for each operating system.  Now, that was initially pretty disappointing.  After further consideration I think that 10 minutes would be a very welcome time savings to anyone babysitting the tree waiting on performance results; in fact, any decrease in Talos machine cycle time is pretty much a good thing.

Now that the machines have had some time to cycle and get us a base of results I went over the appropriate Firefox3.0 graph links to see how the numbers look. From examinations of Tp and Ts:

  • Ubuntu results slightly increased in variance post un-throttling
  • WinXP results slightly decreased in variance post un-throttling
  • Vista variance was unchanged

With the time savings, and knowing that the Linux numbers are slightly degraded, I believe that unthrottling should be rolled out across all Talos boxes testing all branches.  While it isn’t going to solve all our problems it’s a step in the direction of a better testing harness. Feel free to comment here or in the bug. I’d like to get as much feed back as I can, as this will change the reported results of a large amount of our test boxes.

January 21, 2009

Talos Firefox3.0 downtime Friday January 23rd, 9am-12pm PST

Filed under: mozilla — alice @ 2:58 pm

In this downtime:

After throttling has been removed from the Talos boxes a drop in reported performance numbers will be observed across WinXP/Vista/Ubuntu - Leopard/Tiger results will be unaffected.

Please let me know if there is any reason not to proceed with this downtime as planned.

January 15, 2009

Standalone Talos V1.4

Filed under: mozilla, talos — alice @ 5:44 pm

Includes a few noteworthy fixes:

  1. Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians
  2. Bug 398463 - talos runs tests in a different order than listed in the config file
  3. Bug 419367 - Standalone Talos should output results as it runs

Directions for installing and running Standalone Talos have been updated to point to the latest release.

January 12, 2009

Graph Server Data Migration Planning

Filed under: graphs, mozilla — alice @ 3:44 pm

Our beloved Perfomatic is undergoing a schema change. The initial design was put together when the Talos project was just getting off the ground. There’s a very large difference between 3 Talos boxes reporting and 90, and it shows in our 60 GB database. It has become so bloated, and anything interesting involves painful joins between giant tables, that we we mostly just leave it alone to run. We’d like to be able to branch out the graph server work to include dashboards and better statistical analysis and administrative features (removing corrupted data, etc) but everything ends up being hampered by the database.

With this in mind the new schema was designed. It’s broken up into more tables and will greatly reduce the redundancy found in the old schema. It should also make it dead simple to do things like “what are the last 10 data points for test X on branch Y”.

We are starting to put all the pieces together to make use of the new schema but there are some drawbacks:

  1. Format of links to graphs are changing, graph links that work on the old graph server will not work on the new.  What does this mean for existing links in bugs?
  2. How much, if any, data can we migrate from the old graph server to the new?  The format within the database has changed significantly and will require a large amount of massaging to get it into the new, is this effort worth it?
  3. If we are to migrate data, how long can we be without it while it gets pulled out of the old db, altered and re-assembled and then pushed into the new?

Bug 472176 - Migration procedure has been filed to work through issues with switching from the old database to the new.  What I really need is insight from people who work with a graph server on a daily basis.  What is the most important data that is really necessary to migrate?  If we were without data for a few days or a week while migration happened in the background (you would still have the currently reported numbers, just no historic data) would that be okay?  Would it be acceptable to migrate no data and just have the two set ups running side by side, until we felt that there was enough data in the new that the old set up would only be kept alive for looking at old numbers but no longer accepting new?

I’d love to get feedback on these questions during the weekly graph server meeting (Mondays, 11am PST); we’ll be discussing migration for the next few meetings as we get closer to being able to make the switch from old schema to new. If you can’t make the meeting time just join #perfomatic and talk with the graph server team directly, or comment in the migration bug.

January 9, 2009

Investigating Unthrottling Talos Boxes

Filed under: mozilla, talos — alice @ 3:12 pm

Way back in the day when the Talos project was just getting off the ground we had concerns about the granularity of JavaScript timers and their ability to accurately measure performance results. We quickly decided to go with a CPU throttling system, whereby we could slow down the testing and get repeatable test results (Bug 393940 - throttle mac mini XP speed down to something slower). Since that day we’ve had throttling on all Talos WinXP, Vista and Ubuntu boxes, and would have had on Tiger and Leopard if we’d managed to figure out a way to do it.

John Resig recently put together a very thorough blog post about the accuracy of JavaScript time. From his graphs you can see that Firefox 3 does a fine job of reporting to the millisecond timing, and it appears to have been doing so since Bug 363258 - bad millisecond resolution for (new Date).getTime() / Date.now() on Windows was fixed back in the summer of 2007. Now, I could be wrong here, but it was my understanding the the timing was still bad on Firefox 2, which Talos continued to test till just last month when the boxes were retired for that branch.

In either case, we now have good timing on Firefox 3 and we no longer have to worry about the Firefox 2 question so it’s time to examine if we want to keep throttling Talos boxes. There are several obvious pluses to not running throttled: faster test results, cross-comparison between performance results across all platforms (at the moment, we cannot compare Tiger/Leopard results to anything else as they are unthrottled), closer match between test environment and end users’ machines and a more simplified set up for Talos boxes themselves. We will lose out on the ability to compare current results to any historic data that we have already collected - turning off throttling doesn’t simply half the results so there isn’t any easy way to tell what a new unthrottled results “means” in comparison to an old throttled result.

I’ve started to take steps to get us to an unthrottled state (Bug 468680 - unthrottle talos winxp, vista and ubuntu boxes.). For now that just means turning off throttling on all of our Talos staging machines and letting it run for a while. In a week I’ll look over the numbers and see if we are getting consistent, low-variance results out of the newly unthrottled machines. If all that goes smoothly a downtime will be scheduled and the scripts on all the Talos machines that control throttling will be removed and a drop will be seen for all performance results on WinXP/Vista/Ubuntu.

Just for the record, this does break my heart just a little bit. Controlling throttling on Talos boxes has been one of the biggest headaches of the system. Turns out that modern operating systems don’t really want you to do anything to touch the CPU, and are quite happy to hide it very deep and ignore requests when they see fit. That throttling is on consistently across three operating systems and is correctly restarted on reboot is due to weeks of effort. Now that you have all shed a single tear for my loss I’ll get back to removing all my throttling scripts. Sunrise, sunset and all that.

January 6, 2009

Talos on Mobile Performance Results Now Available

Filed under: mozilla, talos — alice @ 6:03 pm

Collection of Talos mobile performance graph links now available. This is pretty much all due to the efforts of Aki. He’s taken Talos on mobile from a mere twinkle in Fennec’s eye to an actual system with multiple machines reporting consistent results with low variance. Aki is still working through the last hurdles to get Tp cycling, but the rest of the tests are now up and running.

Happy regression hunting!

January 5, 2009

Auto-Rebooting Talos Boxen Saves The World

Filed under: mozilla, talos — alice @ 6:01 pm

Bug 463020 - Talos machines should be automatically rebooted periodically has finally been fully implemented and resolved fixed.  Here’s what we know about Talos machines:

  1. The longer the uptime the less trustworthy the numbers
  2. Machines can stay up and report untrustworthy/garbage numbers for a long time before anyone notices - as long as the machine cycles green it is usually ignored
  3. Monitoring every active Talos box to see what machines need a reboot is complicated and difficult

Since the majority of the Talos work that comes across my desk ends up as reboot requests to solve machine number drift,  the obvious solution is to start a system of auto-rebooting.  It turned out that doing so wasn’t even that complicated due to earlier work I’ve done on the system to ensure that Talos machines come back from reboot ready to test and auto-reconnect to their buildbot master.  Thanks to catlee for the patch required to get that last piece in place that fires at the end of every test cycle on every talos machine and starts a reboot.

Now when looking at performance results you can rest assured that the machine that did the testing was in a clean state, newly rebooted and without any extraneous processes kicking around from previous busted browsers/crash reporters/spawned dialogs/whatever.  No longer do we have to live in fear of Bug 419620 - qm-mini-vista04 (branch) numbers gradually rising and other such bugs resulting from long uptimes on Talos test boxes (say, Leopard Talos boxes showing odd spikes in results very 5-20 test cycles, or Tiger Talos boxes having runaway Terminal processes after a few dozen test cycles, etc).

The nifty side effect is that the variance between results is also decreased - I see much smoother lines on every platform.  The only downside here is that we do lose machines that fail to reboot.  At the moment, I’m seeing about 5-6 machines fall over per week.  This does increase the load on IT as the only solution is a manual kick and thus requires a trip to the colo.  They’ve been real good sports about the increase in bug traffic brought about by this change.

Thanks again to catlee and IT for their help.

November 24, 2008

Lucky Number 90

Filed under: mozilla, talos — alice @ 6:51 pm

After the recent work done to bring up the new Firefox3.1 branch I did a count of active Talos boxes. As far as I can figure it breaks down like this:

  1. 7 testing 1.8
  2. 25 testing 1.9
  3. 23 testing 1.9.1
  4. 23 testing 1.9.2
  5. 9 testing Tracemonkey
  6. 3 testing try server builds

That would give a grand total of 90.  Since we are starting down the road of decommissioning the 7 machines assigned to 1.8 this is probably as high as we are going to get for quite some time - though I envision a future with a lot more infrastructure dedicated to the whole try system.

Recall that the first set of Talos mac mini WinXP boxes were pushed to the production waterfall to test Firefox 3 sometime during the week ending on December 7th, 2007 - or so says my status report for that week.  That would be from 3 machines to 90 in under a year.  Seems like a good time to reflect upon the breadth and depth that the whole Talos infrastructure now covers and also hope that we won’t be having to make any more orders for batches of 50 minis.  I don’t think that IT will ever talk to me again if I make them rack more minis.

October 28, 2008

The Super Secret List of Graph Links

Filed under: graphs, mozilla, talos — alice @ 11:17 am

This really shouldn’t be super secret, but it has come to my attention that some people don’t know about the compendium of Talos graph links.

It’s really hard to construct graphs on the graph server that are a) meaningful and b) correct - you have to know which talos machines are testing which code and if they are configured in such a way as to generate useful comparisons.  While we are currently doing a lot of graph server infrastructure work, it is going to be a while before we can fix this overarching difficulty.  I’ve been maintaining this list of graph links for some time and keeping them organized by branch, platform and test.

So, next time you are pushed to the brink of madness trying to figure out if there is a regression in Ts for mozilla-central on WinXP, please use the graph links and save your sanity.

« Newer PostsOlder Posts »

Powered by WordPress