happiest unalice ever

May 7, 2009

Standalone Talos V1.5

Filed under: mozilla, talos — alice @ 11:30 am

In this version:

See Standalone Talos documentation for download and installation instructions.

April 30, 2009

Tshutdown Goes Live

Filed under: mozilla, talos — alice @ 4:42 pm

If you’ve been following along with Bug 480413 - design test to monitor browser shut down time you’ll know that I attempted to roll out Tshutdown on the 18th. Unfortunately, there were a couple of bugs that weren’t discovered in staging and I spent 6 or 7 hours on a beautiful Saturday afternoon attempting to fix them on the fly. Not able to make it work I had to back it all out and fume.

With a clearer head I approached the bugs on Monday morning and resolved both of them. Then it was just a question of waiting for an appropriate downtime. Though I’m as excited about 3.5b4 as anyone, it really gets in the way of arranging Talos code changes. Finally, yesterday afternoon we were able to shut down everything and check things in.

I’m glad to say that this landing was successful. There’s a good chance that Tshutdown will clean up the issues in Bug 478603 - intermittent orange on Windows mozilla-central talos Ts and Tdhtml tests (”failed to initialize browser”), as we’ll start to record the longer shutdown times instead of freezing up and reporting orange.

It’ll be a few more days till we have enough data for Tshutdown to look like anything but a scatter graph, but you can check out the reported results at graphs-new.mozilla.org. There’s actually more to Tshutdown than a single test. You’ll see “Tp3 Shutdown”, “Tp3 Nochrome Shutdown”, “Tp3 Fast Shutdown” and “Ts Shutdown”. Basically, we record the shutdown time after running the given test. Post Ts results in a ‘clean’ shutdown time as we run Ts with an empty, new profile; post Tp3 results in a ‘dirty’ shutdown time as the browser has just completed cycling 10 times through 400 web pages. The post Tp3 results will also show greater variance because we only run Tp3 once per full test cycle (the test does take a good hour or more to complete depending on the platform) and we only have that single value to report, with Ts we rapidly open and close the browser 20 times so we have a data set that we can average to get a more consistent value.

I’m very pleased to get the whole mess put to bed.  There were non-threadsafe python libraries (subprocess, I’m looking at you) to deal with, twisted banana errors (I kid you not), and a whole mess of timing issues.  You can’t build what you don’t monitor and shutdown is an import part of our user’s experience - hopefully we’ll be able to start to trim down our shutdown time now that we are reporting results from all our active Talos boxes.

April 23, 2009

Sheriffs Take Notice: We Can Retest Builds With Talos Sendchange

Filed under: mozilla, talos — alice @ 6:01 pm

I’m a little late in posting anything about this, but Bug 468731 - talos testing of builds using sendchange is a big deal. I had initially thought that we wouldn’t be able to make use of Buildbot’s senchange systems due to the weird hacking that Talos does to the Buildbot change object to move around all the pieces of information required by Talos.  Thankfully, catlee wasn’t nearly so pessimistic and found a way to make it work.

Mostly it sounds like a bunch of Buildbot nonsense that shouldn’t really interest anyone outside of the Release Engineering team, but it has huge benefits to sheriffs and developers.  With Talos buildbot now supporting sendchange we can push builds through the Talos testing infrastructure provided only a correctly formatted link to the build in question.  I’ve already used this to force a re-test of a build that had failed; it immediately failed again on different Talos test boxes and proved that the issue was in the build and not in Talos.

If a build is on stage and downloadable we can make Talos test it as many times as we want.  We still don’t have the means to push any build we like through, but retesting pretty much any build on staging can be very useful when trying to narrow down a regression range or figure out if a regression is ‘real’.  If you are in a situation where you would like to retest a build contact the Release Engineering team member on buildduty (identified as nick-buildduty on irc) and they’ll get it going for you.

Thanks, catlee!

April 22, 2009

Towards Switching To New Graph Server Full Time

Filed under: graphs, mozilla, talos — alice @ 4:53 pm

Bug 487329 - Graph server migration tracking is almost fixed and complete. What does this mean for people who use the graph server?

  1. graphs-new.mozilla.org will become graphs.mozilla.org
  2. The current graphs.mozilla.org will become graphs-old.mozilla.org
  3. No new data will be sent to graphs-old.mozilla.org
  4. graphs-old.mozilla.org will remain up so that older data can be viewed/searched

We’ve been seeding the new graph server with data for a month now and I think that the majority of people are already using it as their main means of viewing performance data; for the most part the switch over should be painless.

What may be slightly more controversial is that there is no current plan to migrate any data from the old graph server to the new.  There’s some very good reasons for this:

  • Data in the old graph server was generated by using throttled test slaves, we no longer throttle slaves so the numbers would not be comparable
  • We are close to rolling out a new Tp test page set, we did not test with this new page set before so the numbers in the old graph server would not be comparable
  • Most of the numbers on the old graph server were collected before we rolled out reboot-every-test-slave-post-every-test - the numbers have greater variance and there are large swaths of data that aren’t trustworthy or useful in anyway (basically, long periods of time when the box in question was in serious need of a reboot)

Instead of banging our heads against shifting data from the old, poorly thought out schema to the new, super fast schema we’re going to consider designing a system whereby we can pick up old builds of interest and push them through the current testing harness. That way we save ourselves headaches and get data that is actually comparable to current results.

Once all the dependent bugs filed against graph server migration have been fixed we’ll roll all this out.  I’m hoping that that will be in the next week or two.  Right now I’m more interested if anyone has any strong feelings about what builds are ‘interesting’ enough that we should come up with the means to re-test them with our current test harness.  Any favorites out there?  Top ten?

February 25, 2009

Double Sending Talos Results To New & Improved Graph Server

Filed under: graphs, mozilla, talos — alice @ 3:16 pm

We are inching closer and closer to completing the effort to rewrite the back end of the graph server.  This has included a whole new schema and rewriting all the scripts that interact with that schema.  Due to the major differences between the old and new schemas data migration isn’t going to be easy.  While we do plan on moving some data over (say around interesting branch points, or for retired branches that we are no longer actively testing), we figured that we would go for a plan where we would ease into use of the new database.  This means that for the next few weeks, possibly a month, Talos boxes are going to be sending data first to the graphs.mozilla.org and then immediately sending the same results to graphs-new.mozilla.org.  This should in no way affect the numbers collected by Talos, or impact the cycle time of Talos machines.  This gives us a few benefits:

  1. Stress test the new graph server.  We’ve had a staging version of the graph server with the new schema up and running for a while, but it only has 5 or 6 staging Talos boxes reporting to it.  We need to see what happens when 90 boxes try to report results all at once.
  2. A good chance to pre-populate the new graph server db.  When we had discussions about just migrating all data in the old db and forcing a switch over to the new graph server all in one shot we ended up talking on the order of 24-36 hours (or more) to convert and transfer all the data.  Just doing double send for a while is going to be far less intrusive and will let us work out the remaining data migration issues under a more reasonable time frame.

The main change that developers who monitor the various waterfalls will notice is that Talos columns will have the standard links to graphs followed by a second set of graph links which will be to the new graph server.  Expect this change to take place by the end of this week.  If this causes any undue confusion, or you find bugs in the graphs-new.mozilla.org feel free to drop by #perfomatic to provide feedback.

January 27, 2009

Firefox3.0 Unthrottled, Victory In Our Time?

Filed under: mozilla, talos — alice @ 6:43 pm

During a long downtime last Friday morning I unthrottled all the WinXP, Vista and Ubuntu boxes testing Firefox3.0.  I had forcast a 25-30 minute time savings in each full Talos test cycle but, alas, that was not to be.  After crunching the numbers, it looks like we save 10 minutes per full test cycle for each operating system.  Now, that was initially pretty disappointing.  After further consideration I think that 10 minutes would be a very welcome time savings to anyone babysitting the tree waiting on performance results; in fact, any decrease in Talos machine cycle time is pretty much a good thing.

Now that the machines have had some time to cycle and get us a base of results I went over the appropriate Firefox3.0 graph links to see how the numbers look. From examinations of Tp and Ts:

  • Ubuntu results slightly increased in variance post un-throttling
  • WinXP results slightly decreased in variance post un-throttling
  • Vista variance was unchanged

With the time savings, and knowing that the Linux numbers are slightly degraded, I believe that unthrottling should be rolled out across all Talos boxes testing all branches.  While it isn’t going to solve all our problems it’s a step in the direction of a better testing harness. Feel free to comment here or in the bug. I’d like to get as much feed back as I can, as this will change the reported results of a large amount of our test boxes.

January 15, 2009

Standalone Talos V1.4

Filed under: mozilla, talos — alice @ 5:44 pm

Includes a few noteworthy fixes:

  1. Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians
  2. Bug 398463 - talos runs tests in a different order than listed in the config file
  3. Bug 419367 - Standalone Talos should output results as it runs

Directions for installing and running Standalone Talos have been updated to point to the latest release.

January 9, 2009

Investigating Unthrottling Talos Boxes

Filed under: mozilla, talos — alice @ 3:12 pm

Way back in the day when the Talos project was just getting off the ground we had concerns about the granularity of JavaScript timers and their ability to accurately measure performance results. We quickly decided to go with a CPU throttling system, whereby we could slow down the testing and get repeatable test results (Bug 393940 - throttle mac mini XP speed down to something slower). Since that day we’ve had throttling on all Talos WinXP, Vista and Ubuntu boxes, and would have had on Tiger and Leopard if we’d managed to figure out a way to do it.

John Resig recently put together a very thorough blog post about the accuracy of JavaScript time. From his graphs you can see that Firefox 3 does a fine job of reporting to the millisecond timing, and it appears to have been doing so since Bug 363258 - bad millisecond resolution for (new Date).getTime() / Date.now() on Windows was fixed back in the summer of 2007. Now, I could be wrong here, but it was my understanding the the timing was still bad on Firefox 2, which Talos continued to test till just last month when the boxes were retired for that branch.

In either case, we now have good timing on Firefox 3 and we no longer have to worry about the Firefox 2 question so it’s time to examine if we want to keep throttling Talos boxes. There are several obvious pluses to not running throttled: faster test results, cross-comparison between performance results across all platforms (at the moment, we cannot compare Tiger/Leopard results to anything else as they are unthrottled), closer match between test environment and end users’ machines and a more simplified set up for Talos boxes themselves. We will lose out on the ability to compare current results to any historic data that we have already collected - turning off throttling doesn’t simply half the results so there isn’t any easy way to tell what a new unthrottled results “means” in comparison to an old throttled result.

I’ve started to take steps to get us to an unthrottled state (Bug 468680 - unthrottle talos winxp, vista and ubuntu boxes.). For now that just means turning off throttling on all of our Talos staging machines and letting it run for a while. In a week I’ll look over the numbers and see if we are getting consistent, low-variance results out of the newly unthrottled machines. If all that goes smoothly a downtime will be scheduled and the scripts on all the Talos machines that control throttling will be removed and a drop will be seen for all performance results on WinXP/Vista/Ubuntu.

Just for the record, this does break my heart just a little bit. Controlling throttling on Talos boxes has been one of the biggest headaches of the system. Turns out that modern operating systems don’t really want you to do anything to touch the CPU, and are quite happy to hide it very deep and ignore requests when they see fit. That throttling is on consistently across three operating systems and is correctly restarted on reboot is due to weeks of effort. Now that you have all shed a single tear for my loss I’ll get back to removing all my throttling scripts. Sunrise, sunset and all that.

January 6, 2009

Talos on Mobile Performance Results Now Available

Filed under: mozilla, talos — alice @ 6:03 pm

Collection of Talos mobile performance graph links now available. This is pretty much all due to the efforts of Aki. He’s taken Talos on mobile from a mere twinkle in Fennec’s eye to an actual system with multiple machines reporting consistent results with low variance. Aki is still working through the last hurdles to get Tp cycling, but the rest of the tests are now up and running.

Happy regression hunting!

January 5, 2009

Auto-Rebooting Talos Boxen Saves The World

Filed under: mozilla, talos — alice @ 6:01 pm

Bug 463020 - Talos machines should be automatically rebooted periodically has finally been fully implemented and resolved fixed.  Here’s what we know about Talos machines:

  1. The longer the uptime the less trustworthy the numbers
  2. Machines can stay up and report untrustworthy/garbage numbers for a long time before anyone notices - as long as the machine cycles green it is usually ignored
  3. Monitoring every active Talos box to see what machines need a reboot is complicated and difficult

Since the majority of the Talos work that comes across my desk ends up as reboot requests to solve machine number drift,  the obvious solution is to start a system of auto-rebooting.  It turned out that doing so wasn’t even that complicated due to earlier work I’ve done on the system to ensure that Talos machines come back from reboot ready to test and auto-reconnect to their buildbot master.  Thanks to catlee for the patch required to get that last piece in place that fires at the end of every test cycle on every talos machine and starts a reboot.

Now when looking at performance results you can rest assured that the machine that did the testing was in a clean state, newly rebooted and without any extraneous processes kicking around from previous busted browsers/crash reporters/spawned dialogs/whatever.  No longer do we have to live in fear of Bug 419620 - qm-mini-vista04 (branch) numbers gradually rising and other such bugs resulting from long uptimes on Talos test boxes (say, Leopard Talos boxes showing odd spikes in results very 5-20 test cycles, or Tiger Talos boxes having runaway Terminal processes after a few dozen test cycles, etc).

The nifty side effect is that the variance between results is also decreased - I see much smoother lines on every platform.  The only downside here is that we do lose machines that fail to reboot.  At the moment, I’m seeing about 5-6 machines fall over per week.  This does increase the load on IT as the only solution is a manual kick and thus requires a trip to the colo.  They’ve been real good sports about the increase in bug traffic brought about by this change.

Thanks again to catlee and IT for their help.

Older Posts »

Powered by WordPress