happiest unalice ever

May 7, 2010

Universal Manifest Format For Unit Tests (and maybe Talos)

Filed under: mozilla, unit tests — alice @ 2:29 pm

Having just joined the Auto-Tools team they thought that they would start me on something fun!  Here’s the problem that I was presented with:

  • each type of unit test uses it’s own manifest file format
  • there is a different manifest file reader for each type of manifest
  • we hack each manifest separately when we want to expand functionality (like skipping a test per-branch, or adding a pref to a test)

Here are some examples:

So, some tests use Makefiles and some use lists and we don’t have a standardized means of adding extra information to any given test.  Some test lists are generated on the fly (say as built up in the web browser for mochitest), some are checked in and static.  Changing which tests mochitest and xpcshell export currently involves Makefile hacking, something that is pretty unpleasant.

Auto-tools proposes that we settle on a universal manifest format for all tests.  Most likely this would be json (we toyed with yaml but that would add extra dependency overhead and bug 517950 has requested that we remove yaml from talos, so it doesn’t seem to have many fans).

We’ve collected proposed formats to the Universal Manifest Project Wiki.  One thing that we don’t want to do here is design in a bubble.  While there are benefits to the auto-tools team in terms of code re-use, centralizing bug fixes and such the biggest consumer of these tests are developers.  We want to provide all these benefits without sacrificing developer productivity.  Our goal is to keep our test harnesses as simple and as easy-to-use as possible while making them extensible and flexible for whatever the future holds.  Feedback is both requested and highly appreciated.

May 7, 2009

Standalone Talos V1.5

Filed under: mozilla, talos — alice @ 11:30 am

In this version:

See Standalone Talos documentation for download and installation instructions.

April 30, 2009

Tshutdown Goes Live

Filed under: mozilla, talos — alice @ 4:42 pm

If you’ve been following along with Bug 480413 - design test to monitor browser shut down time you’ll know that I attempted to roll out Tshutdown on the 18th. Unfortunately, there were a couple of bugs that weren’t discovered in staging and I spent 6 or 7 hours on a beautiful Saturday afternoon attempting to fix them on the fly. Not able to make it work I had to back it all out and fume.

With a clearer head I approached the bugs on Monday morning and resolved both of them. Then it was just a question of waiting for an appropriate downtime. Though I’m as excited about 3.5b4 as anyone, it really gets in the way of arranging Talos code changes. Finally, yesterday afternoon we were able to shut down everything and check things in.

I’m glad to say that this landing was successful. There’s a good chance that Tshutdown will clean up the issues in Bug 478603 - intermittent orange on Windows mozilla-central talos Ts and Tdhtml tests (”failed to initialize browser”), as we’ll start to record the longer shutdown times instead of freezing up and reporting orange.

It’ll be a few more days till we have enough data for Tshutdown to look like anything but a scatter graph, but you can check out the reported results at graphs-new.mozilla.org. There’s actually more to Tshutdown than a single test. You’ll see “Tp3 Shutdown”, “Tp3 Nochrome Shutdown”, “Tp3 Fast Shutdown” and “Ts Shutdown”. Basically, we record the shutdown time after running the given test. Post Ts results in a ‘clean’ shutdown time as we run Ts with an empty, new profile; post Tp3 results in a ‘dirty’ shutdown time as the browser has just completed cycling 10 times through 400 web pages. The post Tp3 results will also show greater variance because we only run Tp3 once per full test cycle (the test does take a good hour or more to complete depending on the platform) and we only have that single value to report, with Ts we rapidly open and close the browser 20 times so we have a data set that we can average to get a more consistent value.

I’m very pleased to get the whole mess put to bed.  There were non-threadsafe python libraries (subprocess, I’m looking at you) to deal with, twisted banana errors (I kid you not), and a whole mess of timing issues.  You can’t build what you don’t monitor and shutdown is an import part of our user’s experience - hopefully we’ll be able to start to trim down our shutdown time now that we are reporting results from all our active Talos boxes.

April 23, 2009

Sheriffs Take Notice: We Can Retest Builds With Talos Sendchange

Filed under: mozilla, talos — alice @ 6:01 pm

I’m a little late in posting anything about this, but Bug 468731 - talos testing of builds using sendchange is a big deal. I had initially thought that we wouldn’t be able to make use of Buildbot’s senchange systems due to the weird hacking that Talos does to the Buildbot change object to move around all the pieces of information required by Talos.  Thankfully, catlee wasn’t nearly so pessimistic and found a way to make it work.

Mostly it sounds like a bunch of Buildbot nonsense that shouldn’t really interest anyone outside of the Release Engineering team, but it has huge benefits to sheriffs and developers.  With Talos buildbot now supporting sendchange we can push builds through the Talos testing infrastructure provided only a correctly formatted link to the build in question.  I’ve already used this to force a re-test of a build that had failed; it immediately failed again on different Talos test boxes and proved that the issue was in the build and not in Talos.

If a build is on stage and downloadable we can make Talos test it as many times as we want.  We still don’t have the means to push any build we like through, but retesting pretty much any build on staging can be very useful when trying to narrow down a regression range or figure out if a regression is ‘real’.  If you are in a situation where you would like to retest a build contact the Release Engineering team member on buildduty (identified as nick-buildduty on irc) and they’ll get it going for you.

Thanks, catlee!

April 22, 2009

Towards Switching To New Graph Server Full Time

Filed under: graphs, mozilla, talos — alice @ 4:53 pm

Bug 487329 - Graph server migration tracking is almost fixed and complete. What does this mean for people who use the graph server?

  1. graphs-new.mozilla.org will become graphs.mozilla.org
  2. The current graphs.mozilla.org will become graphs-old.mozilla.org
  3. No new data will be sent to graphs-old.mozilla.org
  4. graphs-old.mozilla.org will remain up so that older data can be viewed/searched

We’ve been seeding the new graph server with data for a month now and I think that the majority of people are already using it as their main means of viewing performance data; for the most part the switch over should be painless.

What may be slightly more controversial is that there is no current plan to migrate any data from the old graph server to the new.  There’s some very good reasons for this:

  • Data in the old graph server was generated by using throttled test slaves, we no longer throttle slaves so the numbers would not be comparable
  • We are close to rolling out a new Tp test page set, we did not test with this new page set before so the numbers in the old graph server would not be comparable
  • Most of the numbers on the old graph server were collected before we rolled out reboot-every-test-slave-post-every-test - the numbers have greater variance and there are large swaths of data that aren’t trustworthy or useful in anyway (basically, long periods of time when the box in question was in serious need of a reboot)

Instead of banging our heads against shifting data from the old, poorly thought out schema to the new, super fast schema we’re going to consider designing a system whereby we can pick up old builds of interest and push them through the current testing harness. That way we save ourselves headaches and get data that is actually comparable to current results.

Once all the dependent bugs filed against graph server migration have been fixed we’ll roll all this out.  I’m hoping that that will be in the next week or two.  Right now I’m more interested if anyone has any strong feelings about what builds are ‘interesting’ enough that we should come up with the means to re-test them with our current test harness.  Any favorites out there?  Top ten?

Belorussian translation, added February 2011 by Martha Ruszkowski.

February 25, 2009

Double Sending Talos Results To New & Improved Graph Server

Filed under: graphs, mozilla, talos — alice @ 3:16 pm

We are inching closer and closer to completing the effort to rewrite the back end of the graph server.  This has included a whole new schema and rewriting all the scripts that interact with that schema.  Due to the major differences between the old and new schemas data migration isn’t going to be easy.  While we do plan on moving some data over (say around interesting branch points, or for retired branches that we are no longer actively testing), we figured that we would go for a plan where we would ease into use of the new database.  This means that for the next few weeks, possibly a month, Talos boxes are going to be sending data first to the graphs.mozilla.org and then immediately sending the same results to graphs-new.mozilla.org.  This should in no way affect the numbers collected by Talos, or impact the cycle time of Talos machines.  This gives us a few benefits:

  1. Stress test the new graph server.  We’ve had a staging version of the graph server with the new schema up and running for a while, but it only has 5 or 6 staging Talos boxes reporting to it.  We need to see what happens when 90 boxes try to report results all at once.
  2. A good chance to pre-populate the new graph server db.  When we had discussions about just migrating all data in the old db and forcing a switch over to the new graph server all in one shot we ended up talking on the order of 24-36 hours (or more) to convert and transfer all the data.  Just doing double send for a while is going to be far less intrusive and will let us work out the remaining data migration issues under a more reasonable time frame.

The main change that developers who monitor the various waterfalls will notice is that Talos columns will have the standard links to graphs followed by a second set of graph links which will be to the new graph server.  Expect this change to take place by the end of this week.  If this causes any undue confusion, or you find bugs in the graphs-new.mozilla.org feel free to drop by #perfomatic to provide feedback.

January 27, 2009

Firefox3.0 Unthrottled, Victory In Our Time?

Filed under: mozilla, talos — alice @ 6:43 pm

During a long downtime last Friday morning I unthrottled all the WinXP, Vista and Ubuntu boxes testing Firefox3.0.  I had forcast a 25-30 minute time savings in each full Talos test cycle but, alas, that was not to be.  After crunching the numbers, it looks like we save 10 minutes per full test cycle for each operating system.  Now, that was initially pretty disappointing.  After further consideration I think that 10 minutes would be a very welcome time savings to anyone babysitting the tree waiting on performance results; in fact, any decrease in Talos machine cycle time is pretty much a good thing.

Now that the machines have had some time to cycle and get us a base of results I went over the appropriate Firefox3.0 graph links to see how the numbers look. From examinations of Tp and Ts:

  • Ubuntu results slightly increased in variance post un-throttling
  • WinXP results slightly decreased in variance post un-throttling
  • Vista variance was unchanged

With the time savings, and knowing that the Linux numbers are slightly degraded, I believe that unthrottling should be rolled out across all Talos boxes testing all branches.  While it isn’t going to solve all our problems it’s a step in the direction of a better testing harness. Feel free to comment here or in the bug. I’d like to get as much feed back as I can, as this will change the reported results of a large amount of our test boxes.

January 21, 2009

Talos Firefox3.0 downtime Friday January 23rd, 9am-12pm PST

Filed under: mozilla — alice @ 2:58 pm

In this downtime:

After throttling has been removed from the Talos boxes a drop in reported performance numbers will be observed across WinXP/Vista/Ubuntu - Leopard/Tiger results will be unaffected.

Please let me know if there is any reason not to proceed with this downtime as planned.

January 15, 2009

Standalone Talos V1.4

Filed under: mozilla, talos — alice @ 5:44 pm

Includes a few noteworthy fixes:

  1. Bug 459598 - “medians” for individual pages in Tp don’t seem to be medians
  2. Bug 398463 - talos runs tests in a different order than listed in the config file
  3. Bug 419367 - Standalone Talos should output results as it runs

Directions for installing and running Standalone Talos have been updated to point to the latest release.

January 12, 2009

Graph Server Data Migration Planning

Filed under: graphs, mozilla — alice @ 3:44 pm

Our beloved Perfomatic is undergoing a schema change. The initial design was put together when the Talos project was just getting off the ground. There’s a very large difference between 3 Talos boxes reporting and 90, and it shows in our 60 GB database. It has become so bloated, and anything interesting involves painful joins between giant tables, that we we mostly just leave it alone to run. We’d like to be able to branch out the graph server work to include dashboards and better statistical analysis and administrative features (removing corrupted data, etc) but everything ends up being hampered by the database.

With this in mind the new schema was designed. It’s broken up into more tables and will greatly reduce the redundancy found in the old schema. It should also make it dead simple to do things like “what are the last 10 data points for test X on branch Y”.

We are starting to put all the pieces together to make use of the new schema but there are some drawbacks:

  1. Format of links to graphs are changing, graph links that work on the old graph server will not work on the new.  What does this mean for existing links in bugs?
  2. How much, if any, data can we migrate from the old graph server to the new?  The format within the database has changed significantly and will require a large amount of massaging to get it into the new, is this effort worth it?
  3. If we are to migrate data, how long can we be without it while it gets pulled out of the old db, altered and re-assembled and then pushed into the new?

Bug 472176 - Migration procedure has been filed to work through issues with switching from the old database to the new.  What I really need is insight from people who work with a graph server on a daily basis.  What is the most important data that is really necessary to migrate?  If we were without data for a few days or a week while migration happened in the background (you would still have the currently reported numbers, just no historic data) would that be okay?  Would it be acceptable to migrate no data and just have the two set ups running side by side, until we felt that there was enough data in the new that the old set up would only be kept alive for looking at old numbers but no longer accepting new?

I’d love to get feedback on these questions during the weekly graph server meeting (Mondays, 11am PST); we’ll be discussing migration for the next few meetings as we get closer to being able to make the switch from old schema to new. If you can’t make the meeting time just join #perfomatic and talk with the graph server team directly, or comment in the migration bug.

« Newer PostsOlder Posts »

Powered by WordPress