Bug 487329 - Graph server migration tracking is almost fixed and complete. What does this mean for people who use the graph server?
- graphs-new.mozilla.org will become graphs.mozilla.org
- The current graphs.mozilla.org will become graphs-old.mozilla.org
- No new data will be sent to graphs-old.mozilla.org
- graphs-old.mozilla.org will remain up so that older data can be viewed/searched
We’ve been seeding the new graph server with data for a month now and I think that the majority of people are already using it as their main means of viewing performance data; for the most part the switch over should be painless.
What may be slightly more controversial is that there is no current plan to migrate any data from the old graph server to the new. There’s some very good reasons for this:
- Data in the old graph server was generated by using throttled test slaves, we no longer throttle slaves so the numbers would not be comparable
- We are close to rolling out a new Tp test page set, we did not test with this new page set before so the numbers in the old graph server would not be comparable
- Most of the numbers on the old graph server were collected before we rolled out reboot-every-test-slave-post-every-test - the numbers have greater variance and there are large swaths of data that aren’t trustworthy or useful in anyway (basically, long periods of time when the box in question was in serious need of a reboot)
Instead of banging our heads against shifting data from the old, poorly thought out schema to the new, super fast schema we’re going to consider designing a system whereby we can pick up old builds of interest and push them through the current testing harness. That way we save ourselves headaches and get data that is actually comparable to current results.
Once all the dependent bugs filed against graph server migration have been fixed we’ll roll all this out. I’m hoping that that will be in the next week or two. Right now I’m more interested if anyone has any strong feelings about what builds are ‘interesting’ enough that we should come up with the means to re-test them with our current test harness. Any favorites out there? Top ten?
We are inching closer and closer to completing the effort to rewrite the back end of the graph server. This has included a whole new schema and rewriting all the scripts that interact with that schema. Due to the major differences between the old and new schemas data migration isn’t going to be easy. While we do plan on moving some data over (say around interesting branch points, or for retired branches that we are no longer actively testing), we figured that we would go for a plan where we would ease into use of the new database. This means that for the next few weeks, possibly a month, Talos boxes are going to be sending data first to the graphs.mozilla.org and then immediately sending the same results to graphs-new.mozilla.org. This should in no way affect the numbers collected by Talos, or impact the cycle time of Talos machines. This gives us a few benefits:
- Stress test the new graph server. We’ve had a staging version of the graph server with the new schema up and running for a while, but it only has 5 or 6 staging Talos boxes reporting to it. We need to see what happens when 90 boxes try to report results all at once.
- A good chance to pre-populate the new graph server db. When we had discussions about just migrating all data in the old db and forcing a switch over to the new graph server all in one shot we ended up talking on the order of 24-36 hours (or more) to convert and transfer all the data. Just doing double send for a while is going to be far less intrusive and will let us work out the remaining data migration issues under a more reasonable time frame.
The main change that developers who monitor the various waterfalls will notice is that Talos columns will have the standard links to graphs followed by a second set of graph links which will be to the new graph server. Expect this change to take place by the end of this week. If this causes any undue confusion, or you find bugs in the graphs-new.mozilla.org feel free to drop by #perfomatic to provide feedback.
Our beloved Perfomatic is undergoing a schema change. The initial design was put together when the Talos project was just getting off the ground. There’s a very large difference between 3 Talos boxes reporting and 90, and it shows in our 60 GB database. It has become so bloated, and anything interesting involves painful joins between giant tables, that we we mostly just leave it alone to run. We’d like to be able to branch out the graph server work to include dashboards and better statistical analysis and administrative features (removing corrupted data, etc) but everything ends up being hampered by the database.
With this in mind the new schema was designed. It’s broken up into more tables and will greatly reduce the redundancy found in the old schema. It should also make it dead simple to do things like “what are the last 10 data points for test X on branch Y”.
We are starting to put all the pieces together to make use of the new schema but there are some drawbacks:
- Format of links to graphs are changing, graph links that work on the old graph server will not work on the new. What does this mean for existing links in bugs?
- How much, if any, data can we migrate from the old graph server to the new? The format within the database has changed significantly and will require a large amount of massaging to get it into the new, is this effort worth it?
- If we are to migrate data, how long can we be without it while it gets pulled out of the old db, altered and re-assembled and then pushed into the new?
Bug 472176 - Migration procedure has been filed to work through issues with switching from the old database to the new. What I really need is insight from people who work with a graph server on a daily basis. What is the most important data that is really necessary to migrate? If we were without data for a few days or a week while migration happened in the background (you would still have the currently reported numbers, just no historic data) would that be okay? Would it be acceptable to migrate no data and just have the two set ups running side by side, until we felt that there was enough data in the new that the old set up would only be kept alive for looking at old numbers but no longer accepting new?
I’d love to get feedback on these questions during the weekly graph server meeting (Mondays, 11am PST); we’ll be discussing migration for the next few meetings as we get closer to being able to make the switch from old schema to new. If you can’t make the meeting time just join #perfomatic and talk with the graph server team directly, or comment in the migration bug.
This really shouldn’t be super secret, but it has come to my attention that some people don’t know about the compendium of Talos graph links.
It’s really hard to construct graphs on the graph server that are a) meaningful and b) correct - you have to know which talos machines are testing which code and if they are configured in such a way as to generate useful comparisons. While we are currently doing a lot of graph server infrastructure work, it is going to be a while before we can fix this overarching difficulty. I’ve been maintaining this list of graph links for some time and keeping them organized by branch, platform and test.
So, next time you are pushed to the brink of madness trying to figure out if there is a regression in Ts for mozilla-central on WinXP, please use the graph links and save your sanity.