Census 2: More Than Just A Pretty Graph

December 16, 2008

Benchmarks are hard, particularly for complex systems. As a result, the most hotly contested benchmarks tend not to be representative of what makes systems faster for real users. Does another 10% on TPC really matter to most web developers? And should we really pay any attention to how any JS VM does on synthetic language benchmarks?

Maybe.

These things matter only in regards to how well they represent end-user workloads and how trustworthy their findings are. The first is much harder than the second, and end-to-end benchmarking is pretty much the only way to get there. As a result, sites like Tom's Hardware focus on application-level benchmarks while still publishing "low level" numbers. Venerable test suites like SPECint have even moved toward running "full stack" style benchmarks which may emphasize a particular workload but are broad enough to capture the wider system effects which matter in the real world.

Marketing departments also like small, easily digestible, whole numbers. Saying something like "200% Faster!" sure sounds a lot better than "on a particular test which is part of a larger suite of tests, our system ran in X time vs. Y time for the competitor". Both may be true, but the second statement gives you some context. Preferably even that statement would occur above an actual table of numbers or graphs. Numbers without context are lies waiting to be repeated.

With all of this said, James Ward's Census benchmark makes a valiant stab at a full-stack test of data loading and rendering performance for RIA technologies. Last month Jared dug further into the numbers and found the methodology wanting, but given some IP issues couldn't patch the sources himself. Since I wasn't encumbered in the same way I thought I might as well try my hand at it, but after hours of attempting to get the sources to build, I finally gave up and decided to re-write the tests. The result is Census 2.

There are several goals of this re-write:

Fairness. Tests need to be run multiple times for them to be representative in any way. Likewise, systems not being directly tested need to be factored out as much as possible. C2 does this by reducing the number of dependencies and running tests (at least) 5 times and discarding outliers before reporting an average. I've also worked to make sure that the tests put the best foot forward for all of the tested technologies.
Hackability. Benchmarks like Census serve first as a way for decision makers to understand options but second as a way for developers to know how they're doing. Making it trivial to add tests helps both audiences.
Portability. The test suite should run nearly everywhere with a minimum of setup and fuss. This ensures that the largest numbers of people can benefit from the fairness and hackability of the tests.

The results so far have been instructive. On smaller data sets HTML wins hands-down for time-to-render, even despite its disadvantage in over-the-wire size. For massive data sets, pagination saves even the most feature-packed of RIA Grids, allowing the Dojo Grid to best even XSLT and a more compact JSON syntax. Of similar interest is the delta between page cycle times on modern browsers vs their predecessors. Flex can have a relatively even performance curve over host browsers, but the difference between browsers today is simply stunning.

Given the lack of an out-of-the-box paginating data store for Flex, RIAs built on that stack seem beholden to either Adobe's LCDS licensing or are left to build ad-hoc pagination into apps by hand to get reasonable performance for data-rich business applications. James Ward has already exchanged some mail with me on this topic and it's my hope that we can show how to do pagination in Flex without needing LCDS in the near future.

The tests aren't complete. There's still work to do to get some of the SOAP and AMF tests working again. If you have ideas about how to get this done w/o introducing a gigantic harball of a Java toolchain, I'm all ears. Also on the TODO list is an AppEngine app for recording and analyzing test runs so that we can say something interesting about performance on various browsers.

Census 2 is very much an open source project and so if you'd like to get your library or technology tested, please don't hesitate to send me mail or, better yet, attach patches to the Bug Tracker.

Update: I failed to mention earlier that one of the largest changes in C2 vs. Census is that we report full page cycle times. Instead of reporting just the "internal" timings of an RIA which has been fully boostrapped, the full page times report the full time from page loading to when the output is responsive to user action. This keeps JavaScript frameworks (or even Flex) from omitting from the reports the price that users pay to download their (often sizable) infrastructure. There's more work to do in reporting overall sizes and times ("bandwidth" numbers don't report gzipped sizes, e.g.), but if you want the skinny on real performance, scroll down to the red bars. That's where the action is.