I think this has been voiced by other people, but I’m all for having a person who takes on the role of openplans testing master and is always looking at things from that perspective.  Their jobs would include developing a multi-tiered strategy for refining and expanding unit, flunc and manual tests to increase the quality of our sites.  Releases wouldn’t be deployed until they sign off on them.

Leaving testing to be everyone’s job is like leaving kitchen cleanliness to everyone… go and have a look.  Oh, that’s right, Wilfredo was here last night.  We need a Wilfredo who is charged with ensuring the quality of our product.

Filed January 31st, 2008 under Uncategorized

This week and last have been spent entirely on resolving issues with 0.9.7.7.

First I fixed the site error on listen archive views (#2049) which was pretty easy.

Normally when a fix is deployed, we say “Yay” and move on. But I’ve been wanting to try the Five Whys postmortem game, and this seems like a good opportunity. Something like this:

Five Whys, first attempt: Site error on listen archive views



1. Why were the views broken? Because a macro was changed to require some data, and not all the views that used that macro provided the data.

2. Why didn’t we notice those pages were broken? Because we had no automated tests for those pages.

3. Why didn’t we have tests for those views? Because we have no way to know what pages have flunc tests. (We don’t normally write unit tests for pages, at least not so far.)

4. Why don’t we have a way to know that? Because that would require at minimum something like a site coverage test that hits all pages on the site.

5. Why didn’t we find it in manual testing? Didn’t click on any of those I guess. I don’t know what manual testing was done.

From this I make Recommendation 1. Let’s find or write a spider that crawls the whole site and reports site errors.

As I mentioned on the mailing list, this could be as simple as a little scripting around wget that reports error codes like 404, 5xx. There are zillions of problems that can’t be caught this way, but there are also many that can - and the effort involved is low. Running such a script on stage would have been enough in this case.

(The next step would be making it a code coverage tool for flunc - inspect the flunc tests and see what kinds of views it doesn’t hit (maybe matching on the last path segment?)… but that’s a lot more work, let’s say YAGNI.)

David pointed out a potential problem - unsafe GETs. We need to think about these. It’s not any different than eg. Google crawling our site, though. Anyway…


Moving on to nastier stuff: #2055, the massive mailing list breakage that put us in a panic for a day or two, was hard because it seemed to occur spontaneously with nothing much useful in the logs. And after the lists were manually fixed, they broke themselves again a couple days later. We found some clues though, and I wrote two quick hack scripts to fix and to validate the problematic data. The latter I set up on an hourly cron job, hoping to narrow down possible causes. At the same time I added a rather egregious amount of logging to the listen code (full stack trace every time anything touches that data).

Nothing happened over the weekend; all the lists were still fine on Monday. Tuesday we got lucky. Ethan was looking for something else in the production logs and said aloud “What’s all this spew from listen?”. I looked and sure enough we’d just lost nearly all the mailing lists. And right before that, my logs showed that somebody deleted a project. The stack trace gave me some things to look at, and soon I spotted and fixed the faulty code.

Okay, let’s play again:

Five whys, take two: Lists are borken!

1. Why did the lists break? Because list featurelet deletion was implemented in a way that deleted all projects’ mailing lists.

2. Why did we deploy this broken code? Because we weren’t alerted by failing tests.

3. Why weren’t we alerted by failing tests? I don’t think we ran them before deployment; several of us noticed after deployment that both flunc and unit tests failed out of the box on a fresh build of 0.9.7.7.

4. Why didn’t we run the tests before deployment? It’s not part of our formal deployment checklist.

5. Why isn’t it part of our formal deployment checklist? We don’t have one.

From this I make my second recommendation: We need a formal deployment checklist, and one of the things on it needs to be: Run all tests against a totally fresh build of the release candidate code. (Fresh build because you want to rule out the possibility of tweaks in your sandbox that you forget to apply when doing the real build.)

But that’s not quite right in this case.

Let’s back up a bit and correct a problem in step 3:

3. Why weren’t we alerted by failing tests? Even if we’d run the tests, there were no tests that would have demonstrated a failure here.

4. Why weren’t there any? The tests verified that the intended effect was observed. They didn’t check to see if data for other projects was destroyed.

5. Why didn’t they check data for other projects?

Here I don’t have a good answer. The test author would have to think of checking that. It’s not something that normally leaps to my mind when thinking of likely failure modes. So maybe focusing on testing is running out of steam in this exercise.

At this point I sat down with rmarianski, who’d written the code in question. He was amenable to trying the “five whys” and here’s my synopsis of what he said:

Five whys, take three: Lists are borken!

1. Why did the lists break? Because list featurelet deletion was implemented in a way that deleted all projects’ mailing lists.

2. Why was it written that way? Because it was written too quickly without anyone else’s input.

3a. Why was it written too quickly? Because there was too much else going on at the same time.

3b. Why was it written without anyone else’s input? Because everybody else was busy and Rob didn’t want to bother anyone… again, because there was too much else going on.

4. Why was there too much else going on? Because we’d underestimated the amount of work involved in the release - there was still development on features and build scripts to do.

5. Why were we wrong about how much work was left? Because we don’t have a way to measure when a release is done.

Recommendation: Not sure. We might need ways to better estimate our status and remaining work.

Also it seems relevant here that XP says that the classic time vs. quality tradeoff is a mistake. Did we really need all those features that were worrying Rob? I think this is one case where XP is dead right: the real knob that you can actually adjust as needed is scope. Consider cutting scope first; if you can’t do that, postpone the release. Quality is not a knob.

A final thought.

At one point I thought the broken code wasn’t exercised in any tests at all. That wasn’t true. But while I thought it was, I wrote up the following recommendation: use code coverage tools where available to help identify gaps in testing.

Of course, the limitations of code-coverage tools are well known… but I still think it might be worth doing.

For opencore, you can already do:

zopectl test -s opencore --coverage=outputdir

More info in this doc. Unfortunately this makes the tests a lot slower (i just ran the full opencore suite in about 30 minutes). But maybe this could be done at least once per iteration? Put it on the checklist too? Should we mandate at least having every path touched somewhere in some test before we can say “we’re done”? That would at least mean that when you find a bug to fix, you’re less likely to need to write a whole lot of test setup from scratch.

(My test run came up with a couple hundred uncovered lines, by the way; some of them obviously spurious, some not.)

Filed January 30th, 2008 under Iteration Notes

Remember that strange “seek error” that would randomly happen on pages that tried to display project logos? We commented out the logos on openplans.org a while ago to make sure it didn’t take down our home page while we were still figuring out what was causing it. Well, it has been figured out, fixed, and is about to be released, thanks to a bunch of people. Here’s how:

In the course of testing 0.9.7.7 on stage against a live data set for the coming release, our new intern Ivan discovered that the project search for the letter ‘I’ was reliably causing the seek error.

It was a perfect opportunity to put in a pdb and finally figure out what was going on. After rmarianski and I poked inside the pdb to no avail, we called over slinkp, who soon noticed strange mention of an OpenPage where an Image was expected. We poked around in the pdb more and, sure enough, one of the projects beginning with the letter I had created a wiki page named logo, and as a result, when Archetypes tried to fetch the image for the project’s logo, it ended up instead fetching the wiki page, trying to call ’seek’ on something that was not a file, and thus generating the error.

The first step I took to fix the problem was to implement a custom getLogo accessor (instead of the one generated by Archetypes) which checks the type of the value returned to make sure it is actually an Image. To prevent the collision from happening in the future, I added the id ‘logo’ to the list of reserved id’s (such as ‘lists’ and ‘tasks’) within projects so that attempting to create a wiki page called ‘logo’ creates one called ‘logo-page’ instead. While I was in there I noticed that ‘blog’ was also not a reserved id, so I added that too. Finally, on the live site, I renamed the two pages with id ‘logo’ (one in the isummit08 project and one in the nycstreets project) to ‘logo-page’, and emailed the project members to alert them. Fortunately no one on nycstreets.org has created a wiki page named ‘logo’.

There was some discussion on the dev list that we should change the url’s of these reserved id’s to include periods (e.g. “project.logo”) since periods are not allowed characters in wiki id’s. It was also proposed that images no longer be stored as attributes on projects but rather as images within the project folders. I have ticketed these items and assigned them to me, and will work on them when they become top priority.

Besides the project logo work, I checked in a few additions to fassembler to automatically bypass the zope failure-on-first-start issue and to automatically add an openplans site during the zeo build. I also added a task to the opencore and deliverance builds to verify that lxml built properly, as it was often not building properly on Macs due to linking errors. I soon hope to add another task to the opencore build to automatically populate the opencore_properties sheet with the correct values for wordpress_uri, tasktracker_uri, and twirlip_uri, because this is currently an oft-forgotten manual step in setting up a development rig.

I’m looking forward to getting our live sites running off fassembler-built stacks, with twirlip and cabochon in the mix, as well as the additional features and fixes in the new release.

Filed January 16th, 2008 under Iteration Notes

The people and projects on maps work for opencore is basically done. I’d still like to make the feeds faster (performance starts off slowish, around 1 second to get a feed of 10 projects, and degrades roughly linearly above 100 projects).

This took way too long.

Back in mid-November I thought I had a “few days” more work to do on this. I seem to have far surpassed the 90/90 rule.

What happened?

For one thing, there was over a full week of time off in there. But still, “a few days” does not translate to “thirty”.

I lost probably a few days total to debugging hell, sometimes on tasks that weren’t related to people-projects-maps. Also a day or two getting lost in space with the architecture astronauts.

And probably four days total working on unrelated stuff - diagnosing and occasionally fixing issues with the live site.

Then I took most of a day to get my branch mostly in sync with the trunk, and learned some annoying things about subversion in the process.

Then nearly two days to get everything deployed on dev. This took so long partly due to resolving issues with our code revealed along the way, partly due to this being my first time at doing a full stack build with fassembler, and first time deploying anything to flow at all. (We were playing fassembler guinea pig at the same time, with Ra walking me through it and fixing glitches as they came up).

Last time I actually measured time spent on overhead (meetings, email, IRC and the like), it worked out to about 30%. That was at my last job though; I should get a reading on that for TOPP. Assuming it’s still 30%, let’s say I spent about 9 days doing things other than coding.

That still leaves about 10-14 full work days doing “a couple days work”.  And looking back at the end result, it doesn’t look like anything that should have taken that long.  This suggests that A) my estimates still suck (this is not news), and B) I need to learn how to use my time more effectively. I’d like to take some time to go through my commit logs and see if I can retroactively figure out where all the time went and what I can do better on the next project.  Hopefully it’s more concrete and actionable than “don’t be a moron”.

Gotta run to the doctor’s office, see you all after lunch (maybe during).

Filed January 16th, 2008 under Iteration Notes

Here are more accurate stats than those I posted the other day. These were created by swapping in old backups of the database rather than by using approximation methods on our current database.

Again, active means something (a member, project or mailing list) that was used during the previous 30 days. dormant means something that was used but not within the last 30 days. unused means something that wasn’t used past its first day.

members.pngprojects.pngmailing-lists.png

Note that the sudden jump in unused mailing lists is due to us creating a discussion list for each project.

Personally, I don’t think there’s much more, at this point, to be learned from these results. The bottom line is that our numbers are really low and they need to grow by several orders of magnitude. At least now, though, we have a framework in place to measure this growth.

Filed January 7th, 2008 under Iteration Notes
  • Configuration management
    • GenericSetup
    • portal properties for TaskTracker and Wordpress
  • Better performance through BTree based project objects
  • Relative project home page urls (prevent issues moving data across hosts)
  • Email address blacklist
  • Xinha fixes
  • Through the web catalog of versions by sending “openplans-versions” request variable
  • Project deletion
  • Task list notifications
    • notification management through account page
Filed January 7th, 2008 under Release