This week and last have been spent entirely on resolving issues with 0.9.7.7.
First I fixed the site error on listen archive views (#2049) which was pretty easy.
Normally when a fix is deployed, we say “Yay” and move on. But I’ve been wanting to try the Five Whys postmortem game, and this seems like a good opportunity. Something like this:
Five Whys, first attempt: Site error on listen archive views
1. Why were the views broken? Because a macro was changed to require some data, and not all the views that used that macro provided the data.
2. Why didn’t we notice those pages were broken? Because we had no automated tests for those pages.
3. Why didn’t we have tests for those views? Because we have no way to know what pages have flunc tests. (We don’t normally write unit tests for pages, at least not so far.)
4. Why don’t we have a way to know that? Because that would require at minimum something like a site coverage test that hits all pages on the site.
5. Why didn’t we find it in manual testing? Didn’t click on any of those I guess. I don’t know what manual testing was done.
From this I make Recommendation 1. Let’s find or write a spider that crawls the whole site and reports site errors.
As I mentioned on the mailing list, this could be as simple as a little scripting around wget that reports error codes like 404, 5xx. There are zillions of problems that can’t be caught this way, but there are also many that can - and the effort involved is low. Running such a script on stage would have been enough in this case.
(The next step would be making it a code coverage tool for flunc - inspect the flunc tests and see what kinds of views it doesn’t hit (maybe matching on the last path segment?)… but that’s a lot more work, let’s say YAGNI.)
David pointed out a potential problem - unsafe GETs. We need to think about these. It’s not any different than eg. Google crawling our site, though. Anyway…
Moving on to nastier stuff: #2055, the massive mailing list breakage that put us in a panic for a day or two, was hard because it seemed to occur spontaneously with nothing much useful in the logs. And after the lists were manually fixed, they broke themselves again a couple days later. We found some clues though, and I wrote two quick hack scripts to fix and to validate the problematic data. The latter I set up on an hourly cron job, hoping to narrow down possible causes. At the same time I added a rather egregious amount of logging to the listen code (full stack trace every time anything touches that data).
Nothing happened over the weekend; all the lists were still fine on Monday. Tuesday we got lucky. Ethan was looking for something else in the production logs and said aloud “What’s all this spew from listen?”. I looked and sure enough we’d just lost nearly all the mailing lists. And right before that, my logs showed that somebody deleted a project. The stack trace gave me some things to look at, and soon I spotted and fixed the faulty code.
Okay, let’s play again:
Five whys, take two: Lists are borken!
1. Why did the lists break? Because list featurelet deletion was implemented in a way that deleted all projects’ mailing lists.
2. Why did we deploy this broken code? Because we weren’t alerted by failing tests.
3. Why weren’t we alerted by failing tests? I don’t think we ran them before deployment; several of us noticed after deployment that both flunc and unit tests failed out of the box on a fresh build of 0.9.7.7.
4. Why didn’t we run the tests before deployment? It’s not part of our formal deployment checklist.
5. Why isn’t it part of our formal deployment checklist? We don’t have one.
From this I make my second recommendation: We need a formal deployment checklist, and one of the things on it needs to be: Run all tests against a totally fresh build of the release candidate code. (Fresh build because you want to rule out the possibility of tweaks in your sandbox that you forget to apply when doing the real build.)
But that’s not quite right in this case.
Let’s back up a bit and correct a problem in step 3:
3. Why weren’t we alerted by failing tests? Even if we’d run the tests, there were no tests that would have demonstrated a failure here.
4. Why weren’t there any? The tests verified that the intended effect was observed. They didn’t check to see if data for other projects was destroyed.
5. Why didn’t they check data for other projects?
Here I don’t have a good answer. The test author would have to think of checking that. It’s not something that normally leaps to my mind when thinking of likely failure modes. So maybe focusing on testing is running out of steam in this exercise.
At this point I sat down with rmarianski, who’d written the code in question. He was amenable to trying the “five whys” and here’s my synopsis of what he said:
Five whys, take three: Lists are borken!
1. Why did the lists break? Because list featurelet deletion was implemented in a way that deleted all projects’ mailing lists.
2. Why was it written that way? Because it was written too quickly without anyone else’s input.
3a. Why was it written too quickly? Because there was too much else going on at the same time.
3b. Why was it written without anyone else’s input? Because everybody else was busy and Rob didn’t want to bother anyone… again, because there was too much else going on.
4. Why was there too much else going on? Because we’d underestimated the amount of work involved in the release - there was still development on features and build scripts to do.
5. Why were we wrong about how much work was left? Because we don’t have a way to measure when a release is done.
Recommendation: Not sure. We might need ways to better estimate our status and remaining work.
Also it seems relevant here that XP says that the classic time vs. quality tradeoff is a mistake. Did we really need all those features that were worrying Rob? I think this is one case where XP is dead right: the real knob that you can actually adjust as needed is scope. Consider cutting scope first; if you can’t do that, postpone the release. Quality is not a knob.
A final thought.
At one point I thought the broken code wasn’t exercised in any tests at all. That wasn’t true. But while I thought it was, I wrote up the following recommendation: use code coverage tools where available to help identify gaps in testing.
Of course, the limitations of code-coverage tools are well known… but I still think it might be worth doing.
For opencore, you can already do:
zopectl test -s opencore --coverage=outputdir
More info in this doc. Unfortunately this makes the tests a lot slower (i just ran the full opencore suite in about 30 minutes). But maybe this could be done at least once per iteration? Put it on the checklist too? Should we mandate at least having every path touched somewhere in some test before we can say “we’re done”? That would at least mean that when you find a bug to fix, you’re less likely to need to write a whole lot of test setup from scratch.
(My test run came up with a couple hundred uncovered lines, by the way; some of them obviously spurious, some not.)