Saturday, March 22, 2008

Three ingredients to a better bug report

At work I've recently been going through lots of bug reports. It is part of my job to determine the priority of each defect that gets reported and whether we should still fix it in the upcoming release. With multiple testers trying to break the product in all possible ways and a deadline approaching rapidly, analyzing the newfound defects seems to take more and more time.


What did you see?


People entering defects can actually do a lot to reduce the time it takes to analyze their entries. You'd be amazed at the number of defects that say something along the lines:

  • "when I do A, B happens"

There are of course cases when this is all there is to say about a defect. For example:

  • "when I click the Ok button, the browser shows an internal apache error"

Granted, it would be more useful if the report said a bit more about the error message. But it is at least clear that an internal error message should not be shown to the user.

What did you expect to see?


Unfortunately things are not always so clear:

  • "when try to I delete the last item from the list, I get a message saying at least one item is required"

When I get an error report like this, I'm not sure what to do with it. Most likely there is an internal rule in the program that this list may never be empty. And the program seems to enforce this rule by giving a message when you try to delete the last remaining item. So there is a "business rule" in the program and the developer wrote code to enforce that rule. Where is the defect?

In cases like these I ask the person who entered the defect why they think this behavior is wrong. Typically I get an answer like:

  • "if I can't delete the selected item, the delete button should be disabled"

This added information makes things a lot clearer. So the tester didn't disagree with the fact that there should always be at least one item in the list, they just didn't agree with the way it was handled.

Why do you think your expectation is better than the current behavior?


But the above leaves me with a problem. There is a clear rule in the program that the list must never be empty. The programmer implemented this one way, someone else thought it should have been implemented another way.

In cases like these I ask the tester (or whoever reported the defect) to explain why they think their expectation is better than the current behavior. In the example we've used so far, the reason could be something like:

  • "clicking the button when there is only one item in the list will always show an error message - the delete action will never be performed. Buttons that don't lead to an action being executed should be disabled."

This is a clear - albeit somewhat abstract - description of the reason why the person expected the behavior to be different.

Prior art



In this example I doubt whether anyone will disagree with the reasoning of the tester. But there are many cases where someone will disagree. Especially the developer that implemented the functionality will tend to defend the way it works.

That's why I normally prefer the defect to point to other places where similar functionality is available in the way the tester prefers it. So in the example defect:

  • "in screens B and C we have a similar list and there the delete button is disabled if there is only one item remaining in the list"

This type of argument works especially well when the functionality in screens B and C has already been in a released version of the product. The users of the product have experienced the functionality and they will expect the new screen to behave in the same way.

If no similar functionality is available in the application, I often look for other programs that have similar functionality. On Windows the notepad application is one of my favorite examples. Everybody has it and the functionality as not substantially changed for at least a decade. Of course the functionality your program has might not be in notepad. In those cases I often refer to programs like Microsoft Office, Outlook, Firefox or the Google home-page. Not because I think these are perfect programs, but because they're so ubiquitous that most users accept them as a reference point for the behavior they expose.

Summary


So a bug report should at least contain the following ingredients:

  1. What did you see?

  2. What did you expect to see?

  3. Why do you think that 2 is better than 1?


Now if everyone starts filing their bug reports like that, I will have to spend a lot less time on analyzing them and can get back to fixing those defects sooner. Who knows... maybe we'll make that deadline after all.

Sunday, March 2, 2008

Finding primes using a grid of browsers

It's been a few months since I talked about my idea of combining the power of a lot of browsers into a so-called computing grid. Since writing that post I've been working on such a browser based grid. Progress has been slow, but steady. And I'm learning a lot in the process, so I'm not complaining at all.

My current prototype grid software has a server-side and a client-side part. Tasks and results get distributed over clients as needed. It's actually pretty cool to see one node start a task and another node finish it. But one thing that my grid lacks is a data storage service. Until I add such a service (in a scalable way) I am stuck with a grid that can only handle tasks like finding prime numbers. But even from such a classic task, there is still a lot to be learned.

For example one of the problems we'll need to solve for a browser based grid is the fact that the nodes in the grid are not dedicated to being part of the grid. Whereas people that download SETI@home at least consciously decide to contribute their computer cycles to the project, for a browser based grid this decision is less explicit. Of course this is part of the beauty of a using the web for this grid: it becomes very easy for people to join the grid. But it also means that we can't do certain things. Like eat up all available CPU cycles...

So how do we find prime numbers without eating up too much CPU time? As you probably recall from school: a prime number is a number that can only be divided by 1 and by itself. So the simplest possible algorithm to determine if a given number is a prime is something like:


function IsPrimeNumber(number) {
for (var i=2; i < number-1; i++) {
if (number % i == 0) {
return false;
}
}
return true;
}

As you can see this is quite a literal translation of the definition of primes we all learned. And although there are plenty of optimizations you can do to this algorithm, we'll stick to to this very simple version for now.

What happens if we execute this function on a node? It'll work fine for small numbers. But once the number goes beyond a certain threshold, the loop will start eating up quite some time. And since executing JavaScript in the browser is unrestricted but single-threaded, this means the script will eat up all CPU cycles for a noticeable amount of time. As I said before, that is something that I simply don't want to happen on this grid.

So we need to modify our algorithm to not eat up all CPU time. We could optimize it by for example skipping all even numbers above 2. But that would only double the maximum number we can check with this function before the user starts to notice. What we need is something that will make this function work without noticeable effect on the CPU for any number.

My approach to this has been to split the task into many sub-tasks. So instead of having one loop like in the version above, we split it into multiple loops that each run over a smaller range. Let's say we have the following helper function:

function IsDividableBySomethingInRange(number, start, end);

This function will return true if number can be divided by one of the numbers in the range [start, end>. With this helper function, we can rewrite our original IsPrimeNumber function to use ranges.

function IsPrimeNumber(number) {
var CHUNK_SIZE = 1000;
for (var i=0; i < number; i += CHUNK_SIZE) {
if (IsDividableBySomethingInRange(number, Math.max(i, 2), Math.min(i + CHUNK_SIZE, number-1)));
return false;
}
}
return true;
}

Now if we leave the algorithm like this, we are still executing the work in one go. To actually remove that problem we need to convert the IsDividableBySomethingInRange calls into sub-tasks, which can be executed on the grid separately. The IsPrimeNumber function/task then just has to wait until all its sub-tasks have completed.

The splitting of a task into sub-tasks that can be executed separately and independent of each other is a typical fork operation that I learned back in my college days. We would fork off some work to separate sub-threads, so it could be completed in parallel instead of executing each one in turn. Waiting for the sub-tasks is called a join operation. By creating a fork/join algorithm we're not just making task execution more deterministic, we're also improving parallelism by allowing the sub-tasks to run independently from each other.

So what we really wanted to create all along is a fork/join version of our IsPrimeNumber function/task.

Let's say that we have a fork operation that we can call to somehow fork off a function into a separate sub-task. And let's say that whenever such a sub-task is completed, it will call back into the original task with the result:

class IsPrimeNumber {
var number;
var result;
function IsPrimeNumber(number) {
this.number = number;
this.result = true;
}
function forkSubTasks() {
var CHUNK_SIZE = 1000;
for (var i=0; i < number; i += CHUNK_SIZE) {
fork IsDividableBySomethingInRange(number, Math.max(i, 2), Math.min(i + CHUNK_SIZE, number-1)));
}
}
function joinSubTask(subtask) {
if (subtask.result == true) {
this.result = false;
}
}
}

This is a grid-enabled fork/join version of our IsPrimeNumber task. When we execute this task with a very large number on a grid of nodes, the task will be forked into sub-tasks and each of those sub-tasks can be executed on any of the nodes. When a sub-task is completed its result can be joined back into the IsPrimeNumber, which will assure that the combined result is correct.

IsPrimeNumber(2137)
forkSubTasks
|
|--------->IsDividableBySomethingInRange(2137, 2, 1000)
| |
|------------------+-------------->IsDividableBySomethingInRange(2137, 1000, 2000)
| | |
|------------------+---------------------+---------------->IsDividableBySomethingInRange(2137, 2000, 2136)
| | |
joinSubTask | | |
| | | |
|<-----------------+ | |
| | |
|<---------------------------------------+ |
| |
|<---------------------------------------------------------------+
|
true<--------+

If you know a bit about Google's Map/Reduce algorithm (or the open-source Hadoop implementation of it) you will probably see the similarities between join and reduce.

There is still many things that we can improve here (not swamping the grid with all sub-tasks at once, keep track of the number of missing sub-tasks, etc). But essentially we now have a grid-enabled way of determining whether any given number is a prime.