Friday, April 17, 2009

mstsc /console doesn't work anymore

Summary: instead of calling mstsc /console /v:servername you should now use mstsc /admin /v:servername.


This post should really go under the category: what were they thinking?

In our company we use a product that for licensing reasons really wants to make sure that only a single instance of the application is running. One of the things the developers have done to ensure this, is not allowing the application to be started on a remote desktop session. Otherwise: you could install the program on a single machine and let multiple users initiate remote sessions on that machine and this violate the license agreement. Now I'm not saying I agree with this type of licensing setup, but the software is pretty good and we want to make sure we're not doing anything illegal with it.

So what we came up with was setting up a server with the software and then asking developers needing the tool to connect to the console session on that server using Windows' Remote Desktop Connection. The console session is (from the little that I understand of it) the first session that is started on a machine. So even on a terminal services server, there is only one console session. Connecting to it is quite simple from a command prompt:

mstsc /console v:machinename

The chances of two of our users needing access to the tool at the same time are pretty slim, so this setup actually worked for us quite well for the last year or so. Until last week that is.

Suddenly I was getting reports from one of my team members that he couldn't start the tool due to a licensing error. He checked and double checked, but seemed quite sure that he was using the correct command line. I walked up to the server myself and tried to start the tool - it started without problems. So I left the tool running for him and told him to try again. He did... and said the tool wasn't there when he connected!

Anyone with enough experience with multi-session machines knows that this means the remote desktop was in another session than the one I started the application in. But how could that be? My team member clearly told mstsc to connect to the console session.

A few Google searches later, I found the answer. Now keep in mind: the problem occurred and was solved last week. But just writing about it here, still makes me angry - very angry with some people at Microsoft.

What happened is that a team at Microsoft found out that there was a security issue with the way terminal services worked. So they changed the logic of console vs. administrative sessions for security reasons. Nothing wrong with that - I appreciate security as much as the next guy.

So they now separate the real session 0 from the first GUI session. Sounds good. I am no security expert so I don't know if it helps, but I'll take their word for it. But they didn't stop at that. Somebody then also decided that the /console option of the remote desktop client should not be called /console anymore. After all you might not be always be able to connect to the console anymore - on newer servers it would connect you to the first, typically administrative GUI. So they relabeled the "console" session to the "administrative" session. Relabeling is not wrong - I appreciate clarity as much at the next guy.

So instead of calling mstsc /console /v:servername you should now use mstsc /admin /v:servername. All clear - nothing wrong with that.

All of this so far is fine and can be explained to anyone. But this of course leaves one decision to be made: what do you do with the existing /console option? What do you do with all those existing scripts that might already use mstsc /console in the expectation that they end up connected to the console session?

And this is where they did something that I don't understand, that I really, REALLY, REEEAAAALLLY don't understand. They decided to in some cases start ignoring the /console option. So with some versions of the remote desktop client, using mstsc /console will end you up in the console/admin session. And in other versions of the remote desktop client, using the same command will end you up in a completely normal session.

Even after a week of thinking about it, I still don't understand why somebody would decide to silently start ignoring the /console a command line switch that users might depend on. The switches of a command line tool are pretty much the API that you expose to the outside world. Whole build scripts have been built against command line tools based solely on the fact that the switches that are there will not change. Switches might be added, but nothing ever gets dropped. Except at Microsoft I guess. There some genius decides that in some cases they would start silently ignoring the /console option on the command line. See this post on the exact behavior in all cases or read this one for non-Windows-2008 servers. Don't expect to understand the logic after that, but at least it is clear that they've spent a lot of time on this change. And after last week: so have I.

I know not to expect an apologetic phone call from Redmond on this, nor do I really expect them to change the behavior of the mstsc client. Instead I'll just warn my fellow programmers and be on the lookout for remote desktop sessions that don't behave like I expect them to.

Ok... time to breath in and breath out, count to ten and do my exercises. And get back to work of course. Now what was I working on?

Thursday, December 25, 2008

Which keyboard is best on mobile phones?

It seems that mobile phones are the main place where UI experiments are still being done. One of the types of UI experiment that recently drew my attention was in keyboard design.

Palm Treo


I just got a Palm Treo from work. That Treo has a built in QWERTY keyboard with horribly small keys. The keyboard is always available, together with the screen. And although that sounds convenient, it does mean that both the screen and the keyboard are small and the device is clunky.

GPhone


Compare this to the design of the first Google phone: the T-Mobile G1. When using this device as a phone the keyboard is stowed away under the screen. When you want to access one of its many PDA functions, you fold away the screen and the mini QWERTY keys become available. So while in phone mode the device is a lot smaller, but in PDA mode it is bigger.

iPhone


Lastly there's the iPhone solution: a virtual on-screen keyboard. There on-screen keyboards were pioneered by Windows PDAs about a decade ago, but leave it to Apple to breathe stylish new life into an existing concept. The screen on your iPhone is big and can be used for normal operation. Until you need a keyboard, then the screen is split into two: the top section remains a normal screen and the bottom section becomes a virtual keyboard. The virtual keys are as miniaturized as their physical counterpart. But they do miss the tactile feedback you get from the other devices. So while it looks endlessly better, inputting text on the iPhone is probably slower than on the others.

So what do you think? Which is the better way of inputting text on your mobile device? The Treo approach of an "always on" keyboard? The way the Google Phone hides the keyboard during normal operation? Or do you prefer the iPhone's virtual keyboard?

Saturday, October 25, 2008

Optimistic locking vs. pessimistic locking

If you have an education in computer science you typically learned about locking in databases. I was taught that the only safe way to update a database in a multi-user environment is to use a transaction:


  1. acquire a lock on the record

  2. read the current value from the record

  3. calculate the new value

  4. write the new value to the record

  5. release the lock


Simple. Reliable. Safe.

And horribly slow...

Imagine thousands of users concurrently hammering your database. Most of these users will be working on different records. So for most of them, the lock is completely useless. But you're still paying the price for acquiring this lock on every database update. That is why this type of locking is called a pessimistic lock. You assume that multiple people will want to update that same record and prevent them from doing so.

I thought this was the only way to do locking, until I encountered something completely different about 10 years ago. At my company at the time we used an internally built software application to track stock levels of our products. And that application used the pessimistic lock pattern described above. The application worked fine for some time, but as it got more users it started to have more and more performance problems. I was brought in as part of a team to solve those problems. I immediately went to work on some of the heaviest operations and was able to bring startup performance back to acceptable levels in a few days. But the updates to the database were still going as slow as ever.

That's when the technical lead of my team came with an interesting suggestion. "Why don't we remove the database locks?" Of course the entire team was in shock. You couldn't remove the locks from the updates. That meant that the data could become corrupt. That is what we told the tech lead: you need locks to keep your data from becoming corrupt. "So why don't we ensure that the data won't become corrupt when we update it?"

Nobody understood what he meant. Keep in mind that all of us were brought up with the concept of pessimistic locking. So the lead asked us to show him the queries that were being fired against the database:

  1. SELECT ID, StockCount FROM tblProducts WHERE ID=?

  2. application code calculates the new stock for the product

  3. UPDATE tblProducts SET StockCount=? WHERE ID=?


He thought about it for a while and then said "How about this?":

  1. SELECT ID, StockCount FROM tblProducts WHERE ID=?

  2. application code calculates the new stock for the product

  3. UPDATE tblProducts SET StockCount=? WHERE ID=? AND StockCount=?


"So we pass in both the new value for StockCount and the value that we expect to be there. That way the update will only be performed when the StockCount has not been modified by anyone else."

I can still remember the shock going through my mind. Does this mean that I don't need to lock the record before reading it? What will we do when the StockCount did actually change between our SELECT and our UPDATE?

All these questions were answered in the next few hours. If some other user had performed a conflicting UPDATE while our code was busy figuring out the new stock vale, we'd have to reperform the entire sequence above. So the process turned more into a loop:

  • while the stock value has not been updated:


    1. SELECT ID, StockCount FROM tblProducts WHERE ID=?

    2. application code calculates the new stock for the product

    3. UPDATE tblProducts SET StockCount=? WHERE ID=? AND StockCount=?



And sure the code got a bit more complex. And definitely in the worst case, this code will send a lot more queries to the database than the one with pessimistic locking. After all, it will fire multiple SELECT and UPDATE statements. Potentially an unlimited number of them, because of the loop.

But that is exactly why people call this the optimistic locking pattern. As I said earlier, most concurrent updates to a database are to different records. So most updates don't have a conflict. But with pessimistic locking all updates pay the price of acquiring a lock. So all updates pay get a penalty for a small fraction of them that do require the locking.

Optimistic locking is built on the assumption that most updates are without conflicts. When there is no conflict, there is no penalty. And only when there has been a conflicting update, do we go into the loop of retrying and pay the penalty. The penalty will be much higher than with pessimistic locking, but it will only be paid for the cases where it is needed.

So there you have it, the difference between optimistic and pessimistic locking and how I learned about them. Pick whichever one you think will work best for your project. Both of them are perfectly suitable for keeping your data safe. It all depends on the ratio of "updates with conflicts" vs. "updates without conflicts" for your application.

Sunday, July 13, 2008

Using PageRank to determine the most powerful people on LinkedIn

Last week I was reading the original paper by Larry Page and Sergey Brin on the PageRank algorithm. PageRank is an algorithm to determine the relative authority of pages that they came up with while at Stanford University. And it is of course also the algorithm they then just started using for a search engine at Stanford - a search engine they named Google.

If you just skip the math for a moment, the PageRank algorithm is surprisingly simple. The basic rules in determining the rank of a page are:
1) the links to a page determine its rank
2) a page spreads its rank over the links it contains

So if you have a page with three links on it, that page will contribute one third of its rank to each of those links. A page with a rank of 100 that has two links on it, will give 50 points to each of its targets.

A simplified PageRank calculation, showing that each page re-distributes its own PageRank over the pages it links to

One thing you may already have noticed is that the definition of the PageRank can't be determined in one go: you need to know the value of all incoming links to determine the rank of a page, which you need to know to determine the value of all outgoing links. According to the paper you can simply calculate all PageRanks in an iterative fashion until it converges after a quite limited number of steps. I'll gladly take their word for it, since that puts the algorithm back into the hands of us programmers. We can iterate over large collections of data like no other.

A converged PageRank calculation, showing that repeated refinement will result in a stable network of PageRanks

The PageRank algorithm is so simple that it not only works, it has turned the search engine world (and the internet in general) upside down. I'm quite sure that today's PageRank is determined in a much more complex fashion, but let's stick to this older and more simple model for now.

What else could we use PageRank for?

What would happen if you let PageRank loose on a completely different type of data? How about instead of taking all the hyperlinked documents on the web, we take a social network? So take the network of people at a site like LinkedIn, Facebook, Plaxo or Orkut and apply the same two basic rules:
1) a person's rank is determined by the incoming links
2) a person's rank is evenly divided over their outgoing links

So the more people link to you, the higher your rank. And if more people link to the people that link to you, your rank will get even higher.



If we let PageRank loose on a social network like this, what will happen? Who will get the highest PageRank? And what does the rank mean in such a scenario? Is your PageRank an indicator of the power of your social network? Will my boss indeed have a higher rank than me, as you would expect? And will his boss have an even higher rank?

There are of course differences between links on a social network site and the web in general. For example: in a social network like this the links between people are non-directed, when you are connected to someone that person is also connected to you. This will likely affect the results, but I am too bad at maths to foresee how.

Basically I have no idea what the outcome of such an experiment will be. It just seems to me that it would be incredibly interesting to see what the numbers are and then determine what they mean. Much like Page and Brin did with their original PageRank: first let the algorithm loose on the data, then see what you can do with the results. If the algorithm feels intuitively good, the results are bound to be interesting.

Does you feel the same? Can this be an interesting experiment? Or is this complete nonsense and is PageRank not applicable to this scenario? And do the social network sites have APIs that will allow you to do this? Or is your data access limited to the people in your own network (as they always promise in their privacy statement) and can only someone working for Facebook/LinkedIn/Plaxo create such an application?

Saturday, March 22, 2008

Three ingredients to a better bug report

At work I've recently been going through lots of bug reports. It is part of my job to determine the priority of each defect that gets reported and whether we should still fix it in the upcoming release. With multiple testers trying to break the product in all possible ways and a deadline approaching rapidly, analyzing the newfound defects seems to take more and more time.


What did you see?


People entering defects can actually do a lot to reduce the time it takes to analyze their entries. You'd be amazed at the number of defects that say something along the lines:

  • "when I do A, B happens"

There are of course cases when this is all there is to say about a defect. For example:

  • "when I click the Ok button, the browser shows an internal apache error"

Granted, it would be more useful if the report said a bit more about the error message. But it is at least clear that an internal error message should not be shown to the user.

What did you expect to see?


Unfortunately things are not always so clear:

  • "when try to I delete the last item from the list, I get a message saying at least one item is required"

When I get an error report like this, I'm not sure what to do with it. Most likely there is an internal rule in the program that this list may never be empty. And the program seems to enforce this rule by giving a message when you try to delete the last remaining item. So there is a "business rule" in the program and the developer wrote code to enforce that rule. Where is the defect?

In cases like these I ask the person who entered the defect why they think this behavior is wrong. Typically I get an answer like:

  • "if I can't delete the selected item, the delete button should be disabled"

This added information makes things a lot clearer. So the tester didn't disagree with the fact that there should always be at least one item in the list, they just didn't agree with the way it was handled.

Why do you think your expectation is better than the current behavior?


But the above leaves me with a problem. There is a clear rule in the program that the list must never be empty. The programmer implemented this one way, someone else thought it should have been implemented another way.

In cases like these I ask the tester (or whoever reported the defect) to explain why they think their expectation is better than the current behavior. In the example we've used so far, the reason could be something like:

  • "clicking the button when there is only one item in the list will always show an error message - the delete action will never be performed. Buttons that don't lead to an action being executed should be disabled."

This is a clear - albeit somewhat abstract - description of the reason why the person expected the behavior to be different.

Prior art



In this example I doubt whether anyone will disagree with the reasoning of the tester. But there are many cases where someone will disagree. Especially the developer that implemented the functionality will tend to defend the way it works.

That's why I normally prefer the defect to point to other places where similar functionality is available in the way the tester prefers it. So in the example defect:

  • "in screens B and C we have a similar list and there the delete button is disabled if there is only one item remaining in the list"

This type of argument works especially well when the functionality in screens B and C has already been in a released version of the product. The users of the product have experienced the functionality and they will expect the new screen to behave in the same way.

If no similar functionality is available in the application, I often look for other programs that have similar functionality. On Windows the notepad application is one of my favorite examples. Everybody has it and the functionality as not substantially changed for at least a decade. Of course the functionality your program has might not be in notepad. In those cases I often refer to programs like Microsoft Office, Outlook, Firefox or the Google home-page. Not because I think these are perfect programs, but because they're so ubiquitous that most users accept them as a reference point for the behavior they expose.

Summary


So a bug report should at least contain the following ingredients:

  1. What did you see?

  2. What did you expect to see?

  3. Why do you think that 2 is better than 1?


Now if everyone starts filing their bug reports like that, I will have to spend a lot less time on analyzing them and can get back to fixing those defects sooner. Who knows... maybe we'll make that deadline after all.

Sunday, March 2, 2008

Finding primes using a grid of browsers

It's been a few months since I talked about my idea of combining the power of a lot of browsers into a so-called computing grid. Since writing that post I've been working on such a browser based grid. Progress has been slow, but steady. And I'm learning a lot in the process, so I'm not complaining at all.

My current prototype grid software has a server-side and a client-side part. Tasks and results get distributed over clients as needed. It's actually pretty cool to see one node start a task and another node finish it. But one thing that my grid lacks is a data storage service. Until I add such a service (in a scalable way) I am stuck with a grid that can only handle tasks like finding prime numbers. But even from such a classic task, there is still a lot to be learned.

For example one of the problems we'll need to solve for a browser based grid is the fact that the nodes in the grid are not dedicated to being part of the grid. Whereas people that download SETI@home at least consciously decide to contribute their computer cycles to the project, for a browser based grid this decision is less explicit. Of course this is part of the beauty of a using the web for this grid: it becomes very easy for people to join the grid. But it also means that we can't do certain things. Like eat up all available CPU cycles...

So how do we find prime numbers without eating up too much CPU time? As you probably recall from school: a prime number is a number that can only be divided by 1 and by itself. So the simplest possible algorithm to determine if a given number is a prime is something like:


function IsPrimeNumber(number) {
for (var i=2; i < number-1; i++) {
if (number % i == 0) {
return false;
}
}
return true;
}

As you can see this is quite a literal translation of the definition of primes we all learned. And although there are plenty of optimizations you can do to this algorithm, we'll stick to to this very simple version for now.

What happens if we execute this function on a node? It'll work fine for small numbers. But once the number goes beyond a certain threshold, the loop will start eating up quite some time. And since executing JavaScript in the browser is unrestricted but single-threaded, this means the script will eat up all CPU cycles for a noticeable amount of time. As I said before, that is something that I simply don't want to happen on this grid.

So we need to modify our algorithm to not eat up all CPU time. We could optimize it by for example skipping all even numbers above 2. But that would only double the maximum number we can check with this function before the user starts to notice. What we need is something that will make this function work without noticeable effect on the CPU for any number.

My approach to this has been to split the task into many sub-tasks. So instead of having one loop like in the version above, we split it into multiple loops that each run over a smaller range. Let's say we have the following helper function:

function IsDividableBySomethingInRange(number, start, end);

This function will return true if number can be divided by one of the numbers in the range [start, end>. With this helper function, we can rewrite our original IsPrimeNumber function to use ranges.

function IsPrimeNumber(number) {
var CHUNK_SIZE = 1000;
for (var i=0; i < number; i += CHUNK_SIZE) {
if (IsDividableBySomethingInRange(number, Math.max(i, 2), Math.min(i + CHUNK_SIZE, number-1)));
return false;
}
}
return true;
}

Now if we leave the algorithm like this, we are still executing the work in one go. To actually remove that problem we need to convert the IsDividableBySomethingInRange calls into sub-tasks, which can be executed on the grid separately. The IsPrimeNumber function/task then just has to wait until all its sub-tasks have completed.

The splitting of a task into sub-tasks that can be executed separately and independent of each other is a typical fork operation that I learned back in my college days. We would fork off some work to separate sub-threads, so it could be completed in parallel instead of executing each one in turn. Waiting for the sub-tasks is called a join operation. By creating a fork/join algorithm we're not just making task execution more deterministic, we're also improving parallelism by allowing the sub-tasks to run independently from each other.

So what we really wanted to create all along is a fork/join version of our IsPrimeNumber function/task.

Let's say that we have a fork operation that we can call to somehow fork off a function into a separate sub-task. And let's say that whenever such a sub-task is completed, it will call back into the original task with the result:

class IsPrimeNumber {
var number;
var result;
function IsPrimeNumber(number) {
this.number = number;
this.result = true;
}
function forkSubTasks() {
var CHUNK_SIZE = 1000;
for (var i=0; i < number; i += CHUNK_SIZE) {
fork IsDividableBySomethingInRange(number, Math.max(i, 2), Math.min(i + CHUNK_SIZE, number-1)));
}
}
function joinSubTask(subtask) {
if (subtask.result == true) {
this.result = false;
}
}
}

This is a grid-enabled fork/join version of our IsPrimeNumber task. When we execute this task with a very large number on a grid of nodes, the task will be forked into sub-tasks and each of those sub-tasks can be executed on any of the nodes. When a sub-task is completed its result can be joined back into the IsPrimeNumber, which will assure that the combined result is correct.

IsPrimeNumber(2137)
forkSubTasks
|
|--------->IsDividableBySomethingInRange(2137, 2, 1000)
| |
|------------------+-------------->IsDividableBySomethingInRange(2137, 1000, 2000)
| | |
|------------------+---------------------+---------------->IsDividableBySomethingInRange(2137, 2000, 2136)
| | |
joinSubTask | | |
| | | |
|<-----------------+ | |
| | |
|<---------------------------------------+ |
| |
|<---------------------------------------------------------------+
|
true<--------+

If you know a bit about Google's Map/Reduce algorithm (or the open-source Hadoop implementation of it) you will probably see the similarities between join and reduce.

There is still many things that we can improve here (not swamping the grid with all sub-tasks at once, keep track of the number of missing sub-tasks, etc). But essentially we now have a grid-enabled way of determining whether any given number is a prime.

Saturday, February 2, 2008

Scrum: story points, ideal man days, real man weeks

My team completed its seventh sprint of a project. Once again all stories were accepted by the product owner.

While one of the team member was giving the sprint demo, I started looking in more detail at some of the numbers. Because with seven sprints behind us, we've gathered quite some data on the progress from sprint to sprint. That's the velocity, for XP practitioners.

Looking at the data



So far we've had 131 story points of functionality accepted by the product owner, so that's an average of 18-19 per sprint. The distribution has not really been all that stable though. Here's a chart showing the number of accepted story points per sprint:


Although it is a bit difficult to see the trend in sprint 1 to 5, we seemed to be going slightly upward. This is in line with what you'd expect in any project: as the team gets more used to the project and to each other, performance increases a bit.

The jump from sprint 5 to sprint 6 however is very clearly visible. This jump should come as no surprise when I tell you that our team was expanded from 3 developers to 5 developers in sprint 6 and 7. And as you can clearly see, those additional developers were contributing to the team velocity right from the start.

But how much does each developer contribute? To see that we divide the number of accepted story points per sprint by the number of developers in that sprint:

Apparently we've been pretty consistently been implementing 5 story points per developer per sprint. There was a slight drop in sprint 6, which is also fairly typical when you add more developers to a project. But overall you can say that our velocity per developer has been pretty stable.

Given this stability it suddenly becomes a simple (but still interesting) exercise to try and project when the project will be completed. All you need in addition to the data from the previous sprints, is an indication of the total estimate of all stories on the product backlog. We've been keeping track of that number too, so plotting both the work completed vs. the total scope gives the following chart:

So it looks like we indeed will be finished with the project after one more sprint. That is of course, if the product owner doesn't all of a sudden change the scope. Or we find out that our initial estimates for the remaining stories were way off. After all: it's an agile project, so anything can happen.

Story points vs. ideal man days vs. real man weeks



Whenever I talk about this "number of story points per developer per sprint" to people on other projects, they inevitably ask the same question. What is a story point? The correct Scrum answer would be that it doesn't matter what unit it is. It's a story point and we do about five story points per developer per sprint.

But of course there is a different unit behind the story points. When our team estimates its stories, we ask ourselves the question: if I were locked into a room with no phone or other disturbances and a perfect development setup, after how many days would I have this story completed? So a story point is a so-called "ideal man day".

From the results so far we can see that apparently this is a pretty stable way to estimate the work required. And stability is most important, way more important than for example absolute correctness.

A classic project manager might take the estimate of the team (in ideal man days) and divide that by 5 to get to the ideal man weeks. Then divide by the number of people in the team to get to the number of weeks it should take the team to complete the work. And of course they'll add some time to the plan for "overhead", being the benevolent leaders that they are. This will give them a "realistic" deadline for the project. A deadline that somehow is never made, much to the surprise and outrage of the classic project manager.

I'm just a Scrum master on the project. So I don't set deadlines. And I don't get to be outraged when we don't make the deadline. All I can do is study the numbers and see what it tells me. And what it tells me for the current project is that the number are pretty stable. And that's the way I like it.

But there is a bit more you can do with the numbers. If you know that the developers in the team estimate in "ideal man days", you can also determine how many ideal man days fit into a real week. For that you need to know the length of a sprint.

Our team has settled on a sprint length of four weeks. That's the end-to-end time between the sprints. So four weeks after the end of sprint 3, we are at the end of sprint 4. In those four weeks, we have two "slack days". One of those is for the acceptance test and demo. The other is for the retro and planning of the next sprint.

So there is two days of overhead per sprint. But there is a lot more overhead during the sprint, so in calculations that span multiple sprints I tend to think of those two days as part of the sprint.

So a sprint is simply four weeks. And in a sprint a developer on average completes 5 story points, which is just another way of saying 5 ideal man days. So in a real week there is 1.25 ideal man days!

I just hope that our managers don't read this post. Because their initial reaction will be: "What? What are you doing the rest of the time? Is there any way we can improve this number? Can't you people just work harder?"

Like I said before: I don't believe in that logic. It's classic utilization-focused project management. It suggests that you should try to have perfect estimates and account for all variables so that you can come to a guaranteed delivery date. The problem with that is that it doesn't work! If there's anything that decades of software engineering management should have taught us, is that there are too many unknown factors to get any kind of certainty on the deadline. So until we get more control of those variables, I'd much rather have a stable velocity than a high utilization.