On parsing integers: Jeft Stedfast's Beautiful Code

by Miguel de Icaza

While the world is debating what is the proper object oriented and syntactic decoration for another wrapper around getElementById we back in Mono-land were coping with a bug related to parsing integers.

In the CLI the original API to parse integers threw exceptions when an overflow was detected. So if you tried to parse a value that is too large to fit on an int, you would get an overflow exception. This is trivial to implement in C# because you just write the parser code more or less like this:

		int value, digit;
		...
		value = checked (value * 10 + digit);
		
	

If the result for the expression value * 10 + digit the JIT just throws. Beautiful.

This is great as long as your input is error-free, but if you are going to cope with lots of errors, you are going to be coping with a lot of exceptions. And throwing and catching exceptions is consideraly slower than returning an error code. So with .NET 2.0, a new family of methods was introduced, something like this: bool Int32.TryParse (string value, out int result).

The idea here is that instead of raising exceptions left and right we instead try to parse the value, and if the value overflows we just return an error condition.

A simple approach is to use a type that is bigger than the type you are parsing and you just check the boundaries for errors, for example, a byte could use an int:


		int value;

		value = value * 10 + digit;

		if (value < Byte.MinValue || value > Byte.MaxValue)
			// error.
	

The downside with this is that parsing 32-bit ints requires 64-bit longs. This alone is quite expensive as 64-bit operations on 32-bit machines take many more instructions, consume more registers and produce slow code for a very common case. For the 64-bit case, things are worse, since there is no 128-bit longs in .NET, which means that we have to either come up with something clever, or you need to do checked expressions (like before), use try/catch and return an error on the catch handler. Very suboptimal.

The ideal situation is to do the parsing with the data type at hand (use an int for int parsing, a long for long parsing) and catch the overflow before it causes a problem.

Without going into the embarrassing details (which anyone with an inclination to point fingers can do) recently we had to rewrite these routines and Jeff Stedfast came to the rescue with two beautiful routines: one to parse signed integers, and one to parse unsigned integers in C.

His routines are beautiful because they are able to parse a 32-bit int in a single loop, only using 32-bit ints; And 64-bit ints in a single loop, only using 64-bit longs.

His routines were for C code, but variations of these rewritten for C# are now part of our runtime (they will ship as part of Mono 1.2.6).

These routines have a very clever way of coping with overflows.

Posted on 13 Nov 2007