This week at the Microsoft PDC we introduced a new feature
in the Mono virtual machine that we have been working on
quietly and will appear in our upcoming Mono 2.2 release (due
in early December).
I believe we are the first VM for managed code that
provides an object-oriented API to the underlying CPU SIMD
instructions.
In short, this means that developers will be able to use
the types in the Mono.Simd library and have those mapped
directly to efficient vector operations on the hardware that
supports it.
With Mono.Simd,
the core of a vector operations like updating the coordinates
on an existing vector like the following example will go from
40-60 CPU instructions into 4 or so SSE instructions.
Vector4f Move (Vector4f [] pos, ref Vector4f delta)
{
for (int i = 0; i < pos.Length; i++)
pos [i] += delta;
}
Which in C# turns out to be a call into the method
Vector4f.operator + (Vector4f a, Vector4f b) that is
implemented like this:
Vector4f static operator + (Vector3f a, Vector3f b)
{
return new Vector4f (a.x+b.x, a.y+b.y, a.z+b.z, a.w+b.w);
}
The core of the operation is inlined in the `Move' method
and it looks like this:
movups (%eax),%xmm0
movups (%edi),%xmm1
addps %xmm1,%xmm0
movups %xmm0,(%eax)
You can see the details on the slides
that I used at the PDC and look at the changes in the
generated assembly, they are very large.
Ideally, once we tune the API based on our user feedback
and contributions, it should be brought to ECMA for
standardization. Hopefully we can get Microsoft to implement
the SIMD support as well so all .NET developers have access to
this.
Making Managed Code Viable for Gaming
Many developers have to resort to C++ or assembly language
because managed languages did not provide the performance they
needed. We believe that we can bring the productivity gains
of managed languages to developers that seek high performance
applications:
But even if you want to keep using your hand-tuned C++ game
engine, the SIMD extensions will improve the performance of
your scripting code. You can accelerate your ray casting
operations by doing all the work in the managed world instead
of paying for a costly managed to unmanaged transition and
back.
You can avoid moving plenty of code from C# into C++ with
this new functionality.
Some SIMD Background
Most modern CPUs contain special instructions that are able
to perform arithmetic operations on multiple values at once.
For example it is possible to add two 4-float vectors in one
pass, or perform these arithmetic operations on 16-bytes at a
time.
These are usually referred to as SIMD instructions and
started showing up a few years ago in CPUs. On x86-class
machines these new instructions were part of MMX, 3DNow or the
SSEx extensions, on PowerPC these are called Altivec.
CPU manufacturers have been evolving the extensions, and
newer versions always include more functionality and expand on
the previous generations.
On x86 processors these instructions use a new register
bank (the XMM registers) and can be configured to work on 16
bytes at a time using a number of possible combinations:
- byte-level operations on 16 elements.
- short-level operations on 8 elements.
- single precision or integer-level operations on 4
elements.
- double precision or long-integer operations on 2
elements.
The byte level operations are useful for example when doing
image composition, scaling or format conversions. The
floating point operations are useful for 3D math or physics
simulations (useful for example when building video games).
Typically developers write the code in assembly language to
take advantage of this feature, or they use compiler-specific
intrinsic operations that map to these underlying
instructions.
The Idea
Unlike native code generated by a compiler, Common
Intermediate Language (CIL) or Java class files contain enough
semantic information from the original language that it is
very easy to build tools to compute code metrics (with tools
like NDepend), find bugs in the code (with tools like Gendarme or
FxCop, recreate the original program flow-analysis with
libraries like Cecil.FlowAnalysis or even decompile the code
and get back something relatively close to the original source
code.
With this rich information, virtual machines can tune code
when it is just-in-time compiled on a target system by tuning
the code to best run on a particular system or recompiling the
code on demand.
We had proposed in the past mechanisms to improve code
performance of specific code patterns or languages like Lisp
by creating special helper classes that are intimately linked
with the runtime.
As Mono continues to be used as a high-performance
scripting engine for games we were wondering how we could
better serve our gaming users.
During the Game Developer Conference early this year, we
had a chance to meet with Realtime Worlds which is using the
Mono as their foundation for their new work and we wanted to
understand how we could help them be more effective.
One of the issues that came up was the performance of
Vector operations and how this could be optimized. We
discussed with them the possibility of providing an
object-oriented API that would map directly to the SIMD
hardware available on modern computers. Realtime Worlds
shared with us their needs in this space, and we promised that
we would look into this.
The Development
Our initial discussion with Realtime Worlds was in May,
and at the time we were working both towards Mono 2.0 and also
on a new
code generation engine that would improve Mono's
performance.
The JIT engine that shipped with Mono 2.0 was not a great
place to start adding SIMD support, so we decided to postpone
this work until we switched Mono to the Linear IL
engine.
Rodrigo started work on a proof-of-concept implementation
for SIMD and after a weekend he managed to get the basics in
place and got a simple demo working.
Beyond the proof of concept, there was a lingering
question: were the benefits of Vector operations going to be
noticeably faster than the regular code? We were afraid that
the register spill/reload would eclipse the benefits of using
the SIMD instructions or that our assumptions had been wrong.
Over the next few weeks the rest of the team worked with
Rodrigo to turn the prototype into something that could be
both integrated into Mono and would execute efficiently
(Zoltan, Paolo and Mark).
For example, with Mono 2.2 we will now align the stack
conveniently to a 16-byte boundary to improve performance for
stack-allocated Mono.SIMD structures.
So far the reception from developers building games has
been very positive.
Although today we only support x86 up to SSE3 and some
SSE4, we will be expanding both the API and the reach of of
our SIMD mapping based on our users feedback. For example,
on other architectures we will map the operations to their own
SIMD instructions.
The API
The API lives in the Mono.Simd assembly and is available
today from our SVN Repository
(browse
the API or get
a tarball). You can also check
our Mono.Simd
documentation.
This assembly can be used in Mono or .NET and contains the
following hardware accelerated types (as of today):
Mono.Simd.Vector16b - 16 unsigned bytes
Mono.Simd.Vector16sb - 16 signed bytes
Mono.Simd.Vector2d - 2 doubles
Mono.Simd.Vector2l - 2 signed 64-bit longs
Mono.Simd.Vector2ul - 2 unsigned 64-bit longs
Mono.Simd.Vector4f - 4 floats
Mono.Simd.Vector4i - 4 signed 32-bit ints
Mono.Simd.Vector4ui - 4 unsigned 32-bit ints
Mono.Simd.Vector8s - 8 signed 16-bit shorts
Mono.Simd.Vector8us - 8 unsigned 16-bit shorts
The above are structs that occupy 16 bytes each, very
similar to equivalent types found on libraries like OpenTK.
Our library provides C# fallbacks for all of the
accelerated instructions. This means that if your code runs
on a machine that does not provide any SIMD support, or one of
the operations that you are using is not supported in your
machine, the code will continue to work correctly.
This also means that you can use the Mono.Simd API with
Microsoft's .NET on Windows to prototype and develop your
code, and then run it at full speed using Mono.
With every new generation of SIMD instructions, new
features are supported. To provide a seamless experience,
you can always use the same API and Mono will automatically
fallback to software implementations if the target processor
does not support the instructions.
For the sake of documentation and to allow developers to
detect at runtime if a particular method is hardware
accelerated developers can use the
Mono.Simd.SimdRuntime.IsMethodAccelerated method or look at
the [Acceleration] atribute on the methods to identify if a
specific method is hardware accelerated.
The Speed Tests
When we were measuring the performance improvement of the
SIMD extensions we wrote our own home-grown tests and they
showed some nice improvements. But I wanted to implement a
real game workload and compare it to the non-accelerated case.
I picked a C++ implementation and did a straight-forward
port to Mono.Simd without optimizing anything to compare Simd
vs Simd. The result was surprising, as it was even faster
than the C++ version:
Based on the C++ code from F#
for Game Development
The source code for the above tests is available here.
I use the C++ version just because it peeked my curiosity.
If you use compiler-specific features in C++ to use SIMD
instructions you will likely improve the C++ performance
(please post the updated version and numbers if you do).
I would love to see whether Johann Deneux from the F# for Game
Development Blog could evaluate the performance of
Mono.Simd in his scenarios.
If you are curious and want to look at the assembly code
generated with or without the SIMD optimizations, you want to
call Mono's runtime with the -v -v flags (yes, twice) and use
-O=simd and -O=-simd to enable or disable it.
Presentation
You
can watch
the presentation to get some color into the above discussion
or check it in
the Silverlight
player, Get
it as PDF, or
PPTX.