Real-time graphics code is one of those areas of programming where you can never get enough performance. One such scenario was in an OpenGL application I’d been working on recently, where there was a lot of matrix multiplications going on in the inner core loop of the application.

I’d been using the excellent OpenTK library for the OpenGL support, which has some nice helper libraries that help with some of the common grunt work associated with developing these kinds of applications. One of these utility classes is a 4×4 matrix which includes all the standard methods you’d expect to find in such a class including one to multiply two matrices together. If you’re not familiar with matrix maths then don’t worry, all you need to know is that it is a quite computationally expensive operation involving a lot of floating point operations – **64 multiplications** and **48 additions** to be exact.

There are two variants of the method provided by the library; one as you’d normally expect taking in two *Matrix4* values and returning a new *Matrix4* result, and the second which takes two ref Parameters and returns the result as a third out parameter:

public static Matrix4 Mult(Matrix4 left, Matrix4 right) public static void Mult(ref Matrix4 left, ref Matrix4 right, out Matrix4 result)

The latter looks somewhat unusual but it’s quite useful since the *Matrix4* type is a value type and is by default copied each time it is used. The *ref* and *out* keywords instruct the compiler that the memory for the parameters and result are already allocated by the caller so no copies need to be made.

Internally the first method simply calls the second anyway, so the former is just a more natural form since that’s the standard way parameters and return values are passed.

In my code I was already using the latter, however a quick test of the performance of these two running one million iterations came out at:

Call | Duration |
---|---|

OpenTK.Matrix4.Mult(left, right) | 1,700ms |

OpenTK.Matrix4.Mult(ref left, ref right, out result) | 1,380ms |

It’s a little artificial to compare these two variants since the former calls the latter and can therefore only ever be slower, but at least we get a base against which performance improvements can be measured.

Taking a peek inside the latter revealed what it was up to internally:

Granted, there’s an awful lot going on in there, but the routine itself is pretty simple; just multiply a whole lot of numbers together, sum the results and store in an output structure. At a first glance there doesn’t seem to be much to optimise here, but looks can be deceiving.

My first impression was that the values from both matrices are being queried quite often (128 times), where as between the two matrices there are only a total of 32 unique values so each accessor is being invoked 4 times more than it needs to be. By pulling back all these values first then running the same calculations on the local values, we get the following version:

public static void MatrixMultCached(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) { float lM11 = left.M11, lM12 = left.M12, lM13 = left.M13, lM14 = left.M14, lM21 = left.M21, lM22 = left.M22, lM23 = left.M23, lM24 = left.M24, lM31 = left.M31, lM32 = left.M32, lM33 = left.M33, lM34 = left.M34, lM41 = left.M41, lM42 = left.M42, lM43 = left.M43, lM44 = left.M44, rM11 = right.M11, rM12 = right.M12, rM13 = right.M13, rM14 = right.M14, rM21 = right.M21, rM22 = right.M22, rM23 = right.M23, rM24 = right.M24, rM31 = right.M31, rM32 = right.M32, rM33 = right.M33, rM34 = right.M34, rM41 = right.M41, rM42 = right.M42, rM43 = right.M43, rM44 = right.M44; result = new Matrix4( (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41), (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42), (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43), (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44), (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41), (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42), (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43), (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44), (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41), (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42), (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43), (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44), (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41), (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42), (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43), (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44)); }

Granted, the code looks a bit uglier but the performance boost is impressive for such a simple change:

Call | Duration |
---|---|

OpenTK.Matrix4.Mult(left, right) | 1,700ms |

OpenTK.Matrix4.Mult(ref left, ref right, out result) | 1,380ms |

EDais.MatrixMultCached(ref left, ref right, out result) | 560ms |

That’s a pretty impressive 2.5 times the performance for just pulling back a few values locally before doing exactly the same work on them!

By taking a closer look at the *Matrix4* structure itself, we can see something interesting about the way it’s stored:

public struct Matrix4 : IEquatable{ public Vector4 Row0; public Vector4 Row1; public Vector4 Row2; public Vector4 Row3; public Vector4 Column0 { get; } public Vector4 Column1 { get; } public Vector4 Column2 { get; } public Vector4 Column3 { get; } public float M11 { get; set; } public float M12 { get; set; } public float M13 { get; set; } public float M14 { get; set; } public float M21 { get; set; } public float M22 { get; set; } public float M23 { get; set; } public float M24 { get; set; } public float M31 { get; set; } public float M32 { get; set; } public float M33 { get; set; } public float M34 { get; set; } public float M41 { get; set; } public float M42 { get; set; } public float M43 { get; set; } public float M44 { get; set; } }

As you can see, its internal data is stored in four *Vector4* values (*Row0* through *Row3*), and there are helper accessor functions for column access (*Column0* through *Column3*) and individual component access (*M11* through *M44*). This means that every time the *M11* getter is called for example, the runtime has to make a call to the structure which goes off to the *Row0* value and pulls out it’s *.X* value. It may not seem like much, but let’s try going directly to the exposed data fields rather than via the properties:

public static void MatrixMultCachedRows(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) { float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W, lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W, lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W, lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W, rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W, rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W, rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W, rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W; result = new Matrix4( (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41), (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42), (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43), (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44), (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41), (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42), (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43), (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44), (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41), (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42), (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43), (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44), (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41), (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42), (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43), (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44)); }

Call | Duration |
---|---|

OpenTK.Matrix4.Mult(left, right) | 1,700ms |

OpenTK.Matrix4.Mult(ref left, ref right, out result) | 1,380ms |

EDais.MatrixMultCached(ref left, ref right, out result) | 560ms |

EDais.MatrixMultCachedRows(ref left, ref right, out result) | 370ms |

Another significant boost in performance by simply using a more efficient way of getting at the data.

Up until now we’ve only been looking at the reading of data, however there is also the corresponding side of it where the result is written to the out variable; perhaps there are some performance wins there?

Since the reading code was vastly improved by going directly to the exposed data fields, lets try the same thing for the writing. This version writes into the Rows directly rather than creating a whole new Matrix structure:

public static void MatrixMultCachedRowsWriteRows(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) { float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W, lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W, lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W, lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W, rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W, rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W, rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W, rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W; result.Row0 = new Vector4( (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41), (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42), (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43), (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44)); result.Row1 = new Vector4( (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41), (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42), (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43), (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44)); result.Row2 = new Vector4( (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41), (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42), (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43), (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44)); result.Row3 = new Vector4( (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41), (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42), (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43), (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44)); }

Call | Duration |
---|---|

OpenTK.Matrix4.Mult(left, right) | 1,700ms |

OpenTK.Matrix4.Mult(ref left, ref right, out result) | 1,380ms |

EDais.MatrixMultCached(ref left, ref right, out result) | 560ms |

EDais.MatrixMultCachedRows(ref left, ref right, out result) | 370ms |

EDais.MatrixMultCachedRowsWriteRows(ref left, ref right, out result) | 295ms |

A more modest speed improvement, but still an improvement.

The writing code is still having to construct new *Vector4*‘s to push into the result, so instead let’s try just writing directly into the structures that are already there:

public static void MatrixMultCachedRowsWriteVals(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) { float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W, lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W, lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W, lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W, rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W, rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W, rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W, rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W; result.Row0.X = (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41); result.Row0.Y = (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42); result.Row0.Z = (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43); result.Row0.W = (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44); result.Row1.X = (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41); result.Row1.Y = (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42); result.Row1.Z = (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43); result.Row1.W = (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44); result.Row2.X = (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41); result.Row2.Y = (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42); result.Row2.Z = (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43); result.Row2.W = (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44); result.Row3.X = (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41); result.Row3.Y = (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42); result.Row3.Z = (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43); result.Row3.W = (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44); }

Call | Duration |
---|---|

OpenTK.Matrix4.Mult(left, right) | 1,700ms |

OpenTK.Matrix4.Mult(ref left, ref right, out result) | 1,380ms |

EDais.MatrixMultCached(ref left, ref right, out result) | 560ms |

EDais.MatrixMultCachedRows(ref left, ref right, out result) | 370ms |

EDais.MatrixMultCachedRowsWriteRows(ref left, ref right, out result) | 295ms |

EDais.MatrixMultCachedRowsWriteVals(ref left, ref right, out result) | 195ms |

This time the results are even more striking, and the net result is an over 7 times speed improvement for doing nothing more than changing the way data is read and written – exactly the same calculations are being performed and exactly the same results are returned.

Another way of looking at this which will probably appeal more to graphics coders, is if the original code was running at 30 FPS, then if matrix multiplications were the only bottleneck this new version would be running at a little over 260 FPS! In reality though this will probably only add up to a reasonably small amount of the overall performance, but it all helps.

This optimization is now implemented in svn trunk, thanks!

I’m actually really surprised that the runtime can’t do these kind of inlining and parameter passing optimizations automatically. Especially the first change where you move the matrix elements into local variables – I don’t see what is preventing the JIT from recognising that the property is a simple accessor that will return the same value every time and caching the result?

Also, I’ve never liked the necessity of passing value types by ref. Surely the runtime could have some kind of ‘pass large value types by const ref’ optimization that’s automatic and invisible? The compiler could detect that a value type parameter is read-only in a method and mark the signature appropriately. And C++ compilers have been doing return value optimizations for a long time.

A JIT has more information available to do optimization with than a static compiler – but there isn’t even any control flow or polymorphism to worry about in this example so your changes are simple static optimizations that I hope native compilers would be doing off the bat.

Well done for getting such a massive speedup. But it just shows how disappointing it is that these kinds of micro-optimizations have to be done by hand.

Thankfully, this is easy to correct for. When computing matrices for pre-multiplication you simply need to transpose the matrix you’d have used for post-multiplication. This also makes sense since you are transposing the vectors, going from the 4×1 column vector to the 1×4 row vector.