Python managed Win32 GDI resources

A quick note before we begin, whilst this article specifically describes a way of writing cleaner and safer Win32 GDI code in Python, these same techniques could be applied to any corresponding API which requires manual resource management.

A task came up recently which required utilising some old Win32 GDI APIs for a legacy application maintenance.  There are various examples online of using the ctypes library in Python for getting access to the Win32 APIs, however the code tends to be largely a straight port of C-style code without utilising any pythonic niceties.

A typical construct seen throughout GDI code is the create, use, destroy cycle, e.g:

from ctypes import windll
hDC = windll.gdi32.CreateCompatibleDC(0) # Create
# Use Device Context here
windll.gdi32.DeleteDC # Destroy

One issue with this is that different GDI constructs use different clean-up calls.  For example, the handle returned by GetDC must be disposed with ReleaseDC, whereas the handle created by CreateBitmap should be disposed with DeleteObject.  Making sure the correct clean-up method is called for each resource type, especially as the complexity of the code increases can become an issue.

Since the functions are opaque calls into DLLs, there is no code hinting available when writing the code, which requires either an encyclopaedic knowledge of the Win32 API, or the GDI documentation permanently open on a second screen.

Additionally, there are calls like SelectObject, which is used to bind a GDI resource with a Device Context, that can create subtle potential for GDI leaks if not handled correctly.

For instance, you’ll often find code examples like this online:

hPen = windll.gdi32.CreatePen(PS_SOLID, 3, 0xFF0000)
windll.gdi32.SelectObject(hDC, hPen)
# Use pen here
windll.gdi32.DeleteObject(hPen) # Leak! Still selected into the DC, pen isn’t deleted!

The correct way to handle DC object selection is to retain the handle returned by the initial SelectObject call (which will typically be a GDI stock object on a fresh DC), then once the GDI object is no longer required, re-select that handle back into the DC before deleting it.

hPen = windll.gdi32.CreatePen(PS_SOLID, 3, 0xFF0000)
hOldPen = windll.gdi32.SelectObject(hDC, hPen)
# Use pen here
windll.gdi32.SelectObject(hDC, hOldPen) # De-select
windll.gdi32.DeleteObject(hPen) # Then delete

All this clean-up code can get quite verbose, which makes it more difficult to reason about what the code is doing in amongst all the resource management and obscure GDI hoops that must be jumped through.

Finally, error handling can become tricky with this sort of code and a source of GDI leaks, since unless the GDI calls are unwound in the correct order then there is potential for those resources to be left hanging around if an exception occurs.

So whilst waiting for inspiration to strike and getting on with writing another part of the application that dealt with reading files, I was writing some standard file access code e.g.:

with open("file.xyz", "r") as f:
	fileData = f.read()

It occurred to me that although I’d used the with statement to ensure that the file is properly closed in case of error, that I’d never really investigate how it worked.  It turns out that this uses something called a statement context manager, which handles entry into and exit from a runtime context.

A quick check in the documentation reveals that implementing our own context manager is straightforward, a rudimentary example of this would be:

class MyScope(object):
	def __enter__(self):
		print("Entering scope!")
	
	def __exit__(self, type, value, tb):
		print("Exiting scope!")

with MyScope():
	print("In scope")

This prints:

Entering scope!
In scope
Exiting scope!

Nice, so now we can execute code when entering and existing from a scope block.  What’s really nice though, is that the exit code will still be called even in the case an error is raised, similar to a try / finally block:

with MyScope():
	raise Exception("nope")

This outputs:

Entering scope!
Exiting scope!
Exception: nope

As we can see, the exit block gets called before the exception is bubbled up to the caller.

So, with that little aside into context mangers out of the way, is there a way that these two concepts can be tied back together, and the narrative of this article brought back on track?  Thankfully, yes.

So back in GDI world, the familiar create, use, destroy cycle seems to map well to the enter, use, exit functionality of the context manager, so let’s try merging them together.  A new utility class can be built for each pair of calls, that can take the input parameters in the constructor, calls the object creation method in enter, and the corresponding destruction method in exit:

class CreateCompatibleDC(object):
	"""Context manager for GDI32 CreateCompatibleDC function"""
	def __init__(self, hRefDC=0):
		self.hRefDC = hRefDC
	
	def __enter__(self):
		print("Create compatible DC")
		self.hDC = windll.gdi32.CreateCompatibleDC(self.hRefDC)
		return self.hDC
	
	def __exit__(self, type, value, tb):
		print("Delete DC")
		windll.gdi32.DeleteDC(self.hDC)

The class can now be utilised in the calling code like this:

with CreateCompatibleDC(0) as hDC:
	print("Use DC: %d" % hDC)

which outputs:

Create compatible DC
Use DC: -1157555815
Delete DC

Great, it looks like the calls are all being made in the right places, the code is much simpler without all the boilerplate GDI stuff getting in the way.  The Python class also offers code completion in the IDE on the parameter(s) required and it even properly cleans up on exceptions!

Of course, in any production code the print messages wouldn’t be used, they’re just there to show that the code is doing the right things at the right times.

This same technique can the be used for other calls too, for example:

class CreatePen(object):
	"""Context manager for GDI32 CreatePen function"""
	def __init__(self, style, width, colour):
		self.style = style
		self.width = width
		self.colour = colour
	
	def __enter__(self):
		print("Create pen")
		self.hPen = windll.gdi32.CreatePen(self.style, self.width, self.colour)
		return self.hPen
	
	def __exit__(self, type, value, tb):
		print("Delete pen")
		windll.gdi32.DeleteObject(self.hPen)

class SelectObject(object):
	"""Context manager for GDI32 SelectObject function"""
	def __init__(self, hDC, hObj):
		self.hDC = hDC
		self.hObj = hObj
	
	def __enter__(self):
		print("Select object")
		self.hOldObj = windll.gdi32.SelectObject(self.hDC, self.hObj)
		return self.hOldObj
	
	def __exit__(self, type, value, tb):
		print("Deselect object")
		windll.gdi32.SelectObject(self.hDC, self.hOldObj)

Now the code from the beginning of the article can be written more simply as:

with CreateCompatibleDC() as hDC:
	with CreatePen(PS_SOLID, 3, 0xFF0000) as hPen:
		with SelectObject(hDC, hPen):
			print("Use Pen: %d in DC: %d" % (hPen, hDC))

Note here that we’re now utilising the default value specified in the create compatible DC class constructor, and that the returned handle from the select object call is ignored since it’s only useful internally for clean-up.

This outputs:

Create compatible DC
Create pen
Select object
Use Pen: -399500450 in DC: 553719962
Deselect object
Delete pen
Delete DC

Since this is starting to resemble a pyramid of doom, all those with statements can get collapsed into one using backslash line continuation characters, and are still executed in order even in the case of error:

with CreateCompatibleDC() as hDC, \
	CreatePen(PS_SOLID, 3, 0xFF0000) as hPen, \
	SelectObject(hDC, hPen):
	raise Exception("testing")

This prints:

Create compatible DC
Create pen
Select object
Deselect object
Delete pen
Delete DC
Exception: testing

So as we can see, the clean-up code gets executed in order, so even if errors occur there is a guarantee that no GDI objects will get leaked.

Hopefully this technique can be utilised in your own code to help manage resources, clean up code and prevent unmanaged resource leaks.

Simulating 16-bit (65536) colour in Photoshop

A task recently came up which required graphic assets in 16-bit RGB 565 format to work with some old legacy embedded hardware. In case you’re not familiar with the terminology, 565 basically refers to the number of bits used to store each colour channel; 5 for the Red and Blue channels, and 6 for the Green channel.

The reason for the unbalanced distribution of bits is that packing a pixel into 15 bits is inefficient since pixels don’t align on byte boundaries, so an extra bit is added in order to make each pixel byte-aligned. This additional bit is sometimes just wasted (xRGB 1555), sometimes used to provide a basic boolean opacity mask (ARGB 1555) and sometimes added to the green channel (RGB 565) since the photoreceptor cells in the human eye are more sensitive to the yellowish-green part of the spectrum.

The BMP file format supports 565 mode natively, but having to repeatedly save the design in order to preview the loss in resolution was a pretty poor workflow, so I came up with a reasonable way to simulate it live – It’s not pixel-perfect to what the Bitmap export module produces, but it’s pretty close.

This technique makes use of the Posterise adjustment layer, which simply clamps input values to a predefined number of levels. The Red and Blue channels are stored in 5-bits each which results in a maximum of 32 distinct values (25), whilst the Green channel gets its extra bit which doubles it’s possible values to 64 (26.)

Photoshop Advanced Layer Blending
Locking the first Posterise filter to Red and Blue channels

By default the Posterise adjustment layers work across all channels equally, but by tweaking the advanced blending models of the layer it can be targeted on operate only on specific channels.

These controls can be found in the Blending options for the layer.


The trick therefore is to use 2 Posterise filters; the first should be set to 32 levels and set to operate on the Red and Blue channels, the second should be set to 64 levels and set to operate only on the Green channel.

Grouped posterise adjustment layers
The two posterise filters work in parallel to filter the channels separately

Grouping the two adjustment layers together effectively creates a custom togglable effect, which makes it easy to view either the original artwork or the simulated result by switching the visibility of the group.


Here’s the results of the preview against a basic test pattern:

RGB888
Original test image

555Bitmap
RGB555 Bitmap

555Preview
RGB555 Preview

565Bitmap
RGB565 Bitmap

565Preview
RGB565 Preview

Here’s a side-by-side comparison of an enlarged segment of the test images which makes it easier to see the results:

Comparison
Side by side comparison of the different modes

The slight difference in the output is most likely down to rounding methods; by the looks of it, the Bitmap output module rounds to the nearest integer where as the Posterise filter rounds up, but it’s a near enough approximation.

You can grab the 16-bit preview PSD, which contains both the RGB555 and RGB565 previews.

Making OpenTK Matrix multiplications faster – An optimisation exercise in C#

Real-time graphics code is one of those areas of programming where you can never get enough performance. One such scenario was in an OpenGL application I’d been working on recently, where there was a lot of matrix multiplications going on in the inner core loop of the application.
I’d been using the excellent OpenTK library for the OpenGL support, which has some nice helper libraries that help with some of the common grunt work associated with developing these kinds of applications. One of these utility classes is a 4×4 matrix which includes all the standard methods you’d expect to find in such a class including one to multiply two matrices together. If you’re not familiar with matrix maths then don’t worry, all you need to know is that it is a quite computationally expensive operation involving a lot of floating point operations – 64 multiplications and 48 additions to be exact.

There are two variants of the method provided by the library; one as you’d normally expect taking in two Matrix4 values and returning a new Matrix4 result, and the second which takes two ref Parameters and returns the result as a third out parameter:

public static Matrix4 Mult(Matrix4 left, Matrix4 right)
public static void Mult(ref Matrix4 left, ref Matrix4 right, out Matrix4 result)

The latter looks somewhat unusual but it’s quite useful since the Matrix4 type is a value type and is by default copied each time it is used. The ref and out keywords instruct the compiler that the memory for the parameters and result are already allocated by the caller so no copies need to be made.
Internally the first method simply calls the second anyway, so the former is just a more natural form since that’s the standard way parameters and return values are passed.

In my code I was already using the latter, however a quick test of the performance of these two running one million iterations came out at:

CallDuration
OpenTK.Matrix4.Mult(left, right)1,700ms
OpenTK.Matrix4.Mult(ref left, ref right, out result)1,380ms

It’s a little artificial to compare these two variants since the former calls the latter and can therefore only ever be slower, but at least we get a base against which performance improvements can be measured.
Taking a peek inside the latter revealed what it was up to internally:

public static void Mult(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) {
    result = new Matrix4(
        left.M11 * right.M11 + left.M12 * right.M21 + left.M13 * right.M31 + left.M14 * right.M41,
        left.M11 * right.M12 + left.M12 * right.M22 + left.M13 * right.M32 + left.M14 * right.M42,
        left.M11 * right.M13 + left.M12 * right.M23 + left.M13 * right.M33 + left.M14 * right.M43,
        left.M11 * right.M14 + left.M12 * right.M24 + left.M13 * right.M34 + left.M14 * right.M44,
        left.M21 * right.M11 + left.M22 * right.M21 + left.M23 * right.M31 + left.M24 * right.M41,
        left.M21 * right.M12 + left.M22 * right.M22 + left.M23 * right.M32 + left.M24 * right.M42,
        left.M21 * right.M13 + left.M22 * right.M23 + left.M23 * right.M33 + left.M24 * right.M43,
        left.M21 * right.M14 + left.M22 * right.M24 + left.M23 * right.M34 + left.M24 * right.M44,
        left.M31 * right.M11 + left.M32 * right.M21 + left.M33 * right.M31 + left.M34 * right.M41,
        left.M31 * right.M12 + left.M32 * right.M22 + left.M33 * right.M32 + left.M34 * right.M42,
        left.M31 * right.M13 + left.M32 * right.M23 + left.M33 * right.M33 + left.M34 * right.M43,
        left.M31 * right.M14 + left.M32 * right.M24 + left.M33 * right.M34 + left.M34 * right.M44,
        left.M41 * right.M11 + left.M42 * right.M21 + left.M43 * right.M31 + left.M44 * right.M41,
        left.M41 * right.M12 + left.M42 * right.M22 + left.M43 * right.M32 + left.M44 * right.M42,
        left.M41 * right.M13 + left.M42 * right.M23 + left.M43 * right.M33 + left.M44 * right.M43,
        left.M41 * right.M14 + left.M42 * right.M24 + left.M43 * right.M34 + left.M44 * right.M44);
}

Granted, there’s an awful lot going on in there, but the routine itself is pretty simple; just multiply a whole lot of numbers together, sum the results and store in an output structure. At a first glance there doesn’t seem to be much to optimise here, but looks can be deceiving.

My first impression was that the values from both matrices are being queried quite often (128 times), where as between the two matrices there are only a total of 32 unique values so each accessor is being invoked 4 times more than it needs to be. By pulling back all these values first then running the same calculations on the local values, we get the following version:

public static void MatrixMultCached(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) {
    float lM11 = left.M11, lM12 = left.M12, lM13 = left.M13, lM14 = left.M14,
        lM21 = left.M21, lM22 = left.M22, lM23 = left.M23, lM24 = left.M24,
        lM31 = left.M31, lM32 = left.M32, lM33 = left.M33, lM34 = left.M34,
        lM41 = left.M41, lM42 = left.M42, lM43 = left.M43, lM44 = left.M44,
        rM11 = right.M11, rM12 = right.M12, rM13 = right.M13, rM14 = right.M14,
        rM21 = right.M21, rM22 = right.M22, rM23 = right.M23, rM24 = right.M24,
        rM31 = right.M31, rM32 = right.M32, rM33 = right.M33, rM34 = right.M34,
        rM41 = right.M41, rM42 = right.M42, rM43 = right.M43, rM44 = right.M44;

    result = new Matrix4(
        (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41),
        (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42),
        (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43),
        (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44),
        (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41),
        (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42),
        (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43),
        (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44),
        (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41),
        (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42),
        (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43),
        (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44),
        (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41),
        (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42),
        (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43),
        (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44));
}

Granted, the code looks a bit uglier but the performance boost is impressive for such a simple change:

CallDuration
OpenTK.Matrix4.Mult(left, right)1,700ms
OpenTK.Matrix4.Mult(ref left, ref right, out result)1,380ms
EDais.MatrixMultCached(ref left, ref right, out result)560ms

That’s a pretty impressive 2.5 times the performance for just pulling back a few values locally before doing exactly the same work on them!

By taking a closer look at the Matrix4 structure itself, we can see something interesting about the way it’s stored:

public struct Matrix4 : IEquatable{
    public Vector4 Row0;
    public Vector4 Row1;
    public Vector4 Row2;
    public Vector4 Row3;

    public Vector4 Column0 { get; }
    public Vector4 Column1 { get; }
    public Vector4 Column2 { get; }
    public Vector4 Column3 { get; }

    public float M11 { get; set; }
    public float M12 { get; set; }
    public float M13 { get; set; }
    public float M14 { get; set; }
    public float M21 { get; set; }
    public float M22 { get; set; }
    public float M23 { get; set; }
    public float M24 { get; set; }
    public float M31 { get; set; }
    public float M32 { get; set; }
    public float M33 { get; set; }
    public float M34 { get; set; }
    public float M41 { get; set; }
    public float M42 { get; set; }
    public float M43 { get; set; }
    public float M44 { get; set; }
}

As you can see, its internal data is stored in four Vector4 values (Row0 through Row3), and there are helper accessor functions for column access (Column0 through Column3) and individual component access (M11 through M44). This means that every time the M11 getter is called for example, the runtime has to make a call to the structure which goes off to the Row0 value and pulls out it’s .X value. It may not seem like much, but let’s try going directly to the exposed data fields rather than via the properties:

public static void MatrixMultCachedRows(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) {
    float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W,
        lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W,
        lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W,
        lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W,
        rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W,
        rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W,
        rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W,
        rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W;

    result = new Matrix4(
        (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41),
        (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42),
        (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43),
        (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44),
        (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41),
        (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42),
        (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43),
        (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44),
        (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41),
        (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42),
        (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43),
        (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44),
        (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41),
        (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42),
        (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43),
        (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44));
}
CallDuration
OpenTK.Matrix4.Mult(left, right)1,700ms
OpenTK.Matrix4.Mult(ref left, ref right, out result)1,380ms
EDais.MatrixMultCached(ref left, ref right, out result)560ms
EDais.MatrixMultCachedRows(ref left, ref right, out result)370ms

Another significant boost in performance by simply using a more efficient way of getting at the data.

Up until now we’ve only been looking at the reading of data, however there is also the corresponding side of it where the result is written to the out variable; perhaps there are some performance wins there?
Since the reading code was vastly improved by going directly to the exposed data fields, lets try the same thing for the writing. This version writes into the Rows directly rather than creating a whole new Matrix structure:

public static void MatrixMultCachedRowsWriteRows(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) {
    float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W,
        lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W,
        lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W,
        lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W,
        rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W,
        rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W,
        rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W,
        rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W;

    result.Row0 = new Vector4(
        (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41),
        (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42),
        (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43),
        (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44));

    result.Row1 = new Vector4(
        (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41),
        (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42),
        (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43),
        (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44));

    result.Row2 = new Vector4(
        (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41),
        (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42),
        (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43),
        (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44));

    result.Row3 = new Vector4(
        (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41),
        (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42),
        (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43),
        (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44));
}
CallDuration
OpenTK.Matrix4.Mult(left, right)1,700ms
OpenTK.Matrix4.Mult(ref left, ref right, out result)1,380ms
EDais.MatrixMultCached(ref left, ref right, out result)560ms
EDais.MatrixMultCachedRows(ref left, ref right, out result)370ms
EDais.MatrixMultCachedRowsWriteRows(ref left, ref right, out result)295ms

A more modest speed improvement, but still an improvement.
The writing code is still having to construct new Vector4‘s to push into the result, so instead let’s try just writing directly into the structures that are already there:

public static void MatrixMultCachedRowsWriteVals(ref Matrix4 left, ref Matrix4 right, out Matrix4 result) {
    float lM11 = left.Row0.X, lM12 = left.Row0.Y, lM13 = left.Row0.Z, lM14 = left.Row0.W,
        lM21 = left.Row1.X, lM22 = left.Row1.Y, lM23 = left.Row1.Z, lM24 = left.Row1.W,
        lM31 = left.Row2.X, lM32 = left.Row2.Y, lM33 = left.Row2.Z, lM34 = left.Row2.W,
        lM41 = left.Row3.X, lM42 = left.Row3.Y, lM43 = left.Row3.Z, lM44 = left.Row3.W,
        rM11 = right.Row0.X, rM12 = right.Row0.Y, rM13 = right.Row0.Z, rM14 = right.Row0.W,
        rM21 = right.Row1.X, rM22 = right.Row1.Y, rM23 = right.Row1.Z, rM24 = right.Row1.W,
        rM31 = right.Row2.X, rM32 = right.Row2.Y, rM33 = right.Row2.Z, rM34 = right.Row2.W,
        rM41 = right.Row3.X, rM42 = right.Row3.Y, rM43 = right.Row3.Z, rM44 = right.Row3.W;

    result.Row0.X = (((lM11 * rM11) + (lM12 * rM21)) + (lM13 * rM31)) + (lM14 * rM41);
    result.Row0.Y = (((lM11 * rM12) + (lM12 * rM22)) + (lM13 * rM32)) + (lM14 * rM42);
    result.Row0.Z = (((lM11 * rM13) + (lM12 * rM23)) + (lM13 * rM33)) + (lM14 * rM43);
    result.Row0.W = (((lM11 * rM14) + (lM12 * rM24)) + (lM13 * rM34)) + (lM14 * rM44);
    result.Row1.X = (((lM21 * rM11) + (lM22 * rM21)) + (lM23 * rM31)) + (lM24 * rM41);
    result.Row1.Y = (((lM21 * rM12) + (lM22 * rM22)) + (lM23 * rM32)) + (lM24 * rM42);
    result.Row1.Z = (((lM21 * rM13) + (lM22 * rM23)) + (lM23 * rM33)) + (lM24 * rM43);
    result.Row1.W = (((lM21 * rM14) + (lM22 * rM24)) + (lM23 * rM34)) + (lM24 * rM44);
    result.Row2.X = (((lM31 * rM11) + (lM32 * rM21)) + (lM33 * rM31)) + (lM34 * rM41);
    result.Row2.Y = (((lM31 * rM12) + (lM32 * rM22)) + (lM33 * rM32)) + (lM34 * rM42);
    result.Row2.Z = (((lM31 * rM13) + (lM32 * rM23)) + (lM33 * rM33)) + (lM34 * rM43);
    result.Row2.W = (((lM31 * rM14) + (lM32 * rM24)) + (lM33 * rM34)) + (lM34 * rM44);
    result.Row3.X = (((lM41 * rM11) + (lM42 * rM21)) + (lM43 * rM31)) + (lM44 * rM41);
    result.Row3.Y = (((lM41 * rM12) + (lM42 * rM22)) + (lM43 * rM32)) + (lM44 * rM42);
    result.Row3.Z = (((lM41 * rM13) + (lM42 * rM23)) + (lM43 * rM33)) + (lM44 * rM43);
    result.Row3.W = (((lM41 * rM14) + (lM42 * rM24)) + (lM43 * rM34)) + (lM44 * rM44);
}
CallDuration
OpenTK.Matrix4.Mult(left, right)1,700ms
OpenTK.Matrix4.Mult(ref left, ref right, out result)1,380ms
EDais.MatrixMultCached(ref left, ref right, out result)560ms
EDais.MatrixMultCachedRows(ref left, ref right, out result)370ms
EDais.MatrixMultCachedRowsWriteRows(ref left, ref right, out result)295ms
EDais.MatrixMultCachedRowsWriteVals(ref left, ref right, out result)195ms

This time the results are even more striking, and the net result is an over 7 times speed improvement for doing nothing more than changing the way data is read and written – exactly the same calculations are being performed and exactly the same results are returned.

Another way of looking at this which will probably appeal more to graphics coders, is if the original code was running at 30 FPS, then if matrix multiplications were the only bottleneck this new version would be running at a little over 260 FPS! In reality though this will probably only add up to a reasonably small amount of the overall performance, but it all helps.

Tiny wings procedural terrain generation

Tiny wings is the casual game phenomenon sweeping the iPhone gaming landscape at the moment, which features a rather interesting procedural terrain generation system giving the game a different look each day. From what I can see the topology of the terrain remains pretty much constant, it’s just the texturing that changes on a daily basis.

Tiny wings terrain

Being a graphics geek with a speciality in procedural content generation, the procedural terrain feature secured my purchase and like a geeky kid with a new toy I just wanted to take it apart and find out what made it tick. The engine creates beautifully textured undulating 2D landscapes like illustrations straight out of a Dr Suess book; indeed the entire frame seems to have a paper like texture applied which further hints at the illustration motif.

The profile of each mountain seems to follow a simple sine wave curve with varying sized vegetation/rock sprites underlaid perpendicular to the surface normal, however it was the texturing itself that intrigued me. It turns out that the renderer uses an elegantly simple approach, which is to simply displace a reasonably densely tessellated rectangle by the curve and just let the texture mapping perform its magic.

Tiny wings flattened terrain

To illustrate this, here’s the same image as above but with the inverse transformation applied to turn it back into a flat rectangle. This can be accomplished in Adobe Photoshop using the distort filter and an appropriately generated cosine gradient curve.

(The small mountain in the distance is on a different paralax plane to the foreground mountains, and also appears to use the same texture and technique.)

From this perspective it’s easy to see what the underlying texture is up to, and a simple brightness gradient towards the top of the mountain finishes off the effect. It also appears that the paper texture is applied to the mountain at this stage and therefore is also distorted with the curve profile, as can be seen by the texture stretching in the red channel of the original image.

To further demonstrate this, a similar effect can be achieved in Photoshop by generating the original striped rectangle and applying the distort effect using an inverted copy of the original distort map.

Tiny wings terrain generation 1Tiny wings terrain generation 2Tiny wings terrain generation 3

Here a simple diagonal striped pattern is defined, then a brightness gradient and paper texture is applied over the top of that, and finally the striped layer is distorted using a sine wave curve.

The distortion map is effectively a gradient, but using a sine wave profile for the interpolation curve.  Unfortunately this isn’t possible using Photoshop’s in-built gradient tool (at least not to my knowledge), however you can pretty easily generate your own.  The map defines the mountain peaks in white, and the valleys in black – you could think of it as a depth map rendered from the sky.

So it turns out that those wonderfully organic curves are just the interference pattern between a diagonal stripe and sine functions.  A cute technique to compliment a simple yet addictive little game; if you haven’t already check it out on the App Store.