Audio Middle Wares in a Programmer’s Perspective

Audio middle wares like FMod and Wwise are third party soft wares that let you implement audio in a much simpler fashion than hard-coding it into the game engine itself. In addition to being relatively simpler, they also give the user more advanced functionalities (like crossfading between sounds, dynamically changing pitch due to a game parameter, dynamically mixing the soundscape etc.) in an understandable user interface and without the need to code to get them. With this, the sound design of the game becomes what the sound designer wants and not what the programmer interprets from him so that the sound designer can take the actual ownership of the soundscape. The real attraction for the programmers in using middlewares is to achieve a bigger audio potential for the sound designer in a very time efficient way and in much more simpler code. If the game studio doesn’t have a dedicated audio tools programmer or a technical sound designer, middlewares become invaluable for the studio. There are stronger and weaker points of every audio middleware; but in this article my main focus will be Wwise as it is more directly applicable to the examples that I’m going to give.
Wwise middleware takes some effort and budget (you need a pro license to implement a game that has more than 200 sound assets) to implement in a video game development. However the benefits and the workload it saves in a game makes the initial cost more than worthwhile. Through this report, I will focus on the benefits of Wwise for programmers by talking about how it simplifies the coding process, benefits that it brings for memory management and speaker routing, how it simplifies the debug process for sound and the ease of porting to different operating systems that the game will work on. Finally I will give an answer for which middle ware is the most suitable for working with games.
First off, some programming challenges of implementing audio to a game can be summarized by the bullet points below.

It takes a lot of time and effort to setup a dynamic system that will react to changes in the game and play or change sounds accordingly. Even dynamic effects like echo in the sound will need a certain degree of implementation.

There are a lot of things going on at the same time in a game for the processor and memory when we think about the rendering, artificial intelligence and the physics engine inside the game. Adding a dynamic sound system to that will require optimizing the sound system as well as other aspects of the game.

Debugging the sound system is less straightforward than other parts of the game. As the sound system will probably be modular in the game, in most of the errors there is a pretty good chance the program itself won’t throw an exception(other than assertions). There will also be no visual representation that the programmer can see on the screen. The sound will simply not play which will increase the complexity of debugging.

The memory management is an important concept for the sounds in a game. To optimize the memory management, the programmer needs to be aware of which kind of compression algorithms he/she needs to use to decrease both the memory consumption and the hard disk usage with as little loss in the audio data as possible.

Let’s start the discussion with our first bullet point. To emphasize this point I will give two simple case studies to show the processes that the programmer has to go through with and without Wwise.

Shuffle Behaviour

First off, Wwise middleware features random containers. Think of the containers as parent objects with different functionalities. Random containers are used so that any sound that has been put into the container will have a random chance (which can be arranged by the sound designer) of playing with some behavioral choices to specify the random behavior whenever the random container is triggered. One of the behaviors in the random container is the shuffle behavior which means that any time the random container is triggered one of the sounds inside the container will be played, than this sound will have a zero chance of playing until all the other sounds are also played in the container. When all the sounds are triggered the shuffle will be reset and all the sounds will be able to play again. To implement this behavior, let’s look at the workload of the programmer with Wwise. The programmer will not have the access to the random container itself but rather will have access to an event that calls the appropriate sound asset which is implemented in Wwise itself by the sound designer. The programmer will simply have to call the

AkSoundEngine.PostEvent(eventName, gameObject)

function to get the shuffle behavior he/she needs from the random container. Since Wwise deals with all the behaviors internally, the programmer can focus more time on building the game rather than implementing algorithmic functionality to the sound designer to perform the task.

If Wwise isn’t present for a shuffle behavior, the programmer will have to code an algorithm like below;

Setup an array or a list to put the sounds in

Randomly select each element and push them onto a stack

Whenever the sound is triggered, pop and play each element and check if the stack is empty

If the stack is empty fill it again

Another algorithm would be;

Setup linked list nodes with sound pointers and their priority values

Setup a new linked list and populate it with the sounds and initial priorities and actual priorities

Define a case system which will favor the elements with higher priority values and whenever a sound is chosen to play set its priority to -1

Check if all the sounds’ priorities are set to -1 and if that is the case setup the priority values to the initial values

As you can see these are just pseudo code algorithms as actual code is out of the focus of this article and tutorials to implement them can be found online.

The second algorithm will be more complex than the first algorithm; but it also lets us not have extra data for a stack, however we are using more processing power as we need to traverse the linked list to check if all priorities are set to -1.

Granted these algorithms are not really that complex to implement or improve and will not take much time of a programmer. However keep in mind that this is a very simple case which can be set with just one button in Wwise by the sound designer.

Let’s look at a bit more complex case to reinforce the idea.

Blend Container with dynamic modification in the sounds

Our second case would be blending 4 sounds with different intensities. Suppose we have 4 weather sounds for the game. The first sound is weather with light rain intensity, the second is weather with medium rain intensity, the third is weather with heavy rain intensity and the last one is an extra detail layer for the weather with heavy rain intensity. What we want to do is to have a parameter to control rain intensity (rain_intensity). By using this parameter we want to switch between different weather sounds with different rain intensities. We also don’t want to switch abruptly between two intensities so we will have to crossfade between two sounds (Crossfading basically means fading a sound out while fading in the other sound, so there will be a more gradual change between two sounds). Firstly, rain_intensity will have a minimum level of 0 and a maximum level of 100. Suppose we want to crossfade between light rain intensity and medium rain intensity at rain_intensity level 35 and crossfade between medium and heavy rain intensities at rain_intensity level 75. We also want to introduce our detail layer at rain_intensity level 90. To make the layers sound a bit more dynamic, we will also increase the volume of the sound by 0.1 db for each rain_intensity level.

For the case of the programmer that has the Wwise middleware integrated the things that the programmer needs to do are as follows:

Call “AkSoundEngine.PostEvent(PlayWeather, Main_Camera)”. In this event we assume that the weather event is tied to the main_camera object. The looping of the sound will be done by Wwise engine.

Whenever the rain_intensity value changes in the game, the programmer will call the “AkSoundEngine.UpdateRTPCValue(rain_intensity, desired_value, Main_Camera);”. The change in sound itself will be handled by the sound designer inside the Wwise middleware.

If the programmer doesn’t have the Wwise integrated, the steps will be a bit more complex. Again we can give several solutions to this. We’ll start with a brute force approach:

Start playing all 4 sounds in the related volumes. For instance if the rain_intensity value is 10, play the light rain intensity in an audible level and the others at 0 db.

Update the levels of the sounds accordingly whenever rain_intensity parameter is changed bearing in the fact that we will also add 0.1 db for each rain_intensity levels. The important part here is to ensure that we won’t add 0.1 db to a sound if the sound is already playing at 0 db.

This approach will work; but it will also mean that we need 5 different sound objects for one sound. As weather isn’t the only sound in the game, this can increase the memory usage exponentially with every layer that we add.

We can optimize the algorithm to not have that much memory usage; but it will require a bit more knowledge about audio analysis. The first thing to know is the Fast Fourier Transform(FFT) algorithm which lets us split the sound in sine waves as this algorithm assumes that every sound can be split into sine waves. The reason that we want to do this is to ensure that there will be no clicks and pops when we add the sounds together. The focus of this paper isn’t FFT algorithm so we won’t delve into the algorithm itself (You can find more information about the algorithm in

When the sounds are imported check where the loudest peaks of the sounds are. This part is to ensure that we won’t have a lot of waiting time as the FFT is still an algorithm with N^2logN and some of our sounds can have a duration of more than 2 minutes and the complexity of FFT will have a huge impact if we try to analyze all of the sounds.

Use FFT algorithm to analyze the loudest parts of the sounds that we put into our blending object.

Attenuate the frequencies where there will be clipping(the clipping value can change depending on the headroom of the system)

Add the sounds together in the levels according to the rain_intensity parameter and play the sound

Calculate how much volume we need to add according to the rain_intensity * 0.1 formula and add the volume to the sound object whenever the rain_intensity parameter is changed.

This algorithm will also work and we won’t have 4 different sound objects but 1 sound object. However, even if the algorithm given here is written in a High Level, it’s still a complex algorithm Also bear in mind that even the bullet point about attenuating the appropriate frequencies may require the programmer to build up an equalization tool. As we are introducing more modules and systems into our audio engine, we also introduce the chance of running into problems while implementing the algorithm. Even though we can solve those problems by changing the algorithms accordingly, it is safe to say that it will take much more time and effort than integrating Wwise into our game engine to allow the sound designer have that functionality.

During these case studies, we can acknowledge that integrating Wwise in our game engine will decrease the complexity of the algorithms that we need to consider for a dynamic soundscape while programming the game. Like I said before, the tools to create the dynamic soundscape opportunities that Wwise brings can be built from scratch; but if we look at the man hours needed to build the tools, we will see that buying and integrating a middleware is actually much more cost effective than having the programmers program the tools needed.


We’ll need to talk about memory management tools that middlewares bring to the table. One of the optimization tools of Wwise is the ability to limit the sound instances for each sound object. For instance, if we have 32 guns shooting at the same time we can actually prioritize and limitsome of the gunshots if we want. The sound designer can tell Wwise to play only the loudest 16 gunshots and unplay(kill) the other 16 so that there is no impact on memory(keep in mind that silencing quietest sounds still need processing power to calculate which sounds are the appropriate ones. Use kill oldest or newest options to save on processing power too). Since Wwise gives the sound designer the opportunity to do limiting in each and every sound instance, it will let the sound manager to conserve a lot of memory.

Another thing to consider is whether to stream a sound from the hard drive or play it directly from the memory. This is one of the fundamental concerns for playing audio in a game. Suppose we have a music that is 20 minutes long. Trying to play all of the music from the RAM will use up a lot of the memory which we need for other aspects of the game and cause issues like bottlenecking. So the programmer may want to stream the audio from the disk rather than directly loading into the RAM and play it from there. As the music won’t be tied to an animation or anything the streaming delay caused by accessing the hard disk won’t be much of a problem. Keeping track of thousands of sound assets may be a problem for the programmer as the sound designer is the one who created the sound assets and is the one more familiar with them. Wwise gives the sound designer an easier interface to manage streaming. Also, to contemplate from the lag that will potentially be caused by streaming from disk, Wwise also gives the ability to preload (a look-ahead) the sound from the disk and start playing it (keep in mind that this option will use processing power). Adding to this fact, Wwise also has a built-in feature called multi-positioning of the audio. The basic idea is that if we have 20 torches in the game Wwise lets us play one event for all the torches and handle all the blending and crossfading of the 20 sound events with one sound event, which sometimes lets us save additional memory

One of the other concepts that can help greatly with the total size of the audio files is data compression. Granted, most of the programmers who are not also working on audio specifically aren’t expected to know about the different compression algorithms and how they affect the quality of sounds. The great advantage that the middlewares bring is that they already have these compression algorithms built in the program itself which gives the benefit of having the audio designer worrying about the quality-size ratio without letting them compress to a file format that won’t work in the game. This way, massive data compressions, ogg vorbis file format can give you 16:1 compression ratio without much audible loss of quality; can be made by the sound designer without having the programmer worry if the file format works with the game. When we import the sounds into Wwise, it will also get rid of all the unneeded metadata from the audio file, which means less data usage for that audio asset.


Another useful feature that the middlewares have is the profiler feature. From this view, the sound designer can see how much CPU is used for the audio thread, active number of streams, total streaming bandwidth and total usage of memory in real time for each sound event and sound instances. From here, the sound designer can see and optimize the audio before it even comes to the programmer. A further step from this is to use the Wwise SoundFrame framework to build a test environment editor for the sound designers to test the audio inside the game engine without them having to write a single line of code. This editor can also be used to have the sound designer put the sound emitters and setup the audio parameters inside the game editor without the programmer having to worry about it. Both the profiler and the SoundFrame let the programmers and the sound designers debug the audio thread much more efficiently.


Two of the challenges of designing audio for a game include localization and porting if the game is a multiplatform game. The dialogue system in Wwise makes the localization process much easier. Since Wwise supplies soundbanks – which contain the events for the programmer to implement- to the programmer, the sound designer can easily build different sound banks for each different language and leave out the other languages in that bank. The programmer will simply need to load that build of the sound bank to the game. For porting the game to different format, Wwise gives the same idea in a bigger scale. It lets the sound designer compress the files differently for each port in a non-destructive manner. The main idea behind that is each platform has different specifications and file formats for audio depending on the hardware. For instance, we may want to use 8 bit files for the android port of the game while we use 16 bit files in the PC platform because of file size concerns and Wwise lets the sound designer put these decisions to life in a compact and non-confusing user interface. The sound designer can specify which platforms they want to build the sound banks for. Additionally, the sound designer can convert each file to a specified format differently for each different build, which will ensure that all the sounds will play correctly and optimized for each platform.


The last but definitely not the least advantage that I’m going to talk about is the dynamic audio functionalities that Wwise brings for the sound and the music in the game. In the sound designer section, these functions let the sound designer randomize the sounds and have the soundscape adapt to the game changes by binding them to the game parameters. Musically, the interactive music manager brings the same advantage to sound designer or the composer’s hands to have a more interesting and adaptable music. Even though the composer needs to change their way of thinking (rather than thinking the music piece linearly, they need to think how the music will adapt to different circumstances and compose smaller pieces that can tie together accordingly) the benefit of having the music play and adapt to different outcomes in the game can be massive for the gaming experience. When the sound and the music change and adapt to the game dynamically they will have the user experience a more realistic or film like time.

One question to ask would be; do we really need this much of dynamic alteration and sounds in a game. The answer actually depends on the game. For the retro games not having that many variations worked and still work well for the game developers; but for the modern games we’d need to say that we absolutely do need it. The important thing to keep in mind is that we firstly perceive the world with our eyes; but make sense of everything more by using our ears, noses etc. The main idea is that sound works subconsciously as much as working consciously and one of the main challenges of a video game as a whole is repeatability. People can lose interest in a game if they always do the same actions in it. Similarly hearing the same sounds constantly can let the users feel like they are bored. One cannot say that a dynamic sound design can save a repetitive game design that doesn’t give players new challenges; but having a dynamic soundscape is better than not having one for the sustainability of the game in the market.


Even though this article talks about the benefits of the middlewares, there are a couple limitations that these soft wares bring. The main limitation to talk about is customization. No matter which middle ware you use, all of these programs are designed and written by other people, so they are other people’s ideas of solving the audio problem and some of the solutions may not coincide with what you think. This may be something small like having the program do a feature by clicking three times instead of one or something crucial like a better metering system. Either way you will have to ask and wait for the third party to implement it in the software. Finally, another limitation would be that middle wares don’t have any support to be used in platforms like html and flash. Even though this is not a big issue for most game developers, making a website more interactive with a soundscape would be an interesting development


One of the definitive questions in people’s minds would be which middleware is the best? Unfortunately the answer to this question is that it depends on what you expect the middle ware to cover. In this paper, I gave the examples mostly from Wwise middle ware; but it’s purely personal as I am more comfortable with it than others. Every middle ware has advantages and disadvantages against each other.

FMod has a great free car engine system whereas Wwise needs you to pay for a plugin for it. They also don’t have any limitations on audio assets if you are an indie developer and nesting and controlling events like that also makes it easier to mix. Having support for Steinberg plugins are also a big benefit for sound designers. Wwise has a great profiling system for debugging and optimizing. They also have a more sophisticated dialogue system and give you a very easy way to localize the game into other languages and the HDR system. They both have different; but similar ideas about a dynamic music system; but if you want even more complexity in your music system, you might want to check out psai ( If you are in Unity game engine and you don’t want to go out of the engine for audio, you can use a free asset store plugin called Fabric. If you are working with a sound designer who likes to work in his own visual programming patches, you might want to use PureData or Max/MSP. If you are working with a sound designer who likes to code his patches and tools you might want to use SuperCollider. As you can see there are many ideas and many variables you might need to think of when choosing a middle ware. The optimal solution to this question would be, choosing something that the sound designer and the programmer is comfortable with which is sufficient for the job and is under the budget constraints.


To sum up, even looking from a programmer’s or a game designer’s perspective we can say that middlewares are worthwhile investments to implement in a game especially if you don’t have an audio tools programmer or an in-house engine which already has a great soundscape system built in and even if you do have a tools programmer in some cases middlewares might actually be less expensive when one thinks about the man hours to develop a tool. In an audio perspective, it will increase the workload of the sound designer a bit for implementation; but it will decrease the workload of the programmer massively for designing a highly dynamic and interactive soundscape. The workload in integrating the middle wares to the game engines, especially engines like Unity or Unreal, is non-complex for even non-experienced programmers and gives a lot more to work with to the sound designer. The middle wares also simplify the debugging and mixing processes with their easy to understand user interfaces and meters. All in all, even though middle wares have some limitations depending on the engine and the platform you are building they are worth the budget spent on them for both big studios and indie developers with their simplification of audio implementation and their engine supports.