Use Mutation Testing to Improve Your Software Engineering Skills!
Published on October 19, 2023

Code coverage (the percentage of code executed when running your tests) is a great metric. However, coverage doesn’t tell you how good your tests are at picking up changes to your codebase; simply because it's a quantity metric. If your tests aren’t well-designed, changes can pass your unit tests but break production.

Mutation testing is a great (and massively underrated) way to quantify how much you can trust your tests. Mutation tests work by changing your code in subtle ways and then applying your unit tests to these "mutated" versions of your code. If your tests fail, great! If they pass… it means your test wasn't good enough to detect this change.

In this video, Max from the Vonage Developer Experience team will show you how to get started with mutation testing in any language and how to integrate it into your CI/CD pipeline. Soon, catching mutant code will be a routine part of your release engineering process, and you’ll never look at penguins the same way again!

Below is the video transcript, as well as a few handy links.

You can sign up for a free developer account with Vonage here.

If you have any questions, please reach out to us on our Community Slack.


Hi. My name is Max, and I'm a developer advocate at Vonage. And now today, I wanna talk to you about mutation testing. I wanna tell you how I use it and I wanna tell you how you can use it as well. Before we start, I wanna mention the company that I work for briefly.

So we're a company called Vonage, and we do stuff with communications APIs, amongst other things. So a lot of the code I work with is things like sending SMSes or making voice calls, creating video chats, things like that. So lots of stuff to do with communication. And the reason I'm mentioning this is because I actually applied mutation testing to our code. I wanna show you how I did that, why that was a good choice for us, and I wanna show you how you can do it. But before I really start, I wanna introduce the real protagonist of this talk.

I am the speaker, but the real protagonist is actually Henry. Now this is Henry. You can see, hopefully, that he is an adorable little penguin. And the reason he's the real important character here is because as I was learning mutation testing to use it myself, I was using Henry as an analogy for a lot of things. You'll see why, because I'm gonna use him today as well to show you how mutation testing really works. Before we get started, let's just set some baselines.

So first of all, I just want you to think about this: have you heard of testing? I assume if you have clicked on this video, if you're looking at this, you probably know what testing is for software.

I'd also like you to think about code coverage. You may or may not know code coverage. Don't worry if you don't, because it's something that we'll definitely be mentioning. There's also mutation testing. You've heard of this, I assume, if you've seen the title of this video, but it's a really useful thing in its own right. And, I'm hoping that by the end of the video, you'll know an awful lot more about mutation testing and you'll feel comfortable to use it yourself. So if that sounds good to you, then let's continue!

What I'd like to do is just set a bit of a baseline. So I just want to mention testing, first of all. So really, just this big question: why do we write unit tests? What reasons do we have for actually writing tests for our code? And you might wanna pause this and have a think about this yourself. You might not. That's also fine.

In which case, I'll show you what I think it might involve. And here's what I've got. So you can write them to prove that your code works. You can write them for documentation reasons to inspire confidence in your code, for regression testing, for refactoring, and for compliance reasons as well.

So basically, here's all the stuff that might encourage you to write a unit test. And it's great that we have this, it's great that we have a way of verifying that our code works. It's great that we have a way of documenting and all of these great things. But there's a bit of a problem here as well. The problem is that you might start off with a small project, but that project can rapidly grow, and what it does can evolve over time. And then as you do refactors and things like that, you might miss code, you might skip sections. You might miss things out. You might not be testing all of your code anymore. And the issue here is that often we don't monitor our tests.

If we don't monitor our tests, we don't know what's wrong with any of the code that we have, because we don't have a way of testing it, and we don't know that the tests are going to help us. So hopefully, we can see what the challenge is here. I would like to say that it gets better because we have code coverage, right? And if you're not sure what that is, I'll quickly explain. This is it in my words: "how much of the source code of a program is executed when the test suite is run". That's the way that I would explain it.

But let's actually stop talking about it and show you a real example of this in practice. This is one of the open-source SDKs I maintain. And, basically, it's a Python SDK to do API calls for different things. We have an API that I'm supporting in this SDK called our Messages API. And inside of that, you can send messages with SMS, MMS, WhatsApp, Facebook, things like that - lots of different channels. I got the code and I ran some code coverage metrics on it, so I could see which statements in the code I was testing, and which statements I was missing. You can see here that I've covered most of it, but there's a missing statement on line 23 and that line is one that tells me, oh, I didn't actually write a test for that.

In this case, I didn't try that type of authentication. And so what that means is that I can then go away and write a test for that. And so code coverage has already just improved my code, and that's awesome! What else it can do?

It can actually give you an overview of your entire project and all of the different statements in it - basically, how much coverage you've got. So this is what that looks like for me. And this seems great. There's really great potential here because first of all, it lets you write more tests and better tests.

It's also really easy and cheap to measure this stuff. It's not very computationally intense just to run the code coverage and say, "what did I cover"? The other good thing, the really the best thing here is that it shows what you didn't test and so you know, hey, I need to pay attention to this. If I don't trust this code, which really you often shouldn't. That's really useful. And so code coverage really covers a lot.

So actually, I think maybe we're done here, thank you very much. It's been great to talk to you today. I'll see you again... except actually, there are some things which are not so good about this.

Well, first of all, code coverage can be a little bit misleading. It also doesn't guarantee the quality of your tests. So what I'm gonna do is show you an example of a very simple piece of Python code, here it is. And all I'm doing here is importing the requests module, that's the library I'm using.

Then I'm making an API call with that library. So all I'm doing is doing a GET request to a specific URL, and then I'm getting the response back and returning the JSON of the response back to the user. And that's fine. That's all good. We can see though on the next slide, the next page, what I've got here, is a test for this.

It actually has a 100% coverage, but it's not very useful. We can see here, we've got something that calls the API. But what it's actually doing is calling the API and then just saying if I called it, that's good. So if that code ran, we're good. But what doesn't do is validate any of the things that could come back.

We might get a response back that's a 200. We might get a 400. We might get something else. We might get something we don't expect, and we don't handle any of those cases. And we don't test for any of those cases either. So actually, this test is not very useful to help you with that.

So what I'd like to do is very quickly ask a question: have you ever written a piece of code, then written a test for it, not because that test was gonna be useful, but because you needed to make your code coverage a bit higher. I've done this and a lot of people have too. If you haven't, great, I'm proud of you, but in reality, most of us have done this kind of thing.

What happens with code coverage often is we end up making it into a target rather than something that's supposed to give us insight. And this is actually a principle that has got a name. It's a real thing, and it's called Goodhart's Law. It's not just for test coverage, but the statement is essentially this: "When a metric becomes a target, it ceases to be a good metric".

This is really important, so I'll say it again: "When a metric becomes a target, it ceases to be a good metric". Now what do I mean by that? Well, what I mean is that we've taken code coverage, which is supposed to tell us about our code and our tests and how they work, but actually we've turned that into a number that we care about optimizing.

So rather than thinking about making better tests, we're thinking about optimizing a number, which is not as good. That leads to some questions. Like, how can we understand what our tests are really doing? How do we know if our tests are trustworthy? And actually the best statement of this was thought of in around 100 AD by the Roman thinker Juvenal (after a fourth glass of wine), and the statement he came up with was "Who Watches the Watchers?". Who's looking after the people that should be looking after us. And in the same way, who's testing our tests?

I assert an answer for you today, and I assert that the answer is mutation testing. So you might have noticed that Henry has made a bit of a reappearance.

And that's awesome because first of all, he's adorable, but more importantly, he's gonna help us now. So last year, when I started to learn about mutation testing, I was thinking an awful lot about how I could apply it to my code, and I was thinking about ways to conceptualize and understand that for myself. And the way that I did that is that I imagined that my code, which basically sends lots of API calls and deals with API calls and messages, to be like a pigeon or a dove, like a bird that can fly. And so what I can do in this analogy is tie a message to the bird's leg and let the bird go, and it will fly away and deliver my message.

So it will fly away and do what I need it to do. Now, the thing with penguins is that they are birds, so they meet that criteria, but they can't fly, as you may well know, they can't fly, and that makes them kinda angry. But more importantly, a mutant in this way is like a regular bird changed into a penguin which now can't do that central thing I need, which is for that bird to fly. Let's see a real example using pictures of birds because that's what we're doing today.

First of all, we start with some production code, which works as expected. Then we do some kind of mutation operation and create a mutant version of this code. So we have for example, this function, this is Python. It just adds 2 numbers and returns the sum of those 2 numbers. Now, a mutated version of this might, for example, subtract numbers or it might add a constant, or it might return the addition of the two strings, things like that.

It might return nothing at all, or any of the sorts of logical operation that can change what that line of code returns. So what do we do with mutation testing? We've created some mutants, what do we do next?

Well, I take the mutants that we've got. I call these the Fab 4 for reasons that should be obvious, and essentially what we've got to do is run them against our test suite. So we take each mutant. And what we're essentially doing in this analogy is seeing if that mutant can fly, if it can pass the test that it needs to be able to fly away and carry that message. So, Henry, our beautiful mutant penguin here, needs to try to fly.

So when we take this mutant, we'll run our test suites. And in the best case scenario, our tests are going to fail. That's good because it means that we've exposed Henry for what he is, which is an adorable penguin, which is awesome, because it means that actually we're not going to let him go to production. We've caught him in our tests. Now if the tests pass, that's a bad time because what's happened is that we have actually not noticed that Henry is a penguin and not a pigeon.

And he's able to fly. I mean, look at these wings, this this penguin can lift a bus, right? Very impressive. But it means that this mutant could get into production, and that's what we don't want. So what does this give us?

And I assert what this gives us is a way to evaluate the quality of our tests. That's what mutation testing is.

Let's talk about a couple of frameworks that you might want to think about. So with frameworks, there's various options for various languages, you know, every language has some version of this. So in my case, I use mutmut, which is a Python based mutation testing framework.

But if you're interested in other languages, there are things like Pitest for Java or Stryker for JavaScript, C#, things like that. So there's options for multiple languages. And if the language hasn't been mentioned, there's probably still something for it. Now, I'm not a professional doctor, financial adviser, teacher or anything like this - the value of your investment may go down as well as up. But most importantly for me, what I used is what I'll talk about.

So I'll talk about mutmut today because that's what I used myself. And what I did was apply this mutation testing to my own SDK, and I'll show you how I did that right now. So first of all, I pip install mutmut, which is just a Pythonic way of installing stuff, then I run that. That was it.

Luckily, it has some sensible defaults, which I did not need to change. You may need to change what's on there depending on what you're doing. There's different configuration options that you can choose, but for me, this was fine. And when I ran this, what happened was that it basically told me what was going to happen. So it printed out this.

It printed out the fact that it was going to run the entire test suite first of all to understand the timing and things like that. And then it would generate the mutants and check them. So there's some different outcomes that can come up here. So, for example, we can catch a mutant, we can have a mutant time out, which is not good, we can have ones that looks suspicious. and we don't maybe trust, and we can also have ones that survived. And in this case, that's a really bad situation where we know we didn't catch that mutant. So when I ran this, what we saw first of all was it ran my test suite and then it checked those mutants and it could see that I caught 512 of them, but it didn't catch 170 of them. Now, just think about this.

Is that a good number or a bad number? We'll talk about that later, but for now, let's look at some mutants. So here's a simple one we're able to catch. We started off with here. This is our Messages API class.

We've got some valid message channels that we can send messages via. And the mutant just actually just changed one of those. It just changed one of the strings. So actually, we weren't able to send messages with SMS anymore. And so when we had a test to do that, it failed, and that's great.

That means we call that. Here's another one. I don't know if people have used Pydantic who are watching this, but Pydantic is a great validation library, and we use it in the SDK. Here's an example where we have a validator, this is Pydantic V1, this is something that rounds up a number. But the mutated version removed that annotation, that decorator (depending on your language, you'll call it differently, we call them decorators in Python).

So we removed that. And, that meant actually this code would never be called so when I had a test for rounding a number, that would never be called. And so that failed as well. This is good because it means these are 2 versions of the codebase, two little Henries that we were able to catch and say, hey, you're a penguin, which is what we need.

So how do we see the mutants? Well, if you do mutmut show, it can give you a list of them. And you can, for example, name 1, like show number 1, show the first mutant and you're able to see that. So in our case, number one here, for example, changes the auth type and, of course, that one was easy to catch. But the really interesting stuff we can do is we can see the outputs of all of these tests.

So we have a report in HTML format. Depending on your language and what mutation testing software you're using, it'll be easier or harder to do this. For us, it's quite a simple interface, but essentially what it will do is it will show you all of the mutants that you didn't catch inside of each file. So let's look at some.

Here's Mutant 58, which we didn't catch, and you'll see that not all of these are created equal. So this mutant, you can see all we did here is we renamed the logger. And I think that logging is out of scope of my testing. And so for my money, I don't mind that this mutant got in, this was okay because I don't think this is what I should be testing.

So that's okay. So here's another example. This is Mutant 62. And inside of here, we all we do is we change the value of a constant. And, again, I don't think this is in the scope of what I want to test.

I don't think it's important to me to test whether a default value is set or not. That's not important to me in this context. But let's look at one that is more important, that I do care about. So here's one. This is Mutant 112.

And what I'm doing here is I'm creating instances of all of the classes for all the APIs we use. You can see here that we have a voice class. In the mutant version, we don't create a voice class, but our tests are still passing. even though we're not creating a class for all of our voice methods. So why is this happening?

Why are my tests not failing? What's the deal? Well, it turns out that the way we use this and the way we test in the SDK is that we actually don't just call through the client's objects. We actually called the Voice API class directly, but our users will call it like this. And so actually, I should probably have a test for this.

So this one tells me something very important about my code that will improve the test quality because I can write a test for this exact case, which is representative of what our users will do. So let's go back to this number. Then we caught a lot of mutants. We also didn't catch quite a few mutants. In fact, we actually only caught about 75% of these mutants.

We actually missed about 25%. And the question I have for you is, is this a good number? 75%. Is that good? And if you're watching this and shrugging, that's the right answer.

What's interesting here is that 100% doesn't make sense! Because there's cases I just don't care about, right? Like the logger or like the constant that changes. I don't care about that. What I'm using this for isn't to get 100% mutation coverage, because then all I've done is taken the issue with code coverage score and abstracted it again. What I really care about is getting a good insight into my code, that's what I really care about.

Hopefully, I've convinced you that you might want to explore mutation testing a little bit yourself. I really hope I have. but if I haven't, don't worry because what I'm gonna do now is show you how you can get started with mutation testing, and it's so simple that actually I think anybody could go away and do this if you're already writing tests. First of all, some broad strokes advice: start locally, run it on your machine first. That's what I did. I just ran it locally on my machine. start small. If you've got a bigger test base than me, if you've got a bigger set of tests, you might want to start with a subset of those. You might also want to tweak for performance. So you might want to exclude specific tests that aren't relevant to you. For example, things like integration tests, you might not care as much about these.

Alternatively, you might want to exclude parts of the code, for example, if you've auto generated some code, you probably don't care about that.

Okay. So when you wanna run off your machine, which you may well do, then this part is gonna really help you. First of all, I'm really happy with this penguin picture. I don't know why it exists in this way, but I'm very happy it does.

But why would you want to run off your machine? I'll tell you. So testing takes some time. You might want to use the resources in the cloud rather than use your own machine whenever you wanna run a test suite. this also means you can add it to your CI system, which is very useful.

It means that you can specify different platforms that you want to run on, different OSs, different versions of your language, different versions of your code. So let me show you this. I've told you why it might be an advantage. Let me actually just show you what I did and how I did it.

I applied this to my Python SDK. And what I did here was I created a GitHub action. You can do this in any CI system you like. I use GitHub Actions. I created something for mutation testing - a very, very simple piece of YAML. Essentially, what it will do is let me manually choose to run the mutation test for my test suite. And actually, it was a choice that I have to run it manually.

I don't want it to run automatically. You could have it run on push, but I chose not to. So when I run this, you can see what it will do is actually just complete that job. And what it will also do is give me that HTML report. It'll give me that as a run artifact I can download, and that way I can see exactly how my tests are performing and which mutants aren't being caught.

So how do I do it? Well, this is really the interesting question here. I'm gonna show you the YAML that I used. So again, I use GitHub Actions. That's what I use myself, but you can do this in any CI system you like - it's just some simple scripting. And honestly, what I'd also say is feel free to go to the SDK that I'm maintaining. The code is there. You can take the YAML file. And please use it yourself, you're very welcome to. If it gets you going with mutation testing, everybody wins. But either way, I'll show you the YAML right now. I'll explain what each part is doing and how it works.

First of all, you can see here that we've got a mutation test YAML. You can see that we're running on Ubuntu. We're just running with one version of Python. That's fine. we can see here the different steps. So, we check out the code and then we set up Python.

Once we've done that, we install our dependencies, but we now include the dependency mutmut. Once we've done that, we actually run our mutation test, which we do with mutmut run, and we use 2 flags there. We use --no-progress, which basically means that our outputs look better when we read it back in our CI system. And also we use the --CI mode, which gives us a sensible error code back. I'll highlight that because it was my only contribution to mutmut, but I'm still proud of it, so I'm gonna mention it. It is actually very useful here because otherwise we don't get a sensible error code and that causes GitHub Actions to fail. Once we've done that, we actually get that HTML output and we upload that so we can download that from GitHub later. And that's it. It's 35 lines, that's all we do, and that's all good.

What other concerns do we have about CI? What other things do you wanna think about? Well, first of all, you wanna think about manual versus automatic triggering. Personally, I like to run these things manually because I don't want this to be part of a PR process where it needs to get certain score to be approved.

I want to run this when I've added something new that might change things or if need an insight into my code. You can run it automatically if you want, but just be aware that you don't have to. We don't want to just abstract Goodhart's Law again and just try and turn mutation score into a new metric to abstract. We also wanna think about maybe running on multiple OSs. For my code, I didn't really need to because I don't expect it to sufficiently change - we don't play with the OS. We're more doing API calls, so that's not a big deal for us. You might also want to run with multiple versions of your dependencies and things like that, to see if any things do change there for you as well.

So to summarize, mutation testing tests your tests. It helps you to beat Goodhart's Law for code coverage. If you want to use it, I would say start small and local, but then once you're ready, run in a CI system so you can get that stuff done asynchronously and you're not wasting your machine's time and resources. Finally, I just want to say that mutants are valuable, and they are wonderful as well. If we think about Henry and everything that he's given us, like, okay, he can't fly, he can't do the job of the code we need, but what he has done has given us so much of an insight into our code base that he's super wonderful. And so, like I said at the start of this presentation, you shouldn't fear mutants, because you should love them.

Thank you very much. It's been really, really lovely to speak to you today. If you wanna reach out to me with any questions, please join the Vonage Community Slack. If you wanna see the Python SDK, please feel free to have a look at that as well. And if you want to make an account with Vonage and try our stuff, again, there's a link here as well.

So hopefully these things are useful to you. Thanks very much, and I will see you again another time. Cheers.

Max KahanPython Developer Advocate

Max is a Python Developer Advocate and Software Engineer who's interested in communications APIs, machine learning, developer experience and dance! His training is in Physics, but now he works on open-source projects and makes stuff to make developers' lives better.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.