I recently was responsible for two service outages of a web API in a single day. This made me think about the way to test said API before deploying updates to production. Obviously, the existing unit tests, code reviews as well as manual tests were not enough in some cases.

In Comes the Staging System

About 10 years ago I worked part-time at a company developing and maintaining a custom web forum. Back then we used a staging system to test new features before releases. The staging system was a copy of the production system with some dummy data in its database.

In retrospection I did not like that this staging system was just idle, waiting for testers or developers to perform some actions to test them. While I was reading through the book Release It! (up to now I can really recommend it) I had the idea of simulating user traffic by constantly submitting API requests to the staging system. Automatically duplicating a small percentage of live user traffic might be an even better idea, but let’s start slowly.

In such a setup we would be running a staging system comprised of a single server. This server is monitored just the same as the production system - except that all alerts remain on low priority. Nobody should be called in on the weekend, just because the staging system is impaired.

Whenever a new version of the product is tagged, it will be automatically deployed to the staging system. If the error rate suddenly goes up, this would be recognized before the update is actually deployed to production. If no issues are found within a defined time-frame (e.g. one hour or one day) the release can be pushed to production.

Collecting Actual Users’ Requests

So, I was starting to think, how can we simulate our users’ traffic? Of course, I could write down some combinations of API parameters to use, but I would never be able to come up with the variety of parameters the actual users use. A much better approach would be to use real user requests! Even if we do not duplicate them on the fly (remember, we want to start with a simple goal), we could still log their requests and use them at a later point to submit staging requests.

A short experiment revealed that this is actually quite simple with flask. We can use a before_request handler to capture the request and log it to any output. This will cost some time, but if the output target is fast enough it should not matter to much. If it does get too slow, maybe something can be done using an asynchronous worker thread, but I haven’t looked into this.

In the log handler we can just dump all information we need into a JSON object and then forward it. For demonstration purposes I just printed it to stdout, but of course we could as well write it to a file, send it to a message queue or anything else. And if it’s sent to a message queue, it’s not difficult anymore to implement the (almost) live request duplication.

Of course, you need to make sure that only the production system logs requests. If the staging system also starts logging request, you’d get into an infinite loop. Since I did not want to log each and every request, I implemented a variable sample_rate in my proof-of-concept. If we set this to zero on the staging system, nothing will be logged there.

So here’s the result of my PoC. The app.py file:

import flask
import os

import sampler


app = flask.Flask(__name__)
app.config["DEBUG"] = True

sample_rate = float(os.getenv('SAMPLE_RATE', default=1))
request_sampler = sampler.RequestSampler(sample_rate)


@app.before_request
def sample_request():
    request_sampler.log_request(flask.request)


@app.route('/', methods=['POST'])
def home():
    return {}, 200, {'Content-Type': 'application/json; charset=utf-8'}

app.run()

And the sampler.py:

import base64
import flask
import json
import random


class RequestSampler:
    def __init__(self, sample_rate: float):
        self.sample_rate = sample_rate

    def log_request(self, request: flask.Request):
        if random.random() < self.sample_rate:
            # Adding an exception handler here might be a good idea,
            # we would not want to fail a customer's request just because
            # our request logging is broken - instead we should just log the
            # exception
            self._do_log_request(request)

    def _do_log_request(self, request: flask.Request):
        data = {
            'headers': dict(request.headers),
            'data': base64.b64encode(request.data).decode('ascii'),
            'args': request.args,
            'form': request.form,
            'path': request.path,
            'method': request.method,
        }
        print(json.dumps(data))
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.