Overflow in unsigned arithmetic in C

This came up in work recently – and I got a bit confused about it, so I thought it was worth writing it down in case someone like me needs to look it up.

Let’s consider a real world example – we have a 16 bit counter that increments on every event X and which we sample periodically: at our last sampling was at value 0xFFFD and at our current sample is 0x03. How many incidents of X have occurred?

Ignoring the possibility of a ‘double rollover’ or anything similar, the answer is 6 (don’t forget the counter goes through 0). But how can we calculate this programmatically?

It turns out it’s very simple and this always works:

uint16_t numberEvents = currentSample - previousSample

In C the calculation (constrained to 16 bits) of 0x03 – 0xFFFD with unsigned arithmetic results in an overflow (we cannot calculate a negative number) but the language standard gives us a well-defined answer that works well.

The answer of -65530 is reinterpreted as modulo 0x10000 (65536) and this comes out at 6. To explain why think of a smaller number system – modulo 5 say. Then the number line (integers and integers mod 5) looks like this:

Integers: 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Modulo 5: 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0

Under this system we can see that 1 – 4 = -3 for integers but 2 mod 5 – which if this was a simple two bit counter would be the correct number of increments that had taken place.

New lackeyml DTD

I cannot imagine that this matters to anyone other than me right now, but for the record (this is taken from a hacked up version of Valgrind‘s Lackey tool):

        VG_(printf)("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
        VG_(printf)("<!DOCTYPE lackeyml [\n");
        VG_(printf)("<!ELEMENT lackeyml (application?, (thread)*)>\n");
        VG_(printf)("<!ATTLIST lackeyml version CDATA #FIXED \"0.2\">\n");
        VG_(printf)("<!ATTLIST lackeyml xmlns CDATA #FIXED");
        VG_(printf)(" \"https://cartesianproduct.wordpress.com\">\n");
        VG_(printf)("<!ELEMENT application EMPTY>\n");
        VG_(printf)("<!ATTLIST application command CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT thread (instruction|store|load|modify)* >\n");
        VG_(printf)("<!ATTLIST thread tid CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT instruction EMPTY>\n");
        VG_(printf)("<!ATTLIST instruction address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST instruction size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT modify EMPTY>\n");
        VG_(printf)("<!ATTLIST modify address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST modify size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT store EMPTY>\n");
        VG_(printf)("<!ATTLIST store address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST store size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT load EMPTY>\n");
        VG_(printf)("<!ATTLIST load address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST load size CDATA #REQUIRED>\n");
        VG_(printf)(" xmlns=\"https://cartesianproduct.wordpress.com\">\n");

Update: As originally published this code had an error in the name of the attribute for thread – I have fixed that now.

In continued praise of “Programming Pearls”

Algorithm City

I have read two books in the last year that have fundamentally changed the way I think about computers and programming – The Annotated Turing and Programming Pearls.

Programming Pearls in particular continues to inspire me in the way it makes you think about building better algorithms and using data structures better – and here’s a short tale about what I have just done (and why I did it) with my valext program.

Valext runs a program under step tracing to allow the detailed examination of patterns of memory use – interrogating the /proc/pid/pagemap at every step.

The way I was doing this was to query the /proc/pid/maps, find the blocks that were being mapped and then seek through /proc/pid/pagemap, reading off the page status.

For every page that was mapped I would build a data structure and stick it into a linked list

struct blocklist {
    uint64_t address;
    struct blocklist *nextblock;
for (i = startaddr; i < endaddr; i++)
    block = malloc(sizeof (struct blocklist));
    block->address = i;
    block->nextblock = NULL;
    if (*lastblock == NULL){
        *lastblock = block;
        *header = block;
    } else {
        (*lastblock)->nextblock = block;
        *lastblock = block;

But the problem with this code – an \mathcal{O}(n) algorithm – is that performance stinks. I was running it on a very underpowered laptop and got to about four days of wallclock time, in which time about 2 million steps had been taken.

So today I rewrote it with an essentially \mathcal{O}(1) algorithm – I allocate an array of arbitrary size (512 elements in current code) and fill that up – with a 0 value being the guard (ie marking the end). If more than 512 elements are needed then another 512 will be allocated and chained to the top and so on…

struct blockchain {
    int size;
    uint64_t *head;
    struct blockchain *tail;

uint64_t* blockalloc(int size)
    uint64_t *buf = calloc(size, sizeof(uint64_t));
    return buf;

struct blockchain *newchain(int size)
    struct blockchain *chain = NULL;
    chain = calloc(1, sizeof(struct blockchain));
    if (!chain)
        return NULL;
    chain->head = blockalloc(size);
    if (chain->head == NULL) {
        return NULL;
    chain->size = size;
    return chain;

/* recursively free the list */
void cleanchain(struct blockchain *chain)
    if (!chain)

/* set up a list */
struct blockchain* getnextblock(struct blockchain** chain,
struct blockchain** header, char* buf)
    int match, t = 0;
    uint64_t startaddr;
    uint64_t endaddr;
    uint64_t i;
    const char* pattern;
    regex_t reg;
    regmatch_t addresses[3];

    pattern = "^([0-9a-f]+)-([0-9a-f]+)";
    if (regcomp(&reg, pattern, REG_EXTENDED) != 0)
        goto ret;
    match = regexec(&reg, buf, (size_t)3, addresses, 0);
    if (match == REG_NOMATCH || match == REG_ESPACE)
         goto cleanup;
    startaddr = strtoul(&buf[addresses[1].rm_so], NULL, 16) >> PAGESHIFT;
    endaddr = strtoul(&buf[addresses[2].rm_so], NULL, 16) >> PAGESHIFT;
    if (*chain == NULL) {
        *chain = newchain(MEMBLOCK);
        if (*chain == NULL)
            goto cleanup;
        *header = *chain;
    for (i = startaddr; i < endaddr; i++)
        if (t >= MEMBLOCK) {
        struct blockchain *nxtchain = newchain(MEMBLOCK);
        if (!nxtchain)
             goto cleanup;
        (*chain)->tail = nxtchain;
        *chain = nxtchain;
        t = 0;
    (*chain)->head[t] = i;
    return *chain;

As I am only testing this code now I don’t know if it will be faster – though everything tells me that it should be (though I am not entirely clear if this memory allocation issue is the real bottleneck in the program or whether it is just that a Pentium III is a Pentium III and there is nothing much I can do about that).

What I do know is that if it was not for the inspiration of Programming Pearls I probably would not even have bothered trying.