Numerische Vergleiche ohne Überlauf

Wenn man Zahlen verarbeitet, die nicht aus vertrauenswürdigen Quellen stammen, sollte man immer damit rechnen, dass diese Zahlen nicht in den Datentypen passen, den man verwendet. Solche numerischen Überläufen können schnell zu subtilen Bugs oder sogar Sicherheitsproblemen führen.

I recently wrote code that checked the validity of JSON web tokens. JWTs may contain a claim ("exp") that specifies the expiry date of the token in seconds since the epoch (the time 00:00:00 UTC on 1 January 1970). I wrote that in Perl and it is possible that there are still Perls around that have an integer size of 32 bits.

But 32 bit integers for the epoch will overflow in January 2038 and then start again at -2147483648 (231). On a 32 bit system, the seconds since the epoch in 2039 will be negative, and so they are considered less than the current time (if you read this before January 2038).

Let's see how arbitrarily large decimal integers can be compared in C. Like most comparison functions, ours will return a negative value, when the first number is less than the second, a positive one when it is greater, and zero, when both numbers are equal.

The Naive Approach With strcmp()

The function strcmp() compares two strings lexically. And because the ASCII code of the number 0 is 48, while that of 9 is 57, why not just compare the two numbers as strings:

int
compare_integers(const char *num1, const char *num2)
{
    return strcmp(num1, num2);
}

But that has several issues:

  1. Integers may have a leading + or -.
  2. The string "23" is considered greater than "1303" because "2" is greater than "1" as a string.
  3. Integers may have gratuitous leading zeros which leads to more problems.

So that method will not work.

Checking Whether a String is an Integer

The first step should be to check whether the input strings are really integers. That is easy with regular expressions. In C it is a little bit more complicated:

static int
is_integer(const char *ptr)
{
    if (ptr[0] == '-') {
        ++ptr;
    } else if (ptr[0] == '+') {
        ++ptr;
    }

    if (ptr[0] == '\0') return 0;

    while (ptr[0] != '\0') {
        if (!isdigit(ptr[0])) return 0;

        ++ptr;
    }

    return 1;
}

First a possible leading - or + sign is skipped.

In line 10, we check whether the first character after the possible sign is a null byte which terminates strings in C. If that is the case, the "integer" was either an empty string or a lone + or - without any digits.

The while loop beginning in line 12 checks whether all remaining characters are decimal digits.

The Comparison Function

If both input strings are valid interpretations of integers, we can compare them.

The comparison is very simple if one number is positive and the other negative. So we first check for a possible + or - sign:

static int
compare_integers(const char *num1, const char *num2)
{
    char sign1 = '+', sign2 = '+';

    if (*num1 == '-' || *num1 == '+') {
        sign1 = *num1;
        ++num1;
    }

    if (*num2 == '-' || *num2 == '+') {
        sign2 = *num2;
        ++num2;
    }

    /* ... */
}

Numbers without a sign are positive. So we initialize the two variables with a + sign.

If either of the strings starts with a sign, we store it and skip the sign. In C, incrementing a pointer to a string has the effect that the first character is skipped.

while (num1[0] == '0') ++num1;
if (!num1[0]) --num1;

while (num2[0] == '0') ++num2;
if (!num2[0]) --num2;

Next we skip all leading zeros, again by incrementing the pointer to the string. But that would turn the string "0" into an empty string. So we undo the last skip, if the resulting string is empty.

Note: The if clause if (!num1[0]) checks, whether the string num1 is empty. If the first character of the string is a null byte, the string is terminated there, and that means it is empty.

There is one edge case that has to be solved. The numbers "-0" and "+0" are equal. We therefore change "-0" into "+0" before we start comparing:

if (sign1 == '-' && num1[0] == '0' && num1[1] == '\0')
    sign1 = '+';

if (sign2 == '-' && num2[0] == '0' && num2[1] == '\0')
    sign2 = '+';

This can be read as: if the sign is - and the first character is a 0 and the second character is the terminating null byte, change the minus into a plus sign.

Now that everything is normalized, there is an early exit, when the two signs differ:

if (sign1 == '-' && sign2 == '+') {
    return -1;
}

if (sign1 == '+' && sign2 == '-') {
    return -1;
}

Before we continue with the comparison, we have to handle the case that both numbers are negative. 100 is greater than 10 but -100 is less than 10. In other words, if both numbers are negative, the result of the comparison must be inverted, +1 must become -1 and vice versa. This can be easily achieved by multiplying the result with -1 later.

So we check whether the numbers are negative:

int result_sign = sign1 == '-' ? -1 : +1;

There is no need to check that sign2 is - because at this point we know that sign1 and sign2 are the same.

And now comes the actual trick! Even without converting the strings into real integers (which can cause overflows) it is trivial to decide which of the two strings represent the greater integer. We just compare the lengths of the strings:

ssize_t l1 = strlen(num1);
size_t l2 = strlen(num2);

if (l1 > l2) {
    return result_sign * +1;
} else if (l1 < l2) {
    return result_sign * -1;
}

A "longer" numbers is always greater than a "shorter" number, at least after a possible leading sign or gratuitous leading zeros have been removed. This is the case here.

And what if two numbers have equal length? Then you can just use the already mentioned function strcmp() and do a string comparison. "2304" is greater than "1303", both as a string and as a number:

return result_sign * strcmp(num1, num2);

By the way, since Sunday, September 9th 2001 at 01:46:40 AM UTC Unix timestamps have 10 digits. This will last until Saturday, November 20th 2286 05:46:39 PM UTC, leap seconds ignored. That means that for that period of time it will work to compare Unix timestamps as strings! But you are still a lousy developer, if you do so! 😁

Floating Point Numbers

Comparing floating point numbers without overflow is left as an exercice to the reader because it is simple. You first compare the part before the dot, and if that is equals you always compare the fractional parts as strings.

Source Code

The complete source code (download link) follows below. Save it as compare-integers.c and create a binary executable compare-integers from it with the command make compare-integers.

You can test it like ./compare-integers 1303 2304.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <sys/types.h>>
static int is_integer(const char *str);
static int compare_integers(const char *num1, const char *num2);

int
main(int argc, char *argv[])
{
    if (argc != 3) {
        fprintf(stderr, "Usage: %s INTEGER1 INTEGER2\n", argv[0]);
        exit(1);
    }

    if (!is_integer(argv[1])) {
        fprintf(stderr, "'%s' is not an integer.\n", argv[1]);
        exit(1);
    }

    if (!is_integer(argv[2])) {
        fprintf(stderr, "'%s' is not an integer.\n", argv[2]);
        exit(1);
    }

    int result = compare_integers(argv[1], argv[2]);

    if (result > 0) {
        printf("%s is greater than %s.\n", argv[1], argv[2]);
    } else if (result < 0) {
        printf("%s is less than %s.\n", argv[1], argv[2]);
    } else {
        printf("%s equals %s.\n", argv[1], argv[2]);
    }

    return 0;
}

static int
compare_integers(const char *num1, const char *num2)
{
    char sign1 = '+', sign2 = '+';

    if (*num1 == '-' || *num1 == '+') {
        sign1 = *num1;
        ++num1;
    }

    if (*num2 == '-' || *num2 == '+') {
        sign2 = *num2;
        ++num2;
    }

    while (num1[0] == '0') ++num1;
    if (!num1[0]) --num1;

    while (num2[0] == '0') ++num2;
    if (!num2[0]) --num2;

    if (sign1 == '-' && num1[0] == '0' && num1[1] == '\0')
        sign1 = '+';

    if (sign2 == '-' && num2[0] == '0' && num2[1] == '\0')
        sign2 = '+';

    if (sign1 == '-' && sign2 == '+') {
        return -1;
    }

    if (sign1 == '+' && sign2 == '-') {
        return -1;
    }

    int result_sign = sign1 == '-' ? -1 : +1;

    size_t l1 = strlen(num1);
    size_t l2 = strlen(num2);

    if (l1 > l2) {
        return result_sign * +1;
    } else if (l1 < l2) {
        return result_sign * -1;
    }

    return result_sign * strcmp(num1, num2);
}

static int
is_integer(const char *ptr)
{
    if (ptr[0] == '-') {
        ++ptr;
    } else if (ptr[0] == '+') {
        ++ptr;
    }

    if (ptr[0] == '\0') return 0;

    while (ptr[0] != '\0') {
        if (!isdigit(ptr[0])) return 0;

        ++ptr;
    }

    return 1;
}
Leave a comment
Diese Website verwendet Cookies und ähnliche Technologien, um gewisse Funktionalität zu ermöglichen, die Benutzbarkeit zu erhöhen und Inhalt entsprechend ihren Interessen zu liefern. Über die technisch notwendigen Cookies hinaus können abhängig von ihrem Zweck Analyse- und Marketing-Cookies zum Einsatz kommen. Sie können ihre Zustimmung zu den vorher erwähnten Cookies erklären, indem sie auf "Zustimmen und weiter" klicken. Hier können sie Detaileinstellungen vornehmen oder ihre Zustimmung - auch teilweise - mit Wirkung für die Zukunft zurücknehmen. Für weitere Informationen lesen sie bitte unsere Datenschutzerklärung.