Decoding Base64 in AWK

AWK?

I recently needed to make a program that was as easy-to-use and portable as possible for Unix-like systems. My first thought was to use Bourne shell for this, but it turned out that I wanted to do more complex programming than plain shell programming. I did a bit of research, and discovered that AWK is required on POSIX-complaint systems, and is available on nearly all Unix-like systems. There is a version of AWK included in BusyBox, so even embedded systems have a good chance of having AWK available.

Weirdly, I discovered that programming in AWK – while definitely not great compared to modern scripting languages like Python – is actually not that bad. Working with the language you can see the evolution from AWK to Perl and the influence on Python and JavaScript.

I will write up a few of the discoveries and techniques from working with AWK, starting with how to decode Base64.

Encoding Binary into ASCII

It used to be quite common on many systems to only be able to send ASCII data. Standards were defined based on this, and some component somewhere would break if non-ASCII data was used (either in processing, transmission, or storage). To work around this, techniques were developed to encode binary data using only ASCII.

While this sounds like some esoteric technology from the dawn of the age of computing, every time you send someone a photograph or document using e-mail, this sort of binary-to-ASCII encoding is probably being used.

ASCII only uses 7 bits, while modern systems typically break data up into 8-bit bytes. This means that an ASCII representation of the binary data will always take more space than the raw, binary version.

It is easy to imagine a very simple way of encoding, where you convert the binary to a string of 0 and 1, so a byte with value 42 might look like “00101010”. This is of course very wasteful, and would make your ASCII version 8 times bigger than the binary version.

A better encoding may be hexadecimal, where a the 42 might look like “2A”. Hexadecimal is very natural for programmers, and the resulting encoding is only 2 times bigger than the original. One drawback is that it may or may not work with uppercase and lowercase encoding; so “2a” might work in the place of “2A” or not.

Base64 Encoding

To have even more efficient encoding, other schemes have been developed. The general idea is to define a mapping of ASCII characters into binary bits. The IETF has standardized a few of these, in RFC 3548. The basic idea of Base64 is documented like this:

   A 65-character subset of US-ASCII is used, enabling 6 bits to be
   represented per printable character.  (The extra 65th character, "=",
   is used to signify a special processing function.)

   The encoding process represents 24-bit groups of input bits as output
   strings of 4 encoded characters.  Proceeding from left to right, a
   24-bit input group is formed by concatenating 3 8-bit input groups.
   These 24 bits are then treated as 4 concatenated 6-bit groups, each
   of which is translated into a single digit in the base 64 alphabet.

To put it another way, in Base64 each 4 ASCII characters represents 3 binary bytes.

Base64 is the most efficient encoding possible if you want to use printable characters and stick with a whole number of bits per character. It is possible to encode using the entire ASCII alphabet, but that includes non-printable characters (anything in the range 0 to 31, and character 127) which often have special meaning to software (for example, character 0 often marks the end of a string in memory). It is also possible to encode using fractional bits – for example using 80 characters in the alphabet – but both encoding and decoding becomes more complex.

Decoding Base64 in AWK

If you have some binary data encoded in Base64, you will sometimes want to extract it and get to the binary version in AWK. One simple approach might be to simply use a separate program to decode Base64 data, such as the base64 application. This has a number of limitations. One basic problem is that not all systems have base64 installed; remember the reason we are using AWK is for portability!

An equally important limitation is that AWK may not be able to handle the resulting binary data itself. Except for gawk, modern AWK interpreters appear to use C strings internally, meaning that any 0-byte will cause problems:

$ gawk 'BEGIN { printf "\0hello\n" }' 
hello
$ mawk 'BEGIN { printf "\0hello\n" }' 
$ busybox awk 'BEGIN { printf "\0hello\n" }' 
$ original-awk 'BEGIN { printf "\0hello\n" }'

As you can see here, encoding a NUL-terminator into any AWK except for gawk means that the rest of the string is ignored.

We can work around this limitations in strings by decoding to an array of integer values rather than to a string. (We will have to do more work later if we want to output these values. This is discussed below.)

Setting up the Alphabet

Since each character has a specific numeric value, we will use an array to store these values and then look it up during runtime. These values are documented in the RFC, and include A to Z, a to z, 0 to 9, +, -, and =.

# create our lookup table
BEGIN {
    # load symbols based on the alphabet
    for (i=0; i<26; i++) {
        BASE64[sprintf("%c", i+65)] = i
        BASE64[sprintf("%c", i+97)] = i+26
    }
    # load our numbers
    for (i=0; i<10; i++) {
        BASE64[sprintf("%c", i+48)] = i+52
    }
    # and finally our two additional characters
    BASE64["+"] = 62
    BASE64["/"] = 63
    # also add in our padding character
    BASE64["="] = -1
}

We use sprintf() to convert the numbers into ASCII characters. We will end up with a table that contains values like:

BASE64["A"] = 0
BASE64["B"] = 1
   ...
BASE64["Z"] = 26
BASE64["a"] = 27
BASE64["b"] = 28
   ...
BASE64["z"] = 51
BASE64["0"] = 52
BASE64["1"] = 53
   ...

For example by using this table when we see the character “y” in a Base64-encoded string, we know that this converts to the number 50.

Pulling Out the Values

To perform the actual decoding, we go through the string and pull out 4 characters at a time. These get translated into 1, 2, or 3 bytes. The Base64 standard requires that the encoded string be padded to a multiple of 4 characters, so we keep going as long as we have 4 left and then truncate if we hit the limit.

        g0 = BASE64[substr(encoded, 1, 1)]
        g1 = BASE64[substr(encoded, 2, 1)]
        g2 = BASE64[substr(encoded, 3, 1)]
        g3 = BASE64[substr(encoded, 4, 1)]
        if (g0 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g0) >> "/dev/stderr"
            exit 1
        }
        if (g1 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g1) >> "/dev/stderr"
            exit 1
        }
        if (g2 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g2) >> "/dev/stderr"
            exit 1
        }
        if (g3 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g3) >> "/dev/stderr"
            exit 1
        }

In this code, we rely on AWK’s behavior of returning an empty string if an array element does not exist, and check for error. Happily "" != 0 in AWK. (In this case we just output an error message and exit… depending on your application you may want more sophisticated error handling. Error handling is tricky in AWK since we do not have exceptions or the ability to return more than a single value, and the only pass-by-reference type that we have is an array. That’s a story for another blog post though…)

Bitwise, AWK Foolish

If we could do some bitwise operations then it would be straightforward to take these numeric values and combine them into bytes. Unfortunately AWK does not have any bitwise operations, although gawk does add some non-standard functions which support these. Luckily we can perform these bitwise operations with normal mathematical calculations.

Bitwise Operation Operator Equivalent Example
Left shift << * (x << 3) == (x * 8)
Right shift >> / (x >> 2) == int(x / 4)
AND (to clear bits) & % (x & 255) == (x % 256)
OR (to set bits) | + (x | 3) == (x + 3)

Note: The AND and OR operations are not equivalent in the general case, but in our case where we want to zero-out bits and use OR operations where we always know that there are never two 1 bits, these work.

So for each set of 4 characters in the encoded string, we produce our resulting 3 bytes like this:

        # we don't have bit shifting in AWK, but we can achieve the same
        # results with multiplication, division, and modulo arithmetic
        # result[n++] = (g0 << 2) | (g1 >> 4)
        result[n++] = (g0 * 4) + int(g1 / 16)
        if (g2 != -1) {
            # result[n++] = ((g1 << 4) & 0xFF) | (g2 >> 2)
            result[n++] = ((g1 * 16) % 256) + int(g2 / 4)
            if (g3 != -1) {
                # result[n++] = ((g2 << 6) & 0xFF) | g3 
                result[n++] = ((g2 * 64) % 256) + g3
            }
        }

The comparisons with -1 allow us to handle the cases where we only have 2 or 3 characters encoded. Note that the code is not totally strict in checking for all possible ways to badly-encode a value – but our goal is mainly to get the data out not, so we assume that the coders mostly know what they are doing.

Outputting Binary Data in AWK

If your program needs to process the binary data somehow then having the data in an array is fine, and our work here is done. Enjoy your awesome new AWK program!

However, if you actually want to output binary data, then AWK’s inability to deal with binary data becomes a major problem. There are two issues:

  1. BusyBox AWK has no way to output a NUL character (character 0). This is investigated in this excellent Stack Overflow comment: https://stackoverflow.com/a/32302711.
  2. Unless care is taken, gawk will interpret non-ASCII characters based on the locale. This can be avoided by setting the environment properly, for example by LC_ALL=C, or by using the --characters-as-bytes flag.

If you don’t need to use BusyBox’s AWK or can be sure that gawk will be run under the proper setting, then you can simply use sprintf("%c", value) and any byte can be output. Otherwise, we can work around this by using the POSIX-standard printf command (not the system API call, but the executable program).

Basically, we can make a formatted ASCII string that the printf command will then output as a binary value:

    base64decode($0, result)

    printf_str = ""
    for (i=1; i in result; i++) {
        printf_str = printf_str sprintf("\\%03o", result[i])
        if (length(printf_str) >= 1024) {
            system("printf '" printf_str "'")
            printf_str = ""
        }
        delete result[i]
    }
    system("printf '" printf_str "'")

It is somewhat ironic that we are writing a program that understands a format designed to get around binary limitations, and have to encode the output in a format to get around binary limitations. But we only need to do this if we want to output binary data; for processing, we can happily use our array without having to rely on external programs.

Conclusion and Source

AWK is somewhat challenging to work with. In spite of that, it does have the advantage of maximum portability across Unix-like systems.

You can see the source at GitHub:

https://github.com/shane-kerr/AWK-base64decode

Advertisements

Benchmarking Python Heaps

I was doing some work on some software that needs a cache. This is pretty common in the DNS world, and indeed caches with a timeout have many uses. The basic operations are:

  • Add an entry with a key and a given expiration time
    This is the basic operation when we have a new entry.
  • Lookup an entry by some key
    This is what we do to see if an entry is already in the cache.
  • Update an entry to have a new expiration time
    When we want to refresh an item in the cache, for example when it has been used, we want to update the time when it is scheduled to expire.
  • Find or remove the entry for the next expiration time
    This operation is for when we want to clean items out of the cache. (Note that we can either remove items when they are scheduled to expire, or simply remove items when we are out of space. We need the same thing in both cases: find the item with the oldest expiration time.)

In Python, we have a built-in dictionary data structure which works fine for looking up an entry by a key. We also have the heapq module as part of the standard Python library, which implements a heap; this works fine for adding, removing, and updating based on the expiration time. Unfortunately these two data structures don’t work so well together. The basic problem is that when an item changes position in the heap, the object does not know it, so if you reference it from the dictionary you have to search through the entire heap to remove the entry.

I checked the always-useful Python Package Index (PyPI) and found that there is the HeapDict module there which does exactly what I want. Fantastic! Problem solved.

But…

I couldn’t help wondering if maybe one could somehow use the built-in heapq and get better performance, if for no other reason than heapq is coded in C instead of pure Python. The secret would be to have a way to update information in the contents of the heap when they changed position in the underlying list. I even coded up a list implementation that provided this functionality before I realized that I had not actually measured the performance to see if it made sense.

So I decided to benchmark not only the heapdict and heapq modules, but also all of the various other heap implementations in PyPI.

All benchmarks were run on a single-core Python on my laptop. I used the perf module, which I saw presented at FOSDEM 2017.

The Tests

In order to benchmark the various heap implementations, we use the published API in order to check the two fundamental operations:

  • Adding something to the heap.
  • Removing the smallest (or largest) item from the heap.

We do NOT check the operations:

  • Convert an unordered list into a heap.
  • Remove an arbitrary item from the heap.
  • Change the priority of an item on the heap.

Converting an unordered list into a heap may be interesting, but not so much for implementing a cache.

On a basic heap, both remove/change require a linear search through the heap and so are not the main strength of a heap implementation.

In the tests the items that we add are a single-element tuple, something like this:

import heapq

h = []
heap.heappush(h, (42,))

The reason for this is that a heap is very rarely used to store lists of numbers, but much more often used to order objects somehow. Using a tuple should be the cheapest non-numeric items that we can add.

The Modules

I did a simple search of PyPI using:

$ python3 -m pip search heap

I ended up with the following modules:

  • HeapDict
    This module acts as both a heap and a dictionary. It is pure Python.
  • heapq
    This is included in the Python standard library. It uses a C implementation if available, otherwise a Python version.
  • pyheapq
    This is a version I copied from the Python standard library, but removed the part which imports the C implementation if available. The idea is to make a fair comparison of the algorithm of heapq versus the other modules written in Python only. Also, because heapq is a C implementation, it updates the list that represents the heap directly, so you cannot use a UserList and use a list with modified behavior; the Python version works as expected when using such a list.
  • binaryheap
    This is a fairly straightforward Python implementation, which seems to be relatively new (8 months old).
  • heapqueue
    A Python implementation that appears to be under active development, with the last GitHub commit only a few days ago. Looking at the current GitHub code we see stubbed-out versions of both binomial and Fibonacci heaps, although those have not yet been implemented. So perhaps there are larger plans for this module.
  • fibonacci-heap-mod
    This module is a bit different from the others. It implements a Fibonacci heap, rather than a binary heap. This is a more expensive data structure, but has better run time for large heaps. It is a port of a Java implementation, so is a bit less Pythonic than the other modules – although this may actually result in better performance.

Note that the binaryheap and heapqueue modules may be more recent in GitHub than in PyPI. I did not spend any effort trying the latest versions. Also, I did some profiling and made some performance tweaks to the HeapDict module, and got about a 30% speed boost (you can find that in my GitHub fork); the benchmarks here are for the module available in PyPI.

Insertion Results

For insertion in a small heap, the results are as follows:

cpython-heap-insertion-n1000

We see the same behavior for the binary-heap implementations, which is that inserting in ascending order is fastest, descending order is slowest, and random order is in the middle. The standard library’s C-based module is the fastest. We also see that the standard library’s Python-based modules is the fastest of the binary heaps, which shows us that it is an excellent implementation regardless of the language.

Interesting to see is that the Fibonacci heap works as advertised, with a constant time for adding regardless of the order.

For comparison, I also ran the same benchmark with PyPy, an alternate Python implementation written in RPython (a simplified Python). This includes a just-in-time (JIT) compiler, and generally executes much faster than CPython (the reference Python implementation):

pypy-heap-insertion-n1000

One thing to note is that the heapq and pyheapq times are the same in PyPy. This is because on PyPy the Python interpreter is as fast as the C implementation, so only the Python version is used.

We see the same pattern in PyPy as in CPython: ascending is fast, descending is slow, random in between. The heap implementation in the standard library is the fastest. We also see that the Fibonacci heap has a fast, constant-time insertion.

What is different is that the Fibonacci heap is much faster at insertion than even the built-in heap implementation. This may be because you assign a numeric weight to each item in the heap, so comparisons are numeric and PyPy can optimize these better.

Now lets look at insertion time for a larger heap, with 1 million elements:

cpython-heap-insertion-n1000000

I kept the scale microseconds so that you can compare the chart with the times for 1000 elements. Here are the results for PyPy:

pypy-heap-insertion-n1000000

The picture for 1K and 1M elements is basically the same, which is not unexpected but nice to see.

Removal Results

We run fewer tests for removal for a couple of reasons. First, we assume that once items are in the heap that it does not matter which order they were inserted. (This may be incorrect, but this is my assumption.) Second, the tests take longer to run because we need to build the heap that we want to pop off of before timing the removal. (There may be clever ways to combine an insertion and removal test, but I did not pursue this.)

Here are the results for 1000 elements in Python:

cpython-heap-pop-n1000

And for 1000 elements in PyPy:

pypy-heap-pop-n1000

One surprising result here is that the standard library’s heapq module is about 2x faster in CPython than in PyPy.

Here are the results for 1 million elements in Python:

cpython-heap-pop-n1000000

And here are the results for 1 million elements in PyPy:

pypy-heap-pop-n1000000

Just like with adding to the heap, the relative performance of the various implementations remains basically the same for heaps with 1 million elements as for heaps with 1000 elements.

Even though the heapq implementation is about 2x faster in stock Python, it is still slower than the Fibonacci heap in PyPy, which is the fastest in removal.

Conclusion

The heap implementation included in the Python standard library is well-implemented and efficient – if you are able to, probably this is the best choice for most things.

The heap + dictionary module is handy, so if performance is not a major consideration it is certainly “good enough”. I have used it in the past for an A* path-finding implementation, and it worked great.

The other basic heap implementations in PyPI are very new, and may become faster or grow other features in the future.

I was impressed at the performance of the Fibonacci heap module. For a cache, it is the module that has the best overall performance in terms of runtime. There may be a size penalty, which can of course be important for a large cache. Anyone who wants to use this data structure may want to research this.

Indeed the performance of the Fibonacci heap module is good enough that it makes me wonder whether something like this should be added to the Python standard library.

Finally, use PyPy. 😉

References

You can find the code used to do the benchmarking here:

https://github.com/shane-kerr/heapbench

You can see a spreadsheet with the results that I got in the attached heapbench-results file.