Decoding Base64 in AWK

AWK?

I recently needed to make a program that was as easy-to-use and portable as possible for Unix-like systems. My first thought was to use Bourne shell for this, but it turned out that I wanted to do more complex programming than plain shell programming. I did a bit of research, and discovered that AWK is required on POSIX-complaint systems, and is available on nearly all Unix-like systems. There is a version of AWK included in BusyBox, so even embedded systems have a good chance of having AWK available.

Weirdly, I discovered that programming in AWK – while definitely not great compared to modern scripting languages like Python – is actually not that bad. Working with the language you can see the evolution from AWK to Perl and the influence on Python and JavaScript.

I will write up a few of the discoveries and techniques from working with AWK, starting with how to decode Base64.

Encoding Binary into ASCII

It used to be quite common on many systems to only be able to send ASCII data. Standards were defined based on this, and some component somewhere would break if non-ASCII data was used (either in processing, transmission, or storage). To work around this, techniques were developed to encode binary data using only ASCII.

While this sounds like some esoteric technology from the dawn of the age of computing, every time you send someone a photograph or document using e-mail, this sort of binary-to-ASCII encoding is probably being used.

ASCII only uses 7 bits, while modern systems typically break data up into 8-bit bytes. This means that an ASCII representation of the binary data will always take more space than the raw, binary version.

It is easy to imagine a very simple way of encoding, where you convert the binary to a string of 0 and 1, so a byte with value 42 might look like “00101010”. This is of course very wasteful, and would make your ASCII version 8 times bigger than the binary version.

A better encoding may be hexadecimal, where a the 42 might look like “2A”. Hexadecimal is very natural for programmers, and the resulting encoding is only 2 times bigger than the original. One drawback is that it may or may not work with uppercase and lowercase encoding; so “2a” might work in the place of “2A” or not.

Base64 Encoding

To have even more efficient encoding, other schemes have been developed. The general idea is to define a mapping of ASCII characters into binary bits. The IETF has standardized a few of these, in RFC 3548. The basic idea of Base64 is documented like this:

   A 65-character subset of US-ASCII is used, enabling 6 bits to be
   represented per printable character.  (The extra 65th character, "=",
   is used to signify a special processing function.)

   The encoding process represents 24-bit groups of input bits as output
   strings of 4 encoded characters.  Proceeding from left to right, a
   24-bit input group is formed by concatenating 3 8-bit input groups.
   These 24 bits are then treated as 4 concatenated 6-bit groups, each
   of which is translated into a single digit in the base 64 alphabet.

To put it another way, in Base64 each 4 ASCII characters represents 3 binary bytes.

Base64 is the most efficient encoding possible if you want to use printable characters and stick with a whole number of bits per character. It is possible to encode using the entire ASCII alphabet, but that includes non-printable characters (anything in the range 0 to 31, and character 127) which often have special meaning to software (for example, character 0 often marks the end of a string in memory). It is also possible to encode using fractional bits – for example using 80 characters in the alphabet – but both encoding and decoding becomes more complex.

Decoding Base64 in AWK

If you have some binary data encoded in Base64, you will sometimes want to extract it and get to the binary version in AWK. One simple approach might be to simply use a separate program to decode Base64 data, such as the base64 application. This has a number of limitations. One basic problem is that not all systems have base64 installed; remember the reason we are using AWK is for portability!

An equally important limitation is that AWK may not be able to handle the resulting binary data itself. Except for gawk, modern AWK interpreters appear to use C strings internally, meaning that any 0-byte will cause problems:

$ gawk 'BEGIN { printf "\0hello\n" }' 
hello
$ mawk 'BEGIN { printf "\0hello\n" }' 
$ busybox awk 'BEGIN { printf "\0hello\n" }' 
$ original-awk 'BEGIN { printf "\0hello\n" }'

As you can see here, encoding a NUL-terminator into any AWK except for gawk means that the rest of the string is ignored.

We can work around this limitations in strings by decoding to an array of integer values rather than to a string. (We will have to do more work later if we want to output these values. This is discussed below.)

Setting up the Alphabet

Since each character has a specific numeric value, we will use an array to store these values and then look it up during runtime. These values are documented in the RFC, and include A to Z, a to z, 0 to 9, +, -, and =.

# create our lookup table
BEGIN {
    # load symbols based on the alphabet
    for (i=0; i<26; i++) {
        BASE64[sprintf("%c", i+65)] = i
        BASE64[sprintf("%c", i+97)] = i+26
    }
    # load our numbers
    for (i=0; i<10; i++) {
        BASE64[sprintf("%c", i+48)] = i+52
    }
    # and finally our two additional characters
    BASE64["+"] = 62
    BASE64["/"] = 63
    # also add in our padding character
    BASE64["="] = -1
}

We use sprintf() to convert the numbers into ASCII characters. We will end up with a table that contains values like:

BASE64["A"] = 0
BASE64["B"] = 1
   ...
BASE64["Z"] = 26
BASE64["a"] = 27
BASE64["b"] = 28
   ...
BASE64["z"] = 51
BASE64["0"] = 52
BASE64["1"] = 53
   ...

For example by using this table when we see the character “y” in a Base64-encoded string, we know that this converts to the number 50.

Pulling Out the Values

To perform the actual decoding, we go through the string and pull out 4 characters at a time. These get translated into 1, 2, or 3 bytes. The Base64 standard requires that the encoded string be padded to a multiple of 4 characters, so we keep going as long as we have 4 left and then truncate if we hit the limit.

        g0 = BASE64[substr(encoded, 1, 1)]
        g1 = BASE64[substr(encoded, 2, 1)]
        g2 = BASE64[substr(encoded, 3, 1)]
        g3 = BASE64[substr(encoded, 4, 1)]
        if (g0 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g0) >> "/dev/stderr"
            exit 1
        }
        if (g1 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g1) >> "/dev/stderr"
            exit 1
        }
        if (g2 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g2) >> "/dev/stderr"
            exit 1
        }
        if (g3 == "") {
            printf("Unrecognized character %c in Base 64 encoded string\n",
                   g3) >> "/dev/stderr"
            exit 1
        }

In this code, we rely on AWK’s behavior of returning an empty string if an array element does not exist, and check for error. Happily "" != 0 in AWK. (In this case we just output an error message and exit… depending on your application you may want more sophisticated error handling. Error handling is tricky in AWK since we do not have exceptions or the ability to return more than a single value, and the only pass-by-reference type that we have is an array. That’s a story for another blog post though…)

Bitwise, AWK Foolish

If we could do some bitwise operations then it would be straightforward to take these numeric values and combine them into bytes. Unfortunately AWK does not have any bitwise operations, although gawk does add some non-standard functions which support these. Luckily we can perform these bitwise operations with normal mathematical calculations.

Bitwise Operation Operator Equivalent Example
Left shift << * (x << 3) == (x * 8)
Right shift >> / (x >> 2) == int(x / 4)
AND (to clear bits) & % (x & 255) == (x % 256)
OR (to set bits) | + (x | 3) == (x + 3)

Note: The AND and OR operations are not equivalent in the general case, but in our case where we want to zero-out bits and use OR operations where we always know that there are never two 1 bits, these work.

So for each set of 4 characters in the encoded string, we produce our resulting 3 bytes like this:

        # we don't have bit shifting in AWK, but we can achieve the same
        # results with multiplication, division, and modulo arithmetic
        # result[n++] = (g0 << 2) | (g1 >> 4)
        result[n++] = (g0 * 4) + int(g1 / 16)
        if (g2 != -1) {
            # result[n++] = ((g1 << 4) & 0xFF) | (g2 >> 2)
            result[n++] = ((g1 * 16) % 256) + int(g2 / 4)
            if (g3 != -1) {
                # result[n++] = ((g2 << 6) & 0xFF) | g3 
                result[n++] = ((g2 * 64) % 256) + g3
            }
        }

The comparisons with -1 allow us to handle the cases where we only have 2 or 3 characters encoded. Note that the code is not totally strict in checking for all possible ways to badly-encode a value – but our goal is mainly to get the data out not, so we assume that the coders mostly know what they are doing.

Outputting Binary Data in AWK

If your program needs to process the binary data somehow then having the data in an array is fine, and our work here is done. Enjoy your awesome new AWK program!

However, if you actually want to output binary data, then AWK’s inability to deal with binary data becomes a major problem. There are two issues:

  1. BusyBox AWK has no way to output a NUL character (character 0). This is investigated in this excellent Stack Overflow comment: https://stackoverflow.com/a/32302711.
  2. Unless care is taken, gawk will interpret non-ASCII characters based on the locale. This can be avoided by setting the environment properly, for example by LC_ALL=C, or by using the --characters-as-bytes flag.

If you don’t need to use BusyBox’s AWK or can be sure that gawk will be run under the proper setting, then you can simply use sprintf("%c", value) and any byte can be output. Otherwise, we can work around this by using the POSIX-standard printf command (not the system API call, but the executable program).

Basically, we can make a formatted ASCII string that the printf command will then output as a binary value:

    base64decode($0, result)

    printf_str = ""
    for (i=1; i in result; i++) {
        printf_str = printf_str sprintf("\\%03o", result[i])
        if (length(printf_str) >= 1024) {
            system("printf '" printf_str "'")
            printf_str = ""
        }
        delete result[i]
    }
    system("printf '" printf_str "'")

It is somewhat ironic that we are writing a program that understands a format designed to get around binary limitations, and have to encode the output in a format to get around binary limitations. But we only need to do this if we want to output binary data; for processing, we can happily use our array without having to rely on external programs.

Conclusion and Source

AWK is somewhat challenging to work with. In spite of that, it does have the advantage of maximum portability across Unix-like systems.

You can see the source at GitHub:

https://github.com/shane-kerr/AWK-base64decode

Advertisements