Paddinganography: hiding data in baseN-encoded strings


A couple of weeks ago, I discovered a steganography technique based on the padding of base64-encoded strings. The reasoning behind it goes as follows: when encoding a message using base64, each of the base64 characters represents 6 bits (as 26 = 64 possible values can be encoded on 6 bits); unfortunately, this means that if we were to encode one 8-bit extended ASCII character, we would need two base64 characters, as one would only cover 6 bits. We would then use 12 bits to represent one 8-bit character: this obviously means that out of those 12 bits, 4 are utterly useless. Now, this means that we can basically set these to any value we want without affecting the actual decoded string in any way, and in the process hide data in base64-encoded strings!

How is that related to padding? While padding is not strictly required to deduce how many 8-bit characters were encoded using base64 (or any base less than 256 = 28, really), padding is notably required in situations where base64-encoded strings are concatenated, so as to assess the integrity of the transferred data. But let's not delve into too much detail here; what matters is that if the string does not require any padding, then it means that the encoded string contains exactly the number of bits necessary for the encoding; hence, there are no bits left to hide anything! This also means that, as long as there is padding, it is possible to hide some data in the encoded string!


From this context, we can deduce the following elements:

  • this method should work for any base < 256, provided padding can occur representing 8-bit characters in this base;
    • consequently, it should not work for bases 2, 4, and 16;
    • it should work for bases 8, 32, 64 and 128.
  • a decryption method solely based on padding will work.

After scouring the Web for more information regarding this technique, I was surprised to find out that apart from two writeups of the same challenge, I couldn't find anything relevant.

The goal of this article is therefore to provide a clearer explanation of how this steganography technique works (especially how it can be decoded), and how it can be used with basically any baseN encoding scheme satisfying the following conditions:

  • it is used to encode characters of length L (in bits), L larger than bits (L should be known);
  • (i.e. padding can occur);

We would like to focus on the fact that the number of hidden bits in a message may be deduced by considering only the number of padding characters X, L and N.


Let's prove that the number of hidden bits embedded in a message can be deduced from the padding; pose E as the number of bits actually useful in the encoding, H as the hidden bits and P as the padding in bits. It is pretty easy to deduce that is the exact number of bits required so that the message is properly padded.

From this observation, we can pose the following equation: .

This directly follows:

$$\begin{align} E + H + P = lcm(log_2(N), L) &\Leftrightarrow E + H = lcm(log_2(N), L) - P\\ &\Leftrightarrow E + H \equiv lcm(log_2(N), L) - P \mod L\\ &\Leftrightarrow H \equiv -P \mod L &(\exists k, k'\in \mathbb{N}, E = Lk, lcm(log_2(N), L) = Lk')\\ &\Leftrightarrow H = -P \mod L &(H \in \{ 0, ..., L-1 \})\\ \end{align}$$

In practice, P is obtained by multiplying the number of padding characters X by the length in bits of each character: . Hence, H can be obtained from X, L and N!


Using the previous proof, I wrote the following script:

# -*- coding: utf-8 -*-
from gmpy import lcm
from math import log, ceil

# Providing the alphabets because I'm nice.
base64_alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
base32_alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567"
# basexxx…

def extract_data_from_message(message, base_charset=base64_alphabet, sep='\n', pad='=', L=8):
    split_message = message.split(sep)

    # Number of bits used to represent each value using the chosen charset.
    nb_bits = int(ceil(log(len(base_charset), 2)))

    # Length of a properly padded encoded string in bits.
    word_len = lcm(nb_bits, L)

    # Number of characters needed for a properly encoded string.
    nb_chars = word_len / nb_bits

    # Will contain the resulting binary string
    result = ''

    # Number of unused bits depending on the number of padding characters.
    ub_pad = {n: (word_len - n * nb_bits) % L for n in range(nb_chars)}

    for string in split_message:
        if len(string) % (nb_chars) != 0:
            print('Padding error: ' + string)

        # Length of the padding in bytes.
        padding = string.count(pad)

        if padding == 0:
            # No useless bits then…

        # Last encoding char of the string, containing useless bits.
        last_char = string[string.index(pad) - 1]

        # Binary value of the last character, left-padded with zeroes.
        bin_val = bin(base_charset.index(last_char))[2:].zfill(nb_bits)

        # Concatenating the bytes here
        result += bin_val[-ub_pad[padding]:]

    return ''.join([chr(int(result[i:i + 8], 2)) for i in range(0, len(result), 8)])

You can download an example.txt file from Olympic CTF 2014 to test the script with the default parameters. I do not plan on providing an encoding script; however, it would use the same idea as well.


I am usually not into steganography, but I found this technique to be quite different of all the others I encounter in CTFs. I felt it was a shame so little documentation was available on it, so here's my small contribution!

As it is the first time I am writing this kind of articles, I may have been unclear on several points; please let me know in the comments, and don't hesitate to DM me on Twitter if you feel my explanation is lacking.

comments powered by Disqus

Receive Updates