A couple of weeks ago, I discovered a steganography technique based on the padding of base64-encoded strings. The reasoning behind it goes as follows: when encoding a message using base64, each of the base64 characters represents 6 bits (as 26 = 64 possible values can be encoded on 6 bits); unfortunately, this means that if we were to encode one 8-bit extended ASCII character, we would need two base64 characters, as one would only cover 6 bits. We would then use 12 bits to represent one 8-bit character: this obviously means that out of those 12 bits, 4 are utterly useless. Now, this means that we can basically set these to any value we want without affecting the actual decoded string in any way, and in the process hide data in base64-encoded strings!
How is that related to padding? While padding is not strictly required to deduce how many 8-bit characters were encoded using base64 (or any base less than 256 = 28, really), padding is notably required in situations where base64-encoded strings are concatenated, so as to assess the integrity of the transferred data. But let's not delve into too much detail here; what matters is that if the string does not require any padding, then it means that the encoded string contains exactly the number of bits necessary for the encoding; hence, there are no bits left to hide anything! This also means that, as long as there is padding, it is possible to hide some data in the encoded string!
From this context, we can deduce the following elements:
- this method should work for any base < 256, provided padding can occur representing 8-bit characters in this base;
- consequently, it should not work for bases 2, 4, and 16;
- it should work for bases 8, 32, 64 and 128.
- a decryption method solely based on padding will work.
After scouring the Web for more information regarding this technique, I was surprised to find out that apart from two writeups of the same challenge, I couldn't find anything relevant.
The goal of this article is therefore to provide a clearer explanation of how this steganography technique works (especially how it can be decoded), and how it can be used with basically any baseN encoding scheme satisfying the following conditions:
- it is used to encode characters of length L (in bits), L larger than bits (L should be known);
- (i.e. padding can occur);
We would like to focus on the fact that the number of hidden bits in a message may be deduced by considering only the number of padding characters X, L and N.
Let's prove that the number of hidden bits embedded in a message can be deduced from the padding; pose E as the number of bits actually useful in the encoding, H as the hidden bits and P as the padding in bits. It is pretty easy to deduce that is the exact number of bits required so that the message is properly padded.
This directly follows: