imperialviolet - imperialviolet.org (https://www.imperialviolet.org)

page = imperialviolet - imperialviolet.org

url = https://www.imperialviolet.org analysed @ 2021-09-05 00:10 UTC

If you've picked base-10 (digits), base-38, or even base-45 for your data then you need to get it into that form. Base-64 is easy because that's exactly 6 bits per character; you work on 3 bytes of input at a time and produce exactly 4 characters of output. But 10, 38, and 45 aren't powers of two. You've got three options here. The obvious conversion would be to treat the input as a bigint and repeatedly divmod by 10 (or 38, etc) to generate the encoding. If you have a bigint library to hand then it almost certainly has the functions for that, but it's a rather obnoxious (and quadratic) amount of computation and a significant dependency. So you might be willing to waste a few bits to make things easier.
Next option is an encoding noted by djb that achieves similar efficiency but with less computation and no long-division. I updated this post to include it, so it's covered in a section below.
Third third option is to chunk the input and convert each chunk independently. Ideal input chunks would be 8 bytes or fewer, because nearly all environments will support a uint64 type and nearly all hardware can do a divmod on them. If you're using base-10 then there's going to be a function that can “print” a uint64 to digits for you, so let's take digits as an example. With a chunk size of two bytes you would get five digits. Each digit takes 3⅓ bits of space, so 16 input bits takes 16⅔ bits: 96% efficient. Less than the 99.66% we can get with digits for sure. But if you consider all chunk sizes from one to eight bytes, turning 7-byte chunks into 17 digits is 98.82% efficient. That's pretty good for the complexity savings.

The NTRU Prime encoding
Above, I referenced an encoding that gets nearly all the space efficiency of the bigint-and-divmod method, but without the computational costs. This section is about that. It's taken from page 18 of the NTRU Prime NIST submission .
Our motivating issue is thus: if you have a whole byte then taking it mod 10 to generate a digit works fairly well. The digits 0–5 have probability 26/256 while 6–9 have probability 25/256. That's not uniform therefore it doesn't encode the maximum amount of entropy, but it's 99.992% efficient, which is pretty good. But when you have a smaller range of input values the non-uniformity becomes significant and so does the reduction in information transmitted.
The encoding in NTRU Prime takes input values (which can be larger than bytes) and combines pairs of them. It produces some output values from each pair but, once the non-uniformity gets unconfortable, it pairs up the leftovers to increase the range. This repeats in a binary-tree.
As a concrete example we'll use the Python code from page 18 and set M = [256…] (because our inputs are bytes), change the 256 and 255 to 10 and 9 (to extract digits, not bytes), and set
limit
to 1024. Our example input will be
419dd0ed371f44b7
I 65 157 208 237 55 31 68 183 40257 →7,5 60880 →0,8 7991 →1,9 46916 →6,1 402/656 608/656 79/656 469/656 399250 →0,5,2 307743 →3,4,7 399/431 185761 →6,1,7 →2,3,1 307/431 132/186
The input bytes are written (in base 10) along the top. Pairs of them are combined. Take the top-right box: its value is 157×256 + 65 = 40257. That can be considered to be 40257 mod 65536 and, since there's a reasonable number of bits in there, two digits are extracted. Obviously 40257 mod 10 = 7, and 4025 mod 10 = 5. So the two digits are 7 and 5. That leaves 402 mod 656, and 656 is below the limit of 1024 that we set, so it's passed down to be combined into the box below. This continues down the tree: each time there's too little data left to extract another digit, the leftovers are passed down to be combinined. At the bottom there's nothing else to combine with so the final leftover value, 132 mod 186, is converted into the digits 2, 3, and 1. The ultimate output digits are concatenated from left-to-right, top-to-bottom.
This achieves good encoding efficiency without repeated long-division, and can be parallelised.