Reversible “random-looking” IDs with modular arithmetic
When we ship systems that assign sequential indices — patient records, time‑series rows, etc. — we often don’t want to leak ordering, even internally. The goal here is a fast, reversible permutation that maps a small integer index \(i\) to a larger integer \(y\) that looks unstructured, and can be inverted to recover \(i\).
We often need to come up with names for the purposes of uniquely identifying items in a list. These could be patient records in biotech or otherwise sensitive information in other sectors as well. We want to be able to encode small numbers, like indices in a time series, into some larger numbers that appear seemingly random, and from which there is no obvious way to obtain the ordering.
We’d like to obfuscate this without a database lookup or permanent mapping table for several common reasons:
Privacy: hiding record order (e.g. patient #17 shouldn’t look like 17).
Distribution: shard or bucket by seemingly random IDs.
Traceability: unlike a one-way hash, we can always decode.
The standard solution that computer scientists learn for this type of thing is RSA and other cryptography techniques. Those are computational overkill, as we can show easily.
Pick a number \(p\), larger than the dataset size.
Pick a multiplier \(a\) coprime to \(p\), i.e. with \(\text{gcd} (a,p) = 1\).
Find its modular inverse \(a^{-1}\) such that \(a a^{-1} \equiv 1 (\text{mod}\; p)\). (Using pow(a, -1, p) in Python ≥ 3.8)
Then
\[ \text{encode}(n) \equiv an \;\text{mod}\; p \qquad \text{decode} (h) = a^{−1} h \;\text{mod}\; p \]
Because \(a\) is invertible modulo \(p\), the mapping is a permutation of \(\{0, \dots, N−1 \}\).
CODE
# parameters (good for 32-bit IDs)P =4294967291# modulus - this is the largest 32-bit primeA =37# multiplier, coprime with PAINV =21# modular inverse of A mod P (37*21 % 97 == 1)def encode(n: int) ->int:return (n * A) % Pdef decode(h: int) ->int:return (h * AINV) % Pfor n inrange(10): h = encode(n)print(f"{n:2d} → {h:2d} → {decode(h):2d}")
The encoded numbers “look random” and span the full range 0 … 96. Decoding faithfully recovers the originals.
The maximum ID is still \(p−1\); \(p\) should be chosen much larger than the dataset, to avoid conspicuous gaps. So we choose the largest 32-bit prime \(p = 4294967291\) (to cover larger domains like 32-bit IDs), and compute \(a^{−1}\) with pow(a, -1, p) in Python ≥ 3.8.
Other good choices of parameters
To run this method, we need good choices of \(a\) and \(p\) that are coprime. The size of the dataset determines \(p\), and we don’t need it to be prime. Given the way numbers are represented in binary, it typically makes sense to set \(p\) to be a large fixed number like \(2^{32}\) and examine choices for \(a\).
We want a choice \(a\) that scatters inputs evenly around the space. Such choices are well-known for common ID-space sizes.
There has been muchwritten about what we suggest here, which is a simple calculation involving the golden ratio \(\phi\):1
For \(p = 2^{32}\): $a = $0x9e3779b9
For \(p = 2^{64}\): $a = $0x9e3779b97f4a7c15
For \(p = 2^{128}\): $a = $0x9e3779b97f4a7c15f39cc0605d396154
And so on, with more trailing digits being calculated as the exponent on the power of two \(p\) grows.
Integer \(\to\) string: base 36
Often we want a compact, human‑friendly string token from a huge integer. A quick way to do this is to write the integer in base 36, with alphabet 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ.
This obfuscation is cheaper than cryptography
There is an important distinction between this and cryptography: these are obfuscation techniques, not cryptography! Cryptography is absolutely to be preferred if you face adversaries (format‑preserving encryption (FPE) techniques like FF1/FF3 address this type of problem). This is not cryptographically secure; anyone who finds out \(a\) and \(p\) can invert the mapping, unlike for cryptographic schemes where even that is computationally hard.
These are obfuscation techniques, not cryptography. If you face adversaries, prefer format‑preserving encryption (FPE) over your target alphabet/length.
The extra computation that cryptography requires is immediately evident if we try to do a like-for-like comparison. We try this on some relatively low integers that stand in for serial numbers.
CODE
#!pip install pycryptodome# import random, timefrom Crypto.PublicKey import RSAfrom Crypto.Util.number import bytes_to_long, long_to_bytes# toy paramsP, A, AINV =97, 37, 21encode =lambda n: (n*A) % Pdecode =lambda h: (h*AINV) % P# RSA 1024-bit keykey = RSA.generate(1024)e, d, n = key.e, key.d, key.nrsa_enc =lambda m: pow(m, e, n)rsa_dec =lambda c: pow(c, d, n)N =1000nums = [random.randint(0, 10000) for _ inrange(N)]t0 = time.time(); [encode(x) for x in nums]; t1 = time.time()t2 = time.time(); [rsa_enc(x) for x in nums]; t3 = time.time()print(f"Toy scheme : {(t1-t0)*1e6/N:.2f} µs per op")print(f"RSA-1024 : {(t3-t2)*1e6/N:.2f} µs per op")
Toy scheme : 0.09 µs per op
RSA-1024 : 21.97 µs per op
There is at least a 100x difference here.
A cheap bijection that looks random: a small Feistel network
Modular inverses as written above work for all dataset sizes up to \(p\), and are very fast. However, consecutive inputs map to outputs separated by the fixed stride \(a\), so if your audience sees many IDs, they may infer adjacency.
To break this obvious linearity, Feistel networks split the input into left and right halves, and apply bit-mixing functions repeatedly to the halves in a few rounds \(1, \dots, r, \dots, M\).
Since they work iteratively, they are very powerful; any function works to apply each round (this is called the “round function”). We’ll use a tiny keyed multiply‑add + rotate to diffuse bits. The key is the one parameter of this function, so the overall network takes keys \(k_{1}, \dots, k_{r}, \dots, k_{M}\).
On round \(r\), the left and right halves are updated as follows:
\[\left( L, R \right) \leftarrow \left( R, L \oplus F(R, k_{r}) \right)\]
CODE
# 32-bit Feistel permutation with 4 rounds; change keys to rotate the mappingK = (0xA511,0x326B,0x9E37,0x7F4A)C =40503def _F(r,k): r=((r*C+k)&0xFFFF)return ((r<<5)|(r>>11))&0xFFFFdef enc32_feistel(x): L = (x>>16)&0xFFFF R = x&0xFFFFfor k in K: L, R = R, (L^_F(R,k))&0xFFFFreturn ((L<<16)|R)&0xFFFFFFFFdef dec32_feistel(y): L = (y>>16)&0xFFFF R = y&0xFFFFfor k inreversed(K): R0 = L L = (R^_F(R0,k))&0xFFFF R = R0return ((L<<16)|R)&0xFFFFFFFF# smoke test + sampleassertall(dec32_feistel(enc32_feistel(i))==i for i inrange(100000))for i inrange(10): y=enc32_feistel(i);print(f"{i:>2} -> {int(y)}")
Because of the computation that this uses to obfuscate, it occupies a place in between the simple modular inverse obfuscation/“hash” and the full public-key encryption.
CODE
t5 = time.time(); [enc32_feistel(x) for x in nums]; t6 = time.time()print(f"Feistel : {(t6-t5)*1e6/N:.1f} µs per op")
Feistel : 1.6 µs per op
We see this here. There is a ~10x slowdown in using Feistel instead of modular inverse methods, and a further ~10x slowdown in using full cryptography. Of course, Feistel can be done with fewer rounds. However, in most practical data management uses, there is plenty of use for the modular inverse as a quick and easy obfuscation method to augment the workhorse of PKE, but less use of Feistel networks.
Footnotes
The calculation is \(\text{Odd} ((\phi - 1) p)\), which is a common way to find a string of such bits. The idea here is counting on the binary expansion of the irrational number \(\phi\) to look “random.” ↩︎