Some useful hashing tricks

dataviz
Indexing for numbering and obfuscation
Author

Akshay Balsubramani

Reversible “random-looking” IDs with modular arithmetic

When we ship systems that assign sequential indices — patient records, time‑series rows, etc. — we often don’t want to leak ordering, even internally. The goal here is a fast, reversible permutation that maps a small integer index \(i\) to a larger integer \(y\) that looks unstructured, and can be inverted to recover \(i\).

We often need to come up with names for the purposes of uniquely identifying items in a list. These could be patient records in biotech or otherwise sensitive information in other sectors as well. We want to be able to encode small numbers, like indices in a time series, into some larger numbers that appear seemingly random, and from which there is no obvious way to obtain the ordering.

We’d like to obfuscate this without a database lookup or permanent mapping table for several common reasons:

  • Privacy: hiding record order (e.g. patient #17 shouldn’t look like 17).

  • Distribution: shard or bucket by seemingly random IDs.

  • Traceability: unlike a one-way hash, we can always decode.

The standard solution that computer scientists learn for this type of thing is RSA and other cryptography techniques. Those are computational overkill, as we can show easily.

The minimal things

Integer \(\to\) integer: an affine permutation

The quickest thing to do uses modular arithmetic creatively.

  • Pick a number \(p\), larger than the dataset size.

  • Pick a multiplier \(a\) coprime to \(p\), i.e. with \(\text{gcd} (a,p) = 1\).

  • Find its modular inverse \(a^{-1}\) such that \(a a^{-1} \equiv 1 (\text{mod}\; p)\). (Using pow(a, -1, p) in Python ≥ 3.8)

Then

\[ \text{encode}(n) \equiv an \;\text{mod}\; p \qquad \text{decode} (h) = a^{−1} h \;\text{mod}\; p \]

Because \(a\) is invertible modulo \(p\), the mapping is a permutation of \(\{0, \dots, N−1 \}\).

CODE
# parameters (good for 32-bit IDs)
P  = 4294967291          # modulus - this is the largest 32-bit prime
A  = 37          # multiplier, coprime with P
AINV = 21        # modular inverse of A mod P (37*21 % 97 == 1)

def encode(n: int) -> int:
    return (n * A) % P

def decode(h: int) -> int:
    return (h * AINV) % P


for n in range(10):
    h = encode(n)
    print(f"{n:2d}{h:2d}{decode(h):2d}")
 0  →   0  →   0
 1  →  37  →  777
 2  →  74  →  1554
 3  →  111  →  2331
 4  →  148  →  3108
 5  →  185  →  3885
 6  →  222  →  4662
 7  →  259  →  5439
 8  →  296  →  6216
 9  →  333  →  6993

The encoded numbers “look random” and span the full range 0 … 96. Decoding faithfully recovers the originals.

The maximum ID is still \(p−1\); \(p\) should be chosen much larger than the dataset, to avoid conspicuous gaps. So we choose the largest 32-bit prime \(p = 4294967291\) (to cover larger domains like 32-bit IDs), and compute \(a^{−1}\) with pow(a, -1, p) in Python ≥ 3.8.

Other good choices of parameters

To run this method, we need good choices of \(a\) and \(p\) that are coprime. The size of the dataset determines \(p\), and we don’t need it to be prime. Given the way numbers are represented in binary, it typically makes sense to set \(p\) to be a large fixed number like \(2^{32}\) and examine choices for \(a\).

We want a choice \(a\) that scatters inputs evenly around the space. Such choices are well-known for common ID-space sizes.
There has been much written about what we suggest here, which is a simple calculation involving the golden ratio \(\phi\):1

  • For \(p = 2^{32}\): $a = $0x9e3779b9
  • For \(p = 2^{64}\): $a = $0x9e3779b97f4a7c15
  • For \(p = 2^{128}\): $a = $0x9e3779b97f4a7c15f39cc0605d396154

And so on, with more trailing digits being calculated as the exponent on the power of two \(p\) grows.

Integer \(\to\) string: base 36

Often we want a compact, human‑friendly string token from a huge integer. A quick way to do this is to write the integer in base 36, with alphabet 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ.

This obfuscation is cheaper than cryptography

There is an important distinction between this and cryptography: these are obfuscation techniques, not cryptography! Cryptography is absolutely to be preferred if you face adversaries (format‑preserving encryption (FPE) techniques like FF1/FF3 address this type of problem). This is not cryptographically secure; anyone who finds out \(a\) and \(p\) can invert the mapping, unlike for cryptographic schemes where even that is computationally hard.

These are obfuscation techniques, not cryptography. If you face adversaries, prefer format‑preserving encryption (FPE) over your target alphabet/length.

The extra computation that cryptography requires is immediately evident if we try to do a like-for-like comparison. We try this on some relatively low integers that stand in for serial numbers.

CODE
#!pip install pycryptodome
# 

import random, time
from Crypto.PublicKey import RSA
from Crypto.Util.number import bytes_to_long, long_to_bytes

# toy params
P, A, AINV = 97, 37, 21

encode  = lambda n: (n*A) % P
decode  = lambda h: (h*AINV) % P

# RSA 1024-bit key
key     = RSA.generate(1024)
e, d, n = key.e, key.d, key.n
rsa_enc = lambda m: pow(m, e, n)
rsa_dec = lambda c: pow(c, d, n)

N = 1000
nums = [random.randint(0, 10000) for _ in range(N)]

t0 = time.time(); [encode(x) for x in nums]; t1 = time.time()
t2 = time.time(); [rsa_enc(x) for x in nums]; t3 = time.time()

print(f"Toy scheme : {(t1-t0)*1e6/N:.2f} µs per op")
print(f"RSA-1024 : {(t3-t2)*1e6/N:.2f} µs per op")
Toy scheme : 0.09 µs per op
RSA-1024 : 21.97 µs per op

There is at least a 100x difference here.

A cheap bijection that looks random: a small Feistel network

Modular inverses as written above work for all dataset sizes up to \(p\), and are very fast. However, consecutive inputs map to outputs separated by the fixed stride \(a\), so if your audience sees many IDs, they may infer adjacency.

To break this obvious linearity, Feistel networks split the input into left and right halves, and apply bit-mixing functions repeatedly to the halves in a few rounds \(1, \dots, r, \dots, M\).

Since they work iteratively, they are very powerful; any function works to apply each round (this is called the “round function”). We’ll use a tiny keyed multiply‑add + rotate to diffuse bits. The key is the one parameter of this function, so the overall network takes keys \(k_{1}, \dots, k_{r}, \dots, k_{M}\).

On round \(r\), the left and right halves are updated as follows:

\[\left( L, R \right) \leftarrow \left( R, L \oplus F(R, k_{r}​) \right)\]

CODE
# 32-bit Feistel permutation with 4 rounds; change keys to rotate the mapping
K = (0xA511,0x326B,0x9E37,0x7F4A)
C = 40503

def _F(r,k):
    r=((r*C+k)&0xFFFF)
    return ((r<<5)|(r>>11))&0xFFFF

def enc32_feistel(x):
    L = (x>>16)&0xFFFF
    R = x&0xFFFF
    for k in K:
        L, R = R, (L^_F(R,k))&0xFFFF
    return ((L<<16)|R)&0xFFFFFFFF

def dec32_feistel(y):
    L = (y>>16)&0xFFFF
    R = y&0xFFFF
    for k in reversed(K):
        R0 = L
        L = (R^_F(R0,k))&0xFFFF
        R = R0
    return ((L<<16)|R)&0xFFFFFFFF

# smoke test + sample
assert all(dec32_feistel(enc32_feistel(i))==i for i in range(100000))
for i in range(10):
    y=enc32_feistel(i); print(f"{i:>2} -> {int(y)}")
 0 -> 72669228
 1 -> 3491548600
 2 -> 3497100227
 3 -> 2176217098
 4 -> 3941116497
 5 -> 3326608328
 6 -> 377525314
 7 -> 508084945
 8 -> 3240512433
 9 -> 3118050186

Because of the computation that this uses to obfuscate, it occupies a place in between the simple modular inverse obfuscation/“hash” and the full public-key encryption.

CODE
t5 = time.time(); [enc32_feistel(x) for x in nums]; t6 = time.time()
print(f"Feistel : {(t6-t5)*1e6/N:.1f} µs per op")
Feistel : 1.6 µs per op

We see this here. There is a ~10x slowdown in using Feistel instead of modular inverse methods, and a further ~10x slowdown in using full cryptography. Of course, Feistel can be done with fewer rounds. However, in most practical data management uses, there is plenty of use for the modular inverse as a quick and easy obfuscation method to augment the workhorse of PKE, but less use of Feistel networks.

Footnotes

  1. The calculation is \(\text{Odd} ((\phi - 1) p)\), which is a common way to find a string of such bits. The idea here is counting on the binary expansion of the irrational number \(\phi\) to look “random.” ↩︎

Reuse

Citation

BibTeX citation:
@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Some Useful Hashing Tricks},
  langid = {en}
}
For attribution, please cite this work as:
Balsubramani, Akshay. n.d. “Some Useful Hashing Tricks.”