You might already be able to get good acceleration with SSSE3 or AVX2 or NEON, w...

You might already be able to get good acceleration with SSSE3 or AVX2 or NEON, which also has a 4-bit-input permutation instruction. The problem is that you're doing parallel lookup into many different tables, whereas NEON/SSSE3's lookups are 16x in parallel into the same table (and AVX2 is two copies of the SSSE3 one in parallel I think). So it's not as useful unless you're simulating the same grid on several different inputs for bulk testing. It might still be faster than scalar but I'm not sure.