>Compression after encryption is useless, as there should be NO recognizable pat...

jessriedel · on June 28, 2016

No. If the encrypted output is randomly distributed, as it should be, then the expected number of bits cannot be reduced through compression. If it successfully compresses some files, it will make some files larger. For random data, the change will be negligible with overwhelming probability, plus there will be overhead.

witty_username · on June 28, 2016

@DanBlake

That's a text file, which doesn't use the full range of a byte (0-255), so each character takes less than a byte.

ds · on June 28, 2016

Im happy to reproduce in other formats. The contents are still random which is the crux of the issue here.

mikeash · on June 28, 2016

The contents are not purely random. For example, I can predict with 100% certainty that any given byte in the file is not the value zero (ASCII NUL).

Try compressing the output of /dev/urandom on your nearest convenient UNIX-like system. If you figure out a way to reliably and significantly compress that, please report back.

ds · on June 28, 2016

My point was never that it was possible 100% of the time. Simply that it was possible some of the time.

mikeash · on June 28, 2016

On average, you will expand files, not compress them.

Also, the odds of getting finding even a single encrypted file of that length which can be compressed are lower than the odds of winning the lottery three times in a row. Trust me, you didn't just happen to stumble across a working example. You screwed up the test.

Dylan16807 · on June 28, 2016

There are no real patterns in encrypted data. Only tiny exponentially rare streaks. This means that compressing is still useless, as originally claimed. You're arguing against something that wasn't said.

ds · on June 28, 2016

Using a random text generator http://pasted.co/encrypted.txt ( 10,002 bytes )

Then compressed it with winrar http://pasted.co/compressed.rar ( 7,743 bytes )

Roughly 20% compression isnt meaningless- so I fail to see why you are just giving a flat 'no' when its obvious what you are saying is untrue.

r1ch · on June 28, 2016

You aren't performing your test properly. Encryption doesn't output random text, it outputs bytes of data (0-255). It should be obvious that compressing text is possible.

$ dd if=/dev/urandom of=test bs=1k count=10

$ cat test | gzip -9 > test.gz

$ cat test | bzip2 -9 > test.bz2

$ ls -al test*

-rw-r--r-- 1 r1ch r1ch 10240 Jun 28 16:30 test

-rw-r--r-- 1 r1ch r1ch 10739 Jun 28 16:31 test.bz2

-rw-r--r-- 1 r1ch r1ch 10263 Jun 28 16:31 test.gz

Random data is not compressible.

$ dd if=/dev/zero of=test bs=1k count=10

$ openssl enc -in test -aes-256-ctr -out test.encrypted

$ cat test.encrypted | gzip -9 > test.encrypted.gz

$ cat test.encrypted | bzip2 -9 > test.encrypted.bz2

$ ls -al test*

-rw-r--r-- 1 r1ch r1ch 10240 Jun 28 16:32 test

-rw-r--r-- 1 r1ch r1ch 10256 Jun 28 16:33 test.encrypted

-rw-r--r-- 1 r1ch r1ch 10737 Jun 28 16:34 test.encrypted.bz2

-rw-r--r-- 1 r1ch r1ch 10279 Jun 28 16:33 test.encrypted.gz

Encrypted zeroes are not compressible.

hinkley · on June 28, 2016

That's a base64 encoding of random data, which by definition has only 6 bits of entropy per byte. You should be able to compress it 25% but you only managed 22.6%

charleslmunger · on June 28, 2016

Your random text isn't uniformly random - any byte that isn't a letter or number never shows up.

ds · on June 28, 2016

Here it is encrypted with AES256 - Pass is wasteoftime

http://pasted.co/wasteoftime.txt 4,732 bytes

http://pasted.co/wasteoftime.rar 3,753 bytes

Also, if there is issues with using the 'wrong' encryption, I feel thats kind of a straw man argument. Please feel free to upload a file over 5MB which cant be reduced in size through any of the various compression tools.

Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.

mikeash · on June 28, 2016

That's not encrypted with AES256, that's encrypted with AES256 and then encoded with base64.

Do you understand the difference between base64 and binary data? Base64 only uses six bits per byte. It's designed to allow data to transit cleanly through places which only allow text, such as JSON. Binary data is eight bits per byte. Binary data is what encryption and compression algorithms output. Naturally, if you take data which only uses six bits per byte, and run it through a compression algorithm which is able to use eight bits per byte, you can achieve good compression. But this is illusory, because you expanded the original by 33% when you applied base64 in the first place! All your attempts at compression can be beaten by simply decoding the base64 into the original binary data.

It doesn't matter what magnitude of compression you specified. Random data cannot be compressed at all on average. This is a simple mathematical certainty with a straightforward proof. Encrypted data looks and acts like random data. If you find a way to reliably compress the output of a modern, secure encryption algorithm (not the base64 encoding, but the original binary) then you'll have found a massive security hole in it which will make you famous.

Dylan16807 · on June 28, 2016

And as a note: It's fine to use base64 if you want. But then you also need to output the compressed file as base64. And you'll see that 3753 octects equate to 5004 base64 characters, and the file is actually significantly larger than it was before compression.

klodolph · on June 28, 2016

> Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.

This is a common misconception. The pigeonhole principle here tells us that any scheme that ever achieves some compression also achieves some expansion. The only reason compression algorithms work at all is because they are more likely to compress than they are to expand, because we know something about the probability distribution of the plaintexts.

The pigeonhole argument treats compression as a black box with input and output, and makes no assumption about how the compression works.

Your idea—to only compress some inputs and not others—doesn't change the fact that your algorithm can be treated as if it is a black box with inputs and outputs. So the pigeonhole principle still applies. Many people before you have made this argument before, so we are very familiar with it, and very familiar with the reason why it is wrong.

FURTHERMORE, compression, by its very nature, works very poorly on data which is apparently uniformly distributed, as encrypted data is.

This is why you are wrong.

_bz2r · on June 28, 2016

this difference is because your random text values are not random binary values.

you are trying to bring empirical evidence to an information theoretic fight :(

olliej · on June 28, 2016

No. If your encryption is not busted then the entropy under /any/ model tends to perfect. If a model is able to predict with /any/ success then your crypto is probably broken.

So yes, it is random, and 100 X's could appear in a row. But if your model can effectively compress that, then the model has to be wasting space for all the other sequences.

Let's just imagine that you are doing some like basic run length coding (gif! Or jif if you're wrong :) )

So 100 x's turns into (100, x). So you've compressed that part of the stream. Unfortunately everything else is likely (1,a), etc so every other symbol/byte increases in size. If we look at the stream probabilities we get something like P(nX)=(P(X)^n) -- very rough I haven't been at uni for a long time :). You can do a bunch of reasonably simple math do verify it but you eventually end up with something that your average symbol size is sum{x in X}-log(P(x))/|X|. Where X is your set of possible symbols (n,x:X).

For a given stream that is not encrypted you may save some space representing the input bytes, but each symbol (n,X) has at /least/ 1 extra bit of information. Overall you can save space.

But let's look at the encrypted data. Each input symbol is independent so your probability of any sequence in the input is equal. You will typically have a run length of 1 (P=0.5), 2 (P=0.5), ...

So 50% of your output symbols will have at least one extra bit of information /added/. This means your output by necessity is bigger than the input.

I used RLE as an example here, but it's true for all compression schemes. This is a necessary property for compression to work: for any data that compresses by n% under a given model there must be some input that increases in size by n% (or some such, again, long time since uni). I believe Wikipedia has an article on the pigeon hole problem or some such that probably explains this better than I can.

zeveb · on June 28, 2016

> A encrypted file can very well have something like 100 X's in a row

Any decent encryption algorithm has output indistinguishable from random noise; the odds of 100 Xes in a row in random data is 1:2^800; there are only 10^80 particles in the universe (2^266), which is 534 base-2 orders of magnitude smaller than 2^800. Yes, every particle in the universe could be a universe of fundamental particles and you'd still never see 100 Xs in a row in a decently-encrypted message.

ds · on June 28, 2016

Feel free to upload a 5MB file under your narrow restrictions and I will see if I can reduce it with one of the 20 different compression systems available to me.

mikeash · on June 28, 2016

Here you go: https://mikeash.com/tmp/randomfile.bin

If you manage to get that under 5MB by applying any tool which allows recovering the original data, I will be most interested to know how you did it.

olliej · on June 28, 2016

Success: compress.sh: #!/bin/bash echo "#!/bin/bash" echo curl $1

./compress.sh https://mikeash.com/tmp/randomfile.bin > randomfile.compressedzomg

:D

/me hides

mikeash · on June 28, 2016

You're lucky I'm too lazy to edit a byte on the server and then ask you if it still matches.

Normal_gaussian · on June 28, 2016

It would. You would need to change the location of the file

mikeash · on June 28, 2016

Fine, fine, you're^Wthey're lucky I'm too lazy to make it return new random data each time.

olliej · on June 29, 2016

\o/ :D

readams · on June 28, 2016

The larger the random file the _less_ likely you are the successfully compress it.

Everyone here is telling you that you are wrong, and have even clearly demonstrated this through both mathematical reasoning and empirical tests. Quit digging :-)

nitrogen · on June 28, 2016

Quit digging :-)

On that note, sometimes we all get too attached to our own hard-won mistakes. We all need to remember that sometimes we should set aside pride and just say, "Whoops, my reasoning was unclear or I was misinformed, thanks for correcting my misstatent or misunderstanding."

olliej · on June 28, 2016

Ooh, the encryption comment I just made reminded me of an example of compression potentially working on encrypted content.

AES-ECB mode will actually compress fairly well for some inputs (the canonical example being a bitmap image). The reason the compression can work? Because the crypto is broken. In the image case you can actually just visually (no maths or anything) see a large amount of the details of the source image.

duskwuff · on June 28, 2016

Yep. Wikipedia uses this example image:

https://upload.wikimedia.org/wikipedia/commons/f/f0/Tux_ecb....

olliej · on June 28, 2016

If there is structure in an encrypted file then the encryption is broken. It is mathematically indistinguishable from true random data. No compression algorithm can compress it.

An additional point you need to consider: You can select different models and compression programs (and some programs do do this), but you have to record which one you use in your output. This again increases the size of the output - this means that even a no-op compression algorithm (say cat or cp :D ) results in output that is bigger (you need a initial byte to say you haven't tried to compress).

I think the better approach to what you're claiming is for you to provide an example that is compressed. Just provide the input, the key, the encryption algorithm (AES-GCM for preference, but seriously AES-CBC would work to), and the compression algorithm.

zeveb · on June 29, 2016

I don't have a handy place to stash 5MB of data publicly, but you can run the following in SBCL (or the equivalent in whatever language you prefer):

    (in-package :cl-user)
    (use-package :ironclad)
    (let ((data (make-array (* 5 1024 1024)
                            :element-type '(unsigned-byte 8)
                            :initial-element (char-code #\X))))
      (encrypt-in-place
       (make-cipher
        :aes
        :key (sb-ext:string-to-octets "YELLOW SUBMARINE")
        :mode :ctr
        :initialization-vector (make-array 16 :element-type '(unsigned-byte 8)))
       data)
      (with-open-file (tmp "/tmp/data" :direction :output
                           :element-type '(unsigned-byte 8))
        (write-sequence data tmp)))

That's a file of 5,242,880 'X' characters, encrypted with the key 'YELLOW SUBMARINE' and an all-zero initialization vector. You will not be able to compress it.

IanCal · on June 28, 2016

> It wont be a huge difference in file size, but it will be smaller.

may be smaller, not all outputs will be smaller, this is true of all lossless compression algorithms:

https://en.wikipedia.org/wiki/Pigeonhole_principle

ds · on June 28, 2016

This is literally all I was saying and people went on a downvote frenzy. People are literally saying that a encrypted output can NEVER be compressed. I was saying that it can (although not all the time!) and the gains would be minimal. Ive corrected the first post with the word 'may'

jdmichal · on June 28, 2016

You're answering a question that no one is interested in. The problem isn't "can any singular instance of an encrypted output be compressed". That's a trivial conclusion -- as it's theoretically completely random bits, of course some combinations of those bits are going to be compressible, maybe even highly compressible. It might be that the encrypted output is entirely a single repeated byte, in which case a 4-gigabyte message could be compressed to 4 bytes (being the integer representing how many times to repeat the byte).

The real question is, can you reliably compress encrypted messages in general? And, given that the encryption is not broken, the answer is no.

nearbuy · on June 28, 2016

In practice, with any real, widely used compression algorithm, encrypted output will never be compressed. You can generate 100,000 random binary files and run them through all the major compression tools and not one of them will compress by even a single bit.

In theory, some random outputs will be compressible by common tools (for example, a file that's all zeroes). However, the probability of encountering a random file that compresses more than the overhead from any common compression tool is vanishingly small.

Dylan16807 · on June 28, 2016

> People are literally saying that a encrypted output can NEVER be compressed.

No they're not. They're saying that as a class, you cannot compress the data. That is a statement about what happens in aggregate. And the aggregate result of trying to compress securely-encrypted data is that you don't save bytes. Even just having the option of compression does not save bytes, because you have to store the choice.

jakobegger · on June 28, 2016

Monkeys hitting random keys on a typewriter can write novels (although not all the time!)

qu4z-2 · on June 28, 2016

Any one given input can be losslessly compressed down to 1 bit with the exact right compression algorithm. I think everyone here understands that. The argument is just that that's not really useful in the real world.

zardeh · on June 28, 2016

But, (also by the same ideas), most encrypted texts will not be able to be compressed by more than a single bit.