>Compression after encryption is useless, as there should be NO recognizable patterns to exploit after the encryption.
Not to nitpick, but this is incorrect. A encrypted file can very well have something like 100 X's in a row which the compression system could turn from XXXXXXXXXXXXXXXX.... into (100 x's go here) - Lousy example I know but it gets the point across.
Its also easy enough to test- Just encrypt a file then compress it. Sometimes it will be smaller. (Your reply indicated that in 100% of attempts compression is impossible)
No. If the encrypted output is randomly distributed, as it should be, then the expected number of bits cannot be reduced through compression. If it successfully compresses some files, it will make some files larger. For random data, the change will be negligible with overwhelming probability, plus there will be overhead.
The contents are not purely random. For example, I can predict with 100% certainty that any given byte in the file is not the value zero (ASCII NUL).
Try compressing the output of /dev/urandom on your nearest convenient UNIX-like system. If you figure out a way to reliably and significantly compress that, please report back.
On average, you will expand files, not compress them.
Also, the odds of getting finding even a single encrypted file of that length which can be compressed are lower than the odds of winning the lottery three times in a row. Trust me, you didn't just happen to stumble across a working example. You screwed up the test.
There are no real patterns in encrypted data. Only tiny exponentially rare streaks. This means that compressing is still useless, as originally claimed. You're arguing against something that wasn't said.
You aren't performing your test properly. Encryption doesn't output random text, it outputs bytes of data (0-255). It should be obvious that compressing text is possible.
$ dd if=/dev/urandom of=test bs=1k count=10
$ cat test | gzip -9 > test.gz
$ cat test | bzip2 -9 > test.bz2
$ ls -al test*
-rw-r--r-- 1 r1ch r1ch 10240 Jun 28 16:30 test
-rw-r--r-- 1 r1ch r1ch 10739 Jun 28 16:31 test.bz2
-rw-r--r-- 1 r1ch r1ch 10263 Jun 28 16:31 test.gz
Random data is not compressible.
$ dd if=/dev/zero of=test bs=1k count=10
$ openssl enc -in test -aes-256-ctr -out test.encrypted
That's a base64 encoding of random data, which by definition has only 6 bits of entropy per byte. You should be able to compress it 25% but you only managed 22.6%
Also, if there is issues with using the 'wrong' encryption, I feel thats kind of a straw man argument. Please feel free to upload a file over 5MB which cant be reduced in size through any of the various compression tools.
Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.
That's not encrypted with AES256, that's encrypted with AES256 and then encoded with base64.
Do you understand the difference between base64 and binary data? Base64 only uses six bits per byte. It's designed to allow data to transit cleanly through places which only allow text, such as JSON. Binary data is eight bits per byte. Binary data is what encryption and compression algorithms output. Naturally, if you take data which only uses six bits per byte, and run it through a compression algorithm which is able to use eight bits per byte, you can achieve good compression. But this is illusory, because you expanded the original by 33% when you applied base64 in the first place! All your attempts at compression can be beaten by simply decoding the base64 into the original binary data.
It doesn't matter what magnitude of compression you specified. Random data cannot be compressed at all on average. This is a simple mathematical certainty with a straightforward proof. Encrypted data looks and acts like random data. If you find a way to reliably compress the output of a modern, secure encryption algorithm (not the base64 encoding, but the original binary) then you'll have found a massive security hole in it which will make you famous.
And as a note: It's fine to use base64 if you want. But then you also need to output the compressed file as base64. And you'll see that 3753 octects equate to 5004 base64 characters, and the file is actually significantly larger than it was before compression.
> Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.
This is a common misconception. The pigeonhole principle here tells us that any scheme that ever achieves some compression also achieves some expansion. The only reason compression algorithms work at all is because they are more likely to compress than they are to expand, because we know something about the probability distribution of the plaintexts.
The pigeonhole argument treats compression as a black box with input and output, and makes no assumption about how the compression works.
Your idea—to only compress some inputs and not others—doesn't change the fact that your algorithm can be treated as if it is a black box with inputs and outputs. So the pigeonhole principle still applies. Many people before you have made this argument before, so we are very familiar with it, and very familiar with the reason why it is wrong.
FURTHERMORE, compression, by its very nature, works very poorly on data which is apparently uniformly distributed, as encrypted data is.
No. If your encryption is not busted then the entropy under /any/ model tends to perfect. If a model is able to predict with /any/ success then your crypto is probably broken.
So yes, it is random, and 100 X's could appear in a row. But if your model can effectively compress that, then the model has to be wasting space for all the other sequences.
Let's just imagine that you are doing some like basic run length coding (gif! Or jif if you're wrong :) )
So 100 x's turns into (100, x). So you've compressed that part of the stream. Unfortunately everything else is likely (1,a), etc so every other symbol/byte increases in size. If we look at the stream probabilities we get something like P(nX)=(P(X)^n) -- very rough I haven't been at uni for a long time :). You can do a bunch of reasonably simple math do verify it but you eventually end up with something that your average symbol size is sum{x in X}-log(P(x))/|X|.
Where X is your set of possible symbols (n,x:X).
For a given stream that is not encrypted you may save some space representing the input bytes, but each symbol (n,X) has at /least/ 1 extra bit of information. Overall you can save space.
But let's look at the encrypted data. Each input symbol is independent so your probability of any sequence in the input is equal. You will typically have a run length of 1 (P=0.5), 2 (P=0.5), ...
So 50% of your output symbols will have at least one extra bit of information /added/. This means your output by necessity is bigger than the input.
I used RLE as an example here, but it's true for all compression schemes. This is a necessary property for compression to work: for any data that compresses by n% under a given model there must be some input that increases in size by n% (or some such, again, long time since uni). I believe Wikipedia has an article on the pigeon hole problem or some such that probably explains this better than I can.
> A encrypted file can very well have something like 100 X's in a row
Any decent encryption algorithm has output indistinguishable from random noise; the odds of 100 Xes in a row in random data is 1:2^800; there are only 10^80 particles in the universe (2^266), which is 534 base-2 orders of magnitude smaller than 2^800. Yes, every particle in the universe could be a universe of fundamental particles and you'd still never see 100 Xs in a row in a decently-encrypted message.
Feel free to upload a 5MB file under your narrow restrictions and I will see if I can reduce it with one of the 20 different compression systems available to me.
The larger the random file the _less_ likely you are the successfully compress it.
Everyone here is telling you that you are wrong, and have even clearly demonstrated this through both mathematical reasoning and empirical tests. Quit digging :-)
On that note, sometimes we all get too attached to our own hard-won mistakes. We all need to remember that sometimes we should set aside pride and just say, "Whoops, my reasoning was unclear or I was misinformed, thanks for correcting my misstatent or misunderstanding."
Ooh, the encryption comment I just made reminded me of an example of compression potentially working on encrypted content.
AES-ECB mode will actually compress fairly well for some inputs (the canonical example being a bitmap image). The reason the compression can work? Because the crypto is broken. In the image case you can actually just visually (no maths or anything) see a large amount of the details of the source image.
If there is structure in an encrypted file then the encryption is broken. It is mathematically indistinguishable from true random data. No compression algorithm can compress it.
An additional point you need to consider: You can select different models and compression programs (and some programs do do this), but you have to record which one you use in your output. This again increases the size of the output - this means that even a no-op compression algorithm (say cat or cp :D ) results in output that is bigger (you need a initial byte to say you haven't tried to compress).
I think the better approach to what you're claiming is for you to provide an example that is compressed. Just provide the input, the key, the encryption algorithm (AES-GCM for preference, but seriously AES-CBC would work to), and the compression algorithm.
That's a file of 5,242,880 'X' characters, encrypted with the key 'YELLOW SUBMARINE' and an all-zero initialization vector. You will not be able to compress it.
This is literally all I was saying and people went on a downvote frenzy. People are literally saying that a encrypted output can NEVER be compressed. I was saying that it can (although not all the time!) and the gains would be minimal. Ive corrected the first post with the word 'may'
You're answering a question that no one is interested in. The problem isn't "can any singular instance of an encrypted output be compressed". That's a trivial conclusion -- as it's theoretically completely random bits, of course some combinations of those bits are going to be compressible, maybe even highly compressible. It might be that the encrypted output is entirely a single repeated byte, in which case a 4-gigabyte message could be compressed to 4 bytes (being the integer representing how many times to repeat the byte).
The real question is, can you reliably compress encrypted messages in general? And, given that the encryption is not broken, the answer is no.
In practice, with any real, widely used compression algorithm, encrypted output will never be compressed. You can generate 100,000 random binary files and run them through all the major compression tools and not one of them will compress by even a single bit.
In theory, some random outputs will be compressible by common tools (for example, a file that's all zeroes). However, the probability of encountering a random file that compresses more than the overhead from any common compression tool is vanishingly small.
> People are literally saying that a encrypted output can NEVER be compressed.
No they're not. They're saying that as a class, you cannot compress the data. That is a statement about what happens in aggregate. And the aggregate result of trying to compress securely-encrypted data is that you don't save bytes. Even just having the option of compression does not save bytes, because you have to store the choice.
Any one given input can be losslessly compressed down to 1 bit with the exact right compression algorithm. I think everyone here understands that. The argument is just that that's not really useful in the real world.
Not to nitpick, but this is incorrect. A encrypted file can very well have something like 100 X's in a row which the compression system could turn from XXXXXXXXXXXXXXXX.... into (100 x's go here) - Lousy example I know but it gets the point across.
Its also easy enough to test- Just encrypt a file then compress it. Sometimes it will be smaller. (Your reply indicated that in 100% of attempts compression is impossible)