Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Insecure Features in PDFs (2021) (web-in-security.blogspot.com)
89 points by todsacerdoti on Feb 26, 2024 | hide | past | favorite | 30 comments


Though it barely mentions security https://willcrichton.net/notes/portable-epubs/ is one of my favorite essays to be posted to HN in awhile https://news.ycombinator.com/item?id=39138042

I wonder how one would measure the costs and benefits (with a focus on security) of speeding up and making more security-driven the gargantuan task of shifting to a better, well, portable document format. People are thinking big and want measurement, eg just posted https://news.ycombinator.com/item?id=39514844 WH release on memory safety https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/pre... would it not make sense to be similarly ambitious and metrics-driven for this too?


PDFs aren’t a widespread malware vector in practice. Hence there is little pressure, and it’s much easier to enforce a profile like PDF/A than to migrate to a different document format.


* cough * cough *

You can embed a port scanner in JavaScript inside a PDF document via /JavaScript tags.

And many high-end IDS/IPS/NDS/XNS firewalls look into PDF documents for such things like that.

https://opensource.adobe.com/dc-acrobat-sdk-docs/library/sam...


There is no contradiction to what I wrote.


Lot of situations I use

pdftotext file.pdf - | nroff | less

pdftotext is from Poppler Developers http://poppler.freedesktop.org Glyph & Cog, LLC

nroff is a GNU common util on most linux/unix systems

(though I don't trust poppler utils to be secure)


Yeah, I wasn't referring to the security of tools used to process PDFs, including memory-safe rewrites, but that too! Instead to whether incentivizing migration to a safer portable document format couldn't be justified and done in a similar way to how memory safety is being approached now.

Edit: re-reading I guess your point is that you can use tools to extract text from PDFs and then read without worry. That brings up another super annoying thing about PDFs--it's can be hard to extract text from them with high fidelity.


> can be hard to extract text from them with high fidelity.

Or impossible with scans. Usually I take a look at the filesize as a guestimate. Anyone know better FOSS OCR tools that run from cmdline?


The irony doesn't escape me that a link entitled "portable epubs" loads infinitely with the ... ahem, helpful ... text of "To work around the bug, you either need to close any other tabs of this document (in Google Chrome), or try a different browser."

In practice, it's a simple JS kaboom that they didn't catch, because error handling is for n00bs


I've had to fix the outline/bookmark loop problem before. It is quite an unexpected problem. There are billions of documents out there and who knows how many different pdf generators or editors (writing a pdf is easy, reading a pdf is insanely difficult). I've had normal non-malicious documents where the software trying to merge documents or move pages has caused the outline/bookmark tree to have a loop.


Why PostScript had to be Turing-complete makes no sense to me. Loops, code-execution, functions, it all seems so unnecessary for a markup and presentation language.


PostScript does more than markup and presentation; it is an entire 2D graphics engine. At one point PostScript served as the graphics substrate for the Sun NeWS and NeXT Display PostScript-based window systems. While I agree from a security standpoint that its Turing-completeness poses challenges, it also makes it easier to express certain constructs such as complex shapes and fonts programmatically.

I’m not a PostScript expert but I’ve been reading a lot about it recently. It’s a rather fascinating system for 2D graphics.


Turing completeness isn’t really a huge problem (and not included in that part of PDF anyway). PostScript in its printer application doesn’t really have I/O except for, well, the printer, and no raw pointers, so there’s a smaller surface than, say, JavaScript (no network access).


On its own, the Turing-completeness of PostScript is not a vulnerability, per se. But it does mean that anything that interprets or renders PostScript might take an unbounded (or infinite) amount of time to do so. So PostScript-handling software — even things like virus scanners — have to be coded to handle timing out and interrupting/cancelling PostScript interpretation. (Where PostScript/PDF rendering is often the only format among many "document formats" or "image formats" handled by such software, that imposes this requirement — so when multi-format "reader" software first adds PostScript- or PDF-handling capability, it often finds itself suddenly needing to rearchitect from using single-threaded bounded-time blocking calls to parsers, to needing to use async, concurrency, timers, etc.)

That being said, in combination with exploitable vulnerabilities in particular PostScript parsers used in software like Acrobat, the Turing-completeness of PostScript makes it much harder to detect such exploits. 0days in PDF readers are "nice exploits to have", because a PDF virus can be coded such that it's represented in code inside the PDF in an unbounded number of different ways, foiling signature-based virus sanners.


There's a ZMachine interpreter written in postscript:

       gs -dNOSAFER zmachine.ps -- yourgame.z3


It was the early 80’s, probably the flexibility seemed nice, and the riskiness of putting a full programming language in your documents was probably not as obvious.

Thankfully we learned from them and didn’t repeat that mistake over and over again.


WASM?


Yes, and JavaScript before that. His last line was sarcasm.


PostScript is deprecated in PDF 2.0 and is not the source of the issues listed in TFA.


Well, if you think it's possible, try coming up with even the core of an architectural basis for

1. a declarative language for describing the same things PostScript describes,

2. which allows the rasterization of arbitrary shapes at arbitrary DPI (PostScript is DPI-oblivious — it's up to the printer what DPI it's printing at!);

3. and which also works for vector plotters, that will never rasterize the data you're sending at all, but will actually follow the bezier curves, like the 2D version of 3D-printer GCODE;

4. and which enables the implementation of this rasterization and/or plotting on a variety of affordable hardware architectures in the 1980s — where the 1980s was a time where CPU power wasn't too expensive, but where memory prices were at an absolute premium. So your printer might have had a CPU as powerful as your computer's in it, to crunch PostScript — but definitely wouldn't have had the memory to buffer a full rasterized page.

---

To put that last constraint another way: PostScript was designed to be rasterized in a way that enabled printers to do something much akin to "Racing the Beam" (https://www.youtube.com/watch?v=sJFnWZH5FXc).

In both the display-rendering and printing cases, this was done in 1980s hardware, because 1980s memory was too expensive for most systems to be dedicating it to hold a buffer to asynchronously pre-render into and then read from when drawing.

So instead, in both cases, you must render+rasterize in one motion, programmatically and extremely efficiently. And the obvious way to do this, is by using a CPU with rasterization MMIO registers it can very quickly poke at, to change mode bits during the rendering process. A CPU whose ISA becomes, in effect, a Domain Specific Bytecode for procedurally generating raster-lines.

If printer vendors of the 1980s could have been expected to agree on a single such ISA, then chip vendors would have just made printer SoCs that conform to that ISA — and we'd have ended up with some kind of "vector-drawing abstract machine" bytecode (with real hardware impls in printer ASICs, but also virtual ones on PCs) rather than PostScript.

But as with RDBMS vendors in the 1980s, the printer vendors were all too invested in their own internal architectures to agree on what the low-level execution plan should look like.

And a with RDBMS vendors, the solution that all these 1980s manufacturers could get behind, was a standard for a (theoretically) portsable, text-based intermediate-language standard — one that could be generated by computer software, transmitted to their system, and then, inside their system, compiled down to whatever internal representation allows the system to do things its own way.

In RDBMSes, this "intermediate language" was SQL. In printers: PostScript.


Given how well Preview.app and Safari work for viewing >99% of PDFs I actually encounter in the wild, this article makes Apple's engineering decisions look good.

It also confirms many suspicions I've had over the years that have led me to, e.g., running all PDFs from questionable sources through VirusTotal before viewing on platforms where I wouldn't normally run antivirus software.

The original article also confirms my suspicions that this step is inadequate:

Because the Launch action can be considered as a dangerous feature, we conducted a large-scale evaluation of 294,586 PDF documents downloaded from the Internet, in order to research if there are any legitimate use cases at all. Of those documents, only 532 files (0.18%) contained a Launch action. While none of the files was classified as malicious according to the VirusTotal database, we conclude that the Launch action is rarely used in the wild and its support should be removed by PDF implementations as well as the standard.

Incidentally, the Launch action is still present in the most recent version of the PDF standard[1], with only OS-specific launch parameters deprecated (which include passing arguments to the launched executable, so eliminating deprecated features is still a significant security gain).

Finally, I'm both personally and professionally curious about how the non-DoS examples in this articles may apply to non-viewer PDF tools and libraries like qpdf[2] and Ghostscript's original and recently reimplemented PDF interpreters[3].

[1] https://pdfa.org/resource/iso-32000-pdf/

(registration required, but at least the base standard is available at no cost; sadly, important incorporated standards like ISO 21757-1:2020 [ECMAScript for PDF] are not)

[2] https://qpdf.sourceforge.io

[3] https://ghostscript.com/blog/pdfi.html


> Evaluation: Out of 28 tested applications, 26 are vulnerable to at least one attack. (2021)


Is there some program which generates a safe subset of PDF? And/or can filter a PDF into that subset? pdf2ps -> ps2pdf?


Are there any sample documents for these types of attacks? I'd love to test my own service.


[flagged]


Most of our core desktop computer infrastructure was never designed with 24/7 high speed internet access in mind and it shows. I remember MS Office macros being a frequent source of security vulnerabilities as well.


Trying to design a perfect system is a pipe dream. There will always be interactions between components that weren't expected. The best we can do is fix the problems as they come and occasionally break some api's to build better systems with hindsight, but backward compatibility is such an essential requirement to customers.


I completely agree, however Windows desktops are in particularly bad shape in this regard as they are the result of continuous evolution from the very first version of DOS with an attempt to keep as much backwards compatibility as possible every step of the way. Even the modern NT-based versions of Windows brought over a lot of bits from previous OSes. They were not a completely "clean" rewrite. For example, the Windows 3.1 program manager was still available on 32-bit versions of Windows XP and to this day you still can't name a file any of the following on Windows for compatibility with DOS special file names: CON, PRN, AUX, NUL, COM0, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, COM¹, COM², COM³, LPT0, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, LPT¹, LPT², and LPT³. In fact you can't even name a file any of those with an extension, such as CON.TXT.


But how do the special file names cause security issues? Or are they just an example of general cruft?


Mostly an example of cruft, although I do see the potential for this unexpected behavior to cause software to misbehave or crash.


There is a difference between a design flaw and an implementation flaw though. It seems like for PDFs, those are mostly design flaws, e.g. doing potentially malicious silent tasks in the background without the user's knowledge. Although it could be argued that the OS should be the component guarding against those types of attacks instead of the PDF implementation.


comments welcome from Leonard Rosenthol ADOBE, fifteen years as PDF architect?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: