Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Decompiling 2024: A Year of Resurgance in Decompilation Research (mahaloz.re)
147 points by matt_d on Jan 30, 2025 | hide | past | favorite | 61 comments


This was an informative article and I hope the author continues the series.

Regarding AI-assisted renaming of variables, the author calls this "a strict improvement over traditional decompilation." But looking at the example:

    struct IxpMsg {          struct Buffer {
        char* data;              uint8_t* buffer;
        char* pos;               uint8_t* pos;
        char* end;       =>      uint8_t* streamPos;
        _ixpuint size;           uint32_t bufferSize;
        _ixpuint mode;           uint32_t type;
    };                       }
        Ground Truth                ReSym (AI)
I am reluctant to allow the decompiler to influence my judgment about the meaning of variables. `streamPos` is not equivalent to `end`. Consider the issue multiplied by 20 or 100 as many incorrect assumptions, and it would severel cloud your understanding of the decompiled code.

Combining this with reasoning models that can justify their labels would be very helpful. UX improvements could also be made to indicate confidence or progressively disclose these assumptions.


It's a strict improvement over the technology before LLM, which usually just assigns random cryptic symbols.


No, it's really not a strict improvement. A meaningless name like `v2` does at least convey that you, as the analyst, haven't understood the role of the variable well enough to rename it to something more fitting to its inferred purpose. If the LLM comes up with an "informative" variable name that is not very well-suited towards what it actually does, the name can waste your time by misleading you as to the role of the variable.


I think Ghidra could do better even without any LLM involved. Ghidra will define local variables like this:

SERVICE_TABLE_ENTRY* local_5c;

I wish it at least did something like:

SERVICE_TABLE_ENTRY* local_5c_pServiceTableEntry;

Oh yeah, there’s probably some plugin or Python script to do this. But I just dabble with Ghidra in my spare time

It would be great if it tracked the origin of a variable/parameter name, and could show them in a different colour (or some other visual distinction) based on their origin. That way you could easily distinguish “name manually assigned by analyst” (probably correct) vs “name picked by some LLM” (much more tentative, could easily be a hallucination)


In my view one of the most pressing shortcomings of Ghidra is that it can't understand the lifetimes of multiple variables with overlapping stack addresses: https://github.com/NationalSecurityAgency/ghidra/issues/975

Ghidra does have an extensive scripting API, and I've used LLMs to help me write scripts to do bulk changes like you've described. But you would have to think about how you would ensure the name suffix is synchronized as you retype variables during your analysis.


Yeah, I don't know why they don't use something like SSA – make every line of code which performs an assignment create a new local variable.

Although I suppose when decompiling to C, you need to translate it out of SSA form when you encounter loops or backwards control flow.


I disagree with it as a blanket statement. It’s related to the problem of hallucinations in LLMs; sometimes they come up with plausible, misleading bullshit. The user quickly relies on the automatic labeling as a crutch, while exhausting mental effort trying to distinguish the bullshit from the accurate labels. The cryptic symbols do not convey meaning but consequently can’t mislead you.

I don’t reject the whole concept and am bullish on AI-assisted decompilation. But the UX needs to help the user have confidence in the results, just like source code-level static analyzers generate proofs.


> If you’ve ever talked to me in person, you’d know that I’m a disbeliever of AI replacing decompilers any time soon

Decompilation, seen as a translation problem, is by any means a job that suits AI methods. Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.


Yes and no.

My first priority for a decompiler is that the output is (mostly) correct. (I say mostly because there's lots of little niggling behavior you probably want to ignore, like representing a shift instruction as `a << b` over `a << (b & 0x1f)`). When the decompiler's output is incorrect, I can't trust it anymore, and I'm going to go straight back to the disassembly because I need to work with the correct output. And AI--especially LLMs--are notoriously bad at the "correct" part of translation.

If you look at decompilation as a multistep problem, the main steps are a) identify the function/data symbol boundaries, b) lift the functions to IR, c) recover type information (including calling convention for functions), d) recover high-level control flow, and e) recover variable names.

For step b, correctness is so critical that I'm wary of even trusting hand-generated tables for disassembly, since it's way too easy for someone to copy something by hand. But on the other hand, this is something that can be machine-generated with something that is provably correct (see, e.g., https://cs.stanford.edu/people/eschkufz/docs/pldi_16.pdf). Sure, there's also a further step for recognizing higher-level patterns like manually-implemented-bswap, but that's basically "implement a peephole optimizer," and the state of the art for compilers these days is to use formally verifiable techniques for doing that.

For a lot of the other problems, if you instead categorize them as things where the AI being wrong doesn't make it incorrect, AI can be a valuable tool. For example, control flow structuring can be envisioned as identifying which branches are gotos (including breaks/continues/early returns), since a CFG that has no gotos is pretty trivial to structure. So if your actual AI portion is a heuristic engine for working that out, it's never going to generate wrong code, just unnecessarily complicated code.


> identifying which branches are gotos

mmh. Yesterday i tried some LLM-augmented "analysis", given a 50 lines source of C, a function with few goto's in it.. somehow all "explanations" were ~correct except it completely ignored the goto's. Using a deepseek-r1-...7b, ollama's default, probably too weak ; but i don't believe other models would be 100% correct either.


>And AI--especially LLMs--are notoriously bad at the "correct" part of translation.

Can't you just compare the compiled binaries to see if they are the same? Is the issue that you don't have the full toolchain so there are different outputs from the two compilers? Thinking about it though you could probably figure out which compiler was used using those same differences though..


It can take quite a bit of engineering just to get the same source to produce the same results in many C or C++ toolchains. "Reproducible builds" require work; all sorts of trivial things like the length of pathnames can perturb the results. Not to mention having to have the same optimizer flags.

"Do these two binaries always behave the same for the same inputs" is practically an unsolvable problem in general. You can get fairly close with something like AFL (American fuzzy lop, a fuzzer and also a type of rabbit).

(Someone should really make an LLM bot that scans HN for instances of "just" and explain why you can't just do that, it's such a red flag word)


The expected outcome of using a LLM to decompile is a binary that is so wildly different from the original that they cannot even be compared.

If you only make mistakes very rarely and in places that don't cause cascading analysis mistakes, you can recover. But if you keep making mistakes all over the place and vastly misjudge the structure of the program over and over, the entire output is garbage.


That makes sense. So it can work for small functions but not an entire codebase which is the goal. Does that sound correct? If so, is it useful for small functions (like, let's say I identify some sections of code I think are important becuase they modify some memory location) or is this not useful?


There are lots of parts of analysis that really matter for readability but aren't used as inputs to other analysis phases and thus mistakes are okay.

Things like function and variable names. Letting an LLM pick them would be perfectly fine, as long as you make sure the names are valid and not duplicates before outputting the final code.

Or if there are several ways to display some really weird control flow structures, letting an LLM pick which to do would be fine.

Same for deciding what code goes in which files and what the filenames should be.

Letting the LLM comment the code as it comes out would work too, as if the comments are misleading you can just ignore or remove them.


No, but for verifying equivalence you could use some symbolic approach that is provably correct. The LLM could help there by making its output verifiable.


Program equivalence is undecidable, in general, but also in practice (in my experience) most interesting cases quickly escalate to require an unreasonable amount of compute. Personally, I think it is easier to produce correct-by-construction decompilation by applying sequences of known-correct transformations, rather than trying to reconstruct correctness a posteriori. So perhaps the LLM could produce such sequence of transforms rather than outputting the final decompiled program only.


Yes, something like this, the intermediate verified steps wouldn't have to be shown to the user.


You are right on a lot of things, but LLMs are the best bijective lens that humanity has ever discovered. They can invert functions we didn't think were invertible.

If given a mostly correct transform from binary back to code, how would we fix that?

Exactly!

Heuristics are dead.


> Heuristics are dead

I guess it is an example of an heuristic.


Absolutely. LLMs use methods that are not really provably solving the given problem, they just happen to work in most relevant cases. That's a heuristic.


http://www.incompleteideas.net/IncIdeas/BitterLesson.html

> Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development.

https://arxiv.org/abs/2410.09649v1


I know that these algorithms are the best that exists for many tasks, but any non-deterministic non-provable algorithm is still technically a heuristic. Also, currently, the bitter lesson seems to be that more of the same runs into diminishing returns, contrary to "The Bitter Lesson"(tm). There will probably be a better AI architecture at some point, but "lol just add more GPUs / data / RAM" times are kind of over for now.


You are misinterpreting the bitter lesson, it only gets more bitter with time!

The arxiv paper is well written, worth a read.


I don't think that I misunderstood - it says that more computing power beats "hand-engineering". But more computing power doesn't seem to work that well anymore to improve AI performance - and there is not much more computing power coming in the next couple of years, not orders of magnitude more for free like it used to be. Case in point: the somewhat disappointing new nVidia generation.


> it says that more computing power beats "hand-engineering".

This is the simplistic tl;dr, but that isn't what is says. You should reread the essay and the linked arxiv paper.

It is much more than throwing more compute at a problem.


I agree with many other sentiments here that if it can replace decompilers, then surely it can replace compilers... which feels unlikely soon. So far, I've seen four end-to-end binary-to-code AI approaches, and none have had convincing results. Even those that crawled all of GitHub continue to have issues of making fake code, not understanding math, omitting portions of code, and (a personal irritant for me) being unable to map what address a line of decompilation came from.

However, I also acknowledge that AI can solve many pattern-based problems well. I think a considerable value can be extracted from AI by focusing in on micro decisions in the decompiler process, like variable types, as recent work has.


I'd feel a lot more comfortable in the prospects of AI if their big boosters weren't so gung-ho about it replacing absolutely everything. Compilers (and by extension decompilers) are one of the areas where we have the ability to have formal proofs of correctness [1]--and the fact that AI people seem to be willing to throw all of that away in favor of their maybe-correct-but-does-it-really-matter-if-it's-not tools is extremely distressing to me.

[1] And one of the big advances in compilers in the past decade or so is the fact that compilers are actually using these in practice!


> I'd feel a lot more comfortable in the prospects of AI if their big boosters weren't so gung-ho about it replacing absolutely everything. Compilers (and by extension decompilers) are one of the areas where we have the ability to have formal proofs of correctness [1]--and the fact that AI people seem to be willing to throw all of that away in favor of their maybe-correct-but-does-it-really-matter-if-it's-not tools is extremely distressing to me.

God yes. These people are infuriating; it's as if they've abandoned the concept of provable correctness, or never understood it in the first place, and replaced it with "looks good enough to me on a few examples". Some sort of gambler's fallacy where if you get a right answer once it doesn't matter how many wrong answers you get.


> Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.

Decompilation is about dependencies which makes it a graph problem.

One such problem is boolean satisfiability and this particular kind of problem is extremely important. It also very easy to gather mappings between CNF and solutions. Actually, randomization of standard benchmarks is now part of SAT competitions, AFAIK.

Have you seen any advances there using large predictive models?

Proper decompilation is even harder, it is much like halting problem than SAT. Imagine that there is a function that gets inlined and, therefore specialized. One definitely wants source for the original function and calls to it, not a listing of all specializations.

This moves us to the space of "inverse guaranteed optimization" and as such it requires approximation of the solution of halting problem.


> Decompilation, seen as a translation problem, is by any means a job that suits AI methods.

Compilation is also a translation problem but I think many people would be leery of an LLM-based rust or clang -- perhaps simply because they're more familiar with the complexities involved in compilation than they are with those involved in decompilation.

(Not to say it won't eventually happen in some form.)


LLMs are not deterministic, and I want deterministic builds from compiled code to assembly. I also do not want the LLM to arbitrarily change the functionality, I have no such guarantees.


Compilers aren't deterministic in the ways that people would think matter.

We will have LLM based compilers in the near future. Determinism is a property of the system, not the components.


"near" like never :)


You want to put money on it?

You set the criteria.


Let me know when an LLM compiler is licensed for use on critical infrastructure. Admittedly Java doesn't meet this bar (see "You acknowledge that Licensed Software is not designed or intended for use in the design, construction, operation or maintenance of any nuclear facility" https://www.oracle.com/technetwork/java/javase/downloads/jdk... )


I wouldn't be at all leery of LLM-based compilers (as long as their output is verified to be correct and reasonable).


Verification is the hard bit! That's why sel4 is so unique.


That's a different kind of verification entirely.


It's pattern matching, plain and simple, An area where AI excels. AI driven decomp is absolutely on its way


It's also perfect for RL because it can compile it's output and check it against the input. It's a translation exercise where there's already a perfect machine translator in one direction.

It probably just hasn't happened because decompilation is not a particularly useful thing for the vast majority of people.


Maybe in conjunction with a deterministic decompiler.

precision wrt translation, especially when the translation is not 1-to-1, is not excellent with LLMs.

In fact, their lack of precision is what makes them so good at translating natural languages!


Let me parrot you, it's fun.

"It's pattern matching, plain and simple, an area where pattern matching algorithms excel. Pattern matching driven decomp absolutely leads"

Decompilation is a dependence graph problem, one can formulate decompilation as a graph transformation/rewrite. Neural networks are notoriously bad at graphs.


> Neural networks are notoriously bad at graphs.

AlphaFold is based on graph neural networks. The biggest issue is that we still do not know how to best encode graph problems in ways neural networks can exploit. Current graph neural network techniques exploit certain invariants but cannot distinguish between various similar graphs. And yet, they're still generating meaningful insights.


> AlphaFold is based on graph neural networks.

And what is the size of graphs processed by AlphaFold? How does it compare to the size of program dependence graph?

Here's evaluation of boolean satisfiability encoding of protein folding problem: https://pmc.ncbi.nlm.nih.gov/articles/PMC7197060/

Boolean satisfiability solvers are used in satisfiability modulo theories solvers that, in turn, are used to prove various things about programs.


> Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.

Not anytime soon. There is more to a decompiler than assembly being converted to x language. File parsers, disassemblers, type reconstruction, etc are all functionality that have to run before a “machine code” can be converted to the most basics of decompiler output.


Typical obfuscation sure. vms, obfuscation and everything in between is just noise to AI.


“Resurgence” not “resurgance”. I wanted to leave a comment in the article itself but it wants me to sign in with GitHub, which: yuk, so I’m commenting here instead.


Welp, that's a really sad typo... I've made a post-publication edit to the article now, but my shame is immortalized in Git: https://github.com/mahaloz/mahaloz.re/commit/90b760f53ef51b7...


Proof you didn't use AI? :)


Spell checkers existed before AI.


"Please answer this question but misspell about one percent of the words in the answer."


Ive had supprisingly good results feeding ghidras decomp output to chat gpt and having it simplify and explain it

it seems to be very capable of having some understanding of what the original code would do.

for instance i was feeding it some game decomp. a function looking for an entity in a 3d array of tiles.

It somehow inferred it was an array of tiles and that it was hunting for a specific entity.

None of the decomp I fed it had any variable/function names or comments, just the usual var1,var2 ect.

How did it know what the underlying code was doing?


I remember working on DCC, a decompiler for C created by Cristina Cifuentes in 1990. It felt like magic and the future, but it was incredibly difficult and interesting. I used it for decompiling firmware and it was hard to convince my boss that we needed it.


Decompilers aren’t just for security research they’re a key part of data compression of software updates. Delta compressors make deltas between decompiled code. So an improvement in mapping of decompiled files could have as much as a 20x improvement in software update size.


? I thought delta compressors operated on the binary directly, can you provide an example of one which does actual decompilation/recompilation?


Pretty much only Microsoft msdelta does it. Google courgette only does a disassembly.


Don’t delta updates usually use some binary diffing algorithm? Even if they didn’t I can understand why would they make deltas using decompiled code and not the disassembled one.


I love this use case! Do you have any public links acknowledging/mentioning/showing this use case? Including it in the Applications portion of the Dec Wiki would be great.


As a total beginner in this field, how can I begin experimenting with this?


The article covers several themes in decompilation, but for academic work in decompilation just take some papers, study them, study references, try to reproduce the experiments I guess. For the bare basics, you can get a disassembly of a random binary on Linux with objdump -S




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: