Code2vec: learning distributed representations of code

fmap · on Dec 5, 2018

Most examples I tried didn't work very well, but when it did work it was truly neat. The performance makes sense from a quick glance into the paper. The model represents programs as paths in the AST, which is not sufficient to reconstruct the semantics, but is a good "fingerprint" of a program for fuzzy retrieval tasks. That's the domain which the authors wanted to target.

I wonder if there is really so much low hanging fruit still lying around, or if everybody who tried injecting some more domain knowledge into tools like this had quietly failed.

For example, the obvious way of building a distributed representation of, e.g., the simply-typed lambda-calculus (STLC) is by building a model. There are four local constraints that the model has to satisfy and the payoff is a representation that is invariant under program equivalence.

There are some complexity theoretic reasons why this cannot really work all the time (conversion in STLC is nonelementary), but even something that works in simple cases would be more robust than a statistical fingerprint that gets confused by the names of local variables...

0xBABAD00C · on Dec 5, 2018

> representation that is invariant under program equivalence

is this even computable at all (leaving aside the complexity theoretic issues)?

fmap · on Dec 5, 2018

That was a poor choice of words. Models of lambda calculus are invariant under beta-eta conversion, which is what I meant by program equivalence, but which is not the same thing as contextual equivalence.

Thus you get a representation invariant under computation. This remains decidable when you consider only normalizing programs as in STLC or related subsystems.

DannyBee · on Dec 5, 2018

No, it isn't. It's clearly undecidable.

Herbrand equivalence is the best you can do (in general) if you are trying to say whether two variables have the same values at the same program points.

If you are willing to be probablistically correct you can do better, but you will get wrong answers (and not know they are wrong) That is likely okay for this application.

kaveh_h · on Dec 5, 2018

Nice! could be useful base for a tool to make code more DRY or higher quality.

One way would be an IDE extension that suggests a reference implementation of an algortihm if it finds code in your code base resembling it with high prediction score. Or if it sees code duplication it will suggest a refactoring to factorize the common function.

olalonde · on Dec 5, 2018

This could be useful for reverse engineering minified/obfuscated code.

gcommer · on Dec 5, 2018

Interesting concept, now to rush and lookup word2vec applications and see if they make sense for code :p

Also, props for an academic work having an extensive README! https://github.com/tech-srl/code2vec

I do wonder if some sort of AST normalization would improve the input signal. Example 8 on the website shows their system correctly identifying an isPrime function. However some irrelevant perturbations can break it. If you swap the if statement condition around from `n % i == 0` to `0 == n % i` the proposed names are totally different and make no sense.

namibj · on Dec 5, 2018

Well, than maybe train it on random AST equivalency transformations? I'd assume that to work better.

yahave · on Dec 5, 2018

Yes, that indeed works better

Karlozkiller · on Dec 5, 2018

If the task you want to solve is automatic function naming I definitely think normalizing would be an improvement. But I'm not sure it would be the right thing for all applications. Don't have any examples though.

ychen306 · on Dec 5, 2018

Assuming I understand their approach correctly, they are using little (one could argue none at all) semantic information.

faizshah · on Dec 5, 2018

I wonder if this could be used to help detect plagiarism when auto grading CS homework.

toolslive · on Dec 5, 2018

Maybe, but in the past, we've detected plagiarism in a different way: just look at the assembly output of the compiled code. The compiler is a good normalizer that removes artificial differences like naming of variables and functions.

w_t_payne · on Dec 5, 2018

It would be nice if someone had some code with associated requirements, use-cases, user stories, or even test suites or documentation.

Then we could use an adversarial network to try to learn the relationship between requirements (or tests) and code.

thwy12321 · on Dec 5, 2018

JIRA could be parsed to do this, specifically when git commits are tagged to tickets. Quite an effort to get all that data in the right form though

ehsankia · on Dec 5, 2018

Very cool idea, but other than the examples, I didn't very much luck. It wasn't able to guess a basic Fibonacci or FizzBuzz, and for isEven, it returned isOdd, which I guess is close but still.

urialon · on Dec 5, 2018

The model was trained on real-world code of top starred repositories from GitHub.

Although we tend to think as Fibonacci and FizzBuzz as "basic" - these are actually not common in real projects.

wtroughton · on Dec 5, 2018

I wonder the model is trained purely on the AST or if there are other inputs / metadata from Github that the model uses for prediction.

tsumnia · on Dec 5, 2018

But what would we quantify as "real" projects?

urialon · on Dec 5, 2018

We just took the top-starred Java projects from GitHub, assuming that their source code and naming are of good quality. These included projects like elasticsearch, hadoop-common and maven.

We assumed that their code and naming quality are worth learning, such that the learned patterns will transfer to other code and projects, because they are popular and actively maintained projects. So, as you can imagine, FizzBuzz is not that common there :)

stult · on Dec 5, 2018

You could probably dig up a pretty large number of high quality teaching repos to supplement those projects. I'm thinking of the types of repos that are included as supplements for MOOCs or reference books. Then you'll get some of the canonical gimmicky algorithms like fizzbuzz. But more importantly, you will get reference implementations of fundamental algorithms. Things like mergesort or binary tree search. While you are unlikely to see a straightforward implementation of those algorithms in a production repo, it wouldn't surprise me to see some the core abstract patterns repeated over and over again because they are so fundamental to CS pedagogy. And if you pick sufficiently high quality repos, you (hopefully) won't be compromising the code quality metrics driving your selection of those top starred repos.

jpfed · on Dec 5, 2018

Consider also Rosetta Code, which should have Java implementations for a ton of simple problems.

tsumnia · on Dec 5, 2018

While I understand that, how can this work be used to, say, analyze novice/student coding behaviors? Furthermore, does selecting the GitHub repos subject code2vec to be bias toward specific design patterns?

To play devil's advocate, could I just run code2vec across different repos to see if they are "real enough"?

I like to library and just printed out the paper to read, but since my focus is more with learning to program, this sounds like a tools I'd love to be able to use for analysis, but I'm being told no.

bryanrasmussen · on Dec 5, 2018

wouldn't that be code that someone uses to build applications that other people want to use? Nobody is sitting around going, darn we need a new open source fizzbuzz application and we need it today!

tracer4201 · on Dec 5, 2018

Disclaimer: Not an ML expert

I suspect that's a function of the size and quality of their training set?

wtroughton · on Dec 5, 2018

This is a really cool demo. For common patterns like get, contains, ends with, etc, this could be helpful in code representation without having to go into the details.

zelly · on Dec 5, 2018

We’re being replaced.