Most examples I tried didn't work very well, but when it did work it was truly neat. The performance makes sense from a quick glance into the paper. The model represents programs as paths in the AST, which is not sufficient to reconstruct the semantics, but is a good "fingerprint" of a program for fuzzy retrieval tasks. That's the domain which the authors wanted to target.
I wonder if there is really so much low hanging fruit still lying around, or if everybody who tried injecting some more domain knowledge into tools like this had quietly failed.
For example, the obvious way of building a distributed representation of, e.g., the simply-typed lambda-calculus (STLC) is by building a model. There are four local constraints that the model has to satisfy and the payoff is a representation that is invariant under program equivalence.
There are some complexity theoretic reasons why this cannot really work all the time (conversion in STLC is nonelementary), but even something that works in simple cases would be more robust than a statistical fingerprint that gets confused by the names of local variables...
That was a poor choice of words. Models of lambda calculus are invariant under beta-eta conversion, which is what I meant by program equivalence, but which is not the same thing as contextual equivalence.
Thus you get a representation invariant under computation. This remains decidable when you consider only normalizing programs as in STLC or related subsystems.
Herbrand equivalence is the best you can do (in general) if you are trying to say whether two variables have the same values at the same program points.
If you are willing to be probablistically correct you can do better, but you will get wrong answers (and not know they are wrong)
That is likely okay for this application.
Nice! could be useful base for a tool to make code more DRY or higher quality.
One way would be an IDE extension that suggests a reference implementation of an algortihm if it finds code in your code base resembling it with high prediction score.
Or if it sees code duplication it will suggest a refactoring to factorize the common function.
I do wonder if some sort of AST normalization would improve the input signal. Example 8 on the website shows their system correctly identifying an isPrime function. However some irrelevant perturbations can break it. If you swap the if statement condition around from `n % i == 0` to `0 == n % i` the proposed names are totally different and make no sense.
If the task you want to solve is automatic function naming I definitely think normalizing would be an improvement. But I'm not sure it would be the right thing for all applications. Don't have any examples though.
Maybe, but in the past, we've detected plagiarism in a different way: just look at the assembly output of the compiled code. The compiler is a good normalizer that removes artificial differences like naming of variables and functions.
Very cool idea, but other than the examples, I didn't very much luck. It wasn't able to guess a basic Fibonacci or FizzBuzz, and for isEven, it returned isOdd, which I guess is close but still.
We just took the top-starred Java projects from GitHub, assuming that their source code and naming are of good quality.
These included projects like elasticsearch, hadoop-common and maven.
We assumed that their code and naming quality are worth learning, such that the learned patterns will transfer to other code and projects, because they are popular and actively maintained projects. So, as you can imagine, FizzBuzz is not that common there :)
You could probably dig up a pretty large number of high quality teaching repos to supplement those projects. I'm thinking of the types of repos that are included as supplements for MOOCs or reference books. Then you'll get some of the canonical gimmicky algorithms like fizzbuzz. But more importantly, you will get reference implementations of fundamental algorithms. Things like mergesort or binary tree search. While you are unlikely to see a straightforward implementation of those algorithms in a production repo, it wouldn't surprise me to see some the core abstract patterns repeated over and over again because they are so fundamental to CS pedagogy. And if you pick sufficiently high quality repos, you (hopefully) won't be compromising the code quality metrics driving your selection of those top starred repos.
While I understand that, how can this work be used to, say, analyze novice/student coding behaviors? Furthermore, does selecting the GitHub repos subject code2vec to be bias toward specific design patterns?
To play devil's advocate, could I just run code2vec across different repos to see if they are "real enough"?
I like to library and just printed out the paper to read, but since my focus is more with learning to program, this sounds like a tools I'd love to be able to use for analysis, but I'm being told no.
wouldn't that be code that someone uses to build applications that other people want to use? Nobody is sitting around going, darn we need a new open source fizzbuzz application and we need it today!
This is a really cool demo. For common patterns like get, contains, ends with, etc, this could be helpful in code representation without having to go into the details.
I wonder if there is really so much low hanging fruit still lying around, or if everybody who tried injecting some more domain knowledge into tools like this had quietly failed.
For example, the obvious way of building a distributed representation of, e.g., the simply-typed lambda-calculus (STLC) is by building a model. There are four local constraints that the model has to satisfy and the payoff is a representation that is invariant under program equivalence.
There are some complexity theoretic reasons why this cannot really work all the time (conversion in STLC is nonelementary), but even something that works in simple cases would be more robust than a statistical fingerprint that gets confused by the names of local variables...