Building a Language and Compiler for Machine Learning

ChrisRackauckas · on Dec 4, 2018

As someone who works in merging differential equations and machine learning, I have found this kind of work essential for what I do. Pervasive AD that allows merging neural networks and diffeq solvers is allowing us to explore all of kinds of new models and new problems. Sure it doesn't impact vanilla machine learning all that much (though Zygote.jl does allow for a lot of optimizations that wouldn't be possible with tracing-based AD), but it definitely opens up a new wave of AI possibilities.

EE84M3i · on Dec 4, 2018

As someone who works in merging differential equations and machine learning, I have found this kind of work essential for what I do. Pervasive AD that allows merging neural networks and diffeq solvers is allowing us to explore all of kinds of new models and new problems. Sure it doesn't impact vanilla machine learning all that much (though Zygote.jl does allow for a lot of optimizations that wouldn't be possible with tracing-based AD), but it definitely opens up a new wave of AI possibilities.

I thought I was having some kind of stroke or terrible deja-vu (didn't I read this comment earlier this morning?) until I realized you copy and pasted your comment from 20 hours ago on https://news.ycombinator.com/item?id=18594103

ChrisRackauckas · on Dec 4, 2018

Yeah, the other thread caught on so I copy/pasted this post to the thread people were using. But now this thread caught on too... something weird happened but :shrug:.

dnautics · on Dec 4, 2018

there are a few things about Flux that bugs me. It automatically assumes that I want to optimize matrix multiplications by parallelizing those against cores, which has Amdahl scaling, instead of parallelizing across samples in the batch, which has Gustafson scaling. It would probably help if batches and minibatches (or something like that) were datatypes, which they are not. Doing something like this would probably also help with distributing computation, down the line.

I'm also not entirely sure what is going on under the hood with Tracker types, and the documentation is not that great, which became a problem when I was trying to chase down errors in something really custom I was doing.

I much prefer Knet's way of autodifferentiating, which is more intuitive to me, but Knet's layering doesn't feel as nice as Flux's.

I really wish GPU computation in Julia had a different semantic - by making it a 'virtual computational node', accessible using Distributed module with the same semantics as a totally separate node. That would really make async distributed batch processing a thing, the system could profile all the nodes in use and if we really want to get fancy be able to use something like JUMP to make best use of the processing power available to it.

byt143 · on Dec 4, 2018

Two of your issues are currently being worked on. The autodiff tracker stuff is temporary until the lower overhead and almost invisible compiler based AD mentioned in the blog post is fully ready. No custom types needed.

There are also various autobatching packages being developed.

Regarding the GPU semantics, wouldn't that be solved by simply using a distributed array of GPU arrays?

Since flux is lightweight, generic, modular and pure Julia, these things can be developed in third party packages.

taliesinb · on Dec 4, 2018

Can you point at the autobatching packages? What strategy are they taking? Will they recognize opportunities to combine compatible operations within a given function body into a batch? Does one need a cost model for merging and splitting data into batches?

Also, what does an approach like bucketing even look like for the approach that Julia is taking? The idea there of course is to have 'slop': to combine many similar examples whose tensor sizes differ by small amounts, and to carefully define all your primitive operations such that they can can ignore the padding used to combine similar tensors into a uniform shape. Doing this requires awareness of the tensor sizes all the way back to the way you sample the training data, so I don't see how compiler magic can achieve the same performance as you get from bucketing.

Of course, bucketing becomes more complex for things like trees and graphs and other higher-level objects. And bucketing, theoretically, can bias into your gradients, if there is any correlation between the gradient of an example and its tensor shape.

byt143 · on Dec 9, 2018

From the blogpost

"Automatic Batching To get the most from these accelerators – which can have significant overheads per kernel launch, but scale very well over input size – it is common to batch programs, applying the forwards and backwards passes to multiple training examples at once. In simple cases, such as with convolutional nets, it’s simple to handle this by concatenating, say, 10 images along an extra batch dimension. But this task becomes much harder when dealing with variably-structured inputs, such as trees or graphs.

Most researchers address this by taking on the significant burden of batching code by hand. Different solutions have been proposed for different frameworks (DyNet, TensorFlow Fold, which heuristically try to batch some high level operations together when possible, but these typically either have their own usability issues or do not achieve the performance of hand-written code.

We suggest that this problem is identical to that of Single Program Multiple Data (SPMD) programming, which has been well-studied by the language and compiler community for decades, and becomes visible in more recent approaches to batching like matchbox. Indeed, it is very similar to the model of parallelism used by GPUs internally, and has been implemented as a compiler transform for the SIMD units of CPUs. Taking inspiration from this work, we are implementing the same transform in Julia to provide SPMD programming both for scalar SIMD units and for model-level batching. This allows us to reach the ideal of writing simple code that operates on individual samples, while still getting the best performance on modern hardware."

dnautics · on Dec 4, 2018

> wouldn't that be solved by simply using a distributed array of GPU arrays?

No, I don't want to necessarily have distributed GPUs, I want to treat a GPU as a distributed compute node. As in "the GPU is a remote machine that I can send julia code to" (this is how julia normally treats running on clusters, or even on multiple threads).

taliesinb · on Dec 4, 2018

Can you elaborate on what you mean by "parallelizing those against cores" vs "parallelizing across samples in a batch"?

dnautics · on Dec 4, 2018

so for example, if you want to do

    M = [1 0 0 0
         0 1 0 0
         0 0 1 0
         0 0 0 1]
    v1 = [1,2,3,4] (column vector)
    v2 = [4,5,6,7] (column vector)

you can either do (M * v1, M * v2) on two cores as

    r1a = [1 0 0 0
           0 1 0 0] * v1  (core 1)
    r1b = [0 0 1 0
           0 0 0 1] * v1  (core 2)
    r2a = [1 0 0 0
           0 1 0 0] * v2  (core 1)
    r2b = [0 0 1 0
           0 0 0 1] * v2  (core 2)

then

    r1 = flatten([r1a, r1b])
    r2 = flatten([r2a, r2b])

OR, you could just have the whole model in each core:

    r1 = M * v1 (core 1)
    r2 = M * v2 (core 2)

taliesinb · on Dec 4, 2018

Thanks. I'm more familiar with this being called model parallelism (exploiting parallelism in M) vs data parallelism (exploiting parallelism in v).

dnautics · on Dec 5, 2018

Thank you for the appropriate terminology. I'm not a professional in the field, so I learned something useful!!

mlevental · on Dec 4, 2018

how do you use AD for diffeqs? do you really have systems of diffeqs large enough that you need AD for evaluating a solution at a particular point? or do you need backbrop for something that i can't imagine?

ChrisRackauckas · on Dec 4, 2018

For parameter estimation. In parameter estimation, you have to evaluate the derivative of a cost function based on an ODE solution, which is usually something like the L2 norm between your numerical solution points and your data. The gradient of this cost function requires calculating the gradient of the solution with respect to the parameters, which can be done quite well via AD. We are finding that AD methods work better than the traditional sensitivity analysis methods in many cases.

mlevental · on Dec 4, 2018

>parameter estimation

you mean given some data that's modeled by an ODE you want to fit the ODE to the data (and therefore discover parameters of the ODE that would have produced that data) ?

ChrisRackauckas · on Dec 4, 2018

Yes exactly.

manojlds · on Dec 4, 2018

https://news.ycombinator.com/item?id=18593453

mountainriver · on Dec 4, 2018

I’m finding the ML work being done in Julia very refreshing. It feels like they are building things right from the ground up and the community is great to work with.

pbalau · on Dec 4, 2018

> [...] bake for fifteen minutes and out pops a fully-featured ML stack

where is logging, where is model storage and versioning, where is input data processing and normalizing, where is results processing?

mlevental · on Dec 4, 2018

lowbrow comment.

the hard part of ML stacks is AD and GPU not all of those other things (i'm sure there has been zero cutting edge research done on better ways to log).

one-more-minute · on Dec 4, 2018

Yes, and unlike AD and GPU support, things like logging have nothing (special) to do with ML. Julia has both very nice logging and plenty of good serialisation options, all of which works nicely with the ML stack. It's entirely unnecessary to duplicate these tools just so they can be baked in to a huge framework.

ChrisRackauckas · on Dec 4, 2018

Well, I think there is something to be said there though. The reason why the Julia stack is nice is because Julia's standard logging tools can be used for logging in ML codes. Even other things like Julia's standard progress monitoring toolbars just work on ML codes. That's quite a surprising result. Tools which build a sub-language for graph building like TensorFlow have to build and document such tooling. So for newcomers to Julia, they will search the package documentation and package codes and find nothing. It is a confusing problem because the functionality exists but no one thought to document its usage for this context since it is just the standard Julia usage!

xiphias2 · on Dec 4, 2018

I agree that it shouldn't be called a fully featured ML stack. Still, the most important feature (optimized compilation of models) is handled quite well. Whenever I have to look at Tensorflow source code to understand how it works, I see an over-complicated system that is too far from the research papers (which makes it hard to work with it).

xvilka · on Dec 5, 2018

Hopefully one day Julia won't need patched LLVM. Will improve packaging in various distributions too.