MT-DNN Achieves Human Performance in General Language Understanding Benchmark

SmooL · on July 8, 2019

It seems since submitting that they are no longer the leader on the GLUE leaderboard - https://gluebenchmark.com/leaderboard/

yorwba · on July 8, 2019

In fact the article was published on June 20, while the XLNet submission that dethroned them was already on June 19. I guess their publishing pipeline doesn't allow last-minute amendments.

dweekly · on July 8, 2019

The relentlessness of the pace of ML remains breathtaking.

Microsoft: We beat human performance and lapped everyone else!

XLNet: Hold my beer.

Code: https://github.com/zihangdai/xlnet

Description: https://towardsdatascience.com/what-is-xlnet-and-why-it-outp...

Paper: https://arxiv.org/pdf/1906.08237.pdf

m0zg · on July 8, 2019

Notably, this result would be very hard for Microsoft to achieve, or indeed even reproduce fully in-house, because it requires more memory than GPUs have. GitHub mentions that TPUs are pretty much table stakes to train this.

bitforger · on July 8, 2019

The memory constraint is certainly an issue, but I think they can be overcome with some software hacks (for a speed penalty, of course.) For example, gradient checkpointing and gradient accumulation might help:

https://medium.com/tensorflow/fitting-larger-networks-into-m... https://medium.com/huggingface/training-larger-batches-pract...

riku_iki · on July 8, 2019

> because it requires more memory than GPUs have

could you elaborate? TPU V3 unit has 16GB of memory, and old V100 also has 16GB of memory. Plus TPU has extra memory consumption for mandatory tensor padding, which GPU doesn't have.

m0zg · on July 9, 2019

The XLNet GitHub elaborates pretty well, no need to duplicate it here: https://github.com/zihangdai/xlnet

riku_iki · on July 9, 2019

There is no elaboration of your statement there. It says it is hard to reproduce results on single 16GB GPU, but nor they used single TPU in their result. They explicitly state that large number of GPU required: "Therefore, a large number (ranging from 32 to 128, equal to batch_size) of GPUs are required to reproduce many results in the paper.", and they used around 200 TPUs themself according to their publication.

m0zg · on July 9, 2019

Yes, it boils down to using either a single machine with multiple TPUs (and therefore doing things "the easy way" and relatively quickly) or having to use 128 GPUs (up to 8 per machine) and working with a single sample per GPU, really, really slowly. Given that a single model often requires dozens, if not hundreds of training runs to figure out hyperparameters that achieve a SOTA result, this means that with TPU you can do this, and with GPU you aren't even going to bother training something this big because it will take forever, and someone with a TPU will figure out something better by the time you're done. Which is what happened in this case, it looks like.

mda · on July 8, 2019

I guess they can use TPUs with GCE.

daenz · on July 8, 2019

One of their test sentences was interesting:

>For example, the task provides the sentence: “The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.” If the word “feared” is selected, then “they” refers to the city council. If “advocated” is selected, then “they” presumably refers to the demonstrators.

aisofteng · on July 8, 2019

This sort of sentence is exactly the point of the article. See: https://en.m.wikipedia.org/wiki/Winograd_Schema_Challenge

p1esk · on July 8, 2019

Seems like this is largely due to improved Winograd Schema results. I wonder if those questions made it into the training set in some form.

TheIronYuppie · on July 8, 2019

Disclosure: I work at Azure on Machine Learning

Hi all! Please let me know if you have any questions - happy to direct them to the right people! Thanks!

(Email in my profile)