In fact the article was published on June 20, while the XLNet submission that dethroned them was already on June 19. I guess their publishing pipeline doesn't allow last-minute amendments.
Notably, this result would be very hard for Microsoft to achieve, or indeed even reproduce fully in-house, because it requires more memory than GPUs have. GitHub mentions that TPUs are pretty much table stakes to train this.
The memory constraint is certainly an issue, but I think they can be overcome with some software hacks (for a speed penalty, of course.) For example, gradient checkpointing and gradient accumulation might help:
could you elaborate?
TPU V3 unit has 16GB of memory, and old V100 also has 16GB of memory. Plus TPU has extra memory consumption for mandatory tensor padding, which GPU doesn't have.
There is no elaboration of your statement there.
It says it is hard to reproduce results on single 16GB GPU, but nor they used single TPU in their result.
They explicitly state that large number of GPU required: "Therefore, a large number (ranging from 32 to 128, equal to batch_size) of GPUs are required to reproduce many results in the paper.", and they used around 200 TPUs themself according to their publication.
Yes, it boils down to using either a single machine with multiple TPUs (and therefore doing things "the easy way" and relatively quickly) or having to use 128 GPUs (up to 8 per machine) and working with a single sample per GPU, really, really slowly. Given that a single model often requires dozens, if not hundreds of training runs to figure out hyperparameters that achieve a SOTA result, this means that with TPU you can do this, and with GPU you aren't even going to bother training something this big because it will take forever, and someone with a TPU will figure out something better by the time you're done. Which is what happened in this case, it looks like.
>For example, the task provides the sentence: “The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.” If the word “feared” is selected, then “they” refers to the city council. If “advocated” is selected, then “they” presumably refers to the demonstrators.