Attention Is All You Need

eggie5 · on Dec 16, 2017

This paper has a lot of prerequisites to understand. A good paper to read is precursor to this paper released a year ago: https://arxiv.org/abs/1606.01933

bthornbury · on Dec 16, 2017

I expect we'll be seeing many shakeups on what has (perhaps prematurely) become the established norms for NN architectures (CNN and RNN) throughout the next few years.

Its a great time to be alive!

pilooch · on Dec 16, 2017

See the Google blog post from last summer, https://research.googleblog.com/2017/08/transformer-novel-ne... A novel simplified architecture for sequences and translation.

ferros · on Dec 16, 2017

Can somebody assist in breaking this down?

xgb84j · on Dec 16, 2017

When you want to process a sequence of vectors while incorporating information from a second sequence, you use attention. Attention means you create a weighted sum of all vectors in the second sequence for each vector in the first sequence. You basically add a customized summary vector of the second sequence to each vector of the first.

visarga · on Dec 16, 2017

It shows that multiple attention heads can replace CNN's and RNN's, creating a powerful new paradigm for neural nets.

speedplane · on Dec 16, 2017

It just seems so random. I understand that CNNs are better suited to process pixels in a 2D grid, it sort of makes sense b/c images are in a 2D grid and the operations emulate convolutions used in lots of image processing. I also understand that RNNs used to process symbols, and their feedback mechanisms can emulate short-term memory.

Still... even though I understand what works and what doesn't, I don't really understand why RNNs are better at processing symbols and CNNs are better at processing images. They hold the same information, and work the same way, they are just organized differently. It makes little sense.

candiodari · on Dec 16, 2017

Think of it this way: Neural networks are a particular kind of formula. For instance a 2x2 neural network takes 2 inputs, x, and y, has 2 biases, bx and by, and 4 weights w1 ... w4 and calculates 2 outputs ox, and oy. The formula it uses is:

ox = w1 * x + w3 * y + bx

oy = w2 * x + w4 * y + by

And then learns how to change w1 ... w4, bx and by to get ox and oy to be more useful (less error). Is it so weird that a 2 layer network (which is the same formula, but it uses ox and oy as inputs to a second identical calculation to produce the final outputs) will produce better results, even with learning ?

Even in a neural network so absurdly large as the human mind, it isn't just a bunch of wires connecting everything to everything. There are clear patterns, clear formulas that are dictated not by learning, but by the architecture itself (e.g. CNNs in the eyes and optical cortex, and something much more like LSTMs in memory heavy regions).

In some sense one might even say it doesn't make a difference. A fully connected network with the same fanout as a CNN can do everything a CNN can, and more. Likewise, a network that is simply presented with the last 50 timesteps can do strictly better than an LSTM or GRU RNN would.

But such a network would have much, much greater computational complexity (by a factor in the thousands at least, in the case of those CNNs by a factor that is roughly the number of pixels in the image).

So they don't work better per se. In fact they are "worse" by the most important metric (error). But they are a good trade off : LSTMs are thousands of times faster, with only slightly worse results for sequence labeling or production (ie. text and audio comprehension). CNNs are millions of times faster at image comprehension than a fully connected network, and are only slightly worse at it.

ChrisFoster · on Dec 16, 2017

These are good points about computational complexity, but there's more to it: the NN architecture regularizes the representations which may be learned. A fully connected architecture may be fully general, but without powerful regularization you won't be able to train it to get the performance of an architecture which encodes some of the structure of the problem. Even disregarding the computational load of the increased number of parameters.

speedplane · on Dec 16, 2017

I understand all your points, but they are just not satisfying. CNNs and RNNs are sufficiently different, I sort of see understand. Maybe CNNs are more like your eyeball and RNNs are more like your inner brain. But LSTMs... really? There are so many variations of them and they seem to be such trial and error hooking nodes up in weird ways. I'm not saying they don't work, and I'm comfortable not know why particular weights do what they do, but I it's hard to get understand why some configurations work so much better than others for certain problems.

rjtavares · on Dec 16, 2017

The way I took it is that CNNs use local context and are translation neutral (so if a certain combination of pixels is an eye, they'll be an eye anywhere in the image), while RNNs can decide which information to keep and can decide in what way positioning affects result.

So the thing CNNs are really good at are perfect for small images, but symbols already have that thing done. When CNNs and RNNs are used together (e.g. image segmentation), it's almost as if the CNN is creating symbols and RNN s processing them.

Or, you know, maybe I'm far off. I'm in no way qualified to talk about this (just a fan).

sdenton4 · on Dec 16, 2017

Here's another thing to consider. The brain is processing "video" in what we might call 3.5 dimensions. (2d images + time with around 6 channels... Plus sound). Image recognition with a CNN is working on still images, and dropping the time component entirely. RNNs are about handling sequential data, discrete or not...

Meanwhile, the hardware in our heads is far more complex and capable than the models we're building. And we're not even sure how that hardware works, or what principles we can abstract out and simplify. A neuron getting electrical and hormonal signals is far more complex than an lstm cell... Does it need to be? Or are there biological requirements built in which we don't understand? And how many evolutionary accidents which never evolved away?

visarga · on Dec 16, 2017

> I don't really understand why RNNs are better at processing symbols and CNNs are better at processing images

It's also possible to go the other way around. PixelRNN uses a kind of special convolutions to encode and generate images. And CNN's have been used for text translation and speech synthesis, especially because they are faster.

The RNN and CNN contain prior knowledge about the domain. The RNN contains the idea that prior elements influence the following elements in a sequence. CNN's encode the idea that close pixels are related. Both imply some kind of domain knowledge.

Multiple attention heads could potentially learn domain knowledge from data.

RangerScience · on Dec 16, 2017

I'm seconding this. I could not find a good resource to understand what "attention" actually _is_.

(The next step for me would be to follow the citation trail to the original paper, but that might not be the best place to come to an understanding of the thing.)

visarga · on Dec 16, 2017

Attention is just a weighted sum over a set of vectors, where the weights sum to one. Attention weights are usually created by neural nets. The word "attention" might seem more grandiose than what it actually does.

speedplane · on Dec 16, 2017

That may be true on a mathematical level, but that's also the answer to just about any neural net question... it's all just a a weighted sum. My understanding of "attention" on a higher level is the ability to concentrate more neurons on "important" areas of an image than less important ones.

An imperfect analogy is how the human visual system has better resolution at your eye-line's center than at it's edges. In this analogy, your brain should not waste effort processing image details in your peripheral vision.

visarga · on Dec 16, 2017

The key element is that we use neural nets to compute the attention weights, so attention itself is learnable.

mattj · on Dec 16, 2017

The other answers cover the math well, but I think the “why do you need attention?” statement is worth making (and answers the more engineering-y question of “how/when?”):

DNNs typically operate on fixed-size tensors (often with a variable batch size, which you can safely ignore). In order to incorporate a non-fixed size tensor, you need some way of converting it into a fixed size. For example, processing a sentence of variable length into a single prediction value. You have many choices for methods of combining the tensors from each token in the sentence - max, min, mean, median, sum, etc etc. Attention is a weighted mean, where the weights are computed based on a query, key, and value. The query might represent something you know about the sentence or the context (“this is a sentence from a toaster review”), the key represents something you know about each token (“this is the word embedding tensor”), and the value is the tensor you want to use for the weighted mean.

nl · on Dec 16, 2017

The distill.pub explanation is pretty good https://distill.pub/2016/augmented-rnns/

IncRnd · on Dec 16, 2017

Reading the headline I thought the article would be about mindfulness, which would have been nice. Reading the article I was pleasantly surprised to find a different subject that I also enjoy. :)

chriswarbo · on Dec 16, 2017

Would this have implications for using ANNs on recursive structures (trees and graphs)? Their "position encoding" seems a little contrived, but may be amenable to a more complex positioning scheme (e.g. paths from a root node).

Whilst there are "standard" approaches in computer vision ("CNNs applied to <foo>") and sequence processing ("LSTM RNNs applied to <foo>"), there doesn't seem to be any "standard" for variable-size, recursively-structured data. Sure there's recursive ANNs, backpropagation-through-structure, etc. but they all seem like one-off inventions, rather than accepted problem-solving tools.

sdenton4 · on Dec 16, 2017

Seq2Seq is kind of a standard, but also strikes me as pretty hacky. The network has an encoder and a decoder mode, reads until it finds an end of input signal, then switches to decode mode. This is how absolutely nothing works in nature.

phkahler · on Dec 16, 2017

Is this really significant? I'm not an NN kind of guy but I find it an interesting thing to follow from a distance. From the abstract, this sounds like an important paper. Is it?

jorgemf · on Dec 16, 2017

This paper was uploaded to arxiv 6 months ago (June). With the fast pace in translation in the last years it might be outdated already

m3kw9 · on Dec 16, 2017

I wonder how Capsule nets can evolve using Attention model like this

eref · on Dec 16, 2017

Capsules basically do a kind of self-attention. But there the parent features compete for a coupling, not the child features.

imurray · on Dec 16, 2017

I suggest changing the link from the .pdf to the web page: https://papers.nips.cc/paper/7181-attention-is-all-you-need

It's one click to get the pdf from there. But you also get a plain webpage with abstract, citation details, and so on, which you can't get back to from the PDF.

In general it's good to knock the ".pdf" off the end of all papers.nips.cc links. Similarly turn /pdf/ links on arXiv into /abs/ links, and replace "pdf?" in openreview.net links with "forum?".

popcorncolonel · on Dec 16, 2017

Agreed. Plus, its really annoying to open it on mobile and my phone starts downloading the pdf file immediately, which I don't want to have to manually delete in the future.

dang · on Dec 16, 2017

Thanks! Changed from https://papers.nips.cc/paper/7181-attention-is-all-you-need.....

hjjiehebebe · on Dec 16, 2017

Abstract:

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.