A good friend of mine shared with me an interesting article on the Computational Complexity blog, that asked why the output of a transformer needs to be sequential, and why it couldn’t just also include positional encodings as the input does?
To provide a bit more context to what this means, consider that the Transformer architecture naively has no concept of relative position - it operates on all of its inputs simultaneously. Now, for a vast amount of applications, especially in NLP, this is not optimal, as the position of a word contains a lot of information about the word. Thus in order to provide the Transformer with a way of detecting relative position, to the encodings of the inputs, a position varying signal is added using sin and cos, which due to its regularity is something that the transformer can detect. Now the output of a Transformer is merely a sequential series of tokens, but the blog post asks, very interestingly, whether instead of assuming the sequence, you could rather have the output also include a positional factor the same as the input, and have that be something else it learns, with a postprocessor putting it in the right order. This could possibly help with more long distance correlations in transformers and is a quite interesting idea that should be explored at some point, and probably will be before the year 2084.