Specifically, the hidden state is now equivalent to the weights of a model. This model can be a linear model, a small neural network, or anything else. The output rule is simply: Intuitively, the output k is the prediction made by the model with the updated weights. The update rule is a step of gradient descent on some self-supervised loss ℓ: where the learning rate is η. From a compression perspective, each heuristic needs to decide which inputs to remember and forget. Inputs that produce large gradients will be remembered - intuitively, those that make a lot of learning.
One choice for ℓ is reconstruction itself. To make the litauen handynummer learning problem non-trivial, the authors first process it into a corrupted input and then optimize: Similar to the denoising autoencoder, it needs to find correlations between dimensions in order to reconstruct from partial information. As shown in the figure, gradient descent can reduce ℓ but cannot reduce it to zero. As with other layers and self-attention mechanisms, the algorithm that maps the input sequence,…, to the output sequence,…, can be programmed into the forward propagation of the sequence modeling layer using the hidden states, update rules, and output rules described above.
Even at test time, the new layer still trains a different weight sequence,…, for each input sequence. Therefore, the researchers call it a test-time trained layer. . The forward propagation of a neural network layer using layers also has a corresponding backward propagation. Layers have the same interface as layers and self-attention mechanisms and can therefore be replaced in any larger neural network architecture. It is worth mentioning that the way to train a neural network with layers is the same as training any other model. The same data, methods, and objectives (such as the next k predictions) can be used to optimize the parameters of the rest of the network.