This compression has two consequences: the advantage is that the processing is efficient because the processing time for each k is constant. The disadvantage is that the performance is limited by the "expressiveness" of the hidden state when processing long contexts. The self-attention mechanism (-) can also be understood from the above perspective. The difference is that its hidden state is usually called the key-value (K) cache, which grows linearly with . It can store all the context and does not compress it. It has good expressiveness, but its processing time grows linearly with the length of the context.
a better “compression heuristic” ( ) is needed. Specifically, it is japan mobile number necessary to compress millions of k into a hidden state that effectively captures their underlying structure and relationships. . The key idea of the hidden state researchers is to use self-supervised learning to compress the historical context, …, into a hidden state. The approach is to treat the context as an unlabeled dataset and the state as a model. Specifically, the hidden state is now equivalent to the weights of a model. This model can be a linear model, a small neural network, or anything else.
The output rule is simply: Intuitively, the output k is the prediction made by the model after the updated weights. The update rule is a step of gradient descent on some self-supervised loss ℓ: where the learning rate is η. From a compression perspective, each heuristic needs to decide which inputs to remember and forget. It will remember those inputs that produce large gradients - intuitively, those that make a lot of learning. One choice for ℓ is reconstruction itself. To make the learning problem non-trivial, the authors first process it into a corrupted input and then optimize: Similar to denoising autoencoders, they need to find correlations between dimensions in order to reconstruct from partial information.
So to be both efficient and
-
- Posts: 32
- Joined: Mon Dec 23, 2024 6:09 am