KVzap-mlp-Qwen3-8B
KVzap-mlp-Qwen3-8B is an NVIDIA research model that learns to predict KV-cache state directly from token embeddings, bypassing the attention computation for cached positions. It uses an MLP head on Qwen3-8B internals and is designed to accelerate inference by reducing the number of full attention forward passes. The approach is described in the KVzap paper (arXiv:2506.05345).
587,151 ↓ · 4 ♡