vLLM Fix: FullAttention KV Cache Zeroing to Prevent Memory Leak

### Vulnerability Overview This vulnerability involves the improper recycling of KV blocks in the `FullAttention` model. Specifically, there are defects in the implementation and testing of the `needs_kv_cache_zeroing` method in the `test_kv_cache_utils.py` and `kv_cache_interface.py` files. ### Impact Scope - **Files**: `tests/vllm/core/test_kv_cache_utils.py` and `vllm/vllm/kv_cache_interface.py` - **Method**: `needs_kv_cache_zeroing` - **Impact**: May lead to memory leaks or performance issues, particularly when handling the tail state of partial blocks. ### Remediation Plan 1. **Test Cases**: - Two test cases, `test_needs_kv_cache_zeroing` and `test_sliding_only_needs_kv_cache_zeroing`, were added to `test_kv_cache_utils.py` to verify the correctness of the `needs_kv_cache_zeroing` method. - These test cases cover both `FullAttention` and `sliding_only` configurations to ensure proper recycling of the KV cache in different scenarios. 2. **Code Implementation**: - The implementation of the `needs_kv_cache_zeroing` method in `kv_cache_interface.py` was adjusted to ensure correct recycling of KV blocks under the `FullAttention` configuration. - The specific implementation is as follows: ```python @property def needs_kv_cache_zeroing(self) -> bool: return self.has_mamba_layers or any( type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups ) ``` ### POC Code The relevant POC code is provided below: ```python def test_needs_kv_cache_zeroing(): # Regression test for #39146: FullAttention models must zero recycled # kv blocks to avoid stale KV leaking through partial-block tail slots. full_attention = KVCacheConfig( num_blocks=16, kv_cache_tensors=[], kv_cache_groups=[KVCacheGroupSpec(["layer_0"], new_kv_cache_spec())], ) assert full_attention.needs_kv_cache_zeroing sliding_only = KVCacheConfig( num_blocks=16, kv_cache_tensors=[], kv_cache_groups=[ KVCacheGroupSpec(["layer_0"], new_sliding_window_spec(sliding_window=4)), ], ) assert not sliding_only.needs_kv_cache_zeroing ``` ```python @property def needs_kv_cache_zeroing(self) -> bool: return self.has_mamba_layers or any( type(g.kv_cache_spec) is FullAttentionSpec for g in self.kv_cache_groups ) ```

Goal Reached Thanks to every supporter — we hit 100%!

vLLM Fix: FullAttention KV Cache Zeroing to Prevent Memory Leak

More from this source