# [Bug]: KV Block Corruption in Base Scheduler, Non-deterministic Output at temperature=0 Without Prefix Caching #39146 ## Vulnerability Overview A KV block corruption bug has been discovered in the base scheduler of vLLM. When using `--enable-prefix-caching` but without fuzz testing, under `temperature=0` and with no prefix caching, identical prompts can produce completely different output sequences. This issue is particularly evident under concurrent requests and may lead to non-deterministic outputs. ## Impact Scope - **Trigger Conditions**: `temperature=0`, no prefix caching, concurrent requests. - **Impact**: Identical inputs may yield different outputs, affecting model stability and reliability. - **Concurrent Requests**: Reproducible with 4–5 concurrent requests; any production deployment may be affected. ## Fix Plan 1. **TOCTOU Fix**: PR #37164 addresses the TOCTOU race condition in `get_computed_blocks()`, but has not yet been merged. 2. **Independent Cases**: - **Without `--enable-prefix-caching`**: `get_computed_blocks()` is not called under APC, thus unaffected. - **No Shared Prefix**: All request prompts are entirely unique, with no shared cached content. ## Reproduction Steps 1. **Start vLLM**: ```bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-0.5B-Instruct \ --gpu-memory-utilization 0.95 \ --max-model-len 32768 ``` 2. **Run Script**: ```bash python3 repro.py --base-url http://localhost:8000 ``` ## Related Findings - **Primary finding — finding_00450 (cleanest)**: 5 requests, no shared state, no cancellation operations. - **Second, finding_01410, same as the above :)**: 21 concurrent requests, mixed sizes, block allocation order may be the trigger. - **Related finding — finding_00030 (cancel path)**: 5 requests canceled midway, 5 retries, possibly involving the same issue. ## Isolation Testing - **Speculative Decoding Is Not the Cause**: Re-running each trace to fully remove the spec engine yields consistent results. ## Hypotheses - **Block Allocator in V1 Scheduler**: Does not track block state via hash table, which may cause block reuse issues. - **Block Allocation Order**: Variations in block allocation order across different requests lead to differing outputs. ## Related Files - `primary_finding_00030_999829240.json` - `second_corroboration_finding_00450_862114934.json` - `cancel_retry_finding_01410_1760617970.json` - `repro.py` ## Pre-submission Checks - Ensure related issues have been searched and review the FAQ at the bottom of the documentation page.