# [Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching #39146 ## Vulnerability Overview In the vLLM project, when using the base scheduler and **prefix caching is not enabled**, KV cache blocks become corrupted under `temperature=0`. This leads to non-deterministic outputs, where multiple runs with identical prompts yield inconsistent results. This issue differs from the previous TOCTOU race condition (#37076); it occurs within the `get_computed_blocks()` path and does not require shared prefix content. ## Impact Scope - **Affected Component**: KV block allocation logic in the Base Scheduler. - **Trigger Conditions**: - Prefix caching is disabled (`--enable-prefix-caching` is not enabled). - Concurrent requests (4-5 concurrent requests were sufficient to reproduce in testing). - Changes in memory pressure leading to altered block allocation order. - **Consequences**: In production deployments, concurrent requests may result in incorrect or inconsistent output. ## Remediation - **Status**: Fixed. - **Related PR**: PR #37164. - **Fix Details**: Resolved the TOCTOU race condition in the KV block allocator, ensuring correct retrieval of computed blocks in `get_computed_blocks()`. ## Reproduction Code (POC) **Step 1 — Start vLLM as is:** ```bash python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-0.5B-Instruct \ --gpu-memory-utilization 0.95 \ --max-model-len 32768 ``` **Step 2 — Run the script (requires httpx):** ```bash python3 repro.py --base-url http://localhost:8000 ``` **Note**: Before running the script, ensure that `repro.py` is in the same directory as the required JSON files (`primary_finding_00030.json`, `second_corroborating_finding_00450.json`, `cancel_retry_finding_01410.json`).