Authors: Yuan He, Chengxing Xie
GitHub: https://github.com/strands-rl/strands-sglang
Page on Strands: https://strandsagents.com/docs/community/model-providers/sglang/
Example on Slime: https://github.com/THUDM/slime/tree/main/examples/strands_sglang
<aside>
🌟TL;DR: Existing agent scaffolds like Strands-Agents [1] ****and ****make it easy to serve tool-using agents but face a key challenge because they operate on text (usually with OpenAI endpoint) while RL training requires exact token IDs (token-in, token-out). This mismatch causes retokenization drift [2] — the tokens used for computing logprobs/gradients no longer match the tokens that were actually generated — leading to effectively off-policy updates and potentially unstable RL training. Strands-SGLang bridges this gap by extending Strands-Agents with SGLang's native endpoint [3] while keeping the same customizable agent loop.
</aside>
| Component | Agent scaffold / loop | Token-in/token-out |
|---|---|---|
| Strands-Agents | ✅ | ❌ (text-based) |
| SGLang | ❌ | ✅ |
| Strands-SGLang | âś… | âś… |
Most agent scaffolds provide a great agent loop (tool orchestration, iteration control, tracing), but their model interface is typically text-based. For RL training, text alone is insufficient: the training pipeline must consume the exact token-level trajectory produced by the backend.
If token IDs are reconstructed later by retokenizing the rendered text messages, retokenization drift can occur, making updates effectively off-policy and potentially destabilizing RL training.
Strands-SGLang addresses this by bridging both worlds:
So you can keep the same agent loop for serving while producing training-ready trajectories by construction.
Strands-SGLang implements a new model class SGLangModel backed by SGLang’s native /generate endpoint, so you can reuse Strands’ agent loop while exposing RL-relevant internals:
You run a normal Strands agent… but now you can directly read token-level artifacts from the model:
from transformers import AutoTokenizer
from strands import Agent
from strands_tools import calculator
from strands_sglang import SGLangModel
# Suppose Qwen3-8B is served at <http://localhost:30000>
agent = Agent(
model=SGLangModel(
tokenizer=AutoTokenizer.from_pretrained("Qwen/Qwen3-8B"),
base_url="<http://localhost:30000>"),
tools=[calculator],
)
model.reset() # refresh token manager
result = await agent.invoke_async("What is (25 * 17)^3 ?")
tm = model.token_manager
print("token_ids:", tm.token_ids)
print("loss_mask:", tm.loss_mask)
print("logprobs:", tm.logprobs)
This is the key: the rollout is already in the form that RL training wants, and you don’t need to worry about writing ad-hoc agent loop code
We demonstrate the impact of maintaining token-in/token-out (TITO) using a math reasoning agent (with a Python execution tool) with a Qwen3-8B (thinking) backend.
SGLangModel with OpenAIModel and apply retokenizationWithout TITO, training collapsed before step 50 despite similar initial reward increase.


[1] Strands Agents SDK: https://github.com/strands-agents/sdk-python
[2] https://blog.vllm.ai/2025/10/22/agent-lightning.html
[3] SGLang: https://docs.sglang.io/
[4] Slime: https://github.com/THUDM/slime/tree/main