A transparent HTTP proxy between your IDE and LLM providers. Lossless compression. Zero config. 50% savings.
The problem
Every API call ships repeated system prompts, bloated tool schemas, and stale conversation history. You’re paying per token for data the model doesn’t need.
As context windows fill, model quality degrades — the “lost in the middle” effect. More tokens doesn’t mean better results. Often it means worse.
Agentic workflows burn through context limits fast, forcing premature truncation or expensive summarization passes that lose information.
Token pricing is simple: more tokens, more cost. Without compression, every optimization you make elsewhere is undermined by raw token waste.
Pipeline
Each request passes through five independent stages. Every layer is toggleable. Together, they compound.
Replaces repeated token subsequences with dictionary-backed placeholders using LZ77-style meta-token compression. Fully lossless.
paper ›Two-tool architecture: find_tool discovers relevant tools, call_tool loads schemas on demand. No more sending all schemas upfront.
Embeds each request and matches against cached responses using vector similarity. On a hit, the LLM is bypassed entirely.
paper ›Compresses multi-turn agent conversation histories through heuristic deduplication and tool result truncation.
paper ›Negotiates a compact tab-separated output format instead of verbose JSON. Same data, dramatically fewer tokens out.
paper ›Benefits
Your agent takes more steps before hitting context limits. No premature truncation, no forced summarization mid-task.
Register hundreds of MCP tools without burning context. Schemas load on demand, not upfront.
LLMs degrade as context fills. Lower effective token usage means the model performs better, even with the same real information.
You pay per token. Fitting more into less means the same work costs less. Typical savings of 50%+ on API spend.
Quickstart
Clients
Providers
Stats
$ lessloss stats
Architecture
Request-side compression only
Responses stream through untouched
No TLS termination
Plain HTTP on localhost, HTTPS to upstream
Tower middleware architecture
Each layer independently toggleable
Connection pooling
HTTP/2 multiplexing to upstream providers