Compress token usage
with zero accuracy loss

49-57% token reduction

A transparent HTTP proxy between your IDE and LLM providers. Lossless compression. Zero config. 50% savings.

>50%token reduction
0%accuracy loss
<50msoverhead
›››

The problem

The hidden tax on every LLM call

Redundant context

Every API call ships repeated system prompts, bloated tool schemas, and stale conversation history. You’re paying per token for data the model doesn’t need.

Context rot

As context windows fill, model quality degrades — the “lost in the middle” effect. More tokens doesn’t mean better results. Often it means worse.

Agentic burn rate

Agentic workflows burn through context limits fast, forcing premature truncation or expensive summarization passes that lose information.

Cost scales linearly

Token pricing is simple: more tokens, more cost. Without compression, every optimization you make elsewhere is undermined by raw token waste.

›››

Pipeline

Five layers of compression

Each request passes through five independent stages. Every layer is toggleable. Together, they compound.

›››››
LTSC
18-47%input reduction
››››
Tools
70-98%tool token reduction
›››
Cache
100%on cache hit
››
ACON
26-54%peak reduction
TOON
30-61%output reduction

Replaces repeated token subsequences with dictionary-backed placeholders using LZ77-style meta-token compression. Fully lossless.

paper

Two-tool architecture: find_tool discovers relevant tools, call_tool loads schemas on demand. No more sending all schemas upfront.

Embeds each request and matches against cached responses using vector similarity. On a hit, the LLM is bypassed entirely.

paper

Compresses multi-turn agent conversation histories through heuristic deduplication and tool result truncation.

paper

Negotiates a compact tab-separated output format instead of verbose JSON. Same data, dramatically fewer tokens out.

paper
›››

Benefits

What compression means for your workflow

››››

Longer sessions

Your agent takes more steps before hitting context limits. No premature truncation, no forced summarization mid-task.

›››

More tools, less waste

Register hundreds of MCP tools without burning context. Schemas load on demand, not upfront.

››

Better quality

LLMs degrade as context fills. Lower effective token usage means the model performs better, even with the same real information.

Lower cost

You pay per token. Fitting more into less means the same work costs less. Typical savings of 50%+ on API spend.

›››

Quickstart

Three commands. That's it.

terminal›››
# 1. Start the proxy
lessloss start
# 2. Point your LLM client at lessloss
eval "$(lessloss init zsh)"
# 3. Use your tools as normal
claude "explain this codebase"
codex "fix the auth bug"
aider /add src/

Clients

OpenAI SDKAnthropic SDKClaude CodeCodex CLIaiderCursorcontinue.devGemini CLI

Providers

OpenAIAnthropicGoogle GeminiAzureGroqTogether AIOllamaOpenRouter+ any OpenAI-compatible API
›››

Stats

Compression in action

live session

$ lessloss stats

Tokens saved:847,291
Cache hits:43
Avg compression:0.47x
Pipeline latency:<42ms
Sessions active:12
$
›››

Architecture

How it works

IDE / Agent
LTSC
Tools
Cache
ACON
TOON
LLM Provider

Request-side compression only

Responses stream through untouched

No TLS termination

Plain HTTP on localhost, HTTPS to upstream

Tower middleware architecture

Each layer independently toggleable

Connection pooling

HTTP/2 multiplexing to upstream providers