Most text-to-SQL demos work great on clean toy schemas with five tables and obvious column names. Real company databases have 200 tables, cryptic column names like usr_st_cd, and business logic baked into magic integers that nobody documented. That gap between demo and reality is what Kosh is trying to close.
Here's how the full stack fits together.
The model
The base model is Gemma 4 E4B, fine-tuned using QLoRA via Unsloth on birdsql/bird23-train-filtered - about 6,600 curated examples from the BIRD-SQL benchmark. BIRD is specifically designed around messy, multi-table queries with external business logic hints, which made it the right dataset for this use case.
QLoRA keeps the base model frozen and trains small low-rank adapter matrices on top. In practice, this means training a 4B parameter model on a single A100 in Colab Pro without running out of memory. The training config uses rank 16 LoRA adapters attached to all attention and MLP projection layers, with gradient checkpointing to save another 30% VRAM.
model = FastModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
use_gradient_checkpointing='unsloth',
target_modules=[
'q_proj', 'k_proj', 'v_proj', 'o_proj',
'gate_proj', 'up_proj', 'down_proj',
],
)
Three epochs, cosine learning rate schedule, effective batch size of 8. Total training time was around 60-90 minutes on the A100.
After training, I exported to GGUF format using q8_0 quantization - best quality at roughly 5GB file size - and loaded it into Ollama running on my local NUC. Sub-second query generation, zero cloud dependency, zero per-query API cost.
One thing that tripped me up: Gemma 4's processor expects the input wrapped in a list, not a bare string. Most tokenizer examples online don't do this, and the error message when you get it wrong is not helpful.
inputs = tokenizer(
text=[prompt_text], # needs to be a list
return_tensors='pt',
padding=True,
truncation=True,
)
The schema pruning problem
This is the part I spent the most time on and the part that makes the biggest difference in output quality.
A naive text-to-SQL implementation dumps the entire database schema into the prompt and asks the model to figure it out. This fails in two ways. First, large schemas blow past the context window. Second, irrelevant tables actively confuse the model. When you ask "how many users churned last month," you don't want the model reasoning about your shipping_zones table.
Kosh solves this with semantic retrieval. When a database is connected, every table and column gets embedded using a sentence embedding model. When a query comes in, the question gets embedded and the top-k most semantically relevant schema elements are retrieved. A 200-table schema gets pruned down to 8-12 tables before it ever touches the model.
In practice this cuts prompt token overhead by around 60% and meaningfully reduces hallucination on ambiguous column names. The model works better with less context when that context is the right context.
Keyword and hybrid retrieval are also available as configurable options in the developer settings panel, but semantic is the default. Keyword matching struggles with natural language phrasing - "how many people left" won't match a column called churn_date unless the embeddings understand the semantic relationship.
Table annotations
Schema pruning handles the structural problem. Table annotations handle the semantic one.
No amount of fine-tuning can know that status = 2 means a user churned in your specific system. That's company-specific logic that lives in someone's head or a Notion doc.
Kosh lets you annotate any table or column with plain English notes. "status=2 means churned." "revenue column is in cents, not dollars." These get injected into the context alongside the pruned schema on every query. The model factors them in automatically. It's persistent memory for your database's business logic.
Text-to-SQL vs text-to-Pandas
Not everything is cleanly expressible in SQL. Window functions get ugly fast. Certain statistical operations are painful. Python with Pandas handles these cases better.
The backend decides which path to take based on the query. Straightforward aggregations and filters go to SQL. Anything that looks like it needs complex reshaping or multi-step calculation gets routed to a Pandas script instead.
This creates a security problem immediately.
The WASM sandbox
Executing AI-generated Python on a server is a bad idea. The model can generate code that reads files, makes network requests, or worse.
Kosh runs AI-generated Python inside Pyodide - Python compiled to WebAssembly, running in the browser. The generated script never touches the server. It executes client-side in an environment with no filesystem access and no network access. If the model hallucinates something malicious, it runs into a wall.
I went with WASM over server-side sandboxing because it's lightweight and the isolation guarantee comes from the browser's security model rather than from my own container configuration. Less surface area to get wrong. The tradeoff is that Pyodide has a cold-start cost on first load, but after that it's fast enough that users don't notice.
The self-healing query loop
SQL generation isn't perfect. The model sometimes produces queries with syntax errors or references to columns that got pruned from the context.
When a query fails execution, Kosh feeds the error message back to the model with the original question and asks it to fix the SQL. This loop runs up to three times before surfacing a failure to the user. Most errors resolve on the first retry. The UI surfaces an "Auto-repaired" badge when this happens so the result is transparent - the user can see the original generation wasn't clean.
The data layer
DuckDB sits underneath everything as a unified query engine. It virtualizes PostgreSQL, MySQL, SQLite, SAP HANA, CSV files, and Excel files into a single SQL interface. The same query generation logic works regardless of what the user connects.
It also means I don't have to write separate connectors and dialect translators for each database type. DuckDB handles the differences.
Infrastructure
The full stack runs in Docker Compose: PostgreSQL for app state, FastAPI backend, Next.js frontend. GitHub Actions handles CI/CD - pushes to prod build the image, push to GHCR, run Prisma migrations, and redeploy. A dedicated migrate container runs prisma migrate deploy before the app container starts, so the database schema is always in sync at deploy time.
Database credentials are encrypted with AES-256-GCM before storage. Stripe webhooks are verified cryptographically before any billing state changes. These aren't interesting to build but they matter.
What I'd do differently
The semantic retrieval works well but embedding quality matters a lot. Generic sentence embeddings don't always capture SQL-specific semantics. Embeddings fine-tuned specifically on schema description tasks would probably do better.
The text-to-Pandas routing is also more heuristic than I'd like. Right now it uses keyword matching on the question. A small classifier trained on query types would be more reliable and less brittle.
Kosh is still in active development. If you want to try it, reach out.
