GroceryGPT is a small language model that's been gently nudged into knowing about fruits and vegetables — picking ripe ones, storing them, what to do with that lonely chayote. This page explains how it's built, how to run it, and how to test it yourself.
We start with an existing tiny model — a "small language model" called SmolLM2-135M. Think of it as a generally-educated recent grad: it knows English, can hold a conversation, but has no special expertise.
We then show it about 65 worked examples of how a grocery expert answers produce questions — picking, storing, substitutions. We don't retrain the whole model (that would take days). Instead we use a technique called LoRA that trains a tiny "adapter" alongside the frozen original. The adapter learns the new style and content; the base model stays untouched.
After training we merge the adapter back into the base, convert the result to the format Ollama uses (called GGUF), and load it into an Ollama server running in Docker. A simple HTML chat page talks to it. That's the whole stack.
Four moving parts in two phases. Build once, then run.
Runs once. Reads the dataset, fine-tunes a copy of the base model using LoRA, merges the trained adapter back in, converts the result to GGUF format, and saves it to a shared folder. Then exits. Takes 5–15 minutes on a modern CPU.
The model server. Loads the GGUF file into memory and exposes an HTTP API on port 11434. When asked a question it streams back the model's answer one token at a time. This is the workhorse that keeps running.
A tiny helper that runs once Ollama is healthy. It tells Ollama "register this GGUF as a model called grocery-slm", waits for confirmation, then exits. This gives us a clean declarative setup — no manual ollama create required.
A nginx server with two jobs: serve the chat HTML page and proxy any /api/* request through to Ollama. The proxy means the browser only talks to one origin — no CORS configuration, no exposed Ollama port to worry about.
Imagine teaching someone a specialty without sending them back to school.
A pre-trained language model is like a person who's read millions of books. They know language, facts, reasoning. But they don't have opinions, expertise, or a specific style — they're a generalist.
Fine-tuning is the process of showing them many examples of how you want them to behave, until they pick up the pattern. Show them 65 examples of "helpful, concise produce advice" and they start producing helpful, concise produce advice.
The catch: training all the model's weights from scratch is expensive — billions of numbers, each adjusted slightly with every example. That's why we use LoRA.
"LoRA: train a tiny patch of new weights, leave the rest alone."
LoRA stands for Low-Rank Adaptation. Instead of touching the 135 million weights of the base model, we add a small set of ~700,000 extra weights — a "patch" — and only train those. The base stays frozen. After training, we mathematically merge the patch into the base so the result is a single regular model file.
Every line in the dataset is one Q&A pair, formatted in the standard "chat completions" shape Hugging Face expects:
{
"messages": [
{"role": "user", "content": "How do I pick a ripe avocado?"},
{"role": "assistant", "content": "Gently squeeze the avocado in your palm — not with your fingertips, which bruise the flesh. A ripe Hass yields slightly to pressure but isn't mushy..."}
]
}
The training script wraps each example with a system message that defines the persona ("You are GroceryGPT..."), then teaches the model to predict the assistant's reply, given the question. After 3 passes through all 65 examples, it's picked up the pattern.
From raw Q&A text to a model running in Ollama, four steps.
Each step writes its output to a shared ./output/
folder so the next step can read it. The whole thing is wrapped in a single
shell script that runs inside the trainer container — you don't run these
commands manually.
What happens between you typing "How do I pick a ripe mango?" and tokens streaming back.
Why streaming?
The model produces one token at a time. If we waited for the full reply we'd see nothing for several seconds, then a wall of text. By streaming, the user sees the answer appear as it's generated — much better feel.
Why nginx in the middle?
Without it the browser would have to call Ollama on a different
port, hitting CORS issues. With nginx, the page and the API live
at the same origin (localhost:8080).
Three commands. The first is slow (training); the rest are seconds.
docker compose --profile train run --rm trainer
Trains the LoRA adapter, merges it, exports to GGUF. The output lands in ./output/.
docker compose up -d
Brings up Ollama + the chat UI. Watch docker compose logs -f ollama-init to see the model register.
./scripts/smoke.sh
# then open http://localhost:8080
Hits the API directly, asks one question, prints the result. If this passes, you're good.
For a more thorough check, the tests/ folder has a pytest suite that hits Ollama and verifies behavior:
pip install -r tests/requirements.txt
pytest tests/ -v
A model is only as good as the questions you throw at it. Here's a structured walkthrough that takes about 10 minutes and exercises every important behavior.
Make sure all four services are happy before you start poking the UI.
docker compose ps
You should see ollama as running (healthy), ollama-init as exited (0), and webui as running. The trainer is fine being absent — it's already done its job.
Open http://localhost:8080. You're looking for:
⚠ If the dot is red ("ollama unreachable"), the proxy isn't reaching ollama — check docker compose logs ollama.
Click the "ripe avocado?" chip. Watch the assistant bubble:
⚠ If the whole reply appears at once after a long pause, the nginx proxy is buffering — verify proxy_buffering off is in nginx.conf.
Conversational context lives client-side. Test it with this exchange:
⚠ If the second answer is generic or asks "store what?", check that history in index.html is being appended on each turn.
The system prompt instructs the model to redirect off-topic questions. Try these:
A 135M model isn't perfect at this — occasional leaks happen. What you want is most of the time it stays in character. If it always fails, the system prompt isn't being applied (check the Modelfile).
Ask about produce that wasn't in the training set. The model should still respond reasonably, drawing on its base knowledge in the trained style:
Answers won't be as polished as for trained items. That's expected — generalization is the limit of small models with small datasets.
On a typical modern laptop CPU you should see:
| Metric | Healthy range | Concerning |
|---|---|---|
| Time to first token | < 3 sec | > 10 sec |
| Tokens per second | 15 – 40 tok/s | < 5 tok/s |
| Full reply length | 5 – 20 sec | > 60 sec |
| Memory used by Ollama | ~500 MB – 1 GB | > 4 GB (something's wrong) |
Check docker stats grocery-ollama to watch memory and CPU live.
Curated prompts grouped by what they probe. Click any card to copy. Each card lists what a "good" answer should mention.
When you ask a test prompt, rate the answer on these axes:
Did it mention any expected term?
If "How do I store mushrooms?" doesn't mention paper bags or moisture, the model has missed the point.
Is what it said true?
A small model can confidently make things up. Sanity-check claims that sound suspiciously specific (numbers, dates, names).
Did it stay GroceryGPT?
Concise, warm, practical. Not a generic essay. Not breaking into code or unrelated topics.
Honest about limitations — knowing the failure modes is half the battle.
A 135M model has limited knowledge and will confidently invent numbers, dates, or scientific names. It might tell you watermelons need 47% humidity. They don't, and the model doesn't really know — it's pattern-matching.
Mitigation: treat specific numeric claims with skepticism. The trained Q&A material is reliable; ad-hoc generations less so.
Sometimes the model will answer an off-topic question instead of redirecting. Small models have a weaker grip on system prompts than large ones.
Mitigation: add more "redirect" examples to the dataset, lower temperature in the Modelfile (currently 0.6), or move up to SmolLM2-360M.
If you see the model start looping ("...store in a cool place. Store in a cool place. Store in a cool..."), temperature is too low or there's a sampling issue.
Mitigation: bump temperature to 0.7–0.8, or add repeat_penalty 1.1 to the Modelfile parameters.
The model says "you should consult a nutritionist" instead of actual advice — it's reverting to base-model defaults instead of the trained persona.
Mitigation: train more epochs (try EPOCHS=5), expand the dataset, or check that the LoRA actually merged successfully (look at the file size diff in ./output/merged/).
Every term you might bump into, in plain English.
<|im_start|>). Get this wrong and the model produces gibberish.