GroceryGPT — How it works

§ 01

The 60-second
version.

We start with an existing tiny model — a "small language model" called SmolLM2-135M. Think of it as a generally-educated recent grad: it knows English, can hold a conversation, but has no special expertise.

We then show it about 65 worked examples of how a grocery expert answers produce questions — picking, storing, substitutions. We don't retrain the whole model (that would take days). Instead we use a technique called LoRA that trains a tiny "adapter" alongside the frozen original. The adapter learns the new style and content; the base model stays untouched.

After training we merge the adapter back into the base, convert the result to the format Ollama uses (called GGUF), and load it into an Ollama server running in Docker. A simple HTML chat page talks to it. That's the whole stack.

§ 02

The big picture.

Four moving parts in two phases. Build once, then run.

SERVICE 1

Trainer

Runs once. Reads the dataset, fine-tunes a copy of the base model using LoRA, merges the trained adapter back in, converts the result to GGUF format, and saves it to a shared folder. Then exits. Takes 5–15 minutes on a modern CPU.

SERVICE 2

Ollama

The model server. Loads the GGUF file into memory and exposes an HTTP API on port 11434. When asked a question it streams back the model's answer one token at a time. This is the workhorse that keeps running.

SERVICE 3

Ollama-init

A tiny helper that runs once Ollama is healthy. It tells Ollama "register this GGUF as a model called grocery-slm", waits for confirmation, then exits. This gives us a clean declarative setup — no manual ollama create required.

SERVICE 4

WebUI

A nginx server with two jobs: serve the chat HTML page and proxy any /api/* request through to Ollama. The proxy means the browser only talks to one origin — no CORS configuration, no exposed Ollama port to worry about.

§ 03

What is "fine-tuning",
actually?

Imagine teaching someone a specialty without sending them back to school.

A pre-trained language model is like a person who's read millions of books. They know language, facts, reasoning. But they don't have opinions, expertise, or a specific style — they're a generalist.

Fine-tuning is the process of showing them many examples of how you want them to behave, until they pick up the pattern. Show them 65 examples of "helpful, concise produce advice" and they start producing helpful, concise produce advice.

The catch: training all the model's weights from scratch is expensive — billions of numbers, each adjusted slightly with every example. That's why we use LoRA.

"LoRA: train a tiny patch of new weights, leave the rest alone."

LoRA stands for Low-Rank Adaptation. Instead of touching the 135 million weights of the base model, we add a small set of ~700,000 extra weights — a "patch" — and only train those. The base stays frozen. After training, we mathematically merge the patch into the base so the result is a single regular model file.

What an example looks like

Every line in the dataset is one Q&A pair, formatted in the standard "chat completions" shape Hugging Face expects:

{
  "messages": [
    {"role": "user",      "content": "How do I pick a ripe avocado?"},
    {"role": "assistant", "content": "Gently squeeze the avocado in your palm — not with your fingertips, which bruise the flesh. A ripe Hass yields slightly to pressure but isn't mushy..."}
  ]
}

The training script wraps each example with a system message that defines the persona ("You are GroceryGPT..."), then teaches the model to predict the assistant's reply, given the question. After 3 passes through all 65 examples, it's picked up the pattern.

§ 04

The build pipeline.

From raw Q&A text to a model running in Ollama, four steps.

Each step writes its output to a shared ./output/ folder so the next step can read it. The whole thing is wrapped in a single shell script that runs inside the trainer container — you don't run these commands manually.

§ 05

Anatomy of a chat reply.

What happens between you typing "How do I pick a ripe mango?" and tokens streaming back.

Why streaming?

The model produces one token at a time. If we waited for the full reply we'd see nothing for several seconds, then a wall of text. By streaming, the user sees the answer appear as it's generated — much better feel.

Why nginx in the middle?

Without it the browser would have to call Ollama on a different port, hitting CORS issues. With nginx, the page and the API live at the same origin (localhost:8080).

§ 06

Running it yourself.

Three commands. The first is slow (training); the rest are seconds.

STEP 1

~5–15 min · downloads SmolLM2 once

Build the model

docker compose --profile train run --rm trainer

Trains the LoRA adapter, merges it, exports to GGUF. The output lands in ./output/.

STEP 2

~10–30 sec to be ready

Start the demo

docker compose up -d

Brings up Ollama + the chat UI. Watch docker compose logs -f ollama-init to see the model register.

STEP 3

~30 sec sanity check

Verify it works

./scripts/smoke.sh
# then open http://localhost:8080

Hits the API directly, asks one question, prints the result. If this passes, you're good.

BONUS

Run the automated test suite

For a more thorough check, the tests/ folder has a pytest suite that hits Ollama and verifies behavior:

pip install -r tests/requirements.txt
pytest tests/ -v

§ 07

Manual test playbook.

A model is only as good as the questions you throw at it. Here's a structured walkthrough that takes about 10 minutes and exercises every important behavior.

0

Pre-flight check

Make sure all four services are happy before you start poking the UI.

docker compose ps

You should see ollama as running (healthy), ollama-init as exited (0), and webui as running. The trainer is fine being absent — it's already done its job.

1

Visual check

Open http://localhost:8080. You're looking for:

✓ Page loads in under a second.
✓ Status indicator (top right) settles to a green dot with "model ready" within ~5 seconds.
✓ Four suggestion chips appear ("ripe avocado?", "fridge tomatoes?", etc.).
✓ Cursor is in the textarea on load.

⚠ If the dot is red ("ollama unreachable"), the proxy isn't reaching ollama — check docker compose logs ollama.

2

Streaming sanity

Click the "ripe avocado?" chip. Watch the assistant bubble:

✓ A "typing" dot animation appears immediately.
✓ Within ~2 seconds the dots are replaced by streaming text — words appearing in chunks, not all at once.
✓ The page auto-scrolls as the answer grows.
✓ Reply finishes in ~5–20 seconds depending on CPU.

⚠ If the whole reply appears at once after a long pause, the nginx proxy is buffering — verify proxy_buffering off is in nginx.conf.

3

Multi-turn memory

Conversational context lives client-side. Test it with this exchange:

Ask: "How do I pick a ripe avocado?" → expect tips about squeezing, color, stem nub.
Follow up: "And how do I store it once I cut it?" → expect storage advice (wrap, lemon juice, fridge).
The follow-up should reference avocados without you naming them again.

⚠ If the second answer is generic or asks "store what?", check that history in index.html is being appended on each turn.

4

Persona check

The system prompt instructs the model to redirect off-topic questions. Try these:

"Write me a Python function to reverse a string" → should politely redirect, not output code.
"What's the capital of France?" → should redirect, not say "Paris".
"Tell me a joke about cars" → should pivot back to produce.

A 135M model isn't perfect at this — occasional leaks happen. What you want is most of the time it stays in character. If it always fails, the system prompt isn't being applied (check the Modelfile).

5

Generalization check

Ask about produce that wasn't in the training set. The model should still respond reasonably, drawing on its base knowledge in the trained style:

"How can I tell if a pear is ripe?" → expect tips about pressing near the neck/stem.
"How do I know if a pomegranate is ready?" → expect heaviness, firm leathery skin.
"What's the easiest way to peel garlic?" → expect smash-with-knife or shake-in-jar.

Answers won't be as polished as for trained items. That's expected — generalization is the limit of small models with small datasets.

6

Performance feel

On a typical modern laptop CPU you should see:

Metric	Healthy range	Concerning
Time to first token	< 3 sec	> 10 sec
Tokens per second	15 – 40 tok/s	< 5 tok/s
Full reply length	5 – 20 sec	> 60 sec
Memory used by Ollama	~500 MB – 1 GB	> 4 GB (something's wrong)

Check docker stats grocery-ollama to watch memory and CPU live.

§ 08

Test prompt library.

Curated prompts grouped by what they probe. Click any card to copy. Each card lists what a "good" answer should mention.

How to score answers

When you ask a test prompt, rate the answer on these axes:

A · TOPICAL

Did it mention any expected term?

If "How do I store mushrooms?" doesn't mention paper bags or moisture, the model has missed the point.

B · CORRECT

Is what it said true?

A small model can confidently make things up. Sanity-check claims that sound suspiciously specific (numbers, dates, names).

C · IN-PERSONA

Did it stay GroceryGPT?

Concise, warm, practical. Not a generic essay. Not breaking into code or unrelated topics.

§ 09

What can go wrong.

Honest about limitations — knowing the failure modes is half the battle.

FAILURE 1

Hallucinated facts

A 135M model has limited knowledge and will confidently invent numbers, dates, or scientific names. It might tell you watermelons need 47% humidity. They don't, and the model doesn't really know — it's pattern-matching.

Mitigation: treat specific numeric claims with skepticism. The trained Q&A material is reliable; ad-hoc generations less so.

FAILURE 2

Persona leakage

Sometimes the model will answer an off-topic question instead of redirecting. Small models have a weaker grip on system prompts than large ones.

Mitigation: add more "redirect" examples to the dataset, lower temperature in the Modelfile (currently 0.6), or move up to SmolLM2-360M.

FAILURE 3

Repetitive output

If you see the model start looping ("...store in a cool place. Store in a cool place. Store in a cool..."), temperature is too low or there's a sampling issue.

Mitigation: bump temperature to 0.7–0.8, or add repeat_penalty 1.1 to the Modelfile parameters.

FAILURE 4

Generic, training-data answers

The model says "you should consult a nutritionist" instead of actual advice — it's reverting to base-model defaults instead of the trained persona.

Mitigation: train more epochs (try EPOCHS=5), expand the dataset, or check that the LoRA actually merged successfully (look at the file size diff in ./output/merged/).

§ 10

Glossary.

Every term you might bump into, in plain English.

SLM: Small Language Model. Same shape as ChatGPT under the hood, just much smaller — millions of parameters instead of billions. Trades accuracy for speed and the ability to run on a laptop.
Token: The unit a language model reads and writes — usually a word fragment of 2–6 letters. "watermelon" might be one token; "pomegranate" two or three. Models predict one token at a time, then feed it back in to predict the next.
LoRA: Low-Rank Adaptation. A way to fine-tune a model by adding a small "patch" of new weights instead of changing the original. Cheap, fast, and the patch can be merged in or kept separate.
GGUF: A file format for storing models efficiently. It's what llama.cpp and Ollama use. Supports various levels of "quantization" (lossy compression of the weights) for smaller files.
Modelfile: Ollama's recipe format. Says which GGUF to load, what chat template to use, what the system prompt is, and which parameters to set. Like a Dockerfile, but for models.
Chat template: The exact way messages are formatted into a string before the model sees them. Different model families use different templates (special tokens like <|im_start|>). Get this wrong and the model produces gibberish.
System prompt: A hidden message at the start of every conversation that defines the assistant's persona ("You are GroceryGPT..."). The user never sees it.
Temperature: A knob from 0 to ~2 that controls randomness. 0 = always pick the most likely next token (boring, repetitive). 1 = sample naturally. Higher = more creative, riskier. We use 0.6.
NDJSON streaming: Newline-Delimited JSON. Instead of one big response, the server sends one JSON object per line as it produces each token. The browser reads them as they arrive, not all at once.

The 60-secondversion.