2026-05-13

Replying to chatroom photos automatically with a single multimodal LLM

Replying to chatroom photos automatically with a single multimodal LLM

"Can I auto-reply to photo messages in my chatroom while keeping my operator tone?"

Yes, but it adds one layer of complexity over plain text replies. The model has to "see" the image and then write a response. Whether you split that across two models or run it in a single model changes both the operational overhead and the final tone quality.

Two paths, side by side

Same photo, two pipelines. Top is the legacy serial flow, bottom is the single-call multimodal path that Replyer uses from v0.13.0 onward.

Legacy: vision model + text LLM (serial)

photo
vision caption
~3GB
"two people at a cafe"
persona LLM
~5GB
text reply
RAM ~8GB+ residenttwo-stage serial calls

Current: single multimodal LLM + mmproj

photo
mmproj
~0.5GB
persona LLM
~5GB, same model
text reply
RAM ~5.5GBsingle call

Bottom line: a single multimodal LLM that handles text and image in one call wins on both tone and ops burden.

Where the two-model approach falls short

The classic flow people started with:

  1. User sends a photo.
  2. Vision model (Qwen2.5-VL etc.) generates a caption ("two people sitting in a cafe").
  3. Caption is passed to the persona-equipped text LLM as if it were a text message.
  4. Text LLM writes a tone-matched reply.

What goes wrong:

1. The caption model knows nothing about your persona

Vision captioners write generic descriptions, "cafe", "two people", "sunlight". Your operator-specific gaze ("this cafe looks chill", "those two look close") is gone.

2. Memory adds up

Vision model 3GB + text LLM 5GB = 8GB+ resident at once. On a 16GB laptop that crowds out everything else.

3. The caption strips signal

Subtle details, facial expressions, mood, props in the background, get compressed into 50 characters of caption text.

The single multimodal LLM path

Gemma 3 (4B / 12B / 27B) and Gemma 4 (E2B / E4B / 26B-A4B / 31B) both accept images when you also load an mmproj (multimodal projection) file.

Flow:

  1. User sends a photo.
  2. Persona system prompt + image + (optional text caption) goes to the LLM in a single call.
  3. LLM writes a persona-toned reply directly.

Token budget inside the context window

How photo tokens (256 each) eat into the same n_ctx. The difference between text-only and photo-heavy chatrooms in one chart.

Five photos in an 8192 window leave almost no headroom. Photo-heavy chatrooms should run n_ctx at 16384+.

What mmproj actually is

mmproj is short for "multimodal projection". It is the small neural network that bridges a vision encoder (image → visual tokens) with the LLM's text embedding space.

What llama.cpp does when you send an image:

  1. The user's image gets encoded as a base64 data URI.
  2. mmproj converts the image into ~256 visual tokens (count depends on the model).
  3. Those visual tokens sit in the same embedding space as text tokens.
  4. The LLM generates a response using the combined visual + text context.

Operator-side knobs

1. Confirm mmproj is downloaded

If you only grabbed the base LLM and skipped mmproj, photo messages will not get text responses. In Replyer's Settings, model cards with a "vision" badge mean mmproj is bundled and downloaded.

2. Slight latency bump

Image tokenization adds 2-5 seconds versus text-only requests.

3. Token budget (context window)

As the chart above shows, bump cfg.n_ctx to 16384+ for photo-heavy rooms.

4. Persona should describe how it reacts to images

Adding a one-line guide like "react briefly to photos, then connect to a personal experience" inside agents/casual_chat.yaml pays off in consistency.

Where this matters most

Photo-heavy chatrooms benefit most:

  • Hobby chatrooms (30-60% photos), cafes, workouts, gaming screenshots
  • Parenting / daily life chatrooms, mostly photo-based sharing
  • Travel / photography clubs, nearly every message is an image

Information chatrooms (stocks, real estate, news) sit under 5% photos. A plain text LLM is enough.

For the chatroom-type fit, see info chatroom verticals.

FAQ

Q. Gemma 3 or Gemma 4, which should I pick?

Gemma 4 (E2B / E4B) is the newer release with slightly better Korean and vision quality at the same size. Some older llama.cpp builds fail to load Gemma 4, so Replyer's default presets fall back to Gemma 3 or Qwen on environments where Gemma 4 cannot run.

Q. I was using Qwen2.5-VL + Gemma in two-model serial mode, how do I migrate?

Recent Replyer builds (v0.13.0+) removed the two-model serial flow in favor of a single multimodal LLM. Existing users migrate automatically.

Q. Replies to photos look off or unrelated, what is wrong?

Three usual causes:

  1. The persona prompt lacks a photo-response guide, add one line.
  2. The mmproj file is an older version, re-download from the model card.
  3. The context window is full and image tokens got truncated, bump n_ctx to 16384+.

Q. What about video or GIF messages?

Current multimodal LLMs only handle still images. Videos get reduced to the first frame or fall back to a generic text reply.

Q. What does each image cost in tokens?

For local LLMs (Replyer's default) the per-image cost is zero. See local LLM vs cloud API for the trade-off.

Q. Is mmproj also a GGUF file?

Yes. mmproj-f16.gguf or mmproj-q8_0.gguf, same format as base GGUF models but containing only the vision projection weights. f16 (~500MB) is recommended on typical laptops.

Next step

Grab the build for your OS from the Replyer download page and follow the usage manual for step-by-step setup.