Monthly auto-reply quality audit - six heuristics plus a 30-minute operator checklist, Replyer

Monthly auto-reply quality audit - six heuristics plus a 30-minute operator checklist

"I've had auto-reply running for a month - how do I check it's still working as intended?"

Letting automation run unsupervised gets dangerous over time. Chatroom mood shifts, personas grow stale, and LLM quality drifts - the consequences only surface as member churn or reports. A 30-minute monthly audit is the cheapest insurance against that.

This post combines Replyer's six automated heuristics with operator-only checks into a single monthly routine.

Six automated heuristics

Replyer's Diagnostics → quality tab scans the last 30 days against six criteria. The operator just reads the results. The radar below contrasts the normal ceiling with two simulated operators.

Solid line is the normal ceiling (lower is better, 0-100 scoring). Dashed lines are simulated operator runs. Scores inside the normal ring pass.

1. No-reply rate

Share of incoming messages auto-reply did not respond to despite being on. Normal varies by chatroom, but if it spikes beyond what the Korean gate, hourly limits, or night gating would explain, it's a signal. Causes:

Persona triggers are too narrow and reject most messages.
LLM call timeouts accumulating (model RAM pressure, etc.).
Chatroom topics drift outside the persona's scope.

2. Tone drift

Match between the persona's tone guide ("short and direct") and actual responses. Frequent mismatches mean the prompt is not flowing through to the LLM strongly. Causes:

Few-shot examples contradict the tone guide.
Context window overflow truncates the system prompt.
A model swap (Gemma 4 → Gemma 3) changes the tone behavior.

3. Duplicate replies

Share of near-identical responses across different moments in the same chatroom. This is the fastest bot tell members catch. Causes:

Persona has only one templated answer for a question pattern.
Multi-account ops with broken tone-slot mapping.
LLM temperature is too low (narrow variance).

For multi-account variance, see why multiple accounts produce duplicate tone.

4. Length anomalies

Share of responses with outlier length (too short or too long). Normal is 1-3 sentences. Outlier causes:

Frequent 5+ sentence replies → persona prompt missing a length guide.
Frequent 1-2 word replies → LLM output truncation (max_tokens too low).
Completely empty strings → LLM call error fallback.

5. Banned phrase detection

Share of responses that included banned terms or words the operator never uses. Causes:

Persona's hard-banned phrase list is empty.
Persona uses vocabulary outside the operator's voice (e.g. English acronyms the operator avoids).
Chatroom policy violations (ads, profanity, politics).

6. Response latency

Average LLM call time. If it exceeds the auto-reply countdown, the auto mode stops working as intended. Causes:

Model size near RAM ceiling (memory pressure).
Context window full causing truncation and recompute every call.
Disk-full triggering swap.

Four operator-only checks

Things heuristics cannot catch. The operator looks at the live chatroom and decides.

1. Member reaction patterns

Share of auto-replies that get a follow-up reaction (emoji, reply, follow-on message). If down from a month ago, naturalness is slipping. How:

Take the latest 50 auto-replies, count member follow-ups.
Compare with 50 from a month ago.
If down, revisit the persona prompt or rate limits.

2. Persona prompt freshness

Personas should track chatroom drift. Is the persona written a month ago still aligned with current topics?

Are recent chatroom topics inside the persona's scope?
Do the few-shot examples still feel natural for today's vibe?
Has the operator's own voice shifted since?

3. Rate limit fit

Do per-hour / per-minute caps still match chatroom traffic?

Auto-replies pile up in queue and rarely fire → cap too low.
Auto-replies fire more often than members chat → cap too high.
Adjust as chatroom size and topic frequency change.

4. Night gating and skip probability

Does the operator-time policy still match current chatroom rhythm?

Are night-time hours (e.g. 11pm-7am) aligned with chatroom dead hours?
Is the skip probability (let members talk among themselves) calibrated?
Adjust if the chatroom mood changed (quiet rooms → higher skip).

See the 24-hour chatroom night boundary for night handling.

30-minute checklist

□ Diagnostics → quality tab, read the six heuristic results (5 min)
□ Note signals on no-reply / tone drift / duplication / length (5 min)
□ Activity page, scan latest 50 responses + member reactions (5 min)
□ Persona prompt last-edit date + check chatroom topic alignment (5 min)
□ Settings page, review rate limits / night gating / skip probability (3 min)
□ Note adjustments needed; apply quick ones immediately (5 min)
□ Schedule a separate session for big changes (persona rewrite, etc.) (2 min)

Signal → action matrix

Action mapping for each heuristic signal. The gauges below show normal / warn / risk bands (green/amber/red) with a simulated audit's current position (blue marker).

Heuristic gauges (simulated audit)

Green normal · Amber watch · Red risk

Signal	Likely cause	Action
No-reply rate spike	Persona triggers too narrow	Loosen trigger patterns
No-reply rate spike	LLM timeouts	Check model / RAM
Frequent tone drift	Few-shot examples contradictory	Rewrite examples
Frequent tone drift	Context truncation	Increase n_ctx
Frequent duplicates	Temperature too low	0.7 → 0.85
Frequent duplicates	Narrow persona answer space	Diversify few-shot examples
5+ sentence replies	Length guide missing	Persona: "1-2 sentences"
1-2 word replies	max_tokens too low	256 → 512
Banned phrase hits	Hard-banned list empty	Add forbidden terms
Slow responses	RAM pressure	Smaller model
Slow responses	Disk swap	Clean up disk or upgrade

What skipping audits gets you

Push the monthly audit to once every six months and watch the compounding damage:

Persona prompt drifts away from chatroom topic → naturalness erodes gradually → members churn quietly.
Duplicate response rate climbs → someone catches on and outs it to the whole chatroom.
Model RAM accumulates → swap one night, infinite queue backlog by morning.
Night gating no longer matches the chatroom's actual quiet hours → replies fire when nobody's around, silence when they are.

A 30-minute monthly check prevents all of these.

FAQ

Q. Is monthly too frequent?

Depends on volume. Light ops (under 5 replies/hour) can drop to quarterly. Heavy ops (20+ replies/hour, multi-chatroom, multi-account) should run every 2 weeks.

Q. Are the six heuristics 100% accurate?

No. They are pattern detectors; the operator makes the final call. A "tone drift" hit might be intentional persona evolution. Treat heuristics as starting points, not verdicts.

Q. What's the priority order for fixes?

Three tiers:

Member safety - banned phrases or policy violations (act now).
Bot exposure risk - duplicates, tone drift (within 1 week).
Response quality - length anomalies, latency (within 1 month).

Exposure risk compounds quickly, so it ranks right after safety.

Q. If all heuristics look clean, am I done?

Not quite - the four operator-only checks remain. Even with clean automated signals, chatroom-mood shifts often hide in member-reaction patterns that only a human can read.

Q. What if one operator runs 5+ chatrooms?

30 min × 5 = 2h30 in one sitting. Spread across 5 weeks (one chatroom per week) or audit the most active two monthly and the rest quarterly.

Q. What other metrics would be useful?

On the next-round candidate list:

Auto-counted member follow-up rate (emoji / reply / follow-on).
Per-persona acceptance rate in manual mode (operator approves vs rejects).
Per-chatroom auto-reply share (bot share of total messages).

Adding these would push the heuristic count to 8-10.

Q. After a big persona rewrite, do I need to re-run A/B?

Strongly recommended. Big rewrites can shift tone enough that you should validate before letting them loose. See persona A/B testing.

Q. Where do I store audit results?

A note in Notion / Apple Notes / wherever you keep ops notes, one page per month. Heuristic results + fixes + outcomes over time become a time-series of your chatroom ops. Replyer does not provide a built-in audit log page yet (on the candidate list).

Next step

Grab the build for your OS from the Replyer download page and follow the usage manual for step-by-step setup.