The first version of Cruto's interview scoring used a generic "score this answer 0-100" prompt to GPT-4. It was usable — but it scored the wrong thing. It rewarded answers that sounded structured. It didn't reward answers that handled the underlying trade-off.
The fix was a rubric. Not the open-ended "rate clarity, depth, and structure" rubric you see in interview-prep books — those are too vague for an LLM to apply consistently. We built one with named axes, each with a positive and negative anchor, and we score each axis independently before averaging.
The rubric (per persona)
- Domain accuracy — does the answer use the right primitives for the role's domain?
- Trade-off articulation — does the answer name what it's giving up, not just what it's choosing?
- Constraint sensitivity — when the constraint changed, did the answer change?
- Communication clarity — minimal hedging, structured ramp from premise to conclusion?
- Recovery from challenge — when the interviewer pushed back, did the answer adjust or repeat itself?
Each axis gets 0-20. Average → 0-100 score. The numeric output is published; the per-axis breakdown drives the debrief commentary.
Why the per-persona variant matters
HR Recruiter persona scores behavioral answers — there, "trade-off articulation" gets renamed "self-awareness about ambiguity." The CTO persona scores leadership answers — "constraint sensitivity" becomes "can you tell me what would have to be true for the opposite decision?"
The point is the rubric stays the same shape; the words on the axes match what the persona cares about. That keeps the scoring framework portable across personas without throwing away the structure.
Watermarking
One open question with any LLM-graded test: how do you detect the candidate using ChatGPT to draft answers? We watermark the test prompts with random tokens that signal the question came from Cruto. If the answer references the watermark, we know it round-tripped through ChatGPT and we flag the cert as suspect.
None of this is perfect. But it's better than the generic scorer we started with, by enough that the debrief commentary is now the thing users actually quote back at us.