Goose · Senior Product Manager · 2024

AI-Powered Report Cards

95%+ of reports generated without staff intervention. Processing time cut from 10 minutes to under 2.

AI/MLB2C0→1

The Problem

Goose is a white-label CRM for educational institutions. Schools using the platform needed to generate individualized report cards for every student — narrative-style comments, not just letter grades. The existing process was entirely manual: staff would spend 8–10 minutes per report, writing personalized assessments for each student across multiple subjects.

For a school with 500 students, that's ~80 hours of staff time per reporting cycle. The quality was inconsistent — some reports were thoughtful, others were copy-pasted templates. Schools were asking for help, and the CRM had no answer.

The opportunity was clear: use AI to generate draft report cards from existing student data (grades, attendance, teacher notes), then let staff review and approve. The question was how to do it without losing the personal, human quality that parents expect.

My Role

I owned this feature end-to-end — from identifying the opportunity through to rollout and quality measurement. I defined the product spec, designed the AI pipeline architecture with engineering, wrote the prompt engineering strategy, and ran the pilot with three partner schools. No dedicated ML team — I was the PM translating business requirements into engineering-ready specs for our full-stack team.

Key Decisions

LLM generation with structured input, not free-form.

Rationale: The temptation was to feed raw teacher notes into an LLM and hope for good output. I pushed for a structured approach instead: we built an input schema that captured grades, behavioral observations, and teacher highlights per subject. The LLM then worked from structured data — not messy free-text.

Trade-off: This required more upfront work from our team (building the schema, migrating existing data) and slightly more structure from teachers when inputting data. But it dramatically improved output consistency and made the AI's behavior predictable and debuggable.

Human-in-the-loop review, not full automation.

Rationale: We could have shipped fully automated reports — the AI output was good enough. But parents trust report cards because a teacher wrote them. Removing the human entirely would have created a trust problem we couldn't recover from.

Trade-off: Staff still spend ~2 minutes per report reviewing and adjusting. That's 80% less time, not 100%. But the trust equation is intact — staff sign off on every report, and parents never question whether a human was involved.

Tone calibration per school, not one-size-fits-all.

Rationale: We tested early with a single prompt style and got immediate pushback — a Montessori school and a traditional prep school want fundamentally different tones. I designed a tone calibration step during onboarding: schools provide 3–5 example reports they consider "good," and we use those to tune the generation style.

Trade-off: Added complexity to onboarding (an extra step that takes ~15 minutes). But it eliminated the #1 objection from pilot schools and made the output feel native to each institution.

How I Measured Success

95%+ of reports generated without staff intervention beyond review

10min → 2min processing time per report, from manual writing to review-only

3 schools in pilot, covering ~1,200 students across the first reporting cycle

What I couldn't measure (yet): Parent satisfaction with the AI-generated reports vs. the old manual ones. We designed a feedback mechanism (thumbs up/down on the parent portal) but hadn't collected enough data by the time I moved on. Early anecdotal signals were positive — schools reported fewer parent complaints about vague or unhelpful comments.

What I'd Do Differently

I'd invest in automated quality scoring earlier. We relied on manual review from school staff to catch bad outputs, but as volume scales, that doesn't hold. I'd build an automated rubric — checking for hallucinations, tone consistency, and factual accuracy against the input data — and flag reports that fail the check before they reach the reviewer.

I'd also push harder for A/B testing the parent experience. We shipped the AI reports and hoped parents wouldn't notice (in a good way). A proper A/B test — some parents get the old reports, some get AI-generated — would have given us real data on whether the quality was genuinely equivalent.

What I Learned

AI features succeed when you make the AI invisible. Parents don't want to know a machine wrote their child's report card. They want a report that sounds like their child's teacher — thoughtful, specific, human. The entire product design was oriented around that insight: structured inputs for consistency, tone calibration for authenticity, human review for trust.

The second lesson: the hardest part of shipping AI as a PM isn't the model — it's the failure states. What happens when the AI generates something wrong? When a teacher disagrees with the output? When a parent notices a factual error? Designing for those cases took more time than designing the happy path, and it's where the product quality actually lives.