Gen-AI Notice Relevance Model

TL;DR

I led the product effort to design, deliver and operate a Gen-AI notice-relevancy system that automatically detects and hides irrelevant web notices from a high-volume daily feed. The system combines a high-precision spam classifier, an allowlist/blocklist onboarding process, an LLM-based judge for double checks, UX filtering controls and operational automations. By default the system hides roughly 30-40% of irrelevant notices, preserves auditability and gives researchers easy discoverability and undo controls.

The Problem

We monitor ~50k public sources daily to find tax-relevant updates. The crawl + diff pipeline produces many notices every day — a large portion of that volume is noise (site chrome changes, navigation updates, weather text, date tweaks). Manually triaging this noise consumed valuable researcher time and delayed identification of true regulatory changes.

Objective

Remove noise from researcher work queues while preserving visibility and auditability. Enable safe automation with clear controls and easy recovery.

My Remit & Constraints

As Senior Product Manager, I was responsible for delivering a production-grade relevancy feature that:

•Ships quickly with phased rollouts and measurable acceptance criteria
•Scales across thousands of sources with a clear human→AI feedback loop
•Operates safely in regulated workflows (no silent deletion of content)
•Reduces triage load while keeping rare relevant items recoverable

Approach & Execution

Re-scope and Align Stakeholders

Convened cross-functional cadence — research leads, data scientists, engineering, ops and QA — to document core problems, define acceptance criteria, and agree on operational policies (hide vs retain vs auto-complete). Captured representative noise patterns as a prioritized testbed.

Built a Hybrid Relevance Pipeline

Spam Classifier

LLM-prompted classifier tuned to label notices 100% irrelevant with ~99.9% precision. Production uses GPT-4o-mini; GPT-4.1-mini showed ~15-20% better recall in tests.

Content-Type & Tag Refinement

AI-suggested tags validated by researchers to improve classification precision and reduce label drift.

LLM Judge

A second LLM check that double-checks 100% irrelevant labels and surfaces uncertain cases to researchers.

Designed UX & Researcher Controls

Led design for a conservative, transparent UX: 100%-irrelevant notices hidden by default with toast explanation, double range slider for score visibility control, and filters to reveal or include all items while keeping hidden items discoverable for audit.

Operational Guardrails & Automation

Balanced automation and safety: auto-complete for unassigned 100% irrelevant notices after 2 days, 30-day retention for hidden items, and phased onboarding where new sources are blocklisted until validated.

Phased Rollout & Monitoring

Deployed to UAT, ran targeted reviews, then rolled out to production in phases. Instrumented dashboards for irrelevancy rates, per-source performance and false-positive reports with continuous feedback loops for model refinement.

How It Works — User Flow

Daily Ingest & Scoring

System crawls configured sources, computes diffs, creates notices and assigns relevance/irrelevancy scores.

Inbox Landing

100% irrelevant notices hidden by default. Toast indicates behavior. Default slider range 0-80 shows non-100% items.

Filter Controls

Double range slider lets researchers broaden view (0-100) or view only 100% irrelevant. Relevant items unaffected.

Review & Undo

Hidden items accessible for 30 days. Researchers can recover false positives or feed them back for retraining.

Auto-Care

Unassigned 100% irrelevant items auto-complete after 2 days.

Feedback Loop

Researchers flag misclassifications; examples used to refine prompts, tags and model retraining.

Key UX Decisions

Hidden by Default

100%-irrelevant notices are hidden in the To-Do inbox with a toast explaining why items disappeared.

Double Range Slider

Filters control which irrelevancy scores are visible (default 0-80 so 100% items are excluded).

30-Day Retention

Hidden items remain accessible for audit and recovery before permanent clearance.

Auto-Complete

Unassigned 100% irrelevant notices automatically marked completed after 2 days.

Outcomes & Impact

~30-40%

Irrelevant notices hidden by default

99.9%

Precision on 100% irrelevant labels

30 days

Retention for audit & recovery

2 days

Auto-complete for unassigned items

•Inbox reduction: System hides roughly 30-40% of irrelevant notices by default, substantially reducing triage workload.
•Faster sourcing: Researchers spend less time reviewing noise and more on high-value content discovery.
•Lower manual churn: Auto-complete and hide-by-default behavior reduced repetitive manual actions.
•Trust & auditability: Retention & discoverability rules preserved visibility and allowed confident automation.

Challenges & Trade-offs

False Positives Risk

Hiding items risks missing rare relevant notices. Mitigated by retention, filters, and judge processes.

Definition Ambiguity

Relevance judgments can be subjective. Created labeled examples and aligned stakeholders for consistency.

Model Drift

Source changes required monitoring and a human review cadence to detect drift early.

Automation vs Transparency

Chose conservative defaults (hide, not delete) to build trust before expanding automation.

Key Learnings

Curate explicit examples early. Representative noise samples accelerate prompt & model tuning.

Hybrid rules + ML is pragmatic. Deterministic rules handle known noise, ML handles fuzzy cases.

Provide discoverability & undo. Users must be able to find and recover hidden items for trust.

Operational policies matter. Retention, auto-complete and exportability enable safe automation in regulated contexts.

The product includes the feedback loop. Continuous retraining and tag curation are part of the shipped product, not an afterthought.