Baker Street | Mycroft

Overview

Evidence before confidence theatre

Mycroft exists because teams often tweak prompts, retrieval, and orchestration without any stable release gate. It gives Baker Street and client teams a small, auditable harness for proving whether a change actually improved the workflow.

CLI runner for offline and model-backed evals
Dataset and config examples
Required-keyword and forbidden-keyword checks
Generated JSON and Markdown reports for review

Best Fit

Who this product is built for

Prompt or retrieval iterations that need hard evidence
Pilot relaunches with a measurable pass gate
Teams that want CI-friendly workflow evaluation
Operators comparing revisions before rollout

Artifacts

The outputs teams actually use

These are the delivery artifacts the repo is designed to produce, not just the internal implementation detail.

Artifact

Eval report

A release-review artifact with case results, pass rate, and tag breakdown.

Artifact

Dataset pack

Representative cases for the workflow slice that matters commercially.

Artifact

Threshold config

Explicit rules for what must pass before a change is accepted.

Workflow

How it gets used in real delivery

Each product is designed to slot into a fixed-scope Baker Street engagement rather than sit as a disconnected side project.

Step 1

Define the representative case set and the thresholds that matter.

Step 2

Run offline or model-backed evals against the dataset.

Step 3

Use the generated report to decide whether the change is fit to release.

Next Move

Use it alongside the rest of the Baker Street system

Related package: Pilot Rescue

Move from product detail into the related package, workflow, or delivery stack page.

Open link

See Moriarty

Move from product detail into the related package, workflow, or delivery stack page.

Open link

GitHub repo

Move from product detail into the related package, workflow, or delivery stack page.

Open link

Product Intake

Need a release gate for a live AI workflow?

Mycroft fits when a team already has a workflow in motion but needs evidence around quality, latency, and failure conditions before it expands or relaunches.