Chasing Mythos-level Fusion in the open

2026-06-14

Source context: Open Fusion methodology.

We tried to push TrustedRouter Fusion toward Mythos and Fable-class DRACO performance. The current target panel is GPT-5.5, Claude Opus 4.8, Kimi K2.7 Code, GLM 5.1, MiniMax M3, Gemini 3 Flash, and Gemini 3.1 Pro, with Opus 4.8 synthesizing the final answer and Gemini 3.1 Pro judging against DRACO criteria.

That exact run is not publishable yet. The main blocker is GPT-5.5 long-reasoning behavior on DRACO prompts: it can spend the completion budget on reasoning and return no usable answer. GLM 5.2 is not enabled for the current Z.AI account yet, so the reproducible run uses GLM 5.1 until a direct GLM 5.2 smoke passes.

What actually ran

Run	Task slice	Result	Status
Current 7-model target	Non-financial DRACO pilot	No score	Waiting on GPT-5.5 long-reasoning handling
Available 6-model fallback	First completed non-financial DRACO task	19.85	Completed, far below target

The first fallback panel used Opus 4.8, Kimi K2.7 Code, GLM 5.1, MiniMax M3, Gemini 3 Flash, and Gemini 3.1 Pro. It completed one task before the pilot was stopped for speed and reliability. A score of 19.85 is not close to the target, and we are not presenting it as a win.

What changed in the harness

GPT-5.5 eval calls now omit temperature and use max_completion_tokens.
Panel and final synthesis calls stream so long answers do not wait for full completion before parsing.
Analysis and judge calls stay non-streaming because they require structured JSON reliability.
The live runner now has explicit six-model and seven-model frontier Fusion configs behind a hard budget.
The recommended DRACO slice for this experiment is --task-filter non-financial.

Next gates

The next clean run needs two fixes before any headline claim: make GPT-5.5 long-reasoning responses produce useful content through the attested gateway, and finish a 10-task non-financial DRACO pilot without task-level hangs. GLM 5.2 can replace GLM 5.1 later when Z.AI enables it for the account.

This is the point of doing the work in the open. If TrustedRouter clears a Mythos/Fable-class target, the result should be reproducible from code, model ids, task filters, budget limits, and artifacts. Until then, the honest result is: not there yet.

Evals guide Models Providers GitHub