How we cluster stories: the methodology behind fact-based synthesis

Fact-based synthesis means that a machine-generated summary contains only statements that are substantiated by the underlying sources. It is the opposite of a freely formulated answer: every sentence is tied to facts that were previously extracted from the original items and verified.

The core problem of any media intelligence is volume. A single event often prompts hundreds of items — some duplicates, some contradictory, some with a different focus. Distilling a reliable picture from this by hand takes hours. Naively leaving it to a generative AI risks fluent inventions.

Step 1: From items to stories

Before anything is summarised, related items have to be recognised. We bundle them into a story using three signals, considered together:

Semantic similarity: items that are close in content, measured by the vector distance of their meaning — not by mere word overlap.
Entity overlap: the same people, organisations and places point to the same event.
Temporal proximity: events unfold within a time window; an item from three weeks ago rarely belongs to the acute story.

Only their interplay makes the clustering robust. Each signal on its own is misleading — together they produce a traceable grouping.

Step 2: Extract and verify facts

Individual, verifiable statements are extracted from the items of a story. The decisive factor is corroboration: a statement that appears in only a single source carries less weight than one supported by several independent sources. This makes it possible to separate robust core statements from individual opinions and speculation before a single word is summarised.

Fail loud instead of quietly wrong

If the substantiated facts are not enough, we would rather stop than guess. A missing synthesis is more honest than an invented one.

Step 3: Synthesis, tied to the facts

Only now does text emerge — and even here, not freely. The synthesis formulates solely on the basis of the previously verified facts and is checked against them: for hallucinations, for unsubstantiated claims, for hedging. The result is a summary whose statements can be traced back to the original sources.

In doing so, we do not reproduce other parties' texts. The synthesis is an independent, fact-based condensation — not a stitched-together excerpt. This distinction matters to us both legally and editorially.

Why we do it this way

One could work faster and more cheaply — simply feed items into a language model and publish the result. We deliberately do not. Traceability is not a feature for us, but the precondition for media intelligence to be trustworthy at all. And it increasingly aligns, as we have written elsewhere about the AI Act, with what regulation demands.

Good methodology is invisible as long as it works. It only becomes visible in the trust you can place in the result.

How we cluster stories: the methodology behind fact-based synthesis

Step 1: From items to stories

Step 2: Extract and verify facts

Fail loud instead of quietly wrong

Step 3: Synthesis, tied to the facts

Why we do it this way

Read on