Project Report

Discourse Integrity Map Project

A portfolio submission for the CIP Research Fellowship · Alice Pateman-Brown · Prepared May 2026

Background What Was Built Methodology What the Output Reveals Limitations Further Work

Overview

DIM is a computational pipeline for mapping the quality and structure of public argument. Applied here to the UK and US wealth tax debate, it extracted 456 speaker claims from 17 debates, aggregated them into 37 canonical positions, and classified 390 rhetorical technique detections - surfacing where the two sides diverge, where latent agreement already exists, and what kind of disagreement underlies the debate.

Research question Can we identify common ground and point to the nature of disagreement outside of rhetoric, by mapping a debate space and aggregating speaker claims into canonical positions?

17 debates processed

456 speaker claims extracted

37 canonical positions
(incl. 5 shared ground)

390 rhetorical technique detections

Positions by stasis type

fact

value

policy

definition

procedure

Detections by stance

pro

197

anti

105

mixed

neutral

Key finding · Shared ground

Beneath the surface heat of the wealth tax debate, the two sides agree on more than they appear to. The Agora identifies five shared ground positions - claims that recur on both sides of the debate, including that some wealthy individuals do relocate in response to tax changes, that avoidance loopholes should be closed, and that design and thresholds are decisive. The real disagreement is about magnitude and mechanism, not about whether inequality is a problem or whether the tax system needs reform.

This is where deliberative design can start: not from opposed positions, but from the claims both sides already accept - and which currently get lost in the argument about the instrument.

Agora

Browse all 37 canonical positions, filterable by stance and stasis type

What might a deliberative platform look like?

Heatmap

390 rhetorical technique detections across 17 debates, broken down by stance

Do patterns of rhetorical technique vary systematically across political positions?

Corpus

Per-debate view with speaker claims and discourse profiles for all 17 debates

Where do the claims actually come from?

Debate

A multi-agent debate replay built from AI personas derived from the corpus

Can AI agents standing in for real positions model what genuine deliberation might look like?

Background

In brief

DIM grew from an MSc dissertation that proved the right structure but exposed poor classifier quality - the redesign addresses that directly.
The core design principle: make the structure of argument visible without delivering a verdict on truth.
A central design goal was surfacing common ground - identifying where apparent opponents already agree, and using that as a foundation for more productive deliberation. This connects directly to CIP's interest in collective intelligence and tools that improve the quality of group reasoning rather than simply amplifying existing positions.
Scope narrowed to the UK/US wealth tax debate for the MVP; unit of analysis shifted from whole interventions to individual speaker claims.
Grounded in Aristotle's dialectic/eristic distinction - the difference between arguing to find truth and arguing to win.

This report accompanies a portfolio submission for the Collective Intelligence Project (CIP) Research Fellowship. The fellowship supports original research at the intersection of AI, democratic deliberation, and public reasoning about contested questions - a set of concerns that have shaped my thinking long before I had vocabulary for them. Discovering the CIP's work, among others such as Audrey Tang's work, gave me a framework for ideas I had been circling: that the quality of collective reasoning matters, that it can be measured, and that the design of the systems through which people argue and decide together is not a technical afterthought but a foundational political question - one that becomes more urgent the more political polarisation deepens and social media and wider societal and media friction continues to fragment shared epistemic ground. DIM is a computational pipeline for mapping the quality and structure of public argument: identifying where sides disagree, where latent agreement already exists, and what kinds of disagreement they are. It is offered here as a working proof of concept that speaks directly to the fellowship's interest in communication and deliberation in policy settings, and in the design of tools that support more rigorous collective reasoning.

This project has roots in my MSc Data Science dissertation, in which I assembled a corpus of 853 UK and US immigration news articles collected via three APIs (GNews, NewsAPI, NewsData.io) and enriched with media bias metadata from four rating sources - Ad Fontes, MBFC, AllSides, and Ground News - as well as sentiment analysis via VADER and TextBlob. The overall aim of the project was to lay the foundations of a non-prescriptive media literacy tool: a system that could make the rhetorical construction of public argument visible without delivering a verdict on its truth, helping readers see not just what was being argued but how. To that end, I applied the SemEval-2020 Task 11 propaganda detection model - a publicly released RoBERTa-based pipeline trained to identify rhetorical techniques in news text - to this corpus. Due to significant compute constraints - the model ran on CPU only, making full-corpus inference prohibitively slow - span identification (SI) and technique classification (TC) were applied to a balanced sample of 70 articles rather than the full dataset. Both stages were successfully completed and merged, and the results integrated into a deployed front-end. What I found was that while the pipeline ran successfully, many of the propaganda technique labels it produced - classical logical fallacies such as "ad hominem" - were inaccurate or poorly grounded. The structure was solid and the design philosophy promising, but the project was let down by the quality of the model's output. That failure clarified what the project was actually asking: not a pipeline question, but a classifier question. The design principle - surface the structure of argument without prescribing conclusions - was worth pursuing. What was needed was a model that could be trusted to do it reliably. The deployment can be viewed at abrown1564.github.io/truthspace.

The DIM (Discourse Integrity Map) was envisaged as a follow-up to that project - an attempt to do what the dissertation had shown was worth doing, but with a classifier that could actually be trusted. The animating idea was a kind of "pulse-taking" of contemporary public discourse: a system capable of designating, for any given debate, whether its character was primarily dialectic or eristic - terms borrowed from Aristotle, who distinguished between argument conducted in pursuit of truth and argument conducted in pursuit of winning. Alongside this quality dimension, the original brief envisaged a visual map in the style of an Overton Window: showing where arguments cluster across the political landscape, how far they travel, and - weighted by reach - which low-quality arguments were achieving the widest audiences.

However, after beginning the project, the scope and nature of the project changed in two important ways. First, on topic: given the time and resource constraints of a fellowship application, I focused the MVP on a single issue - the UK and US public debate over wealth taxation, currently a widely and frequently debated issue in public discourse, stemming primarily from the success of economist Gary Stevenson in his aim to encourage discussion and consideration of wealth taxes as a method of reducing inequality in the UK. This decision traded breadth for depth, allowing me to build something analytically rigorous within a defined domain rather than something superficial across many. Second, and more significantly, the unit of analysis shifted. The original design treated the whole intervention - a video, a speech - as the atomic unit, with one quality score per piece of content visualised on a reach/quality scatter plot. What I actually built uses the individual speaker's claim as the unit instead: discrete argumentative moves extracted from transcripts, classified, and then aggregated upward into canonical positions that recur across multiple debates. This is a more granular and analytically richer approach, and it is what made the Agora possible.

The research question the project settled on was: can you identify common ground and point to the nature of disagreement outside of rhetoric, by mapping a debate space and aggregating speaker claims into canonical positions? The theoretical grounding for this draws on the eristic/dialectic distinction and its modern formalisation in pragma-dialectics (van Eemeren and Grootendorst), which operationalises good-faith argument as ten rules for critical discussion whose violations correspond to named fallacies. Pragma-dialectics represents a direction for deeper engagement with more time rather than a framework applied in full here. What DIM does in this version is different in scope but equally distinct from a fact-checker or bias rater: it does not ask whether a claim is true, but maps what is being argued, how often, and where positions converge and diverge - treating the structure of the debate as the object of analysis rather than the truth value of any individual claim within it.

What Was Built

In brief

A 5-stage LLM pipeline: ingest transcripts, extract claims, aggregate canonical positions, classify rhetorical techniques, assemble discourse profile.
456 claims extracted across 17 debates; aggregated into 37 canonical argument positions split pro, anti, and shared ground.
390 rhetorical technique detections produced by classifying speaker turns against a curated fallacy taxonomy.
Deployed as a static site on GitHub Pages - no backend, all data pre-built into JSON files.

17 debates processed through the full pipeline

456 individual claims extracted

37 canonical argument positions aggregated

5 stasis types: fact, definition, value, policy, procedure

Stage 1 detect_units - segment transcript into argumentative units

Stage 2 skeletonize - extract claims, warrants, stasis points per unit

Stage 3 merge_discourse - merge into one unified discourse skeleton

Stage 4 classify_units - detect rhetorical / fallacious techniques

Stage 5 assemble_profile - aggregate detections into discourse profile

The pipeline has five stages. First, YouTube transcripts for 17 wealth-tax debates and monologues were ingested, diarised into speaker turns, and loaded into a SQLite database. Second, each debate was tagged with a discourse mode - monologic, dialogic, or multi-party - because certain fallacies are only structurally possible in interactive formats and should not be applied to a single speaker talking to camera. Third, speaker turns were processed by a large language model (Claude) to extract individual claims, each tagged with a stasis type (fact, definition, value, policy, or procedure) - stasis being the central point of contention in a debate, from the Greek "a standing still" - a debate stance, and a geographic context (UK or US). 456 individual claims were extracted across the 17 debates.

Those 456 claims were then aggregated into 37 canonical argument positions - the distinct, recurring arguments that appear and reappear across different debates - via a two-stage LLM pipeline. The first stage condensed the claims into canonical positions with metadata. The second mapped each position back to its source claims, source debates, and example quotes. Positions were divided into pro-wealth-tax, anti-wealth-tax, and shared ground.

The output is a static site hosted on GitHub Pages, with no backend - the pipeline stores all its data in a SQLite database, but the site itself reads from pre-built JSON files rather than querying a database at runtime. It includes a positions map filterable by side, stasis type, and context; a per-debate view; a heatmap of rhetorical technique detections; and a multi-agent debate replay built from AI agents assigned argumentative personas derived from the corpus.

The system was designed from the outset to be subject-agnostic and non-partisan, with the same analytical rubric applied to pro- and anti-wealth-tax arguments alike.

Methodology

In brief

The atomic unit is the individual speaker claim, tagged with stasis type (fact, definition, value, policy, procedure) and debate stance.
Fallacies were chosen because they are formally defined and structurally detectable - the kinds of epistemic failures invisible to fact-checkers.
The taxonomy was developed against real examples from live public discourse, which were then used directly as grounding material in LLM classification prompts.
A composite quality score was deliberately set aside: practical constraints limited the corpus, and encoding "good reasoning" as a single metric raises normative questions the system is not positioned to resolve.

The DIM is grounded in Aristotle's distinction between dialectic - argument conducted in pursuit of truth - and eristic - argument conducted in pursuit of winning. The modern formalisation of this distinction in pragma-dialectics (van Eemeren & Grootendorst) - which operationalises good-faith argument as ten rules for critical discussion whose violations correspond to named fallacies - represents a framework for closer engagement with more time; it is an influence and a direction rather than a methodology applied in full. What the system does in this version is ask a more tractable prior question: not whether an argument is conducted in good faith, but what is being argued, how the debate space is structured, and where positions coincide or diverge - independently of rhetoric.

The focus on logical fallacies specifically is deliberate. Fact-checkers operate on a true/false axis — they can tell you a statistic is wrong, but they cannot tell you that a valid statistic is being deployed in a context designed to mislead, or that a conclusion is being drawn from a premise that was never argued for, or that someone is attacking the person making an argument rather than the argument itself. These are structural failures in reasoning that are invisible to fact-checking but arguably constitute the most common and most damaging epistemic moves in contemporary public discourse. Fallacies are also, crucially, detectable: they have names, formal definitions, and centuries of documented structure. That is what makes the classification task tractable — you cannot build a classifier for "bad argument" in the abstract, but you can build one for ad hominem circumstantial or hasty generalisation because there is a precise definition of what each move looks like.

The original design intended to produce a single composite dialectic/eristic quality score per intervention - a mechanical derivation from classified outputs, reach-weighted by audience size, that would position each piece of content on a two-dimensional map of volume against quality. This direction was set aside for two reasons, one practical and one methodological. The practical reason: API credit constraints limited the pipeline to 17 of the 31 intended debates, producing a corpus too partial to support reliable aggregate scoring. The methodological reason, and the more significant one: working toward a quality score quickly revealed that applying a single composite metric to argument quality requires encoding normative judgements about what counts as good reasoning - judgements that are not as neutral as a mechanical derivation suggests. For a system designed to be non-partisan and non-prescriptive, that is a problem that deserves careful treatment rather than a quick fix. The quality score remains a potential direction; the claim-level substrate built here is what any such score would need to sit on top of.

What was built is the claim-level substrate: the atomic unit of analysis is the individual speaker claim rather than the intervention as a whole, tagged with a stasis type (fact, definition, value, policy, or procedure) and a debate stance. 456 claims were extracted across 17 debates and aggregated into 37 canonical positions via a two-stage LLM pipeline. Separately, 390 rhetorical technique detections were produced by classifying speaker turns against a curated fallacy taxonomy.

The taxonomy is the epistemic spine of the system. It draws from classical rhetorical theory, pragma-dialectical violation categories, and the modern argumentation literature, and includes a small number of coined terms for phenomena not yet formalised (or not yet identified) in existing literature - phantom quantification, normative laundering, mechanistic laundering, performed empiricism. These are flagged as coined rather than established. Every detection references a specific taxonomy entry, making the classification auditable and improvable: when a definition is refined, content can be reclassified.

Taxonomy development was grounded in real examples rather than constructed ones. Candidate entries were tested against observed rhetorical moves drawn from live public discourse - including a Spiked Online article analysed for technique density, and a public exchange in which Rory Stewart challenged Gary Stevenson's credentials as an economist during an interview with Zack Polanski, followed by Stevenson's rebuttal. These provided concrete, contested instances of credential-based dismissal, counter-framing, and asymmetric burden-of-proof moves - exactly the phenomena the coined entries were designed to capture. Those examples were then used directly in the classification prompts given to the LLMs, as grounding material for what each taxonomy entry was intended to detect in practice.

The system was designed explicitly to be non-partisan, with the same taxonomy and the same prompts applied to pro- and anti-wealth-tax arguments alike. The discourse quality score, when built, would be a mechanical derivation from classification output. A core design principle was that any challenge to a finding must necessarily engage with a specific classification against a specific taxonomy entry, and not with the methodology in the abstract - which is why provenance of quote, debate and label has been explicitly retained throughout.

What the Output Reveals

In brief

The two most frequent positions are mirror consequence claims: capital flight (anti, 33 claims) vs. rising inequality (pro, 30 claims).
Shared ground is more substantial than the debate's surface heat suggests - both sides agree on several empirical facts; the real dispute is about magnitude and design.
Most of the debate is stuck at factual and value stasis before it ever reaches policy - only 5 of 32 positions are policy claims.
A deliberative process that resolves the factual disputes first could find more common ground than current debate formats suggest.

Anti-wealth-tax · Fact

Wealthy individuals will relocate if taxed

Capital flight is the dominant factual stasis point of the debate - the most frequently recurring argument in the entire dataset.

frequency: 33 claims

Pro-wealth-tax · Fact

Rising inequality harms society

Notably a fact claim, not a value claim - suggesting the pro side leads with empirical framing rather than moral argument.

frequency: 30 claims

Even from a partial corpus, the Agora produces some analytically interesting results. These two positions are the core of the debate: one side leads with a consequence claim about what a wealth tax would cause, the other with a condition claim about what the absence of one is already causing.

The shared ground is more substantial than the debate's surface heat suggests. Both sides agree that some wealthy individuals do relocate in response to tax changes - the dispute is about magnitude, not the phenomenon itself. Both agree that the current tax system has problems, that avoidance loopholes should be closed, and that design and thresholds determine whether any wealth tax would be workable. This means the disagreement is not primarily about principle - it is about empirical facts (does capital flight happen at scale?) and policy design (what threshold, what assets, what enforcement mechanism?).

Stasis breakdown across 32 pro/anti positions: 14 fact claims, 10 value claims, 5 policy claims. Most of the debate is stuck at the factual and evaluative level before it ever reaches the question of what to actually do.

Fact

Value

Policy

Other

A deliberative process designed to resolve the factual disputes first - through shared engagement with evidence - might find more policy common ground than the debate's current format suggests is possible.

Limitations

DIM is an exploratory research substrate; the methodology is intentionally open and evolving.

In brief

API cost cut the corpus to 17 of 31 intended debates - the Agora is partial and should not be read as a complete account of the debate landscape.
Several unprocessed debates came from high-prestige sources (LSE and comparable forums) whose absence is a real analytical loss.
Speaker diarisation errors limit speaker-level analysis, particularly at moments of overlap and interruption.
Any fallacy taxonomy encodes contestable normative judgements; deploying it as an evaluative tool would require co-design with the intended audience.

The most significant limitation in this demo was cost and dependency. The pipeline relies on Claude for claim extraction, classification, and aggregation. Processing cost grows linearly with corpus size, and in practice API credits were exhausted before the pipeline could be run across the full intended corpus. Of the 31 debates identified for inclusion, only 17 were processed. The remaining 14 were ingested but not classified. The Agora should not be read as a comprehensive account of the wealth tax debate landscape - it reflects a partial corpus, and many arguments are likely underrepresented or absent. The quality bar established here should be treated as a benchmark for a future open-source pipeline.

Among the unprocessed debates were several from high-prestige institutional sources - including LSE public events and comparable academic forums - which would likely have surfaced considerably more evidential rigour, longer chains of reasoning, and expert-level nuance than is currently visible in the corpus. Their absence is something that future work should resolve: debates of that kind are exactly where the system's ability to distinguish substantive from rhetorical argument would be most tested and most useful.

Automated speaker diarisation was also imperfect. YouTube's automatic captions frequently misattribute or merge speakers, and Whisper diarisation still struggles with overlapping speech and rapid turn-taking. This limits the pipeline's ability to reliably detect interruptions, speaker-specific fallacy patterns, or responsiveness failures - all of which are analytically important in dialogic and multi-party formats and formed part of the initial intended project brief. The inaccuracies were most visible precisely where these features matter most: moments of speaker overlap, sustained interruption, and negotiation of the floor.

In addition, any fallacy taxonomy encodes normative judgements about what constitutes good and bad reasoning. I have tried to ground the taxonomy in established, peer-reviewed traditions - principally Aristotelian rhetoric, pragma-dialectics, and the classical fallacy literature - rather than in ad hoc or politically motivated categories. The taxonomy also includes a small number of entries I have labelled "coined" - terms generated to describe real, distinct phenomena not yet formalised (or not yet identified) in the existing literature (for example, mechanistic_laundering, phantom_quantification, lexical_credentialism). But the selection of which fallacies to include, how to define them, and how to weight them inevitably reflects choices that could be contested. This is an inherent limitation of any discourse quality framework, and would need careful work with the intended audience to surface and validate those normative assumptions before the taxonomy is deployed as the foundation of a deliberative tool - so that the standards of "good reasoning" it encodes reflect the deliberative values of the community it serves.

The frequency count assigned to each canonical position reflects the number of individual claims mapped to it by the LLM - not a count of speakers, debates, or reach-weighted instances. These counts are indicative of LLM-identified argumentative salience within the corpus when given a strictly bounded prompt.

Further Work

In brief

Complete the 14 debates already ingested but not yet classified, including LSE and comparable institutional sources - the infrastructure exists, only API cost prevented it.
Rebuild the pipeline on open-source models to remove dependency on frontier APIs.
Add live ingestion so the system becomes a real-time discourse monitor rather than a retrospective one.
Integrate a formal dialectic/eristic scoring layer, connecting to Lewandowsky et al.'s research linking rhetorical patterns to radicalisation outcomes.
Use identified shared ground as deliberative launchpads - co-design starting points rather than debate findings.
Distributed audit architecture and human-in-the-loop annotation to improve classification rigour and build ground-truth data.

Open-source pipeline

The most significant near-term step is rebuilding the claim extraction and aggregation pipeline using open-source models, benchmarked against the quality of output produced here. The current pipeline establishes the ceiling; the future work is reaching that ceiling without frontier model dependency.

Live ingestion

The step change in the system's value comes when it can ingest new content continuously - a live feed of parliamentary debate, a public consultation portal, a social media stream - and update the argument landscape in near-real time. This shifts the tool from a retrospective analysis instrument to a live discourse monitor.

Eristic/dialogic scoring layer

A natural extension is to integrate a formal dialectic/eristic scoring layer. This connects to significant prior work: Lewandowsky and colleagues (Teitelbaum, Saltz, Lewandowsky and Simchon, 2025) have demonstrated empirical links between rhetorical features in social media content and real-world outcomes, including radicalisation and political violence.^{[preprint] [talk]} A discourse quality score grounded in the dialectic/eristic axis, cross-referenced with reach metrics, could contribute meaningfully to that research direction.

From stasis points to deliberative launchpads

The platform is currently claim-focused: it surfaces positions, maps agreement and disagreement, and invites users to respond. But many of the positions in the shared ground category - for instance, "rising inequality harms society" - command near-universal assent outside the specific framing of a wealth tax debate. Within that debate, they become stasis points, contested or obscured by the surrounding argument. The more generative move would be to identify claims where latent agreement already exists across apparent opponents and treat those claims not as findings to display but as launchpads for a new deliberative process - one oriented toward co-designing solutions rather than continuing to argue about the instrument. The deliberative literature on integrative negotiation and principled agreement - Fisher and Ury on principled negotiation^[book], Habermas on communicative rationality^[book] - offers a theoretical basis for what that platform would need to do.

Distributed audit architecture

A more rigorous version of the pipeline could replace the current single-pass classifier with a distributed audit architecture. Rather than one model making a monolithic judgement across all taxonomy categories using a recursive prompting strategy, as is the current methodology, narrow specialist auditors could each be responsible for a single taxonomy entry or tightly scoped family, required to either return an evidence-backed detection or explicitly abstain. An arbiter layer - a larger, slower model - could then reconcile conflicting auditor outputs and assemble a final discourse profile. Small models act as sensors; larger models act as interpreters and verifiers.

Human-in-the-loop annotation

A meaningful improvement on the system would be to allow human review of model-generated labels, routing each detection to three or more reviewers who independently assess whether the label is accurate. Aggregating responses using inter-rater agreement measures (e.g. Cohen's kappa) would produce a reliability score per detection, flagging low-confidence labels for further review and gradually building a ground-truth dataset that could be used to fine-tune future classifiers.

Completing the wealth tax corpus

The most immediate next step is processing the 14 debates already ingested but not yet classified - including several from high-prestige institutional sources such as LSE public events. Completing the corpus would substantially improve the Agora's coverage and is likely to surface more evidentially rigorous argument than is currently visible. This is the lowest-hanging fruit: the infrastructure exists, the content is already ingested, and the only constraint was API cost.

Multi-topic expansion

The system was designed to be subject-agnostic. Extending the corpus to other high-salience debates - net zero, immigration, NHS reform - would test the generalisability of the pipeline and the taxonomy, and allow cross-topic comparison of discourse quality patterns.