DiscorraDiscorra

About Frequency & Keyness Analysis

An overview of frequency and keyness scoring in Discorra — how token counts are normalized, how keyness is calculated, and how to interpret statistical significance between corpora.

Updated 9/12/2025

About Frequency & Keyness Analysis

Frequency and keyness are the foundations of corpus comparison in Discorra.
They show you not only which words appear most often, but which words are statistically significant in one corpus versus another.



What is frequency analysis?

Frequency analysis shows the most common tokens or words in a dataset.
It answers the simple but powerful question: What terms dominate this corpus?

Definition
Frequency counts show how often each token appears, normalized by dataset size.


What is keyness analysis?

Keyness goes further: it identifies which tokens are unusually frequent in one corpus compared to another.
It tells you not just what is common, but what is distinctive.

Definition
Keyness measures the statistical strength of a word in one corpus relative to another, often using log-likelihood or Z-score metrics.


How frequency is calculated

Discorra applies normalization to ensure fair comparison:

  1. Tokenization — Text is broken into words/tokens.
  2. Normalization — Counts are scaled per 10k tokens to adjust for dataset size.
  3. Ranking — Tokens are sorted by descending frequency.

How keyness is calculated

Discorra uses statistical measures to highlight distinctiveness:

  1. Expected frequency — Calculates how often a token should appear if corpora were identical in distribution.
  2. Observed frequency — Counts actual token appearances in each corpus.
  3. Z-score or log-likelihood — Highlights terms that appear more often than expected in one corpus relative to the other.

⚠️ Note: High-frequency terms (like the, and) are filtered out via stopword handling, so results emphasize meaningful content words.


How to interpret results

  • High-frequency tokens: Show what dominates discussion, regardless of distinctiveness.
  • High keyness tokens: Indicate what is characteristic of one corpus versus another.
  • Negative keyness values: Suggest terms are more characteristic of the comparison corpus.

Example:
In a Jazz vs. Blues comparison:

  • Frequency shows both corpora use music, song, band often.
  • Keyness highlights improvisation (Jazz) vs. heritage (Blues) as distinctive.

Why frequency & keyness matter

Frequency and keyness help you:

  • Identify core vocabulary shaping a domain
  • Surface unique terms that distinguish one group from another
  • Build the foundation for higher-level analyses (resonance, messaging, sentiment)
  • Support evidence-based content and brand strategy

Next steps


Further Reading