NLTK #2: Text Corpora

Accessing Text Corpora and Lexical Resources

Published in

Dev Genius

9 min readMay 5, 2024

A text corpus (plural: corpora) is a large and structured set of texts that are used for linguistic research and natural language processing tasks.

Previous Chapter:

NLTK #1: A Quick Start

Python and Language Processing

blog.devgenius.io

NLTK has some free books from the Gutenberg project, Gutenberg Corpus.

from nltk.corpus import gutenberg
gutenberg.fileids()

# 

"""
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
"""

# Let's get the Emma by Jane Austen

emma = gutenberg.words('austen-emma.txt')

len(emma) # the count of total words

"""
192427
"""

len(gutenberg.raw("austen-emma.txt")) # the count of total chars (including blank)

"""
887071
"""

NLTK has some small examples of less formal text, webtext corpus.

from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(f"fileid: {fileid}\n{webtext.raw(fileid)[:20]}")

"""
fileid: firefox.txt
Cookie Manager: "Don
fileid: grail.txt
SCENE 1: [wind] [clo
fileid: overheard.txt
White guy: So, do yo
fileid: pirates.txt
PIRATES OF THE CARRI
fileid: singles.txt
25 SEXY MALE, seeks 
fileid: wine.txt
Lovely delicate, fra
"""

from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
print(chatroom[123])

"""
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
"""

The Brown Corpus, established in 1961 at Brown University, was the initial electronic corpus of English to contain over a million words. This corpus includes texts from 500 different sources, which are organized by genre, including categories like news and editorials.

Example document for each section of the Brown Corpus. Source: *Natural Language Processing with Python* by Steven Bird, Ewan Klein, and Edward Loper

from nltk.corpus import brown
brown.categories()

"""
['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']
"""

brown.words(categories='news')

"""
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
"""

cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() 
                                  for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

"""
                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 
"""

nltk.ConditionalFreqDist() creates a conditional frequency distribution. A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition". The condition often represents a category or group within the data.

cfd.tabulate() displays a table of frequencies. The frequencies of words are conditioned on genres.

The Reuter Corpus is another one which contains 1.3 million words.

from nltk.corpus import reuters
print(reuters.categories())

"""
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
"""

print(reuters.categories('training/9865'))

"""
['barley', 'corn', 'grain', 'wheat']
"""

The Inaugural Address Corpus is a collection of the speeches given by Presidents of the United States at their inaugurations, used for linguistic and historical analysis.

from nltk.corpus import inaugural
print(inaugural.fileids())

"""
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt', '2013-Obama.txt', '2017-Trump.txt', '2021-Biden.txt']
"""

In NLTK, text corpora are organized into various structures to support diverse linguistic analyses:

Unstructured Corpora: Basic collections of texts without specific organization.
Categorized Corpora: Texts organized by categories such as genre or topic, exemplified by the Brown Corpus.
Overlapping Categorizations: Corpora like the Reuters Corpus where texts belong to multiple categories, useful for multi-dimensional analysis.
Temporal Corpora: Organized chronologically, such as the Inaugural Address Corpus, ideal for studying language evolution over time.
Parallel Corpora: Include texts and their translations, crucial for comparative linguistics and machine translation studies.
Annotated Corpora: Feature linguistic annotations like part-of-speech tags and syntactic trees, aiding in advanced NLP tasks.

Common structures for text corpora. Source: *Natural Language Processing with Python* by Steven Bird, Ewan Klein, and Edward Loper

We can also load our own corpus into nltk using PlaintextCorpusReader .

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'folder'
wordlists = PlaintextCorpusReader(corpus_root, '.*') 
wordlists.fileids()

"""
['sample_data.csv', 'text.txt', 'ticket.pdf']
"""

wordlists.words('text.txt')

"""
['I', 'have', 'some', 'instructions', 'here', '.', ...]
"""

A ConditionalFreqDist in NLTK is used to count frequencies of elements conditionally based on some criterion.

from nltk.corpus import brown

# This line uses a list comprehension to generate tuples where each tuple 
# consists of a genre label and a word from that genre.
genre_word = [(genre, word) for genre in ['news', 'romance'] for word in 
              brown.words(categories=genre)]
# [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand'), ...] 
len(genre_word)

"""
170576
"""

genre_word[:4]

"""
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]
"""

cfd = nltk.ConditionalFreqDist(genre_word)
cfd

"""
<ConditionalFreqDist with 2 conditions>
"""

cfd.conditions()

"""
['news', 'romance']
"""

cfd['news']

"""
FreqDist({'the': 5580, ',': 5188, '.': 4030, 'of': 2849, 'and': 2146, 'to': 2116, 'a': 1993, 'in': 1893, 'for': 943, 'The': 806, ...})
"""

cfd.tabulate() method displays the frequency distribution in table format.

conditions: Limits the display to the 'English' and 'German_Deutsch' data.
samples: Specifies the range of word lengths to include in the table (from 0 to 9).
cumulative=True: Indicates that the table should show cumulative frequencies. This means that each cell shows the total count of words with lengths up to and including that number (e.g., the count of words with length ≤ 3).

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word)) for lang in languages for word in udhr.words(lang + '-Latin1'))

cfd.tabulate(conditions=['English', 'German_Deutsch'], samples=range(10), cumulative=True)

"""
                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275 
"""

Lexical resources are databases or collections of information about words, their meanings, relationships, and usage in a language.

A lexical entry refers to a single, complete record in a lexical resource for a particular word or phrase.

The headword is the word under which a lexical entry is listed in a lexical resource. It’s essentially the main word being defined or described.

A lemma is a canonical form, or base form, of a word. In lexicography and NLP, a lemma represents a group of related forms of a word. For example, the lemma “run” includes forms like “runs,” “running,” and “ran.”

Homonyms are words that share the same spelling and pronunciation but have different meanings.

NLTK includes a type of corpus that consists solely of words listed. This can be used for spell checking and identifying uncommon or misspelled words.

def unusual_words(text):
   text_vocab = set(w.lower() for w in text if w.isalpha())
   english_vocab = set(w.lower() for w in nltk.corpus.words.words())
   unusual = text_vocab.difference(english_vocab)
   return sorted(unusual)

print(unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))[:10])

"""
['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'accents', 'accepting', 'accommodations']
"""

Stopwords are words in any language that are filtered out before or during the processing of text data because they are considered to be of little meaningful information for understanding the content. These words are typically very common, short function words such as “and”, “the”, “is”, and “in”. Because they occur frequently in the language and are generally necessary for grammatical structure but not for meaning, removing them helps focus on the important information

from nltk.corpus import stopwords
print(stopwords.words('english'))

"""
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
"""

NLTK features the CMU Pronouncing Dictionary for American English, intended for use with speech synthesis systems.

entries = nltk.corpus.cmudict.entries()
for entry in entries[39943:39951]:
    print(entry)

"""
('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])
"""

WordNet is a semantically focused English dictionary, akin to a thesaurus but with a more complex structure.

Synonyms are words or phrases that have nearly the same meaning as another word or phrase in the same language. For example, the words “quick” and “fast” are synonyms because they convey a similar idea of speed.

A synset is a concept used primarily in linguistic databases like WordNet, and it stands for “set of synonyms”. A synset groups words or phrases that express the same concept but may have slight variations in context or usage. Each synset contains words that are interchangeable in some contexts but might differ slightly in connotation, usage, or emotional color.

from nltk.corpus import wordnet as wn
wn.synsets('fast')

"""
[Synset('fast.n.01'),
 Synset('fast.v.01'),
 Synset('fast.v.02'),
 Synset('fast.a.01'),
 Synset('fast.a.02'),
 Synset('fast.a.03'),
 Synset('fast.s.04'),
 Synset('fast.s.05'),
 Synset('debauched.s.01'),
 Synset('flying.s.02'),
 Synset('fast.s.08'),
 Synset('firm.s.10'),
 Synset('fast.s.10'),
 Synset('fast.r.01'),
 Synset('fast.r.02')]
"""

print(wn.synset('fast.n.01').lemma_names())

"""
['fast', 'fasting']
"""

wn.synset('fast.n.01').definition()

"""
wn.synset('fast.n.01').definition()
"""

WordNet organizes its lexical database in a structured and hierarchical manner, which is crucial for understanding the relationships between different words (lemmas) and their meanings (senses).

Hypernyms are more general or abstract terms relative to a given synset. A hypernym of “car” is “vehicle”, meaning that every instance of a car is also an instance of a vehicle.

Hyponyms are more specific or concrete terms relative to a given synset. A hyponym of “bird” is “sparrow”, meaning that every sparrow is a bird, but not all birds are sparrows.

fast = wn.synset('fast.n.01')
types_of_fast = fast.hyponyms()
types_of_fast

"""
[Synset('diet.n.04'), Synset('hunger_strike.n.01'), Synset('ramadan.n.02')]
"""

fast.hypernyms()

"""
[Synset('abstinence.n.02')]
"""

Meronyms are words that denote a part of something. For instance, “wheel” is a meronym of “car” because a wheel is part of a car.

Holonyms are words that denote a whole that a part belongs to. Using the same example, “car” is a holonym of “wheel”.

Antonyms are words that have opposite meanings. For example, “hot” and “cold” are antonyms. While not a hierarchical relationship, it’s crucial for understanding semantic opposites within the database.

wn.synset('tree.n.01').part_meronyms()

"""
[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]
"""

wn.synset('tree.n.01').member_holonyms()

"""
[Synset('forest.n.01')]
"""

wn.lemma('supply.n.02.supply').antonyms()

"""
[Lemma('demand.n.02.demand')]
"""

Semantic Similarity is determining how closely related two concepts are by finding their common hypernym in the hierarchy and measuring the path distance between them.

The path_similarity function calculates the shortest path distance between two synsets in the network and converts this distance into a similarity score.

right = wn.synset('right_whale.n.01')
minke = wn.synset('minke_whale.n.01')
right.path_similarity(minke)

"""
0.25
"""

NLTK #2: Text Corpora

Accessing Text Corpora and Lexical Resources

NLTK #1: A Quick Start

Python and Language Processing

Read More

LangChain in Chains #24: Utilizing DataFrame Agent

Analyzing Data in a Pandas DataFrame Using Agents

LangChain in Chains #13: Embeddings

LangChain Embeddings

Exploring Hugging Face: Text to Audio

MusicGen

Sources

Written by Okan Yenigün