Vossian Antonomasia

Logo

Automatic extraction of Vossian antonomasia from large newspaper corpora.

(Shout-out to Gerardus Vossius, 1577–1649.)

View the Project on GitHub weltliteratur/vossanto

Some More Statistics

An “executable” version of this file is statistics.org.

Temporal Distribution

Let us check how Vossian Antonomasia (VA) is spread across the whole corpus:

echo "year articles cand wd wd+bl found true prec"
for year in $(seq 1987 2007); do
    echo $year \
     $(grep ^$year ../articles.tsv | cut -d' ' -f2) \
     $(zcat ../theof_${year}.tsv.gz | wc -l) \
     $(cat ../theof_${year}_wd.tsv | wc -l) \
     $(cat ../theof_${year}_wda_bl.tsv | wc -l) \
     $(../org.py -y ../README.org | grep ${year} | wc -l) \
         $(../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -y -c -b ../README.org | grep ${year} | awk -F'\t' '{if ($2 == "D" || $3 == "True") print;}' | wc -l)
done
year articles cand wd wd+bl found true prec  
1987 106104 641432 5236 131 129 95 73.6 0.90
1988 104541 637132 5074 143 141 88 62.4 0.84
1989 102818 625894 4922 151 148 104 70.3 1.01
1990 98812 614164 4890 142 140 105 75.0 1.06
1991 85135 512582 4189 154 154 103 66.9 1.21
1992 82685 493808 4442 152 152 103 67.8 1.25
1993 79200 480883 4338 167 167 121 72.5 1.53
1994 74925 464278 4038 164 164 112 68.3 1.49
1995 85392 500404 4636 162 162 124 76.5 1.45
1996 79077 497688 4250 186 186 133 71.5 1.68
1997 85396 515759 4561 173 173 134 77.5 1.57
1998 89163 571010 5333 243 243 180 74.1 2.02
1999 91074 585464 5375 189 189 136 72.0 1.49
2000 94258 602240 4750 231 231 172 74.5 1.82
2001 96282 587644 4512 210 209 163 78.0 1.69
2002 97258 597289 4992 231 229 177 77.3 1.82
2003 94235 590890 4749 219 216 165 76.4 1.75
2004 91362 571894 4702 192 191 153 80.1 1.67
2005 90004 562027 4680 208 207 162 78.3 1.80
2006 87052 561203 4786 221 221 169 76.5 1.94
2007 39953 260778 2276 101 101 76 75.2 1.90
sum 1854726 11474463 96731 3770 3753 2775 73.9 1.50
mean 88320 546403 4606 180 179 132 73.7 1.49

The table shows the temporal distribution of the number of candidate phrases (cand) after matching against Wikidata (wd) and a blacklist (wd+bl), and after the manual inspection (true). The last column shows the precision.

Let us plot some of the columns:

reset
set datafile separator "\t"

set xlabel "year"
set ylabel "frequency"
set grid linetype 1 linecolor 0
set yrange [0:*]
set y2range [0:100]
set y2label 'precision'
set y2tics
set key bottom right
set style fill solid 1

set term svg enhanced size 800,600 dynamic fname "Noto Sans, Helvetica Neue, Helvetica, Arial, sans-serif" fsize 16
#set out "nyt_vossantos_over_time.svg"
plot data using 1:6 with linespoints pt 6 title 'candidates',\
     data using 1:7 with linespoints pt 7 title 'VA',\
     data using 1:8 with lines            title 'precision' axes x1y2

# for arxiv paper
set term pdf enhanced lw 2
set out "nyt_vossantos_over_time.pdf"
replot

# for DSH paper
set term png enhanced size 2835,2126 font "Arial,40" lw 4
# set term png enhanced size 800,600 font "Arial,16" lw  2
set out "nyt_vossantos_over_time.png"
plot data using 1:6 with linespoints pt 6 ps 7 lc "black" title 'candidates',\
     data using 1:7 with linespoints pt 7 ps 7 lc "black" title 'VA',\
     data using 1:8 with lines                 lc "black" title 'precision' axes x1y2


# ---- relative values

set term svg enhanced size 800,600 dynamic fname "Noto Sans, Helvetica Neue, Helvetica, Arial, sans-serif" fsize 16
set out "nyt_vossantos_over_time_rel.svg"
set ylabel "frequency (per mille)"
set format y "%2.1f"

plot data using 1:($6/$2*1000) with linespoints pt 6 title 'candidates',\
     data using 1:($7/$2*1000) with linespoints pt 7 title 'VA',\
     data using 1:8            with lines            title 'precision' axes x1y2

# for arxiv paper
set term pdf enhanced lw 2
set out "nyt_vossantos_over_time_rel.pdf"
replot

set term png enhanced size 2835,2126 font "Arial,40" lw 4
# set term png enhanced size 800,600 font "Arial,16" lw  2
set out "nyt_vossantos_over_time_rel.png"
plot data using 1:($6/$2*1000) with linespoints pt 6 ps 7 lc "black" title 'candidates',\
     data using 1:($7/$2*1000) with linespoints pt 7 ps 7 lc "black" title 'VA',\
     data using 1:8            with lines                 lc "black" title 'precision' axes x1y2

Absolute frequency: Absolute Frequency

Relative frequency: Relative
Frequency

Top-40 VA Sources

Let us count the most frequent sources for Vossian Antonomasia:

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T ../README.org | sort | uniq -c | sort -nr | head -n40
count source
68 Michael Jordan
58 Rodney Dangerfield
36 Babe Ruth
32 Elvis Presley
31 Johnny Appleseed
23 Bill Gates
21 Pablo Picasso
21 Michelangelo
21 Donald Trump
21 Jackie Robinson
21 Madonna
20 P. T. Barnum
20 Tiger Woods
18 Martha Stewart
16 Henry Ford
16 William Shakespeare
16 Wolfgang Amadeus Mozart
15 Adolf Hitler
14 Greta Garbo
14 John Wayne
14 Mother Teresa
13 Napoleon
13 Ralph Nader
12 Leonardo da Vinci
12 Cal Ripken
12 Leo Tolstoy
12 Oprah Winfrey
12 Rosa Parks
12 Susan Lucci
11 Walt Disney
11 Willie Horton
11 Rembrandt
10 Albert Einstein
10 Thomas Edison
10 Mike Tyson
10 Julia Child
9 Ross Perot
9 Dennis Rodman
8 James Dean
8 Mikhail Gorbachev

… pulled from Wikidata via Property:P18 (one entity has no image provided in Wikidata):

Michael Jordan Rodney Dangerfield Babe Ruth Elvis Presley Johnny Appleseed Bill Gates Pablo Picasso Michelangelo Donald Trump Jackie Robinson Madonna P. T. Barnum Tiger Woods Martha Stewart Henry Ford William Shakespeare Wolfgang Amadeus Mozart Adolf Hitler Greta Garbo John Wayne Mother Teresa Napoleon Ralph Nader Leonardo da Vinci Cal Ripken Leo Tolstoy Oprah Winfrey Rosa Parks Susan Lucci Walt Disney Rembrandt Albert Einstein Thomas Edison Mike Tyson Julia Child Ross Perot Dennis Rodman James Dean Mikhail Gorbachev

Categories

online

Extract the categories of articles:

export PYTHONIOENCODING=utf-8
for year in $(seq 1987 2007); do
    ./nyt.py --category ../nyt_corpus_${year}.tar.gz \
        | sed -e "s/^nyt_corpus_//" -e "s/\.har\//\//" -e "s/\.xml\t/\t/" \
        | sort >> nyt_categories.tsv
done

Compute frequency distribution over all articles:

cut -d$'\t' -f2 nyt_categories.tsv | sort -S1G | uniq -c \
   | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
                                          > nyt_categories_distrib.tsv

Check the number of and the top categories:

echo articles $(wc -l < nyt_categories.tsv)
echo categories $(wc -l < nyt_categories_distrib.tsv)
echo ""
sort -nrk2 nyt_categories_distrib.tsv | head
articles 1854726
categories 1580
Business 291982
Sports 160888
Opinion 134428
U.S. 89389
Arts 88460
World 79786
Style 65071
Obituaries 19430
Magazine 11464
Travel 10440

Collect the categories of the articles:

echo "VA" $(../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T ../README.org | wc -l) articles $(wc -l < ../nyt_categories.tsv)
../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -f ../README.org | join ../nyt_categories.tsv - | sed "s/ /\t/" | awk -F'\t' '{print $2}' \
    | sort | uniq -c \
    | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
    | join -t$'\t' -o1.2,1.1,2.2 - ../nyt_categories_distrib.tsv \
    | sort -nr | head -n20
VA 2646 category articles 1854726
336 12.7% Sports 160888 8.7%
334 12.6% Arts 88460 4.8%
290 11.0% New York and Region 221897 12.0%
237 9.0% Arts; Books 35475 1.9%
158 6.0% Movies; Arts 27759 1.5%
109 4.1% Business 291982 15.7%
102 3.9% Opinion 134428 7.2%
96 3.6% U.S. 89389 4.8%
95 3.6% Magazine 11464 0.6%
62 2.3% Style 65071 3.5%
61 2.3% Arts; Theater 13283 0.7%
46 1.7% World 79786 4.3%
39 1.5% Home and Garden; Style 13978 0.8%
32 1.2% Travel 10440 0.6%
31 1.2% Technology; Business 23283 1.3%
27 1.0%   42157 2.3%
25 0.9% Week in Review 17107 0.9%
25 0.9% Home and Garden 5546 0.3%
17 0.6% World; Washington 24817 1.3%
17 0.6% Style; Magazine 1519 0.1%

desks

Extract the desks of the articles:

export PYTHONIOENCODING=utf-8
for year in $(seq 1987 2007); do
    ./nyt.py --desk ../nyt_corpus_${year}.tar.gz \
        | sed -e "s/^nyt_corpus_//" -e "s/\.har\//\//" -e "s/\.xml\t/\t/" \
        | sort >> nyt_desks.tsv
done

Compute frequency distribution over all articles:

cut -d$'\t' -f2 nyt_desks.tsv | sort -S1G | uniq -c \
   | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
                                          > nyt_desks_distrib.tsv

Check the number of and the top categories:

echo articles $(wc -l < nyt_desks.tsv)
echo categories $(wc -l < nyt_desks_distrib.tsv)
echo ""
sort -t$'\t' -nrk2 nyt_desks_distrib.tsv | head
articles 1854727
categories 398
Metropolitan Desk 237896
Financial Desk 206958
Sports Desk 174823
National Desk 143489
Editorial Desk 131762
Foreign Desk 129732
Classified 129660
Business/Financial Desk 112951
Society Desk 44032
Cultural Desk 40342

Collect the desks of the articles:

echo "VA" $(./org.py -T README.org | wc -l) articles $(wc -l < nyt_desks.tsv)
./org.py -T -f README.org | join nyt_desks.tsv - | sed "s/ /\t/" | awk -F'\t' '{print $2}' \
    | sort | uniq -c \
    | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
    | join -t$'\t' -o1.2,1.1,2.2 - nyt_desks_distrib.tsv \
    | sort -nr | head -n20
VA 2764 desk articles 1854727
133 4.8% Sports Desk 174823 9.4%
77 2.8% Cultural Desk 40342 2.2%
68 2.5% Book Review Desk 32737 1.8%
61 2.2% National Desk 143489 7.7%
54 2.0% Financial Desk 206958 11.2%
51 1.8% Metropolitan Desk 237896 12.8%
46 1.7% Weekend Desk 18814 1.0%
38 1.4% Arts & Leisure Desk 6742 0.4%
35 1.3% Editorial Desk 131762 7.1%
31 1.1% Foreign Desk 129732 7.0%
31 1.1% Arts and Leisure Desk 27765 1.5%
25 0.9% Magazine Desk 25433 1.4%
25 0.9% Long Island Weekly Desk 20453 1.1%
22 0.8% Living Desk 6843 0.4%
19 0.7% Home Desk 8391 0.5%
15 0.5% Week in Review Desk 21897 1.2%
14 0.5% Style Desk 21569 1.2%
13 0.5% Styles of The Times 2794 0.2%
12 0.4%   6288 0.3%
9 0.3% Travel Desk 23277 1.3%

Sidenote: There are many errors in the specification of desks.

Authors

Extract the authors of articles:

export PYTHONIOENCODING=utf-8
for year in $(seq 1987 2007); do
    ./nyt.py --author ../nyt_corpus_${year}.tar.gz \
        | sed -e "s/^nyt_corpus_//" -e "s/\.har\//\//" -e "s/\.xml\t/\t/" \
        | sort >> nyt_authors.tsv
done

Compute frequency distribution over all articles:

cut -d$'\t' -f2 nyt_authors.tsv | sort -S1G | uniq -c \
   | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
                                          > nyt_authors_distrib.tsv

Check the number of and the top authors:

echo articles $(wc -l < nyt_authors.tsv)
echo categories $(wc -l < nyt_authors_distrib.tsv)
echo ""
sort -t$'\t' -nrk2 nyt_authors_distrib.tsv | head
articles 1854726
categories 30691
  961052
Elliott, Stuart 6296
Holden, Stephen 5098
Chass, Murray 4544
Pareles, Jon 4090
Brozan, Nadine 3741
Fabricant, Florence 3659
Kozinn, Allan 3654
Curry, Jack 3654
Truscott, Alan 3646

requires clean-up!

Collect the authors of the articles:

echo "VA" $(../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T ../README.org | wc -l) articles $(wc -l < ../nyt_authors.tsv)
../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -f ../README.org | join ../nyt_authors.tsv - | sed "s/ /\t/" | awk -F'\t' '{print $2}' \
    | sort | uniq -c \
    | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
    | join -t$'\t' -o1.2,1.1,2.2 - ../nyt_authors_distrib.tsv \
    | sort -nr | head -n20
VA 2646 author articles 1854726
411 15.5%   961052 51.8%
30 1.1% Holden, Stephen 5098 0.3%
29 1.1% Maslin, Janet 2874 0.2%
26 1.0% Vecsey, George 2739 0.1%
23 0.9% Sandomir, Richard 3140 0.2%
22 0.8% Ketcham, Diane 717 0.0%
20 0.8% Kisselgoff, Anna 2661 0.1%
19 0.7% Dowd, Maureen 1647 0.1%
19 0.7% Berkow, Ira 1704 0.1%
18 0.7% Kimmelman, Michael 1515 0.1%
17 0.6% Brown, Patricia Leigh 568 0.0%
16 0.6% Pareles, Jon 4090 0.2%
16 0.6% Chass, Murray 4544 0.2%
15 0.6% Smith, Roberta 2497 0.1%
15 0.6% Lipsyte, Robert 817 0.0%
15 0.6% Grimes, William 1368 0.1%
15 0.6% Barron, James 2188 0.1%
15 0.6% Anderson, Dave 2735 0.1%
14 0.5% Stanley, Alessandra 1437 0.1%
14 0.5% Haberman, Clyde 2492 0.1%

List of All VA Coined by the Two Top-Scoring Authors

Stephen Holden

# extract list of articles
for article in $(../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -f ../README.org \
       | join ../nyt_authors.tsv - | grep "Holden, Stephen" | cut -d' ' -f1 ); do
  grep "$article" ../README.org
done

Janet Maslin

# extract list of articles
for article in $(../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -f ../README.org \
             | join ../nyt_authors.tsv - | grep "Maslin, Janet" | cut -d' ' -f1 ); do
  grep "$article" ../README.org
done

Relative Frequency

The previous table shows the most prolific authors in terms of the absolute number of VA used. Naturally, authors who wrote more articles had more chances to throw in VA expressions, so let’s also compare the relative number of VA used. We compute how many articles per author we need on average to encounter one VA. The smaller this number, the more often the author uses VA in their articles. So, ‘18’ would mean that on average a VA occurs in every 18th article. We will use a threshold of at least 1000 articles to filter authors who only occasionally wrote for the NYT.

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -f ../README.org \
    | join ../nyt_authors.tsv - | sed "s/ /\t/" | awk -F'\t' '{print $2}' \
    | sort | uniq -c \
    | sed -e "s/^ *//" -e "s/ /\t/" | awk -F'\t' '{print $2"\t"$1}' \
    | join -t$'\t' -o1.2,2.2,1.1 - ../nyt_authors_distrib.tsv \
    | awk -F$'\t' '{if ($2 >= 1000) printf "%3.1f\t%i\t%i\t%s\n", $2/$1, $1, $2, $3}' \
    | LC_NUMERIC=en_US.UTF-8 sort -n | head -n20
articles per VA VA articles author
84.2 18 1515 Kimmelman, Michael
86.7 19 1647 Dowd, Maureen
89.7 19 1704 Berkow, Ira
91.2 15 1368 Grimes, William
99.1 29 2874 Maslin, Janet
102.6 14 1437 Stanley, Alessandra
105.3 26 2739 Vecsey, George
111.4 11 1225 Strauss, Neil
112.6 10 1126 Scott, A O
112.9 10 1129 Rich, Frank
113.0 12 1356 Apple, R W Jr
132.5 12 1590 Longman, Jere
133.1 20 2661 Kisselgoff, Anna
136.5 23 3140 Sandomir, Richard
138.6 14 1940 Araton, Harvey
139.5 13 1814 Martin, Douglas
139.9 10 1399 Verhovek, Sam Howe
145.9 15 2188 Barron, James
146.0 8 1168 Gates, Anita
154.6 9 1391 Collins, Glenn

Modifiers

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -o -T ../README.org | sort | uniq -c | sort -nr | head -n30
count modifier
55 his day
33 his time
29 Japan
16 tennis
16 his generation
16 baseball
15 China
13 her time
13 her day
12 our time
11 the 1990’s
10 the Zulus
10 the 90’s
10 politics
10 hockey
10 Brazil
10 basketball
10 ballet
9 jazz
9 fashion
8 today
8 Israel
8 his era
8 hip-hop
8 golf
8 dance

Time

“Today”

Who are the sources for the modifier “… of today”?

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /today/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
1 Shoeless Joe Jackson
1 Buck Rogers
1 Bill McGowan
1 William F. Buckley Jr.
1 Ralph Fiennes
1 Julie London
1 Jimmy Osmond
1 Harry Cohn

“His Day” or “His Time”

Who are the sources for the modifiers “… of his day”, “… of his time”, and “… of his generation”?

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /his \(day\|time\|generation\)/" | awk -F'\t' '{print $2}' | sort | uniq -c  | sort -nr  | head
count source
3 Michael Jordan
2 Mike Tyson
2 Billy Martin
2 Dan Quayle
2 Arnold Schwarzenegger
2 Martha Stewart
2 Donald Trump
2 L. Ron Hubbard
2 Tiger Woods
1 Lawrence Taylor

“Her Day”

Who are the sources for the modifier “… of her day”?

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /her day/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
1 Hilary Swank
1 Hillary Clinton
1 Marilyn Monroe
1 Judith Krantz
1 Lucia Pamela
1 Elizabeth Taylor
1 Imelda Marcos
1 Laurie Anderson
1 Nell Gwyn
1 Annie Leibovitz
1 Tara Reid
1 Madonna
1 Maria Callas

Country

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -o -T ../README.org \
      | sort | uniq -c | sort -nr | grep "Japan\|China\|Brazil\|Iran\|Israel\|Mexico\|India\|South Africa\|Spain\|South Korea\|Russia\|Poland\|Pakistan" | head -n13
count country
29 Japan
15 China
10 Brazil
8 Israel
7 Iran
7 India
4 South Africa
4 Mexico
3 Spain
3 South Korea
3 Russia
3 Poland
3 Pakistan

What are the sources for the modifier … ?

“Japan”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
      | grep "of\* /Japan/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
5 Walt Disney
4 Bill Gates
2 Nolan Ryan
2 Frank Sinatra
1 Richard Perle
1 Thomas Edison
1 Cal Ripken
1 Walter Johnson
1 Andy Warhol
1 Pablo Picasso
1 William Wyler
1 Stephen King
1 Brad Pitt
1 Richard Avedon
1 P. D. James
1 Rem Koolhaas
1 Steve Jobs
1 Ralph Nader
1 Madonna
1 Jack Kerouac

“China”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
     | grep "of\* /China/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
4 Barbara Walters
2 Jack Welch
1 Louis XIV of France
1 Oskar Schindler
1 Napoleon
1 Keith Haring
1 Mikhail Gorbachev
1 Donald Trump
1 Larry King
1 Ted Turner
1 Madonna

“Brazil”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
      | grep "of\* /Brazil/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
1 Giuseppe Verdi
1 Jil Sander
1 Walter Reed
1 Lech Wałęsa
1 Jim Morrison
1 Bob Dylan
1 Elvis Presley
1 Scott Joplin
1 Larry Bird
1 Pablo Escobar

Sports

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -o -T ../README.org \
    | sort | uniq -c | sort -nr | grep "baseball\|basketball\|tennis\|golf\|football\|racing\|soccer\|sailing" | head -n7
count sports
16 tennis
16 baseball
10 basketball
8 golf
7 football
6 soccer
6 racing

Who are the sources for the modifier … ?

“Tennis”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /tennis/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
2 George Foreman
1 Tim McCarver
1 Pete Rose
1 Nolan Ryan
1 Crash Davis
1 Spike Lee
1 John Madden
1 Michael Jordan
1 John Wayne
1 George Hamilton
1 Michael Dukakis
1 Jackie Robinson
1 Babe Ruth
1 Dennis Rodman
1 Madonna

“Baseball”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /baseball/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
2 P. T. Barnum
2 Larry Bird
1 Clifford Irving
1 Mike Tyson
1 Thomas Dooley
1 Marco Polo
1 Pablo Picasso
1 Horatio Alger
1 Rodney Dangerfield
1 Michael Jordan
1 Alan Alda
1 Brandon Tartikoff
1 Howard Hughes
1 Thomas Jefferson

“Basketball”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /basketball/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
2 Babe Ruth
1 Joseph Stalin
1 Martin Luther King, Jr.
1 Pol Pot
1 Johnny Appleseed
1 Adolf Hitler
1 Bugsy Siegel
1 Elvis Presley
1 Chuck Yeager

“Golf”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /golf/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
2 Michael Jordan
2 Jackie Robinson
1 J. D. Salinger
1 James Brown
1 Marlon Brando
1 Babe Ruth

“Football”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /football/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
1 Ann Calvello
1 Bobby Fischer
1 Patrick Henry
1 Susan Lucci
1 Jackie Robinson
1 Babe Ruth
1 Rich Little

“Soccer”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /soccer/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
1 James Brown
1 Michael Jordan
1 Larry Brown
1 Derek Jeter
1 Ernie Banks
1 Magic Johnson

“Racing”

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -w -T -t -c ../README.org \
    | grep "of\* /racing/" | awk -F'\t' '{print $2}' | sort | uniq -c | sort -nr
count source
2 Rodney Dangerfield
1 John Madden
1 Bobo Holloman
1 Lou Gehrig
1 Wayne Gretzky

Culture

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -T -o ../README.org \
    | sort | uniq -c | sort -nr | grep "dance\|hip-hop\|jazz\|fashion\|weaving\|ballet\|the art world\|wine\|salsa"   | head -n8
count modifier
10 ballet
9 jazz
9 fashion
8 hip-hop
8 dance
7 the art world
4 wine
4 salsa

Michael Jordan

../org.py -T -l -o ../README.org | awk -F'\t' '{if ($1 == "Michael Jordan") print $2}' \
      | sort -u

the Michael Jordan of

Some Favourites

Complete List of Successfully Extracted VA

../org.py --ignore-source-ids fictional_humans_in_our_data_set.tsv -g -H -T ../README.org \
      | pandoc -f org -t markdown -o vossantos.md

result in vossantos.md