Lexical Investigations of the 1913 Webster's Unabridged

by David Dailey

(via Ralph Sutherland's OPTED, via Gutenberg, via Noah Webster)


Questions:

Look up the frequency of a word.
The word
Look up the definition of the word.

Frequency of word usages (within the dictionary) of the 100 most common words used:

$ head -60 BigConc
133852 of
115472 a
113133 the
97527 to
89324 or
39595 as
39514 in
27267 and
19155 an
18634 by
15975 with
14320 one
14056 which
11955 for
11867 is
10786 from
8741 see
7763 who
7563 that
7117 on
7083 being
6449 having
6167 also
5890 pertaining
5731 it
5577 used
5333 not
5239 act
4994 be
4929 state
4759 etc
4621 any
4373 called
4259 at
4189 alt
4164 are
3520 into
3421 other
3295 like
3045 part
2969 same
2948 quality
2936 manner
2930 form
2915 its
2798 make
2793 hence
2578 small
2484 so
2307 under
2307 made
2299 place
2295 body
2261 water
2253 person
2250 some
2210 especially
2193 two
2039 out
2037 upon
1949 another
1931 without
1875 certain
1865 esp
1861 kind
1816 usually
1751 has
1737 genus
1722 up
1705 said
1663 but
1640 his
1596 anything
1595 species
1580 resembling
1577 when
1554 time
1542 between
1516 ones
1482 more
1460 have
1459 substance
1432 power
1423 over
1415 large
1410 parts
1393 something
1377 acid
1374 cause
1364 process
1361 order
1346 action
1338 color
1322 often
1297 consisting
1293 instrument
1286 containing
1282 their
1270 sometimes
1256 capable

Technical Notes
The files distributed by the OPTED project look like this:

$ ls -l|awk '{print $5,$9}'

1178738 wb1913_a.html
985466 wb1913_b.html
1725819 wb1913_c.html
1077154 wb1913_d.html
788913 wb1913_e.html
752831 wb1913_f.html
549347 wb1913_g.html
654226 wb1913_h.html
807226 wb1913_i.html
135943 wb1913_j.html
130249 wb1913_k.html
602493 wb1913_l.html
905696 wb1913_m.html
972633 wb1913_new.html
310305 wb1913_n.html
444445 wb1913_o.html
1618704 wb1913_p.html
109964 wb1913_q.html
911289 wb1913_r.html
2203644 wb1913_s.html
992243 wb1913_t.html
322136 wb1913_u.html
307443 wb1913_v.html
462147 wb1913_w.html
18868 wb1913_x.html
47619 wb1913_y.html
46196 wb1913_z.html

Records in the files look like this (two lines from wb1913_q.html):

<P><B>Qua</B> (<I>conj.</I>) In so 
far as; in the capacity or character of; as.</P>
<P><B>Quab</B> (<I>n.</I>) An 
unfledged bird; hence, something immature or unfinished.</P>

The following script(s) serve(s) to strip the HTML, remove punctuation, convert everything to lowercase and then to count the number of occurrences of each word in the dictionary:

$ cat `ls`|strings|/homes/ddailey/programs/concDict 
>../BigConc

where the file  /homes/ddailey/programs/concDict contains just one command:

sed 's/<[/[:alpha:]]*>//g;s/[(].*[)]//g;s/[[:punct:]]//g;s/\ 
/\
/g' $1|awk '!/^$/ {print tolower($0)}'|sort -f|uniq -c|sort 
-nr

The early lines of this analysis are reproduced above.