File tree Expand file tree Collapse file tree 4 files changed +89
-0
lines changed
Expand file tree Collapse file tree 4 files changed +89
-0
lines changed Original file line number Diff line number Diff line change 1+ cmudict *
2+ kilgarriff *
Original file line number Diff line number Diff line change 1+ cmudict :
2+ curl http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b > cmudict
3+
4+ kilgarriff :
5+ curl http://www.kilgarriff.co.uk/BNClists/all.al.gz | gunzip > kilgarriff
6+
7+ # kilgarriff.num::
8+ # curl http://www.kilgarriff.co.uk/BNClists/all.num.gz | gunzip > kilgarriff.num
Original file line number Diff line number Diff line change 1+ source files
2+ ============
3+ kilgarriff:
4+ 20 the
5+ 10 read (past tense)
6+ 5 read (present tense)
7+ 1 zoo
8+
9+ cmudict:
10+ CAT K AE1 T
11+ READ R EH1 D
12+ READ(1) R IY1 D
13+ THE DH AH0
14+ ZOO Z UW1
15+
16+ discard extra pronunciations in cmudict:
17+ ========================================
18+ cmudict.10-first-only:
19+ CAT K AE1 T
20+ READ R EH1 D
21+ THE DH AH0
22+ ZOO Z UW1
23+
24+ discard stress info in cmudict:
25+ ===============================
26+ cmudict.20-discard-stress:
27+ CAT K AE T
28+ READ R EH D
29+ THE DH AH
30+ ZOO Z UW
31+
32+ "squash" kilgarriff parts of speech together
33+ ============================================
34+ kilgarriff.10-squashed:
35+ 20 the
36+ 15 read
37+ 1 zoo
38+
39+ correlate
40+ =========
41+ correlated:
42+ 20 the DH AH
43+ 15 read R EH D
44+ 1 zoo Z UW
45+
46+
47+ --------------------------------------------------------------------------------
48+
49+
50+ q1: phoneme frequencies
51+ =======================
52+ phoneme_counts = defaultdict(int)
53+ for line in correlated:
54+ count, word, phonemes = ...
55+ for p in phonemes:
56+ phoneme_counts[p] += count
57+
58+ q2: post-w phoneme frequencies
59+ ==============================
60+ def post_ws(phonemes):
61+ for (p, pnext) in zip(phonemes, phonemes[1:]):
62+ if p is `w`:
63+ yield pnext
64+
65+ phoneme_counts = defaultdict(int)
66+ for line in correlated:
67+ count, word, phonemes = ...
68+ for p in post_ws(phonemes):
69+ phoneme_counts[p] += count
Original file line number Diff line number Diff line change 1+ Sources:
2+ - https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/
3+ - http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4+ - http://www.kilgarriff.co.uk/bnc-readme.html
5+
6+ To do:
7+ . use more than just first pronunciation in cmudict
8+ . phone or phoneme?
9+ . manual error checking
10+ . reread cmloegcmluin
You can’t perform that action at this time.
0 commit comments