Skip to content

Commit 18c6f3b

Browse files
committed
initial commit with pseudocode for v0.1.0
0 parents  commit 18c6f3b

File tree

4 files changed

+89
-0
lines changed

4 files changed

+89
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
cmudict*
2+
kilgarriff*

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
cmudict:
2+
curl http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b > cmudict
3+
4+
kilgarriff:
5+
curl http://www.kilgarriff.co.uk/BNClists/all.al.gz | gunzip > kilgarriff
6+
7+
# kilgarriff.num::
8+
# curl http://www.kilgarriff.co.uk/BNClists/all.num.gz | gunzip > kilgarriff.num

PSEUDOCODE

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
source files
2+
============
3+
kilgarriff:
4+
20 the
5+
10 read (past tense)
6+
5 read (present tense)
7+
1 zoo
8+
9+
cmudict:
10+
CAT K AE1 T
11+
READ R EH1 D
12+
READ(1) R IY1 D
13+
THE DH AH0
14+
ZOO Z UW1
15+
16+
discard extra pronunciations in cmudict:
17+
========================================
18+
cmudict.10-first-only:
19+
CAT K AE1 T
20+
READ R EH1 D
21+
THE DH AH0
22+
ZOO Z UW1
23+
24+
discard stress info in cmudict:
25+
===============================
26+
cmudict.20-discard-stress:
27+
CAT K AE T
28+
READ R EH D
29+
THE DH AH
30+
ZOO Z UW
31+
32+
"squash" kilgarriff parts of speech together
33+
============================================
34+
kilgarriff.10-squashed:
35+
20 the
36+
15 read
37+
1 zoo
38+
39+
correlate
40+
=========
41+
correlated:
42+
20 the DH AH
43+
15 read R EH D
44+
1 zoo Z UW
45+
46+
47+
--------------------------------------------------------------------------------
48+
49+
50+
q1: phoneme frequencies
51+
=======================
52+
phoneme_counts = defaultdict(int)
53+
for line in correlated:
54+
count, word, phonemes = ...
55+
for p in phonemes:
56+
phoneme_counts[p] += count
57+
58+
q2: post-w phoneme frequencies
59+
==============================
60+
def post_ws(phonemes):
61+
for (p, pnext) in zip(phonemes, phonemes[1:]):
62+
if p is `w`:
63+
yield pnext
64+
65+
phoneme_counts = defaultdict(int)
66+
for line in correlated:
67+
count, word, phonemes = ...
68+
for p in post_ws(phonemes):
69+
phoneme_counts[p] += count

README

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Sources:
2+
- https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/
3+
- http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4+
- http://www.kilgarriff.co.uk/bnc-readme.html
5+
6+
To do:
7+
. use more than just first pronunciation in cmudict
8+
. phone or phoneme?
9+
. manual error checking
10+
. reread cmloegcmluin

0 commit comments

Comments
 (0)