phoneme-frequencies/PSEUDOCODE at 18c6f3b75c3dd6c39a354a22f378cd3b244323b6 · prendradjaja/phoneme-frequencies · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
source files
============
  kilgarriff:
    20 the
    10 read (past tense)
    5 read (present tense)
    1 zoo

  cmudict:
    CAT  K AE1 T
    READ  R EH1 D
    READ(1)  R IY1 D
    THE  DH AH0
    ZOO  Z UW1

discard extra pronunciations in cmudict:
========================================
  cmudict.10-first-only:
    CAT  K AE1 T
    READ  R EH1 D
    THE  DH AH0
    ZOO  Z UW1

discard stress info in cmudict:
===============================
  cmudict.20-discard-stress:
    CAT  K AE T
    READ  R EH D
    THE  DH AH
    ZOO  Z UW

"squash" kilgarriff parts of speech together
============================================
  kilgarriff.10-squashed:
    20 the
    15 read
    1 zoo

correlate
=========
  correlated:
    20 the  DH AH
    15 read  R EH D
    1 zoo  Z UW


--------------------------------------------------------------------------------


q1: phoneme frequencies
=======================
  phoneme_counts = defaultdict(int)
  for line in correlated:
    count, word, phonemes = ...
    for p in phonemes:
      phoneme_counts[p] += count

q2: post-w phoneme frequencies
==============================
  def post_ws(phonemes):
    for (p, pnext) in zip(phonemes, phonemes[1:]):
      if p is `w`:
        yield pnext

  phoneme_counts = defaultdict(int)
  for line in correlated:
    count, word, phonemes = ...
    for p in post_ws(phonemes):
      phoneme_counts[p] += count