Skip to content

Commit 8e0c6af

Browse files
committed
++readme
1 parent 0ebf06d commit 8e0c6af

File tree

4 files changed

+77
-84
lines changed

4 files changed

+77
-84
lines changed

CHANGELOG

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
Changelog:
2+
- v0.3.3: Add correlated_ipa_no_spaces
3+
4+
- v0.3.2: Add correlated_ipa to local_intermediate/
5+
6+
- v0.3.1: Various
7+
- Move *.py into scripts/
8+
- Add `make test`
9+
10+
- v0.3.0: Various
11+
- Add local copies of source data and results
12+
- Default to local copies
13+
- Move all data files into subdirectories
14+
- Add MIT license
15+
- Refactor: Move file paths out of Python and into Makefile
16+
17+
- v0.2.0: Translate ARPAbet to IPA
18+
19+
- v0.1.0: First steps; Frequencies generally and post-/w/
20+
- Processing:
21+
- Use only the first pronunciation in cmudict
22+
- Discard uncorrelateds entirely
23+
- No manual error checking etc
24+
- Results:
25+
- Q1: Frequencies of phonemes generally
26+
- Q2: Frequencies of phonemes post-/w/

README

Lines changed: 0 additions & 84 deletions
This file was deleted.

README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
## Quick links
2+
- [Words by count in BNC with pronunciations](local_intermediate/correlated_ipa_no_spaces)
3+
- [Phonemes by frequency](local_target/q1_frequencies)
4+
- [Phonemes by frequency post-/w/](local_target/q2_post_w_frequencies)
5+
6+
## Summary
7+
An estimate of the relative frequencies of English phonemes.
8+
Also, an estimate of the relative frequencies of English phonemes
9+
that follow /w/.
10+
11+
## Methodology
12+
Reproducing the work of Doug Blumeyer[1], I correlated the CMU
13+
Pronouncing Dictionary ("CMUdict")[2] and Adam Kilgarriff's
14+
unlemmatized frequency list[3] for the British National Corpus to
15+
find phoneme frequencies generally. I extended this technique to
16+
estimate post-/w/ phoneme frequencies as well.
17+
18+
## Limitations
19+
As Blumeyer notes, the source datasets have some limitations.
20+
CMUdict conflates "schwa with the near-open central vowel" and
21+
has "several noticeable errors." Kilgarriff's frequency list has
22+
some formatting issues that make it hard to work with words with
23+
accents and apostrophes, (at this time, I've completely ignored
24+
this issue) including common contractions.
25+
26+
Blumeyer did manual error checking on several hundred of the
27+
most common words. I have not done this.
28+
29+
The CMUdict has multiple pronunciations for some words. For
30+
these words, I used only the first pronunciation given. It's not
31+
clear to me if in these cases the multiple pronunciations are
32+
ordered in some way or just ordered arbitrarily.
33+
34+
## Other notes
35+
While the Kilgarriff list is for the British National Corpus, a
36+
quick inspection suggests that it uses American pronunciations
37+
over British ones.
38+
39+
## References
40+
- Doug Blumeyer, ["Relative Frequencies of English Phonemes"][blumeyer]
41+
- [CMU Pronouncing Dictionary][cmudict] (Local copy at version 0.7b. Retrieved May 28, 2018.)
42+
- Adam Kilgarriff, [word frequencies for the BNC][kilgarriff] (Local copy retrieved May 28, 2018.)

TODO

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
. at or before v1.0.0, include changelog details in tags too
2+
. use more than just first pronunciation in cmudict
3+
. <er> is one phoneme etc
4+
. phone or phoneme?
5+
. rename to phoneme-frequencies? (but really, is it phone or phoneme in this case?)
6+
. manual error checking
7+
. transcribe some of the uncorrelateds
8+
. reread cmloegcmluin
9+
x ARPAbet -> IPA

0 commit comments

Comments
 (0)