++readme

prendradjaja · prendradjaja · commit 8e0c6af55291 · 2018-05-28T22:50:59.000-07:00
diff --git a/CHANGELOG b/CHANGELOG
@@ -0,0 +1,26 @@
+Changelog:
+- v0.3.3: Add correlated_ipa_no_spaces
+
+- v0.3.2: Add correlated_ipa to local_intermediate/
+
+- v0.3.1: Various
+  - Move *.py into scripts/
+  - Add `make test`
+
+- v0.3.0: Various
+  - Add local copies of source data and results
+  - Default to local copies
+  - Move all data files into subdirectories
+  - Add MIT license
+  - Refactor: Move file paths out of Python and into Makefile
+
+- v0.2.0: Translate ARPAbet to IPA
+
+- v0.1.0: First steps; Frequencies generally and post-/w/
+  - Processing:
+    - Use only the first pronunciation in cmudict
+    - Discard uncorrelateds entirely
+    - No manual error checking etc
+  - Results:
+    - Q1: Frequencies of phonemes generally
+    - Q2: Frequencies of phonemes post-/w/
diff --git a/README b/README
diff --git a/README.md b/README.md
@@ -0,0 +1,42 @@
+## Quick links
+- [Words by count in BNC with pronunciations](local_intermediate/correlated_ipa_no_spaces)
+- [Phonemes by frequency](local_target/q1_frequencies)
+- [Phonemes by frequency post-/w/](local_target/q2_post_w_frequencies)
+
+## Summary
+An estimate of the relative frequencies of English phonemes.
+Also, an estimate of the relative frequencies of English phonemes
+that follow /w/.
+
+## Methodology
+Reproducing the work of Doug Blumeyer[1], I correlated the CMU
+Pronouncing Dictionary ("CMUdict")[2] and Adam Kilgarriff's
+unlemmatized frequency list[3] for the British National Corpus to
+find phoneme frequencies generally. I extended this technique to
+estimate post-/w/ phoneme frequencies as well.
+
+## Limitations
+As Blumeyer notes, the source datasets have some limitations.
+CMUdict conflates "schwa with the near-open central vowel" and
+has "several noticeable errors." Kilgarriff's frequency list has
+some formatting issues that make it hard to work with words with
+accents and apostrophes, (at this time, I've completely ignored
+this issue) including common contractions.
+
+Blumeyer did manual error checking on several hundred of the
+most common words. I have not done this.
+
+The CMUdict has multiple pronunciations for some words. For
+these words, I used only the first pronunciation given. It's not
+clear to me if in these cases the multiple pronunciations are
+ordered in some way or just ordered arbitrarily.
+
+## Other notes
+While the Kilgarriff list is for the British National Corpus, a
+quick inspection suggests that it uses American pronunciations
+over British ones.
+
+## References
+- Doug Blumeyer, ["Relative Frequencies of English Phonemes"][blumeyer]
+- [CMU Pronouncing Dictionary][cmudict] (Local copy at version 0.7b. Retrieved May 28, 2018.)
+- Adam Kilgarriff, [word frequencies for the BNC][kilgarriff] (Local copy retrieved May 28, 2018.)
diff --git a/TODO b/TODO
@@ -0,0 +1,9 @@
+. at or before v1.0.0, include changelog details in tags too
+. use more than just first pronunciation in cmudict
+. <er> is one phoneme etc
+. phone or phoneme?
+. rename to phoneme-frequencies? (but really, is it phone or phoneme in this case?)
+. manual error checking
+. transcribe some of the uncorrelateds
+. reread cmloegcmluin
+x ARPAbet -> IPA