replace the bioguide scraper with one that can do a deep parse of the bioguide#304
replace the bioguide scraper with one that can do a deep parse of the bioguide#304
Conversation
5cf842f to
a9cd8b7
Compare
|
This is great! |
|
For a future feature, one of the things that I've thought could be easy is to include an education field. Many of the entries share the same phrasing:
Of course, there's a lot of nuance. Some bios mention admittance but not graduation...so that means the resulting schema would have to account for the different educational outcomes and actions, as well as multiple schools and degrees. But my estimate is that there's a large amount of low-hanging fruit, i.e. "was graduated from [some college]" that could be filled out. I think the topic of educational background is fascinating, even beyond the obvious finding that Harvard by far as the most alumni in the federal legislative structure. |
…guide entries using a context free grammar
a9cd8b7 to
3946e14
Compare
This is a bit crazy, but it kind of works. The bioguide is written with a fairly consistent format. This commit replaces
bioguide.pywith a new deep parser for bioguide entries.It produces output like this:
for
The output is rough. There are lots of incorrect parses. In this case, one of the elections isn't recognized by the parser. Maybe some of these issues can be fixed. And the schema of the output is a little unpredictable because it's trying to handle a lot of cases in the input.
I've posted the complete output here:
https://www.govtrack.us/data/misc/bioguide-parsed.yaml (30 MB)