replace the bioguide scraper with one that can do a deep parse of the bioguide by JoshData · Pull Request #304 · unitedstates/congress-legislators

JoshData · 2015-08-02T22:36:03Z

This is a bit crazy, but it kind of works. The bioguide is written with a fairly consistent format. This commit replaces bioguide.py with a new deep parser for bioguide entries.

It produces output like this:

activities:
 [...snip...]
- text: Minister to England 1815-1817, assisted in concluding the convention of commerce
    with Great Britain
- date:
    end: 1825
    start: 1817
  text: Secretary of State in the Cabinet of President James Monroe
- date: 1829-03-03
  text: decision in the 1824 election of the President of the United States fell,
    according to the Constitution of the United States, upon the House of Representatives,
    as none of the candidates had secured a majority of the electors chosen by the
    states, and Adams, who stood second to Andrew Jackson in the electoral vote, was
    chosen and served from March 4, 1825, to
- date: 1834
  text: elected as a Republican to the U.S. House of Representatives for the Twenty-second
    and to the eight succeeding Congresses, becoming a Whig
- text: served from March 4, 1831, until his death
- text: chairman, Committee on Manufactures (Twenty-second through Twenty-sixth, and
    Twenty-eighth and Twenty-ninth Congresses), Committee on Indian Affairs (Twenty-seventh
    Congress), Committee on Foreign Affairs (Twenty-seventh Congress)
- date: 1834
  text: unsuccessful candidate for Governor of Massachusetts
- text: interment in the family burial ground at Quincy, Mass.
- text: subsequently reinterred in United First Parish Church
born:
  date: 1767-07-11
  location: Braintree, Mass.
died:
  date: 1848-02-23
  location: the U.S. Capitol Building, Washington, D.C.
elected:
- dates:
    end: 1808-06-08
    end-reason: resignation
    start: 1803-03-04
  elections:
  - how: elected
    party: Federalist
    type: senate
family-relations:
- relation: son
  to:
    name: John Adams
- relation: father
  to:
    name: Charles Francis Adams
- relation: brother-in-law
  to:
    name: William Stephens Smith
name: ADAMS, John Quincy
name-info: []
roles:
- state: MA
  type: Senator
- state: MA
  type: Representative
- ordinal: 6
  type: President of the United States

for

ADAMS, John Quincy, (son of John Adams, father of Charles Francis Adams, brother-in-law
of William Stephens Smith), a Senator and a Representative from Massachusetts and
6th President of the United States; born in Braintree, Mass., July 11, 1767; acquired
his early education in Europe at the University of Leyden; was graduated from Harvard
University in 1787; studied law; was admitted to the bar and commenced practice
in Boston, Mass.; appointed Minister to Netherlands 1794, Minister to Portugal 1796,
Minister to Prussia 1797, and served until 1801; commissioned to make a commercial
treaty with Sweden in 1798; elected to the Massachusetts State senate in 1802; unsuccessful
candidate for election to the U.S. House of Representatives in 1802; elected as
a Federalist to the United States Senate and served from March 4, 1803, until June
8, 1808, when he resigned, a successor having been elected six months early after
Adams broke with the Federalist party; Minister to Russia 1809-1814; member of the
commission which negotiated the Treaty of Ghent in 1814; Minister to England 1815-1817,
assisted in concluding the convention of commerce with Great Britain; Secretary
of State in the Cabinet of President James Monroe 1817-1825; decision in the 1824
election of the President of the United States fell, according to the Constitution
of the United States, upon the House of Representatives, as none of the candidates
had secured a majority of the electors chosen by the states, and Adams, who stood
second to Andrew Jackson in the electoral vote, was chosen and served from March
4, 1825, to March 3, 1829; elected as a Republican to the U.S. House of Representatives
for the Twenty-second and to the eight succeeding Congresses, becoming a Whig in
1834; served from March 4, 1831, until his death; chairman, Committee on Manufactures
(Twenty-second through Twenty-sixth, and Twenty-eighth and Twenty-ninth Congresses),
Committee on Indian Affairs (Twenty-seventh Congress), Committee on Foreign Affairs
(Twenty-seventh Congress); unsuccessful candidate for Governor of Massachusetts
in 1834; died in the U.S. Capitol Building, Washington, D.C., February 23, 1848;
interment in the family burial ground at Quincy, Mass.; subsequently reinterred
in United First Parish Church.

The output is rough. There are lots of incorrect parses. In this case, one of the elections isn't recognized by the parser. Maybe some of these issues can be fixed. And the schema of the output is a little unpredictable because it's trying to handle a lot of cases in the input.

I've posted the complete output here:

https://www.govtrack.us/data/misc/bioguide-parsed.yaml (30 MB)

dannguyen · 2015-08-03T17:05:03Z

This is great!

dannguyen · 2015-08-03T17:11:11Z

For a future feature, one of the things that I've thought could be easy is to include an education field. Many of the entries share the same phrasing:

was graduated from Harvard University in 1787; studied law; was admitted to the bar and commenced practice in Boston, Mass

Of course, there's a lot of nuance. Some bios mention admittance but not graduation...so that means the resulting schema would have to account for the different educational outcomes and actions, as well as multiple schools and degrees. But my estimate is that there's a large amount of low-hanging fruit, i.e. "was graduated from [some college]" that could be filled out. I think the topic of educational background is fascinating, even beyond the obvious finding that Harvard by far as the most alumni in the federal legislative structure.

…guide entries using a context free grammar

JoshData force-pushed the bioguide-deep-parse branch from 5cf842f to a9cd8b7 Compare August 2, 2015 22:36

JoshData mentioned this pull request Aug 2, 2015

BioGuide Scrape #296

Open

JoshData mentioned this pull request Aug 4, 2015

Correct term end dates in historical terms, per #7 #305

Merged

JoshData added 2 commits August 7, 2015 11:53

replace the bioguide scraper with one that can do a deep parse of bio…

d64986e

…guide entries using a context free grammar

more work on the deep bioguide parser

3946e14

JoshData force-pushed the bioguide-deep-parse branch from a9cd8b7 to 3946e14 Compare August 7, 2015 15:53

JoshData mentioned this pull request Jun 16, 2018

Should we mine bioguide for biographical info? #572

Open

JoshData force-pushed the master branch from fab07bd to 14c497c Compare February 25, 2021 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace the bioguide scraper with one that can do a deep parse of the bioguide#304

replace the bioguide scraper with one that can do a deep parse of the bioguide#304
JoshData wants to merge 2 commits intomainfrom
bioguide-deep-parse

JoshData commented Aug 2, 2015

Uh oh!

dannguyen commented Aug 3, 2015

Uh oh!

dannguyen commented Aug 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JoshData commented Aug 2, 2015

Uh oh!

dannguyen commented Aug 3, 2015

Uh oh!

dannguyen commented Aug 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants