Skip to content

mallory-jpg/sm-political-analysis

Repository files navigation

Social Media Political Analysis 🇺🇸

Springboard Open-ended Capstone

SM Political Analysis - 4

Objectives

Problems & Questions

How can we better develop educational materials to meet kids where they are?

  • Is it worth it to spend money to advertise to youth for political campaigns - are they engaging with current events?
  • What politics & policies are The Youth™ talking about & why?

Goals

  • to analyze how age/youth impacts political indoctrination and participation
  • to track social impacts of political events
  • to understand colloquial knowledge of political concepts

Overview:

  1. Use NewsAPI to find top news by day
  2. Parse news story title & article into individual words/phrases
  3. Count most important individual words & phrases
  4. Use top 3 most important words & phrases to search TikTok & Twitter
  5. Count number of tweets & TikToks mentioning key words & phrases

Installation & Use

config.ini should have the following layout and info:

[database]
  host = <hostname>
  database = <db name>
  user = <db username>
  password = <db password>

[newsAuth]
  api_key = <NewsAPI.org API key>

[tiktokAuth]
  s_v_web_id = <s_v_web_id>
  tt_web_id = <tt_web_id>

[twitterAuth]
  access_token = <access_token>
  access_token_secret = <access_token_secret>
  consumer_key = <consumer_key>
  consumer_secret = <consumer_secret>

Note: headers and keys/variables in config.ini file don't need to be stored as strings, but when calling them in program, enclose references with quotes

To scrape the web without getting blocked:

Clone the following url into your project directory using Git or checkout with SVN: https://github.com/tamimibrahim17/List-of-user-agents.git. These .txt files contain User Agents and are specified by browser (shout out Timam Ibrahim!). They will be randomized to avoid detection by web browsers.

To find your s_v_web_id & tt_web_id for TikTokAPI access:

  1. Go to the TikTok website & login
  2. If using Google Chrome, open the developer console
  3. Go to the 'Application' tab
  4. Find & click 'Cookies' in the left-hand panel →
  5. On the resulting screen, look for s_v_web_id and tt_web_id under the 'name' column

Find more information about .ini configuration files in Python documentation: https://docs.python.org/3/library/configparser.html

Exploratory Data Analysis

This step was guided by my anticipation that the data will be used for trend graphing, sentiment analysis, age inference, and correlation between user characteristics and extent of participation in responding to political events. With this in mind, I plan to optimize query speed via limiting storage by geographic location of both users and events to the US (though this categorization may be loose at times because of US involvement on the world stage). My PostgreSQL database will also be sharded by datetime, as the analytical window references 3 days before and 3 days after the political event of interest.

At this point, I've found that my GET requests for popular news articles return various formats which must be converted, unpacked, or otherwise transformed. Because of this added step, I opted to keep the article text in a separate table within the database. The article text in the news table will serve as data sources from which to extract our top 3 keywords (per article) using Term Frequency-Inverse Document Frequency (TF-IDF) calculations. These keywords will be applied to our searches for related TikToks and tweets. For some resources on this, check out TF-IDF in the real world and a step-by-step guide by Prachi Prakash.

With data from social media adding a pop-cultural context to political news, we inch closer to an understanding of TikTok and Twitter as novel forms of youth political engagement!

Getting Tweets

This project uses Tweepy's tweet search method to search for tweets within the past seven (7) days using the keywords produced from the .get_all_news() method. A separate Tweepy Stream Listener subclass catches tweets (statuses) that contain our keywords of interest as they are tweeted. The max stream rate for Twitter's API (upon which Tweepy is based) is 450.

Setting Up Kafka Streaming Application (Scala ==> Python)

application.conf file should look like:

com.ram.batch {
spark {
  app-name = <app name>
  master = "local"
  log-level = "INFO"
}
postgres { 
  url = "postgresql://localhost/<db name>"
  username = <db user>
  password = <db pwd>
}
}

Note: these are strings and must be enclosed in quotation marks.

  • Make sure you customize your connection string url to the database you use
  • The tweetStream Python class invoked in my pipeline includes a Kafka broker: self.producer = KakfaProducer(bootstrap_servers='localhost')
  • The broker & the Kafka streaming application are two separate files & entities

Scala applications require 'application'.sbt files that include the name of the app, the version, the scalaVersion, and library dependencies:

```
name := "Tweet Stream"

version := "1.0"

scalaVersion := "3.0.2"

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1"
```
  • Find your Spark version by using spark-submit --version on the command line.

M1 Processor Issues

  • I had to find a compatible SDK to install sbt:
      curl -s "https://get.sdkman.io" | zsh 
      source "$HOME/.sdkman/bin/sdkman-init.sh" 
      sdk version
      sdk install java
      sdk install sbt
      sbt compile
    
  • Then finally run sbt compile on the command line from my project directory.

Common Tweet Streaming Issues:

  • Make sure any json file that is being used to store tweets is opened with the 'a' designator for 'append' or else each tweet will overwrite the last
  • If sbt won't compile, ensure that your .sbt file dependencies are the correct versions
  • Make sure your application directory mirrors what is found in the Spark documentation so it can compile properly.

Getting TikToks

This project uses Avilash Kumar's TikTokAPI. Refer to their GitHub for further information.

Common TikTok Streaming Issues:

  • TBA

Check out this project's slide deck ⤵ SM Political Analysis - 4 (2) SM Political Analysis - 4 (4) SM Political Analysis - 4 (7) SM Political Analysis - 4 (8) SM Political Analysis - 4 (9) SM Political Analysis - 4 (10) SM Political Analysis - 4 (11) SM Political Analysis - 4 (12)

About

Data pipeline to understand novel forms of Youth™ political engagement

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors