Automated Construction & Analysis of
Political Networks
via open government & media sources


Diego Garcia-Olano



November 7, 2016

Text, Nodes, Edges, & Narco Networks

Network: Undirected with Weighted Edges

Text: "Los Señores del Narco"

Nodes: Actors

Edges: there exists a co-occurrence between 2 Actors
in the text within 200 characters of each other.

*Nodes are sized in relation to how many edges (ie, relationships) they have
*Edges are sized by how many co-occurances between two actors exist.

Fig. 1. Mexican narco-network constructed from the book "Los Señores del Narco" by Anabel Hernandez.
Jesús Espinal-Enriquez, J. Mario Siqueiros-Garcıa, Rodrigo García-Herrera, & Sergio Antonio Alcalá-Corona.
"A literature-based approach to a narco-network".  In Social Informatics, pages 97–101. Springer, 2014.

They Mystery That is Texas

Population: 27 Million People

Texas House Congressional Districts Map.  150 districts total
http://www.whoyouelect.com/texas/texas-house-map.html

The Mystery That is Texas Politics

Population: 27 Million People

State & Federal Politicians: 250

composed of:

  - State Congress: 150 Representatives & 31 State Senators

  - U.S. Congress: 36 Representatives & 2 Senators

  - State Executive officials:
        Governor, Lieutenant Governor, Speaker of the House,
        Attorney General, Comissioners, etc

  - State Judges: 9 Supreme Court & 9 Court of Criminal Appeals

Texas House Congressional Districts Map.  150 districts total
http://www.whoyouelect.com/texas/texas-house-map.html

The Mystery That is Texas Voter Turnout

Population: 27 Million People

State & Federal Politicians: 250

composed of:

  - State Congress: 150 Representatives & 31 State Senators

  - U.S. Congress: 36 Representatives & 2 Senators

  - State Executive officials:
        Governor, Lieutenant Governor, Speaker of the House,
        Attorney General, Comissioners, etc

  - State Judges: 9 Supreme Court & 9 Court of Criminal Appeals


Voter Turnout: 20% for 2015 Governor election.

Texas House Congressional Districts Map.  150 districts total
http://www.whoyouelect.com/texas/texas-house-map.html

1. list of Politicians
2. list of relevant news sites

Construct, Display & Analyze the networks
around these Politicians


Case study:
  1. 246 Texas elected officials ( from May 2015 )
  2. 6 news sites that cover Texas politics

        the Austin American Statesman, the Dallas Morning News, the Houston Chronicle,
        the New York Times, the Texas Observer and the Texas Tribune


Presentation Outline


Case Study Results, Networks, Maps, Code: http://whoyouelect.com/texas
Slides: http://www.whoyouelect.com/texas/northeastern_nov2016/

Background & Related Work

Defining Big Words

We will be constructing undirected heterogeneous networks
with weighted edges in political contexts.

Defining Big Words

We will be constructing undirected heterogeneous networks
with weighted edges in political contexts.

Heterogeneous refers to the existence of different node types that will compose the graph.  For instance, nodes (also refered to as "entities") in our system will be labeled as People, Organizations, Politicians, Locations, Bills, or Miscellaneous.

Defining Big Words

We will be constructing undirected heterogeneous networks
with weighted edges in political contexts.

Heterogeneous refers to the existence of different node types that will compose the graph.  For instance, nodes (also refered to as "entities") in our system will be labeled as People, Organizations, Politicians, Locations, Bills, or Miscellaneous.

Edges represent article text co-occurences between two entities.
The weight of an edge is the number of co-occurences found between two entities
according to some distance metric; in our case: "same sentence","near",
"same article" or our proposed "combined" metric

* We consider only one relationship type, so there is at most one edge between
a pair of nodes, and as such we are not in a multiplex context

Of the prior works cited in our work,

  • Some rely on time intensive, hand crafted networks or
    use a point in time snapshot of a curated news corpora

Of the prior works cited in our work,

  • Some rely on time intensive, hand crafted networks or
    use a point in time snapshot of a curated news corpora
  • While others rely on access to article lookup system that offer an impressive breadth of sources, but that constrains them to use only sources from that list which have an additional quality assurance weakness

Of the prior works cited in our work,

  • Some rely on time intensive, hand crafted networks or
    use a point in time snapshot of a curated news corpora
  • While others rely on access to article lookup system that offer an impressive breadth of sources, but that constrains them to use only sources from that list which have an additional quality assurance weakness
  • Some leveraged paid-for search engine results, only processed the first twenty results for each query and then only used the snippet of text present in Yahoo’s search results page as opposed to all the content within the actual article itself.

Of the prior works cited in our work,

  • Some rely on time intensive, hand crafted networks or
    use a point in time snapshot of a curated news corpora
  • While others rely on access to article lookup system that offer an impressive breadth of sources, but that constrains them to use only sources from that list which have an additional quality assurance weakness
  • Some leveraged paid-for search engine results, only processed the first twenty results for each query and then only used the snippet of text present in Yahoo’s search results page as opposed to all the content within the actual article itself.
  • We were unable to find any work that leveraged the publically available search engines present in most news websites.  By utilizing this mechanism, our tool allows the flexibility to create a context in which to search and by doing so, to curate the content to a user's exact needs.

Overview of WhoYouElect.com

http://www.whoyouelect.com/texas

Who is Eddie Rodriguez?




1. Ego "Inner" Network of a Politician

http://www.whoyouelect.com/texas/explorer-view.html?show=15&s=Eddie Rodriguez

or just click on "Inner Network" from the Table of Contents view: http://www.whoyouelect.com/texas/table-of-contents.html

URL parameters

sname of Politician
showonly show edges with weight greater than or equal to show
dm(optional) which distance metric to use. possible values: ss, sn, all, comb . corresponding to Same Sentence, Same or Near, All Co-Occurences, Proposed Combined Metric. Defaults to "All"
near_co(optional) Near Sentence Coefficient in calculation for Combined Metric
same_art_co(optional) Same Article Coefficient in calculation for Combined Metric
from(optional) date from which to include articles found. expected format: YYYY-MM-DD
to(optional) end date for inclusion of articles. expected format: YYYY-MM-DD . ex: 2008-07-01
exclude(optional) which news sources to exclude. possible values: AAS, DMN, HC, NYT, TXOB, TXTRB . use commas to exclude multiple

Combined Metric

The proposed "combined" weight of an edge between two nodes is defined as:

w = BOOST × [ s + ( α × n ) + ( γ × a ) ]
where
BOOST =s + n + aa
s = number of (s)ame sentence occurences of the two nodes in the texts
α = coefficient for near sentence occurences.  Set to 0.5 by default.  URL parameter: near_co
n = number of (n)ear sentence occurences ( within 3 sentences of one another )
γ = coefficient for same article occurences.  Set to 0.1 by default.  URL parameter: same_art_co
a = number of same (a)rticle occurrences ( outside of 3 sentence distance )

BOOST is a boosting term to enhance relationships with more same and near sentence co-occurances
and penalize those which are largely composed of "same article" (far) co-occurances

A Couple Informal Network Science Definitions

Degreenumber of edges a node has
Strengththe sum of the (weighted) edges of a node
Betweennessnumber of shortest-paths that pass through a node.
who is near the center of a community
Page Rankgoogle:  you are important if you are friends with important people
Transitivity / Clustering Coefficient how many of your friends are friends with each other.  "connectedness"
  
Louvainfast nondeterministic "community detection" algorithm.
it is a greedy optimization method that attempts to optimize the "modularity" of a partition of the network and runs in O(nlog(n)).
Modularityhow good is the overall quality of separation for all communities.
Networks with high modularity have dense connections between the nodes
within modules but sparse connections between nodes in different modules.
Conductancefraction of edges leaving a particular community
Expansionnumber of edges per node leaving a particular community

2. Extended View of a Politician

http://www.whoyouelect.com/texas/communities-from-ncol.html?cl=25&t=15&s=Eddie Rodriguez

* Edges weighted according to the proposed "Combined" metric
* Placement of nodes/communities is calculated to maximize seperation and clarity

or just click on "Extended Network" from the Table of Contents view

URL parameters

sname of Politician
tonly show edges with weight greater than or equal to threshhold t.  Defaults to 15.
clnumber of communities to discover in network.  Defaults to 25.

3.1 AUTOMATED SUMMARIZATION OF COMMUNITIES


  • 1. Treat the articles of a given community collectively as a single corpus.
  • 2. Run an initial TF-IDF procedure to filter terms and reduce noise.
  • 3. followed by Latent Dirichlet Allocation (LDA) to derive topics.


**We can weigh articles by their relative importance in the community.

Topic Modeling to summarize a single community

library("topicmodels")
library("tm")
k = 5           			  		#num of communities to look for
highend = 4000  			  		#want less than this many terms
lowend = 2000   			  		#want more than this many terms
tsvfile = "eddie_rodriguez-articles.tsv"  			   #community article texts
articlesForCommunity <- read.csv("lda/community-articles-lda.csv") #comm article meta data
colnames(articlesForCommunity) <- c("url","date","entities")
ngramtype <- 1
weighbyentity = T  
article_cutoff = 1
runTopicModelingOnCommunity(tsvfile,k,highend,lowend,
        articlesForCommunity,ngramtype,article_cutoff,weighbyentity)

runTopicModelingOnCommunity <- function( ... ){
	# .... load tsv to get article texts and put in JSS_papers list
	corpus <- Corpus(VectorSource(sapply(JSS_papers[, "text"], remove_HTML_markup))) 
	JSS_dtm <- DocumentTermMatrix(corpus, control = list(stopwords = TRUE, 
                    minWordLength = 3, removeNumbers = TRUE, removePunctuation = TRUE))
	term_tfidf <- tapply( JSS_dtm$v/row_sums(JSS_dtm)[JSS_dtm$i], JSS_dtm$j, mean) 
				         * log2( nDocs(JSS_dtm) / col_sums(JSS_dtm > 0))
	cutoffvals <- get_cutoffval_and_type(term_tfidf,highend,lowend)
	# .... filter corpus by cutoff vals ...
	jss_TM <- LDA(JSS_dtm, k = k, control = list(seed = SEED)) 	
	Topic <- topics(jss_TM)
	Terms <- terms(jss_TM)
	# ... generate most frequent topics of the articles and the terms for each one
}					

3.2 Summarization of a Politician via text mining

1. Treat the combined text of the articles a politician occurs in as a single corpus,
2. Create a document term frequency matrix where each row is an article, the columns
are terms that occur in the corpus & a cell is the # of times a term appears in an article.
3. Use this to calculate the TF-IDF (term frequency–inverse document frequency) of the corpus.
4. Using the TF-IDF value we filter out terms that appear all the time & provide little distinguishing
information or inversely those that occur vary rarely and could be noise. It's an art.
5. Using the refined corpus, run Latent Dirichlet Allocation to uncover the “topics”
(ie, issues, latent concepts) within the articles for a single politician.

LDA assumes that words in documents are generated by a mixture of topics with each topic itself being a distribution of words.  Here the number of topics is fixed initially and after it is run, each article in the corpus of a politician is assigned a topic.  Each topic is a set of words ranked by the probability of a word occuring.
Because we have how many documents were assigned a given topic,
the most frequent topics serve to summarize a politician's "issues".

* For a more thorough explanation of LDA and approaches for assessing topic quality see:
http://chdoig.github.io/acm-sigkdd-topic-modeling/#/4/11


LDA with 3 topics and 4 words per topic.
http://www.mdpi.com/1999-4893/5/4/469/htm

Some Possible Texas Words
alamo, farms, freedom, guns, immigration, marriage, obamacare, oil, rights, solar

civil rights, farmers markets, for fracking, gulf coast, pecan pie, redistricting map, school breakfast, wind farms, workers compensation

Which politician is
most associated with the word ... ?

whoyouelect.com/texas/politician-results.html
whoyouelect.com/texas/politician-results-bigrams.html
whoyouelect.com/texas/politician-results-trigrams.html

We ran LDA using 20 topics* over each politician’s articles set for single words,bi-grams (pairs of words)
and tri-grams separately using a script we developed in R using the "tm" and "topicmodels" packages

*based on performance of different topic number values normally used in the literature (5, 10, 20, etc.)
  The optimal value is specific to each politician & the analysis one wants to present.
  As long as a sufficient number of topics are used, it has a less overall effect than
  the thresholds used in determining which terms to keep in the refined corpus.

Media & Graph Analysis Tools

whoyouelect.com/texas/mediaresults.html
whoyouelect.com/texas/media-top-per-source.html
whoyouelect.com/texas/extendedresults.html
whoyouelect.com/texas/politician-relative-articles.html

Party District maps

federal-districts.html
texas-house-map.html
texas-senate-map.html

Media Coverage Maps

media-federal-districts.html
media-texas-house-map.html
media-texas-senate-map.html
all-media-maps.html

Automated Construction of Networks

General Overview of Graph Construction Process

http://openstates.org/api/v1/legislators/?state=tx&active=true
https://www.govtrack.us/api/v2/role?current=true&state=TX
http://www.sos.state.tx.us/elections/voter/elected.shtml

General Overview of Graph Construction Process

http://openstates.org/api/v1/legislators/?state=tx&active=true
https://www.govtrack.us/api/v2/role?current=true&state=TX
http://www.sos.state.tx.us/elections/voter/elected.shtml

Languages, Libraries, and Databases
Python: general backend work
MongoDB: to store article texts and entity information
MITIE: MIT open source Named Entity Recognition tool
BeautifulSoup and Selenium libraries:
    python webscrapers used in obtaining articles.
    BS4 is for static web pages,
    while Selenium, using the PhantomJs webdriver,
    handles pages constructed by javascript dynamically
langdetect:
    open source python library for language detection
D3.js: network visualizations & maps
jQuery UI: some frontend interactivity functionality
jLouvain: javascript Louvain community detection
html5 webworkers:
    asynchronous, nonblocking JS load of data for graphs

3.1 Adding Politicians From Open Gov. Sources

  • 1. Texas Congress data was obtained from OpenStates.org for both active and inactive members
  • 2. Federal Congress data was obtained from GovTrack.us for current federal representatives
  • 3. Other Texas state officials data was obtained from the Secretary of State of Texas website via a script

This gives us the metadata for all the politicians

3.2 Adding News Sources

A reasonable mix of representative media sources on Texas politics.

3.3 Setup for Data Acquisition By Template Modification

Two template "web scraper" solutions are provider

Based on whether a news site renders its site content statically or dynamically via Javascript

1.  The static version, based on BeautifulSoup
  - acquires data more quickly,
  - but can not handle dynamic content.

2.  The dynamic version, based on Selenium/PhantomJS
  - can handle static or dynamic content,
  - but goes slower than the static solution;

Future work is planned to unify the approach into one template

3.4 Running And storing
the Web Search results for each Active Entity

This step calls the webscraper template for a politican.
For each news source

  • download the list of article urls returned from its internal search engine for that politician
  • download full articles into JSON files
  • do language detection for the text of the article before importing JSON into MongoDB.


3.5 Processing article results per Active Entity

Take all the articles downloaded for a given politician, process and store them,
and construct graphs for the politician:
1. the Ego "Inner" Network, and
2. the Extended Network View

3.5 Processing article results ... Continued

Go through all the articles for a politician one by one and

  • 1. filter out empty articles, sports articles, and articles not containing the politician's name explicitly.

  • 2. Split article into sentences, and run the MITIE Named Entity Recognition library over each one.
    This finds "entities" in each sentence and gives each a tag of "person","location","organization" or "misc".
    Additionally check if "person" tags are "politicians" using our entities DB or whether any
    Congressional "bill"s exist in the sentence using a heuristic.

  • 3. Run coreference resolution over all the entities found to get an additional
    dictionary of all distinct entities found in the article.

  • 4. From this and the tagged sentences, we then find and store all co-occurences that occurred within
    the same sentence, within three sentences, or outside of that distance
    for all entities in the dictionary.

  • 5. At this point, the article has been processed and we merge and save it locally.


3.6 Create network views

Using saved result objects and statistics, construct the Ego and Expanded network data files

Conclusions & Future Work

Contributions

We presented ...
1.  a tool that generates real world political networks from user provided lists of politicians and news sites.
2.  open source data for politician metadata which can be easily extended to any US state.
3.  the Ego "Inner" and "Extended" graph visualizations
4.  a proof-of-concept use of topic modeling for labeling communities in a politician’s “extended” network
5.  Topic modeling over Politician articles and various maps/tables of results.

Uses:
- voter education tool
- creating real world networks for academic study ( easily adapted to other US states )
- discovering potential news stories

System based entirely on open data & tech.
Does not require massive computational power or storage capabilities (all content downloaded/processed on a laptop)

Future Work

1.  an extensive statistical analysis of merged "extended" graphs obtained
2.  incorporation of city & local APIs for better resolution of elected officials, campaign funding APIs for influence tracking,
      congressional bill APIs, public health, socio-economic, and voting history APIs
3.  all articles aren't equal, and as such weighing article relations differently is very important.
4.  better disambiguation of entities, use of alias lists, automated merging tools
5.  simplification of webscraping solution & refactoring code to handle "parties" more generally for non-US cases
6.  expanding/changing NER solution to provide for more language handling ("polyglot" python library)
7.  refactoring text-snippet solution for better scalability.
8.  developing mechanism for downloading, processing and adding new articles for existing politicians.
9.  assessing use of multiplex paradigm by introducing additional link types (“neighboring districts”,
      “author of bill”, “member of committee”, etc.) for more robust network analysis.
10.  leverage posteriors of LDA for better topic analysis/visualization
11.  relationship labeling/role discovery incorporation ( signed positive/negative edges when applicable )
12.  temporal community detection work for extended view
13.  event detection work on ego net

QUESTIONS ?
TELL YOUR TEXAS FRIENDS :)

Who You Elect Texas: whoyouelect.com/texas
Slides: whoyouelect.com/texas/northeastern_nov2016/

email: diegoolano@gmail.comweb: diegoolano.com
github: github.com/diegoolanotwitter: dgolano