Automated Construction & Analysis of
Political Networks
via open government & media sources


UPC MASTERS DEFENSE
by Diego Garcia-Olano



Advisors: Marta Arias & Josep Lluís Larriba Pey

Introduction

Text, Nodes, Edges, & Narco Networks

Network: Undirected with Weighted Edges

Text: "Los Señores del Narco"

Nodes: Actors

Edges: there exists a co-occurrence between 2 Actors
in the text within 200 characters of each other.

*Nodes are sized in relation to how many edges (ie, relationships) they have
*Edges are sized by how many co-occurances between two actors exist.

Fig. 1. Mexican narco-network constructed from the book "Los Señores del Narco" by Anabel Hernandez.
Jesús Espinal-Enriquez, J. Mario Siqueiros-Garcıa, Rodrigo García-Herrera, & Sergio Antonio Alcalá-Corona.
"A literature-based approach to a narco-network".  In Social Informatics, pages 97–101. Springer, 2014.

The Mystery That is Texas

Population: 27 Million People

State Congress: House of Represenatives & Senate
150 Representatives and 31 State Senators

Federal U.S. Congress: Texas has
36 Representatives and 2 Senators

Other elected Texas officials:
Governor, Lieutenant Governor, Speaker of the House,
Attorney General, Comissioners, etc

State Level Judges:
9 Supreme Court and 9 Court of Criminal Appeals

Not to mention all the City and Locally elected politicians. Mayors, city councilmen, etc.

As of 2012, the GDP of Texas puts it in the top 20 richest countries of the world, just below Spain.

Texas House Congressional Districts Map.  150 districts total
http://www.whoyouelect.com/texas/texas-house-map.html

News Can Educate The Masses
...but it can beat them senseless as well.

In the last election for Governor of Texas,
the winner received 2.8 million votes from a pool of 16.7 million eligible voters
... and won by a landslide; by a margin of 20%!

How much would a citizen need to read to keep up with the politicians who represent them?
What about the biases of those news sources?

Could this lack of voter participation be due in part to actual information overload or
more likely, the daunting prospect of gathering all the pertninent and trust worthy data in one spot to be able to then absorb it and make an informed decision?

We decided to build a tool which can...
aggregate the results of article searches for politicians
using relevant news sites encompassing a spectrum of viewpoints
to summarize and display the network of politicians, organizations, etc. around a politician
and present the prevalent themes which exist within the articles found for that politician.


To test our system, we use as a case study,
  1. a list of 246 currently active Texas elected officials and
  2. a list of 6 news sites that cover Texas politics in some way:

        the Austin American Statesman, the Dallas Morning News, the Houston Chronicle,
        the New York Times, the Texas Observer and the Texas Tribune


The list of politicians and news sites to use is completely up to the user!

Presentation Outline

  • Introduction
  • Background & Related Work
  • Overview of WhoYouElect.com
          Network Science Approach
             1. Individual Star Network of a Politician
             2. Extended Network of a Politician
          Text Mining Approach
             3. Automatic Summarization of a Politician
  • Automated Construction of Networks
  • Conclusions & Future Work
  • Who You Elect Texas Case Study Results, Graphs, Maps: http://whoyouelect.com/texas
    Slides: http://www.whoyouelect.com/texas/presentation

    Background & Related Work

    Defining Big Words

    We will be constructing undirected heterogeneous networks
    with weighted edges in political contexts.

    Heterogeneous refers to the existence of different node types that will compose the graph.  For instance, nodes (also refered to as "entities") in our system will be labeled as People, Organizations, Politicians, Locations, Bills, or Miscellaneous.

    Edges represent article text co-occurences between two entities.
    The weight of an edge is the number of co-occurences found between two entities
    according to some distance metric; in our case: "same sentence","near",
    "same article" or our proposed "combined" metric)

    We consider only one relationship type, so there is at most one edge between
    a pair of nodes, and as such we are not in a multiplex context

    Of the prior works cited in our work,

    • Some rely on time intensive, hand crafted networks or
      use a point in time snapshot of a curated news corpora
    • While others rely on access to article lookup system that offer an impressive breadth of sources, but that constrains them to use only sources from that list which have an additional quality assurance weakness
    • Some leveraged paid-for search engine results, only processed the first twenty results for each query and then only used the snippet of text present in Yahoo’s search results page as opposed to all the content within the actual article itself.
    • We were unable to find any work that leveraged the publically available search engines present in most news websites.  By utilizing this mechanism, our tool allows the flexibility to create a context in which to search and by doing so, to curate the content to a user's exact needs.

    Information retrieval aspects aside,

    • The prior works all use natural language processing (NLP) to extract entities using either
      1. similarity metrics based on some combination of:
          entity co-occurrence,
          textual contexts of and shared between entities, &
          the correlation of entities and hyper links found in documents OR
      2. topic modeling to infer relationships of interest.
    • Topic modeling approaches allow for relationship "across articles", but are very noisy.
      The use of TM for relationship labeling and as a complement for edge detection
      is interesting and left for future work.

    Overview of WhoYouElect.com

    http://www.whoyouelect.com/texas

    Who is Eddie Rodriguez?


    Table Of Contents View

    http://www.whoyouelect.com/texas/table-of-contents.html

    Maps Maps Maps

    http://www.whoyouelect.com/texas/federal-districts.html http://www.whoyouelect.com/texas/texas-house-map.html http://www.whoyouelect.com/texas/texas-senate-map.html


    Committee Assignments

    http://www.whoyouelect.com/texas/committees.html

    Listing of Committees: openstates.org/api/v1/committees/?state=tx
    Committee Members for a single Committee: http://openstates.org/api/v1/committees/TXC000010/

    1. Individual Star View of a Politician

    http://www.whoyouelect.com/texas/explorer-view.html?show=15&s=Eddie Rodriguez

    Pro-tip: Its way easier to just click on "Inner Network" from the Table of Contents view

    URL parameters

    sname of Politician
    showonly show edges with weight greater than or equal to show
    dm(optional) which distance metric to use. possible values: ss, sn, all, comb . corresponding to Same Sentence, Same or Near, All Co-Occurences, Proposed Combined Metric. Defaults to "All"
    near_co(optional) Near Sentence Coefficient in calculation for Combined Metric
    same_art_co(optional) Same Article Coefficient in calculation for Combined Metric
    from(optional) date from which to include articles found. expected format: YYYY-MM-DD
    to(optional) end date for inclusion of articles. expected format: YYYY-MM-DD . ex: 2008-07-01
    exclude(optional) which news sources to exclude. possible values: AAS, DMN, HC, NYT, TXOB, TXTRB . use commas to exclude multiple

    Combined Metric

    The proposed "combined" weight of an edge between two nodes is defined as:

    w = BOOST × [ s + ( α × n ) + ( γ × a ) ]
    where
    BOOST =s + n + aa
    s = number of (s)ame sentence occurences of the two nodes in the texts
    α = coefficient for near sentence occurences.  Set to 0.5 by default.  URL parameter: near_co
    n = number of (n)ear sentence occurences ( within 3 sentences of one another )
    γ = coefficient for same article occurences.  Set to 0.1 by default.  URL parameter: same_art_co
    a = number of same (a)rticle occurrences ( outside of 3 sentence distance )

    BOOST is a boosting term to enhance relationships with more same and near sentence co-occurances
    and penalize those which are largely composed of "same article" (far) co-occurances

    A Couple Informal Network Science Definitions

    Degreenumber of edges a node has
    Strengththe sum of the (weighted) edges of a node
    Betweennessnumber of shortest-paths that pass through a node.
    who is near the center of a community
    Page Rankgoogle:  you are important if you are friends with important people
    Transitivity / Clustering Coefficient how many of your friends are friends with each other.  "connectedness"
      
    Louvainfast nondeterministic "community detection" algorithm.
    it is a greedy optimization method that attempts to optimize the "modularity" of a partition of the network and runs in O(nlog(n)).
    Modularityhow good is the overall quality of separation for all communities.
    Networks with high modularity have dense connections between the nodes
    within modules but sparse connections between nodes in different modules.
    Conductancefraction of edges leaving a particular community
    Expansionnumber of edges per node leaving a particular community

    2. Extended View of a Politician

    http://www.whoyouelect.com/texas/communities-from-ncol.html?cl=25&t=15&s=Eddie Rodriguez

    Edges in extended graph are all weighted according to the proposed "Combined" metric!
    Placement of nodes/communities is not based on force-directed method,
    but instead is calculated to maximize seperation and clarity
    Pro-tip: Its way easier to just click on "Extended Network" from the Table of Contents view

    URL parameters

    sname of Politician
    tonly show edges with weight greater than or equal to threshhold t.  Defaults to 15.
    clnumber of communities to discover in network.  Defaults to 25.

    Media & Graph Analysis Tools

    whoyouelect.com/texas/mediaresults.html
    whoyouelect.com/texas/media-top-per-source.html
    whoyouelect.com/texas/extendedresults.html
    whoyouelect.com/texas/politician-relative-articles.html

    Media Coverage Maps

    whoyouelect.com/texas/media-federal-districts.html
    whoyouelect.com/texas/media-texas-house-map.html
    whoyouelect.com/texas/media-texas-senate-map.html
    whoyouelect.com/texas/all-media-maps.html





    Who (Are These People) You Elect?

    3. AUTOMATED SUMMARIZATION OF POLITICIANS

    How to get an overall picture of the political landscape of Texas
    given the data we obtained for our list of politicians?

    2 approaches
    1. Network-Centric approach based on merging all extended graphs
    2. Text Mining approach using information retrieval & topic modeling techniques

    1.  Determine the importance of politicians/entities via Network Analysis & centrality measures.
    This would produce interesting detailed insights into the political landscape as a whole,
    but would not provide a simple summary of the issues & topics surrounding each politician.

    2.  The Text Mining approach is not entity based & is a better complementary approach.

    *image from prior slide at http://www.whoyouelect.com/texas/table-of-everyone.html

    Summarization of a Politician via text mining

    1. Treat the combined text of the articles a politician occurs in as a single corpus,
    2. Create a document term frequency matrix where each row is an article, the columns
    are terms that occur in the corpus & a cell is the # of times a term appears in an article.
    3. Use this to calculate the TF-IDF (term frequency–inverse document frequency) of the corpus.
    4. Using the TF-IDF value we filter out terms that appear all the time & provide little distinguishing
    information or inversely those that occur vary rarely and could be noise. It's an art.
    5. Using the refined corpus, run Latent Dirichlet Allocation to uncover the “topics”
    (ie, issues, latent concepts) within the articles for a single politician.

    LDA assumes that words in documents are generated by a mixture of topics with each topic itself being a distribution of words.  Here the number of topics is fixed initially and after it is run, each article in the corpus of a politician is assigned a topic.  Each topic is a set of words ranked by the probability of a word occuring.
    Because we have how many documents were assigned a given topic,
    the most frequent topics serve to summarize a politician's "issues".

    * For a more thorough explanation of LDA and approaches for assessing topic quality see:
    http://chdoig.github.io/acm-sigkdd-topic-modeling/#/4/11


    LDA with 3 topics and 4 words per topic.
    http://www.mdpi.com/1999-4893/5/4/469/htm

    Some Possible Texas Words
    alamo, farms, freedom, guns, immigration, marriage, obamacare, oil, rights, solar

    civil rights, farmers markets, for fracking, gulf coast, pecan pie, redistricting map, school breakfast, wind farms, workers compensation

    Which politician is
    most associated with the word ... ?

    whoyouelect.com/texas/politician-results.html
    whoyouelect.com/texas/politician-results-bigrams.html
    whoyouelect.com/texas/politician-results-trigrams.html

    We ran LDA using 20 topics* over each politician’s articles set for single words,bi-grams (pairs of words)
    and tri-grams separately using a script we developed in R using the "tm" and "topicmodels" packages

    *based on performance of different topic number values normally used in the literature (5, 10, 20, etc.)
      The optimal value is specific to each politician & the analysis one wants to present.
      As long as a sufficient number of topics are used, it has a less overall effect than
      the thresholds used in determining which terms to keep in the refined corpus.

    AUTOMATED SUMMARIZATION OF COMMUNITIES

    Before in the Community Analysis View, we saw details into the central figures and articles of a community

    It would be useful however to have a way of automatically describing the community at a high level
    Then we could label all communities in this way and gain a global view for understanding & comparing them.

    One way to do that is by treating the articles of a given community collectively as a single corpus.
    We can then analyze the corpus using the same procedure we use to “summarize politicians”;
    namely an initial TF-IDF procedure to filter terms and reduce noise, followed by LDA to derive topics.

    The difference in this context though is that ...
    we know how many entities from a community occurred in each article in the community,
    and thus we can weigh these articles and their text by their relative importance in the community.

    Finally Some Code

    library("topicmodels")
    library("tm")
    k = 5           			  		#num of communities to look for
    highend = 4000  			  		#want less than this many terms
    lowend = 2000   			  		#want more than this many terms
    tsvfile = "eddie_rodriguez-articles.tsv"  			   #community article texts
    articlesForCommunity <- read.csv("lda/community-articles-lda.csv") #comm article meta data
    colnames(articlesForCommunity) <- c("url","date","entities")
    ngramtype <- 1
    weighbyentity = T  
    article_cutoff = 1
    runTopicModelingOnCommunity(tsvfile,k,highend,lowend,
            articlesForCommunity,ngramtype,article_cutoff,weighbyentity)
    
    runTopicModelingOnCommunity <- function( ... ){
    	# .... load tsv to get article texts and put in JSS_papers list
    	corpus <- Corpus(VectorSource(sapply(JSS_papers[, "text"], remove_HTML_markup))) 
    	JSS_dtm <- DocumentTermMatrix(corpus, control = list(stopwords = TRUE, 
                        minWordLength = 3, removeNumbers = TRUE, removePunctuation = TRUE))
    	term_tfidf <- tapply( JSS_dtm$v/row_sums(JSS_dtm)[JSS_dtm$i], JSS_dtm$j, mean) 
    				         * log2( nDocs(JSS_dtm) / col_sums(JSS_dtm > 0))
    	cutoffvals <- get_cutoffval_and_type(term_tfidf,highend,lowend)
    	# .... filter corpus by cutoff vals ...
    	jss_TM <- LDA(JSS_dtm, k = k, control = list(seed = SEED)) 	
    	Topic <- topics(jss_TM)
    	Terms <- terms(jss_TM)
    	# ... generate most frequent topics of the articles and the terms for each one
    }					

    Automated Construction of Networks

    General Overview of Graph Construction Process

    http://openstates.org/api/v1/legislators/?state=tx&active=true
    https://www.govtrack.us/api/v2/role?current=true&state=TX
    http://www.sos.state.tx.us/elections/voter/elected.shtml

    Languages, Libraries, and Databases
    Python: general backend work
    MongoDB: to store article texts and entity information
    MITIE: MIT open source Named Entity Recognition tool
    BeautifulSoup and Selenium libraries:
        python webscrapers used in obtaining articles.
        BS4 is for static web pages,
        while Selenium, using the PhantomJs webdriver,
        handles pages constructed by javascript dynamically
    langdetect:
        open source python library for language detection
    D3.js: network visualizations & maps
    jQuery UI: some frontend interactivity functionality
    jLouvain: javascript Louvain community detection
    html5 webworkers:
        asynchronous, nonblocking JS load of data for graphs

    3.1 Adding Politicians From Open Gov. Sources

    Texas Congress data was obtained from OpenStates.org for both active and inactive members
    http://openstates.org/api/v1/legislators/?state=tx&active=true
    http://openstates.org/api/v1/legislators/?state=tx&active=false

    Federal Congress data was obtained from GovTrack.us for current federal representatives
    https://www.govtrack.us/api/v2/role?current=true&state=TX

    Other Texas state officials data was obtained from the Secretary of State of Texas website via a script
    http://www.sos.state.tx.us/elections/voter/elected.shtml

    In the end, this gives us the metadata for all the politicians we want to look into as JSON text documents, but this could have just as easily come from a CSV file that was manually created.

    These APIs were leveraged for their richness, ease of use, and for reproducability with other US states.
    Future plans include leveraging Google's Civic API for access to City and County level information

    3.2 Adding News Sources

    A subset of Texas newspapers with the highest circulation, the Dallas Morning News, Houston Chronicle, and the Austin American Statesman representing some what the spectrum of conservative, centrist and progressive areas within Texas were selected along with two sites, the Texas Observer and Texas Tribune which focus on Texas politics and issues.

    In addition, the New York Times was selected to provide an outside context.

    This selection of sources was made with the intention of presenting a reasonable mix of representative media about the state of Texas. In the end, and whats most important, is that
    a user of the tool gets to select any news source they wish to use.

    3.3 Setup for Data Acquisition By Template Modification

    Two template "web scraper" solutions are provider, and a technical user of the tool can modify the appropriate one to fit the needs of the site they wish to include.

    The distinction between the two is based on whether a news site renders its site content statically in conforming with good web standards or dynamically via Javascript which many do. The static version, based on BeautifulSoup, acquires data more quickly, but can not handle dynamic content, whereas the dynamic version, based on Selenium/PhantomJS can handle static or dynamic content, but goes slower than the static solution; thus both were used for time considerations since many articles were downloaded and processed for each news source for each politician.

    In our case study, half the sites used one template while half used the other. In its current state its functional, but requires probably too much from an end user of the tool and thus future work is planned to unify the approach into one template

    3.4 Running And storing
    the Web Search results for each Active Entity

    First to avoid confusion, an Entity is one Politician from our list.
    This step calls the webscraper template for a politican,
    which for each news source, downloads a list of article urls available from its internal search engine.
    It then downloads those articles seperately and places them into JSON files
    which are then each "postprocessed" to add the politician's name as an indexing mechanism for the result
    and to do language detection for the text of the article, and adding it as a field,
    before importing it into MongoDB.


    3.5 Processing article results per Active Entity

    On a high level, this takes all the articles downloaded for a given politician and processes them,
    and then stores the network data which will then be used to construct two graphs for the politician
    we will see later:
    1. the Individual Star Network, and
    2. the Extended Network View

    3.5 Processing article results ... Continued

    This step goes through all the articles for a politician one by one and
    1. filters out empty articles, sports articles, and articles not containing the politician's name explicitly.

    2. It then splits an article into sentences, and runs the MITIE Named Entity Recognition library over each one.
    MITIE finds "entities" in each sentence and gives each a tag of "person","location","organization" or "misc".
    We additionally check if "person" tags are "politicians" using our entities DB or whether any
    Congressional "bill"s exist in the sentence (independent of the NER output) using a heuristic
    and if so we tag them as "politician" or "bill".

    3. Then we do a sort of coreference resolution over all the entities found to give us an additional
    dictionary of all distinct entities found in the article.

    4. From this and the tagged sentences, we then find and store all co-occurences that occurred within
    the same sentence, within three sentences, or outside of that distance
    for all entities in the dictionary.

    5. At this point, the article has been processed and we merge and save it locally.
    Once all the articles are processed, we save the results and statistics on the results, and
    proceed to step 3.6 which creates two datasets for the afformentioned network views.

    Conclusions & Future Work

    Contributions

    We presented ...
    1.  a tool that generates real world political networks from user provided lists of politicians and news sites.
    2.  enriched with data obtained from open sources to provide structure via verified politician meta-data.
    3.  the Individual Star and "Extended" graph visualizations, tools & maps for the exploration of a politician's environment
    4.  a “Combined” distance metric to better assess the strength of relationships between actors in a graph.
    5.  an automated summarization tools to extract topics and issues characterizing individual politicians.
    6.  a proof-of-concept use of topic modeling for labeling communities in a politician’s “extended” network (not shown)

    Voters: a simple and trustworthy mechanism for voter education.
    Researchers & NGOs: construct and tailor real world graphs of their exact choosing to study.
    Journalists & NGOs: discover potentially new stories

    System based entirely on open data & tech.
    Does not require massive computational power or storage capabilities (all content downloaded/processed on a laptop)

    *  On a personal note, I grew up in Texas and had no idea of who many of these politicians were before creating these tools,
    but now I'm able to quickly gain a fairly good idea of any politician here, provided that sufficient articles were processed for them.
    That is very powerful and quite useful especially when attempting to understand a politician’s history given our 24 hour news cycle.

    Future Work

    1.  an extensive statistical study of the merged "extended" graphs obtained
    2.  incorporation of city & local APIs for better resolution of elected officials, campaign funding APIs for influence tracking,
          congressional bill APIs, public health, socio-economic, and voting history APIs
    3.  all articles aren't equal, and as such weighing article relations differently is very important.
    4.  better disambiguation of entities, use of alias lists, automated merging tools
    5.  simplification of webscraping solution & refactoring code to handle "parties" more generally for non-US cases
    6.  expanding NER solution to provide for more language handling (Catalan for instance)
    7.  refactoring text-snippet solution for better scalability.
    8.  developing mechanism for downloading, processing and adding new articles for existing politicians.
    9.  assessing use of multiplex paradigm by introducing additional link types (“neighboring districts”,
          “author of bill”, “member of committee”, etc.) for more robust network analysis.
    10.  leverage posteriors of LDA for better topic analysis, and similarly leverage stochaistic community detection methods
    11.  relationship labeling/role discovery incorporation ( signed positive/negative edges when applicable)

    QUESTIONS ?
    TELL YOUR TEXAS FRIENDS :)

    Who You Elect Texas: whoyouelect.com/texas
    Slides: whoyouelect.com/texas/presentation

    email: diegoolano@gmail.comweb: diegoolano.com
    github: github.com/diegoolanotwitter: dgolano