Advisors: Marta Arias & Josep Lluís Larriba Pey
![]() |
Network: Undirected with Weighted Edges |
Fig. 1. Mexican narco-network constructed from the book "Los Señores del Narco" by Anabel Hernandez.
Jesús Espinal-Enriquez, J. Mario Siqueiros-Garcıa, Rodrigo García-Herrera, & Sergio Antonio Alcalá-Corona.
"A literature-based approach to a narco-network". In Social Informatics, pages 97–101. Springer, 2014.
![]() |
Population: 27 Million People State Congress: House of Represenatives & Senate Federal U.S. Congress: Texas has Other elected Texas officials: State Level Judges: Not to mention all the City and Locally elected politicians. Mayors, city councilmen, etc. As of 2012, the GDP of Texas puts it in the top 20 richest countries of the world, just below Spain. |
Texas House Congressional Districts Map. 150 districts total
http://www.whoyouelect.com/texas/texas-house-map.html
In the last election for Governor of Texas,
the winner received 2.8 million votes from a pool of 16.7 million eligible voters
... and won by a landslide; by a margin of 20%!
How much would a citizen need to read to keep up with the politicians who represent them?
What about the biases of those news sources?
Could this lack of voter participation be due in part to actual information overload or
more likely, the daunting prospect of gathering all the pertninent and trust worthy data in one spot to be able to then absorb it and make an informed decision?
We decided to build a tool which can...
aggregate the results of article searches for politicians
using relevant news sites encompassing a spectrum of viewpoints
to summarize and display the network of politicians, organizations, etc. around a politician
and present the prevalent themes which exist within the articles found for that politician.
To test our system, we use as a case study,
1. a list of 246 currently active Texas elected officials and
2. a list of 6 news sites that cover Texas politics in some way:
the Austin American Statesman, the Dallas Morning News, the Houston Chronicle,
the New York Times, the Texas Observer and the Texas Tribune
The list of politicians and news sites to use is completely up to the user!
Who You Elect Texas Case Study Results, Graphs, Maps: http://whoyouelect.com/texas
Slides: http://www.whoyouelect.com/texas/presentation
Listing of Committees: openstates.org/api/v1/committees/?state=tx
Committee Members for a single Committee: http://openstates.org/api/v1/committees/TXC000010/
Pro-tip: Its way easier to just click on "Inner Network" from the Table of Contents view
URL parameters
s | name of Politician |
show | only show edges with weight greater than or equal to show |
dm | (optional) which distance metric to use. possible values: ss, sn, all, comb . corresponding to Same Sentence, Same or Near, All Co-Occurences, Proposed Combined Metric. Defaults to "All" |
near_co | (optional) Near Sentence Coefficient in calculation for Combined Metric |
same_art_co | (optional) Same Article Coefficient in calculation for Combined Metric |
from | (optional) date from which to include articles found. expected format: YYYY-MM-DD |
to | (optional) end date for inclusion of articles. expected format: YYYY-MM-DD . ex: 2008-07-01 |
exclude | (optional) which news sources to exclude. possible values: AAS, DMN, HC, NYT, TXOB, TXTRB . use commas to exclude multiple |
The proposed "combined" weight of an edge between two nodes is defined as:
BOOST is a boosting term to enhance relationships with more same and near sentence co-occurances
and penalize those which are largely composed of "same article" (far) co-occurances
Degree | number of edges a node has |
Strength | the sum of the (weighted) edges of a node |
Betweenness | number of shortest-paths that pass through a node. who is near the center of a community |
Page Rank | google: you are important if you are friends with important people |
Transitivity / Clustering Coefficient | how many of your friends are friends with each other. "connectedness" |
Louvain | fast nondeterministic "community detection" algorithm. it is a greedy optimization method that attempts to optimize the "modularity" of a partition of the network and runs in O(nlog(n)). |
Modularity | how good is the overall quality of separation for all communities. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. |
Conductance | fraction of edges leaving a particular community |
Expansion | number of edges per node leaving a particular community |
Edges in extended graph are all weighted according to the proposed "Combined" metric!
Placement of nodes/communities is not based on force-directed method,
but instead is calculated to maximize seperation and clarity
Pro-tip: Its way easier to just click on "Extended Network" from the Table of Contents view
URL parameters
s | name of Politician |
t | only show edges with weight greater than or equal to threshhold t. Defaults to 15. |
cl | number of communities to discover in network. Defaults to 25. |
How to get an overall picture of the political landscape of Texas
given the data we obtained for our list of politicians?
2 approaches
1. Network-Centric approach based on merging all extended graphs
2. Text Mining approach using information retrieval & topic modeling techniques
1. Determine the importance of politicians/entities via Network Analysis & centrality measures.
This would produce interesting detailed insights into the political landscape as a whole,
but would not provide a simple summary of the issues & topics surrounding each politician.
2. The Text Mining approach is not entity based & is a better complementary approach.
1. Treat the combined text of the articles a politician occurs in as a single corpus,
2. Create a document term frequency matrix where each row is an article, the columns
are terms that occur in the corpus & a cell is the # of times a term appears in an article.
3. Use this to calculate the TF-IDF (term frequency–inverse document frequency) of the corpus.
4. Using the TF-IDF value we filter out terms that appear all the time & provide little distinguishing
information or inversely those that occur vary rarely and could be noise. It's an art.
5. Using the refined corpus, run Latent Dirichlet Allocation to uncover the “topics”
(ie, issues, latent concepts) within the articles for a single politician.
LDA assumes that words in documents are generated by a mixture of topics with each topic itself being a distribution of words. Here the number of topics is fixed initially and after it is run, each article in the corpus of a politician is assigned a topic. Each topic is a set of words ranked by the probability of a word occuring.
Because we have how many documents were assigned a given topic,
the most frequent topics serve to summarize a politician's "issues".
* For a more thorough explanation of LDA and approaches for assessing topic quality see:
http://chdoig.github.io/acm-sigkdd-topic-modeling/#/4/11
LDA with 3 topics and 4 words per topic.
http://www.mdpi.com/1999-4893/5/4/469/htm
whoyouelect.com/texas/politician-results.html
whoyouelect.com/texas/politician-results-bigrams.html
whoyouelect.com/texas/politician-results-trigrams.html
We ran LDA using 20 topics* over each politician’s articles set for single words,bi-grams (pairs of words)
and tri-grams separately using a script we developed in R using the "tm" and "topicmodels" packages
*based on performance of different topic number values normally used in the literature (5, 10, 20, etc.)
The optimal value is specific to each politician & the analysis one wants to present.
As long as a sufficient number of topics are used, it has a less overall effect than
the thresholds used in determining which terms to keep in the refined corpus.
Before in the Community Analysis View, we saw details into the central figures and articles of a community
It would be useful however to have a way of automatically describing the community at a high level
Then we could label all communities in this way and gain a global view for understanding & comparing them.
One way to do that is by treating the articles of a given community collectively as a single corpus.
We can then analyze the corpus using the same procedure we use to “summarize politicians”;
namely an initial TF-IDF procedure to filter terms and reduce noise, followed by LDA to derive topics.
The difference in this context though is that ...
we know how many entities from a community occurred in each article in the community,
and thus we can weigh these articles and their text by their relative importance in the community.
library("topicmodels")
library("tm")
k = 5 #num of communities to look for
highend = 4000 #want less than this many terms
lowend = 2000 #want more than this many terms
tsvfile = "eddie_rodriguez-articles.tsv" #community article texts
articlesForCommunity <- read.csv("lda/community-articles-lda.csv") #comm article meta data
colnames(articlesForCommunity) <- c("url","date","entities")
ngramtype <- 1
weighbyentity = T
article_cutoff = 1
runTopicModelingOnCommunity(tsvfile,k,highend,lowend,
articlesForCommunity,ngramtype,article_cutoff,weighbyentity)
runTopicModelingOnCommunity <- function( ... ){
# .... load tsv to get article texts and put in JSS_papers list
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "text"], remove_HTML_markup)))
JSS_dtm <- DocumentTermMatrix(corpus, control = list(stopwords = TRUE,
minWordLength = 3, removeNumbers = TRUE, removePunctuation = TRUE))
term_tfidf <- tapply( JSS_dtm$v/row_sums(JSS_dtm)[JSS_dtm$i], JSS_dtm$j, mean)
* log2( nDocs(JSS_dtm) / col_sums(JSS_dtm > 0))
cutoffvals <- get_cutoffval_and_type(term_tfidf,highend,lowend)
# .... filter corpus by cutoff vals ...
jss_TM <- LDA(JSS_dtm, k = k, control = list(seed = SEED))
Topic <- topics(jss_TM)
Terms <- terms(jss_TM)
# ... generate most frequent topics of the articles and the terms for each one
}
![]() http://openstates.org/api/v1/legislators/?state=tx&active=true |
Languages, Libraries, and Databases Python: general backend work MongoDB: to store article texts and entity information MITIE: MIT open source Named Entity Recognition tool BeautifulSoup and Selenium libraries: python webscrapers used in obtaining articles. BS4 is for static web pages, while Selenium, using the PhantomJs webdriver, handles pages constructed by javascript dynamically langdetect: open source python library for language detection D3.js: network visualizations & maps jQuery UI: some frontend interactivity functionality jLouvain: javascript Louvain community detection html5 webworkers: asynchronous, nonblocking JS load of data for graphs |
Texas Congress data was obtained from OpenStates.org for both active and inactive members
http://openstates.org/api/v1/legislators/?state=tx&active=true
http://openstates.org/api/v1/legislators/?state=tx&active=false
Federal Congress data was obtained from GovTrack.us for current federal representatives
https://www.govtrack.us/api/v2/role?current=true&state=TX
Other Texas state officials data was obtained from the Secretary of State of Texas website via a script
http://www.sos.state.tx.us/elections/voter/elected.shtml
In the end, this gives us the metadata for all the politicians we want to look into as JSON text documents, but this could have just as easily come from a CSV file that was manually created.
These APIs were leveraged for their richness, ease of use, and for reproducability with other US states.
Future plans include leveraging Google's Civic API for access to City and County level information
A subset of Texas newspapers with the highest circulation, the Dallas Morning News, Houston Chronicle, and the Austin American Statesman representing some what the spectrum of conservative, centrist and progressive areas within Texas were selected along with two sites, the Texas Observer and Texas Tribune which focus on Texas politics and issues.
In addition, the New York Times was selected to provide an outside context.
This selection of sources was made with the intention of presenting a reasonable mix of representative media about the state of Texas. In the end, and whats most important, is that
a user of the tool gets to select any news source they wish to use.
Two template "web scraper" solutions are provider, and a technical user of the tool can modify the appropriate one to fit the needs of the site they wish to include.
The distinction between the two is based on whether a news site renders its site content statically in conforming with good web standards or dynamically via Javascript which many do. The static version, based on BeautifulSoup, acquires data more quickly, but can not handle dynamic content, whereas the dynamic version, based on Selenium/PhantomJS can handle static or dynamic content, but goes slower than the static solution; thus both were used for time considerations since many articles were downloaded and processed for each news source for each politician.
In our case study, half the sites used one template while half used the other. In its current state its functional, but requires probably too much from an end user of the tool and thus future work is planned to unify the approach into one template
First to avoid confusion, an Entity is one Politician from our list.
This step calls the webscraper template for a politican,
which for each news source, downloads a list of article urls available from its internal search engine.
It then downloads those articles seperately and places them into JSON files
which are then each "postprocessed" to add the politician's name as an indexing mechanism for the result
and to do language detection for the text of the article, and adding it as a field,
before importing it into MongoDB.
On a high level, this takes all the articles downloaded for a given politician and processes them,
and then stores the network data which will then be used to construct two graphs for the politician
we will see later:
1. the Individual Star Network, and
2. the Extended Network View
This step goes through all the articles for a politician one by one and
1. filters out empty articles, sports articles, and articles not containing the politician's name explicitly.
2. It then splits an article into sentences, and runs the MITIE Named Entity Recognition library over each one.
MITIE finds "entities" in each sentence and gives each a tag of "person","location","organization" or "misc".
We additionally check if "person" tags are "politicians" using our entities DB or whether any
Congressional "bill"s exist in the sentence (independent of the NER output) using a heuristic
and if so we tag them as "politician" or "bill".
3. Then we do a sort of coreference resolution over all the entities found to give us an additional
dictionary of all distinct entities found in the article.
4. From this and the tagged sentences, we then find and store all co-occurences that occurred within
the same sentence, within three sentences, or outside of that distance for all entities in the dictionary.
5. At this point, the article has been processed and we merge and save it locally.
Once all the articles are processed, we save the results and statistics on the results, and
proceed to step 3.6 which creates two datasets for the afformentioned network views.
We presented ...
1. a tool that generates real world political networks from user provided lists of politicians and news sites.
2. enriched with data obtained from open sources to provide structure via verified politician meta-data.
3. the Individual Star and "Extended" graph visualizations, tools & maps for the exploration of a politician's environment
4. a “Combined” distance metric to better assess the strength of relationships between actors in a graph.
5. an automated summarization tools to extract topics and issues characterizing individual politicians.
6. a proof-of-concept use of topic modeling for labeling communities in a politician’s “extended” network (not shown)
Voters: a simple and trustworthy mechanism for voter education.
Researchers & NGOs: construct and tailor real world graphs of their exact choosing to study.
Journalists & NGOs: discover potentially new stories
System based entirely on open data & tech.
Does not require massive computational power or storage capabilities (all content downloaded/processed on a laptop)
* On a personal note, I grew up in Texas and had no idea of who many of these politicians were before creating these tools,
but now I'm able to quickly gain a fairly good idea of any politician here, provided that sufficient articles were processed for them.
That is very powerful and quite useful especially when attempting to understand a politician’s history given our 24 hour news cycle.
1. an extensive statistical study of the merged "extended" graphs obtained
2. incorporation of city & local APIs for better resolution of elected officials, campaign funding APIs for influence tracking,
congressional bill APIs, public health, socio-economic, and voting history APIs
3. all articles aren't equal, and as such weighing article relations differently is very important.
4. better disambiguation of entities, use of alias lists, automated merging tools
5. simplification of webscraping solution & refactoring code to handle "parties" more generally for non-US cases
6. expanding NER solution to provide for more language handling (Catalan for instance)
7. refactoring text-snippet solution for better scalability.
8. developing mechanism for downloading, processing and adding new articles for existing politicians.
9. assessing use of multiplex paradigm by introducing additional link types (“neighboring districts”,
“author of bill”, “member of committee”, etc.) for more robust network analysis.
10. leverage posteriors of LDA for better topic analysis, and similarly leverage stochaistic community detection methods
11. relationship labeling/role discovery incorporation ( signed positive/negative edges when applicable)
Who You Elect Texas: whoyouelect.com/texas
Slides: whoyouelect.com/texas/presentation
email: diegoolano@gmail.com | web: diegoolano.com |
github: github.com/diegoolano | twitter: dgolano |