Data Visualization – thomascleary.com

Visualizing Global Trade with Chord Graphs

This image is largely the result of me brushing up on my R skills while looking for an excuse to create a Chord Graph. The image below shows the imports and exports of the countries with the Top 50 values in these metrics, according to Worldbank.org.

The first step was to download the data available above, and use an R script to combine ~200 .csv files into a single table:

library( plyr )
setwd("C:...")
trade <- ldply( .data = list.files(pattern="*.CSV"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Reporter", "Partner", "Product categories", "Indicator Type", "Indicator", "2018", "2017", "2016", "2015", "2014", "2013", "2012", "2011", "2010", "2009", "2008", "2007", "2006" ,"2005" ,"2004" ,"2003" ,"2002" ,"2001" ,"2000", "1999", "1998", "1997", "1996", "1995", "1994", "1993", "1992", "1991", "1990", "1989", "1988"))

This single large table was filtered to only include the countries with the highest reported totals. I did not distinguish between imports and exports, I just totaled the “2018” column by “Reporter” and kept the top 50 countries. From there, a matrix of trade relationships was built. The (i,j) entry represents exports from country i to country j. With an adjacency matrix, a list of labels for the countries, and the chorddiag package, it’s fairly straightforward to create a graph:

setwd("C:...")
trade_matrix = as.matrix(read.csv("trade_matrix_top50.csv",header=FALSE))
trade_matrix <- matrix(as.numeric(trade_matrix), ncol = ncol(trade_matrix))
trade_matrix[is.na(trade_matrix)] = 0

countries = c("United States","European Union","China","Germany","Hong Kong, China","Japan","Mexico","Canada","Korea, Rep.","France","Netherlands","United Kingdom","Belgium","Italy","Singapore","Other Asia, nes","United Arab Emirates","India","Saudi Arabia","Spain","Switzerland","Russian Federation","Australia","Poland","Malaysia","Thailand","Brazil","Czech Republic","Indonesia","Austria","Ireland","Sweden","Turkey","Norway","Hungary","Denmark","Portugal","Philippines","Slovak Republic","Chile","Romania","Kuwait","South Africa","Finland","Israel","Argentina","Qatar","Nigeria","Kazakhstan")

dimnames(trade_matrix) <- list(Exporters = countries,
 Importers = countries)

library(chorddiag)
groupColors <- c("#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea", "#0f144c", "#fbedf1", "#89aa0d", "#871550", "#ee0d0e", "#b0fff9", "#21a708", "#270fe2","#069668", "#e96b22", "#a0b099", "#fef48e", "#bef451", "#591489", "#155392", "#4b381d", "#992a13", "#51f310", "#788ce0", "#ce3bf6", "#fd1e6e", "#38f0ac", "#c0852a", "#c87ef8", "#0b5313", "#e5808e", "#25b8ea")
groupColors <- groupColors[1:49]
export_graph <- chorddiag(trade_matrix, groupColors = groupColors, groupnamePadding = 10, showTicks = FALSE)

library(htmlwidgets)
saveWidget(export_graph, file=paste0( getwd(), "/chord_interactive.html"))

Sleep Cycle of a Newborn Baby

Between my interest in all things data related and my wife’s interest in keeping the baby on a schedule, this project was bound to happen eventually. We chose the Hatch Baby app to track important events (eating, sleeping, and “output”) before Myles was even born. Around the same time, I started tracking my own activity and I thought it might be fun to track the baby’s too.

Concentric rings represent days, with midnight at the top of the circle. Dark blue shading represents sleep, while grey represents awake time. Awake is the default, so some days (weeks 2-3) are erroneously devoid of “sleep” because we were too tired to record!

Images like this one have been floating around on sites like r/dataisbeautiful/, so I can’t claim the idea as my own. However, I still think there’s value in applying the same technique to my own data. The process was fairly straightforward.

Collect data in the Hatch App.
Export from Hatch to .csv file.
Transform time data from to a usable format.
1. Hatch exported time in a format that Python, R, Excel, and Google Sheets all disliked. The info was there, but it was not processing cleanly.
2. I used Excel to parse a date/time from the existing data for speed (this was an afternoon project) but if I were going to be replicating the process I would probably have set up a Python function or a Knime flow.
Import transformed .csv to Python with pandas package.
Use matplotlib to place the data on a polar graph.


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Import sleep data in .csv format.  This data has already been cleaned.
sleepData = pd.read_csv('sleep.csv')

# Change the resolution of our plot figure to a ridiculously high value
fig = plt.figure(dpi=2000)

# Change the plot to a polar coordinate system
ax = fig.add_subplot(111, polar=True)

# Loop through each of the baby's recorded naps
for nap in sleepData.iterrows():
    # For each nap, draw a grey arc from the start of the nap to the end of the nap.  
    # The arc should be at a distance from the origin that corresponds to the date.
    ax.plot(np.linspace(2*nap[1]['startTime']*np.pi, 2*nap[1]['endTime']*np.pi, 50), np.ones(50)+5*(nap[1]['pyDate']+50), color='lightslategrey', linewidth=.4, linestyle='-')

# Turn off the axes and set the background color.
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set(facecolor = "darkblue")

# Show the plot!
plt.show()

Graphing the Grift: Mapping the Communication Flow within Enron

I chose to perform an exploratory analysis of a dataset containing the emails of the now infamous company Enron. Enron is perhaps the most famous example of corporate corruption and mismanagement, and I won’t dwell on the details regarding the company’s rise and fall. During the course of investigations by shareholders and government regulatory agencies, the content of the emails became a matter of public record. These emails give us the ability to track the flow of information in the company in its final months. This 1100 node and 1,802 edge subset of the emails suggests two interesting ideas:

The full graph of the Enron corporation, as seen through email connections

1. It is obvious that the Chief of Staff at Enron, Steven Kean, was by far the most central communication figure at the company. It is less obvious that the vast majority of information passed through a narrow subset of employees.
2. Emails at Enron tend to flow ”downward.” It was common for supervisors to send emails to their subordinates, but not very common for subordinates to email their supervisors.

The first visualization will support the claim that the vast majority of email traffic passed through a limited number of individuals. The metric best suited to show which individuals are good information brokers is ”betweeness.” Betweeness describes the number of optimal paths between nodes that pass through the node in question. In a flat organization, we would expect this number to remain low as each node is able to set up ad-hoc relationships to pass information. In a highly tiered organization, betweeness values would vary greatly as the information must flow through managers before it is eventually disseminated. Of the 1100 nodes, only 74 have a betweeness greater than or equal to 24. Filtering for this value eliminated over 94\% of the nodes, leaving only the most central figures for communication. This reveals that information flow in Enron appears to have passed through a few key individuals while the majority of workers within their respective departments would remain disconnected from other departments.

Visual 1: Splitting the graph into two subsets based on betweenness clearly highlights key members of the organization

The second visualization is similar to the first, but instead of filtering using betweenness it instead uses the out-degree of each node to show which individuals were sending a significant portion of the emails. Emails are, by nature, a directed form of communication. Therefore the degree of each node can be easily separated into the in-degree (received) and the out-degree (sent). When examining the distribution of these metrics, it became clear that the vast majority of the employees were receiving far more emails than they sent. Over 75% of the email traffic was sent from 42 nodes to the other 1058. The remaining 25% of traffic was split with one third (8% of total) taking place between the 42 “managers” and two thirds (16% of total) taking place between lower level employees. The sparsity of the edges connecting the nodes at different levels of the company illustrates that most of the emails occurred as transmissions from the leadership toward the followers.

Visual 2: E-mail readers versus E-mail senders. The sparse senders mostly overlap with the high-betweenness individuals.

I chose a Fruchterman-Reingold graph for both of my visualizations because I felt it offered a well-balanced, consistent view of the information. The symmetry and equal spacing between nodes allowed the information to be neatly presented without dense areas becoming overcrowded. The side-by-side display of employees vs. management was done to better illustrate the point that most employees were not a central part of the organization. It would be simple to say that we filtered 95% of the nodes when we took out those that sent less than 10 emails, but actually seeing how sparse the graph is by comparison is much more effective.

The primary factors driving my selections for node and edge attributes were clarity and consistency. I wanted to make sure that the nodes were large enough to communicate key information, but small enough that the visualization did not appear crowded. This was especially true for the labeling of nodes, where names had to be large enough to be legible without overcrowding the image. In the left half of the visualization, the decision was made to not show name labels because the names did not add context to the graph. In fact, because these individuals were all lower-level employees, the decision to omit their names is somewhat representative of their actual role in the Enron scandal, which was largely blamed on the employees with their names shown in the right halves of the visualizations.

The final enhancements that I used to improve the aesthetics of the visualizations were curved edges, thicker edges, smaller node sizes, and labels that scaled with node size. These all helped make the visualizations more organized and clear. The curved edges prevented edges from overlapping with other edges or nearby nodes, while the thicker edges ensured that they would remain visible when the image was scaled down to fit on the page.

Graphs created in Gephi.

Data Source: Enron Email Dataset. William W. Cohen, 08 May 2015. https://www.cs.cmu.edu/~enron/