US Political Contributions 2010: Cluster Analysis


Over the past 10 years, I’ve spent much of my time pulling actionable insights out of data. Perhaps I shouldn’t admit this out loud, but I enjoy it.

I’ve recently been experimenting with an approach I’ve developed for identifying clusters of similar transactions within a data set. Seeing every group of clustered transactions within a data set enables one to identify the most interesting clusters, deep dive into why those transactions are clustered and devise tactics to get more clusters if they’re ‘good’ or fewer clusters if they’re ‘bad’.

The approach I’m experimenting with compares every record to every other record for large data sets. This is non-trivial given the exponential growth in the number of comparisons as the dataset grows. Consider for example a data set with 3 million rows of data and 4 columns. To compare each record with every other record may require 36 trillion calculations (3M * 3M * 4). It’s easy to cut this number by more than 1/2 by not comparing each record to itself and by only doing the comparison in one direction, i.e. assuming that comparing A to B is the same as comparing B to A, but that still leaves almost 18 trillion calculations and reducing that number further gets trickier.

I’ll discuss my approach in more detail over the coming posts but let’s take a look at some results:

The maps below show the political contributions made by individuals to political parties in the US in 2010. Let’s imagine that we are working for a particular political party and we want to increase the amount of contributions we receive from individuals. One way of doing this would be to identify groups of similar people who contribute large amounts of money, understand why those people are big contributors, and then attempt to create more of those groups of people across the country. To do this, we’ll look at the base data set, we’ll run it through the clustering algorithm, we’ll identify the top 20 clusters that we want to investigate further and devise tactics for replicating the ‘good’ clusters.

The underlying data set was created by Open Secrets, a fantastic group in the US collecting, cleansing and posting political data.

Full data set map

The above maps provide some interesting tidbits such as in 2010 year-to-date, Puerto Ricans contributed to Republicans, Democrats and Independents whereas Hawaiians prefer to stick with the major parties (note that the data set shows only direct contributions and excludes PAC contributions). Whilst this information is interesting, it is not particularly actionable and doesn’t help us achieve our goal of increasing the amount individuals are contributing to our party.

Now let’s run the data set through the clustering algorithm and look at the clusters:

The following map shows clusters of contributions with a similarity rating of 98%, containing 10 or more transactions, contributing more than USD $10,000 to the coffers of the party (using v0.01a of the software).

This looks better. We can see some interesting information now and we can start asking some questions. What is the large Democrat cluster in LA? Are the Republican clusters in Florida replicable across the country or are they particular the region? It looks like the Democrats in South Texas and Michigan are doing some amazing things. What are they?

But the first question we’ll ask is  ”What are the top 20 clusters by contribution amount and what tactic should we apply to replicate the top cluster?”

(To be continued…)

By the way, the reporting tool used to generate the above charts is Tableau. You can download the Tableau data set used in the above analysis as well as the csv file here. Tableau has a free ‘reader‘ version that you can use to explore the data further. If you have the full version of Tableau you can create your own charts using the data set. I’d appreciate seeing your findings.

Posted in Risk Tagged: Analytics, Clusters, data analysis, Political contributions