Step by Step: How to Cluster-match your Data with Deep Machine Learning

5 min readApr 20, 2017

It’s been 20 years since IBM’s Deep Blue defeated grandmaster Gary Kasparov in chess. I remember watching the drama unfold from my helpdesk computer on my first job out of college and seeing each move in succession — animated in real-time. It was a match for the ages and one of the first dynamic interactive events on the web.

Deep Blue has lost to Kasparov the year before but impressively beat him in the rematch. Essentially, Deep Blue won because it could live a lifetime of chess playing in minutes for every single move, calculate the best predictive output for that current board and feed itself more information on how to beat Kasparov which each counter. Today, your iPhone can handle more calculations than Deep Blue could in 1997. It’s time to put the power of machine learning to use!

I have some fake sales data that I’m putting together for a longer piece but I’m going to revert to our favorite subject of politics to illustrate how to use data clustering to predict the next set of voters you want to go after.

Last time you might recall that we did some basic regression analysis, correlations and then created some models and ensembles for the overall picture of the election across the 3100+ counties in the United States. Today, we’re going to look at a subset of that data to see if we (by we I mean a big machine in the cloud!) can find clusters of data points that indicate a winning combination and then apply that to a larger set of the data.

Here’s what I did:

Download the excellent election data from Data.World or elsewhere to get the election results in csv format. I have some proprietary data in my set so I won’t yet make that available.

Next, create an account on BigML.com if you haven’t already. Go to your dashboard then upload your data to the sources there. (My file was called cc4.csv but you can use XLS too). Be sure to put the “target” field as the last field. In my case the last column I had on my .csv file was a field which indicated which candidate “won” that county.

Click into the file and create a dataset:

Then click into that dataset and choose “Filter Dataset” and narrow the data down to just the battleground states: Colorado, Florida, Iowa, Michigan, Nevada, New Hampshire, North Carolina, Ohio, Pennsylvania, Virginia and Wisconsin. Go back to your original dataset and do the same but this time exclude the battleground states.

You should now have 3 datasets: 1) the temp dataset from the source data (we can ignore that or play with it later); 2) the battleground dataset and 3) the battleground excluded dataset.

Now, the fun begins. Click into your battleground dataset and choose 1-Click Cluster:

The Cluster app on BigML.com will find groups of counties with similar results. For example, the big circle (called a “centroid”) in red has Trump at 65%, it shows a slightly older age demo on average, smaller diversity index and shows non-urban strongholds.

This one is interesting but it’s not going to help Trump win more counties the next time around. In politics you spend your time targeting the margins. These counties with 65% Trump vote probably don’t need that much attention and I can readily identify them across the board. Let’s look at this other centroid to the left. Notice it has a slight Trump win with 47% of the vote and while it notes this cluster is the in Deep South the area is a medium metro.

Now our theory and our goal come into stark relief: are there OTHER counties outside of the battleground states that Trump could target that mirror this 47% centroid?

Batching a centroid allows us to map a dataset to the centroid and download the results. Below is a screenshow of who I mapped the battleground centroid to the non-battleground dataset I saved.

The resulting XLS file I downloaded showed 182 counties in non-battleground states which are similar to the 47% centroid. Trump already won over 100 of those so the 75 left are my target.

Hillary eeked out a win in Minnesota and Trump gave her a run for her money — just like he did in the unexpected battleground states of WI and MI. Here’s a sample of 5 counties in that state that Trump almost won:

Bottom line: using Machine Learning to find insights in clusters of data points can give you serious momentum in targeting other opportunities for wins. In politics and next… in marketing and sales!

— about the author:

Justin Hart is a senior executive consultant.
His primary objective: plumb the deep depths of cutting edge technologies and translate those into c-suite strategies to improve marketing and sales teams.
shorter version: mktg + bizdev + ai
Justin is a recognized industry speaker on modern marketing trends. He is currently working with several companies applying advanced tech tools like machine learning and artificial intelligence to business funnel basics.
You can find his work online at justinhart.biz.
Email Justin at justinhart.biz at gmail.
On twitter @justin_hart.
On Medium Justin Hart

Justin has over 20 years experience as a senior executive of established and start-up companies and even political campaigns (as senior digital director to the Mitt Romney campaign). He currently resides in Southern California.

Step by Step: How to Cluster-match your Data with Deep Machine Learning

Written by Justin Hart