What can be done nowadays with databases is remarkable. Well, remarkable probably isn’t the right word here. It doesn’t have the right flavour. One might want to mix in a little bit of “sinister” in there. But I’ll stick with remarkable because as usual, I get pissed when people get worked up over something without knowing how it works. So let’s make sure we are worked up over this for the right reasons.
I was pointed to this story by a good friend of mine. It’s about how Target, a major retail store in the United States collects information on its clients and then uses this information to predict future purchasing behavior. Of course, it is not stated quite like that. The title is much more provocative: “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did”. The article transpires hints that the author didn’t understand how it was done and he exaggerates a few key elements.
But still, this is a rather worrying example of the unexpected power of the techniques used in personalized advertising. Before we go on, let me remind you that personalized advertising is everywhere. Every time you buy a book on Amazon, that you buy a new TV season pass on iTunes, or that you “like” something on Facebook, the information you provide is used to give you recommendations. You have a “You might also like:” box that appears on your screen right after the purchase. This is not new at all. Amazon has been doing it for years. How can you accomplish something like that, and how can it reach the point where it can be so intrusive as in the Target case?
I thought about how you might accomplish this, and given the details on the method given by Andrew Pole, a statistician working for Target, I am reasonably confident that I know how. The techniques used are the same as the ones we use in particle physics. Welcome to the world of multi-variate analysis.
On one hand, you have a collection of clients on who you have information like their purchase history and basic demographic information. On the other hand you have a collection of particle collisions on which you have all the information your particle detector has managed to gather. In both cases, you have data points (clients, collisions) each of which has a specific value for different variables. For example, for each particle collision you may have the total amount of energy collected by the detector. For each client you may have the total number of packs of chewing gum they bought in the last two weeks.
Let’s say you work for Amazon, and your job is to figure out who you should advertise the Kindle to. Let’s say the only information you have on your clients is their purchase history, and whatever information they provide you for delivery. Is it possible that this will be enough to figure out who’s likely to buy a Kindle? Remember that this scenario is purely hypothetical, and that the plots I am going to show are completely made up.
First, you want to organize the information you have on your clients in variables. Such variables might be the total number of books they have bought, the number of books they bought in the last year, the ratio of romance books over science fiction books, the frequency at which they buy books, the frequency at which you change addresses. You can display these variables in histograms.
Is it possible to use this information in any way to figure out who is likely to want a Kindle and who is not? Well, just from the bare histogram, you can’t. You need to split the data in the histogram in two. You need to know who will buy a Kindle and who will not, and then show the histograms for each scenario. But you can’t possibly know this in advance: this is what you want to figure out! The trick is to look in the past.
In the past, you already have clients who bought Kindles, and you have clients who didn’t. You can look at the number of books purchased by people who just bought a Kindle. If you do that, You find the following distribution in the histogram.
It becomes immediately apparent that people who buy Kindles tend to buy more books than the average buyer. You can already use this single variable to attribute a score that will tell you how likely someone is to buy a Kindle. But is it possible to do any better? Is there any other variable in your database on each client that may indicate that people buy Kindles? Let’s look at the ratio of romance books over science fiction books.
There is some interesting information in that plot, like that there are lots of people who buy lots of sci-fi, but no romance, but no one buys only romance and no sci-fi. (Remember, this is made up!) However, there isn’t any discrimination between Kindle buyers and other people. Let’s look at the number of address changes in the last 10 years.
There we have some more discrimination. The thing is, we know that the number of books someone buys and the frequency with which she moves to a new home don’t really have anything to do together. What you observe is that people who buy lots of books and move around a lot are very likely to buy a Kindle. However, it doesn’t necessarily mean that a Kindle buyer will do both of these things. How can you use both variables to find potential Kindle buyers and still take this into account?
The answer is multi-variate analysis. You may already have picked up that you may have a larger number of potentially useful variables in your database (not only these two), some of them giving you strong discrimination, some of them giving you weaker discrimination, some of them being correlated, some of them not correlated… A multi-variate technique allows you to take all of these variables, and as long as you tell it what values these variables will take for Kindle buyers, it will make the best of them. What you end up with is a score that you can then calculate on each client. If the score if high, you predict that the client is likely to buy a Kindle, and if the score is low, you predict that she isn’t likely to buy a Kindle in the near future. That score will be like a new variable, that gives you the maximum amount of discrimination you can achieve with the data you have on your customers.
I have taken a fairly innocuous example here with the Kindle. In the case of the pregnancy and Target, it seems much more intrusive. There is an ethical problem here. Some people don’t see the difference between using a data-mining technique like multi-variate analyses and spying on people, but there is a very important difference. Spying will give you accurate information. It will give you certainty. You will know stuff about specific individuals. Data-mining will only give you a guess. It will never get you anywhere near the certainty you can obtain with spying.
For example, if you estimate that 70% of the people with a “Kindle buying” score above 0.9 will buy a Kindle in the next month, you know that 30% won’t. But there is no way to know who in particular. It is a fundamental properties of multi-variate techniques that they can only ever pick-up trends. They can never make definitive statements on single data points. I think this is a very important distinction and it should be brought into the discussion on the ethics of personalized advertising.
However, I am not done with personalized advertising yet. Is it right for companies to use the information the clients provide in this way? More precisely, is it right for a company to assume you are pregnant and send you advertising that is targeted at pregnant women? My answer to this question is no, but not because it violates the privacy of the clients. It does not (in the limits in which the company collects only information you willingly give them, but that is another discussion). That is simply not how data-mining works.
I would oppose personalized advertising simply because it is wrong of a company to assume anything on its clients. It is a form of prejudice. A very informed prejudice, but a prejudice nonetheless which results in a form of discrimination. Also, I have my reservations about companies reaching conclusions on their clients with multi-variate analyses, since they will usually be oblivious to how solid these conclusions actually are. They are not scientists, and their goal is to maximize profits, not to find the truth. The potential for discrimination here is horrendous. A company has no rights to determine your needs for you. There is an ethical limit beyond which these techniques should not be used, and we are getting close to it. Let’s just hope our politicians are up to the challenge (*sigh).