Cathy O’Neil found refuge in mathematics, despite being told by teachers (many of whom were women) that she didn’t need to do math because she was a girl.
Events like that have driven O’Neil to completely obliterate that gender prejudice. As a result, she is one of the most prominent mathematicians in the U.S. and the author of “Weapons of Math Destruction,” which uncovers deep data biases and how algorithms are only as socially equal as the people who write them.
Studying at Berkeley, Harvard, MIT, and Barnard, and then working for a hedge fund just as the financial crisis began in 2007, O’Neil has a singular perspective on how we use data to either include or exclude Americans, often based on racial and gender biases.
Join Cathy O’Neil on her Masters of Data Episode #46 “A Data Science Super Hero Fighting Creepy Algorithms,” where she challenges what we think data is doing and explains what the data actually determines.
What’s Creepy About Algorithms?
Most consumers think an invasive algorithm lies behind that push message that seems to know what you’ve purchased, how likely you are to purchase again, and where you are geographically. It’s that feeling of 24-hour surveillance that makes marketing messages ultra-creepy.
O’Neil has a different — and more disturbing — definition of creepy data. Specializing in predictive algorithms, she has studied how formulas look at historical data to guess what will happen in the future. The problem is what if the historical data is biased or inaccurate?
“Algorithms are opinions embedded in code,” O’Neil says, and predictive algorithms are never neutral. They embody the ideologies of those who write them.
You end up with racially biased facial recognition. You get a mindless random number generator used to determine (or mis-determine) the effectiveness of teachers in Washington, D.C. You get an endless stream of white male commentators on Fox News, and you get a pre-trial incarceration system in California that does not favor minorities. (Check out “Weapons of Math Destruction” for an in-depth look at these examples.)
Worst of all, most consumers (and voters) have no idea how they’re being scored on this data and how that score will limit their potential to be “lucky” or not, in getting a loan, insurance, or a job.
Where Algorithm Writers Go Wrong
First you have to decide what your definition of “success” is. That determines the data you’ll train your algorithm on. Then you look at your historical data and judge whether your new algorithm gave you the outcome you wanted. In other words, was the algorithm successful? The answer to that determines which people are tagged as likely or unlikely to match whatever you’re targeting.
Once you define success and collect the appropriate data, you have two kinds of data modeling to choose from:
- Modeling you want to be right. For example, you want to profit from predicting market performance.
- Modeling you want to be wrong. Here, you want to underestimate risk, especially when you’re focused on short-term gains — like the subprime mortgage debacle in 2007-2008 where there was no incentive to get the data right.
O’Neil was in the thick of it when the credit crisis started and soon grew disillusioned. One of her biggest takeaways was that “mathematical authority was part of this story.” Fact, accuracy, and data were at odds with what the market wanted — positive growth at any cost, even if that meant ignoring what the data said.
How the Creep Creeps into the Algorithm
As pointed out in the episode, O’Neil acknowledges that many people start in a good place and try to build an algorithm that fixes a flaw or addresses an injustice: “I think what happens, in general, is people want to improve a system that they know is unfair because they have evidence that it’s unfair, and then their blind spot is that the data from the unfair system is itself incredibly bad.”
After counting clicks in the travel industry, O’Neil realized that though the modeling wasn’t as bad as the subprime mortgage crisis, the travel models were deeply flawed, leaving O’Neil to pick winners and losers based on biased demographic data.
But more troubling to her, data scientists (which O’Neil rejects as scientists because there is no mandatory scientific process) use the same data she was looking at to determine who gets a loan or how long people will be imprisoned. This method was rife with false positives and negatives, just like what happened in the credit crisis, where the mantra was: “If you guess wrong 40% of the time in finance, you still make money.”
Predictive Data Doesn’t Have to Be Creepy, but It Needs to Be Representative
Not every company collecting data intends to wield the data in a biased way. Take, for example, the concept of personalization. Consumers generally appreciate it when a brand knows their name and what they prefer to shop for, but to get beyond “first name, last name,” you need vast amounts of data to reach the customer at the right time on the right channel.
There’s a big reason to get this right: Companies that can personalize the shopping experience for customers outperform the competition by 85% in sales growth and 25% in gross margin.
Furthermore, a RedPoint Global survey found that 63% of consumers expect a brand to know them and their shopping preferences, though there’s room for improvement: 34% are very frustrated when they’re sent an offer for a product they’ve already bought, and 31% of consumers complain that brands still fail to identify them as existing customers.
The same survey, though, found 54% of consumers are willing to share data with companies in exchange for a high-grade shopping experience. As long as a brand is transparent about what data they’re collecting and what they’re doing with that data, many customers accept the trade.
Summary
Data is troubling when it’s being used to deprive Americans of their constitutional or basic human rights. O’Neil would prefer that data scientists and analysts take something like a Hippocratic Oath to do no harm. She has yet to see anyone really do this in a way that’s hard-core enough for her.
How biased data might impact black voters and other minorities this November is anyone’s guess, but data will definitely play an ever-increasing role in elections.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.