I am currently a fellow of the Fall 2016 Insight Data Science program. Insight is an intensive 7-week post-doctoral training program that helps bridge the gap between academia and industry data science. In order to highlight our transferable skills, we spent the first three weeks working on a business-oriented project. For my project, I teamed up with RedCarpet, to identify fraud and credit risk in chat messages for e-commerce. I discovered that the syntax of an individual's chat messages is distinct enough, that it can be used to identify a user that is creating multiple fake accounts. This project required me to implement a solution using TensorFlow, deep learning, and convolution neural nets. The rest of this blog posts outlines how I accomplished this.
The Problem WIth Introducing Credit In Developing Countries
Most Americans can agree, that there are many advantages to having good credit. Credit allows people to invest in their future, such as: buying a house, starting a business, or paying for college. For those who live paycheck-to-paycheck, credit is essential when unexpected expenses arise. Finally, paying with credit is a secure and convenient way to pay for online purchases.
Unfortunately, most of the world’s population has no access to credit. The problem is, that without having a consumer credit market in place, the data that is conventionally used to estimate credit scores, such as a FICO score, does not exist. Therefore, in order to expand credit to the developing world, credit companies must use alternative data sources, to estimate a new user’s credit limit. Taking India a test case, business-to-consumer e-commerce is expected to rise to over $100 billion by the year 2020. Cash-on-delivery is the dominant payment, with 83% of Indian e-consumers saying that they have used CoD within the past 6 months. Therefore, when it comes to credit, the average Indian consumer is greatly under-served.
RedCarpet: Bringing Credit to India
RedCarpet, is a mobile-based, credit company operating in India, that issues microloans to consumers for e-commerce. RedCarpet has now partnered with Insight three times to improve its predictive algorithms. David Pappano (Insight Data Science Fellow, Spring 2016) used social connectivity data to predict micro-credit limits. Eliza Guseva (Insight Data Science Fellow, Summer 2016) used a random forest to estimate the likelihood for new users to default based on features in their chat messages. My project builds off of Eliza’s, but with the specific goal to implement a deep learning solution. This project was heavily inspired by recent developments in applying deep learning to natural language processing (NLP) problems. Specifically, the question was whether the syntax of a person’s text messages could help RedCarpet predict either the likelihood that an individual would default, or whether a specific transaction was fraudulent. Given RedCarpet’s existing code base, it was very important for them that the project use deep learning and be implemented using TensorFlow.
Training a neural net on syntax
So, what does it mean to classify users based on their syntax? Let's say that I just sent RedCarpet a copy of my checkbook to verify my identity, and now, I want to purchase a pair of sneakers on Amazon. They then asked me which address to send the sneakers to and I answer:
>> the address is on the checkbook
Now, let's say that this is my mom buying the sneakers, she might write something like:
>> Please send the item to the address on the checkbook. Thank you!
Now, a user in India might write something like:
>> chek book pe hai address current location ka .
The average English-speaking person can read these sentences and immediately identify differences in the syntax of these sentences. RedCarpet records its chat history with each user, and using this, we can use neural nets to learn the syntactic fingerprint for each user.
Which deep learning architecture should I apply to this problem? There are currently three standard neural net architectures used in NLP: recursive, recurrent, and convolutional neural nets (CNNs). In a 2014 article, Yoon Kim demonstrates that CNNs are just as effective or better at sentence classification than other deep learning methods. Denny Britz has a nice blog post introducing convolutional neural nets for NLP, including a TensorFlow implementation of Kim's CNN for NLP. For my particular problem, there are several advantages to using a convolutional neural net. First, the convolution in CNN, implies that the neural net is training on word relationships, i.e. syntax, by design. Second, neural nets automatically select which features to train on, so that as the RedCarpet user base grows, the algorithm still remains relevant. And finally, by adopting this method, I don’t have to worry about users with unconventional, non-English, usage patterns. Using Denny Britz' code base as a springboard, I then proceeded to apply a CNN to my particular application.
using a cnn to estimate credit risk
I first tried testing me code by dividing the existing user base into those users who paid on time and those who defaulted. Below is a plot of the late fraction as a function of the number of times borrowed.
There's an obvious over-density of points for users that have never been late in their payments. Great! You can also see how that the late fraction can only take as many distinct values as the number of times borrowed. However, no matter where I drew the cutoff for the late payers from the on-time payers, I couldn't get my neural net to learn how to classify these classifications. RedCarpet is a relatively young startup, and with about 1500 users that have borrowed at least once, its user base is still small. While 1500 data points might be sufficient to train neural nets for some applications, in the case of identifying credit risk based on language, this is a subtle problem that requires more data. I'm sure many of us know of sets of siblings, who, given that they grew up in similar circumstances, should have very similar language patterns, but have widely varying spending habits. The more subtle the problem, the more data a neural net needs to train on.
USING A CNN TO IDENTIFY FRAUD
A rule of thumb in machine learning is that if a human expert can easily classify a small set of data, then given enough data, there's a good chance that you can train a machine to make the same classification on larger data sets. I am unaware of any human expert that has rigorously demonstrated their ability to identify credit risk using an individual's speech patterns. However, I'm sure most of us would be able to distinguish our sibling's chat history from our parents' chat history. Once I discovered that my CNN is much better at training on individual syntax patterns, I realized that this is a tool that is more readily applied to identifying fraud.
Yay. Let's go fight fraud! Wait a minute, RedCarpet hasn't labelled any of its transactions as being fraudulent. As I've discovered while doing this project, because companies want to prevent fraud before it occurs, simulating fraud is actually common practice in industry.
Let's try to think like a potential fraudster. The RedCarpet credit limit is very small, typically less than 50 USD. So, if a fraudster is going to make a profit, they’ll probably be using several false identities. Say a fraudster opens several fraudulent accounts, unless they're NLP experts themselves, each chat message will have the syntactic fingerprint of the individual who typed them. If I can match the fingerprints, then I can find the fraudster.
I then tried to figure out how to simulate this using the data I had. Below is a histogram of the number of words in each user's chat message history.
As you can see, this is a log-normal distribution. If I take the users with the longest chat histories, I can then chop these into pieces that approximately follow the median distribution of the bulk of the user chat histories. This simulates a fraudster creating multiple false identities. I then train my neural network to distinguish the fraudulent users from the conventional ones, reserving a smaller set of samples on which to evaluate the performance of my method. Once the network is trained, I can then see whether or not my algorithm can identify a new user as a likely fraudster.
In the figure below, you can see that my neural net performs very well if the sample is balanced between the two categories: the area under the blue curve is greater than 0.95 for both the ROC curve, as well as the precision-recall curve. However, in a realistic scenario where the fraud rate is low, there's an obvious trade-off between precision and recall. In other words, if the fraud rate is low, then RedCarpet needs to decide whether it wants only those hits where we're highly confident that we found a match, or, whether it wants all of the hits where there might be a match, and then we'll have to have a human take a look at these examples individually to determine whether it's really fraud, or perhaps a false positive.
Conclusion
At this point, I hope that I’ve convinced you that CNNs are effective tools at classifying text based on syntax. My method is particularly suited for detecting individual fraudsters with a known syntactic fingerprint. And given the high precision of my results, I would prioritize this project for production.