A TensorFlow Tutorial: Email Classification.
This code/post was written in conjunction with Michael Capizzi. Sections of the original code on which this is based were written with Joe Meyer. Update: November 2, 2017 - New script for raw text feature extraction read_corpus.py. Update: March 8, 2017 - Now supports TensorFlow 1.0. Quick Start. You can get the code and data discussed in this post (as well as presentation slides from the Tucson
Data Science Meetup) by cloning the following repo: Dependencies. Once you have the code and data, you can run a training session and get some output with the following: Introduction. This tutorial is meant for those who want to get to know the Flow of TensorFlow. Ideally, you already know some of the Tensor of TensorFlow. That is, in this tutorial we aren’t going to go deep into any of the linear algebra, calculus, and statistics which are used in machine learning. Don’t worry though, if you don’t have that background you should still be able to follow this tutorial. If you’re interested in learning more about the math, there’s a ton of good places to get an introduction to the algorithms used in machine learning. This tutorial from Stanford University about artificial neural nets is especially good. We’re going to be using a simple logistic regression classifier here, but many of the concepts are the same.

Email Classification. To ground this tutorial in some real-world application, we decided to use a common beginner problem from Natural Language Processing (NLP): email classification. The idea is simple - given an email you’ve never seen before, determine whether or not that email is Spam or not (aka Ham ). For us humans, this is a pretty easy thing to do. If you open an email and see the words “Nigerian prince” or “weight-loss magic” , you don’t
need to read the rest of the email because you already know it’s Spam . While this task is easy for humans, it’s much harder to write a program that can correctly classify an email as Spam or Ham .
You could collect a list of words you think are highly correlated with Spam emails, give that list to the computer, and tell the computer to check every email for those words. If the computer finds a word from the list in an email, then that email gets classified as Spam . If the computer did not find any of those words in an email, then the email gets classified as Ham . Sadly, this simple approach doesn’t work well in practice. There’s lots of Spam words you will miss, and
some of the Spam words in your list will also occur in regular, Ham emails. Not only will this approach work poorly, it will take you a long time to compose a good list of Spam words by hand. So, why don’t we do something a little smarter by using machine learning? Instead of telling the program which words we think are important, let’s let the program learn which words are actually important. To tackle this problem, we start with a collection of sample emails (i.e.

a text corpus). In this corpus, each email has already been labeled as Spam or Ham . Since we are making use of these labels in the training phase, this is a supervised learning task. This is called supervised learning because we are (in a sense) supervising the program as it learns what Spam emails look like and what Ham email look like . During the training phase, we present these emails and their labels to the program. For each email, the program says whether it thought the email was Spam or Ham . After the program makes a prediction, we tell the program what the label of the email actually was. The program then changes its configuration so as to make a better prediction the next time around. This process is done iteratively until either the program can’t do any better or we get impatient and just tell the program to stop. On to the Script. The beginning of our script starts with importing a few needed dependencies (Python packages and modules). If you want to see where these packages get used, just do a CTRL+F search for them in the script. If you want to learn what the packages are, just do a Google search for them. Next, we have some code for importing the data for our Spam and Ham emails.
For the sake of this tutorial, we have pre-processed the emails to be in an easy to work with format.
Комментариев нет:
Отправить комментарий