How to Write an Email Miner for Python
- 1). Open a terminal session and type python -v at the prompt to check that you have Python 2.6 or higher, but not 3.0 or higher. Versions 2.6 or 2.7 are ideal because they are compatible with NLTK and PyYAML. Visit the Python packages index page; find and download the PyYAML and NLTK packages. Unzip/untar them. Change your directory to the PyYAML directory. At command line prompt type in: sudo python setup.py install. It should look like this:
My-Computer:PyYAML-3.2.0 Me$ sudo python setup.py install
You will be prompted for a password. Type it and press the return button. Follow this procedure for every Python package you install. - 2). Download mail messages for parsing with the following lines of code:
#!/usr/local/bin/python
import poplib, getpass, sys, mailconfig
mailserver = mailconfig.popservername
mailuser = mailconfig.popusername
mailpasswd = getpass.getpass('Password for %s?' % mailserver)
server = poplib.POP3(mailserver)
server.user(mailuser)
server.pass_(mailpasswd)
print(server.getwelcome())
msgCount, msgBytes = server.stat()
print('There are', msgCount, 'mail messages in', msgBytes, 'bytes')
print(server.list())
print('-' * 80)
input('[Press Enter key]')
for i in range(msgCount):
hdr, message, octets = server.retr(i+1)
for line in message: print(line.decode())
read('-' * 80)
if i < msgCount - 1:
This script will connect to your pop3 email server, prompt you for your user name and password, count the number of messages on the server and read them into memory. - 3). Mine your email messages by converting each message to a string, a native data type in Python, that can be searched with Python's string methods, regular expression engine, and Natural Language Toolkit:
m = msgCount[1]
s = str(m)
from email.parser import Parser
import nltk
import re - 4). Mine the first message for any information of interest. Discover how many words are in that message by entering the following command:
>>>>len(s)
It will return an integer value for the number of words. To find every sentence with the word mortgage, enter the following NLTK command:
>>>>s.concordance('mortgage')
This will return every sentence with the word mortgage in it; very useful for detectives investigating mortgage fraud.