Extracting nouns with Python

This is some real Python 101, so bear with me.

I've never worked with Python, except for one Saturday afternoon when I left a conference early and worked through the first two parts of the Django tutorial in a fog of considerable confusion. But for someone with aspirations of working with data, Python is clearly the future (or so I'm told by people who seem to know what they're about), and a project at work came along that presented a chance to explore Python's powerful Natural Language Toolkit (NLTK) module - or at least to scratch its surface.

"Natural language" here means language encountered in the wild, language as humans use it day-to-day - language that from a computer's point of view is unruly, messy, and illogical. Python's NLTK module gives programmers the tools to parse and interpret natural language as data, to draw insights from raw text.

My assignment was this: given a large chunk of raw text, extract only the nouns, both common and proper.

import nltk
essays = u"""text here"""
tokens = nltk.word_tokenize(essays)
tagged = nltk.pos_tag(tokens)
nouns = [word for word,pos in tagged \
	if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
downcased = [x.lower() for x in nouns]
joined = " ".join(downcased).encode('utf-8')
into_string = str(nouns)

output = open("output.txt", "w")

Import the Natural Language Toolkit module. Tokenize the raw text and then tag each token with a part-of-speech tag (e.g., noun, verb, adjective, etc.). Identify each token that's tagged as a noun or a proper noun. Make everything lowercase, and then join the noun tokens into a string, separated by spaces. Finally, write the result into a new file called output.txt.

Simple as this is, it took a long time to get working because the terminal kept throwing errors related to the text encoding. I'm still unclear on the details, but in short the version of Python that comes packaged on Macs, 2.7.x, requires explicit handling of Unicode text, which is why there's a random "u" in front of the """text here""" string. Python 3+, the version under active development, handles Unicode text by default. (Every explanation of the difference that I've read online runs to the thousands of words, so this paragraph is likely simplified to the point of uselessness.)

I'm excited to explore more of NLTK's capabilities. Next up might be a vocabulary frequency distribution.