Natural language processing

We have learned how to work with data organized as an array of numbers, using the numpy package, or more generally as a table of numeric and text columns, using pandas. We have also learned a bit about working with image files. The other major data format that you are likely to encounter in many computer programming tasks is natural, human-language text.

In computing, people often refer to the languages that human beings use for everyday communication, such as English or Pirahá, as ‘natural’ languages. This distinguishes them from programming languages like Python, and helps avoid confusion, for example when asking a computer person ‘What language do you use at work?’ Using computers to analyze natural language is often called ‘Natural Language Processing’ (NLP).

Computers still have huge difficulty responding to natural language appropriately. Here we will introduce a few tools and techniques that can help if we need to write a computer program that takes natural language text as its input. These techniques are useful in many common programming tasks:

  • translating between natural languages

  • classifying texts by topic

  • identifying particular tones or styles, for example rude or offensive posts in a forum

  • creating chatbots or generating automated responses to customer queries

  • … and even some genuinely useful things like establishing the authorship of anonymous works of literature

Real-world natural language processing tasks like these are usually quite complex, and require a ‘pipeline’ of multiple techniques. As usual, we will only cover the basics. After our short introduction, you should know at least where to start if you later come to work with natural language text data.

Objective

Here is our toy task for this lesson.

Let’s imagine we have landed a job as an editor at a pretentious elitist newspaper, and we would like to write a program that helps us identify all the sentences in a text that violate one of the woefully misguided rules of the newspaper’s style handbook. Let’s take three example rules:

  • A preposition (such as for, to, etc.) is not a good thing to end a sentence with.

  • To boldly place an intervening word between the to and the verb in an infinitive is forbidden.

  • And you shouldn’t begin a sentence with a conjunction (such as and, but, etc.)

This isn’t a completely unrealistic example. Talking right is something that many people still get very excited about.

So that we have a large example text to work with, let’s load the classic go-to text for all NLP examples ever, Moby Dick:

import os

md = open(os.path.join('data', 'melville-moby_dick.txt')).read()

print(md[:433])
[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

String methods again

Let’s dive right in and just take a naive stab at the first rule, about not having a preposition at the end of a sentence. Ignoring for a moment any initial processing, such as splitting up the full text into sentences, we can first write a function to determine whether or not a single sentence violates the rule.

Remember the main ingredients of a function:

  • The name of the function. This function’s job is just to say ‘yes’ or ‘no’. For such functions, a good choice of name is some abbreviated form of a ‘yes/no’ question. We can go with endswith_preposition.

  • Its input arguments. This is easy. The single input argument is a string containing the sentence.

  • What steps it carries out (the ‘body’ of the function). There is room for some variation and choice of strategy here. But a simple start would be to split the sentence into words, get the final one, remove any punctuation characters, then compare it to a list of prepositions.

  • The return value. We don’t actually want to return the printed answer 'yes' or 'no'. If our function is to be useful as part of a bigger program, it needs to return the computer versions of ‘yes’ and ‘no’, a boolean, with True for ‘yes the sentence violates the rule’ and False for ‘no it does not’.

punctuation = ' .,?!:;'
prepositions = ['around', 'about', 'at', 'by', 'down', 'from', 'in', 'of', 'on', 'out', 'to', 'up', 'with']

def endswith_preposition(sentence):
    words = sentence.split()
    final_word = words[-1].strip(punctuation)
    return final_word.lower() in prepositions

Let’s give our function a quick test.

endswith_preposition('This is the sort of mindless pedantry I just cannot put up with!')
True
endswith_preposition('This is the sort of mindless pedantry up with which I just cannot put!')
False

It works. But you can probably already spot a few deficiencies. For example, there are other punctuation characters that might end a sentence.

endswith_preposition('I said "This is the sort of mindless pedantry I just cannot put up with!"')
False

Regular expressions

Some of the difficulties of working with natural language text can be handled using a particular programming language whose sole purpose is to specify patterns of text characters to search for. Pieces of code written in this language are known as regular expressions, sometimes abbreviated to ‘regex’.

There is a third-party Python package for working with regular expressions. Let’s import it so that we can see a few examples of regular expressions in action.

import regex

(Note that there is also a package in Python’s standard library called re, which provides the same functions as regex. But re has several limitations, so I recommend working with regex instead.)

The regex package provides various functions, but the one that we will use here is findall(), which takes two arguments, a regular expression giving a pattern of text, and a string in which to search for that pattern. The function returns a list of all the matches.

Regular expressions are a programming language all of their own. We won’t cover here all of the rules of regular expressions. But the basic idea is that a regular expression contains normal text plus certain special characters that have particular meanings. Normal text characters simply specify that exact piece of text. So for example the regular expression cat will match all the occurrences of that exact sequence of characters.

text = 'That big fat cat on the flat mat is my cat Pat.'

regex.findall('cat', text)
['cat', 'cat']

There are lots of special characters or combinations of characters that can be used in regular expressions to specify more abstract matches. For example, what if we want any word that ends in ‘at’? A range of alphabetic characters enclosed in square parentheses [] means ‘any one of these characters’. So a first stab at finding all the words ending in ‘at’ might be:

regex.findall('[a-z]at', text)
['hat', 'fat', 'cat', 'lat', 'mat', 'cat']

This gets us all the cases of exactly one letter followed by ‘at’, but this results in some ‘chopped off’ words. The special character + means ‘one or more occurrences of the preceding pattern’, so we can add this in to also make sure we get the full words with more than one letter before the ‘at’.

regex.findall('[a-z]+at', text)
['hat', 'fat', 'cat', 'flat', 'mat', 'cat']

Regular expressions draw a distinction between upper and lower case letters. So to get the words with initial uppercase letters as well, we need to add the range of uppercase letters to our [] sequence.

regex.findall('[a-zA-Z]+at', text)
['That', 'fat', 'cat', 'flat', 'mat', 'cat', 'Pat']

Regular expressions may also contain certain character sequences that represent a whole ‘class’ of characters. These special sequences always begin with the backslash \. For example, the special sequence \w matches ‘word’ characters, which covers all letters plus the underscore _. This is commonly useful for breaking a text into words while leaving out punctuation.

regex.findall('\w+', text)
['That',
 'big',
 'fat',
 'cat',
 'on',
 'the',
 'flat',
 'mat',
 'is',
 'my',
 'cat',
 'Pat']

There is lots lots more that can be done with regular expressions. You can read about the full set of special regex characters on the Python documentation page for regular expressions.

There is also an interactive regex testing site, where you can enter a regular expression plus some example text and check whether your regeular expression has any matches in the text. Try it out.

For now, let’s apply the little that we have learned about regular expressions to improve our endswith_preposition() function.

def endswith_preposition(sentence):
    words = regex.findall('\w+', sentence)
    final_word = words[-1]
    return final_word.lower() in prepositions

endswith_preposition('I said "This is the sort of mindless pedantry I just cannot put up with!"')
True

nltk

But we still have a problem with our list of prepositions. It is far from complete.

endswith_preposition("Now that's the kind of pedantry I can get behind!")
False

Regular expressions aren’t going to be of much help fixing this second limitation of our function. We could of course just include more prepositions in our list. But there are a lot of prepositions that we would need to take into account.

As we have seen a few times before, when we encounter what seems like a difficult but common problem, there may be some tools already out there that can help save us from re-inventing the wheel. There are lots of Python packages for NLP, which apply various techniques ranging from simple string processing like in our example function above, to something more like artificial intelligence.

We will look at just one NLP package, called nltk, which stands for ‘natural language tool kit’. The tools provided by nltk are a bit more complex than our example function, but still simple enough to be suitable for an introduction to NLP. Indeed, nltk was originally created to teach linguists how to do computing with Python, and it accompanies a book about Python and NLP, which is free to read online.

nltk is included in the default Anaconda installation, so if you have Anaconda then you will have it already. Let’s import it.

import nltk

Dealing with natural language often requires quite a lot of extra data. For example, we might need pre-compiled lists of word types, a database of grammatical rules, and so on. nltk includes various extra data files of this kind, but it does not install them all by default, in order to save space for users who do not need all of it. So the first time you use nltk you may find that you need to download extra components. For simplicity, I will simply download them all here so that we definitely have everything that we need.

nltk.download('all', quiet=True)
True

There are a lot of extra data files in the full package. If you prefer to pick and choose which extra components of nltk to download, call the downloader function with no arguments, and a window will pop up allowing you to browse components:

nltk.download()

Now we are ready to make use of nltk’s pre-compiled knowledge of word types in order to label words as prepositions.

Tokenization

The first step is to split our text into words. We already managed this above using the basic Python string method split(). But to see how this can be achieved with nltk, let’s use nltk’s own function for this, word_tokenize().

In NLP, the term ‘token’ refers to an instance of some meaningful unit of language. Most tokens are words, but not all. Sometimes, we might wish to consider punctuation as meaningful, in which case a punctuation character is a token. And sometimes we might consider a word to be meaningful only as part of group of words but not on its own, in which case that group of words is a token (one example of this latter case is names of people or places, such as ‘Los Angeles’, in everyday English ‘Los’ is not really a token).

The process of splitting a text into tokens is called ‘tokenization’. We can see some of the differences between this process and simple splitting by comparing the two.

sentence = "I'm ready!Let's go"

sentence.split()
["I'm", "ready!Let's", 'go']
nltk.word_tokenize(sentence)
['I', "'m", 'ready', '!', 'Let', "'s", 'go']

We see that word_tokenize() handles contractions like I’m as two tokens, and also deals correctly with typing errors such as missing spaces after punctuation.

Now let’s tokenize one of our example sentences from above.

sentence = "Now that's the kind of pedantry I can get behind!"

tokens = nltk.word_tokenize(sentence)

tokens
['Now',
 'that',
 "'s",
 'the',
 'kind',
 'of',
 'pedantry',
 'I',
 'can',
 'get',
 'behind',
 '!']

POS tagging

Once we have our tokens, the next step is to ‘tag’ the tokens as belonging to linguistic categories like ‘noun’, ‘verb’, ‘preposition’, etc. These categories are often called ‘parts of speech’ (abbreviated to POS). The nltk function for assigning parts of speech to tokens is pos_tag().

There are many different conventions concerning how to name and abbreviate the different possible parts of speech. These different conventions are known as ‘tagsets’. An additional argument can be used to specify the tagset to use. We will use a simplified ‘universal’ tagset here.

pos = nltk.pos_tag(tokens, tagset='universal')

pos
[('Now', 'ADV'),
 ('that', 'DET'),
 ("'s", 'VERB'),
 ('the', 'DET'),
 ('kind', 'NOUN'),
 ('of', 'ADP'),
 ('pedantry', 'NOUN'),
 ('I', 'PRON'),
 ('can', 'VERB'),
 ('get', 'VERB'),
 ('behind', 'ADP'),
 ('!', '.')]

We get a list of tuples, in each of which the token is the first entry and its POS tag is the second entry.

If you are wondering what the tags mean, you can see a table of them on the nltk website. Here are the ones that we are interested in for our task:

  • ADP : adposition (which to a linguistics blockhead like me is basically the same thing as a preposition)

  • . : punctuation

We need to remove any punctuation tokens, in order to make sure that the last item in our list is the last actual word and not any final punctuation mark. Then we need to check whether the final item has been tagged ‘ADP’.

So here is our updated function, which includes a little extra trick to reverse the list of tokens and then go through them to find the first non-punctuation token.

def endswith_preposition(sentence):
    tokens = nltk.word_tokenize(sentence)
    pos = nltk.pos_tag(tokens, tagset='universal')
    for x in reversed(pos):
        if x[1] != '.':
            return x[1] == 'ADP'
    return False

endswith_preposition("Now that's the kind of pedantry I can get behind!")
True

Success!

Sentences

Now we are ready to apply our function to the text of Moby Dick. There is an nltk function for splitting a text into sentences. We need to apply this function, then go through the sentences and check them with our own endswith_preposition() function. To keep the output brief, we will only print the first few matches.

counter = 0

for sentence in nltk.sent_tokenize(md):
    if endswith_preposition(sentence):
        print(sentence, '\n')
        counter = counter + 1
    if counter == 5:
        break
"No, Sir, 'tis a Right Whale," answered Tom; "I saw his sprout; he
threw up a pair of as pretty rainbows as a Christian would wish to
look at. 

so near! 

They must get just as nigh
the water as they possibly can without falling in. 

Tell me that. 

Again, I always go to sea as a sailor, because they make a point of
paying me for my trouble, whereas they never pay passengers a single
penny that I ever heard of. 

Our solution to the task works, more or less. But we can see here already some of the difficulties of using computers to process natural language. For example, the ‘that’ in ‘Tell me that’ is not really being used as a preposition.

To see some more NLP in action, take a look at the example program pedantry.py, which implements functions for checking the three prescriptive grammar rules from our example task, as well as an overall function for searching a text for any violations of the rules.

spacy

As we have seen, even a fairly simple NLP task like finding sentences that match a certain grammatical pattern can be very tricky. In natural language, context matters a great deal. A word may have different meanings or grammatical roles depending on the surrounding words. Simple techniques like regular expressions, and the basic NLP tools provided by nltk, will usually not be enough for real-world NLP problems. For many such problems, the current cutting-edge solutions rely on various machine learning techniques, which involve first ‘training’ a fairly complex program at the task. This is the topic of the next lesson.

One of the most popular Python packages for applying machine learning specifically to NLP is called spacy. The spacy package provides pre-trained computer ‘models’ of language, which can then be applied to piece apart the structure of a natural language text. This package is too big and too complex for our introduction here, but if you would like to learn more you can see examples and tutorials on the spacy website.

Exercise

English often contains some compound adjectives or adverbs in which two or more words are joined by a hyphen, and the final word is a verb in the past tense. For example:

  • old-fashioned

  • absent-minded

  • grey-headed

  • open-mouthed

  • so-called

  • good-natured

  • one-armed

Write a program that finds occurrences of these compound phrases in a text. Apply it to the text of Moby Dick and print out any compound phrases that occur three or more times in the text. You should find that it prints the seven example phrases above.

The documentation for nltk is not always so easy to navigate, so here is a hint:

Earlier on, we used the ‘universal’ tagset for tagging words with parts of speech. This tagset is useful for simple distinctions betweens verbs, nouns, adjectives, and so on. The default tagset for pos_tag() provides more detailed tags that distinguish between verbs in the past and present tense. If you use pos_tag() without the tagset argument, you will get these more detailed tags.

For example:

phrase = 'The cat sat on the mat. The cat sits on the mat.'
tokens = nltk.word_tokenize(phrase)

nltk.pos_tag(tokens)
[('The', 'DT'),
 ('cat', 'NN'),
 ('sat', 'VBD'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('mat', 'NN'),
 ('.', '.'),
 ('The', 'DT'),
 ('cat', 'NN'),
 ('sits', 'VBZ'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('mat', 'NN'),
 ('.', '.')]

The ‘VBD’ tag represents a verb in the past tense, as you can see for the word ‘sat’ here, so this is the tag you will need to look for to identify hyphenated phrases that end in a past-tense verb.