Unix pipelines for basic spelling error detection
Introduction
We can of course write programs to do most anything we want, but often the Unix command line has everything we need to perform a series of useful operations without writing a line of code. In my Applied NLP class today, I show how one can get a high-confidence dictionary out of a body of raw text with a series of Unix pipes, and I’m posting that here so students can refer back to it later and see some pointers to other useful Unix resources.
Note: for help with any of the commands, just type “man <command>” at the Unix prompt.
Checking for spelling errors
We are working on automated spelling correction as an in-class exercise, with a particular emphasis on the following sentence: This Facebook app shows that she is there favorite acress in tonw So, this has a contextual spelling error (there), an error that could be a valid English word but isn’t (acress) and an error that violates English sound patterns (tonw).
One of the key ingredients for spelling correction is a dictionary of words known to be valid in the language. Let’s assume we are working with English here. On most Unix systems, you can pick up an English dictionary in /usr/share/dict/words, though the words you find may vary from one platform to another. If you can’t find anything there, there are many word lists available online, e.g. check out the Wordlist project for downloads and links. We can easily use the dictionary and Unix to check for words in the above sentence that don’t occur in the dictionary. First, save the sentence to a file.
$ echo 'This Facebook app shows that she is there favorite acress in tonw' > sentence.txt
Next, we need to get the unique word types (rather than tokens) is sorted lexicographic order. The following Unix pipeline accomplishes this.
$ cat sentence.txt | tr ' ' '\n' | sort | uniq > words.txt
To break it down:
- The cat command spills the file to standard output.
- The tr command “translates” all spaces to new lines. So, this gives us one word per line.
- The sort command sorts the lines lexicographically.
- The uniq command makes those lines uniq by making adjacent duplicates disappear. (This doesn’t do anything for this particular sentence, but I’m putting it in there in case you try other sentences that have multiple tokens of the type “the”, for example.)
You can see these effects by doing each in turn, building up the pipeline incrementally.
$ cat sentence.txt This Facebook app shows that she is there favorite acress in tonw $ cat sentence.txt | tr ' ' '\n' This Facebook app shows that she is there favorite acress in tonw $ cat sentence.txt | tr ' ' '\n' | sort Facebook This acress app favorite in is she shows that there tonw
We can now use the comm command to compare the file words.txt and the dictionary. It produces three columns of output: the first gives the lines only in the first file, the second are lines only in the second file, and the third are those in common. So, the first column has what we need, because those are words in our sentence that are not found in the dictionary. Here’s the command to get that.
$ comm -23 words.txt /usr/share/dict/words Facebook This acress app shows tonw
The -23 options indicate we should suppress columns 2 and 3 and show only column 1. If we just use -2, we get the words in the sentence with the non-dictionary words on the left and the dictionary words on the right (try it).
The problem of course is that any word list will have gaps. This dictionary doesn’t have more recent terms like Facebook and app. It also doesn’t have upper-case This. You can ignore case with comm using the -i option and this goes away. It doesn’t have shows, which is not in the dictionary since it is an inflected form of the verb stem show. We could fix this with some morphological analysis, but instead of that, let’s go the lazy route and just grab a larger list of words.
Extracting a high-confidence dictionary from a corpus
Raw text often contains spelling errors, but errors don’t tend to happen with very high frequency, so we can often get pretty good expanded word lists by computing frequencies of word types on lots of text and then applying reasonable cutoffs. (There are much more refined methods, but this will suffice for current purposes.)
First, let’s get some data. The Open American National Corpus has just released v3.0.0 of its Manually Annotated Sub-Corpus (MASC), which you can get from this link.
– http://www.anc.org/masc/MASC-3.0.0.tgz
Do the following to get it and set things up for further processing:
$ mkdir masc $ cd masc $ wget http://www.anc.org/masc/MASC-3.0.0.tgz $ tar xzf MASC-3.0.0.tgz
(If you don’t have wget, you can just download the MASC file in your browser and then move it over.)
Next, we want all the text from the data/written directory. The find command is very handy for this.
$ find data/written -name '*.txt' -exec cat {} \; > all-written.txt
To see how much is there, use the wc command.
$ wc all-written.txt 43061 400169 2557685 all-written.txt
So, there are 43k lines, and 400k tokens. That’s a bit small for what we are trying to do, but it will suffice for the example.
Again, I’ll build up a Unix pipeline to extract the high-confidence word types from this corpus. I’ll use the head command to show just part of the output at each stage.
Here are the raw contents.
$ cat all-written.txt | head I can't believe I wrote all that last year. Acephalous Friday, 07 May 2010
Now, get one word per line.
$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | head I can t believe I wrote all that last
The tr translator is used very crudely: basically, anything that is not an ASCII letter character is turned into a new line. The -cs options indicate to take the complement (opposite) of the ‘A-Za-z’ argument and to squeeze duplicates (e.g. A42, becomes A with a single new line rather than three).
Next, we sort and uniq, as above, except that we use the -c option to uniq so that it produces counts.
$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | head 1 737 A 22 AA 1 AAA 1 AAF 1 AAPs 21 AB 3 ABC 1 ABDULWAHAB 1 ABLE
Because the MASC corpus includes tweets and blogs and other unedited text, we don’t trust words that have low counts, e.g. four or fewer tokens of that type. We can use awk to filter those out.
$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' | head A AA AB ACR ADDRESS ADP ADPNP AER AIG ALAN
Awk makes it easy to process lines of files, and gives you indexes into the first column ($1), second ($2), and so on. There’s much more you can do, but this shows how you can conditionally output some information from each line using awk.
You can of course change the threshold. You can also turn all words to lower-case by inserting another tr call into the pipe, e.g.:
$ cat all-written.txt | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n' | sort | uniq -c | awk '{ if($1>8) print $2 }' | head a aa ab abandoned abbey ability able abnormal abnormalities aboard
It all comes down to what you need out of the text.
Combining and using the dictionaries
Let’s do the check on the sentence above, but using both the standard dictionary and the one derived from MASC. Run the following command first.
$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' > /tmp/masc_vocab.txt
Then in the directory where you saved words.txt, do the following.
$ cat /usr/share/dict/words /tmp/masc_vocab.txt | sort | uniq > big_vocab.txt $ comm -23 words.txt big_vocab.txt acress tonw
Ta-da! The MASC corpus provided us with enough examples of other words that This, Facebook, app, and shows are no longer detected as errors. Of course, detecting there as an error is much more difficult and requires language models and more.
Conclusion
Learn to use the Unix command line! This post is just a start into many cool things you can do with Unix pipes. Here are some other resources:
- Unix for Poets (the classic resource by Ken Church)
- Data-Intensive Linguistics book draft by Chris Brew and Marc Moens (never published, but has a great section on using Unix for language processing).
- Slides I did on using Unix to count words in texts and compute language model probabilities
Happy (Unix) hacking!
Reference: Unix pipelines for basic spelling error detection from our JCG partner Jason Baldridge at the Bcomposes blog.