Download the file “ATaleOfTwoCities.txt“. This file contains the full text of Charles Dickens’s novel “A Tale of Two Cities”. It was downloaded from Project Gutenberg. Using the nltk.tokenize module, calculate the total number of sentences in this novel.
import re import nltk from nltk.tokenize import sent_tokenize, word_tokenize novel = open("ATaleOfTwoCities.txt" , encoding = 'utf-8') fullText = novel.read()
Building on the code that you have written for exercise 1, try to calculate the average number of words per sentence. Hint: calculate both the total number of words and the total number of sentences, and divide these two numbers.
The code can below can be used to apply part-of-speech tagging to the text file that you have openened. Use this code to count the total number of nouns, adjectives and adverbs in “A Room with a View“. Print all the adjectives in a separate text file named “adjectives.txt”. N.B. All the tags used in the Penn Treebank project can be found here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
novel = open( "ATaleOfTwoCities.txt" , encoding = 'utf-8' ) fullText = novel.read() sentences = sent_tokenize(fullText) for sent in sentences: words = word_tokenize(sent) tags = nltk.pos_tag(words) for t in tags: print( t + " => " + t + "\n")
Find all the fragments in “A Tale of Two Cities” that contain the verb “to go”, regardless of its inflection.
import re import string from nltk.tokenize import sent_tokenize, word_tokenize from nltk.stem import PorterStemmer, WordNetLemmatizer lm = WordNetLemmatizer() novel = open( "ATaleOfTwoCities.txt" , encoding = 'utf-8' ) for line in novel: words = word_tokenize(line) tags = nltk.pos_tag(words)