Perl and R exercises

Visualising NLP data

 

Download the folder NLP.zip

1. In this folder, you will find a a program named pos.pl. This program can add part of speech tags to the texts in a corpus, and count the occurrences of these different tags.  It makes use of the tags that have been defined in the Penn Treebank list.Try to answer the following questions about the texts in your corpus.

  1. How many adjectives are there in the different text? Try to consider adjectives in the comparative and in the superlative as well.
  2. How many verbs are there in the text in your corpus? Note that verbs may have received different tags: VB (Verb, base form), VBD (Verb, past tense), VBG (Gerund or present participle), VBN (Past participle), etc.
  3. What are the most frequent personal pronouns in your corpus?

2. Adjust the file pos.pl in such a way that it can find all fragments that contain the following grammatical construction: [Noun] [preposition] [article or personal pronoun] [Noun]. Examples:

roots\NNS of\IN the\DET world\NN
Judge\NNP of\IN the\DET stars\NNS
linnet\NN from\IN the\DET leaf\NN
dancer\NN from\IN the\DET dance\NN

3. First, run a text through the UCREL English semantic tagger. Copy and paste the result in a text file. Next, use the program usas.pl to find the most frequent semantic categories.

The full USAS tag set can be found here.

4. Use the program sentimentAnalysis.pl to count the tokens that have either a positive or a negative connotation.

Analysing most frequent words

The exercises in this section mention may different files. They can be downloaded individually, but they have also been made available as a single zipped folder.

 

5. Using the file tokeniser.pl as a basis, write a .csv file named “mfw.csv”, which lists the 100 most frequent words in E.M. Forster’s A Room with a View.  Next, try to create a bar chart in R which can represent the frequencies that were calculated. Use the following steps:

  • In your code editor (e.g. NotePad++ or Brackets), create a file named barChart_mfw.R. Copy and paste the code below. The full script may also be downloaded from the file repository.
    N.B. The very first line in this code indicates the folder on your computer which contains the csv file. If the data is in another directory, you need to change the value of the setwd() function.
setwd("P:\\My Documents\\DTDP")

library(ggplot2)

mfw <- read.csv( "mfw.csv" )
mfw <- mfw[1:30 , ]
mfw$word <- factor(mfw$word, levels = mfw$word)
p <- ggplot(mfw, aes( x = word, y = frequency ) ) + geom_bar(stat = 'identity' )
p <- p + coord_flip()
print(p)
  • Open the R application.
  • Run this code in the R program by choosing File > Source R code
  • Also try to change the colour of the bar chart by adding the fill parameter to the geom_bar(), as follows:
ggplot(mfw, aes( x = word, y = frequency ) ) 
+ geom_bar(stat = “identity” , fill = “darkred” ) 

You can also select a different colour, using a colour picker, for instance.

You can also vary the colour of the bars along with the first character in the word, as follows:

ggplot(mfw, aes( x = type , y = frequency , fill = substr( mfw$type , 0, 1 ) )) 
+ geom_bar(stat = “identity” ) 

What is the function of the function coord_flip() ?

5. If you have created a .csv file containing word frequencies, your can create a word cloud in R using the following code:

setwd("P:\\My Documents\\DTDP")


library(wordcloud)


mfw <- read.csv( "mfw.csv" )

wordcloud( mfw$type, mfw$frequency , min.freq = 3, max.words=300, colors = c( "#4180f4", "#5642f4" , "#5642f4") , random.order=FALSE, rot.per=0.50 )

You can save yourself some typing by downloading this R code from the file repository.

Also try to experiment with this code by working with different values for paramaters such as ro.per (the number of words that are rotated), random.order (the way in which the terms are positioned within the graph), or min.freq.

6. The file tokeniser.pl counts the frequencies of all words, including those of very frequent words such as “the”, “a” or “at”. The program tokeniser_stopwords.pl makes use of a list of stopwords to ignore all occurrences of such common words. First, create a new .csv using this program, and, second, generate a new bar chart and a new word cloud, using the same R files which you had used for exercises 1 and 2.

7. The Perl program distribution.pl builds on the tokenisation applications that you have worked with so far. This code firstly divides the full text into smaller segments. Next, it calculates the frequencies of a given search term within all of these different segments. These frequencies can be visualised using the R code below. The code can also be downloaded here. The diagram that can be produced using this codes gives an impression of the way in which a term is distributed over a text. Use this code to visualise the distribution of the word “whale” across Melville’s novel Moby Dick.

library("ggplot2") ; 
setwd("P:\\My Documents\\DTDP")

d <- read.csv("mfw_segments.csv" ) ;

p <- ggplot( d , aes( x = segment , y = frequency )) + geom_line( color = "#4286f4" ) ;

print(p)

 

Collocation analysis and type-token ratios

 

8. Collocation analyses can be performed using the file collocation.pl. Visualise the result as a word cloud.

9. Use the file ttr.pl and ttr.r to visualise the type-token ratios of the texts in this sample corpus.

Line 37 of this file specify the path to the various files in your directory. On a Windows machine, the hierarchy between folders and files is represented using backslashes, while Mac and Linux machines use the forward slash. When you use this file on a Mac or on on Linux computer, replace the two backslashes with a single forward slash.

 

Readability analyses

 

10. The perl program “readability.pl” can be used to calculate the number of sentences and the number of syllables in a group of texts. These metrics can be used, in turn, to calculate readability metric, such as the Flesch-Kincaid index. The file “presidents.csv” contains such such data about a corpus containing the full transcripts of inauguration speeches delivered by US presidents. Create visualisations of this data set to clarify some noteworthy characteristics of these speeches. Additionally, use the file “metadata.csv” to explore differences between, for instance, Republican and Democratic presidents.

 

Distinctive vocabulary

 

11. The program “unique.pl” can be used to find the words that are unique to the various texts in your corpus. In other words, it creates lists of words which occur only once within the corpus, and within one single document. Use this program to find the unique words in the inauguration speeches of US presidents.

 

Analyses based on term-document matrices

 

12. The program “tdm.pl” can be used to create a term-document matrix on the basis of  the texts in your corpus. The CSV file that results from this program can be visualised in R on a scatter plot, via Principal Component Analysis. It is also possible to calculate the so-called Euclidean distances between the various texts. These distances can be visualised effectively as a dendrogram. Experiment with these programs using the texts in your corpus as a basis.