Visualisation in R

N.B. Data files and code that can help you to make exercises 1 to 3 can be downloaded here

N.B.2. To do these exercises, you need to install the R libraries “ggplot2” and “wordcloud” first. In RStudio, you can do this by using Tools > Install Packages.

Exercise 1

Using the code that you have produce for Coding Challenge 3, create a bar chart in R which can visualise the type-token ratios of all the texts in a given corpus. Use the R code below.


data <- read.csv("data.csv" )

plot <- ggplot( data , aes( x= title , y = ttr )) + geom_bar( stat = "identity" , fill = “”)

print (plot)

You can also select a different colour, using a colour picker, for instance.

You can also vary the colour of the bars along with the first character in the word, as follows:

ggplot(mfw, aes( x = type , y = frequency , fill = substr( mfw$type , 0, 1 ) )) + geom_bar(stat = “identity” )

 What is the function of the function coord_flip() ?


Exercise 2

Write a program in Python which can calculate the following metrics for each of the texts in your corpus:

  • Total number of words
  • Total number of sentences

Next, use R/GGPlot to make a bar chart which can visualise the average number of words per sentence.


Exercise 3

Create a program in Python which can count all occurrences of the following POS categories for each text in your corpus:

  1. Try to consider adjectives in the comparative and in the superlative as well.
  2. Verbs; Note that verbs may have received different tags: VB (Verb, base form), VBD (Verb, past tense), VBG (Gerund or present participle), VBN (Past participle), etc.
  3. Adverbs

Use R/GGPlot to make a number of scatter plot which can compare the use of these different POS categories.

Use the code below as a basis.



d <- read.csv("nltk_data.csv" )

p <- ggplot( d , aes( x = ( adverbs / tokens ) , y = ( adjectives / tokens ) , label = title ) ) + geom_point( size = 4 ) + geom_text( col = "black" , hjust = -0.2 , size = 4 ) + xlab("") + ylab( "" )



Exercise 4

Create a program in Python which can calculate the 50 most frequent words in one of the texts in your corpus. Visualise these words as a wordcloud.

If you have created a .csv file containing word frequencies (in columns ‘type’ and ‘frequency’), your can create a word cloud in R using the following code:


mfw <- read.csv( "mfw.csv" )

wordcloud( mfw$type, mfw$frequency , min.freq = 3, max.words=300, colors = c( "#4180f4", "#5642f4" , "#5642f4") , random.order=FALSE, rot.per=0.50 )

You can save yourself some typing by downloading this R code from the file repository.

Also try to experiment with this code by working with different values for paramaters such as ro.per (the number of words that are rotated), random.order (the way in which the terms are positioned within the graph), or min.freq.