Coding challenges

Coding challenge 1


Create a program in Perl which can produce basic statistical information about the poem “The Love Song of J. Alfred Prufrock” by T.S. Eliot. The program must produce a file called “data.txt”. This file should contain the following sentences:

This text consists of X lines. 
It consists of Y characters in total. 
The average number of characters per line is Z.

X, Y, X evidently need to be replaced with actual numbers which have been calculated by the program.

To calculate the number of characters in a string, you can make use of the length() function. It functions as follows:

length( "Let us go then, you and I,") ; 
 # this will print the length of the line (its number of characters)

The solution can be found here.

Coding challenge 2


Create a program in Perl which can extract lines from P.B. Shelley’s Complete Poems with either one of the following characteristics

  1. Lines that contain either one of the following words: “belief”, “believe”, “believing”, “believes”, “believed” or “believer”.
  2. Lines which contain the words “the” and “light”, either directly in sequence, or with a maximum of two other words in between them. For example: the light, the aereal golden light, the grey light

You can create one long list with examples of both types of lines.

The different words in the first question can of course be supplied fully, but, since these different words all share a same root (“belie”) it is more efficient to create a single pattern in which only the final characters vary. In exercise 11,  you saw that it was possible to search for both the singular and the plural of “leaf” using the following pattern:


In this example, the brackets (or parentheses) following “lea” are used to create a cluster of different options. These options are separated by the pipe character (“|”). You can use a similar solution to find all the different word forms which are mentioned in the first question.

For the second question, you need to specify that there can optionally be one or two words in between two other words. Making a pattern for the sequence “the light” is fairly straightforward:

/\bthe light\b/

The challenge is to change this simple expression in such a way that it matches fragments such as “the bright light” or “the strong yellow light” as well. If there is another word in between the two words which are given, this additional string consists of the characters in this word (so: find a pattern which can represent ‘a word’ in general), followed by a space. This combination (the ‘word’ pattern followed by the space) needs to be placed within parentheses, so that it forms a larger cluster of characters. Finally, you also need to add quantifiers to indicate how often this additional pattern may occur. Just as a reminder: quantifiers generally have the following form: {n,m}. Such a quantifier indicates that the pattern occurs at least ‘n’ times, and maximally ‘m’ times.

The solution can be found here.

Coding challenge 3


Create a list of the 50 most frequent words in E.M. Forster’s A Room with a View (or, alternatively, any of the texts in your own corpus). Write the result to a file named “mfw.csv”. The list must contain the various types and the number of times these types occur within the text. These two values must be separated by a comma. the list must be sorted by frequency, in a descending manner.

To make sure that you only print the 50 most common words, you can work with a variable which counts the number of lines that have already been printed. To break out of a loop, you can use the command “last”. The usage of this command is illustrated in the example below. N.B. The “foreach” in this sample code corresponds to the foreach that you use in your own code to navigate through the hash that contains all frequencies.

$i = 0 ; 

foreach my $item ( keys %longList) {

$i++ ; 

if( $i == 10) {


Solution to this exercise



Coding Challenge 4


The files and ttr.r can be used in combination to make a bar chart showing the type-token ratios of the texts in your corpus. The current code disregards the fact that the texts may have different lengths, however. Try to adjust the code in, in such a way that it only counts the types and tokens in the first 3000 words in your texts (or less words, if the shortest text in your corpus does not contain 3000 words).

Submit the new code, together with a bar chart that you can create using ttr.r.

N.B. The easiest way to solve this challenge is to work with a counter variable, which can determine the number of types that are counted within the section that is headed by while(<IN>). The %freq hash need to stop counting types when the counter variable has reached the value 3000.

Solution to this exercise 



Coding Challenge 5


Download the file data.csv. This file contains data about 150 novels in the English language. Create visualisations of this data set which can help to answer the following questions:

  1. Do female writers use longer sentences? And is their vocabulary more repetitive or less? Create a scatter plot which displays both the average length of sentences (tokens / nrSentences) and the type-token ratios. Use the variable typesFirst3000 to calculate the type-token ratios. Additionally, add a color parameter in the aes() section of ggplot(), which takes its values from the gender variable.
  2. How many books are there in total from the 18th, 19th and 20th century? And how many of these were written by female authors? Create a stacked bar chart which indicates both the number of books published in these different centuries and the number of book written by male and female authors. Follow the steps below:
    1. Create a bar chart (a histogram) which indicates the total number of books in the three centuries
    2. To the aes() section of the plot, add a fill parameter. This fill parameter must reflect the counts for male and female authors
    3. In the geom_bar() function, add the parameter position = “dodge”
  3. Create a top 20 list of the books that have the longest sentences. Display your results as a bar chart. Use the code below to make this specific selection:


d <- read.csv("data.csv")

d$sentenceLength <- ( d$tokens / d$nrSentences )

d <- d[ order( -d$sentenceLength ) , ]

top20 <- d[1:20,]

Note that this code creates a new data frame, named top20. This is a filtered version of the full d data frame that is created at the beginning. This new data frame only contains data about the 20 books with the longest sentences. This second data frame, named top20, must also be used in the ggplot() function which creates the bar chart.

Submit a Word file in which you paste the graphs that you have created in R, together with the full code that you used to make these visualisations.

To save images in R, you can make use of the ggsave() function, as follows:



Solution to this assignment