Perl Exercises

Basics

 

Printing

1. Print the text “It works” (or any other short string).

2. Write a programme which produces the following output:

This is the first line.
 This is the second line.
 This line contains a        tab.


Making Calculations

3. Create two variables, and assign a numerical value to both of them. Next, print their sum, their differences and their product.

4. How many seconds are there in seven weeks? And how many hours?

5. Given an exchange rate of 1 euro to 1,0719 dollars, how many American dollars can you buy for 150 euro?

6. What is the equivalent in Fahrenheit of 32 degrees Celcius? Use the following formula for this conversion: F = 9/5 * C + 32

 

Reading a file and writing to a file

 

7. Create a Perl application which can read the text file “shelley.txt” (file is also provided in the DTDP file repository) and which can print all the lines. Right-click (Windows) or use CTRL + Click (Mac) to download the file.

To read a text file in Perl, you can use the code below.

open ( IN , "shelley.txt") ;
 
 while(<IN>) {
 
 print $_ ; 
 
 } 
 close ( IN ) ;


 

8. Create a program that can count the total number of lines in the file “shelley.txt“. Write the output of this program to a file named “out.txt”.

The code below illustrates the code that you can use for this purpose.

open ( OUT , ">out.txt") ; 

print OUT "Hello world!" ;

close ( OUT ) ;

Solutions to these exercises

 

 

Regular expressions

 

9. Create an application in Perl which can print all the lines from Shelley’s Complete Poems  that contain a given keyword (suggestions: “fire” , “rain” , “moon”, “storm”, “time”).

10. Using the text file “shelley.txt”, try to find

  • lines containing the either the word “sun” or to word “moon”. Use a single regular expression to identify these lines.
  • lines which contain either the singular or the plural form of “star”.
  • lines which contain a question mark.
  • Lines ending in the character sequence “ain”.
  • Lines which contain at least two words that begin with “br”

11. Download the file concordance.pl. The file can be used to create concordances. Open this file and experiment with different values for the regular expression on line 19 of the program.

Try to find, for instance

  • words beginning with “br”
  • either the singular or the plural form of “leaf”.
  • all words ending in “ly”. The regular expression that you create may possibly be used in an algorithms which extracts adverbs. Which additional difficulties need to be solved?

 

12. Download the file “Ulysses.txt“. It is the full text of James Joyce’s novel Ulysses. Write regular expressions to retrieve texts fragments with the following characteristics:

  • Text fragments containing a year (e.g. the sentence “What reflection concerning the irregular sequence of dates 1884, 1885, 1886, 1888, 1892, 1893, 1904 did Bloom make before their arrival at their destination?”)
  • Text fragments in which Joyce chose the dramatic form, or, more specifically, lines which begin with the name of a speaker in capitals, followed directly by a colon.
  • Lines which consist of less than 30 characters.
  • Lines containing surnames beginning with “O'”, followed by apostrophe (e.g. O’Connell, O’Brian)

Solutions to these exercises

 

Arrays and hashes

13. Create an array that lists the days of the week, using the code below.

@week = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") ;

Next, add some code to do the following:

  • Print the fourth item in the list. N.B. Note that the first items in the list has index 0.
  • Print the last item in the list. N.B. The last item in an array can be accessed easily using the index -1.
  • Print all the items of the array. Each item must be printed on a separate line.
  • Building on the code produced for the previous question, show only those weekdays whose names start with a “t” or an “s”.
  • When you use the keyword scalar in front of an array, this will produce the number of items in the array. Use this method to give information on the number of items in the array. N.B. Use regular expressions in your code.
  • Print all the items of the array. Display all the days of the ways and their index in the array. N.B. To do this, it may be useful to introduce a new variable which can count the number of times the program loops through the array.

14. Use the code below to create a hash that contains information about European countries and their capitals

%capitals = ( "Italy"=>"Rome", 
"Luxembourg"=>"Luxembourg", 
"Belgium"=> "Brussels", 
"Denmark"=>"Copenhagen", "
Finland"=>"Helsinki", 
"France" => "Paris", 
"Slovakia"=>"Bratislava", 
"Slovenia"=>"Ljubljana", 
"Germany" => "Berlin", 
"Greece" => "Athens", 
"Ireland"=>"Dublin", 
"Netherlands"=>"Amsterdam", 
"Portugal"=>"Lisbon", 
"Spain"=>"Madrid", 
"Sweden"=>"Stockholm", 
"United Kingdom"=>"London", 
"Cyprus"=>"Nicosia", 
"Lithuania"=>"Vilnius", 
"Czech Republic"=>"Prague", 
"Estonia"=>"Tallin", 
"Hungary"=>"Budapest", 
"Latvia"=>"Riga", 
"Malta"=>"Valetta", 
"Austria" => "Vienna", 
"Poland"=>"Warsaw") ;
  • For each item in the hash, print the sentence “the capital of X is Y”.
  • Building on the code produced for the previous question, create a list in which the sentences are sorted alphabetically by the names of the countries.
  • Also produce a list which sorted according to the names of the capitals.
  • Count the number of countries that start with the letter “L”.

Solutions to these exercises

  

Tokenisation and frequency counts

15. Write an application in Perl which can divide the file “forster-samples.txt” into its individual words. You can use the code below as a basis.

 

 use strict ;
 use warnings ;

my ( @words, $w , %freq ) ;

open ( FILE , "forster-samples.txt") or die "Can't open file!" ;
open( OUT , ">words.txt" ) ; 

while( <FILE> ) {
 
 @words = split( /\s+/ , $_ ) ;
 foreach my $w ( @words ) {

 print OUT $w . "\n" ;

 if( $w =~ /(\w+)/ ) {
 $freq{ lc($1) }++ ;
 }
 }
 }
 close (FILE) ;

foreach my $t ( sort { $freq{$b} <=> $freq{$a} } keys %freq ) {

 #print OUT $t . "\t" . $freq{$t} . "\n" ;

}

close( OUT ) ;

 

16. In his text book Practical Text Mining with Perl, Roger Bilisoly proposes that word segmentation algorithms can make use of the following regular expression:

/(([a-zA-Z']+-)*[a-zA-Z']+)/

One disadvantage of this regular expression, however, is that it does not remove single quotes preceding or following words. To remove such quotes, the following expression may be used:

/(([a-zA-Z]+['-])*[a-zA-Z]+)/

This improved patterns was proposed by Anne van Engelen (DTDP class 2016/2017).

Use this regular expression to print a new list of all the types in the file “forster-samples.txt“, together with their frequencies.

17. Calculate both the number of types and the number of tokens in the E.M. Forster’s A Room with a View. Next, calculate the type-token ratio in this novel.

18. Open the Glasgow list of stop words:  http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words.

Save the list as “stopwords.txt”.

Next, add the code below in the program that you have created for question 17. This code can be used to remove stopwords from frequency lists.

my ( @stopwords ) ;
 my $stopwords = "" ;

open ( ST , "stopwords.txt" ) or die "Can't read file!" ;
 while(<ST>) {
 chomp() ;
 if($_ =~ /./) {
 push( @stopwords , $_) ;
 }
 }
 close( ST) ;
 $stopwords = join( "|" , @stopwords ) ;
 $stopwords = lc( $stopwords ) ;

sub isstopword($) {

my $text = shift ;

if ( lc($text) =~ /\b($stopwords)\b/) {
 return 1 ;
 } else {
 return 0 ; }
 }

Create a frequency list for Foster’s novel Howard’s End (consisting of types and their frequencies; separate the types and the number of occurrences with a tab). The list must be ordered in a descending order; the most frequent types must be shown first.

19. Download the file “Moretti.txt” from the DTDP file repository. This file is a full text version of the book Distant Reading, a collection of essays by France Moretti. Using the perl program “concordance.pl”, which can also be found in the type repository, try to find passages which contain some of the following words:

  • close reading (as a unit)
  • distant / distance
  • computer / computing
  • digital / digitisation
  • reading / read / reads
  • collaboration / collaborative
  • data
  • fact / facts / factual / empirical / objective

What do the KWIC lists reveal about Moretti’s ideas, or about the historical development in Moretti’s ideas?

What are the 30 most frequent words in Moretti’s book?

 

Solutions to these exercises

  

Natural Language Processing

Download the folder NLP.zip.

20. First, use the program pl, which you can find in the NLP folder, to add part of speech tags to the texts in a corpus that you have created. Next, try to answer the following questions about one of the texts in this corpus.

  1. How many adjectives are there in this text? Try to consider adjectives in the comparative and in the superlative as well.
  2. How many nouns are there in this text? Try to consider both singular and plural nouns.
  3. What are the most frequent personal pronouns?

N.B. An overview of all the codes that are defined in the Penn Treebank list can be found here.

21. Find all lines that contain the following grammatical construction: [Noun] [preposition] [article or personal pronoun] [Noun]. Examples:

roots\NNS of\IN the\DET world\NN
Judge\NNP of\IN the\DET stars\NNS
linnet\NN from\IN the\DET leaf\NN
dancer\NN from\IN the\DET dance\NN

22. First, run a text through the UCREL English semantic tagger. Copy and paste the result in a text file. Next, use the program usas.pl to find the most frequent semantic categories.

The full USAS tag set can be found here.