The exercises below make use of a number of code packages. When these are not available on your computer yet, these need to be installed first. You can do this by using the following commands:
install.packages("ggplot2", dependencies=TRUE) install.packages("wordcloud", dependencies=TRUE)
Basic R Exercises
1. Download the data set “nobel.csv”. This data collection contains data about all Novel laureates in between 1901 and 2016. It was ac via Kaggle. Save this file in your DTDP folder, and make sure that your DTDP folder is also the Working Directory in R.
- Which variables are used in this data set? Use the str() function. Another possibility is to use the colnames() function
- Print the first 10 rows of this .csv file, using head()
- Show all the values of the column Birth.Country
- What is the earliest year in the data set? Use the min() function.
- Try to subset your data frame. For instance, make a new frame which data about Nobel prizes by people born in the Netherlands.
2. Download the data set “imdb.csv”. This data collection contains data about ca. 5000 movies, “scraped” from the IMDB website. It was acquired via Kaggle. This data set has values for the following variables:
Using this data set, create visualisations which can be used to answer the questions below. Concise descriptions of the main functions and parameters in GGplot can be found in the ggplot cheat sheet.
- Which language are spoken in the movies that are included in the data set?
- Which languages are spoken apart from English? Create a bar chart, with horizontal bars.
- What were the budgets of all the French movies made after 2000? What were the most expensive movies?
- Which movies made in or after 2012 have had the best box office results? If the movies are sorted by revenue (as given is ‘gross’), which movies belong to the highest 5%?
- How many movies have been made in France, Germany, the United Kingdom and Canada? What are the genres of these movies? Create a chart with bars in different colours: the colours must reflect the different genres. N.B. The existing column genres contain multiple genres. To select the genre that is mentioned first, use the following code:
m$genreMain <- sub( “[|].+” , “” , m$genres )
- What are the genres of the movies made in the United Kingdom? Create a stacked bar chart.
- Create a line chart which can display the number of films that were made on an annual basis.
- Is there a relation between the budget of a movie and the gross revenue? Focus on movies made in the United States (USA), and on movies with a budget of 50 mln and higher. Create a scatter plot showing both the budgets and the gross box office results. The colour of the points must reflect the main genres of these movies.
- When you limit your analysis to all French movies, is there a relation between budget and appreciation? Create a scatter plot which displays the values for ‘budget’ and ‘imdb_score’.
- How did the movie budgets develop since 2010?
3. Using the data set “nobel.csv“, which you had downloaded for question 1, create visualisations in Ggplot that can help to answer the questions below.
- Show the countries of all the Laureates of the Nobel Prize for Literature, together with the number of prizes per country. Display this information as a bar chart
- Create a top 10 of countries which have produced the most Nobel Laureates.
- Create a scatter plot showing the ages of the laureates in the Peace category. Use the following code to calculate the ages:
n$YOB <- sub( “[-].*”, “”, n$Birth.Date )
n$YOB <- as.numeric( n$YOB )
n$Age <- n$Year – n$YOB
- Create scatter plots showing the ages of Nobel Laureates in all categories. Use facet_wrap, as follows: facet_wrap( ~ Category , ncol = 2 )
- Create a stacked bar chart showing the categories and the number of Nobel Prizes won by people born in the Netherlands.
- Create a line chart which indicates the historical development in Nobel Prizes won by organisations based in the USA.
Topic Modelling in R
4. This topic modelling exercise is based on data acquired via the API of the New York Times. Using the “Articles” API, data was collected about articles written about “The Netherlands” in between 01/01/2015 and 01/04/2015. The full dataset can be downloaded here.
The file “nyt.csv” contains metadata about all the articles.
- Create a line diagram (using geom_freqpoly) which shows the total number of articles that have been written about the Netherlands on a weekly basis.This diagram can partly be based on the following code:
nyt$date <- as.Date( nyt$date )
nyt$ym <- format( nyt$date , format = “%y-%m-%d”)p <- ggplot( nyt , aes(x= ym , group = 1 )) + geom_freqpoly( size = 1 ) + theme(axis.text.x=element_text(angle=-90))
- Create a bar chart which gives information on the section in which these articles have appeared. This information can be found in the column section_name.
5. Use the file “tm.R” to create a topic model of the texts in a given corpus. To be able to run this code, you firstly need to install the following packages:
install.packages("tm", dependencies=TRUE) install.packages("topicmodels", dependencies=TRUE)
The name of the folder containing all of these texts must be mentioned in the function setwd(). The program works most effectively if you supply the FULL PATH to this folder.
The R code created a data frame named terms which lists the words that have been associated with the various topics. Can you label these topics?
Using the variable topics2, create a bar chart which displays the number of articles about each topic.
Create a dot chart (using geom_point) which clarifies the general distibution of the topics.
Network visualisations can be created using the “iGraph” package, among many other packages. The files in the zipped folder “iGraph.zip” illustrate the way in which this package may be used. The file “network.R” creates a network of related books, based on data extracted from goodReads.
Mapping in R
Maps can be created using a variety of packages, including “ggmap”, “RGoogleMaps” and “maps”. The packages “ggmap” and “RGoogleMaps” are still under development, and the code is not fully compliant with the all versions of R and ggplot. For this reason, this exercise only makes use of the “maps” package.
The packages that are necessary to do these exercises can be installed using the following commands:
install.packages(“maps”, dependencies = TRUE ) install.packages(“mapdata” , dependencies = TRUE )
The mapdata package includes about 2 million polygons representing geographic areas, including all the countries. The data set was compiled by the World Data Bank. This dataset is referred to as ‘worldHires’
The following command plots data at the highest level, and this is essentially a plot of the entire world.
To map individual countries, you need to give the name of the country as the second parameter of the map() function. The mapdata package includes a large list of pre-defined polygons which can be referenced using geographic terms. This list includes the following names.
N.B. These country names are not case sensitive.
map('worldHires' , 'France') map('worldHires' , 'USA')
You will notice that, when you draw a map of France, this country will be shown on a very small scale. This is because R draws all French territories, including the areas such as French Guiana, Guadeloup, Martinique and Reunion.
You can limit the area that is shown on the plot by making use of the xlim and ylim functions. These function set the range of longitudes and latitudes to be shown in the plot.
map('worldHires' , 'France' , xlim=c( -5 , 10 ), ylim=c( 41 , 51 ) , col="#D7DCDE" , fill = TRUE )
map('worldHires' , 'UK' , xlim=c( -11 , 3 ), ylim=c( 49 , 61 ) , col="#D7DCDE" , fill = TRUE )
You can also generate maps using only the xlim and the ylim functions:
map('worldHires' , xlim=c( -11 , 3 ), ylim=c( 49 , 61 ) , col="#D7DCDE" , fill = TRUE )
Locations of interest can be plotted onto the map using a csv file that gives information about the longitude and the latitude of these locations. These points can be drawn onto the map using the points() function. Labels can be added using text().