Scraping Data From Wikipedia With R

In Data Analysis require a data on which we can perform analysis and find insights. We always don’t get clean open data for our analysis so we need to create, find our data. In this post, I describe how to extract data from a website using R.

Author

Affiliation

Diwash Shrestha

 

Published

Nov. 5, 2019

DOI

Nepal is a Himalayan nation with beautiful mountains, ancient cultures, traditions and rich history. Nepal is celebrating year 2020 as “Tourism Year” targeting 2 million international tourist arrivals. You can learn more about the #VisitNepal2020.

I wanted to find out the history of international tourist arrival in Nepal and found data in Wikipedia I wanted to extract the data from Wikipedia and analyse and create some visualization and reports. This post will highlight how I got to scraping out this data using R’s package rvest. rvest is an R package that makes it easy for us to scrape data from the web.

Lets Start

Above figure shows the website from which I will extract the data.

I will import the rvest package which makes it easy to scrape (or harvest) data from HTML web pages.


library(rvest)
library(stringr)
library(tidyverse)

We need the link from which we want to scrape the data. I opened the wiki webpage and I will scrape the table shown below.

I will use the read_html() function which reads the HTML page.


wikipage <- read_html("https://en.wikipedia.org/wiki/Tourism_in_Nepal")

Data Extraction

We need the CSS selector of the table from which we want to scrape the data. I opened the inspect view which can be opened by clicking the right button of the webpage and click the inspect. for the table, the CSS selector is “table.wikitable”.

I passed the CSS selector to the html_nodes() function and result is passed to the html_table() function using pipeline(%>%). The below code extract the arrival table data as data frame.


table <- wikipage %>%
html_nodes("table.wikitable") %>%
html_table(header=T)
table <- table[[1]]
 
#add the table to a dataframe
tourist_df <- as.data.frame(table)

I will change the column to a short form so that it will be easy to work on it.


names(tourist_df) <- c("year","tourist_number","per_change")
ABCDEFGHIJ0123456789
year
<int>
tourist_number
<chr>
per_change
<chr>
1993293,567-12.2%
1994326,531+11.2%
1995363,395+11.3%
1996393,613+8.3%
1997421,857+7.2%
1998463,684+9.9%
1999491,504+6.0%
2000463,646-5.7%
2001361,237-22.1%
2002275,468-23.7

The tourist_number column has “,” between the number which makes it string. I removed the “,” using str_remove() from stringr package.


tourist_df$tourist_number <- str_remove(tourist_df$tourist_number,",")
tourist_df$per_change <- str_remove(tourist_df$per_change, "%")
tourist_df$tourist_number <- str_remove(tourist_df$tourist_number,",")

As the extracted data was string, I convert columns to integer and numeric type.


tourist_df$tourist_number <- as.integer(tourist_df$tourist_number)
tourist_df$per_change <- as.numeric(tourist_df$per_change)
tourist_df$year <- as.integer(tourist_df$year)

This is the final dataframe after cleaning.

ABCDEFGHIJ0123456789
year
<int>
tourist_number
<int>
per_change
<dbl>
1993293567-12.2
199432653111.2
199536339511.3
19963936138.3
19974218577.2
19984636849.9
19994915046.0
2000463646-5.7
2001361237-22.1
2002275468-23.7

In the same way, I can extract another arrival from the country table. both tables have the same selector “table.wikitable” when the html_nodes extract the tables it keeps data in list form in the table object. We can extract the data for the arrival by country from table list using the index 2.


table <- wikipage %>%
html_nodes("table.wikitable") %>%
html_table(header=T)
table <- table[[2]]
 
#add the table to a dataframe
con_tour_df <- as.data.frame(table)

I rename the columns of the dataframe .


names(con_tour_df)<- c("Rank","Country",2013,2014,2015,2016,2017)
ABCDEFGHIJ0123456789
Rank
<int>
Country
<chr>
2013
<chr>
2014
<chr>
2015
<chr>
2016
<chr>
2017
<chr>
1India160,832118,24975,124135,343180,974
2China104,664104,00566,984123,805113,173
3United States79,14653,64542,68749,83047,355
4United Kingdom51,05846,29529,73036,75935,688
5Sri Lanka45,36157,52144,36737,54632,736
6Thailand39,15426,72232,33833,42240,969
7South Korea34,30125,17118,11223,20519,714
8Australia33,37125,50716,61924,51620,469
9Myanmar30,85225,76921,631N/AN/A
10Germany29,91823,81216,40518,02822,263

This the data how it look like after extracting using rvest.

I remove the “,” from every column using the purr package map() and str_remove() of the stringr package.


con_tour_df <- con_tour_df%>% map(str_remove,",") %>% as_tibble()

This is the final dataframe after some cleaning process.

ABCDEFGHIJ0123456789
Rank
<chr>
Country
<chr>
2013
<chr>
2014
<chr>
2015
<chr>
2016
<chr>
2017
<chr>
1India16083211824975124135343180974
2China10466410400566984123805113173
3United States7914653645426874983047355
4United Kingdom5105846295297303675935688
5Sri Lanka4536157521443673754632736
6Thailand3915426722323383342240969
7South Korea3430125171181122320519714
8Australia3337125507166192451620469
9Myanmar308522576921631N/AN/A
10Germany2991823812164051802822263

Finally, I have extracted the data from Wikipedia. I will analyse this data and share the insights in the next post.

Feel free to send to me your feedback and suggestions regarding this post!

References

An introduction to web scraping using R

Analytics Vidhya, Beginner’s guide on web scraping

Footnotes