WEB SCRAPING USING R: REALITY TV RATINGS

Well, Im a fan of IMDB. Its such a rich source of potential data, from user ratings, episodes, user reviews etc. I came across some awesome R code from this nice blog site Stat of Mind here, which scraped IMDB user ratings and tried to explore their relationship with actual show ratings.

From there I added my own code to graph the UserRatings over time, split by series and accompanied by a grid.arrange formatted second graph showing the number of UserVotes for each episode.

What it shows is that each season is rated differently. Over the first 15 seasons, user ratings slowly declined, with the exception of Season 7 ‘Pearl Islands’ generally considered one of the best ever. From seasons 15-20. ratings improved, to a peak for the h’Heroes v Villains’ season – featuring a full casts of all-stars and two of the most jaw-dropping tribal councils to date. Seasons 21-22 then saw a massive drop. These seasons coincided with the concept of bringing veteran players (Rob Mariano, Russel Hantz, Benjamin ‘Coach’ Wade and Oscar ‘Ozzy’ Lusth) back for their 3rd or 4th attempts with a group of newbie players. The veterans had an experience advantage and all of them made it deep except for Russel Hantz (Mariano won, Wade was second and Lusth was 4th).

After a lull from seasons 23-27, season 28 ‘Brains v Brawn v Beauty’ was seen as the best ever, with winner Tony Vlachos being one of the most memorable to date.

R Code:

<script>
'scrape_ratings' <- function(url)
{
  require(XML)
  # get HTML of url
  doc <- htmlParse(url)

  # find all tables in webpage
  tables <- readHTMLTable(doc)

  # find largest table and return as dataframe
  nrows <- unlist(lapply(tables, function(t) dim(t)&#91;1&#93;))
  df <- tables&#91;&#91;which.max(nrows)&#93;&#93;

  return(df)
}

# IMDB id of Survivor is "tt0239195"
#url <- 'http://www.imdb.com/title/tt0319931/epdate' #american idol
url <- 'http://www.imdb.com/title/tt0251497/epdate' #big brother
#url <- 'http://www.imdb.com/title/tt0239195/epdate'
series.ratingsA <- scrape_ratings(url)
series.ratings=series.ratingsA

###Data Cleaning
colnames(series.ratings)&#91;1&#93;="Ep_Number"

series.ratings$EpisodeNumber <- as.integer(rownames(series.ratings))
series.ratings$UserVotes <- as.numeric(series.ratings$UserVotes)
series.ratings$UserRatings=as.numeric(levels(series.ratings$UserRating))&#91;series.ratings$UserRating&#93;
series.ratings$s <- lapply(strsplit(as.character(series.ratings$Ep_Number), ''), function(x) which(x == "."))
series.ratings$s=as.numeric(series.ratings$s)-1
series.ratings$season <- as.factor(as.numeric(substr(series.ratings$Ep_Number,1,series.ratings$s)))
#series.ratings$season <- paste("Season", substr(series.ratings$Ep_Number,1,series.ratings$s))
series.ratings$Ep_Number = gsub('&#91;&#91;:alpha:&#93;&#93;', '', series.ratings$Ep_Number)

#PLots
library(ggplot2)
library(gridExtra)

p<-ggplot(series.ratings, aes(x=EpisodeNumber, y=UserRatings, color=season)) +
  #labs(title=main="Survivor Episode Ratings from IMDB") +
  geom_point(shape=1) +
  scale_colour_hue(l=50) + # Use a slightly darker palette than normal
  geom_smooth()    # Don't add shaded confidence region
  #+ guides(col = guide_legend(nrow = 15))
p <- p+labs(title="Big Brother Episode Ratings from IMDB")
p <-p+guides(col = guide_legend(nrow = 15))

#p+guides(col = guide_legend(nrow = 15))

#marginal density of x - plot on top
plot_bottom <-ggplot(series.ratings, aes(x=EpisodeNumber, y=UserVotes)) +
  #labs(title=main="Survivor Episode Ratings from IMDB") +
  geom_point(shape=1) +
  scale_colour_hue(l=50) + # Use a slightly darker palette than normal
  geom_smooth() 

#placeholder plot - prints nothing at all
empty <- ggplot()+geom_point(aes(1,1), colour="white") +
  theme(
    plot.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank()
  )

#arrange the plots together, with appropriate height and width for each row and column
#fullscreen size
grid.arrange(p, arrangeGrob(plot_bottom, empty, ncol=2,widths=c(11,1)),
                    ncol=1,  heights=c(3,1))
#halfscreen size
grid.arrange(p, arrangeGrob(plot_bottom, empty, ncol=2,widths=c(7,1)),
             ncol=1,  heights=c(3,1))

</script>