A Small Web Scraping Project in R

Dane Van Domelen
4 min readNov 17, 2019

--

Get S&P 500 tickers on a user-specified date

Motivation

The first step to developing a trading strategy is finding some historical data. Before you can do that, you have to decide on a “pool” of funds to include.

For example, Vanguard offers commission-free trades on their ETF’s, so you might consider developing a strategy based just on that pool of ~75 ETF’s. I did that once; it was an anti-momentum strategy, and no, it didn’t work.

Another idea is to use the 500 individual companies in the S&P 500. I think 500 is a reasonable number, and there are pros to working with big, mostly recognizable companies. It seems like there must be profitable strategies to discover in historical S&P data. Buy the newcomers, buy the hottest stock in each sector, pair-trade the newest stock to outperform the oldest, etc.

It’s easy to get a list of S&P 500 tickers as of today. There’s a Wikipedia page, which you could scrape, although it would probably be easier to just copy-paste into Excel and then load into R (or Python, etc.).

Problem is, anytime you start with a current list of stocks, you open the door to survivorship bias. None of the companies currently in the S&P have suffered catastrophic losses over the past few years, otherwise they wouldn’t be still be in the S&P.

What we really need is a list of S&P tickers at the start of the historical period we want to use to develop our strategy. This is doable thanks to Wikipedia’s Revision history, and for a one-time thing we could just navigate to the right date and copy-paste the table. But it’d be nice to have a function to do it automatically, and at least some other folks seem to be interested.

Pseudocode

Basically, the project involves two parts: (1) given a user-specified date, find the right URL; and (2) given such a URL, scrape the Wikipedia table.

The function should work something like this:

get_sp500_tickers = function(date = Today) {1. Use lookup table to find URL of closest revision on/before 'date'
2. Scrape table from that URL
3. Do basic formatting/cleaning, if necessary
4. Return list of ~500 ticker symbols
}

Development

Given URL, get tickers

I started by trying to figure out how to scrape the current Wikipedia page. After some Googling, I found a tidyverse tool: Hadley Wickham’s rvest package.

This is the code I came up with:

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
df <- url %>%
read_html() %>%
html_node("table") %>%
html_table()

The result is a data frame, but all we need is the ticker symbols from the first column. Here’s what it looks like:

tickers <- df[[1]]
head(tickers)
[1] "MMM" "ABT" "ABBV" "ABMD" "ACN" "ATVI"

I also confirmed that the above code also works with a much older version of the page (Jan. 4, 2006).

Some basic cleaning

I noticed that Wikipedia uses periods in ticker symbols like BF.B and BRK.B, while Yahoo! Finance uses dashes (BF-B, BRK-B). The popular R package for downloading historical data, quantmod, uses Yahoo! Finance, so I think it’s a good idea to replace the .’s with -’s. That’s easy:

gsub("[.]", "-", x = tickers)

Given date, get URL

The other thing we have to figure out is how to go from a user-specified date to the appropriate page from Wikipedia’s Revision history.

I found a solution that isn’t particularly elegant, but that works. If you click around on the Revision history, you can display 500 revisions (date and URL) at a time:

I just copied the first URL, clicked “older 500” and copied that one, and so on for a total of 4 pages.

For each page, to scrape all of the links and underlying URL’s:

url.page1 <- "https://en.wikipedia.org/w/index.php?title=List_of_S%26P_500_companies&offset=&limit=500&action=history"page1 <- url.page1 %>%
read_html()
linktext <- page1 %>%
html_nodes("a") %>%
html_text()
urls <- page1 %>%
html_nodes("a") %>%
html_attr("href")
lookup <- data.frame(Date = linktext, URL = urls,
stringsAsFactors = FALSE)

There are a total of 3,888 links/URL’s, but we only want to keep the ones with dates in their names. This particular page only has 2018 and 2019 revisions, so we can run:

lookup <- lookup[grep("2019|2018", lookup$linktext), ]

Both columns of our lookup table actually need a bit of cleanup yet, as you can see here:

head(lookup$Date, 1)
[1] "17:26, 9 November 2019"
head(lookup$URL, 1)
[1] "/w/index.php?title=List_of_S%26P_500_companies&oldid=925366750"

Which we can accomplish with:

lookup$Date <- as.Date(sapply(lookup$Date, function(x) {
strsplit(x, ", ")[[1]][2])
}, format = "%d %b %Y")
lookup$URL <- paste("https://en.wikipedia.org", lookup$URL,
sep = "")

I ran the same code for the other 3 pages as well (not shown here), obtaining a 2-column lookup table which we can use in our function.

Putting it together

Now that we have the constituent pieces, the R function isn’t hard to write:

get_sp500_tickers <- function(date = Sys.Date()) {  # Get most recent date on or before 'date'
df <- lookup %>%
dplyr::filter(Date <= as.Date(date)) %>%
dplyr::slice(1)
# Get ticker symbols
tickers <- (df$URL %>%
read_html() %>%
html_node("table") %>%
html_table())[[1]]
# Replace .'s with -'s for compatibility with Yahoo! Finance
gsub("[.]", "-", x = tickers)
}

Demo

Here’s a short video showing that the function works as expected:

Having confirmed it works, I added it to my R package stocks, which is available on GitHub.

I’ll end with a simple example of something practical you could do with this new function. To plot compound annualized growth rate vs. max drawdown for 10 randomly selected S&P stocks starting on Jan. 1, 2010:

tickers <- get_sp500_tickers("2010-01-01")
plot_metrics(formula = cagr ~ mdd, tickers = sample(tickers, 10),
from = "2010-01-01")

--

--