|

Scraping HTML Table Data in Clojure for Profit

A guest post by Timothy Pratley, who currently works for Tideworks Technology as a Development Manager building Traffic Control software for logistics clients including SSA Marine, CSX, and BNSF.

There is a vast amount of useful data on the Web in HTML tables. When you are ready to work with this data in your favorite programming language, you might may consider manually copying and pasting this data into files, then writing custom readers and so forth. Follow along in this post, to discover an easy way to pull the data into Clojure.

We will start by investigating the effects of PE and Shiller PE on market movements. The data is available at multpl.com in tables. In order to process the data we first need to screen scrape it. The full source code used in this post is available here: https://github.com/timothypratley/tablescrape. And, although it isn’t required, you may also want to read Timothy’s Stock Simulation in Clojure post as well.

NOTE: The statements made here in this post are not investment advice.

Getting the data

The data we are interested in working with is here: http://www.multpl.com/table?f=m.

Using the browser to view the source you can see the table we are most interested in:

   <table id="datatable">

The enlive library allows us to use CSS select patterns to pull out the element we want (the data table):

(select (html-resource
         (java.net.URL. "http://www.multpl.com/table?f=m"))
        [:table#datatable])
=> A great big sequence of maps with nested sequences and maps inside it

First off let’s temporarily store the result so we can examine it more closely:

(def t
  (select (html-resource
           (java.net.URL. "http://www.multpl.com/table?f=m"))
          [:table#datatable]))

Now let’s use pretty printing to the console so that we can view the structure:

(clojure.pprint/pprint t)

I have edited the output for brevity. If you try this for yourself, you will see there is a lot more printed out, but you can see the structure because of the indenting:

({:tag :table,
  :content
  ({:tag :tr,
    :content
    ({:tag :td, :content ("Jul 1, 2013")}
     {:tag :td, :content ("19.16")})
...

For every TR (table row) we want to get every TD (table data):

(def rows
  (for [tr (select t [:tr])
       :let [row (select tr [:td])]
       :when (seq row)]
   row))
(clojure.pprint/pprint rows)
=>
({:tag :td, :content ("Jul 1, 2013")}
  {:tag :td, :content ("19.16")})
 ({:tag :td, :content ("Jun 1, 2013")}
  {:tag :td, :content ("18.46")})
...

We have the rows of data, so now we need to process them into something more convenient.

Extracting the data series

To get to the actual data we need to extract it:

(defn contents [x]
  (map (comp first :content) x))
(map contents rows)
=> (("Aug 30, 2013" "18.62") ("Jul 1, 2013" "19.16")

We want Dates and Numbers, not strings, so we will need to parse these. We define three functions:

  1. date-parser
  2. parse-number
  3. parse-money

The index values are not money, but they do have commas in them. We need to strip those out, hence the parse-money function, which ignores additional symbols.

At this point we have everything we need to extract the data by parsing “dates” and “money”. But can we make a generic table scraper that will handle different data types? Of course! To do so we will need to pass in a vector of parsing functions that correspond to the table data elements.

We want to specify:

[(date-parser "MMM dd, yyyy") parse-money]

And then call those functions for every element in the row. But to support ignoring rows (by specifying a nil function) we will need to check for the existence of a function. This is easiest to express with a list comprehension:

(defn- parse-fs [fs ds]
  (for [[f d] (map vector fs ds)
        :when f]
    (f d)))

This pairs the functions with data-elements, but only calls the function to produce a result when the function is not nil.

To parse a table given some parsing functions, we call parse-fs on every row:

(defn- parse-table
  [table fs]
  (for [tr (select table [:tr])
        :let [row (contents (select tr [:td]))]
        :when (seq row)]
    (parse-fs fs row)))

Now we make a function that takes the URL for the page we want to scrape, the table selector, and the parse functions:

(defn scrape-table
  [url selector fs]
  (parse-table
   (select
    (html-resource (java.net.URL. url))
    selector)
   fs))

(scrape-table "http://www.multpl.com/table?f=m"
              [:table#datatable]
              [(date-parser "MMM dd, yyyy") parse-money])
=> ((# 18.62)
    (# 19.16)
    ...

We have our data, and a generic way of scraping any table.

Processing to Determine Performance Given Price to Earnings

Given that we want to pull several tables from multpl, I added a convenience function that takes the route to the table we want to pull. Let’s pull out the S&P 500 “price” as sap, the PE ratio as pe, and the Shiller PE as spe. PE is price to earnings ratio, which is one measure that investors use to determine whether a security is “cheap” or “expensive”. Price to Earnings ratio is essentially the inverse of yield. Clearly a high yield is a good thing, which is the same as a low PE ratio.

(def sap (multpl "s-p-500-price/table?f=m"))
(def pe (multpl "table?f=m"))
(def spe (multpl "shiller-pe/table?f=m"))
(map count [sap pe spe])
=> (1712 1712 1592)

The first thing to notice is that the series are of different lengths. This makes sense because the Shiller PE is calculated using average inflation-adjusted earnings from the previous 10 years. If we find the minimum count m and take the same number of values from each series, they should be aligned. We should, however, check that the dates match up:

(= (map #(take m (map first %)) [sap pe spe]))
=> true

Yes they do. So now we plot all three on a chart to see what they look like:

fig1.overlay

Unfortunately this does not give us much insight into the predictive powers of PE ratios. We need to dig a little further. One thing we notice is that the Shiller PE is less volatile than the plain PE, as we expect, because it is averaging over a longer period. This is more obvious if we examine the distribution explicitly:

fig2.distribution

fig3.distibution.shiller

What we want to know is, if we invest when the PE is low, will we make money? To explore this question we must calculate the gains made when investing at low and high PE ratios.

Let’s define a function called series that just strips out the values from the dates:

(series sap)
=> (153.5 151.55 153.27 151.19 ...

From this we can calculate the gains made from investing on that particular date:

(def ss (series sap))
(def monthly-gains (map / (rest ss) ss))
(def yearly-gains (map / (drop 12 ss) ss))
(def five-year-gains (map / (drop 60 ss) ss))
(mean five-year-gains)
=> 1.175603693981667

Values are monthly, and there are 60 months in 5 years, so we divide the future price by the current price to get the gains.

To make sure we are on the right track, let us look at how those gains are distributed:

fig4.monthly

fig5.yearly

fig6.5year

These distributions are in line with what we would expect from holding an investment for a set period. Over a five year investment you rarely lose half your money, or rarely earn double your money, but mostly likely earn a 0.7 to 1.5 return. So far so good. Now we want to group the returns based upon the PE at the time of investment.

We find the division points for separating the series into four quarters like so:

(def bins [0.125 0.25 0.375 0.5 0.625 0.75 0.875])
(quantile (series pe) :probs bins)
=> (9.34 11.63 13.44 14.91 16.43 17.925 20.62)
(quantile (series spe) :probs bins)
=> (9.68 11.69 13.9 15.89 17.65 19.86 22.89)

This tells us that 12.5% of the recorded Shiller PE ratios fall before 9.68; a further 12.5% occur between 9.69 and 11.69; a further 12.5% occur between 11.69 and 13.9, and so on. These divisions will be our “bins” for evaluating the gains made by investing when the PE falls inside one of these quantiles. We could select any set of bins we want to – there is nothing special about them – they simply define the PE ranges that we want to compare subsequent investment performance.

In the charts below I have adjusted the return to be an Annual Percentage Rate, so that we can compare 1 month gains with 30 year gains.

fig7.performance.pe

fig8.performance.shiller

The periods are in months, and there does seem to be an effect here. High PE values do lead to negative performance, and low PE values do lead to positive returns. But the effect is very, very mild. Before we rush off to our brokers to start placing bets, we should take a look at the possible outcomes, not just the average returns. To do this we will plot the 12 month returns for the upper and lower bins:

fig9.relative.pe

The most important thing to note is that the distributions of the first and last bin performance overlap significantly! This tells us that even though the average case suggests there is a real effect, anything can happen. The market can go in the opposite direction to the PE indicator.

fig10.relative.shiller

Note how the Shiller version also has a wide overlap.

Conclusion

In this post we found an easy and generic way to scrape data from HTML tables using enlive, and we were able to pull in some financial data series and plot them with incanter. Clojure is an excellent language for general programming and data transformation, making it convenient to perform these types of ad hoc analysis. There are no restrictions on how you gather or process your data, and there are many convenient libraries that do one thing very well in Clojure.

Be sure to look at the Clojure resources that you can find in Safari Books Online.

Safari Books Online has the content you need

Clojure Inside Out is a video where you’ll not only learn how to tackle practical problems with this functional language, but you’ll learn how to think in Clojure—and why you should want to. Neal Ford (software architect and meme wrangler at ThoughWorks) and Stuart Halloway (CEO of Relevance, Inc.) show you what makes programming with Clojure swift, surgical, and accurate.
Clojure Programming, helps you learn the fundamentals of Clojure with examples relating it to the languages you know already—whether you’re focused on data modeling, concurrency and parallelism, web programming, statistics and data analysis, and more.
The Joy of Clojure goes beyond the syntax, and shows how to write fluent, idiomatic Clojure code. You will learn to approach programming challenges from a Functional perspective and master the Lisp techniques that make Clojure so elegant and efficient. This book will help you think about problems the “Clojure way,” and recognize when they simply need to change the way they program.
Practical Clojure is the first definitive reference for the Clojure language, providing both an introduction to functional programming in general and a more specific introduction to Clojure’s features. This book demonstrates the use of the language through examples, including features such as STM and immutability, which may be new to programmers coming from other languages.

About the author

timhead Timothy Pratley currently works for Tideworks Technology as a Development Manager building Traffic Control software for logistics clients including SSA Marine, CSX, and BNSF. He can be reached at timothypratley@gmail.com.

About Safari Books Online

Safari Books Online is an online learning library that provides access to thousands of technical, engineering, business, and digital media books and training videos. Get the latest information on topics like Windows 8, Android Development, iOS Development, Cloud Computing, HTML5, and so much more – sometimes even before the book is published or on bookshelves. Learn something new today with a free subscription to Safari Books Online.
|

Comments are closed.