3.9 STOXPRED: Stock Market Prediction As A Service

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun.

Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue-green planet whose ape-descendent life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.

This planet has — or rather had — a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because it wasn’t the small green pieces of paper that were unhappy.

Douglas Adams, The Hitch Hiker’s Guide to the Galaxy

Valuable services on the Internet are usually not implemented as mobile agents. There are much simpler ways of implementing services. All Unix systems provide, for example, the cron service. Unix system users can write a list of tasks to be done each day, each week, twice a day, or just once. The list is entered into a file named crontab. For example, to distribute a newsletter on a daily basis this way, use cron for calling a script each day early in the morning:

# run at 8 am on weekdays, distribute the newsletter
0 8 * * 1-5   $HOME/bin/daily.job >> $HOME/log/newsletter 2>&1

The script first looks for interesting information on the Internet, assembles it in a nice form and sends the results via email to the customers.

The following is an example of a primitive newsletter on stock market prediction. It is a report which first tries to predict the change of each share in the Dow Jones Industrial Index for the particular day. Then it mentions some especially promising shares as well as some shares which look remarkably bad on that day. The report ends with the usual disclaimer which tells every child not to try this at home and hurt anybody.

Good morning Uncle Scrooge,

This is your daily stock market report for Monday, October 16, 2000.
Here are the predictions for today:

        AA      neutral
        GE      up
        JNJ     down
        MSFT    neutral
        …
        UTX     up
        DD      down
        IBM     up
        MO      down
        WMT     up
        DIS     up
        INTC    up
        MRK     down
        XOM     down
        EK      down
        IP      down

The most promising shares for today are these:

        INTC            http://biz.yahoo.com/n/i/intc.html

The stock shares to avoid today are these:

        EK              http://biz.yahoo.com/n/e/ek.html
        IP              http://biz.yahoo.com/n/i/ip.html
        DD              http://biz.yahoo.com/n/d/dd.html
        …

The script as a whole is rather long. In order to ease the pain of studying other people’s source code, we have broken the script up into meaningful parts which are invoked one after the other. The basic structure of the script is as follows:

BEGIN {
  Init()
  ReadQuotes()
  CleanUp()
  Prediction()
  Report()
  SendMail()
}

The earlier parts store data into variables and arrays which are subsequently used by later parts of the script. The Init() function first checks if the script is invoked correctly (without any parameters). If not, it informs the user of the correct usage. What follows are preparations for the retrieval of the historical quote data. The names of the 30 stock shares are stored in an array name along with the current date in day, month, and year.

All users who are separated from the Internet by a firewall and have to direct their Internet accesses to a proxy must supply the name of the proxy to this script with the ‘-v Proxy=name’ option. For most users, the default proxy and port number should suffice.

function Init() {
  if (ARGC != 1) {
    print "STOXPRED - daily stock share prediction"
    print "IN:\n    no parameters, nothing on stdin"
    print "PARAM:\n    -v Proxy=MyProxy -v ProxyPort=80"
    print "OUT:\n    commented predictions as email"
    print "JK 09.10.2000"
    exit
  }
  # Remember ticker symbols from Dow Jones Industrial Index
  StockCount = split("AA GE JNJ MSFT AXP GM JPM PG BA HD KO \
    SBC C HON MCD T CAT HWP MMM UTX DD IBM MO WMT DIS INTC \
    MRK XOM EK IP", name);
  # Remember the current date as the end of the time series
  day   = strftime("%d")
  month = strftime("%m")
  year  = strftime("%Y")
  if (Proxy     == "")  Proxy     = "chart.yahoo.com"
  if (ProxyPort ==  0)  ProxyPort = 80
  YahooData = "/inet/tcp/0/" Proxy "/" ProxyPort
}

There are two really interesting parts in the script. One is the function which reads the historical stock quotes from an Internet server. The other is the one that does the actual prediction. In the following function we see how the quotes are read from the Yahoo server. The data which comes from the server is in CSV format (comma-separated values):

Date,Open,High,Low,Close,Volume
9-Oct-00,22.75,22.75,21.375,22.375,7888500
6-Oct-00,23.8125,24.9375,21.5625,22,10701100
5-Oct-00,24.4375,24.625,23.125,23.50,5810300

Lines contain values of the same time instant, whereas columns are separated by commas and contain the kind of data that is described in the header (first) line. At first, gawk is instructed to separate columns by commas (‘FS = ","’). In the loop that follows, a connection to the Yahoo server is first opened, then a download takes place, and finally the connection is closed. All this happens once for each ticker symbol. In the body of this loop, an Internet address is built up as a string according to the rules of the Yahoo server. The starting and ending date are chosen to be exactly the same, but one year apart in the past. All the action is initiated within the printf command which transmits the request for data to the Yahoo server.

In the inner loop, the server’s data is first read and then scanned line by line. Only lines which have six columns and the name of a month in the first column contain relevant data. This data is stored in the two-dimensional array quote; one dimension being time, the other being the ticker symbol. During retrieval of the first stock’s data, the calendar names of the time instances are stored in the array day because we need them later.

function ReadQuotes() {
  # Retrieve historical data for each ticker symbol
  FS = ","
  for (stock = 1; stock <= StockCount; stock++) {
    URL = "http://chart.yahoo.com/table.csv?s=" name[stock] \
          "&a=" month "&b=" day   "&c=" year-1 \
          "&d=" month "&e=" day   "&f=" year \
          "g=d&q=q&y=0&z=" name[stock] "&x=.csv"
    printf("GET " URL " HTTP/1.0\r\n\r\n") |& YahooData
    while ((YahooData |& getline) > 0) {
      if (NF == 6 && $1 ~ /Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec/) {
        if (stock == 1)
          days[++daycount] = $1;
        quote[$1, stock] = $5
      }
    }
    close(YahooData)
  }
  FS = " "
}

Now that we have the data, it can be checked once again to make sure that no individual stock is missing or invalid, and that all the stock quotes are aligned correctly. Furthermore, we renumber the time instances. The most recent day gets day number 1 and all other days get consecutive numbers. All quotes are rounded toward the nearest whole number in US Dollars.

function CleanUp() {
  # clean up time series; eliminate incomplete data sets
  for (d = 1; d <= daycount; d++) {
    for (stock = 1; stock <= StockCount; stock++)
      if (! ((days[d], stock) in quote))
          stock = StockCount + 10
    if (stock > StockCount + 1)
        continue
    datacount++
    for (stock = 1; stock <= StockCount; stock++)
      data[datacount, stock] = int(0.5 + quote[days[d], stock])
  }
  delete quote
  delete days
}

Now we have arrived at the second really interesting part of the whole affair. What we present here is a very primitive prediction algorithm: If a stock fell yesterday, assume it will also fall today; if it rose yesterday, assume it will rise today. (Feel free to replace this algorithm with a smarter one.) If a stock changed in the same direction on two consecutive days, this is an indication which should be highlighted. Two-day advances are stored in hot and two-day declines in avoid.

The rest of the function is a sanity check. It counts the number of correct predictions in relation to the total number of predictions one could have made in the year before.

function Prediction() {
  # Predict each ticker symbol by prolonging yesterday's trend
  for (stock = 1; stock <= StockCount; stock++) {
    if         (data[1, stock] > data[2, stock]) {
      predict[stock] = "up"
    } else if  (data[1, stock] < data[2, stock]) {
      predict[stock] = "down"
    } else {
      predict[stock] = "neutral"
    }
    if ((data[1, stock] > data[2, stock]) && (data[2, stock] > data[3, stock]))
      hot[stock] = 1
    if ((data[1, stock] < data[2, stock]) && (data[2, stock] < data[3, stock]))
      avoid[stock] = 1
  }
  # Do a plausibility check: how many predictions proved correct?
  for (s = 1; s <= StockCount; s++) {
    for (d = 1; d <= datacount-2; d++) {
      if         (data[d+1, s] > data[d+2, s]) {
        UpCount++
      } else if  (data[d+1, s] < data[d+2, s]) {
        DownCount++
      } else {
        NeutralCount++
      }
      if (((data[d, s]  > data[d+1, s]) && (data[d+1, s]  > data[d+2, s])) ||
          ((data[d, s]  < data[d+1, s]) && (data[d+1, s]  < data[d+2, s])) ||
          ((data[d, s] == data[d+1, s]) && (data[d+1, s] == data[d+2, s])))
        CorrectCount++
    }
  }
}

At this point the hard work has been done: the array predict contains the predictions for all the ticker symbols. It is up to the function Report() to find some nice words to present the desired information.

function Report() {
  # Generate report
  report =        "\nThis is your daily "
  report = report "stock market report for "strftime("%A, %B %d, %Y")".\n"
  report = report "Here are the predictions for today:\n\n"
  for (stock = 1; stock <= StockCount; stock++)
    report = report "\t" name[stock] "\t" predict[stock] "\n"
  for (stock in hot) {
    if (HotCount++ == 0)
      report = report "\nThe most promising shares for today are these:\n\n"
    report = report "\t" name[stock] "\t\thttp://biz.yahoo.com/n/" \
      tolower(substr(name[stock], 1, 1)) "/" tolower(name[stock]) ".html\n"
  }
  for (stock in avoid) {
    if (AvoidCount++ == 0)
      report = report "\nThe stock shares to avoid today are these:\n\n"
    report = report "\t" name[stock] "\t\thttp://biz.yahoo.com/n/" \
      tolower(substr(name[stock], 1, 1)) "/" tolower(name[stock]) ".html\n"
  }
  report = report "\nThis sums up to " HotCount+0 " winners and " AvoidCount+0
  report = report " losers. When using this kind\nof prediction scheme for"
  report = report " the 12 months which lie behind us,\nwe get " UpCount
  report = report " 'ups' and " DownCount " 'downs' and " NeutralCount
  report = report " 'neutrals'. Of all\nthese " UpCount+DownCount+NeutralCount
  report = report " predictions " CorrectCount " proved correct next day.\n"
  report = report "A success rate of "\
             int(100*CorrectCount/(UpCount+DownCount+NeutralCount)) "%.\n"
  report = report "Random choice would have produced a 33% success rate.\n"
  report = report "Disclaimer: Like every other prediction of the stock\n"
  report = report "market, this report is, of course, complete nonsense.\n"
  report = report "If you are stupid enough to believe these predictions\n"
  report = report "you should visit a doctor who can treat your ailment."
}

The function SendMail() goes through the list of customers and opens a pipe to the mail command for each of them. Each one receives an email message with a proper subject heading and is addressed with his full name.

function SendMail() {
  # send report to customers
  customer["uncle.scrooge@ducktown.gov"] = "Uncle Scrooge"
  customer["more@utopia.org"           ] = "Sir Thomas More"
  customer["spinoza@denhaag.nl"        ] = "Baruch de Spinoza"
  customer["marx@highgate.uk"          ] = "Karl Marx"
  customer["keynes@the.long.run"       ] = "John Maynard Keynes"
  customer["bierce@devil.hell.org"     ] = "Ambrose Bierce"
  customer["laplace@paris.fr"          ] = "Pierre Simon de Laplace"
  for (c in customer) {
    MailPipe = "mail -s 'Daily Stock Prediction Newsletter'" c
    print "Good morning " customer[c] "," | MailPipe
    print report "\n.\n" | MailPipe
    close(MailPipe)
  }
}

Be patient when running the script by hand. Retrieving the data for all the ticker symbols and sending the emails may take several minutes to complete, depending upon network traffic and the speed of the available Internet link. The quality of the prediction algorithm is likely to be disappointing. Try to find a better one. Should you find one with a success rate of more than 50%, please tell us about it! It is only for the sake of curiosity, of course. :-)