3.2 GETURL: Retrieving Web Pages

GETURL is a versatile building block for shell scripts that need to retrieve files from the Internet. It takes a web address as a command-line parameter and tries to retrieve the contents of this address. The contents are printed to standard output, while the header is printed to /dev/stderr. A surrounding shell script could analyze the contents and extract the text or the links. An ASCII browser could be written around GETURL. But more interestingly, web robots are straightforward to write on top of GETURL. On the Internet, you can find several programs of the same name that do the same job. They are usually much more complex internally and at least 10 times as big.

At first, GETURL checks if it was called with exactly one web address. Then, it checks if the user chose to use a special proxy server whose name is handed over in a variable. By default, it is assumed that the local machine serves as proxy. GETURL uses the GET method by default to access the web page. By handing over the name of a different method (such as HEAD), it is possible to choose a different behavior. With the HEAD method, the user does not receive the body of the page content, but does receive the header:

BEGIN {
  if (ARGC != 2) {
    print "GETURL - retrieve Web page via HTTP 1.0"
    print "IN:\n    the URL as a command-line parameter"
    print "PARAM(S):\n    -v Proxy=MyProxy"
    print "OUT:\n    the page content on stdout"
    print "    the page header on stderr"
    print "JK 16.05.1997"
    print "ADR 13.08.2000"
    exit
  }
  URL = ARGV[1]; ARGV[1] = ""
  if (Proxy     == "")  Proxy     = "127.0.0.1"
  if (ProxyPort ==  0)  ProxyPort = 80
  if (Method    == "")  Method    = "GET"
  HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
  ORS = RS = "\r\n\r\n"
  print Method " " URL " HTTP/1.0" |& HttpService
  HttpService                      |& getline Header
  print Header > "/dev/stderr"
  while ((HttpService |& getline) > 0)
    printf "%s", $0
  close(HttpService)
}

This program can be changed as needed, but be careful with the last lines. Make sure transmission of binary data is not corrupted by additional line breaks. Even as it is now, the byte sequence "\r\n\r\n" would disappear if it were contained in binary data. Don’t get caught in a trap when trying a quick fix on this one.