3.6 STATIST: Graphing a Statistical Distribution

gawk_statist

In the HTTP server examples we’ve shown thus far, we never present an image to the browser and its user. Presenting images is one task. Generating images that reflect some user input and presenting these dynamically generated images is another. In this section, we use GNUPlot for generating .png, .ps, or .gif files.13

The program we develop takes the statistical parameters of two samples and computes the t-test statistics. As a result, we get the probabilities that the means and the variances of both samples are the same. In order to let the user check plausibility, the program presents an image of the distributions. The statistical computation follows Numerical Recipes in C: The Art of Scientific Computing by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Since gawk does not have a built-in function for the computation of the beta function, we use the ibeta() function of GNUPlot. As a side effect, we learn how to use GNUPlot as a sophisticated calculator. The comparison of means is done as in tutest, paragraph 14.2, page 613, and the comparison of variances is done as in ftest, page 611 in Numerical Recipes.

As usual, we take the site-independent code for servers and append our own functions SetUpServer() and HandleGET():

function SetUpServer() {
  TopHeader = "<HTML><title>Statistics with GAWK</title>"
  TopDoc = "<BODY>\
   <h2>Please choose one of the following actions:</h2>\
   <UL>\
    <LI><A HREF=" MyPrefix "/AboutServer>About this server</A></LI>\
    <LI><A HREF=" MyPrefix "/EnterParameters>Enter Parameters</A></LI>\
   </UL>"
  TopFooter  = "</BODY></HTML>"
  GnuPlot    = "gnuplot 2>&1"
  m1=m2=0;    v1=v2=1;    n1=n2=10
}

Here, you see the menu structure that the user sees. Later, we will see how the program structure of the HandleGET() function reflects the menu structure. What is missing here is the link for the image we generate. In an event-driven environment, request, generation, and delivery of images are separated.

Notice the way we initialize the GnuPlot command string for the pipe. By default, GNUPlot outputs the generated image via standard output, as well as the results of print(ed) calculations via standard error. The redirection causes standard error to be mixed into standard output, enabling us to read results of calculations with getline. By initializing the statistical parameters with some meaningful defaults, we make sure the user gets an image the first time he uses the program.

Following is the rather long function HandleGET(), which implements the contents of this service by reacting to the different kinds of requests from the browser. Before you start playing with this script, make sure that your browser supports JavaScript and that it also has this option switched on. The script uses a short snippet of JavaScript code for delayed opening of a window with an image. A more detailed explanation follows:

function HandleGET() {
  if (MENU[2] == "AboutServer") {
    Document  = "This is a GUI for a statistical computation.\
      It compares means and variances of two distributions.\
      It is implemented as one GAWK script and uses GNUPLOT."
  } else if (MENU[2] == "EnterParameters") {
    Document = ""
    if ("m1" in GETARG) {     # are there parameters to compare?
      Document = Document "<SCRIPT LANGUAGE=\"JavaScript\">\
        setTimeout(\"window.open(\\\"" MyPrefix "/Image" systime()\
         "\\\",\\\"dist\\\", \\\"status=no\\\");\", 1000); </SCRIPT>"
      m1 = GETARG["m1"]; v1 = GETARG["v1"]; n1 = GETARG["n1"]
      m2 = GETARG["m2"]; v2 = GETARG["v2"]; n2 = GETARG["n2"]
      t = (m1-m2)/sqrt(v1/n1+v2/n2)
      df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) \
           + (v2/n2)*(v2/n2) /(n2-1))
      if (v1>v2) {
          f = v1/v2
          df1 = n1 - 1
          df2 = n2 - 1
      } else {
          f = v2/v1
          df1 = n2 - 1
          df2 = n1 - 1
      }
      print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")"  |& GnuPlot
      print "pF=2.0*ibeta(" df2/2 "," df1/2 "," \
            df2/(df2+df1*f) ")"                    |& GnuPlot
      print "print pt, pF"                         |& GnuPlot
      RS="\n"; GnuPlot |& getline; RS="\r\n"    # $1 is pt, $2 is pF
      print "invsqrt2pi=1.0/sqrt(2.0*pi)"          |& GnuPlot
      print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)" |& GnuPlot
      print "set term png small color"             |& GnuPlot
      #print "set term postscript color"           |& GnuPlot
      #print "set term gif medium size 320,240"    |& GnuPlot
      print "set yrange[-0.3:]"                    |& GnuPlot
      print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left"  |& GnuPlot
      print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left"  |& GnuPlot
      print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\
        mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot
      print "quit"                                         |& GnuPlot
      GnuPlot |& getline Image
      while ((GnuPlot |& getline) > 0)
          Image = Image RS $0
      close(GnuPlot)
    }
    Document = Document "\
    <h3>Do these samples have the same Gaussian distribution?</h3>\
    <FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\
    <TR>\
    <TD>1. Mean    </TD>
    <TD><input type=text name=m1 value=" m1 " size=8></TD>\
    <TD>1. Variance</TD>
    <TD><input type=text name=v1 value=" v1 " size=8></TD>\
    <TD>1. Count   </TD>
    <TD><input type=text name=n1 value=" n1 " size=8></TD>\
    </TR><TR>\
    <TD>2. Mean    </TD>
    <TD><input type=text name=m2 value=" m2 " size=8></TD>\
    <TD>2. Variance</TD>
    <TD><input type=text name=v2 value=" v2 " size=8></TD>\
    <TD>2. Count   </TD>
    <TD><input type=text name=n2 value=" n2 " size=8></TD>\
    </TR>                   <input type=submit value=\"Compute\">\
    </TABLE></FORM><BR>"
  } else if (MENU[2] ~ "Image") {
    Reason = "OK" ORS "Content-type: image/png"
    #Reason = "OK" ORS "Content-type: application/x-postscript"
    #Reason = "OK" ORS "Content-type: image/gif"
    Header = Footer = ""
    Document = Image
  }
}

As usual, we give a short description of the service in the first menu choice. The third menu choice shows us that generation and presentation of an image are two separate actions. While the latter takes place quite instantly in the third menu choice, the former takes place in the much longer second choice. Image data passes from the generating action to the presenting action via the variable Image that contains a complete .png image, which is otherwise stored in a file. If you prefer .ps or .gif images over the default .png images, you may select these options by uncommenting the appropriate lines. But remember to do so in two places: when telling GNUPlot which kind of images to generate, and when transmitting the image at the end of the program.

Looking at the end of the program, the way we pass the ‘Content-type’ to the browser is a bit unusual. It is appended to the ‘OK’ of the first header line to make sure the type information becomes part of the header. The other variables that get transmitted across the network are made empty, because in this case we do not have an HTML document to transmit, but rather raw image data to contain in the body.

Most of the work is done in the second menu choice. It starts with a strange JavaScript code snippet. When first implementing this server, we used a short ‘"<IMG SRC=" MyPrefix "/Image>"’ here. But then browsers got smarter and tried to improve on speed by requesting the image and the HTML code at the same time. When doing this, the browser tries to build up a connection for the image request while the request for the HTML text is not yet completed. The browser tries to connect to the gawk server on port 8080 while port 8080 is still in use for transmission of the HTML text. The connection for the image cannot be built up, so the image appears as “broken” in the browser window. We solved this problem by telling the browser to open a separate window for the image, but only after a delay of 1000 milliseconds. By this time, the server should be ready for serving the next request.

But there is one more subtlety in the JavaScript code. Each time the JavaScript code opens a window for the image, the name of the image is appended with a timestamp (systime()). Why this constant change of name for the image? Initially, we always named the image Image, but then the Netscape browser noticed the name had not changed since the previous request and displayed the previous image (caching behavior). The server core is implemented so that browsers are told not to cache anything. Obviously HTTP requests do not always work as expected. One way to circumvent the cache of such overly smart browsers is to change the name of the image with each request. These three lines of JavaScript caused us a lot of trouble.

The rest can be broken down into two phases. At first, we check if there are statistical parameters. When the program is first started, there usually are no parameters because it enters the page coming from the top menu. Then, we only have to present the user a form that he can use to change statistical parameters and submit them. Subsequently, the submission of the form causes the execution of the first phase because now there are parameters to handle.

Now that we have parameters, we know there will be an image available. Therefore we insert the JavaScript code here to initiate the opening of the image in a separate window. Then, we prepare some variables that will be passed to GNUPlot for calculation of the probabilities. Prior to reading the results, we must temporarily change RS because GNUPlot separates lines with newlines. After instructing GNUPlot to generate a .png (or .ps or .gif) image, we initiate the insertion of some text, explaining the resulting probabilities. The final ‘plot’ command actually generates the image data. This raw binary has to be read in carefully without adding, changing, or deleting a single byte. Hence the unusual initialization of Image and completion with a while loop.

When using this server, it soon becomes clear that it is far from being perfect. It mixes source code of six scripting languages or protocols:

After all this work, the GNUPlot image opens in the JavaScript window where it can be viewed by the user.

It is probably better not to mix up so many different languages. The result is not very readable. Furthermore, the statistical part of the server does not take care of invalid input. Among others, using negative variances causes invalid results.


Footnotes

(13)

Due to licensing problems, the default installation of GNUPlot disables the generation of .gif files. If your installed version does not accept ‘set term gif’, just download and install the most recent version of GNUPlot and the GD library by Thomas Boutell. Otherwise you still have the chance to generate some ASCII-art style images with GNUPlot by using ‘set term dumb’. (We tried it and it worked.)