15.15 QUICK CLUSTER

QUICK CLUSTER var_list
      [/CRITERIA=CLUSTERS(k) [MXITER(max_iter)] CONVERGE(epsilon) [NOINITIAL]]
      [/MISSING={EXCLUDE,INCLUDE} {LISTWISE, PAIRWISE}]
      [/PRINT={INITIAL} {CLUSTER}]
      [/SAVE[=[CLUSTER[(membership_var)]] [DISTANCE[(distance_var)]]]

The QUICK CLUSTER command performs k-means clustering on the dataset. This is useful when you wish to allocate cases into clusters of similar values and you already know the number of clusters.

The minimum specification is ‘QUICK CLUSTER’ followed by the names of the variables which contain the cluster data. Normally you will also want to specify /CRITERIA=CLUSTERS(k) where k is the number of clusters. If this is not specified, then k defaults to 2.

If you use /CRITERIA=NOINITIAL then a naive algorithm to select the initial clusters is used. This will provide for faster execution but less well separated initial clusters and hence possibly an inferior final result.

QUICK CLUSTER uses an iterative algorithm to select the clusters centers. The subcommand /CRITERIA=MXITER(max_iter) sets the maximum number of iterations. During classification, PSPP will continue iterating until until max_iter iterations have been done or the convergence criterion (see below) is fulfilled. The default value of max_iter is 2.

If however, you specify /CRITERIA=NOUPDATE then after selecting the initial centers, no further update to the cluster centers is done. In this case, max_iter, if specified. is ignored.

The subcommand /CRITERIA=CONVERGE(epsilon) is used to set the convergence criterion. The value of convergence criterion is epsilon times the minimum distance between the initial cluster centers. Iteration stops when the mean cluster distance between one iteration and the next is less than the convergence criterion. The default value of epsilon is zero.

The MISSING subcommand determines the handling of missing variables. If INCLUDE is set, then user-missing values are considered at their face value and not as missing values. If EXCLUDE is set, which is the default, user-missing values are excluded as well as system-missing values.

If LISTWISE is set, then the entire case is excluded from the analysis whenever any of the clustering variables contains a missing value. If PAIRWISE is set, then a case is considered missing only if all the clustering variables contain missing values. Otherwise it is clustered on the basis of the non-missing values. The default is LISTWISE.

The PRINT subcommand requests additional output to be printed. If INITIAL is set, then the initial cluster memberships will be printed. If CLUSTER is set, the cluster memberships of the individual cases are displayed (potentially generating lengthy output).

You can specify the subcommand SAVE to ask that each case’s cluster membership and the euclidean distance between the case and its cluster center be saved to a new variable in the active dataset. To save the cluster membership use the CLUSTER keyword and to save the distance use the DISTANCE keyword. Each keyword may optionally be followed by a variable name in parentheses to specify the new variable which is to contain the saved parameter. If no variable name is specified, then PSPP will create one.