Next: , Up: Data Screening and Transformation


5.2.1 Identifying incorrect data

Data from real sources is rarely error free. pspp has a number of procedures which can be used to help identify data which might be incorrect.

The DESCRIPTIVES command (see DESCRIPTIVES) is used to generate simple linear statistics for a dataset. It is also useful for identifying potential problems in the data. The example file physiology.sav contains a number of physiological measurements of a sample of healthy adults selected at random. However, the data entry clerk made a number of mistakes when entering the data. descriptives illustrates the use of DESCRIPTIVES to screen this data and identify the erroneous values.

     PSPP> get file='/usr/local/share/pspp/examples/physiology.sav'.
     PSPP> descriptives sex, weight, height.

Output:

     DESCRIPTIVES.  Valid cases = 40; cases with missing value(s) = 0.
     +--------#--+-------+-------+-------+-------+
     |Variable# N|  Mean |Std Dev|Minimum|Maximum|
     #========#==#=======#=======#=======#=======#
     |sex     #40|    .45|    .50|    .00|   1.00|
     |height  #40|1677.12| 262.87| 179.00|1903.00|
     |weight  #40|  72.12|  26.70| -55.60|  92.07|
     +--------#--+-------+-------+-------+-------+

Example 5.2: Using the DESCRIPTIVES command to display simple summary information about the data. In this case, the results show unexpectedly low values in the Minimum column, suggesting incorrect data entry.

In the output of Example 5.2, the most interesting column is the minimum value. The weight variable has a minimum value of less than zero, which is clearly erroneous. Similarly, the height variable's minimum value seems to be very low. In fact, it is more than 5 standard deviations from the mean, and is a seemingly bizarre height for an adult person. We can examine the data in more detail with the EXAMINE command (see EXAMINE):

In examine you can see that the lowest value of height is 179 (which we suspect to be erroneous), but the second lowest is 1598 which we know from the DESCRIPTIVES command is within 1 standard deviation from the mean. Similarly the weight variable has a lowest value which is negative but a plausible value for the second lowest value. This suggests that the two extreme values are outliers and probably represent data entry errors.

[... continue from Example 5.2]
     PSPP> examine height, weight /statistics=extreme(3).

Output:

     #===============================#===========#=======#
     #                               #Case Number| Value #
     #===============================#===========#=======#
     #Height in millimetres Highest 1#         14|1903.00#
     #                              2#         15|1884.00#
     #                              3#         12|1801.65#
     #                     ----------#-----------+-------#
     #                       Lowest 1#         30| 179.00#
     #                              2#         31|1598.00#
     #                              3#         28|1601.00#
     #                     ----------#-----------+-------#
     #Weight in kilograms   Highest 1#         13|  92.07#
     #                              2#          5|  92.07#
     #                              3#         17|  91.74#
     #                     ----------#-----------+-------#
     #                       Lowest 1#         38| -55.60#
     #                              2#         39|  54.48#
     #                              3#         33|  55.45#
     #===============================#===========#=======#

Example 5.3: Using the EXAMINE command to see the extremities of the data for different variables. Cases 30 and 38 seem to contain values very much lower than the rest of the data. They are possibly erroneous.