next up previous contents index
Next: System Up: Commands Previous: Splot surfaces   Contents   Index

Stats (Statistical Summary)

Syntax:
    stats {<ranges>} 'filename' {matrix | using N{:M}} {name 'prefix'} {{no}output}

This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot (p. [*]) for details on the index (p. [*]), every (p. [*]), and using (p. [*]) directives. Data points are filtered against both xrange and yrange before analysis. See set xrange (p. [*]). The summary is printed to the screen by default. Output can be redirected to a file by prior use of the command set print, or suppressed altogether using the nooutput option.

In addition to printed output, the program stores the individual statistics into three sets of variables. The first set of variables reports how the data is laid out in the file:

STATS_records 69#69 total number of in-range data records
STATS_outofrange 70#70 number of records filtered out by range limits
STATS_invalid 70#70 number of invalid/incomplete/missing records
STATS_blank 70#70 number of blank lines in the file
STATS_blocks 70#70 number of indexable datablocks in the file
STATS_columns 70#70 number of data columns in the first row of data

The second set reports properties of the in-range data from a single column. This column is treated as y. If the y axis is autoscaled then no range limits are applied. Otherwise only values in the range [ymin:ymax] are considered.

If two columns are analysed jointly by a single stats command, the suffix "_x" or "_y" is appended to each variable name. I.e. STATS_min_x is the minimum value found in the first column, while STATS_min_y is the minimum value found in the second column. In this case points are filtered by testing against both xrange and yrange.

STATS_min   71#71 minimum value of in-range data points
STATS_max   72#72 maximum value of in-range data points
STATS_index_min   73#73 index i for which data[i] == STATS_min
STATS_index_max   74#74 index i for which data[i] == STATS_max
STATS_mean 75#75 76#76 mean value of the in-range data points
STATS_stddev 77#77 78#78 population standard deviation of the in-range data
STATS_ssd 79#79 80#80 sample standard deviation of the in-range data
STATS_lo_quartile     value of the lower (1st) quartile boundary
STATS_median     median value
STATS_up_quartile     value of the upper (3rd) quartile boundary
STATS_sum   81#81 sum
STATS_sumsq   82#82 sum of squares
STATS_skewness   83#83 skewness of the in-range data points
STATS_kurtosis   84#84 kurtosis of the in-range data points
STATS_adev   85#85 mean absolute deviation of the in-range data
STATS_mean_err   86#86 standard error of the mean value
STATS_stddev_err   87#87 standard error of the standard deviation
STATS_skewness_err   88#88 standard error of the skewness
STATS_kurtosis_err   89#89 standard error of the kurtosis

The third set of variables is only relevant to analysis of two data columns.

STATS_correlation   sample correlation coefficient between x and y values
STATS_slope   A corresponding to a linear fit y = Ax + B
STATS_slope_err   uncertainty of A
STATS_intercept   B corresponding to a linear fit y = Ax + B
STATS_intercept_err   uncertainty of B
STATS_sumxy   sum of x*y
STATS_pos_min_y   x coordinate of a point with minimum y value
STATS_pos_max_y   x coordinate of a point with maximum y value

When matrix is specified, all matrix entries are included in the analysis. The matrix dimensions are saved in the variables STATS_size_x and STATS_size_y.

It may be convenient to track the statistics from more than one file or data column in parallel. The name option causes the default prefix "STATS" to be replaced by a user-specified string. For example, the mean value of column 2 data from two different files could be compared by

    stats "file1.dat" using 2 name "A"
    stats "file2.dat" using 2 name "B"
    if (A_mean < B_mean) {...}

The keyword columnheader or function columnheader(N) can be used to generate the prefix from the contents of the first row of a data file:
    do for [COL=5:8] { stats 'datafile' using COL name columnheader }

The index reported in STATS_index_xxx corresponds to the value of pseudo-column 0 ($0) in plot commands. I.e. the first point has index 0, the last point has index N-1.

Data values are sorted to find the median and quartile boundaries. If the total number of points N is odd, then the median value is taken as the value of data point (N+1)/2. If N is even, then the median is reported as the mean value of points N/2 and (N+2)/2. Equivalent treatment is used for the quartile boundaries.

For an example of using the stats command to annotate a subsequent plot, see

http://www.gnuplot.info/demo/stats.htmlstats.dem.

The current implementation does not allow analysis if either the X or Y axis is set to log-scaling. This restriction may be removed in a later version.


next up previous contents index
Next: System Up: Commands Previous: Splot surfaces   Contents   Index
2017-05-24