The proc_means function generates summary statistics for selected variables on the input dataset. The variables are identified on the var parameter. The statistics to perform are identified on the stats parameter. Results are displayed in the viewer interactively and returned from the function.

proc_means(
  data,
  var = NULL,
  stats = c("n", "mean", "std", "min", "max"),
  output = NULL,
  by = NULL,
  class = NULL,
  options = NULL,
  titles = NULL
)

Arguments

data

The input data frame for which to calculate summary statistics. This parameter is required.

var

The variable(s) to calculate summary statistics for. If no variables are specified, summary statistics will be generated for all numeric variables on the input data frame.

stats

A vector of summary statistics keywords. Valid keywords are: "css", "clm", "cv", "kurt", "kurtosis", "lclm", "mean", "median", "mode", "min", "max", "n", "nmiss", "nobs", "p1", "p5", "p10", "p20", "p25", "p30", "p40", "p50", "p60", "p70", "p75", "p80", "p90", "p95", "p99", "q1", "q3", "qrange", "range", "skew", "skewness", "std", "stddev", "stderr", "sum", "uclm", "uss", and "vari". For hypothesis testing, the function supports "t", "prt", "probt", and "df". Default statistics are: "n", "mean", "std", "min", and "max".

output

Whether or not to return datasets from the function. Valid values are "out", "none", and "report". Default is "out", and will produce dataset output specifically designed for programmatic use. The "none" option will return a NULL instead of a dataset or list of datasets. The "report" keyword returns the datasets from the interactive report, which may be different from the standard output. The output parameter also accepts data shaping keywords "long, "stacked", and "wide". The shaping keywords control the structure of the output data. See the Data Shaping section for additional details. Note that multiple output keywords may be passed on a character vector. For example, to produce both a report dataset and a "long" output dataset, use the parameter output = c("report", "out", "long").

by

An optional by group. If you specify a by group, the input data will be subset on the by variable(s) prior to performing any statistics.

class

The class parameter is similar to the by parameter, but the output is different. By groups will create completely separate tables, while class groups will be continued in the same table. When a by and a class are both specified, the class will be nested in the by.

options

A vector of optional keywords. Valid values are: "alpha =", "completetypes", "maxdec =", "noprint", "notype", "nofreq", "nonobs", "nway". The "notype", "nofreq", and "nonobs" keywords will turn off columns on the output datasets. The "alpha = " option will set the alpha value for confidence limit statistics. The default is 95% (alpha = 0.05). The "maxdec = " option sets the maximum number of decimal places displayed on report output. The "nway" option returns only the highest type values.

titles

A vector of one or more titles to use for the report output.

Value

Normally, the requested summary statistics are shown interactively in the viewer, and output results are returned as a data frame. If the request produces multiple data frames, they will be returned in a list. You may then access individual datasets from the list. The interactive report can be turned off using the "noprint" option, and the output datasets can be turned off using the "none" keyword on the output parameter.

Details

The proc_means function is for analysis of continuous variables. Data is passed in on the data parameter. The desired statistics are specified using keywords on the stats parameter. The function can segregate data into groups using the by and class parameters. There are also options to determine whether and what results are returned.

Interactive Output

By default, proc_freq results will be immediately sent to the viewer as an HTML report. This functionality makes it easy to get a quick analysis of your data. To turn off the interactive report, pass the "noprint" keyword to the options parameter.

The titles parameter allows you to set one or more titles for your report. Pass these titles as a vector of strings.

The exact datasets used for the interactive report can be returned as a list. To return these datasets as a list, pass the "report" keyword on the output parameter. This list may in turn be passed to proc_print to write the report to a file.

Dataset Output

Dataset results are also returned from the function by default. If the results are a single dataset, a single data frame will be returned. If there are multiple results, a list of data frames will be returned.

The output datasets generated are optimized for data manipulation. The column names have been standardized, and additional variables may be present to help with data manipulation. For example, the by variable will always be named "BY", and the class variable will always be named "CLASS". In addition, data values in the output datasets are intentionally not rounded or formatted to give you the most accurate statistical results.

Statistics Keywords

The following statistics keywords can be passed on the stats parameter. Normally, each statistic will be contained in a separate column and the column name will be the same as the statistic keyword. You may pass statistic keywords as a quoted vector of strings, or an unquoted vector using the v() function.

  • css: Corrected Sum of Squares.

  • clm, lclm, uclm: Upper and lower confidence limits.

  • cv: Coefficient of Variation.

  • kurt/kurtosis: The Kurtosis is a description of the distribution tails. It requires at least 4 complete observations.

  • mean: The arithmetic mean.

  • median: The median.

  • mode: The mode of the target variable.

  • min, max: The minimum and maximum values of the target variable.

  • n: The number of non-missing observations.

  • nmiss: The number of missing observations.

  • nobs: The number of observations, whether missing or non-missing.

  • p1 - p99: Percentile ranges from p1 to p99, in increments of 5.

  • qrange, q1, q3: Quantile ranges for the first and third quantiles.

  • range: Difference between the minimum and maximum values.

  • skew/skewness: A measure of distribution skewness. It requires at least 3 complete observations.

  • std/stddev: Standard deviation.

  • stderr: Standard error.

  • sum: The sum of variable values.

  • uss: Uncorrected sum of squares.

  • vari: The variance.

The function supports the following keywords to perform hypothesis testing:

  • t: Student's t statistic.

  • prt/probt: A two-tailed p-value for the Student's t statistic.

  • df: Degrees of freedom for the Student's t statistic.

Options

The proc_means function recognizes the following options. Options may be passed as a quoted vector of strings, or an unquoted vector using the v() function.

  • alpha = : The "alpha = " option will set the alpha value for confidence limit statistics. Set the alpha as a decimal value between 0 and 1. For example, you can set a 90% confidence limit as alpha = 0.1.

  • completetypes: The "completetypes" option will generate all combinations of the class variable, even if there is no data present for a particular level. Combinations will be distinguished by the TYPE variable. To use the "completetypes" option, define the class variable(s) as a factor.

  • maxdec = : The "maxdec = " option will set the maximum of decimal places displayed on report output. For example, you can set 4 decimal places as follows: maxdec = 4. Default is 7 decimal places. This option will not round any values on the output dataset.

  • nofreq, nonobs: Turns off the FREQ column on the output datasets.

  • noprint: Whether to print the interactive report to the viewer. By default, the report is printed to the viewer. The "noprint" option will inhibit printing.

  • notype: Turns off the TYPE column on the output dataset.

  • nway: Returns only the highest level TYPE combination. By default, the function returns all TYPE combinations.

TYPE and FREQ Variables

The TYPE and FREQ variables appear on the output dataset by default.

The FREQ variable contains a count of the number of input rows/observations that were included in the statistics for that output row. The FREQ count can be different from the N statistic. The FREQ count is a count of the number of rows/observations, while the N statistic is a count of non-missing values. These counts can be different if you have missing values in your data. If you want to remove the FREQ column from the output dataset, use the "nofreq" option.

The TYPE variable identifies combinations of class categories, and produces summary statistics for each of those combinations. For example, the output dataset normally produces statistics for TYPE 0, which is all class categories, and a TYPE 1 which is each class category. If there are multiple classes, there will be multiple TYPE values for each level of class combinations. If you do no want to show the various type combinations, use the "nway" option. If you want to remove the TYPE column from the output dataset, use the "notype" option.

Using Factors

There are some occasions when you may want to define the class variable(s) as a factor. One occasion is for sorting/ordering, and the other is for obtaining zero-counts on sparse data.

To order the class categories in the means output, define the class variable as a factor in the desired order. The function will then retain that order for the class categories in the output dataset and report.

You may also wish to define the class variable as a factor if you are dealing with sparse data and some of the class categories are not present in the data. To ensure these categories are displayed with zero-counts, define the class variable as a factor and use the "completetypes" option.

Data Shaping

The output dataset produced by the "out" keyword can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the frequencies into the desired form. The shaping options are as follows:

  • long: Transposes the output datasets so that statistics are in rows and variables are in columns.

  • stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and variables are in rows.

  • wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and variables are in rows. This shaping option is the default.

Examples

# Turn off printing for CRAN checks
options("procs.print" = FALSE)

# Default statistics on iris
res1 <- proc_means(iris)

# View results
res1
#   TYPE FREQ          VAR   N     MEAN       STD MIN MAX
# 1    0  150 Sepal.Length 150 5.843333 0.8280661 4.3 7.9
# 2    0  150  Sepal.Width 150 3.057333 0.4358663 2.0 4.4
# 3    0  150 Petal.Length 150 3.758000 1.7652982 1.0 6.9
# 4    0  150  Petal.Width 150 1.199333 0.7622377 0.1 2.5

# Defaults statistics with by
res2 <- proc_means(iris,
                   by = Species)
# View results
res2
#            BY TYPE FREQ          VAR  N  MEAN       STD MIN MAX
# 1      setosa    0   50 Sepal.Length 50 5.006 0.3524897 4.3 5.8
# 2      setosa    0   50  Sepal.Width 50 3.428 0.3790644 2.3 4.4
# 3      setosa    0   50 Petal.Length 50 1.462 0.1736640 1.0 1.9
# 4      setosa    0   50  Petal.Width 50 0.246 0.1053856 0.1 0.6
# 5  versicolor    0   50 Sepal.Length 50 5.936 0.5161711 4.9 7.0
# 6  versicolor    0   50  Sepal.Width 50 2.770 0.3137983 2.0 3.4
# 7  versicolor    0   50 Petal.Length 50 4.260 0.4699110 3.0 5.1
# 8  versicolor    0   50  Petal.Width 50 1.326 0.1977527 1.0 1.8
# 9   virginica    0   50 Sepal.Length 50 6.588 0.6358796 4.9 7.9
# 10  virginica    0   50  Sepal.Width 50 2.974 0.3224966 2.2 3.8
# 11  virginica    0   50 Petal.Length 50 5.552 0.5518947 4.5 6.9
# 12  virginica    0   50  Petal.Width 50 2.026 0.2746501 1.4 2.5

# Specified variables, statistics, and options
res3 <- proc_means(iris,
                   var = v(Petal.Length, Petal.Width),
                   class = Species,
                   stats = v(n, mean, std, median, qrange, clm),
                   options = nofreq,
                   output = long)
# View results
res3
#         CLASS TYPE   STAT Petal.Length Petal.Width
# 1        <NA>    0      N  150.0000000 150.0000000
# 2        <NA>    0   MEAN    3.7580000   1.1993333
# 3        <NA>    0    STD    1.7652982   0.7622377
# 4        <NA>    0 MEDIAN    4.3500000   1.3000000
# 5        <NA>    0 QRANGE    3.5000000   1.5000000
# 6        <NA>    0   LCLM    3.4731854   1.0763533
# 7        <NA>    0   UCLM    4.0428146   1.3223134
# 8      setosa    1      N   50.0000000  50.0000000
# 9      setosa    1   MEAN    1.4620000   0.2460000
# 10     setosa    1    STD    0.1736640   0.1053856
# 11     setosa    1 MEDIAN    1.5000000   0.2000000
# 12     setosa    1 QRANGE    0.2000000   0.1000000
# 13     setosa    1   LCLM    1.4126452   0.2160497
# 14     setosa    1   UCLM    1.5113548   0.2759503
# 15 versicolor    1      N   50.0000000  50.0000000
# 16 versicolor    1   MEAN    4.2600000   1.3260000
# 17 versicolor    1    STD    0.4699110   0.1977527
# 18 versicolor    1 MEDIAN    4.3500000   1.3000000
# 19 versicolor    1 QRANGE    0.6000000   0.3000000
# 20 versicolor    1   LCLM    4.1264528   1.2697993
# 21 versicolor    1   UCLM    4.3935472   1.3822007
# 22  virginica    1      N   50.0000000  50.0000000
# 23  virginica    1   MEAN    5.5520000   2.0260000
# 24  virginica    1    STD    0.5518947   0.2746501
# 25  virginica    1 MEDIAN    5.5500000   2.0000000
# 26  virginica    1 QRANGE    0.8000000   0.5000000
# 27  virginica    1   LCLM    5.3951533   1.9479453
# 28  virginica    1   UCLM    5.7088467   2.1040547