The proc_means
function generates summary statistics
for selected variables on the input dataset. The variables are identified
on the var
parameter. The statistics to perform are identified
on the stats
parameter. Results are displayed in
the viewer interactively and returned from the function.
proc_means(
data,
var = NULL,
stats = c("n", "mean", "std", "min", "max"),
output = NULL,
by = NULL,
class = NULL,
options = NULL,
titles = NULL
)
The input data frame for which to calculate summary statistics. This parameter is required.
The variable(s) to calculate summary statistics for. If no variables are specified, summary statistics will be generated for all numeric variables on the input data frame.
A vector of summary statistics keywords. Valid keywords are: "css", "clm", "cv", "kurt", "kurtosis", "lclm", "mean", "median", "mode", "min", "max", "n", "nmiss", "nobs", "p1", "p5", "p10", "p20", "p25", "p30", "p40", "p50", "p60", "p70", "p75", "p80", "p90", "p95", "p99", "q1", "q3", "qrange", "range", "skew", "skewness", "std", "stddev", "stderr", "sum", "uclm", "uss", and "vari". For hypothesis testing, the function supports "t", "prt", "probt", and "df". Default statistics are: "n", "mean", "std", "min", and "max".
Whether or not to return datasets from the function. Valid
values are "out", "none", and "report". Default is "out", and will
produce dataset output specifically designed for programmatic use. The "none"
option will return a NULL instead of a dataset or list of datasets.
The "report" keyword returns the datasets from the interactive report, which
may be different from the standard output. The output parameter also accepts
data shaping keywords "long, "stacked", and "wide".
The shaping keywords control the structure of the output data. See the
Data Shaping section for additional details. Note that
multiple output keywords may be passed on a
character vector. For example,
to produce both a report dataset and a "long" output dataset,
use the parameter output = c("report", "out", "long")
.
An optional by group. If you specify a by group, the input data will be subset on the by variable(s) prior to performing any statistics.
The class
parameter is similar to the by
parameter, but the output is different. By groups will create completely
separate tables, while class groups will be continued in the same table.
When a by
and a class
are both specified, the class
will be nested in the by
.
A vector of optional keywords. Valid values are: "alpha =", "completetypes", "maxdec =", "noprint", "notype", "nofreq", "nonobs", "nway". The "notype", "nofreq", and "nonobs" keywords will turn off columns on the output datasets. The "alpha = " option will set the alpha value for confidence limit statistics. The default is 95% (alpha = 0.05). The "maxdec = " option sets the maximum number of decimal places displayed on report output. The "nway" option returns only the highest type values.
A vector of one or more titles to use for the report output.
Normally, the requested summary statistics are shown interactively
in the viewer, and output results are returned as a data frame.
If the request produces multiple data frames, they will be returned in a list.
You may then access individual datasets from the list.
The interactive report can be turned off using the "noprint" option, and
the output datasets can be turned off using the "none" keyword on the
output
parameter.
The proc_means
function is for analysis of continuous variables.
Data is passed in on the data
parameter. The desired statistics are specified using keywords
on the stats
parameter. The function can segregate data into
groups using the by
and class
parameters. There are also
options to determine whether and what results are returned.
By default, proc_freq
results will
be immediately sent to the viewer as an HTML report. This functionality
makes it easy to get a quick analysis of your data. To turn off the
interactive report, pass the "noprint" keyword
to the options
parameter.
The titles
parameter allows you to set one or more titles for your
report. Pass these titles as a vector of strings.
The exact datasets used for the interactive report can be returned as a list.
To return these datasets as a list, pass
the "report" keyword on the output
parameter. This list may in
turn be passed to proc_print
to write the report to a file.
Dataset results are also returned from the function by default. If the results are a single dataset, a single data frame will be returned. If there are multiple results, a list of data frames will be returned.
The output datasets generated are optimized for data manipulation. The column names have been standardized, and additional variables may be present to help with data manipulation. For example, the by variable will always be named "BY", and the class variable will always be named "CLASS". In addition, data values in the output datasets are intentionally not rounded or formatted to give you the most accurate statistical results.
The following statistics keywords can be passed on the stats
parameter. Normally, each statistic
will be contained in a separate column and the column name will be
the same as the statistic keyword. You may pass statistic keywords as a
quoted vector of strings, or an unquoted vector using the v()
function.
css: Corrected Sum of Squares.
clm, lclm, uclm: Upper and lower confidence limits.
cv: Coefficient of Variation.
kurt/kurtosis: The Kurtosis is a description of the distribution tails. It requires at least 4 complete observations.
mean: The arithmetic mean.
median: The median.
mode: The mode of the target variable.
min, max: The minimum and maximum values of the target variable.
n: The number of non-missing observations.
nmiss: The number of missing observations.
nobs: The number of observations, whether missing or non-missing.
p1 - p99: Percentile ranges from p1 to p99, in increments of 5.
qrange, q1, q3: Quantile ranges for the first and third quantiles.
range: Difference between the minimum and maximum values.
skew/skewness: A measure of distribution skewness. It requires at least 3 complete observations.
std/stddev: Standard deviation.
stderr: Standard error.
sum: The sum of variable values.
uss: Uncorrected sum of squares.
vari: The variance.
The function supports the following keywords to perform hypothesis testing:
t: Student's t statistic.
prt/probt: A two-tailed p-value for the Student's t statistic.
df: Degrees of freedom for the Student's t statistic.
The proc_means
function recognizes the following options. Options may
be passed as a quoted vector of strings, or an unquoted vector using the
v()
function.
alpha = : The "alpha = " option will set the alpha
value for confidence limit statistics. Set the alpha as a decimal value
between 0 and 1. For example, you can set a 90% confidence limit as
alpha = 0.1
.
completetypes: The "completetypes" option will generate all combinations of the class variable, even if there is no data present for a particular level. Combinations will be distinguished by the TYPE variable. To use the "completetypes" option, define the class variable(s) as a factor.
maxdec = : The "maxdec = " option will set the maximum
of decimal places displayed on report output. For example, you can set 4 decimal
places as follows: maxdec = 4
. Default is 7 decimal places.
This option will not round any values on the output dataset.
nofreq, nonobs: Turns off the FREQ column on the output datasets.
noprint: Whether to print the interactive report to the viewer. By default, the report is printed to the viewer. The "noprint" option will inhibit printing.
notype: Turns off the TYPE column on the output dataset.
nway: Returns only the highest level TYPE combination. By default, the function returns all TYPE combinations.
The TYPE and FREQ variables appear on the output dataset by default.
The FREQ variable contains a count of the number of input rows/observations that were included in the statistics for that output row. The FREQ count can be different from the N statistic. The FREQ count is a count of the number of rows/observations, while the N statistic is a count of non-missing values. These counts can be different if you have missing values in your data. If you want to remove the FREQ column from the output dataset, use the "nofreq" option.
The TYPE variable identifies combinations of class categories, and produces summary statistics for each of those combinations. For example, the output dataset normally produces statistics for TYPE 0, which is all class categories, and a TYPE 1 which is each class category. If there are multiple classes, there will be multiple TYPE values for each level of class combinations. If you do no want to show the various type combinations, use the "nway" option. If you want to remove the TYPE column from the output dataset, use the "notype" option.
There are some occasions when you may want to define the class
variable(s)
as a factor. One occasion is for sorting/ordering,
and the other is for obtaining zero-counts on sparse data.
To order the class categories in the means output, define the
class
variable as a factor in the desired order. The function will
then retain that order for the class categories in the output dataset
and report.
You may also wish to
define the class variable as a factor if you are dealing with sparse data
and some of the class categories are not present in the data. To ensure
these categories are displayed with zero-counts, define the class
variable
as a factor and use the "completetypes" option.
The output dataset produced by the "out" keyword can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the frequencies into the desired form. The shaping options are as follows:
long: Transposes the output datasets so that statistics are in rows and variables are in columns.
stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and variables are in rows.
wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and variables are in rows. This shaping option is the default.
# Turn off printing for CRAN checks
options("procs.print" = FALSE)
# Default statistics on iris
res1 <- proc_means(iris)
# View results
res1
# TYPE FREQ VAR N MEAN STD MIN MAX
# 1 0 150 Sepal.Length 150 5.843333 0.8280661 4.3 7.9
# 2 0 150 Sepal.Width 150 3.057333 0.4358663 2.0 4.4
# 3 0 150 Petal.Length 150 3.758000 1.7652982 1.0 6.9
# 4 0 150 Petal.Width 150 1.199333 0.7622377 0.1 2.5
# Defaults statistics with by
res2 <- proc_means(iris,
by = Species)
# View results
res2
# BY TYPE FREQ VAR N MEAN STD MIN MAX
# 1 setosa 0 50 Sepal.Length 50 5.006 0.3524897 4.3 5.8
# 2 setosa 0 50 Sepal.Width 50 3.428 0.3790644 2.3 4.4
# 3 setosa 0 50 Petal.Length 50 1.462 0.1736640 1.0 1.9
# 4 setosa 0 50 Petal.Width 50 0.246 0.1053856 0.1 0.6
# 5 versicolor 0 50 Sepal.Length 50 5.936 0.5161711 4.9 7.0
# 6 versicolor 0 50 Sepal.Width 50 2.770 0.3137983 2.0 3.4
# 7 versicolor 0 50 Petal.Length 50 4.260 0.4699110 3.0 5.1
# 8 versicolor 0 50 Petal.Width 50 1.326 0.1977527 1.0 1.8
# 9 virginica 0 50 Sepal.Length 50 6.588 0.6358796 4.9 7.9
# 10 virginica 0 50 Sepal.Width 50 2.974 0.3224966 2.2 3.8
# 11 virginica 0 50 Petal.Length 50 5.552 0.5518947 4.5 6.9
# 12 virginica 0 50 Petal.Width 50 2.026 0.2746501 1.4 2.5
# Specified variables, statistics, and options
res3 <- proc_means(iris,
var = v(Petal.Length, Petal.Width),
class = Species,
stats = v(n, mean, std, median, qrange, clm),
options = nofreq,
output = long)
# View results
res3
# CLASS TYPE STAT Petal.Length Petal.Width
# 1 <NA> 0 N 150.0000000 150.0000000
# 2 <NA> 0 MEAN 3.7580000 1.1993333
# 3 <NA> 0 STD 1.7652982 0.7622377
# 4 <NA> 0 MEDIAN 4.3500000 1.3000000
# 5 <NA> 0 QRANGE 3.5000000 1.5000000
# 6 <NA> 0 LCLM 3.4731854 1.0763533
# 7 <NA> 0 UCLM 4.0428146 1.3223134
# 8 setosa 1 N 50.0000000 50.0000000
# 9 setosa 1 MEAN 1.4620000 0.2460000
# 10 setosa 1 STD 0.1736640 0.1053856
# 11 setosa 1 MEDIAN 1.5000000 0.2000000
# 12 setosa 1 QRANGE 0.2000000 0.1000000
# 13 setosa 1 LCLM 1.4126452 0.2160497
# 14 setosa 1 UCLM 1.5113548 0.2759503
# 15 versicolor 1 N 50.0000000 50.0000000
# 16 versicolor 1 MEAN 4.2600000 1.3260000
# 17 versicolor 1 STD 0.4699110 0.1977527
# 18 versicolor 1 MEDIAN 4.3500000 1.3000000
# 19 versicolor 1 QRANGE 0.6000000 0.3000000
# 20 versicolor 1 LCLM 4.1264528 1.2697993
# 21 versicolor 1 UCLM 4.3935472 1.3822007
# 22 virginica 1 N 50.0000000 50.0000000
# 23 virginica 1 MEAN 5.5520000 2.0260000
# 24 virginica 1 STD 0.5518947 0.2746501
# 25 virginica 1 MEDIAN 5.5500000 2.0000000
# 26 virginica 1 QRANGE 0.8000000 0.5000000
# 27 virginica 1 LCLM 5.3951533 1.9479453
# 28 virginica 1 UCLM 5.7088467 2.1040547