The proc_freq
function generates frequency statistics.
It is both an interactive function that can be used for data exploration,
and can produce dataset output for further analysis.
The function can perform one and two-way frequencies. Two-way
frequencies are produced as a cross-tabulation by default. There
are many options to control the generated tables. The function will return
requested tables in a named list.
proc_freq(
data,
tables = NULL,
output = NULL,
by = NULL,
weight = NULL,
options = NULL,
titles = NULL
)
The input data frame to perform frequency calculations on. Input data as the first parameter makes this function pipe-friendly.
The variable or variables to perform frequency counts on. The table specifications are passed as a vector of strings. For one-way frequencies, simply pass the variable name. For two-way tables, pass the desired combination of variables separated by a star (*) operator. The parameter does not accept SAS® style grouping syntax. All cross combinations should be listed explicitly. If the table request is named, the name will be used as the list item name on the return list of tables. See "Example 3" for an illustration on how to name an output table.
Whether or not to return datasets from the function. Valid
values are "out", "none", and "report". Default is "out". This parameter
also accepts the data shaping options "long", "stacked", and "wide". See
the Data Shaping section for a description of these options. Multiple
output keywords may be passed on a character vector. For example,
to produce both a report dataset and a "long" output dataset,
use the parameter output = c("report", "out", "long")
.
An optional by group. Parameter accepts a vector of one or more variable names. When this parameter is set, data will be subset for each by group, and tables will be generated for each subset.
An optional weight parameter. This parameter is passed as a variable name to use for the weight. If a weight variable is indicated, the weighted value will be summed to calculate the frequency counts.
The options desired for the function.
Options are passed to the parameter as a vector of quoted strings. You may
also use the v()
function to pass unquoted strings.
The following options are available:
"chisq", "crosstab", "fisher", "list", "missing",
"nlevels", "nocol",
"nocum", "nofreq", "nopercent", "noprint",
"nonobs", "norow", "nosparse", "notable", "outcum". See
the Options section for a description of these options.
A vector of titles to assign to the interactive report.
The function will return all requested datasets by default. This is
equivalent to the output = "out"
option. To return the datasets
as created for the interactive report, pass the "report" output option. If
no output datasets are desired, pass the "none" output option. If a
single dataset is requested, the function
will return a single dataset. If multiple datasets are requested, the function
will return a list of datasets. The type of data frame returned will
correspond to the type of data frame passed in on the data
parameter.
If the input data is a tibble, the output data will be a
tibble. If the input data is a Base R data frame, the output data will be
a Base R data frame.
The proc_freq
function generates frequency statistics
for one-way and two-way tables. Data is passed in on the data
parameter. The desired frequencies are specified on the tables
parameter.
By default, proc_freq
results will
be immediately sent to the viewer as an HTML report. This functionality
makes it easy to get a quick analysis of your data with very little
effort. To turn off the interactive report, pass the "noprint" keyword
to the options
parameter or set options("procs.print" = FALSE)
.
The titles
parameter allows you to set one or more titles for your
report. Pass these titles as a vector of strings.
If the frequency variables have a label assigned, that label will be used in the report output. This feature gives you some control over the column headers in the final report.
The exact datasets used for the interactive output can be returned as a list.
To return these datasets as a list, pass
the "report" keyword on the output
parameter. This list may in
turn be passed to proc_print
to write the report to a file.
The proc_freq
function returns output datasets.
If you are requesting only one table, a single
data frame will be returned. If you request multiple tables, a list of data
frames will be returned.
By default, the list items are named according to the
strings specified on the tables
parameter. You may control
the names of the returned results by using a named vector on the
tables
parameter.
The standard output datasets are optimized for data manipulation. Column names have been standardized, and additional variables may be present to help with data manipulation. For instance, the by variable will always be named "BY", and the frequency category will always be named "CAT". In addition, data values in the output datasets are not rounded or formatted to give you the most accurate statistical results.
Normally the proc_freq
function counts each row in the
input data equally. In some cases, however, each row in the data
can represent multiple observations, and rows should not be treated
equally. In these cases, use the weight
parameter. The parameter
accepts a variable/column name to use as the weighted value. If the
weight
parameter is used, the function will sum the weighted values
instead of counting rows.
You may request that frequencies be separated into by groups using the
by
parameter. The parameter accepts one or more variable names
from the input dataset. When this parameter is assigned, the data
will be subset by the "by" variable(s) before frequency counts are
calculated. On the interactive report, the by groups will appear in
separate tables. On the output dataset, the by groups will be identified
by additional columns.
The options
parameter accepts a vector of options. Normally, these
options must be quoted. But you may pass them unquoted using the v()
function. For example, you can request the number of category levels
and the Chi-Square statistic like this: options = v(nlevels, chisq)
.
Below are all the available options and a description of each:
crosstab: Two-way output tables are a list style by default. If you want a crosstab style, pass the "crosstab" option.
list: Two-way interactive tables are a crosstab style by default. If you want a list style two-way table, pass the "list" option.
missing: Normally, missing values are not counted and not shown on frequency tables. The "missing" option allows you to treat missing (NA) values as normal values, so that they are counted and shown on the frequency table. Missing levels will appear on the table as a single dot (".").
nlevels: The "nlevels" option will display the number of unique values for each variable in the frequency table. These levels are generated as a separate table that appears on the report, and will also be output from the function as a separate dataset.
nocol: Two-way cross tabulation tables include column percents by default. To turn them off, pass the "nocol" option.
nocum: Whether to include the cumulative frequency and percent columns on one-way, interactive tables. These columns are included by default. To turn them off, pass the "nocum" option.
nofreq: The "nofreq" option will remove the frequency column from one-way and two-way tables.
nopercent: The "nopercent" option will remove the percent column from one-way and two-way tables.
noprint: Whether to print the interactive report to the viewer. By default, the report is printed to the viewer. The "noprint" option will inhibit printing.
nonobs: Whether to include the number of observations "N" column on the output and interactive tables. By default, the N column will be included. The "nonobs" option turns it off.
norow: Whether to include the row percentages on two-way crosstab tables. The "norow" option will turn them off.
nosparse/sparse: Whether to include categories for which there are no frequency counts. Zero-count categories will be included by default, which is the "sparse" option. If the "nosparse" option is present, zero-count categories will be removed.
notable: Whether to include the frequency table in the output dataset list. Normally, the frequency table is included. You may want to exclude the frequency table in some cases, for instance, if you only want the Chi-Square statistic.
outcum: Whether to include the cumulative frequency and percent on output frequency tables. By default, these columns are not included. The "outcum" option will include them.
In addition to the above options, the options
parameter accepts
some statistics options. The following keywords will generate
an additional tables of specialized statistics. These statistics
options are only available on two-way tables:
chisq: Requests that the Chi-square statistics be produced.
fisher: Requests that the Fisher's exact statistics be produced.
There are some occasions when you may want to define the tables
variable
or by
variables as a factor. One occasion is for sorting/ordering,
and the other is for obtaining zero-counts on sparse data.
To order the frequency categories in the frequency output, define the
tables
variable as a factor in the desired order. The function will
then retain that order for the frequency categories in the output dataset
and report.
You may also wish to
define the tables variable as a factor if you are dealing with sparse data
and some of the frequency categories are not present in the data. To ensure
these categories are displayed with zero-counts, define the tables
variable
or by
variable
as a factor and use the "sparse" option. Note
that the "sparse" option is actually the default.
If you do not want to
show the zero-count categories on a variable that is defined as a factor,
pass the "nosparse" keyword on the options
parameter.
By default, the proc_freq
function returns an output dataset of
frequency results. If running interactively, the function also prints
the frequency results to the viewer. As described above, the output
dataset can be somewhat different than the dataset sent to the viewer.
The output
parameter allows you to choose which datasets to return.
There are three choices:
"out", "report", and "none". The "out" keyword returns the default output
dataset. The "report" keyword returns the dataset(s) sent to the viewer. You
may also pass "none" if you don't want any datasets returned from the function.
In addition, the output dataset produced by the "out" keyword can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the frequencies into the desired form. The shaping options are as follows:
long: Transposes the output datasets so that statistics are in rows and frequency categories are in columns.
stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and frequency categories are in rows.
wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and frequency categories are in rows. This shaping option is the default.
For summary statistics, see proc_means
. To pivot
or transpose the data coming from proc_freq
,
see proc_transpose
.
library(procs)
# Turn off printing for CRAN checks
options("procs.print" = FALSE)
# Create sample data
df <- as.data.frame(HairEyeColor, stringsAsFactors = FALSE)
# Assign labels
labels(df) <- list(Hair = "Hair Color",
Eye = "Eye Color",
Sex = "Sex at Birth")
# Example #1: One way frequencies on Hair and Eye color with weight option.
res <- proc_freq(df,
tables = v(Hair, Eye),
options = outcum,
weight = Freq)
# View result data
res
# $Hair
# VAR CAT N CNT PCT CUMSUM CUMPCT
# 1 Hair Black 592 108 18.24324 108 18.24324
# 2 Hair Blond 592 127 21.45270 235 39.69595
# 3 Hair Brown 592 286 48.31081 521 88.00676
# 4 Hair Red 592 71 11.99324 592 100.00000
#
# $Eye
# VAR CAT N CNT PCT CUMSUM CUMPCT
# 1 Eye Blue 592 215 36.31757 215 36.31757
# 2 Eye Brown 592 220 37.16216 435 73.47973
# 3 Eye Green 592 64 10.81081 499 84.29054
# 4 Eye Hazel 592 93 15.70946 592 100.00000
# Example #2: 2 x 2 Crosstabulation table with Chi-Square statistic
res <- proc_freq(df, tables = Hair * Eye,
weight = Freq,
options = v(crosstab, chisq))
# View result data
res
#$`Hair * Eye`
# Category Statistic Blue Brown Green Hazel Total
#1 Black Frequency 20.000000 68.000000 5.0000000 15.000000 108.00000
#2 Black Percent 3.378378 11.486486 0.8445946 2.533784 18.24324
#3 Black Row Pct 18.518519 62.962963 4.6296296 13.888889 NA
#4 Black Col Pct 9.302326 30.909091 7.8125000 16.129032 NA
#5 Blond Frequency 94.000000 7.000000 16.0000000 10.000000 127.00000
#6 Blond Percent 15.878378 1.182432 2.7027027 1.689189 21.45270
#7 Blond Row Pct 74.015748 5.511811 12.5984252 7.874016 NA
#8 Blond Col Pct 43.720930 3.181818 25.0000000 10.752688 NA
#9 Brown Frequency 84.000000 119.000000 29.0000000 54.000000 286.00000
#10 Brown Percent 14.189189 20.101351 4.8986486 9.121622 48.31081
#11 Brown Row Pct 29.370629 41.608392 10.1398601 18.881119 NA
#12 Brown Col Pct 39.069767 54.090909 45.3125000 58.064516 NA
#13 Red Frequency 17.000000 26.000000 14.0000000 14.000000 71.00000
#14 Red Percent 2.871622 4.391892 2.3648649 2.364865 11.99324
#15 Red Row Pct 23.943662 36.619718 19.7183099 19.718310 NA
#16 Red Col Pct 7.906977 11.818182 21.8750000 15.053763 NA
#17 Total Frequency 215.000000 220.000000 64.0000000 93.000000 592.00000
#18 Total Percent 36.317568 37.162162 10.8108108 15.709459 100.00000
# $`chisq:Hair * Eye`
# CHISQ CHISQ.DF CHISQ.P
# 1 138.2898 9 2.325287e-25
#' # Example #3: By variable with named table request
res <- proc_freq(df, tables = v(Hair, Eye, Cross = Hair * Eye),
by = Sex,
weight = Freq)
# View result data
res
# $Hair
# BY VAR CAT N CNT PCT
# 1 Female Hair Black 313 52 16.61342
# 2 Female Hair Blond 313 81 25.87859
# 3 Female Hair Brown 313 143 45.68690
# 4 Female Hair Red 313 37 11.82109
# 5 Male Hair Black 279 56 20.07168
# 6 Male Hair Blond 279 46 16.48746
# 7 Male Hair Brown 279 143 51.25448
# 8 Male Hair Red 279 34 12.18638
#
# $Eye
# BY VAR CAT N CNT PCT
# 1 Female Eye Blue 313 114 36.421725
# 2 Female Eye Brown 313 122 38.977636
# 3 Female Eye Green 313 31 9.904153
# 4 Female Eye Hazel 313 46 14.696486
# 5 Male Eye Blue 279 101 36.200717
# 6 Male Eye Brown 279 98 35.125448
# 7 Male Eye Green 279 33 11.827957
# 8 Male Eye Hazel 279 47 16.845878
#
# $Cross
# BY VAR1 VAR2 CAT1 CAT2 N CNT PCT
# 1 Female Hair Eye Black Blue 313 9 2.8753994
# 2 Female Hair Eye Black Brown 313 36 11.5015974
# 3 Female Hair Eye Black Green 313 2 0.6389776
# 4 Female Hair Eye Black Hazel 313 5 1.5974441
# 5 Female Hair Eye Blond Blue 313 64 20.4472843
# 6 Female Hair Eye Blond Brown 313 4 1.2779553
# 7 Female Hair Eye Blond Green 313 8 2.5559105
# 8 Female Hair Eye Blond Hazel 313 5 1.5974441
# 9 Female Hair Eye Brown Blue 313 34 10.8626198
# 10 Female Hair Eye Brown Brown 313 66 21.0862620
# 11 Female Hair Eye Brown Green 313 14 4.4728435
# 12 Female Hair Eye Brown Hazel 313 29 9.2651757
# 13 Female Hair Eye Red Blue 313 7 2.2364217
# 14 Female Hair Eye Red Brown 313 16 5.1118211
# 15 Female Hair Eye Red Green 313 7 2.2364217
# 16 Female Hair Eye Red Hazel 313 7 2.2364217
# 17 Male Hair Eye Black Blue 279 11 3.9426523
# 18 Male Hair Eye Black Brown 279 32 11.4695341
# 19 Male Hair Eye Black Green 279 3 1.0752688
# 20 Male Hair Eye Black Hazel 279 10 3.5842294
# 21 Male Hair Eye Blond Blue 279 30 10.7526882
# 22 Male Hair Eye Blond Brown 279 3 1.0752688
# 23 Male Hair Eye Blond Green 279 8 2.8673835
# 24 Male Hair Eye Blond Hazel 279 5 1.7921147
# 25 Male Hair Eye Brown Blue 279 50 17.9211470
# 26 Male Hair Eye Brown Brown 279 53 18.9964158
# 27 Male Hair Eye Brown Green 279 15 5.3763441
# 28 Male Hair Eye Brown Hazel 279 25 8.9605735
# 29 Male Hair Eye Red Blue 279 10 3.5842294
# 30 Male Hair Eye Red Brown 279 10 3.5842294
# 31 Male Hair Eye Red Green 279 7 2.5089606
# 32 Male Hair Eye Red Hazel 279 7 2.5089606