The proc_freq function generates frequency statistics. It is both an interactive function that can be used for data exploration, and can produce dataset output for further analysis. The function can perform one and two-way frequencies. Two-way frequencies are produced as a cross-tabulation by default. There are many options to control the generated tables. The function will return requested tables in a named list.

proc_freq(
  data,
  tables = NULL,
  output = NULL,
  by = NULL,
  weight = NULL,
  options = NULL,
  titles = NULL
)

Arguments

data

The input data frame to perform frequency calculations on. Input data as the first parameter makes this function pipe-friendly.

tables

The variable or variables to perform frequency counts on. The table specifications are passed as a vector of strings. For one-way frequencies, simply pass the variable name. For two-way tables, pass the desired combination of variables separated by a star (*) operator. The parameter does not accept SAS® style grouping syntax. All cross combinations should be listed explicitly. If the table request is named, the name will be used as the list item name on the return list of tables. See "Example 3" for an illustration on how to name an output table.

output

Whether or not to return datasets from the function. Valid values are "out", "none", and "report". Default is "out". This parameter also accepts the data shaping options "long", "stacked", and "wide". See the Data Shaping section for a description of these options. Multiple output keywords may be passed on a character vector. For example, to produce both a report dataset and a "long" output dataset, use the parameter output = c("report", "out", "long").

by

An optional by group. Parameter accepts a vector of one or more variable names. When this parameter is set, data will be subset for each by group, and tables will be generated for each subset.

weight

An optional weight parameter. This parameter is passed as a variable name to use for the weight. If a weight variable is indicated, the weighted value will be summed to calculate the frequency counts.

options

The options desired for the function. Options are passed to the parameter as a vector of quoted strings. You may also use the v() function to pass unquoted strings. The following options are available: "chisq", "crosstab", "fisher", "list", "missing", "nlevels", "nocol", "nocum", "nofreq", "nopercent", "noprint", "nonobs", "norow", "nosparse", "notable", "outcum". See the Options section for a description of these options.

titles

A vector of titles to assign to the interactive report.

Value

The function will return all requested datasets by default. This is equivalent to the output = "out" option. To return the datasets as created for the interactive report, pass the "report" output option. If no output datasets are desired, pass the "none" output option. If a single dataset is requested, the function will return a single dataset. If multiple datasets are requested, the function will return a list of datasets. The type of data frame returned will correspond to the type of data frame passed in on the data parameter. If the input data is a tibble, the output data will be a tibble. If the input data is a Base R data frame, the output data will be a Base R data frame.

Details

The proc_freq function generates frequency statistics for one-way and two-way tables. Data is passed in on the data parameter. The desired frequencies are specified on the tables parameter.

Report Output

By default, proc_freq results will be immediately sent to the viewer as an HTML report. This functionality makes it easy to get a quick analysis of your data with very little effort. To turn off the interactive report, pass the "noprint" keyword to the options parameter or set options("procs.print" = FALSE).

The titles parameter allows you to set one or more titles for your report. Pass these titles as a vector of strings.

If the frequency variables have a label assigned, that label will be used in the report output. This feature gives you some control over the column headers in the final report.

The exact datasets used for the interactive output can be returned as a list. To return these datasets as a list, pass the "report" keyword on the output parameter. This list may in turn be passed to proc_print to write the report to a file.

Data Frame Output

The proc_freq function returns output datasets. If you are requesting only one table, a single data frame will be returned. If you request multiple tables, a list of data frames will be returned.

By default, the list items are named according to the strings specified on the tables parameter. You may control the names of the returned results by using a named vector on the tables parameter.

The standard output datasets are optimized for data manipulation. Column names have been standardized, and additional variables may be present to help with data manipulation. For instance, the by variable will always be named "BY", and the frequency category will always be named "CAT". In addition, data values in the output datasets are not rounded or formatted to give you the most accurate statistical results.

Frequency Weight

Normally the proc_freq function counts each row in the input data equally. In some cases, however, each row in the data can represent multiple observations, and rows should not be treated equally. In these cases, use the weight parameter. The parameter accepts a variable/column name to use as the weighted value. If the weight parameter is used, the function will sum the weighted values instead of counting rows.

By Groups

You may request that frequencies be separated into by groups using the by parameter. The parameter accepts one or more variable names from the input dataset. When this parameter is assigned, the data will be subset by the "by" variable(s) before frequency counts are calculated. On the interactive report, the by groups will appear in separate tables. On the output dataset, the by groups will be identified by additional columns.

Options

The options parameter accepts a vector of options. Normally, these options must be quoted. But you may pass them unquoted using the v() function. For example, you can request the number of category levels and the Chi-Square statistic like this: options = v(nlevels, chisq).

Below are all the available options and a description of each:

  • crosstab: Two-way output tables are a list style by default. If you want a crosstab style, pass the "crosstab" option.

  • list: Two-way interactive tables are a crosstab style by default. If you want a list style two-way table, pass the "list" option.

  • missing: Normally, missing values are not counted and not shown on frequency tables. The "missing" option allows you to treat missing (NA) values as normal values, so that they are counted and shown on the frequency table. Missing levels will appear on the table as a single dot (".").

  • nlevels: The "nlevels" option will display the number of unique values for each variable in the frequency table. These levels are generated as a separate table that appears on the report, and will also be output from the function as a separate dataset.

  • nocol: Two-way cross tabulation tables include column percents by default. To turn them off, pass the "nocol" option.

  • nocum: Whether to include the cumulative frequency and percent columns on one-way, interactive tables. These columns are included by default. To turn them off, pass the "nocum" option.

  • nofreq: The "nofreq" option will remove the frequency column from one-way and two-way tables.

  • nopercent: The "nopercent" option will remove the percent column from one-way and two-way tables.

  • noprint: Whether to print the interactive report to the viewer. By default, the report is printed to the viewer. The "noprint" option will inhibit printing.

  • nonobs: Whether to include the number of observations "N" column on the output and interactive tables. By default, the N column will be included. The "nonobs" option turns it off.

  • norow: Whether to include the row percentages on two-way crosstab tables. The "norow" option will turn them off.

  • nosparse/sparse: Whether to include categories for which there are no frequency counts. Zero-count categories will be included by default, which is the "sparse" option. If the "nosparse" option is present, zero-count categories will be removed.

  • notable: Whether to include the frequency table in the output dataset list. Normally, the frequency table is included. You may want to exclude the frequency table in some cases, for instance, if you only want the Chi-Square statistic.

  • outcum: Whether to include the cumulative frequency and percent on output frequency tables. By default, these columns are not included. The "outcum" option will include them.

Statistics Options

In addition to the above options, the options parameter accepts some statistics options. The following keywords will generate an additional tables of specialized statistics. These statistics options are only available on two-way tables:

  • chisq: Requests that the Chi-square statistics be produced.

  • fisher: Requests that the Fisher's exact statistics be produced.

Using Factors

There are some occasions when you may want to define the tables variable or by variables as a factor. One occasion is for sorting/ordering, and the other is for obtaining zero-counts on sparse data.

To order the frequency categories in the frequency output, define the tables variable as a factor in the desired order. The function will then retain that order for the frequency categories in the output dataset and report.

You may also wish to define the tables variable as a factor if you are dealing with sparse data and some of the frequency categories are not present in the data. To ensure these categories are displayed with zero-counts, define the tables variable or by variable as a factor and use the "sparse" option. Note that the "sparse" option is actually the default.

If you do not want to show the zero-count categories on a variable that is defined as a factor, pass the "nosparse" keyword on the options parameter.

Data Shaping

By default, the proc_freq function returns an output dataset of frequency results. If running interactively, the function also prints the frequency results to the viewer. As described above, the output dataset can be somewhat different than the dataset sent to the viewer. The output parameter allows you to choose which datasets to return. There are three choices: "out", "report", and "none". The "out" keyword returns the default output dataset. The "report" keyword returns the dataset(s) sent to the viewer. You may also pass "none" if you don't want any datasets returned from the function.

In addition, the output dataset produced by the "out" keyword can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the frequencies into the desired form. The shaping options are as follows:

  • long: Transposes the output datasets so that statistics are in rows and frequency categories are in columns.

  • stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and frequency categories are in rows.

  • wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and frequency categories are in rows. This shaping option is the default.

See also

For summary statistics, see proc_means. To pivot or transpose the data coming from proc_freq, see proc_transpose.

Examples

library(procs)

# Turn off printing for CRAN checks
options("procs.print" = FALSE)

# Create sample data
df <- as.data.frame(HairEyeColor, stringsAsFactors = FALSE)

# Assign labels
labels(df) <- list(Hair = "Hair Color",
                   Eye = "Eye Color",
                   Sex = "Sex at Birth")

# Example #1: One way frequencies on Hair and Eye color with weight option.
res <- proc_freq(df,
                 tables = v(Hair, Eye),
                 options = outcum,
                 weight = Freq)

# View result data
res
# $Hair
#    VAR   CAT   N CNT      PCT CUMSUM    CUMPCT
# 1 Hair Black 592 108 18.24324    108  18.24324
# 2 Hair Blond 592 127 21.45270    235  39.69595
# 3 Hair Brown 592 286 48.31081    521  88.00676
# 4 Hair   Red 592  71 11.99324    592 100.00000
#
# $Eye
#   VAR   CAT   N CNT      PCT CUMSUM    CUMPCT
# 1 Eye  Blue 592 215 36.31757    215  36.31757
# 2 Eye Brown 592 220 37.16216    435  73.47973
# 3 Eye Green 592  64 10.81081    499  84.29054
# 4 Eye Hazel 592  93 15.70946    592 100.00000

# Example #2: 2 x 2 Crosstabulation table with Chi-Square statistic
res <- proc_freq(df, tables = Hair * Eye,
                     weight = Freq,
                     options = v(crosstab, chisq))

# View result data
res
#$`Hair * Eye`
#   Category Statistic       Blue      Brown      Green     Hazel     Total
#1     Black Frequency  20.000000  68.000000  5.0000000 15.000000 108.00000
#2     Black   Percent   3.378378  11.486486  0.8445946  2.533784  18.24324
#3     Black   Row Pct  18.518519  62.962963  4.6296296 13.888889        NA
#4     Black   Col Pct   9.302326  30.909091  7.8125000 16.129032        NA
#5     Blond Frequency  94.000000   7.000000 16.0000000 10.000000 127.00000
#6     Blond   Percent  15.878378   1.182432  2.7027027  1.689189  21.45270
#7     Blond   Row Pct  74.015748   5.511811 12.5984252  7.874016        NA
#8     Blond   Col Pct  43.720930   3.181818 25.0000000 10.752688        NA
#9     Brown Frequency  84.000000 119.000000 29.0000000 54.000000 286.00000
#10    Brown   Percent  14.189189  20.101351  4.8986486  9.121622  48.31081
#11    Brown   Row Pct  29.370629  41.608392 10.1398601 18.881119        NA
#12    Brown   Col Pct  39.069767  54.090909 45.3125000 58.064516        NA
#13      Red Frequency  17.000000  26.000000 14.0000000 14.000000  71.00000
#14      Red   Percent   2.871622   4.391892  2.3648649  2.364865  11.99324
#15      Red   Row Pct  23.943662  36.619718 19.7183099 19.718310        NA
#16      Red   Col Pct   7.906977  11.818182 21.8750000 15.053763        NA
#17    Total Frequency 215.000000 220.000000 64.0000000 93.000000 592.00000
#18    Total   Percent  36.317568  37.162162 10.8108108 15.709459 100.00000

# $`chisq:Hair * Eye`
#      CHISQ CHISQ.DF      CHISQ.P
# 1 138.2898        9 2.325287e-25

#' # Example #3: By variable with named table request
res <- proc_freq(df, tables = v(Hair, Eye, Cross = Hair * Eye),
                 by = Sex,
                 weight = Freq)

# View result data
res
# $Hair
#       BY  VAR   CAT   N CNT      PCT
# 1 Female Hair Black 313  52 16.61342
# 2 Female Hair Blond 313  81 25.87859
# 3 Female Hair Brown 313 143 45.68690
# 4 Female Hair   Red 313  37 11.82109
# 5   Male Hair Black 279  56 20.07168
# 6   Male Hair Blond 279  46 16.48746
# 7   Male Hair Brown 279 143 51.25448
# 8   Male Hair   Red 279  34 12.18638
#
# $Eye
#       BY VAR   CAT   N CNT       PCT
# 1 Female Eye  Blue 313 114 36.421725
# 2 Female Eye Brown 313 122 38.977636
# 3 Female Eye Green 313  31  9.904153
# 4 Female Eye Hazel 313  46 14.696486
# 5   Male Eye  Blue 279 101 36.200717
# 6   Male Eye Brown 279  98 35.125448
# 7   Male Eye Green 279  33 11.827957
# 8   Male Eye Hazel 279  47 16.845878
#
# $Cross
#        BY VAR1 VAR2  CAT1  CAT2   N CNT        PCT
# 1  Female Hair  Eye Black  Blue 313   9  2.8753994
# 2  Female Hair  Eye Black Brown 313  36 11.5015974
# 3  Female Hair  Eye Black Green 313   2  0.6389776
# 4  Female Hair  Eye Black Hazel 313   5  1.5974441
# 5  Female Hair  Eye Blond  Blue 313  64 20.4472843
# 6  Female Hair  Eye Blond Brown 313   4  1.2779553
# 7  Female Hair  Eye Blond Green 313   8  2.5559105
# 8  Female Hair  Eye Blond Hazel 313   5  1.5974441
# 9  Female Hair  Eye Brown  Blue 313  34 10.8626198
# 10 Female Hair  Eye Brown Brown 313  66 21.0862620
# 11 Female Hair  Eye Brown Green 313  14  4.4728435
# 12 Female Hair  Eye Brown Hazel 313  29  9.2651757
# 13 Female Hair  Eye   Red  Blue 313   7  2.2364217
# 14 Female Hair  Eye   Red Brown 313  16  5.1118211
# 15 Female Hair  Eye   Red Green 313   7  2.2364217
# 16 Female Hair  Eye   Red Hazel 313   7  2.2364217
# 17   Male Hair  Eye Black  Blue 279  11  3.9426523
# 18   Male Hair  Eye Black Brown 279  32 11.4695341
# 19   Male Hair  Eye Black Green 279   3  1.0752688
# 20   Male Hair  Eye Black Hazel 279  10  3.5842294
# 21   Male Hair  Eye Blond  Blue 279  30 10.7526882
# 22   Male Hair  Eye Blond Brown 279   3  1.0752688
# 23   Male Hair  Eye Blond Green 279   8  2.8673835
# 24   Male Hair  Eye Blond Hazel 279   5  1.7921147
# 25   Male Hair  Eye Brown  Blue 279  50 17.9211470
# 26   Male Hair  Eye Brown Brown 279  53 18.9964158
# 27   Male Hair  Eye Brown Green 279  15  5.3763441
# 28   Male Hair  Eye Brown Hazel 279  25  8.9605735
# 29   Male Hair  Eye   Red  Blue 279  10  3.5842294
# 30   Male Hair  Eye   Red Brown 279  10  3.5842294
# 31   Male Hair  Eye   Red Green 279   7  2.5089606
# 32   Male Hair  Eye   Red Hazel 279   7  2.5089606