The proc_reg function performs a regression for one or more models. The model(s) are passed on the model parameter, and the input dataset is passed on the data parameter. The stats parameter allows you to request additional statistics, similar to the model options in SAS. The by parameter allows you to subset the data into groups and run the model on each group. The weight parameter let's you assign a weight to each observation in the dataset. The output and options parameters provide additional customization of the results.

proc_reg(
  data,
  model,
  by = NULL,
  stats = NULL,
  output = NULL,
  weight = NULL,
  options = NULL,
  titles = NULL
)

Arguments

data

The input data frame for which to perform the regression analysis. This parameter is required.

model

A model for the regression to be performed. The model can be specified using either R syntax or SAS syntax. model = var1 ~ var2 + var3 is an example of R style model syntax. If you wish to pass multiple models using R syntax, pass them in a list. For SAS syntax, pass the model as a quoted string: model = "var1 = var2 var3". To pass multiple models using SAS syntax, pass them as a vector of strings. By default, the models will be named "MODEL1", "MODEL2", etc. If you want to name your model, pass it as a named list or named vector.

by

An optional by group. If you specify a by group, the input data will be subset on the by variable(s) prior to performing the regression. For multiple by variables, pass them as a quoted vector of variable names. You may also pass them unquoted using the v function.

stats

Optional statistics keywords. Valid values are "adjrsq", "clb", "est", "edf", "hcc", "hccmethod", "mse", "p", "press", "rsquare", "sse", "spec", "seb", and "table". A single keyword may be passed with or without quotes. Pass multiple keywords either as a quoted vector, or unquoted vector using the v() function. These statistics keywords largely correspond to the options on the "model" statement in SAS. Most of them control which statistics are added to the interactive report. Some keywords control statistics on the output dataset. See the Statistics Keywords section for details on the purpose and target of each keyword.

output

Whether or not to return datasets from the function. Valid values are "out", "none", and "report". Default is "out", and will produce dataset output specifically designed for programmatic use. The "none" option will return a NULL instead of a dataset or list of datasets. The "report" keyword returns the datasets from the interactive report, which may be different from the standard output. Note that some statistics are only available on the interactive report. The output parameter also accepts data shaping keywords "long, "stacked", and "wide". These shaping keywords control the structure of the output data. See the Data Shaping section for additional details. Note that multiple output keywords may be passed on a character vector. For example, to produce both a report dataset and a "long" output dataset, use the parameter output = c("report", "out", "long").

weight

The name of a variable to use as a weight for each observation. The weight is commonly provided as the inverse of each variance.

options

A vector of optional keywords. Valid values are: "alpha =", "edf", "noprint", "outest", "outseb", "press", "rsquare", and "tableout". The "alpha = " option will set the alpha value for confidence limit statistics. The default is 95% (alpha = 0.05). The "noprint" option turns off the interactive report. For other options, see the Options section for explanations of each.

titles

A vector of one or more titles to use for the report output.

Value

Normally, the requested regression statistics are shown interactively in the viewer, and output results are returned as a data frame. If you request "report" datasets, they will be returned as a list. You may then access individual datasets from the list using dollar sign ($) syntax. The interactive report can be turned off using the "noprint" option. The output dataset can be turned off using the "none" keyword on the output parameter. If the output dataset is turned off, the function will return a NULL.

Details

The proc_reg function is a general purpose regression function. It produces a dataset output by default, and, when working in RStudio, also produces an interactive report. The function has many convenient options for what statistics are produced and how the analysis is performed. All statistical output from proc_reg matches SAS.

A model may be specified using R model syntax or SAS model syntax. To use SAS syntax, the model statement must be quoted. To pass multiple models using R syntax, pass them to the model parameter in a list. To pass multiple models using SAS syntax, pass them to the model parameter as a vector of strings.

Interactive Output

By default, proc_reg results will be sent to the viewer as an HTML report. This functionality makes it easy to get a quick analysis of your data. To turn off the interactive report, pass the "noprint" keyword to the options parameter.

The titles parameter allows you to set one or more titles for your report. Pass these titles as a vector of strings.

The exact datasets used for the interactive report can be returned as a list. To return these datasets, pass the "report" keyword on the output parameter. This list may in turn be passed to proc_print to write the report to a file.

Dataset Output

Dataset results are also returned from the function by default. proc_reg typically returns a single dataset. The columns and rows on this dataset can change depending on the keywords passed to the stats and options parameters.

The default output dataset is optimized for data manipulation. The column names have been standardized, and additional variables may be present to help with data manipulation. The data values in the output dataset are intentionally not rounded or formatted to give you the most accurate numeric results.

You may also request to return the datasets used in the interactive report. To request these datasets, pass the "report" option to the output parameter. Each report dataset will be named according to the category of statistical results. There are four standard categories: "NObs", "ANOVA", "FitStatistics", and "ParameterEstimates". When the "spec" statistics option is passed, the function will also return a "SpecTest" dataset containing the White's test results. If the "p" option is present, the "OutputStatistics" and "ResidualStatistics" tables will be included.

If you don't want any datasets returned, pass the "none" option on the output parameter.

Statistics Keywords

The following statistics keywords can be passed on the stats parameter. You may pass statistic keywords as a quoted vector of strings, or an unquoted vector using the v() function. An individual statistics keyword can be passed without quoting.

  • adjrsq: Adds adjusted r-square value to the output dataset.

  • clb: Requests confidence limits be added to the interactive report.

  • edf: Includes the number of regressors, the error degress of freedom, and the model r-square to the output dataset.

  • est: Request an output dataset of parameter estimate and optional model fit summary statistics. This statistics option is the default.

  • hcc: The "hcc" statistics keyword requests that heteroscedasticity-consistent standard errors of the parameter estimates be sent to the interactive report.

  • hccmethod=: When the "hcc" option is present, the "hccmethod=" option specifies the type of method to use. Valid values are 0 and 3.

  • mse: Computes the mean squared error for each model and adds to the output dataset.

  • p: Computes predicted and residual values and sends to a separate table on the interactive report.

  • press: Includes the predicted residual sum of squares (PRESS) statistic in the output dataset.

  • rsquare: Include the r-square statistic in the output dataset. The "rsquare" option has the same effect as the "edf" option.

  • seb: Outputs the standard errors of the parameter estimates to the output dataset. These values will be identified as type "SEB".

  • spec: Adds the "White's test" table to the interactive output. This test determines whether the first and second moments of the model are correctly specified.

  • sse: Adds the error sum of squares to the output dataset.

  • table: The "table" keyword is used to send standard errors, t-statistics, p-values, and confidence limits to the output dataset. These additional statistics are identified by types "STDERR", "T", and "PVALUE". The confidence limits are identified by "LxxB" and "UxxB", where "xx" is the alpha value in percentage terms. The "table" keyword on the stats parameter performs the same functions as the "tableout" option on the options parameter.

Options

The proc_reg function recognizes the following options. Options may be passed as a quoted vector of strings, or an unquoted vector using the v() function.

  • alpha = : The "alpha = " option will set the alpha value for confidence limit statistics. Set the alpha as a decimal value between 0 and 1. For example, you can set a 90% confidence limit as alpha = 0.1.

  • edf: Includes the number of regressors, the error degress of freedom, and the model r-square to the output dataset.

  • noprint: Whether to print the interactive report to the viewer. By default, the report is printed to the viewer. The "noprint" option will inhibit printing. You may inhibit printing globally by setting the package print option to false: options("procs.print" = FALSE).

  • outest: The "outest" option is used to request the parameter estimates and model fit summary statistic be sent to the output dataset. The parameter estimates are identified as type "PARMS" on the output dataset. The "outest" dataset is the default output dataset, and this option does not normally need to be passed.

  • outseb: The "outseb" option is used to request the standard errors be sent to the output dataset. The standard errors will be added as a new row identified by type "SEB". This request can also be made by passing the "seb" keyword to the stats parameter.

  • press: Includes the predicted residual sum of squares (PRESS) statistic in the output dataset.

  • rsquare: Include the r-square statistic in the output dataset. The "rsquare" option has the same effect as the "edf" option.

  • tableout: The "tableout" option is used to send standard errors, t-statistics, p-values, and confidence limits to the output dataset. These additional statistics are identified by types "STDERR", "T", and "PVALUE". The confidence limits are identified by "LxxB" and "UxxB", where "xx" is the alpha value in percentage terms. The "tableout" option on the options parameter performs the same functions as the "table" keyword on the stats parameter.

Data Shaping

The output datasets produced by the function can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the data into the desired form. The shaping options are as follows:

  • long: Transposes the output datasets so that statistics are in rows and variables are in columns.

  • stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and variables are in rows.

  • wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and variables are in rows. This shaping option is the default.

These shaping options are passed on the output parameter. For example, to return the data in "long" form, use output = "long".

Examples

# Turn off printing for CRAN checks
options("procs.print" = FALSE)

# Prepare sample data
set.seed(123)
dat <- cars
samplecar <- sample(c(TRUE, FALSE), nrow(cars), replace=TRUE, prob=c(0.6, 0.4))
dat$group <- ifelse(samplecar %in% seq(1, nrow(cars)), "Group A", "Group B")

# Example 1: R Model Syntax
res1 <- proc_reg(dat, model = dist ~ speed)

# View Results
res1
#    MODEL  TYPE DEPVAR     RMSE Intercept    speed dist
# 1 MODEL1 PARMS   dist 15.37959 -17.57909 3.932409   -1

# Example 2: SAS Model Syntax
res2 <- proc_reg(dat, model = "dist = speed")

# View Results
res2
#    MODEL  TYPE DEPVAR     RMSE Intercept    speed dist
# 1 MODEL1 PARMS   dist 15.37959 -17.57909 3.932409   -1

# Example 3: Report Output
res3 <- proc_reg(dat, model = dist ~ speed, output = report)

# View Results
res3
# $NObs
#                         LABEL NOBS
# 1 Number of Observations Read   50
# 2 Number of Observations Used   50
#
# $ANOVA
#             LABEL DF    SUMSQ     MEANSQ     FVAL        PROBF
# 1           Model  1 21185.46 21185.4589 89.56711 1.489919e-12
# 2           Error 48 11353.52   236.5317       NA           NA
# 3 Corrected Total 49 32538.98         NA       NA           NA
#
# $FitStatistics
# RMSE DEPMEAN  COEFVAR       RSQ    ADJRSQ
# 1 15.37959   42.98 35.78312 0.6510794 0.6438102
#
# $ParameterEstimates
#        PARM DF        EST    STDERR         T        PROBT
# 1 Intercept  1 -17.579095 6.7584402 -2.601058 1.231882e-02
# 2     speed  1   3.932409 0.4155128  9.463990 1.489919e-12

# Example 4: By variable
res4 <- proc_reg(dat, model = dist ~ speed, by = group)

# View Results
res4
#        BY  MODEL  TYPE DEPVAR     RMSE  Intercept    speed dist
# 1 Group A MODEL1 PARMS   dist 15.35049 -24.888326 4.275357   -1
# 2 Group B MODEL1 PARMS   dist 15.53676  -8.705547 3.484381   -1

# Example 5: "tableout" Option
res5 <- proc_reg(dat, model = dist ~ speed, options = tableout)

# View Results
res5
#    MODEL   TYPE DEPVAR     RMSE    Intercept        speed dist
# 1 MODEL1  PARMS   dist 15.37959 -17.57909489 3.932409e+00   -1
# 2 MODEL1 STDERR   dist 15.37959   6.75844017 4.155128e-01   NA
# 3 MODEL1      T   dist 15.37959  -2.60105800 9.463990e+00   NA
# 4 MODEL1 PVALUE   dist 15.37959   0.01231882 1.489919e-12   NA
# 5 MODEL1   L95B   dist 15.37959 -31.16784960 3.096964e+00   NA
# 6 MODEL1   U95B   dist 15.37959  -3.99034018 4.767853e+00   NA

# Example 6: Multiple Models plus Statistics Keywords
res6 <- proc_reg(dat, model = list(mod1 = dist ~ speed,
                                   mod2 = speed ~ dist),
                 stats = v(press, seb))

# View Results
res6
#  MODEL  TYPE DEPVAR      RMSE      PRESS   Intercept      speed        dist
# 1 mod1 PARMS   dist 15.379587 12320.2708 -17.5790949  3.9324088 -1.00000000
# 2 mod1   SEB   dist 15.379587         NA   6.7584402  0.4155128 -1.00000000
# 3 mod2 PARMS  speed  3.155753   526.2665   8.2839056 -1.0000000  0.16556757
# 4 mod2   SEB  speed  3.155753         NA   0.8743845 -1.0000000  0.01749448