The proc_reg
function performs a regression
for one or more models. The model(s) are passed on the model
parameter,
and the input dataset is passed on the data
parameter. The stats
parameter allows you to request additional statistics, similar to the
model options in SAS. The by
parameter allows you to subset the data into groups and run the model on each
group. The weight
parameter let's you assign a weight to each observation
in the dataset. The output
and options
parameters provide
additional customization of the results.
proc_reg(
data,
model,
by = NULL,
stats = NULL,
output = NULL,
weight = NULL,
options = NULL,
titles = NULL
)
The input data frame for which to perform the regression analysis. This parameter is required.
A model for the regression to be performed. The model can be
specified using either R syntax or SAS syntax. model = var1 ~ var2 + var3
is
an example of R style model syntax. If you wish to pass multiple models using
R syntax, pass them in a list. For SAS syntax, pass the model as a quoted string:
model = "var1 = var2 var3"
. To pass
multiple models using SAS syntax, pass them as a vector of strings. By default,
the models will be named "MODEL1", "MODEL2", etc. If you want to name your
model, pass it as a named list or named vector.
An optional by group. If you specify a by group, the input
data will be subset on the by variable(s) prior to performing the regression.
For multiple by variables, pass them as a quoted vector of variable names.
You may also pass them unquoted using the v
function.
Optional statistics keywords. Valid values are "adjrsq", "clb",
"est", "edf", "hcc", "hccmethod", "mse", "p", "press", "rsquare",
"sse", "spec", "seb", and "table". A single keyword may be passed with or
without quotes. Pass multiple keywords either as a quoted vector, or unquoted
vector using the v()
function. These statistics keywords largely
correspond to the options on the "model" statement in SAS. Most of them
control which statistics are added to the interactive report. Some keywords
control statistics on the output dataset. See the Statistics Keywords
section for details on the purpose and target of each keyword.
Whether or not to return datasets from the function. Valid
values are "out", "none", and "report". Default is "out", and will
produce dataset output specifically designed for programmatic use. The "none"
option will return a NULL instead of a dataset or list of datasets.
The "report" keyword returns the datasets from the interactive report, which
may be different from the standard output. Note that some statistics are only
available on the interactive report. The output parameter also accepts
data shaping keywords "long, "stacked", and "wide".
These shaping keywords control the structure of the output data. See the
Data Shaping section for additional details. Note that
multiple output keywords may be passed on a
character vector. For example,
to produce both a report dataset and a "long" output dataset,
use the parameter output = c("report", "out", "long")
.
The name of a variable to use as a weight for each observation. The weight is commonly provided as the inverse of each variance.
A vector of optional keywords. Valid values are: "alpha =", "edf", "noprint", "outest", "outseb", "press", "rsquare", and "tableout". The "alpha = " option will set the alpha value for confidence limit statistics. The default is 95% (alpha = 0.05). The "noprint" option turns off the interactive report. For other options, see the Options section for explanations of each.
A vector of one or more titles to use for the report output.
Normally, the requested regression statistics are shown interactively
in the viewer, and output results are returned as a data frame.
If you request "report" datasets, they will be returned as a list.
You may then access individual datasets from the list using dollar sign
($) syntax.
The interactive report can be turned off using the "noprint" option.
The output dataset can be turned off using the "none" keyword on the
output
parameter. If the output dataset is turned off, the function
will return a NULL.
The proc_reg
function is a general purpose regression function. It
produces a dataset output by default, and, when working in RStudio,
also produces an interactive report. The function has many convenient options
for what statistics are produced and how the analysis is performed. All
statistical output from proc_reg
matches SAS.
A model may be specified using R model syntax or SAS model syntax. To use
SAS syntax, the model statement must be quoted. To pass multiple models using
R syntax, pass them to the model
parameter in a list. To pass multiple
models using SAS syntax, pass them to the model
parameter as a vector
of strings.
By default, proc_reg
results will
be sent to the viewer as an HTML report. This functionality
makes it easy to get a quick analysis of your data. To turn off the
interactive report, pass the "noprint" keyword
to the options
parameter.
The titles
parameter allows you to set one or more titles for your
report. Pass these titles as a vector of strings.
The exact datasets used for the interactive report can be returned as a list.
To return these datasets, pass
the "report" keyword on the output
parameter. This list may in
turn be passed to proc_print
to write the report to a file.
Dataset results are also returned from the function by default.
proc_reg
typically returns a single dataset. The columns and rows
on this dataset can change depending on the keywords passed
to the stats
and options
parameters.
The default output dataset is optimized for data manipulation. The column names have been standardized, and additional variables may be present to help with data manipulation. The data values in the output dataset are intentionally not rounded or formatted to give you the most accurate numeric results.
You may also request
to return the datasets used in the interactive report. To request these
datasets, pass the "report" option to the output
parameter. Each report
dataset will be named according to the category of statistical
results. There are four standard categories: "NObs",
"ANOVA", "FitStatistics", and "ParameterEstimates". When the "spec" statistics option
is passed, the function will also return a "SpecTest" dataset containing
the White's test results. If the "p" option is present, the "OutputStatistics"
and "ResidualStatistics" tables will be included.
If you don't want any datasets returned, pass the "none" option on the
output
parameter.
The following statistics keywords can be passed on the stats
parameter. You may pass statistic keywords as a
quoted vector of strings, or an unquoted vector using the v()
function.
An individual statistics keyword can be passed without quoting.
adjrsq: Adds adjusted r-square value to the output dataset.
clb: Requests confidence limits be added to the interactive report.
edf: Includes the number of regressors, the error degress of freedom, and the model r-square to the output dataset.
est: Request an output dataset of parameter estimate and optional model fit summary statistics. This statistics option is the default.
hcc: The "hcc" statistics keyword requests that heteroscedasticity-consistent standard errors of the parameter estimates be sent to the interactive report.
hccmethod=: When the "hcc" option is present, the "hccmethod=" option specifies the type of method to use. Valid values are 0 and 3.
mse: Computes the mean squared error for each model and adds to the output dataset.
p: Computes predicted and residual values and sends to a separate table on the interactive report.
press: Includes the predicted residual sum of squares (PRESS) statistic in the output dataset.
rsquare: Include the r-square statistic in the output dataset. The "rsquare" option has the same effect as the "edf" option.
seb: Outputs the standard errors of the parameter estimates to the output dataset. These values will be identified as type "SEB".
spec: Adds the "White's test" table to the interactive output. This test determines whether the first and second moments of the model are correctly specified.
sse: Adds the error sum of squares to the output dataset.
table: The "table" keyword is used to send standard
errors, t-statistics, p-values, and confidence limits to the output
dataset. These additional statistics are identified by types "STDERR",
"T", and "PVALUE". The confidence limits are identified by "LxxB" and "UxxB",
where "xx" is the alpha value in percentage terms. The "table" keyword
on the stats
parameter performs the same functions as the "tableout"
option on the options
parameter.
The proc_reg
function recognizes the following options. Options may
be passed as a quoted vector of strings, or an unquoted vector using the
v()
function.
alpha = : The "alpha = " option will set the alpha
value for confidence limit statistics. Set the alpha as a decimal value
between 0 and 1. For example, you can set a 90% confidence limit as
alpha = 0.1
.
edf: Includes the number of regressors, the error degress of freedom, and the model r-square to the output dataset.
noprint: Whether to print the interactive report to the
viewer. By default, the report is printed to the viewer. The "noprint"
option will inhibit printing. You may inhibit printing globally by
setting the package print option to false:
options("procs.print" = FALSE)
.
outest: The "outest" option is used to request the parameter estimates and model fit summary statistic be sent to the output dataset. The parameter estimates are identified as type "PARMS" on the output dataset. The "outest" dataset is the default output dataset, and this option does not normally need to be passed.
outseb: The "outseb" option is used to request the standard
errors be sent to the output dataset. The standard errors will be added
as a new row identified by type "SEB". This request
can also be made by passing the "seb" keyword to the stats
parameter.
press: Includes the predicted residual sum of squares (PRESS) statistic in the output dataset.
rsquare: Include the r-square statistic in the output dataset. The "rsquare" option has the same effect as the "edf" option.
tableout: The "tableout" option is used to send standard
errors, t-statistics, p-values, and confidence limits to the output
dataset. These additional statistics are identified by types "STDERR",
"T", and "PVALUE". The confidence limits are identified by "LxxB" and "UxxB",
where "xx" is the alpha value in percentage terms. The "tableout" option
on the options
parameter performs the same functions as the "table"
keyword on the stats
parameter.
The output datasets produced by the function can be shaped in different ways. These shaping options allow you to decide whether the data should be returned long and skinny, or short and wide. The shaping options can reduce the amount of data manipulation necessary to get the data into the desired form. The shaping options are as follows:
long: Transposes the output datasets so that statistics are in rows and variables are in columns.
stacked: Requests that output datasets be returned in "stacked" form, such that both statistics and variables are in rows.
wide: Requests that output datasets be returned in "wide" form, such that statistics are across the top in columns, and variables are in rows. This shaping option is the default.
These shaping options are passed on the output
parameter. For example,
to return the data in "long" form, use output = "long"
.
# Turn off printing for CRAN checks
options("procs.print" = FALSE)
# Prepare sample data
set.seed(123)
dat <- cars
samplecar <- sample(c(TRUE, FALSE), nrow(cars), replace=TRUE, prob=c(0.6, 0.4))
dat$group <- ifelse(samplecar %in% seq(1, nrow(cars)), "Group A", "Group B")
# Example 1: R Model Syntax
res1 <- proc_reg(dat, model = dist ~ speed)
# View Results
res1
# MODEL TYPE DEPVAR RMSE Intercept speed dist
# 1 MODEL1 PARMS dist 15.37959 -17.57909 3.932409 -1
# Example 2: SAS Model Syntax
res2 <- proc_reg(dat, model = "dist = speed")
# View Results
res2
# MODEL TYPE DEPVAR RMSE Intercept speed dist
# 1 MODEL1 PARMS dist 15.37959 -17.57909 3.932409 -1
# Example 3: Report Output
res3 <- proc_reg(dat, model = dist ~ speed, output = report)
# View Results
res3
# $NObs
# LABEL NOBS
# 1 Number of Observations Read 50
# 2 Number of Observations Used 50
#
# $ANOVA
# LABEL DF SUMSQ MEANSQ FVAL PROBF
# 1 Model 1 21185.46 21185.4589 89.56711 1.489919e-12
# 2 Error 48 11353.52 236.5317 NA NA
# 3 Corrected Total 49 32538.98 NA NA NA
#
# $FitStatistics
# RMSE DEPMEAN COEFVAR RSQ ADJRSQ
# 1 15.37959 42.98 35.78312 0.6510794 0.6438102
#
# $ParameterEstimates
# PARM DF EST STDERR T PROBT
# 1 Intercept 1 -17.579095 6.7584402 -2.601058 1.231882e-02
# 2 speed 1 3.932409 0.4155128 9.463990 1.489919e-12
# Example 4: By variable
res4 <- proc_reg(dat, model = dist ~ speed, by = group)
# View Results
res4
# BY MODEL TYPE DEPVAR RMSE Intercept speed dist
# 1 Group A MODEL1 PARMS dist 15.35049 -24.888326 4.275357 -1
# 2 Group B MODEL1 PARMS dist 15.53676 -8.705547 3.484381 -1
# Example 5: "tableout" Option
res5 <- proc_reg(dat, model = dist ~ speed, options = tableout)
# View Results
res5
# MODEL TYPE DEPVAR RMSE Intercept speed dist
# 1 MODEL1 PARMS dist 15.37959 -17.57909489 3.932409e+00 -1
# 2 MODEL1 STDERR dist 15.37959 6.75844017 4.155128e-01 NA
# 3 MODEL1 T dist 15.37959 -2.60105800 9.463990e+00 NA
# 4 MODEL1 PVALUE dist 15.37959 0.01231882 1.489919e-12 NA
# 5 MODEL1 L95B dist 15.37959 -31.16784960 3.096964e+00 NA
# 6 MODEL1 U95B dist 15.37959 -3.99034018 4.767853e+00 NA
# Example 6: Multiple Models plus Statistics Keywords
res6 <- proc_reg(dat, model = list(mod1 = dist ~ speed,
mod2 = speed ~ dist),
stats = v(press, seb))
# View Results
res6
# MODEL TYPE DEPVAR RMSE PRESS Intercept speed dist
# 1 mod1 PARMS dist 15.379587 12320.2708 -17.5790949 3.9324088 -1.00000000
# 2 mod1 SEB dist 15.379587 NA 6.7584402 0.4155128 -1.00000000
# 3 mod2 PARMS speed 3.155753 526.2665 8.2839056 -1.0000000 0.16556757
# 4 mod2 SEB speed 3.155753 NA 0.8743845 -1.0000000 0.01749448