The proc_sort function sorts a dataset according to the variables passed on the by parameter. If no parameters are passed on the by parameter, it will sort by all variables. The direction of the sort is controlled with the order parameter. Use the nodupkey option to eliminate duplicate rows from the dataset, and the keep parameter to subset columns. The parameters will accept either quoted or unquoted values.

proc_sort(
  data,
  by = NULL,
  keep = NULL,
  order = "ascending",
  options = NULL,
  as.character = FALSE,
  na.sort = NULL,
  where = NULL
)

Arguments

data

The input data to sort.

by

A vector of variables to sort by.

keep

A vector of variables on the output data to keep. All other variables will be dropped.

order

The sort order of the variables on the by parameter. Valid values are 'ascending' or 'descending'. These values may also be abbreviated to 'asc', 'desc', 'a', or 'd'. You may pass a vector of order values equal to the number of variables on the by parameter. Default is 'ascending' for all by variables.

options

Any options desired for the sort. Available options are 'dupkey' and 'nodupkey'. The 'nodupkey' option removes duplicate rows from the sorted dataset. The 'dupkey' option removes unique rows from the sorted dataset.

as.character

If TRUE, will cast any factors in the 'by' parameter to character. Default is FALSE. This parameter is included because it is common to use factors for sorting in R, but you may not want to keep the variable as a factor. This parameter therefore allows you to use the factor for the sort, but then convert back to a character once the sort is complete.

na.sort

An option that determines how to sort NA values. Valid values are NULL, "first", "last", and "sas". Values may be quoted or unquoted. The "sas" value will sort NAs first for ascending keys, and last for descending keys. The NA sort style may be controlled globally with options("procs.na.sort" = <value>). Any values passed on the local parameter will override the global setting. The default value is NULL, which defers to the global setting. If the global setting is NULL, the parameter will default to "last".

where

An expression to filter the rows before the sort takes place. Use the expression function to define the where clause.

Value

The sorted dataset. If a data frame was input, a data frame will be output. If a tibble was input, a tibble will be output.

Options

Below are the available options for the proc_sort function:

  • dupkey: This option keeps duplicate rows and discards unique rows. Duplicate rows will be identified by the key variables listed on the by parameter if passed. This option is the opposite of 'nodupkey'.

  • nodupkey: Removes duplicate rows following the sort. Duplicate rows will be identified by the key variables listed on the by parameter if passed. Otherwise, the function will dedupe on all variables returned.

Missing Values

Missing values are handled in R differently than they are handled in SAS. In R, missing values (NA) are sorted last, and most sorting functions in R give you an option to sort it first if desired.

SAS, on the other hand, considers a missing as the smallest value. If the sort is ascending, it will sort the missing values first. If the sort is descending, it will sort the missing values last.

The na.sort parameter on proc_sort gives you a choice of how NAs are sorted. By default, the function sorts in the standard R manner. If you are trying to replicate a SAS sort, set the value to "sas".

Note that this parameter only sorts NA values in a manner similar to SAS. It does not guarantee that the entire sort will match SAS. There can still be differences in the sort caused by formats, character set mismatch, or other reasons.

If you want to always use the SAS style sort, you can set it as a global option, like this: options("procs.na.sort" = "sas").

Examples

# Prepare data subset
dat <- data.frame(HairEyeColor, stringsAsFactors = FALSE)[1:32 %% 4 == 1, ]

# View data
dat
#     Hair   Eye    Sex Freq
# 1  Black Brown   Male   32
# 5  Black  Blue   Male   11
# 9  Black Hazel   Male   10
# 13 Black Green   Male    3
# 17 Black Brown Female   36
# 21 Black  Blue Female    9
# 25 Black Hazel Female    5
# 29 Black Green Female    2

# Sort by Frequency
res1 <- proc_sort(dat, by = Freq)

# View results
res1
#    Hair   Eye    Sex Freq
# 29 Black Green Female    2
# 13 Black Green   Male    3
# 25 Black Hazel Female    5
# 21 Black  Blue Female    9
# 9  Black Hazel   Male   10
# 5  Black  Blue   Male   11
# 1  Black Brown   Male   32
# 17 Black Brown Female   36

# Sort by Frequency descending
res2 <- proc_sort(dat, by = Freq, order = d)

# View results
res2
#     Hair   Eye    Sex Freq
# 17 Black Brown Female   36
# 1  Black Brown   Male   32
# 5  Black  Blue   Male   11
# 9  Black Hazel   Male   10
# 21 Black  Blue Female    9
# 25 Black Hazel Female    5
# 13 Black Green   Male    3
# 29 Black Green Female    2

# Get unique combinations of Eye and Sex
res3 <- proc_sort(dat, keep = v(Eye, Sex), options = nodupkey)

# View results
res3
#      Eye    Sex
# 1  Brown   Male
# 17 Brown Female
# 5   Blue   Male
# 21  Blue Female
# 9  Hazel   Male
# 25 Hazel Female
# 13 Green   Male
# 29 Green Female