Tag Archives: REG

Linear Model for Regression

Linear Regression model is a class of very important statistical methods to learn from data. In this post, I will cover the main regression algorithms that are introduced in Chapter 3 of Elements of Statistical Learning. Interested readers should consult the book for theoretical background while this post will mainly focus on the implementation of the algorithms using SAS/BASE & SAS/STAT. Following the book, in the demonstration below, we use the Prostate Cancer data, which can be downloaded from the book’s website @ here.

First, we analyze the correlation of predictors and plot the Matrix style correlation plot. This can be easily done using the following SAS code:

proc corr data=prostate
          plot=(matrix  scatterplot);
      var lweight age lbph  svi
          lcp gleason pgg45;

PROC CORR tells SAS to conduct correlation analysis and DATA= option says the analysis uses data set named “prostate”. The OUTP= option tells SAS to output Pearson correlation matrix to a SAS data called “correlation”. Note the “P” stands for “Pearson”. You can also request Spearman correlation matrix by specifying OUTS= option. The PLOT= option tells SAS what type of graphics to be shown. Here we request both the Matrix style plot and Scatter plot. Since we have more than 2 variables to be analyzed, SAS will produce multiple scatter plot for each pair combination of the variables indicated in VAR statement.

In the book, the predictors are standardized to have unit variance. This can be easily done using the following SAS code:

proc standard data=prostate
     var lpsa lcavol lweight age lbph svi
         lcp gleason pgg45;

PROC STANDARD tells SAS to standardize the data. Like PROC CORR, DATA= option indicates which SAS data to be used and OUT= option asks SAS to output all data to a data set called prostate_std after standization. The STD=1 option simply request to have all variables listed in the VAR statement to have unit variance after standardization. If you also want to have 0 mean then add in MEAN=0 option as well.

It is usually a good practice to split the data into 2 pieces, one is used for model training while the other one is used for model evaluation. It is not necessary to physically split the data but instead just flag a subset observations to be training sample and the others be validation sample. The following code randomly select 67 and set the rest 30 as testing, by invoking PROC SURVEYSELECT. The user have to specify OUTALL option to keep all observations and flag those being selected as training while non-flagged as testing data.

proc surveyselect data=prostate_std

DATA= option request the SAS data set to be sampled from and OUT= option tells SAS where to output the data. The OUTALL option is necessary here because it keeps all observations in the output data set, otherwise only those selected will be output. SAMPLESIZE= option means 67 observations should be sampled. Instead of specifying how many observations to be sampled, you can instead specify what portion of observations to be sampled by using SAMPRATE= option. The value should be between 0 and 1 when explained as fraction, or 0 and 100 when it is explained as percentage. The METHOD= option request Systematic Random Sampling (SRS) method.

To do the linear regression, simply submit the following code:

ods output ANOVA=FitStat;
proc reg data=prostate_std  ;
     model lpsa = lcavol  lweight age lbph  
                  svi lcp gleason pgg45;
     where selected=1;

All major variable selection methods mentioned in Chapter 3 of Elements of Statistical Learning are available in SAS, and I will explain them one by one.

1. Best subset selection.
To request best subset selection in SAS, you use PROC REG and specify the SELECTION=CP | RSQUARE | ADJRSQ option in the MODEL statement. “|” means “or”, so that you specify one of the trio (CP, RSQUARE, ADJRSQ), any two combination or all of the three. However, there are some details need attention. If SELECTION=CP or SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum number of subset models to be displayed or output to the OUTEST= data set. For SELECTION=RSQUARE, the BEST= option requests the maximum number of subset models for each size. If the BEST= option is used without the B option (displaying estimated regression coefficients), the variables in each MODEL are listed in order of inclusion, instead of the order in which they appear in the MODEL statement. If the BEST= option is omitted and the number of regressors is less than 11, all possible subsets are evaluated. If the BEST= option is omitted and the number of regressors is greater than 10, the number of subsets selected is, at most, equal to the number of regressors. A small value of the BEST= option greatly reduces the CPU time required for large problems.
SAS can output key statistics during the selection process to a data set via ODS OUTPUT statement so to facilitate analysis. It is better to sort this data by NumInModel, the number of variables in model in each step, to facilitate human reading.

ods output SubsetSelSummary=SelectionSummary;
proc reg data=prostate ;
     model lpsa = lcavol  lweight age lbph 
           svi lcp gleason pgg45 
           /selection=cp  rsquare  adjrsq 
proc sort data=SelectionSummary 
     by NumInModel;

2. Backward, forward, stepwise selection.
These automatic variable selection methods are readily available for linear regression for a couple of SAS procedures: PROC REG and PROC GLMSELECT. The usage is the same, specifying SELECTION=BACKWARD (or FORWARD, STEPWISE) in the MODEL statement in those two procedures. In PROC REG, you can further specify DETAILS option in the MODEL statement to obtain more information about the variable selection, while in PROC GLMSELECT, you can specify the (STOP=NONE) sub-option for SELECT= option to ask SAS conduct model evaluation even if the best model has already appeared during the model selection process.

3. Shrinkage methods.
3.1 Ridge Regression.
Ridge regression is handled by PROC REG in SAS and all you need to do is by specifying RIDGE= option the PROC REG statement, like below:

proc reg data=prostate  
         ridge=0.01 to 0.9 by 0.01  
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 ;

The OUTEST= option will output parameter estimates, including estimate corresponding to the non-regularized model and those corresponding to each of the ridge parameter generated by “RIDGE=0.01 TO 0.9 BY 0.01” option.

3.2 LASSO Regression.
LASSO is supported by PROC GLMSELECT procedure, and all you need to do is to specify SELECTION=LASSO in the MODEL statement.

proc glmselect data=prostate_std
               plots=coefficient(stepaxis=normb unpack);
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 
                   /selection=lasso stop=9;

SAS will stop when the appropriate model is found, therefore, in order to explore the full path of coefficients, it is necessary to either specify STOP=p+1 , where p is the number of covariates, option like above or use this specification: SELECTION=LASSO(STOP=NONE).

3.3. Principal Component Regression (PCR) and Partial Least Square (PLS)
Both these regression models are supported in PROC PLS. To request PCR, simply use METHOD=PCR option in the PROC PLS statement while when METHOD=SIMPLS is specified, PLS is executed.

proc pls data=prostate_std  
      model lpsa = lcavol  lweight age lbph
                   svi lcp gleason pgg45

proc pls data=prostate_std 
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 

PROC PLS does not automatically show or output parameter estimates, therefore you have to be explicit that SOLUTION option is used in the MODEL statement.

3.4 Elastic Net.
Elastic Net for linear regression can be implemented in several ways and glmnet R package used gradient coordinate descending. This is not easy to implement in SAS but in the original paper, Zou and Hastie explored the so called Naïve Elastic Net method where a ridge regression is requested using the data augmentation method, then LASSO is executed on the augmented data set. The Elastic net solution is obtained by scaling back from the solution from Naïve Elastic Net. This process is illustrated below with a \lambda2 parameter set to \sqrt(1000):

%let covars=lcavol  lweight age lbph  
            gleason pgg45   lcp svi ;
%let depvar=lpsa;
proc standard data=prostate 
              mean=0  std=1;
     var &covars;
proc standard data=prostate_std
     var &depvar;

%let lambda2=%sysfunc(sqrt(1000));
%let denum=%sysfunc(sqrt(1001));
data expand;
     array _v[8] &covars  
                 (8*0); /* 8 predictors */
     retain &depvar 0;
     do id=1 to dim(_v);
        if id>=2 then _v[id-1]=0;
        drop id;

proc append base=prostate_std   

data prostate_std;
     set prostate_std;
     array _v[8]  &covars ;
     do j=1 to dim(_v); _v[j]=_v[j]/&denum; end;
     drop j;

In this step, the data is normalized like the original paper suggested: standardize covariates and center dependent variable. Then the augmented data is created with pseudo diagonal data be appended to the normalized data. Finally, all data are scaled so that the total variance of each covariate is 1.

Then it is ready to apply LASSO regression to this new data to obtain naïve elastic net solution:

ods graphics on;
ods output SelectionSummary=SelectionSummary;
proc glmselect data=prostate_std 
               plots(stepaxis=normb  unpack)=all  ;
     model lpsa = lcavol  lweight age lbph  svi 
                  lcp gleason pgg45
ods graphics off;
ods select all;

Note some details. First, in the MODEL statement, the STOP=NONE sub-option is used with the SELECTION=LASSO option to fully explore the path of coefficients and DETAILS=SUMMARY is requested to output key metrics in selection process. Second, PLOTS= option is specified in PROC GLMSELECT statement to obtain the graphic representation of coefficient path, and STEPAXIS= sub-option indicates the X-axis of the figure is shrinkage factor in each step.