Linear Models for Classification

Classification is one very important task in statistical learning. Classification refers to algorithms that identify which category or membership an observation belongs to. The algorithms are trained on a training data set with known membership and applied to new set of observations to map them into the most likely category. In this post, we will discuss the so called Linear Models for classification, where the decision boundaries among categories are linear.

Linear regression of an indicator matrix. To generate indicator matrix from target variables, the easiest and most generic way is using PROC GLMMOD, but just use your target variable as independent variable here because you simply need the design matrix based on Target variable levels from GLMMOD. Besides, GLMMOD by default will show all design points which is a lot of information for cases that have many levels for the target variable, so it is advised to use NOPRINT option to turn off the display output.

proc glmmod  
      class type;
      model  MSRP = type /noint;

In this example, the CARS dataset in SASHELP library is used. OUTPARM= option tells PROC GLMMOD to output parameter mapping table because the output design matrix only contains variable name such as COL1, COL2,… So in order to identify which level corresponds to which column, it is necessary to keep that information handy. The OUTDESIGN= option tells PROC GLMMOD where to put the design matrix. Be aware that OUTDESIGN only output design points and the target variables from MODEL statement, therefore, it is necessary to merge the design matrix back to the original data:

data train_indicator;
     merge train_indicator;

Notice that you specify the Target variable name in the CLASS statement, and as a covariate in the MODEL statement. You can use any appropriate variable as a pseudo-target variable, here we used MSRP, but you can also use variables such as MPG_City, Length, etc, as long as it is numeric. Also note that the NOINT option is specified for the MODEL statement because otherwise COL1 will always be the unit intercept term. You don’t need this intercept term for indicator matrix.

After the design matrix is merged with all necessary data you need for modeling, you can now conduct linear regression against the indicator matrix using PROC REG. Take a look at the following code:

proc sql noprint;
      select cats("Col", _COLNUM_) into :depvars 
              separated by " "
      from   ParameterMapping
%put &depvars;

/* --> now Regression */
proc reg data = train_indicator 
         outest = beta  
    model  &depvars = MSRP  MPG_City  MPG_Highway  
                       Invoice  Cylinders  Wheelbase;
    output out = pred  p=p1-p6;   
           /* if you specify more than the number of 
              indicators, extra will be ignored 

The PROC SQL section extract the names of the indicator matrix. While we know that the names are as simple as COL1, COL2, … it is an easy and generic way to extract the name of dependent variables. You can simply put that whole set of variable names into PROC REG because it accepts more than 1 dependent variable. Let’s have a closer look at PROC REG.

OUTEST=beta asks PROC REG output estimates to SAS dataset named beta. Because &depvars has multiple elements: COL1, COL2, …, COL6 in this case, there will be 6 observations in the beta dataset, each corresponds to a model for one of them dependent variables.

On the other hand, if you want PROC REG to score the data points and output predictions for each category, you can simply use the OUTPUT statement as show above, specifying p=p1-p6. Here you have to be explicit on how many predicted variables you want, each will corresponding to one category in the dependent variable indicator matrix. In the example above, since there are 6 categories, you simply specify 6 variables (the variable list specification p1-p6 is used). If you specify less than 6, say only 3, then only the predictions for first three categories will be output. If, instead, you specify more than 6 prediction variables, SAS is smart enough to only output 6 prediction variables, at the same time, issuing a WARNING message in the log.

[to be continued…]


Linear Model for Regression

Linear Regression model is a class of very important statistical methods to learn from data. In this post, I will cover the main regression algorithms that are introduced in Chapter 3 of Elements of Statistical Learning. Interested readers should consult the book for theoretical background while this post will mainly focus on the implementation of the algorithms using SAS/BASE & SAS/STAT. Following the book, in the demonstration below, we use the Prostate Cancer data, which can be downloaded from the book’s website @ here.

First, we analyze the correlation of predictors and plot the Matrix style correlation plot. This can be easily done using the following SAS code:

proc corr data=prostate
          plot=(matrix  scatterplot);
      var lweight age lbph  svi
          lcp gleason pgg45;

PROC CORR tells SAS to conduct correlation analysis and DATA= option says the analysis uses data set named “prostate”. The OUTP= option tells SAS to output Pearson correlation matrix to a SAS data called “correlation”. Note the “P” stands for “Pearson”. You can also request Spearman correlation matrix by specifying OUTS= option. The PLOT= option tells SAS what type of graphics to be shown. Here we request both the Matrix style plot and Scatter plot. Since we have more than 2 variables to be analyzed, SAS will produce multiple scatter plot for each pair combination of the variables indicated in VAR statement.

In the book, the predictors are standardized to have unit variance. This can be easily done using the following SAS code:

proc standard data=prostate
     var lpsa lcavol lweight age lbph svi
         lcp gleason pgg45;

PROC STANDARD tells SAS to standardize the data. Like PROC CORR, DATA= option indicates which SAS data to be used and OUT= option asks SAS to output all data to a data set called prostate_std after standization. The STD=1 option simply request to have all variables listed in the VAR statement to have unit variance after standardization. If you also want to have 0 mean then add in MEAN=0 option as well.

It is usually a good practice to split the data into 2 pieces, one is used for model training while the other one is used for model evaluation. It is not necessary to physically split the data but instead just flag a subset observations to be training sample and the others be validation sample. The following code randomly select 67 and set the rest 30 as testing, by invoking PROC SURVEYSELECT. The user have to specify OUTALL option to keep all observations and flag those being selected as training while non-flagged as testing data.

proc surveyselect data=prostate_std

DATA= option request the SAS data set to be sampled from and OUT= option tells SAS where to output the data. The OUTALL option is necessary here because it keeps all observations in the output data set, otherwise only those selected will be output. SAMPLESIZE= option means 67 observations should be sampled. Instead of specifying how many observations to be sampled, you can instead specify what portion of observations to be sampled by using SAMPRATE= option. The value should be between 0 and 1 when explained as fraction, or 0 and 100 when it is explained as percentage. The METHOD= option request Systematic Random Sampling (SRS) method.

To do the linear regression, simply submit the following code:

ods output ANOVA=FitStat;
proc reg data=prostate_std  ;
     model lpsa = lcavol  lweight age lbph  
                  svi lcp gleason pgg45;
     where selected=1;

All major variable selection methods mentioned in Chapter 3 of Elements of Statistical Learning are available in SAS, and I will explain them one by one.

1. Best subset selection.
To request best subset selection in SAS, you use PROC REG and specify the SELECTION=CP | RSQUARE | ADJRSQ option in the MODEL statement. “|” means “or”, so that you specify one of the trio (CP, RSQUARE, ADJRSQ), any two combination or all of the three. However, there are some details need attention. If SELECTION=CP or SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum number of subset models to be displayed or output to the OUTEST= data set. For SELECTION=RSQUARE, the BEST= option requests the maximum number of subset models for each size. If the BEST= option is used without the B option (displaying estimated regression coefficients), the variables in each MODEL are listed in order of inclusion, instead of the order in which they appear in the MODEL statement. If the BEST= option is omitted and the number of regressors is less than 11, all possible subsets are evaluated. If the BEST= option is omitted and the number of regressors is greater than 10, the number of subsets selected is, at most, equal to the number of regressors. A small value of the BEST= option greatly reduces the CPU time required for large problems.
SAS can output key statistics during the selection process to a data set via ODS OUTPUT statement so to facilitate analysis. It is better to sort this data by NumInModel, the number of variables in model in each step, to facilitate human reading.

ods output SubsetSelSummary=SelectionSummary;
proc reg data=prostate ;
     model lpsa = lcavol  lweight age lbph 
           svi lcp gleason pgg45 
           /selection=cp  rsquare  adjrsq 
proc sort data=SelectionSummary 
     by NumInModel;

2. Backward, forward, stepwise selection.
These automatic variable selection methods are readily available for linear regression for a couple of SAS procedures: PROC REG and PROC GLMSELECT. The usage is the same, specifying SELECTION=BACKWARD (or FORWARD, STEPWISE) in the MODEL statement in those two procedures. In PROC REG, you can further specify DETAILS option in the MODEL statement to obtain more information about the variable selection, while in PROC GLMSELECT, you can specify the (STOP=NONE) sub-option for SELECT= option to ask SAS conduct model evaluation even if the best model has already appeared during the model selection process.

3. Shrinkage methods.
3.1 Ridge Regression.
Ridge regression is handled by PROC REG in SAS and all you need to do is by specifying RIDGE= option the PROC REG statement, like below:

proc reg data=prostate  
         ridge=0.01 to 0.9 by 0.01  
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 ;

The OUTEST= option will output parameter estimates, including estimate corresponding to the non-regularized model and those corresponding to each of the ridge parameter generated by “RIDGE=0.01 TO 0.9 BY 0.01” option.

3.2 LASSO Regression.
LASSO is supported by PROC GLMSELECT procedure, and all you need to do is to specify SELECTION=LASSO in the MODEL statement.

proc glmselect data=prostate_std
               plots=coefficient(stepaxis=normb unpack);
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 
                   /selection=lasso stop=9;

SAS will stop when the appropriate model is found, therefore, in order to explore the full path of coefficients, it is necessary to either specify STOP=p+1 , where p is the number of covariates, option like above or use this specification: SELECTION=LASSO(STOP=NONE).

3.3. Principal Component Regression (PCR) and Partial Least Square (PLS)
Both these regression models are supported in PROC PLS. To request PCR, simply use METHOD=PCR option in the PROC PLS statement while when METHOD=SIMPLS is specified, PLS is executed.

proc pls data=prostate_std  
      model lpsa = lcavol  lweight age lbph
                   svi lcp gleason pgg45

proc pls data=prostate_std 
      model lpsa = lcavol  lweight age lbph  
                   svi lcp gleason pgg45 

PROC PLS does not automatically show or output parameter estimates, therefore you have to be explicit that SOLUTION option is used in the MODEL statement.

3.4 Elastic Net.
Elastic Net for linear regression can be implemented in several ways and glmnet R package used gradient coordinate descending. This is not easy to implement in SAS but in the original paper, Zou and Hastie explored the so called Naïve Elastic Net method where a ridge regression is requested using the data augmentation method, then LASSO is executed on the augmented data set. The Elastic net solution is obtained by scaling back from the solution from Naïve Elastic Net. This process is illustrated below with a \lambda2 parameter set to \sqrt(1000):

%let covars=lcavol  lweight age lbph  
            gleason pgg45   lcp svi ;
%let depvar=lpsa;
proc standard data=prostate 
              mean=0  std=1;
     var &covars;
proc standard data=prostate_std
     var &depvar;

%let lambda2=%sysfunc(sqrt(1000));
%let denum=%sysfunc(sqrt(1001));
data expand;
     array _v[8] &covars  
                 (8*0); /* 8 predictors */
     retain &depvar 0;
     do id=1 to dim(_v);
        if id>=2 then _v[id-1]=0;
        drop id;

proc append base=prostate_std   

data prostate_std;
     set prostate_std;
     array _v[8]  &covars ;
     do j=1 to dim(_v); _v[j]=_v[j]/&denum; end;
     drop j;

In this step, the data is normalized like the original paper suggested: standardize covariates and center dependent variable. Then the augmented data is created with pseudo diagonal data be appended to the normalized data. Finally, all data are scaled so that the total variance of each covariate is 1.

Then it is ready to apply LASSO regression to this new data to obtain naïve elastic net solution:

ods graphics on;
ods output SelectionSummary=SelectionSummary;
proc glmselect data=prostate_std 
               plots(stepaxis=normb  unpack)=all  ;
     model lpsa = lcavol  lweight age lbph  svi 
                  lcp gleason pgg45
ods graphics off;
ods select all;

Note some details. First, in the MODEL statement, the STOP=NONE sub-option is used with the SELECTION=LASSO option to fully explore the path of coefficients and DETAILS=SUMMARY is requested to output key metrics in selection process. Second, PLOTS= option is specified in PROC GLMSELECT statement to obtain the graphic representation of coefficient path, and STEPAXIS= sub-option indicates the X-axis of the figure is shrinkage factor in each step.

SAS Recipes for Statistical Learning

This blog records my progress on my book Statistical Learning using SAS.

This book will mimic the style of famous Numerical Recipes for C and its offsprings, and concentrates on stable, robust and efficient SAS code for modern data mining algorithms, mostly follow the line of Elements of Statistical Learning. Most importantly, the SAS components involved will be constraint to SAS/BASE+SAS/STAT so that the functionality is readily accessible to majority of analyst, especially small-to-medium marketing companies.