Classification is one very important task in statistical learning. Classification refers to algorithms that identify which category or membership an observation belongs to. The algorithms are trained on a training data set with known membership and applied to new set of observations to map them into the most likely category. In this post, we will discuss the so called Linear Models for classification, where the decision boundaries among categories are linear.
Linear regression of an indicator matrix. To generate indicator matrix from target variables, the easiest and most generic way is using PROC GLMMOD, but just use your target variable as independent variable here because you simply need the design matrix based on Target variable levels from GLMMOD. Besides, GLMMOD by default will show all design points which is a lot of information for cases that have many levels for the target variable, so it is advised to use NOPRINT option to turn off the display output.
proc glmmod data=sashelp.cars outparm=ParameterMapping outdesign=train_indicator(keep=col:) noprint; class type; model MSRP = type /noint; run;
In this example, the CARS dataset in SASHELP library is used. OUTPARM= option tells PROC GLMMOD to output parameter mapping table because the output design matrix only contains variable name such as COL1, COL2,… So in order to identify which level corresponds to which column, it is necessary to keep that information handy. The OUTDESIGN= option tells PROC GLMMOD where to put the design matrix. Be aware that OUTDESIGN only output design points and the target variables from MODEL statement, therefore, it is necessary to merge the design matrix back to the original data:
data train_indicator; merge train_indicator sashelp.cars; run;
Notice that you specify the Target variable name in the CLASS statement, and as a covariate in the MODEL statement. You can use any appropriate variable as a pseudo-target variable, here we used MSRP, but you can also use variables such as MPG_City, Length, etc, as long as it is numeric. Also note that the NOINT option is specified for the MODEL statement because otherwise COL1 will always be the unit intercept term. You don’t need this intercept term for indicator matrix.
After the design matrix is merged with all necessary data you need for modeling, you can now conduct linear regression against the indicator matrix using PROC REG. Take a look at the following code:
proc sql noprint; select cats("Col", _COLNUM_) into :depvars separated by " " from ParameterMapping ; quit; %put &depvars; /* --> now Regression */ proc reg data = train_indicator outest = beta noprint; model &depvars = MSRP MPG_City MPG_Highway Invoice Cylinders Wheelbase; output out = pred p=p1-p6; /* if you specify more than the number of indicators, extra will be ignored */ run;quit;
The PROC SQL section extract the names of the indicator matrix. While we know that the names are as simple as COL1, COL2, … it is an easy and generic way to extract the name of dependent variables. You can simply put that whole set of variable names into PROC REG because it accepts more than 1 dependent variable. Let’s have a closer look at PROC REG.
OUTEST=beta asks PROC REG output estimates to SAS dataset named beta. Because &depvars has multiple elements: COL1, COL2, …, COL6 in this case, there will be 6 observations in the beta dataset, each corresponds to a model for one of them dependent variables.
On the other hand, if you want PROC REG to score the data points and output predictions for each category, you can simply use the OUTPUT statement as show above, specifying p=p1-p6. Here you have to be explicit on how many predicted variables you want, each will corresponding to one category in the dependent variable indicator matrix. In the example above, since there are 6 categories, you simply specify 6 variables (the variable list specification p1-p6 is used). If you specify less than 6, say only 3, then only the predictions for first three categories will be output. If, instead, you specify more than 6 prediction variables, SAS is smart enough to only output 6 prediction variables, at the same time, issuing a WARNING message in the log.
[to be continued...]