We now want to find a regression model that models the original data as closely as possible with as few predictors as possible. We however want to smooth out noise from the original norm data, which can be due to the random sampling process or violations of representativeness. This is done through the 'bestModel' function. You can use this function in two different ways: If you specify R_{adjusted}^{2}, then the regression function will be selected that meets this requirement with the smallest number of predictors. You can however also specify a fixed number of predictors. Then the model is selected that achieves the highest R_{adjusted}^{2} with this specification. To select the best model, cNORM uses the 'regsubset' function from the 'leaps' package. As we do not know beforehand, how well the data can be modeled, we start with the default values (k = 4 and R_{adjusted}^{2} = .99):

The function prints the following result:model <- bestModel(normData)

Final solution: 3

R-Square Adj. amounts to 0.990812753080448

Final regression model: raw ~ L3 + L1A1 + L3A3

Beta weights are accessible via 'model$coefficients':

(Intercept)

L3

L1A1

L3A3

-1.141915e+01

2.085614e-05

1.651765e-01

-5.911150e-07

Regression formula:

[1] "raw ~ -11.4191452286606 + (2.08561440071111e-05*L3) + (0.165176532450589*L1A1) + (-5.9111502682762e-07*L3A3)"

Use 'plotSubset(model)' to inspect model fit

Fine! The determined model already exceeds the predefined threshold of R_{adjusted}^{2} = .99 with only three predictors (plus intercept). The 'bestModel' function as well returns the coefficients and the complete regression formula, which - as was specified - captures more than 99% of the variance in the data set.

If you want to have a look at the selection procedure, all the information is available in 'model$subsets'. The variable selection process per step is listed in 'outmat' and 'which'. There, you can find the R^{2}, R_{adjusted}^{2}, Mallow's C_{p}

printSubset(model)

Furthermore, information about the change of R_{adjusted} and other information criteria (Mallow's C_{p} or BIC) depending on the number of predictors (with fixed k) can also be graphically inspected. Please use the following command to do this:

plotSubset(model, type = 0)

The figure displays R_{adjusted}^{2} as a function of the number of predictors by default. Alternatively, you can also plot log-transformed Mallow's C_{p} (type = 1) or BIC (type = 2) as a function of R_{adjusted}^{2}.

The figure shows that the default value of R_{adjusted} = .99 is already achieved with only three predictors. The inclusion of further predictors only leads to small increases of R_{adjusted} or to small decreases of Mallow's C_{p}. Where the dots are close together, the inclusion of further predictors is of little use. To avoid over-fitting, a model with as few predictors as possible should therefore be selected from this area.

The model with three predictors seems to be suitable. Nevertheless, the model found in this way must still be tested for plausibility using the means described in Model Validation. Above all, it is necessary to determine the limits of model validity. If a model turns out to be suboptimal after this model check, R_{adjusted}^{2}, the number of predictors or, if necessary, k should be chosen differently.

Indeed, the aim of the modeling process is not to capture the maximum variance in the observed data, but instead to retrieve models that predict the (probably unknown) population distribution. Fitting the model too closely to the training data is likely to result in an overfit. To avoid this and to estimate, how well the fitting can be carried out, you need to do a cross validation of the modeling. cNORM helps in selecting the number of terms for the model by doing repeated cross validation with 80 percent of the data as training data and 20 percent as the validation data. The cases are drawn randomly but stratified by norm group. Successive models are retrieved with increasing number of terms and the RMSE of raw scores (fitted by the regression model) is plotted for the training, validation and the complete dataset. Additionally to this analysis on the raw score level, it is possible to estimate the mean norm score reliability and crossfit measures.

A CROSSFIT higher than 1 is a sign of overfitting. Value lower than 1 indicate an underfit due to a suboptimal modeling procedure, i. e. the method may not have captured all the variance of the observed data it could possibly capture. Values around 1 are ideal, as long as the raw score RMSE is low and the norm score validation R2 reaches high levels. As a suggestion for real psychometric tests:

- Use visual inspection of the percentiles with plotPercentiles or plotPercentileSeries
- Combine the visual inspection of the percentiles with a repeated cross validation (e. g. 10 repetitions)
- Focus on low raw score RMSE, high norm score R2 in the validation dataset and avoid a number of terms with a high overfit (e. g. crossfit > 1.1).

The following example was generated on the basis of the *elfe* dataset with 10 repetitions and up to a maximum number of 10 terms in the regression function:

data <- prepareData(elfe)

cnorm.cv(data, repetitions = 10, max = 10)

The results support the decision in the example above to include three terms in the regression function (+ Intercept). Adding more terms neither leads to a lower raw score RMSE nor to an increase in norm score R^{2} in the validation data. Including more terms simply results in an overfit. As long as there are no intersecting percentile curves in *plotPercentiles*, it is advisable to stay with that number of terms.

Data Preparation |
Model Validation |