Site menu:

Switch to German language

cNORM - Data Preparation


  1. Representativeness of the data base
  2. Determining the norm sample size
  3. Data preparation in R
  4. Ranking: Retrieving Percentiles and Norm Scores
  5. Computing powers and interactions
  6. In a single step

Representativeness of the data base

The starting point for standardization should always be a representative sample. Establishing representativeness is one of the most difficult tasks of test construction and must therefore be carried out with appropriate care. First of all, it is important to identify those variables that systematically covary with the variable to be measured. In the case of school performance and intelligence tests, these are, for example, the type of school, the federal state, the socio-economic background, etc. Caution: Increasing the sample size is only beneficial for the quality of the standardization if the covariates do not remain systematically distorted. For example, it would be useless or even counterproductive to increase the size of the sample if the sample was only collected from a single type of school or only in a single region.

Determining the norm sample size

The appropriate sample size cannot be quantified in a definitive way, but depends on how well the test (or scale) must differentiate in the extreme sections of the norm scale. In many countries, for example, it is common (although not always reasonable) to differentiate between IQ < 70 and IQ > 70 to diagnose developmental disabilities and to choose the appropriate school type or track. An IQ test used for school placement must therefore be able to identify a deviation of 2 SD or more from the population mean as reliably as possible. If, on the other hand, the diagnosis of a reading/spelling disorder is required, a deviation of 1.5 SD from the population mean is generally sufficient for the diagnosis according to DSM5. As a rule of thumb for determining the ideal sample size, it can be stated that the uncertainty of standardization increases particularly in those performance ranges which are rarely represented in the standardization sample. (This does not only apply to the nonparametric method presented here, but in principle also to all parametric standardization methods.) For example, in a representative random sample of N = 100, the probability that there is no single child with an IQ below 70 is about 10%. For a sample size of N = 200, this probability decreases to 1 %. Doubling the sample size thus notably improves the reliability of the normal score in ranges markedly deviating from the scale mean.

Data preparation in R

If a sufficiently large and representative sample has been established (missings should be excluded), then the data must first be imported. It is advisable to start with a simply structured data object of type data.frame, which only contains numeric variables without value labels. It is as well favorable to label the measured raw scores with the variable name "raw", as this is the default specification in cNORM. However, all variable names can also be defined individually, but must then be specified as function parameters. The explanatory variable in psychometric performance tests is usually age. We therefore refer to this variable as "age". In fact, however, the explanatory variable is not necessarily age. A training or schooling duration or other explanatory variables can also be included in the modeling. However, it must be an interval-scaled (or, as the case may be, dichotomous) variable. Finally, a grouping variable is required to divide the explanatory variable into smaller standardization groups (e.g. grades or age groups). The method is relatively robust against changes in the granularity of the group subdivision. For example, the result of the standardization only marginally depends on whether one chooses half-year or full-year gradations (see A. Lenhard, Lenhard, Suggate & Segerer, 2016). The more the variable to be measured covaries with the explanatory variable (e. g. a fast development over age in an intelligence test), the more groups should be formed beforehand to capture the trajectories adequately. By standard, we assign the variable name "group" to the grouping variable.

If, when using cNORM, you initially only have the continuous age variable available, you must recode it into a discrete grouping variable. The following code could be helpful (another possibility is the'rankBySlidingWindow' function described below):

# Creates a grouping variable for a fictitious age variable
# for children age 2 to 18. That way, the age variable is recoded
# into a discrete group variable, each group comprising a year.

data$group <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
  [findInterval(data$age, c(-Inf, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
  13, 14, 15, 16, 17))]

Of course, it is also possible to use a data set for which standard scores already exist for individual age groups. A continuously distributed age variable is not necessary in this case. When using RStudio, data can easily be imported from other statistical environments using the import function:

For demostration purposes, cNORM includes a cleaned data set from a German test standardization (ELFE 1-6, W. Lenhard & Schneider, 2006, subtest sentence comprehension) that will be used for demonstrating the method. Another large (but unrepresentative) data set for demonstration purposes stems from the adaption of a vocabulary test to the German language (PPVT-4, A. Lenhard, Lenhard, Segerer & Suggate, 2015). For biometric modeling, it includes a large CDC dataset (N > 45,000) for growth curves from age 2 to 25 (weight, height, BMI; CDC, 2012) and for macro economical and sociological data the data on mortality and life expectancy at birth from 1960 to 2017 from the World Bank. You can retrieve information on the data by typing ?elfe, ?ppvt, ?CDC, ?life or ?mortality on the R console. To load the data sets, please use the following code:

# Loads the package cNorm


# Copies the data set "elfe" from the environment into the object 'normData'

normData <- elfe

# Or similarly for the "ppvt"

normData <- ppvt

# And finally the data for the Body Mass Index
# Please specify 'bmi' as the 'raw' variable in this case in later analyses

normData <- CDC

# Displays the first lines of the data


As you can see, there is no age variable in the data set "elfe", only a person ID, a raw score and a grouping variable. In this case, the grouping variable also serves as a continuous explanatory variable, since children were only examined at the very beginning and in the exact middle of the school year during the test standardization. For example, the value 2.0 means that the children were at the beginning of the second school year, the value 2.5 means that the children were examined in the middle of the second school year. Another possibility would have been to examine children throughout the entire school year. In this case, the duration of schooling would have to be entered as a continuous explanatory variable. To build the grouping variable, the first and second half of each school year could, for exampe, be aggregated into one group respectively.

In the "elfe" data set there are seven groups with 200 cases each, i.e. a total of 1400 cases. With the help of the psych package, descriptive data can be displayed in groups if desired (optional):

# Install psych package

install.packages("psych", dependencies = TRUE)

# Display descriptive results

describeBy(normData, group="group")

Ranking: Retrieving Percentiles and Norm Scores

The next step is to rank each person in each group using the rankByGroup function. The function returns percentiles and also performs a normal-rank transformation in which T-Scores (M = 50, SD = 10) are returned by default. In principle, our mathematical method also works without normal rank transformation, i.e., the method could theoretically also be carried out with the percentiles. This is useful, for example, if you want to enter a variable that deviates extremely from the normal distribution or follows a completely different distribution. For most psychological or physical scales, however, the distributions are still sufficiently similar to the normal distribution even with strong bottom and ceiling effects. In these cases, the normal-rank transformation usually increases the model fit and facilitates the further processing of the data. In addition to T-Scores, the standard scores can also be expressed as z- or IQ-Scores. You can also choose between different ranking methods (RankIt, Blom, van der Warden, Tukey, Levenbach, Filliben, Yu & Huang). However, we want to stick to T-Scores and RankIt, which are preset by default:

# Determine percentiles by group

normData <- rankByGroup(elfe, group = "group")

To change the ranking method, please specify a method index with method 1 = Blom (1958), 2 = Tukey (1949), 3 = Van der Warden (1952), 4 = Rankit (default), 5 = Levenbach (1953), 6 = Filliben (1975) and 7 = Yu & Huang (2001). The standard score can be specified as T-Score, IQ-Score, z-Score or by means of a double vector of M and SD, e.g. scale = c(10, 3) for Wechsler subtest scaled scores. The grouping variable can be deactivated by setting group = FALSE. The normal-rank transformation is then applied to the entire sample.

Please note that there is another function for determining the rank, which works without discrete grouping variables. The rank of each individual subject is then estimated based on the continuous explanatory variables using a sliding window. The width of this window can be specified individually. In the case of a continuous age variable, the specification width = 0.5 means, for example, that the width of the window is half a year. As a consequence, the rank of a test persons is based on all participants who are no more than 3 months younger or older than the test person in question, i. e., the group comprises a total of 6 months.

# Estimation of normal scores via a sliding windows of the width 0.5

normData2 <- rankBySlidingWindow(data = elfe, age = "group", raw = "raw",
  width = 0.5)

The rankBySlidingWindow offers functionality to automatically build a grouping variable and determine the group means:

# Estimation of normal scores via a sliding windows of the width 1
# A grouping variable is automatically determined with 14 distinct groups.
# The mean age is assigned automatically for each group <- rankBySlidingWindow(ppvt, age = "age", width = 1, nGroup = 14)

Please note that the 'rankBySlidingWindow' function only makes sense if the age variable is actually continuous. In the 'elfe' data set the variable 'group' serves as continuous explanatory variable as well as diskrete grouping variable. Therefore, with the function 'rankBySlidingWindow' we obtain the same standard scores as with the function 'rankByGroup' in this specific case.

Both ranking functions ('rankBySlidingWindow' and'rankByGroup') add at least two additional columns, namely 'percentile' and 'normValue'. In addition, descriptive information about each group is added, namely n, m, md and sd.

Descriptive results are only necessary under certain circumstances. The creation of these variables can be deactivated via the parameter 'descriptives'.

Computing powers and interactions

At this point, where many test developers already stop standardization, the actual modeling process begins. A function is determined which expresses the raw score as a function of the latent person parameter l and the explanatory variable. In the following, we will refer to the latter variable as 'a'. In the 'elfe' example, we use the discrete variable ,group' for a. If there is an additional continuous age variable, it should be used instead as 'a' because of its higher precision.

To retrieve the mathematical model, all powers of the variables 'l' and 'a' up to a certain exponent k must be computed. Subsequently, all interactions between these powers must also be calculated by simple multiplication. As a rule of thumb, k > 5 leads to over-adjustment. In general, k = 4 or even k = 3 will already be sufficient to model human performance data with adequate precision. Please use the following function for the calculation:

# Calculation of powers and interactions up to k = 4

normData <- computePowers(normData, k = 4, norm = "normValue", age = "group")

The data set now has 24 new variables (k2 + 2*k), namely all powers of l (L1, L2, L3, L4), all powers of a (A1, A2, A3, A4 ) and all linear products of the power variables(L1A1, L1A2, L1A3, ... L4A3, L4A4).

In a single step

cNorm also includes the convenience method 'prepareData' which can be used to perform both the ranking and the subsequent calculation of powers in one step. Note that the variables 'group' and 'raw' have to be used in the data set for this purpose, if not specified otherwise.

# Combines the functions 'rankByGroup' and 'computePowers'

normData <- prepareData(normData)

# The variable names can be provided, if necessary
# In the example, the CDC dataset on BMI growth is used

data.bmi <- prepareData(CDC, group="group", raw="bmi", age="age")