Parametric Modeling with cNorm - Mathematical Derivation of the Sinh-Arcsinh Distribution

Parametric Modeling with cNORM - Mathematical Derivation of the Sinh-Arcsinh Distribution

The Sinh-Arcsinh distribution (shash distribution) is defined by a transformation of a standard normal variable. If Y follows a standard normal distribution, Y ~ N(0,1), then the ShaSh-distributed variable X is generated by:

\[X = \mu + \sigma \cdot \sinh\left(\frac{\text{arcsinh}(Y) - \epsilon}{\delta}\right)\]

This transformation allows the resulting variable X to have its location, scale, skewness, and tail weight controlled by the following four parameters:

\(\mu\) (mu): location parameter, shifts the distribution horizontally (similar to mean)
\(\sigma\) (sigma): scale parameter (\(\sigma > 0\)), controls the spread of the distribution (similar to standard deviation)
\(\epsilon\) (epsilon): skewness parameter (\(\epsilon = 0\) for symmetry, \(\epsilon > 0\) for right skew, \(\epsilon < 0\) for left skew)
\(\delta\) (delta): tail weight parameter (\(\delta = 1\) produces normal-like tails, \(\delta > 1\) produces heavier tails, \(\delta < 1\) produces lighter tails)

The probability density function is given by:

\[f(x|\mu,\sigma,\epsilon,\delta) = \frac{\delta}{\sigma\sqrt{2\pi}} \cdot \frac{\cosh(\delta \cdot \text{arcsinh}(z) + \epsilon)}{\sqrt{1 + z^2}} \cdot \exp\left(-\frac{1}{2}[\sinh(\delta \cdot \text{arcsinh}(z) + \epsilon)]^2\right)\]

where \(z = \frac{x - \mu}{\sigma}\). The cumulative distribution function (CDF) does not have a closed-form expression but can be computed numerically. For a given value x, the CDF is:

\[F(x|\mu,\sigma,\epsilon,\delta) = P(X \leq x) = \Phi[\sinh(\delta \cdot \text{arcsinh}(z) + \epsilon)]\]

where \(\Phi\) is the standard normal CDF and \(z = (x - \mu)/\sigma\). The quantile function (inverse CDF) can be expressed as:

\[Q(p|\mu,\sigma,\epsilon,\delta) = \mu + \sigma \cdot \sinh\left(\frac{\text{arcsinh}(\Phi^{-1}(p)) - \epsilon}{\delta}\right)\]

Modeling the Sinh-Arcsinh over Age

In cNORM's shash implementation, we model the four parameters as polynomial functions of the explanatory variable (e.g., age). Please note, that in the default setting, the tail weight parameter \(\delta\) is held constant across age, though. It can be adjusted to reflect population characteristics, e.g. by increasing it to values \(\delta > 1\) for heterogeneous samples or \(\delta < 1\) for homogeneous samples. By setting the delta_degree parameter d, \(\delta\) is as well modeled as a polynomial across age. If used, it is advisable to keep the delta_degree parameter low (i.e., not higher than 2), to avoid overfitting. Age is standardized as: \[\text{age}_{std} = \frac{\text{age} - \overline{\text{age}}}{\text{SD}(\text{age})}\] for numerical stability during optimization. Specifically, the parameters are modeled as:

\[\mu(\text{age}_{std}) = \mu_0 + \mu_1 \text{age}_{std} + \mu_2 \text{age}_{std}^2 + ... + \mu_m \text{age}_{std}^m\]

\[\log(\sigma(\text{age}_{std})) = \sigma_0 + \sigma_1 \text{age}_{std} + \sigma_2 \text{age}_{std}^2 + ... + \sigma_s \text{age}_{std}^s\]

\[\epsilon(\text{age}_{std}) = \epsilon_0 + \epsilon_1 \text{age}_{std} + \epsilon_2 \text{age}_{std}^2 + ... + \epsilon_e \text{age}_{std}^e\]

\[\log(\delta(\text{age}_{std})) = \delta_0 + \delta_1 \text{age}_{std} + \delta_2 \text{age}_{std}^2 + ... + \delta_d \text{age}_{std}^d\]

where m, s, e, and d are the degrees of the polynomials for the respective parameters. We use the logarithm of \(\sigma\) and \(\delta\) to ensure that they are positive, as required by the distribution. This transformation also helps in stabilizing the variance and improving the optimization process. Parameters are estimated using maximum likelihood estimation. The log-likelihood function for N observations is:

\[L(\theta | X, Age) = \sum_{i=1}^N \log[f(X_i | \mu(Age_{std,i}), \sigma(Age_{std,i}), \epsilon(Age_{std,i}), \delta(Age_{std,i}))]\]

where \(\theta\) represents the latent variable, \(X_i\) is the raw score and \(Age_{std,i}\) is the standardized age for the i-th observation. The data fitting is performed using numerical optimization techniques, specifically the L-BFGS-B (Limited-memory BFGS) algorithm of the 'optim' function of the stats package of the R plattform. It is a quasi-Newton method for solving large nonlinear optimization problems with simple bounds. By approximating the Hessian matrix, it simultaneously determines the coefficients of the regression equations for all parameters that maximize the log-likelihood, thus providing the best fit to the observed data.

References

Jones, M. C., & Pewsey, A. (2009). Sinh-arcsinh distributions. Biometrika, 96(4), 761-780.

Back to parametric modeling

Site menu: