Weighted Marginal Maximum Likelihood Regression Estimation

This document describes weighted Marginal Maximum Likelihood (MML) estimation for student test data in the Dire package. The model treats abilities as a latent variable and estimates a linear model in the latent space. In this framework, failing to account for the measurement variance will bias the regression estimates. The Dire package can directly estimate these models in the latent space. Aditionally, it can generate palausible values (PVs) which allow for unbiased estimation using the same formulas as multiple imputation (Mislevy et al.).

For simplicity, focusing first on a univariate model where there is only one latent ability measured, the student test data are assumed to have been generated by an Item Response Theory (IRT) model, where student i has ability θi. The probability of a student getting an item correct—for a dichotomous item—increases in ability but decreases in item difficulty. Put together, the likelihood of student i’s responss to items (Ri) is a function of the student’s latent ability in the construct (θi), the item parameters of each item(P) where the probability function in the product is showin in the appendix and depends on the IRT model used for the item. it is possible that a student will not have been shown an item and then Rij is missing. The missing response does not change the probability. This can be thought of as ${\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h)$ being 1 for all values of θi and Ph.

This latent trait is then modeled as a function of a row vector of explanatory variables Xi so that. Where

This creates two sets of likelihoods for person i, one associated with the covariates (Xi) and another associated with the observed item responses (Ri). The likelihood for the student is the product of these two. where f(θi) is the density of student is latent ability θi. Substituting the formula for above, The term in the integral is the convolusion of two functions and there is no closed form solution. Dire follows the AM statistical software (Cohen & Jiang 1999) and uses fixed quadrature points set by the user.

In addition, Dire allows for the construct being scored to not be a simple univariate construct but to incorporate several correlated subscales or constructs. If these are labeled from one to J then each student has a vector θi with components θij. This leads to J latent regression equations and requires a J-demensional integral. where Rijh, Phj, and βj all have a subscript j added because there is now a set of them for each subscale j, ϵi is now a vector with elements ϵij, and Σ is the covariance matrix for the ϵ vector, which allows for covariance between constructs at the student level in that when Σjj is positive than a student with a higher score on construct j, conditional on Xi will also be expected to have a higher score on construct j.

This problem suffers from the curse of dimensionality. However, as is pointed out by Cohen and Jiang (1999) this is identical to seemingly unrelated regression (SUR) and can be solved by fitting βj and σj once for each subscale and then identifying each of the ${ J \choose 2}$ covariance terms in a pairwise fashion.

This comes from the fact that the multivariate normal can be decomposed into two multivariate normals, one conditional on the other. First re-writing, eq with the normal rewritten as f(ϵi) the multivariate normal can be decomposed into the product of two distributions, an unconditional (univariate-)normal and a condional (potentially multivariate when J > 2) normal. Using the first integration dimension, without loss of generality, this becomes where f1(ϵi1) is the normal density function for a single variable and f−1(ϵi, −1|ϵi1) is the density function of the other variables, conditional on ϵi1. This can then be rearanged to Changing to the numerical quadrature representation

this is additively seperable—the max with respect to β1 can be found only with data for the first construct. The other constructs can be relabeled and each can be maximized in this way. Similarly, the variances can be independently found in this way.

The Dire package helps analyists estimate coefficients in two potentially useful ways. First the parameters in the regression can be directly estimated, where the β values are used from the above model fit. Second, Dire can generate plausible values from the fitted likelihood surface.

The direct estimation method has the advantage of using the latent space to identify coefficients. However, it can be time consuming to fit a latent parameter model.

The plausible value approach has two advantages. First, it can be used to fit a saturated model (with all possible coefficients of interest) and then other models that use subsets or linear combinations of those coefficients can be fit to the plausible values. Second, doing this saves time relative to fitting a direct estimation model per regression specification.

This document first describes how parameters are estimated in the latest space. The third section describes the variance estimator. The subsequent section describes estimation of degrees of freedom. Finally, it describes plausible value generation. Each section starts by describing the univariate case and then describes the case with J potentially correlated constructs or subscales. Additionally, while the introduction has not mentioned weights, they are incorporated in the subsequent sections.

Parameter Estimation

Starting with the single construct MML model for test data for n individuals, conditional on a set of parameters for a set of H test items, the likelihood of a regression equation is where is the likelihood of the regression parameters β with full sample weights wi conditional on item score matrix R, student covariate matrix X, and item parameter data P; σ2 is the variance of the regression residual; θi is the ith student’s latent ability measure that is being integrated out; ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$ is the probability of individual i’s score on test item h, conditional on the student’s ability and item parameters Ph—see the appendix for example forms of ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$.

The integral is evaluated using the trapezoid rule at quadrature points tq and quadrature weights δ so that where δ is the distance between any two uniformly spaced quadrature points so that δ = tq + 1 − tq for any q that is at least one and less than Q. The range and value of Q parameterize the quadrature, and its accuracy and should be varied to ensure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability (the portion inside the product) to be calculated once per student.

The log-likelihood is given by Note that δ can be removed for optimization, and its presence adds ${\rm log}(\delta) \sum \bm{w}_i$ to the log-likelihood.

Composite Score Estimation

When the outcome of interest is composite scores, the parameters are estimated by separately estimating the coefficients for each subscale (βj for subscale j) and then calculating the composite scores (βc) using subscale weights (ωj).

The full covariance matrix for the residuals (ϵ vector) is then

The likelihood is then where |Σ(jj)| is the determinant of Σ(jj), and the residual term is defined as Notice that the parameters βj, βj, σj2, and σj2 are taken from the subscale estimation, so the only parameter not fixed is the covariance term σij.

Variance Estimation

Starting with the univariate case, estimating variance of the parameters β can be done in one of several ways.

The inverse Hessian matrix is a consistent estimator when the estimator of β is consistent (Green, 2003, p. 520): This variance is returned when the variance method is set to consistent or left as the default.

A class of variance estimators typically called “sandwich” or “robust” variance estimators allow for variation in the residual and are of the form where V is an estimate of the variance of the summed score function (Binder, 1983).

For a convenience sample, we provide two robust estimators. First, the so-called robust (Huber or Huber-White) variance estimator uses

Second, for the cluster robust case, the partial derivatives are summed within the cluster so that where there are n clusters, indexed by c, and the partial derivatives are summed within the group of which there are nc members:

Finally, Dire impelments the survey sampling method called the Taylor series method and uses the same formula as Eq. , but V is the estimate of the variance of the score vector (Binder, 1983). Our implementation assumes a two-stage design with na primary sampling units (PSUs) in stratum a and summed across the A strata according to where Va is a variance estimate for stratum a and is defined by where sp is the sum of the weighted (or pseudo-) score vector that includes all units in PSU p in stratum a and $\bar{\bm{s}}_a$ is the (unweighted) mean of the sp terms in stratum a so that

When a stratum has only one PSU, Va is undefined. The best approach is for the analyst to adjust the strata and PSU identifiers, in a manner consistent with the sampling approach, to avoid singleton strata. Two simpler, automated, but less defensible options are available in Dire. First, the strata with single PSUs can be dropped from the variance estimation, yielding an underestimate of the variance.

The second option is for the singleton stratum to use the overall mean of sp in place of a. So, where the sum is across all PSUs, and n is the number of PSUs across all strata. Then, for each singleton stratum, Eq.  becomes where the value 2 is used in place of $\frac{n_a}{n_a-1}$, which is undefined when na = 1. This option can underestimate the variance but is thought to more likely overestimate it.

While not advisable, it is possible to assume information equality where the hessian (eq. ) and the score vector based covariance (eq. ) are equal. When a user sets gradientHessian=TRUE the Dire package uses this short hand in calculating any of the above results. Using this option is necessary to get agreement with the AM software on variance terms.

Univariate degrees of freedom

For the Taylor series estimator, the degrees of freedom estimator also uses the Welch-Satterthwaite (WS) degrees of freedom estimate (dofWS). The WS weights require an estimate of the degrees of variance per independent group. For a clustered sample, that is available at the stratum.

Following Binder (1983) and Cohen (2002), the contribution cs to the degrees of freedom from stratum s is defined as zuj from the section, “Estimation of Standard Errors of Weighted Means When Plausible Values Are Not Present, Using the Taylor Series Method,” where u indexes the PSUs in the stratum, of which there are ns, and ws is the stratum weight, or the sum, in that stratum, of all the unit’s full sample weights. Using the cs values, the degrees of freedom is which uses the formula of Satterthwaite (1946), assuming one degree of freedom per stratum.

Composite Score Variances

In the case of composite scores the variance becomes more complex and only one method is supported, Taylor series.

The log-likelihood of composite scores is additively separable, the covariances (including the variances) can be calculated in two steps using Eq. . First, the covariance matrix of ξ is formed, and then the composite covariance terms are estimated as the variance of a linear combination of the elements of ξ.

In the first step, any of the methods in the section “Variance Estimation” are applied to Eq. , treating ξ in the same fashion Eq.  treats β. This step results in a block diagonal inverse Hessian matrix, with a block for each subscale, and a potentially dense matrix for V. Each matrix is square and has S ⋅ (ζ + 1) rows and columns, where ζ is the number of elements in the regression formula (each subscale), to which one is added for the σ terms.

This step results in the following matrix:

For the second step, the composite coefficient then has an ith variance term of where ξci is the composite coefficient for the ith coefficient, and ei is the vector of weights arranged such that The covariance between two terms, i and j, is a simple extension which uses the definition,

A simple example may help clarify. Imagine a composite score composed of two subscales, 1 and 2, with weights ω1 = 0.4 and ω2 = 0.6. Supposed a user is interested in a regression of the form Then the regression in Eq.  would be fit once for subscale 1 and once for subscale 2; the first fit would yield estimated values {a1, b1, σ1}, and the second fit would yield {a2, b2, σ2}. The estimated value, for example, ac, would be ac = 0.4 ⋅ α1 + 0.6 ⋅ α2. By stacking the estimates together, the covariance matrix can then be estimated and will result in a matrix $\bm{\Omega} \equiv {\rm Var}(\bm{\beta})$ from Eq.  that has six rows and six columns. Using the vector it can easily be confirmed that ac = e1Tξ, so ${\rm Var}(a_c)= \bm{e}_1^T \bm{\Omega} \bm{e}_1$.

Composite degrees of freedom

References

Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.

Black, P. E. (2019). Big-O notation. In P. E. Black (Ed.), Dictionary of algorithms and data structures. Washington, DC: National Institute of Standards and Technology. Retrieved from https://www.nist.gov/dads/HTML/bigOnotation.html

Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across nominal subgroups. Journal of the American Statistical Association, 94(448), 1035–1044.

Green, W. H. (2003). Econometric analysis Upper Saddle River, NJ: Prentice Hall.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium of Mathematical Statistics and Probability, Vol. I: Statistics (pp. 221–233). Berkeley, CA: University of California Press.

Johnson, S. G. (2010). Notes on the convergence of trapezoidal-rule quadrature. Retrieved from https://math.mit.edu/~stevenj/trapezoidal.pdf

McCullagh, P. & Nelder, J. A. (1989). Generalized linear models. (2nd ed.). London, UK: Chapman & Hall/CRC.

NAEP. (2008). The generalized partial credit model [NAEP Technical Documentation Website]. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.

Appendix. Test Probability Density Functions

For all cases scored as either correct or incorrect, we use the three parameter logit (3PL) model: where gj is the guessing parameter, aj is the discrimination factor, dj is the item difficulty, and D is a constant, usually set to 1.7, to map the θi and dj terms to a probit-like space; this term is applied by tradition.

When a two parameter logit (2PL) is used, Eq.  is modified to omit gj (effectively setting it to zero):

When a Rasch model is used, Eq.  is further modified to set all aj to a single a, and D is set to one.

The Graded Response Model (GRM) has a probability density that generalizes an ordered logit (McCullagh & Nelder, 1989): Here the parameters Pj are the cut points dcj, where d0j = −∞ and dC + 1, j = ∞. In the first term on the right side of Eq. , the subscript Rij on dRij, j indicates it is the cut point associated with the response level to item j for person i, whereas the last subscript (j) indicates that it is the d term for item j. In the second term, the cut point above that cut point is used.

The Generalized Partial Credit Model (GPCM) has a probability density that generalizes a multinomial logit (McCullagh & Nelder, 1989) where c indexes cut points, of which there are C, and j indexes the item.

The GPCM equation has an indeterminancy because all dj terms could increase and make the values of the probability the same. We can solve the indeterminacy in several ways.

NAEP (2008) uses a mean difficulty (bj), and the dj values are then given by where the δjc values are estimated so that $0=\sum_{c=1}^{C} \delta_{jc}$. In this package, when the polyParamTab has an itemLocation, it serves as b. When there is no itemLocation, the package uses the δ values directly

When a Partial Credit Model (PCM) is used, and the value of D is set to one, whereas aj is again shared across all items. So