This document describes weighted
Marginal Maximum Likelihood (MML) estimation for student test data in
the Dire
package. The model treats abilities as a latent
variable and estimates a linear model in the latent space. In this
framework, failing to account for the measurement variance will bias the
regression estimates. The Dire
package can directly
estimate these models in the latent space. Aditionally, it can generate
palausible values (PVs) which allow for unbiased estimation using the
same formulas as multiple imputation (Mislevy et al.).
For simplicity, focusing first on a univariate model where there is only one latent ability measured, the student test data are assumed to have been generated by an Item Response Theory (IRT) model, where student i has ability θi. The probability of a student getting an item correct—for a dichotomous item—increases in ability but decreases in item difficulty. Put together, the likelihood of student i’s responss to items (Ri) is a function of the student’s latent ability in the construct (θi), the item parameters of each item(P) where the probability function in the product is showin in the appendix and depends on the IRT model used for the item. it is possible that a student will not have been shown an item and then Rij is missing. The missing response does not change the probability. This can be thought of as ${\rm Pr}(R_{ih}| \theta_{i}, \bm{P}_h)$ being 1 for all values of θi and Ph.
This latent trait is then modeled as a function of a row vector of explanatory variables Xi so that. Where
This creates two sets of likelihoods for person i, one associated with the
covariates (Xi) and
another associated with the observed item responses (Ri). The
likelihood for the student is the product of these two. where f(θi)
is the density of student is
latent ability θi. Substituting
the formula for above, The term in the integral is the convolusion of
two functions and there is no closed form solution. Dire
follows the AM statistical software (Cohen & Jiang 1999) and uses
fixed quadrature points set by the user.
In addition, Dire allows for the construct being scored to not be a simple univariate construct but to incorporate several correlated subscales or constructs. If these are labeled from one to J then each student has a vector θi with components θij. This leads to J latent regression equations and requires a J-demensional integral. where Rijh, Phj, and βj all have a subscript j added because there is now a set of them for each subscale j, ϵi is now a vector with elements ϵij, and Σ is the covariance matrix for the ϵ vector, which allows for covariance between constructs at the student level in that when Σjj′ is positive than a student with a higher score on construct j, conditional on Xi will also be expected to have a higher score on construct j′.
This problem suffers from the curse of dimensionality. However, as is pointed out by Cohen and Jiang (1999) this is identical to seemingly unrelated regression (SUR) and can be solved by fitting βj and σj once for each subscale and then identifying each of the ${ J \choose 2}$ covariance terms in a pairwise fashion.
This comes from the fact that the multivariate normal can be decomposed into two multivariate normals, one conditional on the other. First re-writing, eq with the normal rewritten as f(ϵi) the multivariate normal can be decomposed into the product of two distributions, an unconditional (univariate-)normal and a condional (potentially multivariate when J > 2) normal. Using the first integration dimension, without loss of generality, this becomes where f1(ϵi1) is the normal density function for a single variable and f−1(ϵi, −1|ϵi1) is the density function of the other variables, conditional on ϵi1. This can then be rearanged to Changing to the numerical quadrature representation
this is additively seperable—the max with respect to β1 can be found only with data for the first construct. The other constructs can be relabeled and each can be maximized in this way. Similarly, the variances can be independently found in this way.
The Dire
package helps analyists estimate coefficients
in two potentially useful ways. First the parameters in the regression
can be directly estimated, where the β values are used from the above
model fit. Second, Dire
can generate plausible values from
the fitted likelihood surface.
The direct estimation method has the advantage of using the latent space to identify coefficients. However, it can be time consuming to fit a latent parameter model.
The plausible value approach has two advantages. First, it can be used to fit a saturated model (with all possible coefficients of interest) and then other models that use subsets or linear combinations of those coefficients can be fit to the plausible values. Second, doing this saves time relative to fitting a direct estimation model per regression specification.
This document first describes how parameters are estimated in the latest space. The third section describes the variance estimator. The subsequent section describes estimation of degrees of freedom. Finally, it describes plausible value generation. Each section starts by describing the univariate case and then describes the case with J potentially correlated constructs or subscales. Additionally, while the introduction has not mentioned weights, they are incorporated in the subsequent sections.
Starting with the single construct MML model for test data for n individuals, conditional on a set of parameters for a set of H test items, the likelihood of a regression equation is where ℒ is the likelihood of the regression parameters β with full sample weights wi conditional on item score matrix R, student covariate matrix X, and item parameter data P; σ2 is the variance of the regression residual; θi is the ith student’s latent ability measure that is being integrated out; ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$ is the probability of individual i’s score on test item h, conditional on the student’s ability and item parameters Ph—see the appendix for example forms of ${\rm Pr}(\bm{R}_{ih}|\theta_i, \bm{P}_h)$.
The integral is evaluated using the trapezoid rule at quadrature points tq and quadrature weights δ so that where δ is the distance between any two uniformly spaced quadrature points so that δ = tq + 1 − tq for any q that is at least one and less than Q. The range and value of Q parameterize the quadrature, and its accuracy and should be varied to ensure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability (the portion inside the product) to be calculated once per student.
The log-likelihood is given by Note that δ can be removed for optimization, and its presence adds ${\rm log}(\delta) \sum \bm{w}_i$ to the log-likelihood.
When the outcome of interest is composite scores, the parameters are estimated by separately estimating the coefficients for each subscale (βj for subscale j) and then calculating the composite scores (βc) using subscale weights (ωj).
The full covariance matrix for the residuals (ϵ vector) is then
The likelihood is then where |Σ(jj′)| is the determinant of Σ(jj′), and the residual term is defined as Notice that the parameters βj, βj′, σj2, and σj′2 are taken from the subscale estimation, so the only parameter not fixed is the covariance term σij.
Starting with the univariate case, estimating variance of the parameters β can be done in one of several ways.
The inverse Hessian matrix is a consistent estimator when the
estimator of β is
consistent (Green, 2003, p. 520): This variance is returned when the
variance method is set to consistent
or left as the
default.
A class of variance estimators typically called “sandwich” or “robust” variance estimators allow for variation in the residual and are of the form where V is an estimate of the variance of the summed score function (Binder, 1983).
For a convenience sample, we provide two robust estimators. First,
the so-called robust
(Huber or Huber-White) variance
estimator uses
Second, for the cluster robust
case, the partial
derivatives are summed within the cluster so that where there are n′ clusters, indexed by
c, and the partial derivatives
are summed within the group of which there are nc members:
Finally, Dire
impelments the survey sampling method
called the Taylor series
method and uses the same formula
as Eq. , but V is the
estimate of the variance of the score vector (Binder, 1983). Our
implementation assumes a two-stage design with na primary
sampling units (PSUs) in stratum a and summed across the A strata according to where Va is a
variance estimate for stratum a and is defined by where sp is the sum of
the weighted (or pseudo-) score vector that includes all units in PSU
p in stratum a and $\bar{\bm{s}}_a$ is the (unweighted) mean of
the sp terms
in stratum a so that
When a stratum has only one PSU, Va is
undefined. The best approach is for the analyst to adjust the strata and
PSU identifiers, in a manner consistent with the sampling approach, to
avoid singleton strata. Two simpler, automated, but less defensible
options are available in Dire
. First, the strata with
single PSUs can be dropped from the variance estimation, yielding an
underestimate of the variance.
The second option is for the singleton stratum to use the overall mean of sp in place of s̄a. So, where the sum is across all PSUs, and n′ is the number of PSUs across all strata. Then, for each singleton stratum, Eq. becomes where the value 2 is used in place of $\frac{n_a}{n_a-1}$, which is undefined when na = 1. This option can underestimate the variance but is thought to more likely overestimate it.
While not advisable, it is possible to assume information equality
where the hessian (eq. ) and the score vector based covariance (eq. )
are equal. When a user sets gradientHessian=TRUE
the
Dire
package uses this short hand in calculating any of the
above results. Using this option is necessary to get agreement with the
AM software on variance terms.
For the Taylor series estimator, the degrees of freedom estimator also uses the Welch-Satterthwaite (WS) degrees of freedom estimate (dofWS). The WS weights require an estimate of the degrees of variance per independent group. For a clustered sample, that is available at the stratum.
Following Binder (1983) and Cohen (2002), the contribution cs to the degrees of freedom from stratum s is defined as zuj from the section, “Estimation of Standard Errors of Weighted Means When Plausible Values Are Not Present, Using the Taylor Series Method,” where u indexes the PSUs in the stratum, of which there are ns, and ws is the stratum weight, or the sum, in that stratum, of all the unit’s full sample weights. Using the cs values, the degrees of freedom is which uses the formula of Satterthwaite (1946), assuming one degree of freedom per stratum.
In the case of composite scores the variance becomes more complex and only one method is supported, Taylor series.
The log-likelihood of composite scores is additively separable, the covariances (including the variances) can be calculated in two steps using Eq. . First, the covariance matrix of ξ is formed, and then the composite covariance terms are estimated as the variance of a linear combination of the elements of ξ.
In the first step, any of the methods in the section “Variance Estimation” are applied to Eq. , treating ξ in the same fashion Eq. treats β. This step results in a block diagonal inverse Hessian matrix, with a block for each subscale, and a potentially dense matrix for V. Each matrix is square and has S ⋅ (ζ + 1) rows and columns, where ζ is the number of elements in the regression formula (each subscale), to which one is added for the σ terms.
This step results in the following matrix:
For the second step, the composite coefficient then has an ith variance term of where ξci is the composite coefficient for the ith coefficient, and ei is the vector of weights arranged such that The covariance between two terms, i and j, is a simple extension which uses the definition,
A simple example may help clarify. Imagine a composite score composed of two subscales, 1 and 2, with weights ω1 = 0.4 and ω2 = 0.6. Supposed a user is interested in a regression of the form Then the regression in Eq. would be fit once for subscale 1 and once for subscale 2; the first fit would yield estimated values {a1, b1, σ1}, and the second fit would yield {a2, b2, σ2}. The estimated value, for example, ac, would be ac = 0.4 ⋅ α1 + 0.6 ⋅ α2. By stacking the estimates together, the covariance matrix can then be estimated and will result in a matrix $\bm{\Omega} \equiv {\rm Var}(\bm{\beta})$ from Eq. that has six rows and six columns. Using the vector it can easily be confirmed that ac = e1Tξ, so ${\rm Var}(a_c)= \bm{e}_1^T \bm{\Omega} \bm{e}_1$.
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Black, P. E. (2019). Big-O notation. In P. E. Black (Ed.), Dictionary of algorithms and data structures. Washington, DC: National Institute of Standards and Technology. Retrieved from https://www.nist.gov/dads/HTML/bigOnotation.html
Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across nominal subgroups. Journal of the American Statistical Association, 94(448), 1035–1044.
Green, W. H. (2003). Econometric analysis Upper Saddle River, NJ: Prentice Hall.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium of Mathematical Statistics and Probability, Vol. I: Statistics (pp. 221–233). Berkeley, CA: University of California Press.
Johnson, S. G. (2010). Notes on the convergence of trapezoidal-rule quadrature. Retrieved from https://math.mit.edu/~stevenj/trapezoidal.pdf
McCullagh, P. & Nelder, J. A. (1989). Generalized linear models. (2nd ed.). London, UK: Chapman & Hall/CRC.
NAEP. (2008). The generalized partial credit model [NAEP Technical Documentation Website]. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
For all cases scored as either correct or incorrect, we use the three parameter logit (3PL) model: where gj is the guessing parameter, aj is the discrimination factor, dj is the item difficulty, and D is a constant, usually set to 1.7, to map the θi and dj terms to a probit-like space; this term is applied by tradition.
When a two parameter logit (2PL) is used, Eq. is modified to omit gj (effectively setting it to zero):
When a Rasch model is used, Eq. is further modified to set all aj to a single a, and D is set to one.
The Graded Response Model (GRM) has a probability density that generalizes an ordered logit (McCullagh & Nelder, 1989): Here the parameters Pj are the cut points dcj, where d0j = −∞ and dC + 1, j = ∞. In the first term on the right side of Eq. , the subscript Rij on dRij, j indicates it is the cut point associated with the response level to item j for person i, whereas the last subscript (j) indicates that it is the d term for item j. In the second term, the cut point above that cut point is used.
The Generalized Partial Credit Model (GPCM) has a probability density that generalizes a multinomial logit (McCullagh & Nelder, 1989) where c indexes cut points, of which there are C, and j indexes the item.
The GPCM equation has an indeterminancy because all dj terms could increase and make the values of the probability the same. We can solve the indeterminacy in several ways.
NAEP (2008) uses a mean difficulty (bj), and the
dj values
are then given by where the δjc
values are estimated so that $0=\sum_{c=1}^{C}
\delta_{jc}$. In this package, when the polyParamTab
has an itemLocation
, it serves as b
. When
there is no itemLocation
, the package uses the δ values directly
When a Partial Credit Model (PCM) is used, and the value of D is set to one, whereas aj is again shared across all items. So