Introduction to Analyzing NCES Data Using EdSurvey

Overview of the EdSurvey Package

The EdSurvey package is designed to help users analyze data from the National Center for Education Statistics (NCES), including the National Assessment of Educational Progress (NAEP) datasets. Due to the scope and complexity of these datasets, special statistical methods are required for analysis. EdSurvey provides functions to perform analyses that account for both complex sample survey designs and the use of plausible values.

The EdSurvey package also seamlessly takes advantage of the LaF package to read in data only when it is required for an analysis. Users with computers that lack sufficient memory to load the entire NAEP datasets can still perform analyses without having to write special code to access only the relevant variables. This is all handled by the EdSurvey package behind the scenes, without requiring additional work by the user.

Brief demo

First, install EdSurvey and its helper package tidyEdSurvey, which supports tidyverse integration.

install.packages(c("EdSurvey", "tidyEdSurvey"))

This will also install several other packages, so the process may take a few minutes.

The user can then load the EdSurvey package.

NCES provides the NAEP Primer, which includes demo NAEP data and is automatically downloaded with EdSurvey. The following line reads that in and displays relevant information about the anonymized NAEP data from the survey.

naep_primer <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
naep_primer
## edsurvey.data.frame for 2005 NAEP National - Primer (Mathematics; Grade
##   8) in USA
## Dimensions: 17606 rows and 303 columns.
## 
## There is 1 full sample weight in this edsurvey.data.frame:
##   'origwt' with 62 JK replicate weights (the default).
## 
## 
## There are 6 subject scale(s) or subscale(s) in this
##   edsurvey.data.frame:
## 'num_oper' subject scale or subscale with 5 plausible values.
## 
## 'measurement' subject scale or subscale with 5 plausible values.
## 
## 'geometry' subject scale or subscale with 5 plausible values.
## 
## 'data_anal_prob' subject scale or subscale with 5 plausible values.
## 
## 'algebra' subject scale or subscale with 5 plausible values.
## 
## 'composite' subject scale or subscale with 5 plausible values (the
##   default).
## 
## 
## Omitted Levels: 'Multiple', 'NA', and 'Omitted'
## 
## Default Conditions:
## tolower(rptsamp) == "reporting sample"
## Achievement Levels:
## Mathematics:
## Basic: 262.00
## Proficient: 299.00
## Advanced: 333.00

One of the subject scales is composite, which is a scaled score. To calculate weighted summary statistics for this score, use the EdSurvey’s summary2 function. The summary statistics are weighted by origwt, which is the default weight:

summary2("composite", data=naep_primer)
## Estimates are weighted using the weight variable 'origwt'
##    Variable     N Weighted N   Min.  1st Qu.   Median     Mean  3rd Qu.    Max.
## 1 composite 16915   16932.46 126.11 251.9626 277.4784 275.8892 301.1827 404.184
##        SD NA's Zero weights
## 1 36.5713    0            0

The output shows that the weighted mean is 275.8892 and the standard deviation (SD) is 36.5713.

If a user is interested in parents’ education levels, the searchSDF function can find the appropriate variable. The searchSDF function searches the dataset for variables that match a given string, helping the user identify relevant variables for analysis.

searchSDF(string="parent", data=naep_primer)
##   variableName                                      Labels fileFormat
## 1        pared Parental education level (from 2 questions)    Student

The variable is pared, and the user can see the distribution of the variable and how it is related to test scores.

edsurveyTable(composite ~ pared, data=naep_primer)
## 
## Formula: composite ~ pared 
## 
## Plausible values: 5
## jrrIMax: 1
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## full data n: 17606
## n used: 16328
## 
## 
## Summary Table:
##                pared    N    WTD_N       PCT   SE(PCT)     MEAN  SE(MEAN)
##  Did not finish H.S. 1280 1414.508  8.453085 0.3770753 260.7158 1.3274437
##       Graduated H.S. 3091 3179.318 18.999564 0.4926714 265.1290 1.0170015
##   Some ed after H.S. 2905 2962.733 17.705257 0.3732471 279.0351 0.9194085
##    Graduated college 7265 7240.987 43.272050 0.8528701 287.9227 1.0704070
##         I Don't Know 1787 1936.089 11.570043 0.4133281 256.8176 1.3438868

To simplify the analysis, the variable can be recoded into broader categories. The user can then check that the variable was recoded accordingly. Here, the categories are collapsed into “less than HS”, “HS”, and “any after HS”.

naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Did not finish H.S.", "less than HS", "unknown")
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated H.S.", "HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Some ed after H.S.", "any after HS", naep_primer$pared_recode)
naep_primer$pared_recode <- ifelse(naep_primer$pared %in% "Graduated college", "any after HS", naep_primer$pared_recode)
# the tidyEdSurvey package allows this call to with to work
require(tidyEdSurvey)
## Loading required package: tidyEdSurvey
## tidyEdSurvey v0.1.3
## A package for using 'dplyr' and 'ggplot2' with student level data in an edsurvey.data.frame. To work with teacher or school level data, see ?EdSurvey::getData
## 
## Attaching package: 'tidyEdSurvey'
## The following object is masked from 'package:base':
## 
##     attach
with(naep_primer, table(pared_recode, pared))
##               pared
## pared_recode   Did not finish H.S. Graduated H.S. Some ed after H.S.
##   HS                             0           3091                  0
##   any after HS                   0              0               2905
##   less than HS                1280              0                  0
##   unknown                        0              0                  0
##               pared
## pared_recode   Graduated college I Don't Know Omitted Multiple
##   HS                           0            0       0        0
##   any after HS              7265            0       0        0
##   less than HS                 0            0       0        0
##   unknown                      0         1787     577       10

Once recoded, the new variable can be used in a regression. The lm.sdf function fits a linear model, using weights and variance estimates appropriate for the data.

lm1 <- lm.sdf(composite ~ pared_recode, data=naep_primer)
summary(lm1)
## 
## Formula: composite ~ pared_recode
## 
## Weight variable: 'origwt'
## Variance method: jackknife
## JK replicates: 62
## Plausible values: 5
## jrrIMax: 1
## full data n: 17606
## n used: 16915
## 
## Coefficients:
##                              coef       se        t    dof  Pr(>|t|)    
## (Intercept)              265.1290   1.0170 260.6967 25.227 < 2.2e-16 ***
## pared_recodeany after HS  20.2131   1.1150  18.1278 69.674 < 2.2e-16 ***
## pared_recodeless than HS  -4.4132   1.5935  -2.7695 62.168  0.007393 ** 
## pared_recodeunknown       -8.3422   1.5871  -5.2564 41.282 4.816e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared: 0.1053

The summary output provides estimates of the regression coefficients, standard errors, t-values, degrees of freedom, and p-values, allowing users to assess the relationship between composite scores and parental education levels.

Following the regression analysis, a Wald test can be used to determine whether the entire set of coefficients associated with pared_recode is statistically significant.

waldTest(lm1, "pared_recode")
## Wald test:
## ----------
## H0:
## pared_recodeany after HS = 0
## pared_recodeless than HS = 0
## pared_recodeunknown = 0
## 
## Chi-square test:
## X2 = 603.4, df = 3, P(> X2) = 0.0
## 
## F test:
## W = 194.7, df1 = 3, df2 = 60, P(> W) = 0

Two versions of the Wald test are shown here; the user can decide which is applicable to their situation. Generally, the F-test is considered valid, while the chi-square is applicable under more restrictive conditions. The p-value for the F-test is nearly zero and so was rounded to zero.

International data

EdSurvey also supports analysis of international datasets, including those from the International Association for the Evaluation of Educational Achievement (IEA) and the Organisation for Economic Co-operation and Development (OECD). This includes studies such as the Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA). Starting with TIMSS and looking at the association between parents’ highest education level and math test scores in North America:

downloadTIMSS("~/EdSurveyData/", years=2015)
timss_NA15 <- readTIMSS("~/EdSurveyData/TIMSS/2015/", countries=c("usa", "can"), grade=8)
searchSDF(c("parent", "education"), data=timss_NA15)
edsurveyTable(data=timss_NA15, mmat ~ bsdgedup)

Now, the same analysis using PISA data:

downloadPISA("~/EdSurveyData/", years=2015)
pisa_NA15 <- readPISA("~/EdSurveyData/PISA/2015/", countries=c("usa", "can", "max"))
searchSDF(c("parent", "education"), data=pisa_NA15)
edsurveyTable(data=pisa_NA15, math ~ hisced)

EdSurvey offers many other functions, including mixed models (mixed.sdf), gap analysis (gap), correlation analysis (cor.sdf), achievement level analysis (achievementLevels), direct estimation (mml.sdf), percentiles (percentile), logit/probit analysis (logit.sdf/probit.sdf), and quantile regression (rq.sdf).

Book

For further information about installing, using, and understanding the statistical methodology in EdSurvey, please see Analyzing NCES Data Using EdSurvey: A User’s Guide.

Publications

Bailey, P., Lee, M., Nguyen, T., & Zhang, T. (2020). Using EdSurvey to Analyse PIAAC Data. In Maehler, D., & Rammstedt, B. (Eds.), Large-Scale Cognitive Assessment (pp. 209-237). Springer, Cham. [https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf] (https://link.springer.com/content/pdf/10.1007/978-3-030-47515-4_9.pdf)