Title: | Analysis of NCES Education Survey and Assessment Data |
---|---|
Description: | Read in and analyze functions for education survey and assessment data from the National Center for Education Statistics (NCES) <https://nces.ed.gov/>, including National Assessment of Educational Progress (NAEP) data <https://nces.ed.gov/nationsreportcard/> and data from the International Assessment Database: Organisation for Economic Co-operation and Development (OECD) <https://www.oecd.org/>, including Programme for International Student Assessment (PISA), Teaching and Learning International Survey (TALIS), Programme for the International Assessment of Adult Competencies (PIAAC), and International Association for the Evaluation of Educational Achievement (IEA) <https://www.iea.nl/>, including Trends in International Mathematics and Science Study (TIMSS), TIMSS Advanced, Progress in International Reading Literacy Study (PIRLS), International Civic and Citizenship Study (ICCS), International Computer and Information Literacy Study (ICILS), and Civic Education Study (CivEd). |
Authors: | Paul Bailey [aut, cre] , Ahmad Emad [aut], Huade Huo [aut] , Michael Lee [aut] , Yuqi Liao [aut] , Alex Lishinski [aut] , Trang Nguyen [aut] , Qingshu Xie [aut], Jiao Yu [aut], Ting Zhang [aut] , Eric Buehler [aut] , Sun-joo Lee [aut], Blue Webb [aut] , Thomas Fink [aut] , Sinan Yavuz [aut] , Emmanuel Sikali [pdr], Claire Kelley [ctb], Jeppe Bundsgaard [ctb], Ren C'deBaca [ctb], Anders Astrup Christensen [ctb] |
Maintainer: | Paul Bailey <[email protected]> |
License: | GPL-2 |
Version: | 4.0.8 |
Built: | 2024-10-28 23:23:49 UTC |
Source: | https://github.com/american-institutes-for-research/edsurvey |
The EdSurvey
package uses appropriate methods for analyzing NCES
datasets with a small memory
footprint. Existing system control files, included with the
data, are used
to read in and format the data for further processing.
To get started using EdSurvey
, see the vignettes
for tutorials and the statistical methodologies. Use
vignette("introduction", package="EdSurvey")
to see the vignettes.
The package provides functions called readNAEP
,
readCivEDICCS
, readICILS
, readPIAAC
,
readPIRLS
, read_ePIRLS
, readPISA
, readTALIS
,
readTIMSS
, readTIMSSAdv
, and readECLS_K2011
to read in NCES datasets.
The functions
achievementLevels
,
cor.sdf
,
edsurveyTable
,
summary2
,
lm.sdf
,
logit.sdf
,
mixed.sdf
,
rq.sdf
,
percentile
, and
gap
can then be used to analyze data.
For advanced users, getData
extracts
the data of interest as a data frame for further processing.
Maintainer: Paul Bailey [email protected] (ORCID)
Authors:
Ahmad Emad
Huade Huo (ORCID)
Michael Lee (ORCID)
Yuqi Liao (ORCID)
Alex Lishinski (ORCID)
Trang Nguyen (ORCID)
Qingshu Xie
Jiao Yu
Ting Zhang (ORCID)
Eric Buehler (ORCID)
Sun-joo Lee
Blue Webb (ORCID)
Thomas Fink (ORCID)
Sinan Yavuz (ORCID)
Other contributors:
Emmanuel Sikali [email protected] [project director]
Claire Kelley [contributor]
Jeppe Bundsgaard [contributor]
Ren C'deBaca [contributor]
Anders Astrup Christensen [contributor]
Useful links:
Returns achievement levels using weights and variance estimates appropriate for the edsurvey.data.frame
.
achievementLevels( achievementVars = NULL, aggregateBy = NULL, data, cutpoints = NULL, returnDiscrete = TRUE, returnCumulative = FALSE, weightVar = NULL, jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
achievementLevels( achievementVars = NULL, aggregateBy = NULL, data, cutpoints = NULL, returnDiscrete = TRUE, returnCumulative = FALSE, weightVar = NULL, jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
achievementVars |
character vector indicating variables to be included in the achievement
levels table, potentially with a subject scale or subscale. When the subject
scale or subscale is omitted, the default subject scale or subscale is
used. You can find the default composite scale and all subscales using the
function |
aggregateBy |
character vector specifying variables by which to aggregate achievement levels. The percentage
column sums up to 100 for all levels of all variables specified here. When set to the
default of |
data |
an |
cutpoints |
numeric vector indicating cutpoints. Set to standard NAEP cutpoints for Basic, Proficient, and Advanced by default. |
returnDiscrete |
logical indicating if discrete achievement levels should be returned. Defaults
to |
returnCumulative |
logical indicating if cumulative achievement levels should be returned. Defaults
to |
weightVar |
character string indicating the weight variable to use.
Only the name of the
weight variable needs to be included here, and any
replicate weights will be automatically included.
When this argument is |
jrrIMax |
a numeric value. When using the jackknife variance estimation method, the default estimation option, |
dropOmittedLevels |
a logical value. When set to the default value ( |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnNumberOfPSU |
a logical value set to |
returnVarEstInputs |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
The achievementLevels
function applies appropriate weights
and the variance estimation method for each
edsurvey.data.frame
, with several arguments for customizing
the aggregation and output of the analysis
results. Namely, by using these optional arguments, users can choose
to generate the percentage of students
performing at each achievement level (discrete), generate the
percentage of students performing at or above each achievement level
(cumulative),
calculate the percentage distribution of students by achievement
level (discrete or cumulative) and
selected characteristics (specified in aggregateBy
), and
compute the percentage distribution of students
by selected characteristics within a specific achievement level.
The details of the methods are shown in the vignette titled Statistical Methods Used in EdSurvey in “Estimation of Weighted Percentages When Plausible Values Are Present” and are used to calculate all cumulative and discrete probabilities.
When the requested achievement levels are discrete (returnDiscrete = TRUE
),
the percentage is the percentage of students (within the categories specified in
aggregateBy
)
whose scores lie in the range .
cutPoints
is the score thresholds provided by the user with taken
to be 0.
cutPoints
are set to NAEP standard cutpoints for achievement levels by default.
To aggregate by a specific variable, for example, dsex
, specify dsex
in aggregateBy
and all other variables in achievementVars
. To aggregate by subscale, specify
the name of the subscale (e.g., num_oper
) in aggregateBy
and all other variables in
achievementVars
.
When the requested achievement levels are cumulative (returnCumulative = TRUE
),
the percentage is the percentage of students (within the categories specified in
aggregateBy
)
whose scores lie in the range [,
),
. The
first and last categories are the same as defined for discrete levels.
The method used to calculate the standard error of the percentages is described in the vignette titled
Statistical Methods Used in EdSurvey
in the sections “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Present, Using the Jackknife Method”
and “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Not Present, Using the Taylor Series Method.”
For “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Present, Using the Jackknife Method,”
the value of jrrIMax
sets the value of .
A list
containing up to two data frames, one discrete achievement levels (when returnDiscrete
is TRUE
)
and one for cumulative achievement levels (when returnCumulative
is TRUE
). The data.frame
contains the following columns:
Level |
one row for each level of the specified achievement cutpoints |
Variables in achievementVars |
one column for each variable in |
Percent |
the percentage of students at or above each achievement level aggregated as specified by |
StandardError |
the standard error of the percentage, accounting for the survey sampling methodology. See the vignette titled Statistical Methods Used in EdSurvey. |
N |
the number of observations in the incoming data (the
number of rows when |
wtdN |
the weighted number of observations in the data |
nPSU |
the number of PSUs at or above each achievement level aggregated as specified by |
Huade Huo, Ahmad Emad, and Trang Nguyen
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # discrete achievement levels achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf) # discrete achievement levels with a different subscale achievementLevels(achievementVars=c("num_oper"), aggregateBy=NULL, data=sdf) # cumulative achievement levels achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf, returnCumulative=TRUE) # cumulative achievement levels with a different subscale achievementLevels(achievementVars=c("num_oper"), aggregateBy=NULL, data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by sex aggregated by composite achievementLevels(achievementVars=c("composite", "dsex"), aggregateBy="composite", data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by sex aggregated by sex achievementLevels(achievementVars=c("composite", "dsex"), aggregateBy="dsex", data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by race aggregated by race achievementLevels(achievementVars=c("composite", "sdracem"), aggregateBy="sdracem", data=sdf, returnCumulative=TRUE) # use customized cutpoints achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf, cutpoints = c("Customized Basic" = 200, "Customized Proficient" = 300, "Customized Advanced" = 400)) # use recode to change values for specified variables: achievementLevels(achievementVars=c("composite", "dsex", "b017451"), aggregateBy = "dsex", sdf, recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to="Infrequently"), b017451=list(from=c("2 or 3 times a week", "Every day"), to="Frequently"))) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # discrete achievement levels achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf) # discrete achievement levels with a different subscale achievementLevels(achievementVars=c("num_oper"), aggregateBy=NULL, data=sdf) # cumulative achievement levels achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf, returnCumulative=TRUE) # cumulative achievement levels with a different subscale achievementLevels(achievementVars=c("num_oper"), aggregateBy=NULL, data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by sex aggregated by composite achievementLevels(achievementVars=c("composite", "dsex"), aggregateBy="composite", data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by sex aggregated by sex achievementLevels(achievementVars=c("composite", "dsex"), aggregateBy="dsex", data=sdf, returnCumulative=TRUE) # achievement levels as independent variables, by race aggregated by race achievementLevels(achievementVars=c("composite", "sdracem"), aggregateBy="sdracem", data=sdf, returnCumulative=TRUE) # use customized cutpoints achievementLevels(achievementVars=c("composite"), aggregateBy=NULL, data=sdf, cutpoints = c("Customized Basic" = 200, "Customized Proficient" = 300, "Customized Advanced" = 400)) # use recode to change values for specified variables: achievementLevels(achievementVars=c("composite", "dsex", "b017451"), aggregateBy = "dsex", sdf, recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to="Infrequently"), b017451=list(from=c("2 or 3 times a week", "Every day"), to="Frequently"))) ## End(Not run)
Function to coerce a light.edsurvey.data.frame
to a data.frame
.
## S3 method for class 'light.edsurvey.data.frame' as.data.frame(x, ...)
## S3 method for class 'light.edsurvey.data.frame' as.data.frame(x, ...)
x |
a |
... |
other arguments to be passed to |
a data.frame
Trang Nguyen
Implements cbind
and rbind
for light.edsurvey.data.frame
class.
It takes a sequence of vector
, matrix
, data.frame
, or light.edsurvey.data.frame
arguments and combines
by columns or rows, respectively.
## S3 method for class 'light.edsurvey.data.frame' cbind(..., deparse.level = 1) ## S3 method for class 'light.edsurvey.data.frame' rbind(..., deparse.level = 1)
## S3 method for class 'light.edsurvey.data.frame' cbind(..., deparse.level = 1) ## S3 method for class 'light.edsurvey.data.frame' rbind(..., deparse.level = 1)
... |
one or more objects of class |
deparse.level |
integer determining under which circumstances column and row names are built from the actual arguments. See |
Because cbind
and rbind
are standard generic functions that do not use method dispatch, we set this function as generic,
which means it overwrites base::cbind
and base::rbind
on loading. If none of the specified elements are of class light.edsurvey.data.frame
,
the function will revert to the standard base
method. However, to be safe, you might want to explicitly use base::cbind
when needed after loading the package.
The returned object will contain attributes only from the first light.edsurvey.data.frame
object in the call to
cbind.light.edsurvey.data.frame
.
a matrix-like object like matrix
or data.frame
. Returns a light.edsurvey.data.frame
if there is
at least one light.edsurvey.data.frame
in the list of arguments.
Trang Nguyen, Michael Lee, and Paul Bailey
Diagnostic plots for regressions can become too dense to interpret. This function helps by adding a contour plot over the points to allow the density of points to be seen, even when an area is entirely covered in points.
contourPlot( x, y, m = 30L, xrange, yrange, xkernel, ykernel, nlevels = 9L, densityColors = heat.colors(nlevels), pointColors = "gray", ... )
contourPlot( x, y, m = 30L, xrange, yrange, xkernel, ykernel, nlevels = 9L, densityColors = heat.colors(nlevels), pointColors = "gray", ... )
x |
numeric vector of the |
y |
numeric vector of the |
m |
integer value of the number of |
xrange |
numeric vector of length two indicating |
yrange |
numeric vector of length two indicating |
xkernel |
numeric indicating the standard deviation of Normal
|
ykernel |
numeric indicating the standard deviation of Normal
|
nlevels |
integer with the number of levels of the contour plot |
densityColors |
colors to use, specified as in |
pointColors |
color for the plot points |
... |
additional arguments to be passed to a plot call that generates the scatter plot and the contour plot |
Yuqi Liao and Paul Bailey
## Not run: sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) lm1 <- lm.sdf(formula=composite ~ pared * dsex + sdracem, data=sdf) # plot the results contourPlot(x=lm1$fitted.values, y=lm1$residuals[,1], # use only the first plausible value m=30, xlab="fitted values", ylab="residuals", main="Figure 1") # add a line indicating where the residual is zero abline(0,0) ## End(Not run)
## Not run: sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) lm1 <- lm.sdf(formula=composite ~ pared * dsex + sdracem, data=sdf) # plot the results contourPlot(x=lm1$fitted.values, y=lm1$residuals[,1], # use only the first plausible value m=30, xlab="fitted values", ylab="residuals", main="Figure 1") # add a line indicating where the residual is zero abline(0,0) ## End(Not run)
Computes the correlation of two variables on an edsurvey.data.frame
,
a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
The correlation accounts for plausible values and the survey design.
cor.sdf( x, y, data, method = c("Pearson", "Spearman", "Polychoric", "Polyserial"), weightVar = "default", reorder = NULL, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, condenseLevels = TRUE, fisherZ = if (match.arg(method) %in% "Pearson") { TRUE } else { FALSE }, jrrIMax = Inf, verbose = TRUE, omittedLevels = deprecated() )
cor.sdf( x, y, data, method = c("Pearson", "Spearman", "Polychoric", "Polyserial"), weightVar = "default", reorder = NULL, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, condenseLevels = TRUE, fisherZ = if (match.arg(method) %in% "Pearson") { TRUE } else { FALSE }, jrrIMax = Inf, verbose = TRUE, omittedLevels = deprecated() )
x |
a character variable name from the |
y |
a character variable name from the |
data |
an |
method |
a character string indicating which correlation coefficient (or covariance) is to be computed.
One of |
weightVar |
character indicating the weight variable to use. See Details section in |
reorder |
a list of variables to reorder. Defaults to |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
condenseLevels |
a logical value. When set to the default value of
|
fisherZ |
for standard error and mean calculations, set to |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
verbose |
a logical value. Set to |
omittedLevels |
this argument is deprecated. Use |
The getData
arguments and recode.sdf
may be useful. (See Examples.)
The correlation methods are calculated as described in the documentation for the wCorr
package—see browseVignettes(package="wCorr")
.
When method
is set to polyserial
, all x
arguments are assumed to be continuous and all y
assumed discrete. Therefore,
be mindful of variable selection as this may result in calculations taking a very long time to complete.
The Fisher Z-transformation is both a variance stabilizing and normalizing transformation for the Pearson correlation coefficient (Fisher, 1915). The transformation takes the inverse hyperbolic tangent of the correlation coefficients and then calculates all variances and confidence intervals. These are then transformed back to the correlation space (values between -1 and 1, inclusive) using the hyperbolic tangent function. The Taylor series approximation (or delta method) is applied for the standard errors.
An edsurvey.cor
that has print and summary methods.
The class includes the following elements:
correlation |
numeric estimated correlation coefficient |
Zse |
standard error of the correlation ( |
correlates |
a vector of length two showing the columns for which the correlation coefficient was calculated |
variables |
|
order |
a list that shows the order of each variable |
method |
the type of correlation estimated |
Vjrr |
the jackknife component of the variance estimate. For Pearson, in the atanh space. |
Vimp |
the imputation component of the variance estimate. For Pearson, in the atanh space. |
weight |
the weight variable used |
npv |
the number of plausible values used |
njk |
the number of the jackknife replicates used |
n0 |
the original number of observations |
nUsed |
the number of observations used in the analysis—after any conditions and any listwise deletion of missings is applied |
se |
the standard error of the correlation, in the correlation ([-1,1]) space |
ZconfidenceInterval |
the confidence interval of the correlation in the transformation space |
confidenceInterval |
the confidence interval of the correlation in the correlation ([-1,1]) space |
transformation |
the name of the transformation used when calculating standard errors |
Paul Bailey; relies heavily on the wCorr
package, written by Ahmad Emad and Paul Bailey
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.
cor
and weightedCorr
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # for two categorical variables any of the following work c1_pears <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Pearson", weightVar="origwt") c1_spear <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Spearman", weightVar="origwt") c1_polyc <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Polychoric", weightVar="origwt") c1_pears c1_spear c1_polyc # for categorical variables, users can either keep the original numeric levels of the variables # or condense the levels (default) # the following call condenses the levels of the variable 'c046501' cor.sdf(x="c046501", y="c044006", data=sdf) # the following call keeps the original levels of the variable 'c046501' cor.sdf(x="c046501", y="c044006", data=sdf, condenseLevels = FALSE) # these take awhile to calculate for large datasets, so limit to a subset sdf_dnf <- subset(sdf, b003601 == 1) # for a categorical variable and a scale score any of the following work c2_pears <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Pearson", weightVar="origwt") c2_spear <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Spearman", weightVar="origwt") c2_polys <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Polyserial", weightVar="origwt") c2_pears c2_spear c2_polys # recode two variables cor.sdf(x="c046501", y="c044006", data=sdf, method="Spearman", weightVar="origwt", recode=list(c046501=list(from="0%",to="None"), c046501=list(from=c("1-5%", "6-10%", "11-25%", "26-50%", "51-75%", "76-90%", "Over 90%"), to="Between 0% and 100%"), c044006=list(from=c("1-5%", "6-10%", "11-25%", "26-50%", "51-75%", "76-90%", "Over 90%"), to="Between 0% and 100%"))) # reorder two variables cor.sdf(x="b017451", y="sdracem", data=sdf, method="Spearman", weightVar="origwt", reorder=list(sdracem=c("White", "Hispanic", "Black", "Asian/Pacific Island", "Amer Ind/Alaska Natv", "Other"), b017451=c("Every day", "2 or 3 times a week", "About once a week", "Once every few weeks", "Never or hardly ever"))) # recode two variables and reorder cor.sdf(x="pared", y="b013801", data=subset(sdf, !pared %in% "I Don\'t Know"), method="Spearman", weightVar = "origwt", recode=list(pared=list(from="Some ed after H.S.", to="Graduated H.S."), pared=list(from="Graduated college", to="Graduated H.S."), b013801=list(from="0-10", to="Less than 100"), b013801=list(from="11-25", to="Less than 100"), b013801=list(from="26-100", to="Less than 100")), reorder=list(b013801=c("Less than 100", ">100"))) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # for two categorical variables any of the following work c1_pears <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Pearson", weightVar="origwt") c1_spear <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Spearman", weightVar="origwt") c1_polyc <- cor.sdf(x="b017451", y="b003501", data=sdf, method="Polychoric", weightVar="origwt") c1_pears c1_spear c1_polyc # for categorical variables, users can either keep the original numeric levels of the variables # or condense the levels (default) # the following call condenses the levels of the variable 'c046501' cor.sdf(x="c046501", y="c044006", data=sdf) # the following call keeps the original levels of the variable 'c046501' cor.sdf(x="c046501", y="c044006", data=sdf, condenseLevels = FALSE) # these take awhile to calculate for large datasets, so limit to a subset sdf_dnf <- subset(sdf, b003601 == 1) # for a categorical variable and a scale score any of the following work c2_pears <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Pearson", weightVar="origwt") c2_spear <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Spearman", weightVar="origwt") c2_polys <- cor.sdf(x="composite", y="b017451", data=sdf_dnf, method="Polyserial", weightVar="origwt") c2_pears c2_spear c2_polys # recode two variables cor.sdf(x="c046501", y="c044006", data=sdf, method="Spearman", weightVar="origwt", recode=list(c046501=list(from="0%",to="None"), c046501=list(from=c("1-5%", "6-10%", "11-25%", "26-50%", "51-75%", "76-90%", "Over 90%"), to="Between 0% and 100%"), c044006=list(from=c("1-5%", "6-10%", "11-25%", "26-50%", "51-75%", "76-90%", "Over 90%"), to="Between 0% and 100%"))) # reorder two variables cor.sdf(x="b017451", y="sdracem", data=sdf, method="Spearman", weightVar="origwt", reorder=list(sdracem=c("White", "Hispanic", "Black", "Asian/Pacific Island", "Amer Ind/Alaska Natv", "Other"), b017451=c("Every day", "2 or 3 times a week", "About once a week", "Once every few weeks", "Never or hardly ever"))) # recode two variables and reorder cor.sdf(x="pared", y="b013801", data=subset(sdf, !pared %in% "I Don\'t Know"), method="Spearman", weightVar = "origwt", recode=list(pared=list(from="Some ed after H.S.", to="Graduated H.S."), pared=list(from="Graduated college", to="Graduated H.S."), b013801=list(from="0-10", to="Less than 100"), b013801=list(from="11-25", to="Less than 100"), b013801=list(from="26-100", to="Less than 100")), reorder=list(b013801=c("Less than 100", ">100"))) ## End(Not run)
Returns the dimensions of an edsurvey.data.frame
or an
edsurvey.data.frame.list
.
## S3 method for class 'edsurvey.data.frame' dim(x)
## S3 method for class 'edsurvey.data.frame' dim(x)
x |
an |
For an edsurvey.data.frame
, returns a
numeric vector of length two, with the first element being the number
of rows and the second element being the number of columns.
For an edsurvey.data.frame.list
, returns a list of length
two, where the first element is named nrow
and is a
numeric vector containing the number of rows for each element of the
edsurvey.data.frame.list
. The second element is named
ncol
and is the number of columns for each element.
This is done so that the nrow
and ncol
functions
return meaningful results, even if nonstandard.
Paul Bailey
Calculates the degrees of freedom for a statistic (or of a contrast between two statistics) based on the jackknife and imputation variance estimates.
DoFCorrection( varEstA, varEstB = varEstA, varA, varB = varA, method = c("WS", "JR") )
DoFCorrection( varEstA, varEstB = varEstA, varA, varB = varA, method = c("WS", "JR") )
varEstA |
the |
varEstB |
similar to the |
varA |
a character that names the statistic in the |
varB |
a character that names the statistic in the |
method |
a character that is either |
This calculation happens under the notion that statistics have little variance within strata, and some strata will contribute fewer than a full degree of freedom.
The functions are not vectorized, so both varA
and
varB
must contain exactly one variable name.
The method used to compute the degrees of freedom is in the vignette titled Statistical Methods Used in EdSurvey section “Estimation of Degrees of Freedom.”
numeric; the estimated degrees of freedom
Paul Bailey
Johnson, E. G., & Rust, K. F. (1992). Population inferences and variance estimation for NAEP data. Journal of Educational Statistics, 17, 175–190.
## Not run: sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, returnVarEstInputs=TRUE) summary(lm1) # this output agrees with summary of lm1 coefficient for dsex DoFCorrection(lm1$varEstInputs, varA="dsexFemale", method="JR") # second example, a covariance term requires more work # first, estimate the covariance between two regression coefficients # note that the variable names are parallel to what they are called in lm1 output covFEveryDay <- varEstToCov(lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", jkSumMultiplier= EdSurvey:::getAttributes(data=sdf, attribute="jkSumMultiplier")) # second, find the difference and the SE of the difference se <- lm1$coefmat["dsexFemale","se"] + lm1$coefmat["b017451Every day","se"] + -2*covFEveryDay # third, calculate the t-statistic tv <- (coef(lm1)["dsexFemale"] - coef(lm1)["b017451Every day"])/se # fourth, calculate the p-value, which requires the estimated degrees of freedom dofFEveryDay <- DoFCorrection(lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", method="JR") # finally, the p-value 2*(1-pt(abs(tv), df=dofFEveryDay)) ## End(Not run)
## Not run: sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, returnVarEstInputs=TRUE) summary(lm1) # this output agrees with summary of lm1 coefficient for dsex DoFCorrection(lm1$varEstInputs, varA="dsexFemale", method="JR") # second example, a covariance term requires more work # first, estimate the covariance between two regression coefficients # note that the variable names are parallel to what they are called in lm1 output covFEveryDay <- varEstToCov(lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", jkSumMultiplier= EdSurvey:::getAttributes(data=sdf, attribute="jkSumMultiplier")) # second, find the difference and the SE of the difference se <- lm1$coefmat["dsexFemale","se"] + lm1$coefmat["b017451Every day","se"] + -2*covFEveryDay # third, calculate the t-statistic tv <- (coef(lm1)["dsexFemale"] - coef(lm1)["b017451Every day"])/se # fourth, calculate the p-value, which requires the estimated degrees of freedom dofFEveryDay <- DoFCorrection(lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", method="JR") # finally, the p-value 2*(1-pt(abs(tv), df=dofFEveryDay)) ## End(Not run)
Uses an Internet connection to download ePIRLS data. Data come from timssandpirls.bc.edu zip files. This function works for 2016 data.
download_ePIRLS(root, years = c(2016), cache = FALSE, verbose = TRUE)
download_ePIRLS(root, years = c(2016), cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the ePIRLS data should be stored. Files are placed in a subdirectory named ePIRLS/[year]. |
years |
an integer vector of the assessment years to download. Valid year is 2016 only. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions download_ePIRLS(years=2016, root = "~/") # cache=TRUE will download then process the datafiles download_ePIRLS(years=2016, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years download_ePIRLS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions download_ePIRLS(years=2016, root = "~/") # cache=TRUE will download then process the datafiles download_ePIRLS(years=2016, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years download_ePIRLS(root="~/", verbose = FALSE) ## End(Not run)
Provides instructions to download CivED or ICCS data to be processed in readCivEDICCS
.
downloadCivEDICCS(years = c(1999, 2009, 2016))
downloadCivEDICCS(years = c(1999, 2009, 2016))
years |
an integer vector indicating the study year. Valid years are 1999, 2009, and 2016. |
Tom Fink
## Not run: # view instructions to manually download study data downloadCivEDICCS() ## End(Not run)
## Not run: # view instructions to manually download study data downloadCivEDICCS() ## End(Not run)
Uses an Internet connection to download ECLS_K data. Data come from nces.ed.gov zip files. This function works for 1998 and 2011 data.
downloadECLS_K(root, years = c(1998, 2011), cache = FALSE, verbose = TRUE)
downloadECLS_K(root, years = c(1998, 2011), cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the ECLS_K data should be stored. Files are placed in a subdirectory named ECLS_K/[year]. |
years |
an integer vector of the assessment years to download. Valid years are 1998 and 2011. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Beginning for the ECLS_K 2011 Study Grade 5 data files, the ChildK5p.zip
source data file is a DEFLATE64
compressed zip file.
This means that the user must manually extract the contained childK5p.dat
file using an external zip
program capable of handling DEFLATE64
zip format. As existing R functions are unable to handle this zip format natively.
Tom Fink
readECLS_K1998
and readECLS_K2011
## Not run: # root argument will vary by operating system conventions downloadECLS_K(years=c(1998, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadECLS_K(years=c(1998, 2011), root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadECLS_K(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadECLS_K(years=c(1998, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadECLS_K(years=c(1998, 2011), root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadECLS_K(root="~/", verbose = FALSE) ## End(Not run)
Uses an Internet connection to download ELS data. Data come from nces.ed.gov zip files. This function works for 2002 data.
downloadELS(root, years = c(2002), cache = FALSE, verbose = TRUE)
downloadELS(root, years = c(2002), cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the ELS data should be stored. Files are placed in a subdirectory named ELS/[year]. |
years |
an integer vector of the assessment years to download. Valid year is 2002 only. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions downloadELS(years=2002, root = "~/") # cache=TRUE will download then process the datafiles downloadELS(years=2002, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadELS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadELS(years=2002, root = "~/") # cache=TRUE will download then process the datafiles downloadELS(years=2002, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadELS(root="~/", verbose = FALSE) ## End(Not run)
Uses an Internet connection to download HSLS data. Data come from nces.ed.gov zip files. This function works for 2009 data.
downloadHSLS(root, years = c(2009), cache = FALSE, verbose = TRUE)
downloadHSLS(root, years = c(2009), cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the HSLS data should be stored. Files are placed in a subdirectory named HSLS/[year]. |
years |
an integer vector of the assessment years to download. Valid year is 2009 only. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions downloadHSLS(root = "~/", years=2009) # set verbose=FALSE for silent output # if year not specified, download all years downloadHSLS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadHSLS(root = "~/", years=2009) # set verbose=FALSE for silent output # if year not specified, download all years downloadHSLS(root="~/", verbose = FALSE) ## End(Not run)
Provides instructions to download ICILS data to be processed in readICILS
.
downloadICILS(years = c(2013, 2018))
downloadICILS(years = c(2013, 2018))
years |
an integer vector indicating the study year. Valid year is 2013 only. |
Tom Fink
## Not run: # view instructions to manually download study data downloadICILS() ## End(Not run)
## Not run: # view instructions to manually download study data downloadICILS() ## End(Not run)
Provides instructions to download the public-use National Household Education Survey (NHES) data in SPSS (*.sav) format
for use with the readNHES
function.
The data originates from the NCES Online Codebook zip files.
This function works for data from the years
1991, 1993, 1995, 1996, 1999, 2001, 2003, 2005, 2007, 2012, 2016, and 2019.
downloadNHES( years = c(1991, 1993, 1995, 1996, 1999, 2001, 2003, 2005, 2007, 2012, 2016, 2019) )
downloadNHES( years = c(1991, 1993, 1995, 1996, 1999, 2001, 2003, 2005, 2007, 2012, 2016, 2019) )
years |
an integer vector of the assessment years. Valid years are 1991, 1993, 1995, 1996, 1999, 2001, 2003, 2005, 2007, 2012, 2016, and 2019. The instructions are the same for each year, this is used as reference only. |
The NHES data files are additionally available from the NHES data product page. However, the data files provided at that page do not include all available years of data, and contain inconsistent data file formats.
Tom Fink
## Not run: #view instructions to manually download NHES data downloadNHES() ## End(Not run)
## Not run: #view instructions to manually download NHES data downloadNHES() ## End(Not run)
Uses an Internet connection to download PIAAC data to a computer. Data come from the OECD website.
downloadPIAAC(root, cycle = 1, cache = FALSE, verbose = TRUE)
downloadPIAAC(root, cycle = 1, cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the PIAAC data should be stored. Files are placed in a folder named PIAAC/cycle [cycle number]. |
cycle |
a numeric value indicating the assessment cycle to download. Valid cycle is 1 only. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Eric Buehler, Paul Bailey, Trang Nguyen, and Yuqi Liao
## Not run: # download all available data for PIAAC round 1 to "~/PIAAC/Round 1" folder # root argument will vary by operating system conventions downloadPIAAC(root="~/") ## End(Not run)
## Not run: # download all available data for PIAAC round 1 to "~/PIAAC/Round 1" folder # root argument will vary by operating system conventions downloadPIAAC(root="~/") ## End(Not run)
Uses an Internet connection to download PIRLS data. Data come from timssandpirls.bc.edu zip files. This function works for 2001, 2006, 2011, 2016, and 2021 data.
downloadPIRLS( root, years = c(2001, 2006, 2011, 2016, 2021), cache = FALSE, verbose = TRUE )
downloadPIRLS( root, years = c(2001, 2006, 2011, 2016, 2021), cache = FALSE, verbose = TRUE )
root |
a character string indicating the directory where the PIRLS data should be stored. Files are placed in a subdirectory named PIRLS/[year]. |
years |
an integer vector of the assessment years to download. Valid years are 2001, 2006, 2011, 2016, and 2021. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions downloadPIRLS(year=c(2006, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadPIRLS(year=2011, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadPIRLS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadPIRLS(year=c(2006, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadPIRLS(year=2011, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadPIRLS(root="~/", verbose = FALSE) ## End(Not run)
Uses an Internet connection to download PISA data to a computer. Data come from the OECD website.
downloadPISA( root, years = c(2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022), database = c("INT", "CBA", "FIN"), cache = FALSE, verbose = TRUE )
downloadPISA( root, years = c(2000, 2003, 2006, 2009, 2012, 2015, 2018, 2022), database = c("INT", "CBA", "FIN"), cache = FALSE, verbose = TRUE )
root |
a character string indicating the directory where the PISA data should be stored. Files are placed in a folder named PISA/[year]. |
years |
an integer vector of the assessment years to download. Valid years are 2000, 2003, 2006, 2009, 2012, 2015, 2018, and 2022. |
database |
a character vector to indicate which database to download from. For 2012,
three databases are available ( |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
The function uses
download.file
to download files from provided URLs. Some machines might require a different
user agent in HTTP(S) requests. If the downloading gives an error or behaves
unexpectedly (e.g., a zip file cannot be unzipped or a data file is
significantly smaller than expected), users can toggle HTTPUserAgent
options to find one that works for their machines. One common alternative option is
options(HTTPUserAgent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0")
Beginning in the 2018 data files, the SPSS_STU_COG.zip
source data file is a DEFLATE64
compressed zip file.
This means that the user must manually extract the contained CY07_MSU_STU_COG.sav
file using an external zip
program capable of handling DEFLATE64
zip format, as existing R functions are unable to handle this zip format natively.
Yuqi Liao, Paul Bailey, and Trang Nguyen
readPISA
, download.file
, options
## Not run: # download PISA 2012 data (for all three databases) downloadPISA(years = 2012, database = c("INT","CBA","FIN"), root="~/") # download PISA 2009, 2012, and 2015 data (International Database only) # to C:/PISA/2009, C:/PISA/2012, and C:/PISA/2015 folders, respectively downloadPISA(years = c(2009,2012,2015), root="~/") ## End(Not run)
## Not run: # download PISA 2012 data (for all three databases) downloadPISA(years = 2012, database = c("INT","CBA","FIN"), root="~/") # download PISA 2009, 2012, and 2015 data (International Database only) # to C:/PISA/2009, C:/PISA/2012, and C:/PISA/2015 folders, respectively downloadPISA(years = c(2009,2012,2015), root="~/") ## End(Not run)
Provides instructions to download PISA YAFS data to be processed in readPISA_YAFS
.
downloadPISA_YAFS(years = c(2016))
downloadPISA_YAFS(years = c(2016))
years |
an integer vector indicating the study year. Valid year is 2016 only. |
Tom Fink
## Not run: # view instructions to manually download study data downloadPISA_YAFS() ## End(Not run)
## Not run: # view instructions to manually download study data downloadPISA_YAFS() ## End(Not run)
Provides instructions to download School Survey on Crime and Safety (SSOCS) data in SAS (*.sas7bdat) format
for use with the readSSOCS
function.
The data originates from the SSOCS Data Products website at nces.ed.gov.
This function works for the following school year datasets: 2000 (1999–2000), 2004 (2003–2004), 2006 (2005–2006),
2008 (2007–2008), 2010 (2009–2010), 2016 (2015–2016), and 2018 (2017–2018).
downloadSSOCS(years = c(2000, 2004, 2006, 2008, 2010, 2016, 2018))
downloadSSOCS(years = c(2000, 2004, 2006, 2008, 2010, 2016, 2018))
years |
an integer vector of the study years to download. Valid years are as follows: 2000, 2004, 2006, 2008, 2010, 2016, 2018 (see description). The instructions are the same for each year, this is for reference only. |
The year parameter value is shortened to the ending year of the school year (e.g., 2006 refers to the 2005–2006 school year data). Manually downloading the data files is required to fulfill the data usage agreement.
Tom Fink
## Not run: #see instructions for downloading SSOCS Data downloadSSOCS() ## End(Not run)
## Not run: #see instructions for downloading SSOCS Data downloadSSOCS() ## End(Not run)
Uses an Internet connection to download TALIS data. Data come from OECD TALIS site international zip files. This function works for 2008, 2013,and 2018 data.
downloadTALIS(root, years = c(2008, 2013, 2018), cache = FALSE, verbose = TRUE)
downloadTALIS(root, years = c(2008, 2013, 2018), cache = FALSE, verbose = TRUE)
root |
a character string indicating the directory where the TALIS data should be stored. Files are placed in a subdirectory named TALIS/[year]. |
years |
a numeric value indicating the assessment year. Available years are 2008, 2013, and 2018. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink and Trang Nguyen
## Not run: # root argument will vary by operating system conventions downloadTALIS(root = "~/", years = 2018) # cache=TRUE will download then process the datafiles downloadTALIS(root = "~/", years = 2015, cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTALIS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadTALIS(root = "~/", years = 2018) # cache=TRUE will download then process the datafiles downloadTALIS(root = "~/", years = 2015, cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTALIS(root="~/", verbose = FALSE) ## End(Not run)
Uses an Internet connection to download TIMSS data. Data come from timssandpirls.bc.edu zip files. This function works for 2003, 2007, 2011, 2015, and 2019 data.
downloadTIMSS( root, years = c(2003, 2007, 2011, 2015, 2019), cache = FALSE, verbose = TRUE )
downloadTIMSS( root, years = c(2003, 2007, 2011, 2015, 2019), cache = FALSE, verbose = TRUE )
root |
a character string indicating the directory where the TIMSS data should be stored. Files are placed in a subdirectory named TIMSS/[year]. |
years |
an integer vector of the assessment years to download. Valid years are 2003, 2007, 2011, 2015, and 2019. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions downloadTIMSS(year=c(2019, 2015, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadTIMSS(year=2015, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTIMSS(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadTIMSS(year=c(2019, 2015, 2011), root = "~/") # cache=TRUE will download then process the datafiles downloadTIMSS(year=2015, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTIMSS(root="~/", verbose = FALSE) ## End(Not run)
Uses an Internet connection to download TIMSS Advanced data. Data come from timssandpirls.bc.edu zip files. This function works for 1995, 2008, and 2015 data.
downloadTIMSSAdv( root, years = c(1995, 2008, 2015), cache = FALSE, verbose = TRUE )
downloadTIMSSAdv( root, years = c(1995, 2008, 2015), cache = FALSE, verbose = TRUE )
root |
a character string indicating the directory where the TIMSS Advanced data should be stored. Files are placed in a subdirectory named TIMSSAdv/[year]. |
years |
an integer vector of the assessment years to download. Valid years are 1995, 2008, and 2015. |
cache |
a logical value set to process and cache the text (.txt) version of files.
This takes a very long time but saves time for future uses of
the data. Default value is |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Tom Fink
## Not run: # root argument will vary by operating system conventions downloadTIMSSAdv(year=c(2008, 2015), root = "~/") # cache=TRUE will download then process the datafiles downloadTIMSSAdv(year=2015, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTIMSSAdv(root="~/", verbose = FALSE) ## End(Not run)
## Not run: # root argument will vary by operating system conventions downloadTIMSSAdv(year=c(2008, 2015), root = "~/") # cache=TRUE will download then process the datafiles downloadTIMSSAdv(year=2015, root = "~/", cache = TRUE) # set verbose=FALSE for silent output # if year not specified, download all years downloadTIMSSAdv(root="~/", verbose = FALSE) ## End(Not run)
Draw plausible values from an mml fit
## S3 method for class 'sdf' drawPVs( x, npv = 5L, pvVariableNameSuffix = "_dire", data, stochasticBeta = FALSE, construct = NULL, ... )
## S3 method for class 'sdf' drawPVs( x, npv = 5L, pvVariableNameSuffix = "_dire", data, stochasticBeta = FALSE, construct = NULL, ... )
x |
a fit from a call to |
npv |
integer indicating the number of plausible values to draw |
pvVariableNameSuffix |
suffix to append to the name of the new plausible values |
data |
an |
stochasticBeta |
logical when |
construct |
the construct to draw PVs for |
... |
additional parameters |
Two new classes in EdSurvey
are described in this section: the edsurvey.data.frame
and light.edsurvey.data.frame
. The edsurvey.data.frame
class stores metadata about survey data, and data are stored on the
disk (via the LaF
package), allowing gigabytes of data to be used easily on a machine otherwise
inappropriate for manipulating large datasets.
The light.edsurvey.data.frame
is typically generated
by the getData
function and stores the data in a
data.frame
.
Both classes use attributes to manage metadata and allow
for correct statistics to be used in calculating results; the
getAttributes
acts as an accessor for these attributes, whereas
setAttributes
acts as a mutator for the attributes.
As a convenience, edsurvey.data.frame
implements the $
function to extract a variable.
edsurvey.data.frame( userConditions, defaultConditions, dataList = list(), weights, pvvars, subject, year, assessmentCode, dataType, gradeLevel, achievementLevels, omittedLevels, survey, country, psuVar, stratumVar, jkSumMultiplier, recodes = NULL, validateFactorLabels = FALSE, forceLower = TRUE, reqDecimalConversion = TRUE, fr2Path = NULL, dim0 = NULL, cacheDataLevelName = NULL ) ## S3 method for class 'edsurvey.data.frame' x$i ## S3 replacement method for class 'edsurvey.data.frame' x$name <- value ## S4 method for signature 'edsurvey.data.frame,ANY' x %in% table ## S4 method for signature 'edsurvey.data.frame.list,ANY' x %in% table getAttributes(data, attribute = NULL, errorCheck = TRUE) setAttributes(data, attribute, value) getPSUVar( data, weightVar = attributes(getAttributes(data, "weights"))[["default"]] ) getStratumVar( data, weightVar = attributes(getAttributes(data, "weights"))[["default"]] )
edsurvey.data.frame( userConditions, defaultConditions, dataList = list(), weights, pvvars, subject, year, assessmentCode, dataType, gradeLevel, achievementLevels, omittedLevels, survey, country, psuVar, stratumVar, jkSumMultiplier, recodes = NULL, validateFactorLabels = FALSE, forceLower = TRUE, reqDecimalConversion = TRUE, fr2Path = NULL, dim0 = NULL, cacheDataLevelName = NULL ) ## S3 method for class 'edsurvey.data.frame' x$i ## S3 replacement method for class 'edsurvey.data.frame' x$name <- value ## S4 method for signature 'edsurvey.data.frame,ANY' x %in% table ## S4 method for signature 'edsurvey.data.frame.list,ANY' x %in% table getAttributes(data, attribute = NULL, errorCheck = TRUE) setAttributes(data, attribute, value) getPSUVar( data, weightVar = attributes(getAttributes(data, "weights"))[["default"]] ) getStratumVar( data, weightVar = attributes(getAttributes(data, "weights"))[["default"]] )
userConditions |
a list of user conditions that includes subsetting or recoding conditions |
defaultConditions |
a list of default conditions that often are set for each survey |
dataList |
a list of |
weights |
a list that stores information regarding weight variables. See Details. |
pvvars |
a list that stores information regarding plausible values. See Details. |
subject |
a character that indicates the subject domain of the given data |
year |
a character or numeric that indicates the year of the given data |
assessmentCode |
a character that indicates the code of the assessment.
Can be |
dataType |
a character that indicates the unit level of the main data.
Examples include |
gradeLevel |
a character that indicates the grade level of the given data |
achievementLevels |
a list of achievement-level categories and cutpoints |
omittedLevels |
a list of default omitted levels for the given data |
survey |
a character that indicates the name of the survey |
country |
a character that indicates the country of the given data |
psuVar |
a character that indicates the PSU sampling unit variable. Ignored when weights have |
stratumVar |
a character that indicates the stratum variable. Ignored when weights have |
jkSumMultiplier |
a numeric value of the jackknife coefficient (used in calculating the jackknife replication estimation) |
recodes |
a list of variable recodes of the given data |
validateFactorLabels |
a Boolean that indicates whether the |
forceLower |
a Boolean; when set to |
reqDecimalConversion |
a Boolean; when set to |
fr2Path |
a character file location for NAEP assessments to identify the location of the codebook file in |
dim0 |
numeric vector of length two. To speed construction, the dimensions of the data can be provided |
cacheDataLevelName |
a character value set to match the named element in the |
x |
an |
i |
a character, the column name to extract |
name |
a character vector of the column to edit |
value |
outside of the assignment context, new value of the given |
table |
an |
data |
an |
attribute |
a character, name of an attribute to get or set |
errorCheck |
logical; see Details |
weightVar |
a character indicating the full sample weights. Required in |
The weight
list has an element named after each weight variable name
that is a list with elements jkbase
and jksuffixes
. The
jkbase
variable is a single character indicating the jackknife replicate
weight base name, whereas jksuffixes
is a vector with one element for each
jackknife replicate weight. When the two are pasted together, they should form
the complete set of the jackknife replicate weights. The weights
argument
also can have an attribute that is the default weight. If the primary sampling
unit and stratum variables change by weight, they also can be defined on the weight
list as psuVar
and stratumVar
. When this option is used, it overrides
the psuVar
and stratumVar
on the edsurvey.data.frame
,
which can be left blank. A weight must define only one of psuVar
and stratumVar
.
The pvvars
list has an element for each subject or subscale score
that has plausible values. Each element is a list with a varnames
element that indicates the column names of the plausible values and an
achievementLevel
argument that is a named vector of the
achievement-level cutpoints.
An edsurvey.data.frame
implements a unique data caching mechanism that allows users to create and merge data columns for flexibility.
This cache
object is a single data.frame
that is an element in the edsurvey.data.frame
. To accommodate studies with complex data models
the cache can only support one data level at this time. The cacheDataLevelName
parameter indicates which named element in the dataList
the cache is indicated. The default value cacheDataLevelName = NULL
will set the first item in the dataList
as the cache
level for an edsurvey.data.frame
.
An object of class edsurvey.data.frame
with the following elements:
Elements that store data connections and data codebooks
dataList |
a |
Elements that store sample design and default subsetting information of the given survey data
userConditions |
a list containing all user conditions, set using the |
defaultConditions |
the default subsample conditions |
weights |
a list containing the weights. See Details. |
stratumVar |
a character that indicates the default strata identification variable name in the data. Often used in Taylor series estimation. |
psuVar |
a character that indicates the default PSU (sampling unit) identification variable name in the data. Often used in Taylor series estimation. |
pvvars |
a list containing the plausible values. See Details. |
achievementLevels |
default achievement cutoff scores and names. See Details. |
omittedLevels |
the levels of the factor variables that will be omitted from the |
Elements that store descriptive information of the survey
survey |
the type of survey data |
subject |
the subject of the data |
year |
the year of assessment |
assessmentCode |
the assessment code |
dataType |
the type of data (e.g., |
gradeLevel |
the grade of the dataset contained in the |
Elements used in mml.sdf
dichotParamTab |
IRT item parameters for dichotomous items in a data frame |
polyParamTab |
IRT item parameters for polytomous items in a data frame |
adjustedData |
IRT item parameter adjustment information in a data frame |
testData |
IRT transformation constants in a data frame |
scoreCard |
item scoring information in a data frame |
scoreDict |
generic scoring information in a data frame |
scoreFunction |
a function that turns the variables with items in them into numeric scores |
edsurvey.data.frame
is an object that stores connection to data on the
disk along with important survey sample design information.
edsurvey.data.frame.list
is a list of edsurvey.data.frame
objects. It often is used in trend or cross-regional analysis in the
gap
function. See edsurvey.data.frame.list
for
more information on how to create an edsurvey.data.frame.list
. Users
also can refer to the vignette titled
Using EdSurvey for Trend Analysis
for examples.
Besides edsurvey.data.frame
class, the EdSurvey
package also
implements the light.edsurvey.data.frame
class, which can be used by both
EdSurvey
and non-EdSurvey
functions. More particularly,
light.edsurvey.data.frame
is a data.frame
that has basic
survey and sample design information (i.e., plausible values and weights), which
will be used for variance estimation in analytical functions. Because it
also is a base R data.frame
, users can apply base R functions for
data manipulation.
See the vignette titled
Using the getData
Function in EdSurvey
for more examples.
Many functions will remove attributes from a data frame, such as
a light.edsurvey.data.frame
, and the
rebindAttributes
function can add them back.
Users can get a light.edsurvey.data.frame
object by using the
getData
method with addAttributes=TRUE
.
Extracting a column from an edsurvey.data.frame
Users can extract a column from an edsurvey.data.frame
object using $
or []
like a normal data frame.
Extracting and updating attributes of an object of class edsurvey.data.frame
or light.edsurvey.data.frame
Users can use the getAttributes
method to extract any attribute of
an edsurvey.data.frame
or a light.edsurvey.data.frame
.
The errorCheck
parameter has a default value ofTRUE
, which throws an error if an attribute is not found.
Setting errorCheck = FALSE
will suppress error checking, and return NULL
if an attribute can't be found.
A light.edsurvey.data.frame
will not have attributes related to data connection
because data have already been read in memory.
If users want to update an attribute (i.e., omittedLevels
), they can
use the setAttributes
method.
Tom Fink, Trang Nguyen, and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # run a base R function on a column of edsurvey.data.frame table(sdf$dsex) # assignment table(sdf$b013801) sdf$books <- ifelse(sdf$b013801 %in% c("0-10", "11-25"), "0-25 books", "26+ books") table(sdf$books, sdf$b013801) # extract default omitted levels of NAEP primer data getAttributes(data=sdf, attribute="omittedLevels") #[1] "Multiple" NA "Omitted" # update default omitted levels of NAEP primer data sdf <- setAttributes(data=sdf, attribute="omittedLevels", value=c("Multiple", "Omitted", NA, "(Missing)")) getAttributes(data=sdf, attribute="omittedLevels") #[1] "Multiple" "Omitted" NA "(Missing)" ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # run a base R function on a column of edsurvey.data.frame table(sdf$dsex) # assignment table(sdf$b013801) sdf$books <- ifelse(sdf$b013801 %in% c("0-10", "11-25"), "0-25 books", "26+ books") table(sdf$books, sdf$b013801) # extract default omitted levels of NAEP primer data getAttributes(data=sdf, attribute="omittedLevels") #[1] "Multiple" NA "Omitted" # update default omitted levels of NAEP primer data sdf <- setAttributes(data=sdf, attribute="omittedLevels", value=c("Multiple", "Omitted", NA, "(Missing)")) getAttributes(data=sdf, attribute="omittedLevels") #[1] "Multiple" "Omitted" NA "(Missing)" ## End(Not run)
The edsurvey.data.frame.list
function creates an
edsurvey.data.frame.list
object from a series of
edsurvey.data.frame
objects.
append.edsurvey.data.frame.list
creates an
edsurvey.data.frame.list
from two
edsurvey.data.frame
or edsurvey.data.frame.list
objects.
An edsurvey.data.frame.list
is useful for looking at
data, for example, across time or graphically, and reduces
repetition in function calls.
The user may specify a variable that varies across the
edsurvey.data.frame
objects that is
then included in further output.
edsurvey.data.frame.list(datalist, cov = NULL, labels = NULL) append.edsurvey.data.frame.list(sdfA, sdfB, labelsA = NULL, labelsB = NULL)
edsurvey.data.frame.list(datalist, cov = NULL, labels = NULL) append.edsurvey.data.frame.list(sdfA, sdfB, labelsA = NULL, labelsB = NULL)
datalist |
a list of |
cov |
a character vector that indicates what varies across
the |
labels |
a character vector that specifies labels. Must be the
same length
as |
sdfA |
an |
sdfB |
an |
labelsA |
a character vector that specifies |
labelsB |
a character vector that specifies |
The edsurvey.data.frame.list
can be used in place of an
edsurvey.data.frame
in function calls, and results are returned
for each of the component edsurvey.data.frame
s, with the
organization of the results varying by the particular method.
An edsurvey.data.frame.list
can be created from several
edsurvey.data.frame
objects that are related;
for example, all are NAEP mathematics assessments but have one or more
differences (e.g., they are all from different years).
Another example could be data from multiple countries for an
international assessment.
When cov
and labels
are both missing, edsurvey.data.frame.list
attempts to guess what variables may be varying and uses those. When there are no
varying covariates, generic labels are automatically generated.
edsurvey.data.frame.list
returns an edsurvey.data.frame.list
with
elements
datalist |
a list of |
covs |
a character vector of key variables that vary within
the |
append.edsurvey.data.frame.list
returns an edsurvey.data.frame.list
with
elements
datalist |
a list of |
covs |
a character vector of key variables that vary within
the |
Paul Bailey, Huade Huo
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # NOTE: the following code would not normally have to be run but is used here # to generate demo data. # Specifically, make subsets of sdf by the scrpsu variable, # "Scrambled PSU and school code" sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) # construct an edsurvey.data.frame.list from these four data sets sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # alternative method of building sdfl2 <- sdfA + sdfB + sdfC # check contents sdfA %in% sdfl # note %in% checks by survey (NAEP 2005 Math for sdf, # sdfA, sdfB, sdfC, and sdfD) not by subset, so this also return TRUE sdfD %in% sdfl2 # this shows how these datasets will be described sdfl$covs # get the gaps between Male and Female for each data set gap1 <- gap(variable="composite", data=sdfl, dsex=="Male", dsex=="Female") gap1 # make combine sdfA and sdfB sdfl1a <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB), labels=c("A locations", "B locations")) # combine sdfC and sdfD sdfl1b <- edsurvey.data.frame.list(datalist=list(sdfC, sdfD), labels=c("C locations", "D locations")) # append to make sdf3 the same as sdfl sdfl3 <- append.edsurvey.data.frame.list(sdfA=sdfl1a, sdfB=sdfl1b) identical(sdfl, sdfl3) #TRUE # append to make sdf4 the same as sdfl sdfl4 <- append.edsurvey.data.frame.list( append.edsurvey.data.frame.list(sdfA=sdfl1a, sdfB=sdfC, labelsB = "C locations"), sdfD, labelsB = "D locations") identical(sdfl, sdfl4) #TRUE # show label deconflicting downloadTIMSS(root="~/", years=c(2011, 2015)) t11 <- readTIMSS(path="~/TIMSS/2011", countries = c("fin", "usa"), gradeLvl = 4) t15 <- readTIMSS(path="~/TIMSS/2015", countries = c("fin", "usa"), gradeLvl = 4) # these would not be unique t11$covs t15$covs # resulting values includes year now t11_15 <- append.edsurvey.data.frame.list(sdfA=t11, sdfB=t15) t11_15$covs ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # NOTE: the following code would not normally have to be run but is used here # to generate demo data. # Specifically, make subsets of sdf by the scrpsu variable, # "Scrambled PSU and school code" sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) # construct an edsurvey.data.frame.list from these four data sets sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # alternative method of building sdfl2 <- sdfA + sdfB + sdfC # check contents sdfA %in% sdfl # note %in% checks by survey (NAEP 2005 Math for sdf, # sdfA, sdfB, sdfC, and sdfD) not by subset, so this also return TRUE sdfD %in% sdfl2 # this shows how these datasets will be described sdfl$covs # get the gaps between Male and Female for each data set gap1 <- gap(variable="composite", data=sdfl, dsex=="Male", dsex=="Female") gap1 # make combine sdfA and sdfB sdfl1a <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB), labels=c("A locations", "B locations")) # combine sdfC and sdfD sdfl1b <- edsurvey.data.frame.list(datalist=list(sdfC, sdfD), labels=c("C locations", "D locations")) # append to make sdf3 the same as sdfl sdfl3 <- append.edsurvey.data.frame.list(sdfA=sdfl1a, sdfB=sdfl1b) identical(sdfl, sdfl3) #TRUE # append to make sdf4 the same as sdfl sdfl4 <- append.edsurvey.data.frame.list( append.edsurvey.data.frame.list(sdfA=sdfl1a, sdfB=sdfC, labelsB = "C locations"), sdfD, labelsB = "D locations") identical(sdfl, sdfl4) #TRUE # show label deconflicting downloadTIMSS(root="~/", years=c(2011, 2015)) t11 <- readTIMSS(path="~/TIMSS/2011", countries = c("fin", "usa"), gradeLvl = 4) t15 <- readTIMSS(path="~/TIMSS/2015", countries = c("fin", "usa"), gradeLvl = 4) # these would not be unique t11$covs t15$covs # resulting values includes year now t11_15 <- append.edsurvey.data.frame.list(sdfA=t11, sdfB=t15) t11_15$covs ## End(Not run)
Returns a summary table (as a data.frame
)
that shows the number of students, the percentage of students, and the mean
value of the outcome (or left-hand side) variable by the
predictor (or right-hand side) variable(s).
edsurveyTable( formula, data, weightVar = NULL, jrrIMax = 1, pctAggregationLevel = NULL, returnMeans = TRUE, returnSepct = TRUE, varMethod = c("jackknife", "Taylor"), drop = FALSE, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
edsurveyTable( formula, data, weightVar = NULL, jrrIMax = 1, pctAggregationLevel = NULL, returnMeans = TRUE, returnSepct = TRUE, varMethod = c("jackknife", "Taylor"), drop = FALSE, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
formula |
object of class |
data |
object of class |
weightVar |
character string indicating the weight variable to use.
Note that only the name of the
weight variable needs to be included here, and any
replicate weights will be automatically included.
When this argument is |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
pctAggregationLevel |
the percentage variable sums up to 100 for the first
|
returnMeans |
a logical value; set to |
returnSepct |
set to |
varMethod |
a character set to |
drop |
a logical value. When set to the default value of |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnVarEstInputs |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
This method can be used to generate a simple one-way, two-way, or n-way table with unweighted and weighted n values and percentages. It also can calculate the average of the subject scale or subscale for students at each level of the cross-tabulation table.
A detailed description of all statistics is given in the vignette titled Statistical Methods Used in EdSurvey.
A table with the following columns:
RHS levels |
one column for each right-hand side variable. Each row regards students who are at the levels shown in that row. |
N |
count of the number of students in the survey in the |
WTD_N |
the weighted N count of students in the survey in |
PCT |
the percentage of students at the aggregation level specified by |
SE(PCT) |
the standard error of the percentage, accounting
for the survey sampling methodology. When |
MEAN |
the mean assessment score for units in the |
SE(MEAN) |
the standard error of the |
When returnVarEstInputs
is TRUE
, two additional elements are
returned. These are meanVarEstInputs
and pctVarEstInputs
and
regard the MEAN
and PCT
columns, respectively. These two
objects can be used for calculating covariances with
varEstToCov
.
Paul Bailey and Ahmad Emad
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # create a table that shows only the breakdown of dsex edsurveyTable(formula=composite ~ dsex, data=sdf, returnMeans=FALSE, returnSepct=FALSE) # create a table with composite scores by dsex edsurveyTable(formula=composite ~ dsex, data=sdf) # add a second variable edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # add a second variable, do not omit any levels edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE) # add a second variable, do not omit any levels, change aggregation level edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=0) edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=1) edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=2) # variance estimation using the Taylor series edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, varMethod="Taylor") ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # create a table that shows only the breakdown of dsex edsurveyTable(formula=composite ~ dsex, data=sdf, returnMeans=FALSE, returnSepct=FALSE) # create a table with composite scores by dsex edsurveyTable(formula=composite ~ dsex, data=sdf) # add a second variable edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # add a second variable, do not omit any levels edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE) # add a second variable, do not omit any levels, change aggregation level edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=0) edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=1) edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, omittedLevels=FALSE, pctAggregationLevel=2) # variance estimation using the Taylor series edsurveyTable(formula=composite ~ dsex + b017451 + b003501, data=sdf, varMethod="Taylor") ## End(Not run)
Produces the LaTeX code and compiles to a PDF file from the edsurveyTable
results.
edsurveyTable2pdf( data, formula, caption = NULL, filename = "", toCSV = "", returnMeans = TRUE, estDigits = 2, seDigits = 3 )
edsurveyTable2pdf( data, formula, caption = NULL, filename = "", toCSV = "", returnMeans = TRUE, estDigits = 2, seDigits = 3 )
data |
the result of a call to |
formula |
a formula of the form |
caption |
character vector of length one or two containing the table's caption or title.
If the length is two, the second item is the “short caption” used when LaTeX generates
a |
filename |
a character string containing filenames and paths. By default ( |
toCSV |
a character string containing filenames and paths of .csv table output.
|
returnMeans |
a logical value set to |
estDigits |
an integer indicating the number of decimal places to be used for estimates. Negative values are allowed. See Details. |
seDigits |
an integer indicating the number of decimal places to be used for standard errors. Negative values are allowed. |
Rounding to a negative number of digits means rounding to a power of 10,
so, for example, estDigits = -2
rounds estimates to the nearest hundred.
For more details, see the vignette titled
Producing LaTeX
Tables From edsurveyTable
Results With edsurveyTable2pdf
.
Huade Huo
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # create a table with composite scores by dsex and b017451 est1 <- edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # create a table with csv output edsurveyTable2pdf(data=est1, formula=b017451~dsex, toCSV="C:/example table.csv", filename="C:/example table.pdf", returnMeans=FALSE) # create a pdf file using the default subject scale or subscale # and keep two digits for estimates and three digits for SE after decimal point edsurveyTable2pdf(data=est1, formula=b017451~dsex, returnMeans=TRUE, estDigits=2, seDigits=3) # create a pdf file using the percentage of students at the # aggregation level specified by \code{pctAggregationLevel} # output will be saved as "C:/example table.pdf" edsurveyTable2pdf(data=est1, formula=b017451~dsex, filename="C:/example table.pdf", returnMeans=FALSE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # create a table with composite scores by dsex and b017451 est1 <- edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # create a table with csv output edsurveyTable2pdf(data=est1, formula=b017451~dsex, toCSV="C:/example table.csv", filename="C:/example table.pdf", returnMeans=FALSE) # create a pdf file using the default subject scale or subscale # and keep two digits for estimates and three digits for SE after decimal point edsurveyTable2pdf(data=est1, formula=b017451~dsex, returnMeans=TRUE, estDigits=2, seDigits=3) # create a pdf file using the percentage of students at the # aggregation level specified by \code{pctAggregationLevel} # output will be saved as "C:/example table.pdf" edsurveyTable2pdf(data=est1, formula=b017451~dsex, filename="C:/example table.pdf", returnMeans=FALSE) ## End(Not run)
Prep variables for calls to mml.sdf
es_recode(data, vars, from, to)
es_recode(data, vars, from, to)
data |
a data frame to be edited |
vars |
variables to modify |
from |
recode from these levels |
to |
recode to this level, or levels |
If to
is length 1, then all variables in vars
are recoded from every from
to the level of to
.
When to
is the same length as from
then the ith level of from
is recoded
to the ith level of to
.
the data with each variable in vars
recoded from missingFrom
to missingTo
Applies rounding rules
es_round( object, round_n = getOption("EdSurvey_round_n_function"), round_pop_n = getOption("EdSurvey_round_pop_n_function"), round_est = getOption("EdSurvey_round_est_function"), round_est_se = getOption("EdSurvey_round_est_se_function"), round_pct = getOption("EdSurvey_round_pct_function"), round_pct_se = getOption("EdSurvey_round_pct_se_function"), round_specific_element = NULL, ... )
es_round( object, round_n = getOption("EdSurvey_round_n_function"), round_pop_n = getOption("EdSurvey_round_pop_n_function"), round_est = getOption("EdSurvey_round_est_function"), round_est_se = getOption("EdSurvey_round_est_se_function"), round_pct = getOption("EdSurvey_round_pct_function"), round_pct_se = getOption("EdSurvey_round_pct_se_function"), round_specific_element = NULL, ... )
object |
the object (usually the result of an analysis function) to be rounded |
round_n |
function used to round sample n-sizes |
round_pop_n |
function used to round weighted n-sizes, these are also called population size estimates |
round_est |
function used to round estimates; examples include means and percentiles of scores, as well as regression coefficients |
round_est_se |
function used to round standard errors of estimates |
round_pct |
function used to round percentages |
round_pct_se |
function used to round the standard errors of percentages |
round_specific_element |
a list of rounding functions, the function is applied to elements with that name. See Examples |
... |
additional arguments passed to methods rounds every statistic that is a function of data, including the header and tables |
the object is returned, with relevant elements rounded
Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses jackknife variance method using replicate weights es1 <- edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # turn on rounding by default options(EdSurvey_round_output= TRUE) es1 # turn off rounding by default options(EdSurvey_round_output= FALSE) # request rounding for this outpt print(es1, use_es_round=TRUE) # round, then print # round the PCT column to one digit es_round(es1, round_specific_element=list(PCT=roundn(1))) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses jackknife variance method using replicate weights es1 <- edsurveyTable(formula=composite ~ dsex + b017451, data=sdf) # turn on rounding by default options(EdSurvey_round_output= TRUE) es1 # turn off rounding by default options(EdSurvey_round_output= FALSE) # request rounding for this outpt print(es1, use_es_round=TRUE) # round, then print # round the PCT column to one digit es_round(es1, round_specific_element=list(PCT=roundn(1))) ## End(Not run)
Compares the average levels of a variable between two groups that potentially share members.
gap( variable, data, groupA = "default", groupB = "default", percentiles = NULL, achievementLevel = NULL, achievementDiscrete = FALSE, stDev = FALSE, targetLevel = NULL, weightVar = NULL, jrrIMax = 1, varMethod = c("jackknife"), dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, referenceDataIndex = 1, returnVarEstInputs = FALSE, returnSimpleDoF = FALSE, returnSimpleN = FALSE, returnNumberOfPSU = FALSE, noCov = FALSE, pctMethod = c("unbiased", "symmetric", "simple"), includeLinkingError = FALSE, omittedLevels = deprecated() )
gap( variable, data, groupA = "default", groupB = "default", percentiles = NULL, achievementLevel = NULL, achievementDiscrete = FALSE, stDev = FALSE, targetLevel = NULL, weightVar = NULL, jrrIMax = 1, varMethod = c("jackknife"), dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, referenceDataIndex = 1, returnVarEstInputs = FALSE, returnSimpleDoF = FALSE, returnSimpleN = FALSE, returnNumberOfPSU = FALSE, noCov = FALSE, pctMethod = c("unbiased", "symmetric", "simple"), includeLinkingError = FALSE, omittedLevels = deprecated() )
variable |
a character indicating the variable to be compared, potentially with a subject scale or subscale |
data |
an |
groupA |
an expression or character expression that defines a condition for the subset.
This subset will be compared to |
groupB |
an expression or character expression that defines a condition for the subset.
This subset will be compared to |
percentiles |
a numeric vector. The |
achievementLevel |
the achievement level(s) at which percentages should be calculated |
achievementDiscrete |
a logical indicating if the achievement level
specified in the |
stDev |
a logical, set to |
targetLevel |
a character string. When specified, calculates the gap in
the percentage of students at
|
weightVar |
a character indicating the weight variable to use. See Details. |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
varMethod |
deprecated parameter, |
dropOmittedLevels |
a logical value. When set to the default value of
|
defaultConditions |
a logical value. When set to the default value
of |
recode |
a list of lists to recode variables. Defaults to |
referenceDataIndex |
a numeric used only when the |
returnVarEstInputs |
a logical value; set to |
returnSimpleDoF |
a logical value set to |
returnSimpleN |
a logical value set to |
returnNumberOfPSU |
a logical value set to |
noCov |
set the covariances to zero in result |
pctMethod |
a character that is one of |
includeLinkingError |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
This function calculates the gap between groupA
and groupB
(which
may be omitted to indicate the full sample). The gap is
calculated for one of four statistics:
The mean score gap (in the score
variable) identified in the variable
argument.
This is the default. The means and their standard errors are
calculated using the methods
described in the lm.sdf
function documentation.
The gap between respondents at
the percentiles specified in the percentiles
argument.
This is returned when the percentiles
argument is
defined. The mean and standard error are computed as described in the
percentile
function documentation.
The gap in the percentage of
students at (when achievementDiscrete
is TRUE
) or at
or above (when achievementDiscrete
is FALSE
) a
particular achievement level. This is used when the
achievementLevel
argument is defined. The mean and standard error
are calculated as described in the achievementLevels
function documentation.
The gap in the percentage of
respondents responding at targetLevel
to
variable
. This is used when targetLevel
is
defined. The mean and standard deviation are calculated as described in
the edsurveyTable
function documentation.
The return type depends on if the class of the data
argument is an
edsurvey.data.frame
or an edsurvey.data.frame.list
. Both
include the call (called call
), a list called labels
,
an object named percentage
that shows the percentage in groupA
and groupB
, and an object
that shows the gap called results
.
The labels include the following elements:
definition |
the definitions of the groups |
nFullData |
the n-size for the full dataset (before applying the definition) |
nUsed |
the n-size for the data after the group is subsetted and other restrictions (such as omitted values) are applied |
nPSU |
the number of PSUs used in calculation–only returned when
|
The percentages are computed according to the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Weighted Percentages When Plausible Values Are Not Present.” The standard errors are calculated according to “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Not Present, Using the Jackknife Method.” Standard errors of differences are calculated as the square root of the typical variance formula
where the covariance term is calculated as described in the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Covariances.” These degrees of freedom are available only with the jackknife variance estimation. The degrees of freedom used for hypothesis testing are always set to the number of jackknife replicates in the data.
the data argument is an edsurvey.data.frame
When the data
argument is an edsurvey.data.frame
,
gap
returns an S3 object of class gap
.
The percentage
object is a numeric vector with the following elements:
pctA |
the percentage of respondents in |
pctAse |
the standard error on the percentage of respondents in
|
dofA |
degrees of freedom appropriate for a t-test involving |
pctB |
the percentage of respondents in |
pctBse |
the standard error on the percentage of respondents in
|
dofB |
degrees of freedom appropriate for a t-test involving |
diffAB |
the value of |
covAB |
the covariance of |
diffABse |
the standard error of |
diffABpValue |
the p-value associated with the t-test used
for the hypothesis test that |
dofAB |
degrees of freedom used in calculating
|
The results
object is a numeric data frame with the following elements:
estimateA |
the mean estimate of |
estimateAse |
the standard error of |
dofA |
degrees of freedom appropriate for a t-test involving |
estimateB |
the mean estimate of |
estimateBse |
the standard error of |
dofB |
degrees of freedom appropriate for a t-test involving |
diffAB |
the value of |
covAB |
the covariance of |
diffABse |
the standard error of |
diffABpValue |
the p-value associated with the t-test used
for the hypothesis test that |
dofAB |
degrees of freedom used for the t-test on |
If the gap was in achievement levels or percentiles and more
than one percentile or achievement level is requested,
then an additional column
labeled percentiles
or achievementLevel
is included
in the results
object.
When results
has a single row and when returnVarEstInputs
is TRUE
, the additional elements varEstInputs
and
pctVarEstInputs
also are returned. These can be used for calculating
covariances with varEstToCov
.
the data argument is an edsurvey.data.frame.list
When the data
argument is an edsurvey.data.frame.list
,
gap
returns an S3 object of class gapList
.
The results
object in the edsurveyResultList
is
a data.frame
. Each row regards a particular dataset from the
edsurvey.data.frame
, and a reference dataset is dictated by
the referenceDataIndex
argument.
The percentage
object is a data.frame
with the following elements:
covs |
a data frame with a column for each column in the |
... |
all elements in the |
diffAA |
the difference in |
covAA |
the covariance of |
diffAAse |
the standard error for |
diffAApValue |
the p-value associated with the t-test used
for the hypothesis test that |
diffBB |
the difference in |
covBB |
the covariance of |
diffBBse |
the standard error for |
diffBBpValue |
the p-value associated with the t-test used
for the hypothesis test that |
diffABAB |
the value of |
covABAB |
the covariance of |
diffABABse |
the standard error for |
diffABABpValue |
the p-value associated with the t-test used
for the hypothesis test that |
The results
object is a data.frame
with the following elements:
... |
all elements in the |
diffAA |
the value of |
covAA |
the covariance of |
diffAAse |
the standard error for |
diffAApValue |
the p-value associated with the t-test used
for the hypothesis test that |
diffBB |
the value of |
covBB |
the covariance of |
diffBBse |
the standard error for |
diffBBpValue |
the p-value associated with the t-test used
for the hypothesis test that |
diffABAB |
the value of |
covABAB |
the covariance of |
diffABABse |
the standard error for |
diffABABpValue |
the p-value associated with the t-test used
for the hypothesis test that |
sameSurvey |
a logical value indicating if this line uses the same
survey as the reference line. Set to |
Paul Bailey, Trang Nguyen, and Huade Huo
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # find the mean score gap in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female") # find the score gap of the quartiles in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", percentile=50) gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(25, 50, 75)) # find the percent proficient (or higher) gap in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", achievementLevel=c("Basic", "Proficient", "Advanced")) # find the discrete achievement level gap--this is harder to interpret gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", achievementLevel="Proficient", achievementDiscrete=TRUE) # find the percent talk about studies at home (b017451) never or hardly # ever gap in the primer data between males and females gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", targetLevel="Never or hardly ever") # example showing how to compare multiple levels gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", targetLevel="Infrequently", recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to=c("Infrequently")))) # make subsets of sdf by scrpsu, "Scrambled PSU and school code" sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) gap(variable="composite", data=sdfl, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(50)) ## End(Not run) ## Not run: # example showing using linking error with gap # load Grade 4 math data # requires NAEP RUD license with these files in the folder the user is currectly in g4math2015 <- readNAEP("M46NT1AT.dat") g4math2017 <- readNAEP("M48NT1AT.dat") g4math2019 <- readNAEP("M50NT1AT.dat") # make an edsurvey.data.frame.list from math grade 4 2015, 2017, and 2019 data g4math <- edsurvey.data.frame.list(datalist=list(g4math2019, g4math2017, g4math2015), labels = c("2019", "2017", "2015")) # gap analysis with linking error in variance estimation across surveys gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", includeLinkingError=TRUE) gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", percentiles = c(10, 25), includeLinkingError=TRUE) gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", achievementDiscrete = TRUE, achievementLevel=c("Basic", "Proficient", "Advanced"), includeLinkingError=TRUE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # find the mean score gap in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female") # find the score gap of the quartiles in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", percentile=50) gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(25, 50, 75)) # find the percent proficient (or higher) gap in the primer data between males and females gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", achievementLevel=c("Basic", "Proficient", "Advanced")) # find the discrete achievement level gap--this is harder to interpret gap(variable="composite", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", achievementLevel="Proficient", achievementDiscrete=TRUE) # find the percent talk about studies at home (b017451) never or hardly # ever gap in the primer data between males and females gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", targetLevel="Never or hardly ever") # example showing how to compare multiple levels gap(variable="b017451", data=sdf, groupA=dsex=="Male", groupB=dsex=="Female", targetLevel="Infrequently", recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to=c("Infrequently")))) # make subsets of sdf by scrpsu, "Scrambled PSU and school code" sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) gap(variable="composite", data=sdfl, groupA=dsex=="Male", groupB=dsex=="Female", percentile=c(50)) ## End(Not run) ## Not run: # example showing using linking error with gap # load Grade 4 math data # requires NAEP RUD license with these files in the folder the user is currectly in g4math2015 <- readNAEP("M46NT1AT.dat") g4math2017 <- readNAEP("M48NT1AT.dat") g4math2019 <- readNAEP("M50NT1AT.dat") # make an edsurvey.data.frame.list from math grade 4 2015, 2017, and 2019 data g4math <- edsurvey.data.frame.list(datalist=list(g4math2019, g4math2017, g4math2015), labels = c("2019", "2017", "2015")) # gap analysis with linking error in variance estimation across surveys gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", includeLinkingError=TRUE) gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", percentiles = c(10, 25), includeLinkingError=TRUE) gap(variable="composite", data=g4math, groupA=dsex=="Male", groupB=dsex=="Female", achievementDiscrete = TRUE, achievementLevel=c("Basic", "Proficient", "Advanced"), includeLinkingError=TRUE) ## End(Not run)
Retrieves the IRT item variable names associated with construct names for use with mml.sdf
function.
getAllItems(sdf, construct = NULL)
getAllItems(sdf, construct = NULL)
sdf |
an |
construct |
a character value (or vector) for which to return the associated item variable names. Default value is |
a character vector of the items names associated for the values in construct
.
if construct
is a vector, all item names will be returned for those constructs. Use getAllItems
with getData
when creating a light.edsurvey.data.frame
, see example for use.
Tom Fink, Sun-Joo Lee, Eric Buehler, and Paul Bailey
## Not run: #TIMSS Example t15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) showPlausibleValues(data=t15) #view constructs in console #ensure we have all data needed for mml.sdf on light.edsurvey.data.frame #must be specified ahead of time. the 'getAllItems' function makes this easy mathItems <- getAllItems(sdf=t15, construct="mmat") #get mathematics items sciItems <- getAllItems(sdf=t15, construct="ssci") #get science items allItems <- getAllItems(sdf=t15, construct="NULL") wgtVar <- "totwgt" psustr <- c(getPSUVar(t15, wgtVar), getStratumVar(t15, wgtVar)) lsdf <- getData(data=t15, varnames=c("ROWID", "mmat", mathItems, psustr, wgtVar), omittedLevels=FALSE, addAttributes=TRUE) #builds light.edsurvey.data.frame #as a light.edsurvey.data.frame all elements must be present mml.sdf(formula=mmat ~ 1, data=lsdf, weightVar="totwgt") #as edsurvey.data.frame elements retrieved automatically for user mml.sdf(formula=mmat ~ 1, data=t15, weightVar="totwgt") #NAEP example sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) allItems <- getAllItems(sdf=sdf, construct=NULL) algebraItems <- getAllItems(sdf=sdf, construct="algebra") ## End(Not run)
## Not run: #TIMSS Example t15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) showPlausibleValues(data=t15) #view constructs in console #ensure we have all data needed for mml.sdf on light.edsurvey.data.frame #must be specified ahead of time. the 'getAllItems' function makes this easy mathItems <- getAllItems(sdf=t15, construct="mmat") #get mathematics items sciItems <- getAllItems(sdf=t15, construct="ssci") #get science items allItems <- getAllItems(sdf=t15, construct="NULL") wgtVar <- "totwgt" psustr <- c(getPSUVar(t15, wgtVar), getStratumVar(t15, wgtVar)) lsdf <- getData(data=t15, varnames=c("ROWID", "mmat", mathItems, psustr, wgtVar), omittedLevels=FALSE, addAttributes=TRUE) #builds light.edsurvey.data.frame #as a light.edsurvey.data.frame all elements must be present mml.sdf(formula=mmat ~ 1, data=lsdf, weightVar="totwgt") #as edsurvey.data.frame elements retrieved automatically for user mml.sdf(formula=mmat ~ 1, data=t15, weightVar="totwgt") #NAEP example sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) allItems <- getAllItems(sdf=sdf, construct=NULL) algebraItems <- getAllItems(sdf=sdf, construct="algebra") ## End(Not run)
Reads in selected columns to a data.frame
or a
light.edsurvey.data.frame
. On an edsurvey.data.frame
,
the data are stored on disk.
getData( data, varnames = NULL, drop = FALSE, dropUnusedLevels = TRUE, dropOmittedLevels = TRUE, defaultConditions = TRUE, formula = NULL, recode = NULL, includeNaLabel = FALSE, addAttributes = FALSE, returnJKreplicates = TRUE, omittedLevels = deprecated() )
getData( data, varnames = NULL, drop = FALSE, dropUnusedLevels = TRUE, dropOmittedLevels = TRUE, defaultConditions = TRUE, formula = NULL, recode = NULL, includeNaLabel = FALSE, addAttributes = FALSE, returnJKreplicates = TRUE, omittedLevels = deprecated() )
data |
an |
varnames |
a character vector of variable names that will be returned.
When both |
drop |
a logical value. When set to the default value of |
dropUnusedLevels |
a logical value. When set to the default value of
|
dropOmittedLevels |
a logical value. When set to the default value of
|
defaultConditions |
a logical value. When set to the default value of
|
formula |
a |
recode |
a list of lists to recode variables. Defaults to |
includeNaLabel |
a logical value to indicate if |
addAttributes |
a logical value set to |
returnJKreplicates |
a logical value indicating if JK replicate weights
should be returned. Defaults to |
omittedLevels |
this argument is deprecated. Use |
By default, an edsurvey.data.frame
does not have data read
into memory until getData
is called and returns a data frame.
This structure allows EdSurvey
to have a minimal memory footprint.
To keep the footprint small, you need to limit varnames
to just
the necessary variables.
There are two methods of attaching survey attributes to a data.frame
to make it usable by the functions in the EdSurvey
package (e.g., lm.sdf
):
(a) setting the addAttributes
argument to TRUE
at in the call to getData
or (b) by appending the attributes to the data frame with rebindAttributes
.
When getData
is called, it returns a data frame. Setting the
addAttributes
argument to TRUE
adds the survey attributes and
changes the resultant data.frame
to a light.edsurvey.data.frame
.
Alternatively, a data.frame
can be coerced into a light.edsurvey.data.frame
using rebindAttributes
. See Examples in the rebindAttributes
documentation.
If both formula
and varnames
are populated, the
variables on both will be included.
See the vignette titled
Using the getData
Function in EdSurvey
for long-form documentation on this function.
When addAttributes
is FALSE
, getData
returns a
data.frame
containing data associated with the requested
variables. When addAttributes
is TRUE
, getData
returns a
light.edsurvey.data.frame
.
Tom Fink, Paul Bailey, and Ahmad Emad
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # get two variables, without weights df <- getData(data=sdf, varnames=c("dsex", "b017451")) table(df) # example of using recode df2 <- getData(data=sdf, varnames=c("dsex", "t088301"), recode=list(t088301=list(from=c("Yes, available","Yes, I have access"), to=c("Yes")), t088301=list(from=c("No, have no access"), to=c("No")))) table(df2) # when readNAEP is called on a data file, it appends a default # condition to the edsurvey.data.frame. You can see these conditions # by printing the sdf sdf # As per the default condition specified, getData restricts the data to only # Reporting Sample. This behavior can be changed as follows: df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), defaultConditions = FALSE) table(df2) # similarly, the default behavior of omitting certain levels specified # in the edsurvey.data.frame can be changed as follows: df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), omittedLevels = FALSE) table(df2) # omittedLevels can also be edited with setAttributes() # here, the omitted level "Multiple" is removed from the list sdfIncludeMultiple <- setAttributes(data=sdf, attribute="omittedLevels", value=c(NA, "Omitted")) # check that it was set getAttributes(data=sdfIncludeMultiple, attribute="omittedLevels") # notice that omittedLevels is TRUE, removing NA and "Omitted" still dfIncludeMultiple <- getData(data=sdfIncludeMultiple, varnames=c("dsex", "b017451")) table(dfIncludeMultiple) # the variable "c052601" is from the school-level data file; merging is handled automatically. # returns a light.edsurvey.data.frame using addAttributes=TRUE argument gddat <- getData(data=sdf, varnames=c("composite", "dsex", "b017451","c052601"), addAttributes = TRUE) class(gddat) # look at the first few lines head(gddat) # get a selection of variables, recode using ifelse, and reappend attributes # with rebindAttributes so that it can be used with EdSurvey analysis functions df0 <- getData(data=sdf, varnames=c("composite", "dsex", "b017451", "origwt")) df0$sex <- ifelse(df0$dsex=="Male", "boy", "girl") df0 <- rebindAttributes(data=df0, attributeData=sdf) # getting all the data can use up all the memory and is generally a bad idea df0 <- getData(data=sdf, varnames=colnames(sdf), omittedLevels=FALSE, defaultConditions=FALSE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # get two variables, without weights df <- getData(data=sdf, varnames=c("dsex", "b017451")) table(df) # example of using recode df2 <- getData(data=sdf, varnames=c("dsex", "t088301"), recode=list(t088301=list(from=c("Yes, available","Yes, I have access"), to=c("Yes")), t088301=list(from=c("No, have no access"), to=c("No")))) table(df2) # when readNAEP is called on a data file, it appends a default # condition to the edsurvey.data.frame. You can see these conditions # by printing the sdf sdf # As per the default condition specified, getData restricts the data to only # Reporting Sample. This behavior can be changed as follows: df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), defaultConditions = FALSE) table(df2) # similarly, the default behavior of omitting certain levels specified # in the edsurvey.data.frame can be changed as follows: df2 <- getData(data=sdf, varnames=c("dsex", "b017451"), omittedLevels = FALSE) table(df2) # omittedLevels can also be edited with setAttributes() # here, the omitted level "Multiple" is removed from the list sdfIncludeMultiple <- setAttributes(data=sdf, attribute="omittedLevels", value=c(NA, "Omitted")) # check that it was set getAttributes(data=sdfIncludeMultiple, attribute="omittedLevels") # notice that omittedLevels is TRUE, removing NA and "Omitted" still dfIncludeMultiple <- getData(data=sdfIncludeMultiple, varnames=c("dsex", "b017451")) table(dfIncludeMultiple) # the variable "c052601" is from the school-level data file; merging is handled automatically. # returns a light.edsurvey.data.frame using addAttributes=TRUE argument gddat <- getData(data=sdf, varnames=c("composite", "dsex", "b017451","c052601"), addAttributes = TRUE) class(gddat) # look at the first few lines head(gddat) # get a selection of variables, recode using ifelse, and reappend attributes # with rebindAttributes so that it can be used with EdSurvey analysis functions df0 <- getData(data=sdf, varnames=c("composite", "dsex", "b017451", "origwt")) df0$sex <- ifelse(df0$dsex=="Male", "boy", "girl") df0 <- rebindAttributes(data=df0, attributeData=sdf) # getting all the data can use up all the memory and is generally a bad idea df0 <- getData(data=sdf, varnames=colnames(sdf), omittedLevels=FALSE, defaultConditions=FALSE) ## End(Not run)
This function returns a data.frame
object that defines NHES Survey Codes and survey parameters that are compatible with the readNHES
function for use.
The resulting data.frame
object is useful for user reference or other advanced techniques.
getNHES_SurveyInfo()
getNHES_SurveyInfo()
Any changes or modifications to the data.frame
object will not change the behavior of readNHES
.
This function should be treated only as a read-only source of information.
Tom Fink
readNHES
, viewNHES_SurveyCodes
## Not run: #retrieves the NHES survey meta-data to a data.frame surveyInfo <- getNHES_SurveyInfo() #View the survey data where the year is equal to 2016 in RStudio View(subset(surveyInfo, surveyInfo$Year==2016)) ## End(Not run)
## Not run: #retrieves the NHES survey meta-data to a data.frame surveyInfo <- getNHES_SurveyInfo() #View the survey data where the year is equal to 2016 in RStudio View(subset(surveyInfo, surveyInfo$Year==2016)) ## End(Not run)
Gets the set of variables on an edsurvey.data.frame
, a light.edsurvey.data.frame
, or
an edsurvey.data.frame.list
associated with the given subject or subscale.
getPlausibleValue(var, data)
getPlausibleValue(var, data)
var |
a character vector naming the subject scale or subscale |
data |
an |
This function will return a set of plausible value names for variables that
hasPlausibleValue
returns as true.
a character vector of the set of variable names for the plausible values
Michael Lee and Paul Bailey
showPlausibleValues
, updatePlausibleValue
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) getPlausibleValue(var="composite", data=sdf) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) getPlausibleValue(var="composite", data=sdf) ## End(Not run)
Returns the jackknife replicate weights on an edsurvey.data.frame
, a light.edsurvey.data.frame
, or
an edsurvey.data.frame.list
associated with a weight variable.
getWeightJkReplicates(var, data)
getWeightJkReplicates(var, data)
var |
character indicating the name of the weight variable for which the jackknife replicate weights are desired |
data |
an |
a character vector of the jackknife replicate weights
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) getWeightJkReplicates(var="origwt", data=sdf) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) getWeightJkReplicates(var="origwt", data=sdf) ## End(Not run)
Fits a logit or probit that
uses weights and variance estimates
appropriate for the edsurvey.data.frame
,
the light.edsurvey.data.frame
, or the edsurvey.data.frame.list
.
glm.sdf(formula, family = binomial(link = "logit"), data, weightVar = NULL, relevels = list(), varMethod=c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU=FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated()) logit.sdf( formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() ) probit.sdf( formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
glm.sdf(formula, family = binomial(link = "logit"), data, weightVar = NULL, relevels = list(), varMethod=c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU=FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated()) logit.sdf( formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() ) probit.sdf( formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, returnVarEstInputs = FALSE, omittedLevels = deprecated() )
formula |
a |
family |
the |
data |
an |
weightVar |
character indicating the weight variable to use (see Details).
The |
relevels |
a list; used to change the contrasts from the default treatment contrasts to the treatment contrasts with a chosen omitted group. The name of each element should be the variable name, and the value should be the group to be omitted. |
varMethod |
a character set to “jackknife” or “Taylor” that indicates the variance estimation method to be used. See Details. |
jrrIMax |
the |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnNumberOfPSU |
a logical value set to |
returnVarEstInputs |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
This function implements an estimator that correctly handles left-hand side
variables that are logical, allows for survey sampling weights, and estimates
variances using the jackknife replication or Taylor series.
The vignette titled
Statistical Methods Used in EdSurvey
describes estimation of the reported statistics and how it depends on varMethod
.
The coefficients are estimated using the sample weights according to the section “Estimation of Weighted Means When Plausible Values Are Not Present” or the section “Estimation of Weighted Means When Plausible Values Are Present,” depending on if there are assessment variables or variables with plausible values in them.
How the standard errors of the coefficients are estimated depends on the presence of plausible values (assessment variables), But once it is obtained, the t statistic is given by
where
is the estimated coefficient and
is
its variance of that estimate.
logit.sdf
and probit.sdf
are included for convenience only;
they give the same results as a call to glm.sdf
with the binomial family
and the link function named in the function call (logit or probit).
By default, glm
fits a logistic regression when family
is not set,
so the two are expected to give the same results in that case.
Other types of generalized linear models are not supported.
All variance estimation methods are shown in the vignette titled
Statistical Methods Used in EdSurvey.
When the predicted
value does not have plausible values and varMethod
is set to
jackknife
, the variance of the coefficients
is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Not Present, Using the Jackknife Method.”
When plausible values are present and varMethod
is set to
jackknife
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Present, Using the Jackknife Method.”
When the predicted
value does not have plausible values and varMethod
is set to
Taylor
, the variance of the coefficients
is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Not Present, Using the Taylor Series Method.”
When plausible values are present and varMethod
is set to
Taylor
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Present, Using the Taylor Series Method.”
An edsurveyGlm
with the following elements:
call |
the function call |
formula |
the formula used to fit the model |
coef |
the estimates of the coefficients |
se |
the standard error estimates of the coefficients |
Vimp |
the estimated variance caused by uncertainty in the scores (plausible value variables) |
Vjrr |
the estimated variance from sampling |
M |
the number of plausible values |
nPSU |
the number of PSUs used in the calculation |
varm |
the variance estimates under the various plausible values |
coefm |
the values of the coefficients under the various plausible values |
coefmat |
the coefficient matrix (typically produced by the summary of a model) |
weight |
the name of the weight variable |
npv |
the number of plausible values |
njk |
the number of the jackknife replicates used |
varMethod |
always |
varEstInputs |
when |
Of the common hypothesis tests for joint parameter testing, only the Wald
test is widely used with plausible values and sample weights. As such, it
replaces, if imperfectly, the Akaike Information Criteria (AIC), the
likelihood ratio test, chi-squared, and analysis of variance (ANOVA, including F-tests).
See waldTest
or
the vignette titled
Methods and Overview of Using EdSurvey for Running Wald Tests.
Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses the jackknife variance method using replicate weights table(sdf$b013801) # create a binary variable for 26 or more books sdf$b013801_26more <- ifelse(sdf$b013801 %in% c("26-100", ">100"), yes = 1, no = 0) # compare the multiple categorical and binary variable for accuracy table(sdf$b013801, sdf$b013801_26more) logit1 <- logit.sdf(formula=b013801_26more ~ dsex + b017451, data=sdf) # use summary to get detailed results summary(logit1) # Taylor series variance estimation logit1t <- logit.sdf(formula=b013801_26more ~ dsex + b017451, data=sdf, varMethod="Taylor") summary(logit1t) # when using ifelse for PVs, use the ifelse in the formula call. PVs contains multiple variables logit2 <- logit.sdf(formula=ifelse(composite >= 300, yes = 1, no = 0) ~ dsex + b013801, data=sdf) summary(logit2) # note this recoding of composite must be done in the formula logit3 <- glm.sdf(formula=I(composite >= 300) ~ dsex + b013801, data=sdf, family=quasibinomial(link="logit")) # Wald test for joint hypothesis that all coefficients in b013801 are zero waldTest(model=logit3, coefficients="b013801") summary(logit3) # use plausible values as predictors in a generalized linear regression model # ifelse function converts the selected categories to 1 and all the others including # Multiple and Omitted levels to 0 sdf$AlgebraClass <- ifelse(sdf$m815701 %in% c('Algebra I (1-yr crs)', '1st yr 2-yr Algeb I', '2nd yr 2-yr Algeb I', 'Algebra II'), 1, 0) table(sdf$m815701, sdf$AlgebraClass) logit4 <- logit.sdf(formula = AlgebraClass ~ algebra, weightVar = 'origwt', data = sdf) summary(logit4) # alternatively, same analyses can be executed using the I() function with # dropOmittedLevels = FALSE logit5 <- logit.sdf(I(m815701 %in% c('Algebra I (1-yr crs)', '1st yr 2-yr Algeb I', '2nd yr 2-yr Algeb I', 'Algebra II')) ~ algebra, weightVar = 'origwt', data = sdf, dropOmittedLevels = FALSE) summary(logit5) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses the jackknife variance method using replicate weights table(sdf$b013801) # create a binary variable for 26 or more books sdf$b013801_26more <- ifelse(sdf$b013801 %in% c("26-100", ">100"), yes = 1, no = 0) # compare the multiple categorical and binary variable for accuracy table(sdf$b013801, sdf$b013801_26more) logit1 <- logit.sdf(formula=b013801_26more ~ dsex + b017451, data=sdf) # use summary to get detailed results summary(logit1) # Taylor series variance estimation logit1t <- logit.sdf(formula=b013801_26more ~ dsex + b017451, data=sdf, varMethod="Taylor") summary(logit1t) # when using ifelse for PVs, use the ifelse in the formula call. PVs contains multiple variables logit2 <- logit.sdf(formula=ifelse(composite >= 300, yes = 1, no = 0) ~ dsex + b013801, data=sdf) summary(logit2) # note this recoding of composite must be done in the formula logit3 <- glm.sdf(formula=I(composite >= 300) ~ dsex + b013801, data=sdf, family=quasibinomial(link="logit")) # Wald test for joint hypothesis that all coefficients in b013801 are zero waldTest(model=logit3, coefficients="b013801") summary(logit3) # use plausible values as predictors in a generalized linear regression model # ifelse function converts the selected categories to 1 and all the others including # Multiple and Omitted levels to 0 sdf$AlgebraClass <- ifelse(sdf$m815701 %in% c('Algebra I (1-yr crs)', '1st yr 2-yr Algeb I', '2nd yr 2-yr Algeb I', 'Algebra II'), 1, 0) table(sdf$m815701, sdf$AlgebraClass) logit4 <- logit.sdf(formula = AlgebraClass ~ algebra, weightVar = 'origwt', data = sdf) summary(logit4) # alternatively, same analyses can be executed using the I() function with # dropOmittedLevels = FALSE logit5 <- logit.sdf(I(m815701 %in% c('Algebra I (1-yr crs)', '1st yr 2-yr Algeb I', '2nd yr 2-yr Algeb I', 'Algebra II')) ~ algebra, weightVar = 'origwt', data = sdf, dropOmittedLevels = FALSE) summary(logit5) ## End(Not run)
Returns a value indicating if this variable has associated plausible values in an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
hasPlausibleValue(var, data)
hasPlausibleValue(var, data)
var |
a character indicating the variable in question |
data |
an |
This function returns TRUE
only when the variable passed to it is the name for a set of plausible values but
not if it is an individual plausible value from such a set. Thus, on the NAEP Primer, composite
has plausible
values (and so TRUE
would be returned by this function), but any of the plausible values or variable names defined in
the actual data (such as "mrpcm1"
or "dsex"
) are not.
a Boolean (or vector when var
is a vector) indicating if each element of var
has
plausible values associated with it
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # TRUE hasPlausibleValue(var="composite", data=sdf) # FALSE hasPlausibleValue(var="dsex", data=sdf) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # TRUE hasPlausibleValue(var="composite", data=sdf) # FALSE hasPlausibleValue(var="dsex", data=sdf) ## End(Not run)
Returns logical values indicating whether a vector of variables is a weight for an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
isWeight(var, data)
isWeight(var, data)
var |
a character vector of variables |
data |
an |
Note that this function returns TRUE
only when the var
element is the name of the weight used
for making estimates but not if it is one of the individual jackknife replicates.
a logical vector of values indicating if each element of var
is a weight
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # TRUE isWeight(var="origwt", data=sdf) # FALSE isWeight(var="dsex", data=sdf) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # TRUE isWeight(var="origwt", data=sdf) # FALSE isWeight(var="dsex", data=sdf) ## End(Not run)
Retrieve the levels and labels of a variable from an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
levelsSDF(varnames, data, showOmitted = TRUE, showN = TRUE)
levelsSDF(varnames, data, showOmitted = TRUE, showN = TRUE)
varnames |
a vector of character strings to search for in the database connection object ( |
data |
an |
showOmitted |
a Boolean indicating if omitted levels should be shown |
showN |
a Boolean indicating if (unweighted) n-sizes should be shown for each response level |
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # search variables in the sdf levelsSDF(varnames="pared", data=sdf) # search multiple variables levelsSDF(varnames=c("pared","ell3"), data=sdf) # search multiple variables in a light.edsurvey.data.frame with recodes df2 <- getData(data=sdf, varnames=c("dsex", "t088301"), recode=list(t088301=list(from=c("Yes, available","Yes, I have access"), to=c("Yes")), t088301=list(from=c("No, have no access"), to=c("No"))), addAttributes=TRUE) levelsSDF(varnames=c("dsex","t088301"), data=df2) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # search variables in the sdf levelsSDF(varnames="pared", data=sdf) # search multiple variables levelsSDF(varnames=c("pared","ell3"), data=sdf) # search multiple variables in a light.edsurvey.data.frame with recodes df2 <- getData(data=sdf, varnames=c("dsex", "t088301"), recode=list(t088301=list(from=c("Yes, available","Yes, I have access"), to=c("Yes")), t088301=list(from=c("No, have no access"), to=c("No"))), addAttributes=TRUE) levelsSDF(varnames=c("dsex","t088301"), data=df2) ## End(Not run)
Fits a linear model that uses weights and variance estimates appropriate for the data.
lm.sdf(formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, returnNumberOfPSU = FALSE, standardizeWithSamplingVar = FALSE, verbose=TRUE, omittedLevels = deprecated())
lm.sdf(formula, data, weightVar = NULL, relevels = list(), varMethod = c("jackknife", "Taylor"), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, returnNumberOfPSU = FALSE, standardizeWithSamplingVar = FALSE, verbose=TRUE, omittedLevels = deprecated())
formula |
a |
data |
an |
weightVar |
a character indicating the weight variable to use (see Details).
The |
relevels |
a list. Used to change the contrasts from the default treatment contrasts to the treatment contrasts with a chosen omitted group (the reference group). The name of each element should be the variable name, and the value should be the group to be omitted (the reference group). |
varMethod |
a character set to “jackknife” or “Taylor” that indicates the variance estimation method to be used. See Details. |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnVarEstInputs |
a logical value set to |
returnNumberOfPSU |
a logical value set to |
standardizeWithSamplingVar |
a logical value indicating if the standardized coefficients
should have the variance of the regressors and outcome measured
with sampling variance. Defaults to |
verbose |
logical; indicates whether a detailed printout should display during execution |
omittedLevels |
this argument is deprecated. Use |
This function implements an estimator that correctly handles left-hand side variables that are either numeric or plausible values and allows for survey sampling weights and estimates variances using the jackknife replication method. The vignette titled Statistical Methods Used in EdSurvey describes estimation of the reported statistics.
Regardless of the variance estimation, the coefficients are estimated using the sample weights according to the sections “Estimation of Weighted Means When Plausible Values Are Not Present” or “Estimation of Weighted Means When Plausible Values Are Present,” depending on if there are assessment variables or variables with plausible values in them.
How the standard errors of the coefficients are estimated depends on the
value of varMethod
and the presence of plausible values (assessment variables),
But once it is obtained, the t statistic
is given by
where
is the estimated coefficient and
is
the variance of that estimate.
The coefficient of determination (R-squared value) is similarly estimated by finding the average R-squared using the average across the plausible values.
Standardized regression coefficients can be returned in a call to summary
,
by setting the argument src
to TRUE
. See Examples.
By default, the standardized coefficients are calculated using standard
deviations of the variables themselves, including averaging the standard
deviation across any plausible values. When standardizeWithSamplingVar
is set to TRUE
, the variance of the standardized coefficient is
calculated similar to a regression coefficient and therefore includes the
sampling variance in the variance estimate of the outcome variable.
All variance estimation methods are shown in the vignette titled
Statistical Methods Used in EdSurvey.
When varMethod
is set to the jackknife
and the predicted
value does not have plausible values, the variance of the coefficients
is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Not Present, Using the Jackknife Method.”
When plausible values are present and varMethod
is jackknife
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When
Plausible Values Are Present, Using the Jackknife Method.”
When plausible values are not present and varMethod
is Taylor
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When Plausible
Values Are Not Present, Using the Taylor Series Method.”
When plausible values are present and varMethod
is Taylor
, the
variance of the coefficients is estimated according to the section
“Estimation of Standard Errors of Weighted Means When Plausible
Values Are Present, Using the Taylor Series Method.”
An edsurvey.lm
with the following elements:
call |
the function call |
formula |
the formula used to fit the model |
coef |
the estimates of the coefficients |
se |
the standard error estimates of the coefficients |
Vimp |
the estimated variance from uncertainty in the scores (plausible value variables) |
Vjrr |
the estimated variance from sampling |
M |
the number of plausible values |
varm |
the variance estimates under the various plausible values |
coefm |
the values of the coefficients under the various plausible values |
coefmat |
the coefficient matrix (typically produced by the summary of a model) |
r.squared |
the coefficient of determination |
weight |
the name of the weight variable |
npv |
the number of plausible values |
jrrIMax |
the |
njk |
the number of the jackknife replicates used; set to |
varMethod |
one of |
residuals |
residuals from the average regression coefficients |
PV.residuals |
residuals from the by plausible value coefficients |
PV.fitted.values |
fitted values from the by plausible value coefficients |
B |
imputation variance covariance matrix, before multiplication by (M+1)/M |
U |
sampling variance covariance matrix |
rbar |
average relative increase in variance; see van Buuren (2012, eq. 2.29) |
nPSU |
number of PSUs used in calculation |
n0 |
number of rows on an |
nUsed |
number of observations with valid data and weights larger than zero |
data |
data used for the computation |
Xstdev |
standard deviations of regressors, used for computing standardized
regression coefficients when |
varSummary |
the result of running |
varEstInputs |
when |
standardizeWithSamplingVar |
when |
Of the common hypothesis tests for joint parameter testing, only the Wald
test is widely used with plausible values and sample weights. As such, it
replaces, if imperfectly, the Akaike Information Criteria (AIC), the
likelihood ratio test, chi-squared, and analysis of variance (ANOVA, including F-tests). See waldTest
or
the vignette titled
Methods and Overview of Using EdSurvey for Running Wald Tests.
Paul Bailey
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
van Buuren, S. (2012). Flexible imputation of missing data. New York, NY: CRC Press.
Weisberg, S. (1985). Applied linear regression (2nd ed.). New York, NY: Wiley.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses jackknife variance method using replicate weights lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf) lm1 # the summary function displays detailed results summary(lm1) # to show standardized regression coefficients summary(lm1, src=TRUE) # to specify a variance method, use varMethod lm2 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, varMethod="Taylor") lm2 summary(lm2) # use relevel to set a new omitted category lm3 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, relevels=list(dsex="Female")) summary(lm3) # test of a simple joint hypothesis waldTest(lm3, "b017451") # use recode to change values for specified variables lm4 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to=c("Infrequently")), b017451=list(from=c("2 or 3 times a week","Every day"), to=c("Frequently")))) # Note: "Infrequently" is the dropped level for the recoded b017451 summary(lm4) # use plausible values as predictors in a linear regression model lm5 <- lm.sdf(formula=algebra ~ dsex + geometry, data=sdf) lm5 summary(lm5) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # by default uses jackknife variance method using replicate weights lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf) lm1 # the summary function displays detailed results summary(lm1) # to show standardized regression coefficients summary(lm1, src=TRUE) # to specify a variance method, use varMethod lm2 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, varMethod="Taylor") lm2 summary(lm2) # use relevel to set a new omitted category lm3 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, relevels=list(dsex="Female")) summary(lm3) # test of a simple joint hypothesis waldTest(lm3, "b017451") # use recode to change values for specified variables lm4 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, recode=list(b017451=list(from=c("Never or hardly ever", "Once every few weeks", "About once a week"), to=c("Infrequently")), b017451=list(from=c("2 or 3 times a week","Every day"), to=c("Frequently")))) # Note: "Infrequently" is the dropped level for the recoded b017451 summary(lm4) # use plausible values as predictors in a linear regression model lm5 <- lm.sdf(formula=algebra ~ dsex + geometry, data=sdf) lm5 summary(lm5) ## End(Not run)
Takes a data.frame
or a light.edsurvey.data.frame
and merges with a edsurvey.data.frame
into it's internal data cache.
## S3 method for class 'edsurvey.data.frame' merge(x, y, by = "id", by.x = by, by.y = by, ...)
## S3 method for class 'edsurvey.data.frame' merge(x, y, by = "id", by.x = by, by.y = by, ...)
x |
a |
y |
either a |
by |
the column name(s) to perform the data merge operation. If differing column names between the |
by.x |
the column name(s) to perform the data merge operation for the |
by.y |
the column name(s) to perform the data merge operation for the |
... |
arguments passed to merge, note that |
a merged data set the same object type as x
. For edsurvey.data.frame
objects then resulting merged data is stored in the objects internal data cache.
Tom Fink
## Not run: # read in NAEP primer data sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) lsdf <- getData(data=sdf, varnames=c("dsex", "b017451"), addAttributes = TRUE) df <- data.frame(dsex = c("Male","Female"), dsex2 = c("Boy","Girl")) #merging an edsurvey.data.frame with a data.frame/light.edsurvey.data.frame #returns an edsurvey.data.frame object sdf2 <- merge(sdf, df, by = "dsex") table(sdf2$dsex2) # merging a light.edsurvey.data.frame with a data.frame # returns a light.edsurvey.data.frame object merged_lsdf <- merge(lsdf,df, by = "dsex") class(merged_lsdf) # "light.edsurvey.data.frame" "data.frame" head(merged_lsdf) # shows merge results # merging behaves similarly to base::merge df2 <- data.frame(dsex = c("Male","Female"), b017451 = c(1,2)) merged_lsdf2 <- merge(lsdf,df2, by = "dsex") names(merged_lsdf2) # "dsex" "b017451.x" "b017451.y" head(merged_lsdf2) # shows merge results ## End(Not run)
## Not run: # read in NAEP primer data sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) lsdf <- getData(data=sdf, varnames=c("dsex", "b017451"), addAttributes = TRUE) df <- data.frame(dsex = c("Male","Female"), dsex2 = c("Boy","Girl")) #merging an edsurvey.data.frame with a data.frame/light.edsurvey.data.frame #returns an edsurvey.data.frame object sdf2 <- merge(sdf, df, by = "dsex") table(sdf2$dsex2) # merging a light.edsurvey.data.frame with a data.frame # returns a light.edsurvey.data.frame object merged_lsdf <- merge(lsdf,df, by = "dsex") class(merged_lsdf) # "light.edsurvey.data.frame" "data.frame" head(merged_lsdf) # shows merge results # merging behaves similarly to base::merge df2 <- data.frame(dsex = c("Male","Female"), b017451 = c(1,2)) merged_lsdf2 <- merge(lsdf,df2, by = "dsex") names(merged_lsdf2) # "dsex" "b017451.x" "b017451.y" head(merged_lsdf2) # shows merge results ## End(Not run)
More verbose merge function
mergev( x, y, by = NULL, by.x = NULL, by.y = NULL, all.x = NULL, all.y = NULL, all = FALSE, order = c("sort", "unsorted", "x", "y"), fast = FALSE, merge.type.colname = "merge.type", return.list = FALSE, verbose = TRUE, showWarnings = TRUE, ... )
mergev( x, y, by = NULL, by.x = NULL, by.y = NULL, all.x = NULL, all.y = NULL, all = FALSE, order = c("sort", "unsorted", "x", "y"), fast = FALSE, merge.type.colname = "merge.type", return.list = FALSE, verbose = TRUE, showWarnings = TRUE, ... )
x |
first data.frame to merge, same as in |
y |
second data.frame to merge, same as in |
by |
character vector of column names to merge by. When |
by.x |
character vector of column names on |
by.y |
character vector of column names on |
all.x |
logical value indicating if unmerged rows from |
all.y |
logical value indicating if unmerged rows from |
all |
logical value indicating if unmerged rows from |
order |
character string from "sort", "unsorted", "x", and "y".
Specifies the order of the output. Setting this to "sort"
gives the same result as |
fast |
logical value indicating if |
merge.type.colname |
character indicating the column name of the resulting merge type column. See description. |
return.list |
logical value indicating if the merged data.frame and verbose output should be returned as elements of a list. Defaults to FALSE where the function simply returns a data.frame. |
verbose |
logical value indicating if output should be reported. Defaults to TRUE. Useful for testing. |
showWarnings |
logical value to output warning messages (TRUE) or suppress (FALSE). Defaults to TRUE. |
... |
additional parameters passed to merge. |
This is a wrapper for the base package merge function that prints out verbose information about the merge, including the merge type (one/many to one/many), the overlapping column names that will have suffixes applied, the number of rows and the number of unique keys that are in each dataset and in the resulting dataset.
Also gives more detailed errors when, e.g. the columns named in the by
argument are
not on the x
or y
data.frames.
depends on the value of return.list
.
When return.list
is FALSE
, returns a data.frame
.
When return.list
is TRUE
, returns a list with two elements. The first is the same data.frame
result. The second
is a list with the values that were printed out. Elements include merge.type with two elements, each "one" or "many" indicating the
merge type for x
and y
, respectively; inBoth, the list of column names in both merged data.frames; and merge.matrix
the matrix printed out by this function.
Fits a linear weighted mixed-effects model.
mixed.sdf( formula, data, weightVars = NULL, weightTransformation = TRUE, recode = NULL, defaultConditions = TRUE, tolerance = 0.01, nQuad = NULL, verbose = 0, family = NULL, centerGroup = NULL, centerGrand = NULL, fast = FALSE, ... )
mixed.sdf( formula, data, weightVars = NULL, weightTransformation = TRUE, recode = NULL, defaultConditions = TRUE, tolerance = 0.01, nQuad = NULL, verbose = 0, family = NULL, centerGroup = NULL, centerGrand = NULL, fast = FALSE, ... )
formula |
a |
data |
an |
weightVars |
character vector indicating weight variables for
corresponding levels to use. The |
weightTransformation |
a logical value to indicate whether the function
should standardize weights before using it in the
multilevel model. If set to |
recode |
a list of lists to recode variables. Defaults to |
defaultConditions |
a logical value. When set to the default value of
|
tolerance |
depreciated, no effect |
nQuad |
depreciated, no effect |
verbose |
an integer; when set to |
family |
this argument is depreciated; please use the |
centerGroup |
a list in which the name of each element is the name of the aggregation level,
and the element is a formula of variable names to be group mean centered. For example, to group mean center
gender and age within the group student: |
centerGrand |
a formula of variable names to be grand mean centered. For example, to center the
variable education by overall mean of education: |
fast |
depreciated, no effect |
... |
other potential arguments to be used in |
This function uses the mix
call in the WeMix
package to fit mixed models.
When the outcome does not have plausible values, the variance estimator directly from
the mix
function is used; these account for covariance at the top level
of the model specified by the user.
When the outcome has plausible values, the coefficients are estimated in the same
way as in lm.sdf
, that is, averaged across the plausible values.
In addition, the variance of the coefficients is estimated
as the sum of the variance estimate from the mix
function and the
imputation variance. The formula for the imputation variance is, again, the same
as for lm.sdf
,
with the same estimators as in the vignette titled
Statistical Methods Used in EdSurvey.
In the section
“Estimation of Standard Errors of Weighted Means When Plausible Values Are Present, Using the Jackknife Method”
in the formula for , the variance
and estimates of the variance components are estimated with the same formulas as
the regression coefficients.
A mixedSdfResults
object with the following elements:
call |
the original call used in |
formula |
the formula used to fit the model |
coef |
a vector of coefficient estimates |
se |
a vector with the standard error estimates of the coefficients and the standard error of the variance components |
vars |
estimated variance components of the model |
levels |
the number of levels in the model |
ICC |
the intraclass correlation coefficient of the model |
npv |
the number of plausible values |
ngroups |
a |
n0 |
the number of observations in the original data |
nused |
the number of observations used in the analysis |
model.frame |
the data used in the model |
If the formula does not involve plausible values, the function will return the following additional elements:
lnlf |
the likelihood function |
lnl |
the log-likelihood of the model |
If the formula involves plausible values, the function will return the following additional elements:
Vimp |
the estimated variance from uncertainty in the scores |
Vjrr |
the estimated variance from sampling |
Paul Bailey, Trang Nguyen, and Claire Kelley
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4), 805–827.
## Not run: # save TIMSS 2015 data to ~/TIMSS/2015 downloadTIMSS(root="~/", years=2015) fin <- readTIMSS(path="~/TIMSS/2015", countries="fin", gradeLvl=4) # uses all plausible values mix1 <- mixed.sdf(formula=mmat ~ itsex + (1|idschool), data = fin, weightVar=c("totwgt","schwgt"), weightTransformation=FALSE) summary(mix1) # uses only one plausible value mix2 <- mixed.sdf(formula=asmmat01 ~ itsex + (1|idschool), data = fin, weightVar=c("totwgt","schwgt"), weightTransformation=FALSE) summary(mix2) ## End(Not run)
## Not run: # save TIMSS 2015 data to ~/TIMSS/2015 downloadTIMSS(root="~/", years=2015) fin <- readTIMSS(path="~/TIMSS/2015", countries="fin", gradeLvl=4) # uses all plausible values mix1 <- mixed.sdf(formula=mmat ~ itsex + (1|idschool), data = fin, weightVar=c("totwgt","schwgt"), weightTransformation=FALSE) summary(mix1) # uses only one plausible value mix2 <- mixed.sdf(formula=asmmat01 ~ itsex + (1|idschool), data = fin, weightVar=c("totwgt","schwgt"), weightTransformation=FALSE) summary(mix2) ## End(Not run)
Prepare IRT parameters and score items and then estimate a linear model with direct estimation.
mml.sdf( formula, data, weightVar = NULL, dropOmittedLevels = TRUE, composite = TRUE, verbose = 0, multiCore = FALSE, numberOfCores = NULL, minNode = -4, maxNode = 4, Q = 34, idVar = NULL, returnMmlCall = FALSE, omittedLevels = deprecated() )
mml.sdf( formula, data, weightVar = NULL, dropOmittedLevels = TRUE, composite = TRUE, verbose = 0, multiCore = FALSE, numberOfCores = NULL, minNode = -4, maxNode = 4, Q = 34, idVar = NULL, returnMmlCall = FALSE, omittedLevels = deprecated() )
formula |
a |
data |
an |
weightVar |
a character indicating the weight variable to use.
The |
dropOmittedLevels |
a logical value. When set to the value of |
composite |
logical; for a NAEP composite, setting to |
verbose |
logical; indicates whether a detailed printout should display during execution, only for NAEP data. |
multiCore |
allows the |
numberOfCores |
the number of cores to be used when using |
minNode |
numeric; minimum integration point in direct estimation; see |
maxNode |
numeric; maximum integration point in direct estimation; see |
Q |
integer; number of integration points per student used when integrating over the levels of the latent outcome construct. |
idVar |
a variable that is used to explicitly define the name of the student identifier
variable to be used from |
returnMmlCall |
logical; when |
omittedLevels |
this argument is deprecated. Use |
Typically, models are fit with NAEP data using plausible values to integrate out the uncertainty in the measurement of individual
student outcomes. When direct estimation is used, the measurement error is integrated out explicitly using Q
quadrature points.
See documentation for mml
in the Dire
package.
The scoreDict
helps turn response categories that are not simple item responses, such as Not Reached
and Multiple
,
to something coded as inputs for the mml
function in Dire
. How mml
treats these values depends on the test.
For NAEP, for a dichotomous item, 8 is scored as the same proportion correct as the guessing parameter for that item, 0 is
an incorrect response, an NA does not change the student's score, and 1 is correct. TIMSS does not require a scoreDict
.
An mml.sdf
object, which is the outcome from mml.sdf
, with the following elements:
mml |
an object containing information from the |
scoreDict |
the scoring used in the |
.
itemMapping |
the item mapping used in the |
.
Cohen, J., & Jiang, T. (1999). Comparison of partially measured latent traits across nominal subgroups. Journal of the American Statistical Association, 94(448), 1035–1044. https://doi.org/10.2307/2669917
## Not run: ## Direct Estimation with NAEP # Load data sdfNAEP <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # Inspect scoring guidelines defaultNAEPScoreCard() # example output: # resCat pointMult pointConst # 1 Multiple 8 0 # 2 Not Reached NA NA # 3 Missing NA NA # 4 Omitted 8 0 # 5 Illegible 0 0 # 6 Non-Rateable 0 0 # 7 Off Task 0 0 # Run NAEP model, warnings are about item codings mmlNAEP <- mml.sdf(formula=algebra ~ dsex + b013801, data=sdfNAEP, weightVar='origwt') # Call with Taylor summary(mmlNAEP, varType="Taylor", strataVar="repgrp1", PSUVar="jkunit") ## Direct Estimation with TIMSS # Load data downloadTIMSS("~/", year=2015) sdfTIMSS <- readTIMSS(path="~/TIMSS/2015", countries="usa", grade = "4") # Run TIMSS model, warnings are about item codings mmlTIMSS <- mml.sdf(formula=mmat ~ itsex + asbg04, data=sdfTIMSS, weightVar='totwgt') # Call with Taylor summary(mmlTIMSS, varType="Taylor", strataVar="jkzone", PSUVar="jkrep") ## End(Not run)
## Not run: ## Direct Estimation with NAEP # Load data sdfNAEP <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # Inspect scoring guidelines defaultNAEPScoreCard() # example output: # resCat pointMult pointConst # 1 Multiple 8 0 # 2 Not Reached NA NA # 3 Missing NA NA # 4 Omitted 8 0 # 5 Illegible 0 0 # 6 Non-Rateable 0 0 # 7 Off Task 0 0 # Run NAEP model, warnings are about item codings mmlNAEP <- mml.sdf(formula=algebra ~ dsex + b013801, data=sdfNAEP, weightVar='origwt') # Call with Taylor summary(mmlNAEP, varType="Taylor", strataVar="repgrp1", PSUVar="jkunit") ## Direct Estimation with TIMSS # Load data downloadTIMSS("~/", year=2015) sdfTIMSS <- readTIMSS(path="~/TIMSS/2015", countries="usa", grade = "4") # Run TIMSS model, warnings are about item codings mmlTIMSS <- mml.sdf(formula=mmat ~ itsex + asbg04, data=sdfTIMSS, weightVar='totwgt') # Call with Taylor summary(mmlTIMSS, varType="Taylor", strataVar="jkzone", PSUVar="jkrep") ## End(Not run)
Fits a multivariate linear model that uses weights and variance
estimates appropriate for the edsurvey.data.frame
.
mvrlm.sdf( formula, data, weightVar = NULL, relevels = list(), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, estMethod = "OLS", verbose = TRUE, omittedLevels = deprecated() )
mvrlm.sdf( formula, data, weightVar = NULL, relevels = list(), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, estMethod = "OLS", verbose = TRUE, omittedLevels = deprecated() )
formula |
a |
data |
an |
weightVar |
character indicating the weight variable to use (see Details).
The |
relevels |
a list. Used to change the contrasts from the default treatment contrasts to treatment contrasts with a chosen omitted group (the reference group). To do this, the user puts an element on the list with the same name as a variable to change contrasts on and then make the value for that list element equal to the value that should be the omitted group (the reference group). |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnVarEstInputs |
a logical value. Set to |
estMethod |
a character value indicating which estimation method to use.
Default is |
verbose |
logical; indicates whether a detailed printout should display during execution |
omittedLevels |
this argument is deprecated. Use |
This function implements an estimator that correctly handles multiple left-hand side variables that are either numeric or plausible values, allows for survey sampling weights, and estimates variances using the jackknife replication method. The vignette titled Statistical Methods Used in EdSurvey describes estimation of the reported statistics.
The coefficients are estimated using the sample weights according to the section “Estimation of Weighted Means When Plausible Values Are Not Present” or the section “Estimation of Weighted Means When Plausible Values Are Present,” depending on if there are assessment variables or variables with plausible values in them.
The coefficient of determination (R-squared value) is similarly estimated by finding the average R-squared using the sample weights for each set of plausible values.
All variance estimation methods are shown in the vignette titled Statistical Methods Used in EdSurvey.
When the predicted value does not have plausible values, the variance of the coefficients is estimated according to the section “Estimation of Standard Errors of Weighted Means When Plausible Values Are Not Present, Using the Jackknife Method.”
When plausible values are present, the variance of the coefficients is estimated according to the section “Estimation of Standard Errors of Weighted Means When Plausible Values Are Present, Using the Jackknife Method.”
For more information on the specifics of multivariate regression, see the vignette titled Methods and Overview of Using EdSurvey for Multivariate Regression.
An edsurvey.mvrlm
with elements:
call |
the function call |
formula |
the formula used to fit the model |
coef |
the estimates of the coefficients |
se |
the standard error estimates of the coefficients |
Vimp |
the estimated variance caused by uncertainty in the scores (plausible value variables) |
Vjrr |
the estimated variance caused by sampling |
M |
the number of plausible values |
varm |
the variance estimates under the various plausible values |
coefm |
the values of the coefficients under the various plausible values |
coefmat |
the coefficient matrix (typically produced by the summary of a model) |
r.squared |
the coefficient of determination |
weight |
the name of the weight variable |
npv |
the number of plausible values |
njk |
the number of the jackknife replicates used |
varEstInputs |
When |
residuals |
residuals for each of the PV models |
fitted.values |
model fitted values |
residCov |
residual covariance matrix for dependent variables |
residPV |
residuals for each dependent variable |
inputs |
coefficient estimation input matrices |
n0 |
full data n |
nUsed |
n used for model |
B |
imputation variance-covariance matrix, before multiplication by (M+1)/M |
U |
sampling variance-covariance matrix |
Alex Lishinski and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # use | symbol to separate dependent variables in the left-hand side of formula mvrlm.fit <- mvrlm.sdf(formula=algebra | geometry ~ dsex + m072801, jrrIMax = 5, data = sdf) # print method returns coefficients, as does coef method mvrlm.fit coef(mvrlm.fit) # for more detailed results, use summary: summary(mvrlm.fit) # details of model can also be accessed through components of the returned object; for example: # coefficients (one column per dependent variable) mvrlm.fit$coef # coefficient table with standard errors and p-values (1 table per dependent variable) mvrlm.fit$coefmat # R-squared values (one per dependent variable) mvrlm.fit$r.squared # residual covariance matrix mvrlm.fit$residCov # dependent variables can have plausible values or not (or a combination) mvrlm.fit <- mvrlm.sdf(formula=composite | mrps22 ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) mvrlm.fit <- mvrlm.sdf(formula=algebra | geometry | measurement ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) mvrlm.fit <- mvrlm.sdf(formula=mrps51 | mrps22 ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) # hypotheses about coefficient restrictions can also be tested using the Wald test mvr <- mvrlm.sdf(formula=algebra | geometry ~ dsex + m072801, data = sdf) hypothesis <- c("geometry_dsexFemale = 0", "algebra_dsexFemale = 0") # test statistics based on the F and chi-squared distribution are available linearHypothesis(model=mvr, hypothesis = hypothesis, test = "F") linearHypothesis(model=mvr, hypothesis = hypothesis, test = "Chisq") ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # use | symbol to separate dependent variables in the left-hand side of formula mvrlm.fit <- mvrlm.sdf(formula=algebra | geometry ~ dsex + m072801, jrrIMax = 5, data = sdf) # print method returns coefficients, as does coef method mvrlm.fit coef(mvrlm.fit) # for more detailed results, use summary: summary(mvrlm.fit) # details of model can also be accessed through components of the returned object; for example: # coefficients (one column per dependent variable) mvrlm.fit$coef # coefficient table with standard errors and p-values (1 table per dependent variable) mvrlm.fit$coefmat # R-squared values (one per dependent variable) mvrlm.fit$r.squared # residual covariance matrix mvrlm.fit$residCov # dependent variables can have plausible values or not (or a combination) mvrlm.fit <- mvrlm.sdf(formula=composite | mrps22 ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) mvrlm.fit <- mvrlm.sdf(formula=algebra | geometry | measurement ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) mvrlm.fit <- mvrlm.sdf(formula=mrps51 | mrps22 ~ dsex + m072801, data = sdf, jrrIMax = 5) summary(mvrlm.fit) # hypotheses about coefficient restrictions can also be tested using the Wald test mvr <- mvrlm.sdf(formula=algebra | geometry ~ dsex + m072801, data = sdf) hypothesis <- c("geometry_dsexFemale = 0", "algebra_dsexFemale = 0") # test statistics based on the F and chi-squared distribution are available linearHypothesis(model=mvr, hypothesis = hypothesis, test = "F") linearHypothesis(model=mvr, hypothesis = hypothesis, test = "Chisq") ## End(Not run)
Converts coefficients from edsurveyGlm
logit regression model to odds ratios.
oddsRatio(model, alpha = 0.05)
oddsRatio(model, alpha = 0.05)
model |
an |
alpha |
the alpha level for the confidence level |
An oddsRatio.edsurveyGlm
object with the following elements:
OR |
odds ratio coefficient estimates |
2.5% |
lower bound 95% confidence interval |
97.5% |
upper bound 95% confidence interval |
Takes an AM dct
file and formats it for use with the mml
method
as paramTab
.
parseNAEPdct(dct, mml = TRUE)
parseNAEPdct(dct, mml = TRUE)
dct |
a file location from which to read the |
mml |
a logical for if the paramTab is being used in |
a data.frame
in a format suitable for use with mml
as
a paramTab
.
Sun-Joo Lee
Parses an SPSS Syntax Script (.sps) file to return information relating to fixed-width data files.
parseScript_SPSS( spsFilePath, verbose = FALSE, outputFormat = c("data.frame"), encoding = getOption("encoding") )
parseScript_SPSS( spsFilePath, verbose = FALSE, outputFormat = c("data.frame"), encoding = getOption("encoding") )
spsFilePath |
a character value of the file path to the SPSS script to parse. |
verbose |
a logic value to indicate if user wishes to print parsing activity to console. Default value is |
outputFormat |
a named argument to indicate which output format the resulting object should be. See details for information on each format.
Currently, |
encoding |
a character value to indicate the encoding specification that is used by |
NOT CURRENTLY EXPORTED! In Future this could potentially be made to a separate R package THIS parseScript_SPSS function should be used 100 Old/Previous SPSS script parsers should be slowly transitioned to utilize this function when possible to maximize code use.
The SPSS syntax script parser is focused on gathering details for use with fixed-width data files. This function scans for the following SPSS commands:
FILE HANDLE
DATA LIST
VARIABLE LABEL
VALUE LABEL
MISSING VALUE
The outputFormat
specified will determine the result object returned. This function currently supports the following formats.
data.frame
variableName - The variable name as defined in the script
Start - The start number index of the variable defined for the fixed-width format layout
End - The end number index of the variable defined for the fixed-width format layout
Width - The length of how many columns the variable uses in the fixed-width format layout
Attributes - Any SPSS attributes that are defined in the DATA LIST command. This is typically only for field formatting.
RecordNumber - Some fixed-width data files are considered "multi-line" where one record of data can span multiple rows in the file. The RecordNumber indicates which line the variable is assigned.
Labels - The descriptive label associated with the variable name to give more detail or context.
labelValues - For categorical variables a stored value will typically be assigned a longer label/definition. This string identifies these mappings. The '^' symbol is used to delimit each individual label value. Then additionally, the '=' is used to split the value from the left side of the '=' symbol, and the remaining right-hand side of '=' is the text label for that value.
dataType - A best-guess of the data type (either 'numeric' or 'character') without actually examining the data-file.
missingValues - If a MISSING VALUE clause is included in the script this will list the values that are considered 'Missing'. If multiple values specified, they will be delimited by a ';' (semi-colon) symbol.
returns an object containing information specified by the outputFormat
argument.
Tom Fink
Calculates the percentiles of a numeric variable in an
edsurvey.data.frame
, a light.edsurvey.data.frame
,
or an edsurvey.data.frame.list
.
percentile( variable, percentiles, data, weightVar = NULL, jrrIMax = 1, varMethod = c("jackknife", "Taylor"), alpha = 0.05, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, returnNumberOfPSU = FALSE, pctMethod = c("symmetric", "unbiased", "simple"), confInt = TRUE, dofMethod = c("JR", "WS"), omittedLevels = deprecated() )
percentile( variable, percentiles, data, weightVar = NULL, jrrIMax = 1, varMethod = c("jackknife", "Taylor"), alpha = 0.05, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnVarEstInputs = FALSE, returnNumberOfPSU = FALSE, pctMethod = c("symmetric", "unbiased", "simple"), confInt = TRUE, dofMethod = c("JR", "WS"), omittedLevels = deprecated() )
variable |
the character name of the variable to percentiles computed, typically a subject scale or subscale |
percentiles |
a numeric vector of percentiles in the range of 0 to 100 (inclusive) |
data |
an |
weightVar |
a character indicating the weight variable to use. |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
varMethod |
a character set to |
alpha |
a numeric value between 0 and 1 indicating the confidence level.
An |
dropOmittedLevels |
a logical value. When set to the default value of
|
defaultConditions |
a logical value. When set to the default value
of |
recode |
a list of lists to recode variables. Defaults to
|
returnVarEstInputs |
a logical value set to |
returnNumberOfPSU |
a logical value set to |
pctMethod |
one of “unbiased”, “symmetric”, “simple”; unbiased produces a weighted median unbiased percentile estimate, whereas simple uses a basic formula that matches previously published results. Symmetric uses a more basic formula but requires that the percentile is symetric to multiplying the quantity by negative one. |
confInt |
a Boolean indicating if the confidence interval should be returned |
dofMethod |
passed to |
omittedLevels |
this argument is deprecated. Use |
Percentiles, their standard errors, and confidence intervals are calculated according to the vignette titled Statistical Methods Used in EdSurvey. The standard errors and confidence intervals are based on separate formulas and assumptions.
The Taylor series variance estimation procedure is not relevant to percentiles because percentiles are not continuously differentiable.
The return type depends on whether the class of the data
argument is an
edsurvey.data.frame
or an edsurvey.data.frame.list
.
The data argument is an edsurvey.data.frame
When the data
argument is an edsurvey.data.frame
,
percentile
returns an S3 object of class percentile
.
This is a data.frame
with typical attributes (names
,
row.names
, and class
) and additional attributes as follows:
n0 |
number of rows on |
nUsed |
number of observations with valid data and weights larger than zero |
nPSU |
number of PSUs used in the calculation |
call |
the call used to generate these results |
The columns of the data.frame
are as follows:
percentile |
the percentile of this row |
estimate |
the estimated value of the percentile |
se |
the jackknife standard error of the estimated percentile |
df |
degrees of freedom |
confInt.ci_lower |
the lower bound of the confidence interval |
confInt.ci_upper |
the upper bound of the confidence interval |
nsmall |
the number of units with more extreme results, averaged across plausible values |
When the confInt
argument is set to FALSE
, the confidence
intervals are not returned.
The data argument is an edsurvey.data.frame.list
When the data
argument is an edsurvey.data.frame.list
,
percentile
returns an S3 object of class percentileList
.
This is a data.frame with a call
attribute.
The columns in the data.frame
are identical to those in the previous
section, but there also are columns from the edsurvey.data.frame.list
.
covs |
a column for each column in the |
When returnVarEstInputs
is TRUE
, an attribute
varEstInputs
also is returned that includes the variance estimate
inputs used for calculating covariances with varEstToCov
.
Paul Bailey
Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician, 50, 361–365.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get the median of the composite percentile(variable="composite", percentiles=50, data=sdf) # get several percentiles percentile(variable="composite", percentiles=c(0,1,25,50,75,99,100), data=sdf) # build an edsurvey.data.frame.list sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # this shows how these datasets will be described: sdfl$covs percentile(variable="composite", percentiles=50, data=sdfl) percentile(variable="composite", percentiles=c(25, 50, 75), data=sdfl) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get the median of the composite percentile(variable="composite", percentiles=50, data=sdf) # get several percentiles percentile(variable="composite", percentiles=c(0,1,25,50,75,99,100), data=sdf) # build an edsurvey.data.frame.list sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # this shows how these datasets will be described: sdfl$covs percentile(variable="composite", percentiles=50, data=sdfl) percentile(variable="composite", percentiles=c(25, 50, 75), data=sdfl) ## End(Not run)
Prints details of discrete and cumulative achievement levels
calculated using weights and variance
estimates appropriate for the edsurvey.data.frame
.
## S3 method for class 'achievementLevels' print( x, printCall = TRUE, printDiscrete = TRUE, printCumulative = TRUE, use_es_round = getOption("EdSurvey_round_output"), ... )
## S3 method for class 'achievementLevels' print( x, printCall = TRUE, printDiscrete = TRUE, printCumulative = TRUE, use_es_round = getOption("EdSurvey_round_output"), ... )
x |
an |
printCall |
a logical value; by default ( |
printDiscrete |
a logical value; by default ( |
printCumulative |
a logical value; by default ( |
use_es_round |
a logical value; use the EdSurvey rounding functions before printing |
... |
these arguments are not passed anywhere and are included only for compatibility |
Huade Huo and Ahmad Emad
Prints metadata regarding an edsurvey.data.frame
or an edsurvey.data.frame.list
## S3 method for class 'edsurvey.data.frame' print( x, printColnames = FALSE, use_es_round = getOption("EdSurvey_round_output"), round_n = getOption("EdSurvey_round_n_function"), ... )
## S3 method for class 'edsurvey.data.frame' print( x, printColnames = FALSE, use_es_round = getOption("EdSurvey_round_output"), round_n = getOption("EdSurvey_round_n_function"), ... )
x |
an |
printColnames |
a logical value; set to |
use_es_round |
a logical; round the output per |
round_n |
function used to round sample n-sizes. See |
... |
these arguments are not passed anywhere and are included only for compatibility |
Michael Lee and Paul Bailey
Prints labels and a results vector of a gap analysis.
## S3 method for class 'gap' print( x, ..., printPercentage = TRUE, use_es_round = getOption("EdSurvey_round_output") ) ## S3 method for class 'gapList' print(x, ..., printPercentage = TRUE)
## S3 method for class 'gap' print( x, ..., printPercentage = TRUE, use_es_round = getOption("EdSurvey_round_output") ) ## S3 method for class 'gapList' print(x, ..., printPercentage = TRUE)
x |
an |
... |
these arguments are not passed anywhere and are included only for compatibility |
printPercentage |
a logical value set to |
use_es_round |
use the EdSurvey rounding methods for gap |
Paul Bailey
Opens a connection to an ePIRLS data file and
returns an edsurvey.data.frame
with
information about the file and data.
read_ePIRLS(path, countries, forceReread = FALSE, verbose = TRUE)
read_ePIRLS(path, countries, forceReread = FALSE, verbose = TRUE)
path |
a character value to the full directory path to the ePIRLS extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit ISO country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the ePIRLS User Guide to help determine what countries
are included within a specific testing year of ePIRLS.
To select all countries, use a wildcard value of |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the ePIRLS international database(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
An ePIRLS edsurvey.data.frame
includes three distinct data levels:
student
school
teacher
When the getData
function is called using an ePIRLS edsurvey.data.frame
,
the requested data variables are inspected, and it handles any necessary data merges automatically.
The school
data always will be returned merged to the student
data, even if only school
variables are requested.
If teacher
variables are requested by the getData
call, it will cause teacher
data to be merged.
A student
can be linked to many teachers
, which varies widely between countries.
Please note that calling the dim
function for an ePIRLS edsurvey.data.frame
will result in
the row count as if the teacher
dataset was merged.
This row count will be considered the full data N
of the edsurvey.data.frame
, even if no teacher
data were included in an analysis.
The column count returned by dim
will be the count of unique column variables across all three data levels.
an edsurvey.data.frame
for a single specified country or an
edsurvey.data.frame.list
if multiple countries are specified
Tom Fink
readNAEP
, readTIMSS
, getData
, and download_ePIRLS
## Not run: usa <- read_ePIRLS("~/ePIRLS/2016", countries = c("usa")) gg <- getData(data=usa, varnames=c("itsex", "totwgt", "erea")) head(gg) edsurveyTable(formula=erea ~ itsex, data=usa) ## End(Not run)
## Not run: usa <- read_ePIRLS("~/ePIRLS/2016", countries = c("usa")) gg <- getData(data=usa, varnames=c("itsex", "totwgt", "erea")) head(gg) edsurveyTable(formula=erea ~ itsex, data=usa) ## End(Not run)
Opens a connection to the Beginning Teacher Longitudinal Study (BTLS) waves 1 through 5 data file and
returns an edsurvey.data.frame
with
information about the file and data.
readBTLS(dat_FilePath, spss_FilePath, verbose = TRUE)
readBTLS(dat_FilePath, spss_FilePath, verbose = TRUE)
dat_FilePath |
a character value to the full path of the BTLS fixed-width (.dat) data file |
spss_FilePath |
a character value to the full path of the SPSS syntax file to process the |
verbose |
a logical value that will determine if you want verbose output while the |
Reads the spss_FilePath
file to parse the dat_FilePath
to an edsurvey.data.frame
.
There is no cached data because the dat_FilePath
format already is in fixed-width format.
an edsurvey.data.frame
for the BTLS waves 1 to 5 longitudinal dataset.
Tom Fink
readECLS_K2011
, readNAEP
, and getData
## Not run: fld <- "~/EdSurveyData/BTLS" datPath <- file.path(fld, "ASCII Data File", "BTLS2011_12.dat") spsPath <- file.path(fld, "Input Syntax for Stata and SPSS", "BTLS2011_12.sps") #read in the data to an edsurvey.data.frame btls <- readBTLS(datPath, spsPath, verbose = TRUE) dim(btls) ## End(Not run)
## Not run: fld <- "~/EdSurveyData/BTLS" datPath <- file.path(fld, "ASCII Data File", "BTLS2011_12.dat") spsPath <- file.path(fld, "Input Syntax for Stata and SPSS", "BTLS2011_12.sps") #read in the data to an edsurvey.data.frame btls <- readBTLS(datPath, spsPath, verbose = TRUE) dim(btls) ## End(Not run)
Opens a connection to an ICCS (2009, 2016) or CivEd (1999) data file and
returns an edsurvey.data.frame
with
information about the file and data.
readCivEDICCS( path, countries, dataSet = c("student", "teacher"), gradeLvl = c("8", "9", "12"), forceReread = FALSE, verbose = TRUE )
readCivEDICCS( path, countries, dataSet = c("student", "teacher"), gradeLvl = c("8", "9", "12"), forceReread = FALSE, verbose = TRUE )
path |
a character value of the full directory to the ICCS/CivED extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit International Organization for Standardization (ISO) country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the ICCS/CivED User Guide to help determine what countries
are included within a specific testing year of ICCS/CivED.
To select all countries, use a wildcard value of |
dataSet |
a character value of either |
gradeLvl |
a character value of the grade level to return
|
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the international database(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
When using the getData
function with a CivED or ICCS study edsurvey.data.frame
,
the requested data variables are inspected, and it handles any necessary data merges automatically.
The school
data always will be returned merged to the student
data, even if only school
variables are requested.
If a 1999 CivED Grade 8 edsurvey.data.frame
with teacher
data variables is requested by the getData
call,
it will cause teacher
data to be merged.
Many students
can be linked to many teachers
, which varies widely between countries,
and not all countries contain teacher
data.
Calling the dim
function for a CivED 1999 Grade 8 edsurvey.data.frame
will result in the row count as if the teacher
dataset was merged.
This row count will be considered the full data N
of the edsurvey.data.frame
, even if no teacher
data were included in an analysis.
The column count returned by dim
will be the count of unique column variables across all data levels.
an edsurvey.data.frame
for a single specified country or an edsurvey.data.frame.list
if multiple countries specified
Tom Fink
readNAEP
, readTIMSS
, getData
, and downloadCivEDICCS
## Not run: eng <- readCivEDICCS("~/ICCS/2009/", countries = c("eng"), gradeLvl = 8, dataSet = "student") gg <- getData(getData=eng, varnames=c("famstruc", "totwgts", "civ")) head(gg) edsurveyTable(formula=civ ~ famstruc, data=eng) ## End(Not run)
## Not run: eng <- readCivEDICCS("~/ICCS/2009/", countries = c("eng"), gradeLvl = 8, dataSet = "student") gg <- getData(getData=eng, varnames=c("famstruc", "totwgts", "civ")) head(gg) edsurveyTable(formula=civ ~ famstruc, data=eng) ## End(Not run)
Opens a connection to an ECLS-B data file and
returns an edsurvey.data.frame
with
information about the file and data.
readECLS_B( path = getwd(), filename, layoutFilename, forceReread = FALSE, verbose = TRUE )
readECLS_B( path = getwd(), filename, layoutFilename, forceReread = FALSE, verbose = TRUE )
path |
a character value to the full directory path(s) to the ECLS-B extracted fixed-with-format (.dat) set of datafiles. |
filename |
a character value of the name of the fixed-width-file (.dat) data file in the specificed |
layoutFilename |
a character value of the filename of either the ASCII text (.txt) layout file of the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose output while the |
Reads in the unzipped files downloaded from the ECLS-B longitudinal Database.
An edsurvey.data.frame
for the ECLS-B longitudinal dataset.
Trang Nguyen
Opens a connection to an ECLS–K 1998 data file and
returns an edsurvey.data.frame
with
information about the file and data.
readECLS_K1998( path = getwd(), filename = "eclsk_98_99_k8_child_v1_0.dat", layoutFilename = "Layout_k8_child.txt", forceReread = FALSE, verbose = TRUE )
readECLS_K1998( path = getwd(), filename = "eclsk_98_99_k8_child_v1_0.dat", layoutFilename = "Layout_k8_child.txt", forceReread = FALSE, verbose = TRUE )
path |
a character value to the full directory path(s) to the ECLS–K-extracted fixed-width-format (.dat) set of data files |
filename |
a character value of the name of the fixed-width (.dat)
data file in the specified |
layoutFilename |
a character value of the filename of either the ASCII
(.txt) layout file of the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose output while the |
Reads in the unzipped files downloaded from the ECLS–K 1998 longitudinal dataset(s) to an edsurvey.data.frame
. The ECLS–K 1998–99 study consisted of
three distinct separate datasets that cannot be combined: (1) Child Grades K–8 Data, (2) School Base-Year Data, and (3) Teacher Base-Year Data.
The filename
and layoutFilename
arguments default to the corresponding Child K–8 default filenames.
an edsurvey.data.frame
for the ECLS–K 1998 longitudinal dataset
Tom Fink
readECLS_K2011
, readNAEP
, getData
, downloadECLS_K
## Not run: # read-in student file with defaults eclsk_df <- readECLS_K1998(path="~/ECLS_K/1998") #using defaults d <- getData(data=eclsk_df, varnames=c("childid", "gender", "race")) summary(d) ## End(Not run) ## Not run: # read-in with parameters specified eclsk_df <- readECLS_K1998(path = "~/ECLS_K/1998", filename = "eclsk_98_99_k8_child_v1_0.dat", layoutFilename = "Layout_k8_child.txt", verbose = TRUE, forceReread = FALSE) ## End(Not run)
## Not run: # read-in student file with defaults eclsk_df <- readECLS_K1998(path="~/ECLS_K/1998") #using defaults d <- getData(data=eclsk_df, varnames=c("childid", "gender", "race")) summary(d) ## End(Not run) ## Not run: # read-in with parameters specified eclsk_df <- readECLS_K1998(path = "~/ECLS_K/1998", filename = "eclsk_98_99_k8_child_v1_0.dat", layoutFilename = "Layout_k8_child.txt", verbose = TRUE, forceReread = FALSE) ## End(Not run)
Opens a connection to an ECLS–K 2011 data file and
returns an edsurvey.data.frame
with
information about the file and data.
readECLS_K2011( path = getwd(), filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", forceReread = FALSE, verbose = TRUE )
readECLS_K2011( path = getwd(), filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", forceReread = FALSE, verbose = TRUE )
path |
a character value to the full directory path(s) to the ECLS–K 2010–11 extracted fixed-with-format (.dat) set of data files |
filename |
a character value of the name of the fixed-width (.dat) data file in the specified |
layoutFilename |
a character value of the filename of either the ASCII (.txt) layout file of the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose output while the |
Reads in the unzipped files downloaded from the ECLS–K 2010–11 longitudinal dataset.
an edsurvey.data.frame
for the ECLS–K 2010–11 longitudinal dataset
Tom Fink
readECLS_K1998
, readNAEP
, getData
, and downloadECLS_K
## Not run: # read-in student file with defaults eclsk_df <- readECLS_K2011(path="~/ECLS_K/2011") #using defaults d <- getData(data=eclsk_df, varnames=c("childid", "c1hgt1", "c1wgt1")) summary(d) ## End(Not run) ## Not run: # read-in with parameters specified eclsk_df <- readECLS_K2011(path = "~/ECLS_K/2011", filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", forceReread = FALSE, verbose = TRUE) ## End(Not run)
## Not run: # read-in student file with defaults eclsk_df <- readECLS_K2011(path="~/ECLS_K/2011") #using defaults d <- getData(data=eclsk_df, varnames=c("childid", "c1hgt1", "c1wgt1")) summary(d) ## End(Not run) ## Not run: # read-in with parameters specified eclsk_df <- readECLS_K2011(path = "~/ECLS_K/2011", filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", forceReread = FALSE, verbose = TRUE) ## End(Not run)
Opens a connection to an ELS data file and
returns an edsurvey.data.frame
with
information about the file and data.
readELS( path = getwd(), filename = "els_02_12_byf3pststu_v1_0.sav", wgtFilename = ifelse(filename == "els_02_12_byf3pststu_v1_0.sav", "els_02_12_byf3stubrr_v1_0.sav", NA), forceReread = FALSE, verbose = TRUE )
readELS( path = getwd(), filename = "els_02_12_byf3pststu_v1_0.sav", wgtFilename = ifelse(filename == "els_02_12_byf3pststu_v1_0.sav", "els_02_12_byf3stubrr_v1_0.sav", NA), forceReread = FALSE, verbose = TRUE )
path |
a character value to the directory path of the extracted set of data files and layout files. |
filename |
a character value of the name of the SPSS (.sav) data file
in the specified |
wgtFilename |
a character value of the name of the associated balanced
repeated replication (BRR) weight SPSS (.sav) data file
in the specified |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose output
while the |
Reads in the unzipped files downloaded from the ELS longitudinal dataset(s)
to an edsurvey.data.frame
. The ELS 2002 study consisted of
four distinct separate datasets that cannot be combined:
Student: bas -year through follow-up three (default)
School: base year through follow-up one
Institution: follow-up two
Institution: follow-up three
an edsurvey.data.frame
for the ELS longitudinal dataset
Tom Fink
readECLS_K2011
, readNAEP
, getData
, and downloadECLS_K
## Not run: # read-in student file including weight file as default els_df <- readELS("~/ELS/2002") #student level with weights) d <- getData(data=els_df, varnames=c("stu_id", "bysex", "bystlang")) summary(d) # read-in with parameters specified (student level with weights) els_wgt_df <- readELS(path = "~/ELS/2002", filename = "els_02_12_byf3pststu_v1_0.sav", wgtFilename = "els_02_12_byf3stubrr_v1_0.sav", verbose = TRUE, forceReread = FALSE) # read-in with parameters specified (school level, no separate weight replicate file) els_sch_df <- readELS(path = "~/ELS/2002", filename = "els_02_12_byf1sch_v1_0.sav", wgtFilename = NA, verbose = TRUE, forceReread = FALSE) ## End(Not run)
## Not run: # read-in student file including weight file as default els_df <- readELS("~/ELS/2002") #student level with weights) d <- getData(data=els_df, varnames=c("stu_id", "bysex", "bystlang")) summary(d) # read-in with parameters specified (student level with weights) els_wgt_df <- readELS(path = "~/ELS/2002", filename = "els_02_12_byf3pststu_v1_0.sav", wgtFilename = "els_02_12_byf3stubrr_v1_0.sav", verbose = TRUE, forceReread = FALSE) # read-in with parameters specified (school level, no separate weight replicate file) els_sch_df <- readELS(path = "~/ELS/2002", filename = "els_02_12_byf1sch_v1_0.sav", wgtFilename = NA, verbose = TRUE, forceReread = FALSE) ## End(Not run)
Opens a connection to a High School & Beyond 1980–1986 Senior cohort data file and
returns an edsurvey.data.frame
with
information about the file and data.
readHSB_Senior( HSR8086_PRI_FilePath, HSR8086_SASSyntax_Path, forceReread = FALSE, verbose = TRUE )
readHSB_Senior( HSR8086_PRI_FilePath, HSR8086_SASSyntax_Path, forceReread = FALSE, verbose = TRUE )
HSR8086_PRI_FilePath |
a character value to the main study-derived
analytical data file (HSR8086_REV.PRI).
Located within the |
HSR8086_SASSyntax_Path |
a character value to the SAS syntax file for
parsing the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose output
while the |
Reads in the specified HSR8086_SASSyntax_Path
file to parse
the HSR8086_PRI_FilePath
file.
A cached data file and metadata file will be saved in the same
directory and filename as the HSR8086_PRI_FilePath
file,
having new file extensions of .txt and .meta, respectively.
Please note the original source repcode
variable has been split
into two variables named repcode_str
for the stratum value
and repcode_psu
for the primary sampling unit (PSU) value in the resulting
cache data.
an edsurvey.data.frame
for the HS&B Senior 1980–1986 longitudinal dataset
Tom Fink
readECLS_K2011
, readNAEP
, and getData
## Not run: wrkFld <- "~/HSB/SENIOR" dataPath <- file.path(wrkFld, "REVISED_ASCII", "HSR8086_REV.PRI") sasPath <- file.path(wrkFld, "SAS_EXTRACT_LOGIC", "HSBsr_READ_HSR8086.SAS") # with verbose output as default hsbSR <- readHSB_Senior(dataPath, sasPath) # silent output hsbSR <- readHSB_Senior(dataPath, sasPath, verbose = FALSE) # force cache update hsbSR <- readHSB_Senior(dataPath, sasPath, forceReread = TRUE) ## End(Not run)
## Not run: wrkFld <- "~/HSB/SENIOR" dataPath <- file.path(wrkFld, "REVISED_ASCII", "HSR8086_REV.PRI") sasPath <- file.path(wrkFld, "SAS_EXTRACT_LOGIC", "HSBsr_READ_HSR8086.SAS") # with verbose output as default hsbSR <- readHSB_Senior(dataPath, sasPath) # silent output hsbSR <- readHSB_Senior(dataPath, sasPath, verbose = FALSE) # force cache update hsbSR <- readHSB_Senior(dataPath, sasPath, forceReread = TRUE) ## End(Not run)
Opens a connection to a High School & Beyond 1980–1992 Sophomore cohort data file and
returns an edsurvey.data.frame
with
information about the file and data.
readHSB_Sophomore( HSO8092_PRI_FilePath, HSO8092_SASSyntax_Path, forceReread = FALSE, verbose = TRUE )
readHSB_Sophomore( HSO8092_PRI_FilePath, HSO8092_SASSyntax_Path, forceReread = FALSE, verbose = TRUE )
HSO8092_PRI_FilePath |
a character value to the main study-derived
analytical data file (HSO8092_REV.PRI).
Located within the |
HSO8092_SASSyntax_Path |
a character value to the SAS syntax file for
parsing the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value that will determine if you want verbose
output while the |
Reads in the specified HSO8092_SASSyntax_Path
file to parse
the HSO8092_PRI_FilePath
file.
A cached data file and metadata file will be saved in the same
directory and filename as the HSO8092_PRI_FilePath
file,
having new file extensions of .txt and .meta, respectively.
Please note the original source repcode
variable has been split
into two variables named repcode_str
for the stratum value
and repcode_psu
for the primary sampling unit (PSU) value in the resulting cache data.
an edsurvey.data.frame
for the HS&B Sophomore 1980–1992 longitudinal dataset
Tom Fink
readECLS_K2011
, readNAEP
, and getData
## Not run: wrkFld <- "~/HSB/SOPHOMORE" dataPath <- file.path(wrkFld, "REVISED_ASCII", "HSO8092_REV.PRI") sasPath <- file.path(wrkFld, "SAS_EXTRACT_LOGIC", "HSBso_READ_HSO8092.SAS") # with verbose output as default hsbSO <- readHSB_Sophomore(dataPath, sasPath) # silent output hsbSO <- readHSB_Sophomore(dataPath, sasPath, verbose = FALSE) # force cache update hsbSO <- readHSB_Sophomore(dataPath, sasPath, forceReread = TRUE) ## End(Not run)
## Not run: wrkFld <- "~/HSB/SOPHOMORE" dataPath <- file.path(wrkFld, "REVISED_ASCII", "HSO8092_REV.PRI") sasPath <- file.path(wrkFld, "SAS_EXTRACT_LOGIC", "HSBso_READ_HSO8092.SAS") # with verbose output as default hsbSO <- readHSB_Sophomore(dataPath, sasPath) # silent output hsbSO <- readHSB_Sophomore(dataPath, sasPath, verbose = FALSE) # force cache update hsbSO <- readHSB_Sophomore(dataPath, sasPath, forceReread = TRUE) ## End(Not run)
Opens a connection to an HSLS data file and
returns an edsurvey.data.frame
with
information about the file and data.
readHSLS( path = getwd(), filename = "hsls_16_student_v1_0.sav", wgtFilename = NA, forceReread = FALSE, verbose = TRUE )
readHSLS( path = getwd(), filename = "hsls_16_student_v1_0.sav", wgtFilename = NA, forceReread = FALSE, verbose = TRUE )
path |
a character value to the full directory path(s) to the HSLS extracted SPSS (.sav) set of data files |
filename |
a character value of the name of the SPSS (.sav) datafile to be read |
wgtFilename |
a character value of the name of the associated BRR
weight SPSS (.sav) data file in the specificed |
forceReread |
a logic value to force a rereading of all processed data.
The default value of |
verbose |
a logical value set to |
Reads in the unzipped files downloaded from the HSLS longitudinal dataset.
an edsurvey.data.frame
for the HSLS longitudinal dataset
The SPSS (.sav) format is preferred over the fixed-width-format (.dat) ASCII file format at this time relating to value label issues identified with the ASCII layout specifications.
Tom Fink
readECLS_K2011
, readNAEP
, and getData
## Not run: # use function default values at working directory hsls <- readHSLS("~/HSLS/2009") # specify parameters with verbose output hsls <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", forceReread = FALSE, verbose = TRUE) # specify parameters silent output hsls <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", forceReread = FALSE, verbose = FALSE) #for restricted-use student data, replicate weights stored in separate file hslsRUD <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", wgtFilename = "hsls_16_student_BRR_v1_0.sav", forceReread = FALSE, verbose = TRUE) ## End(Not run)
## Not run: # use function default values at working directory hsls <- readHSLS("~/HSLS/2009") # specify parameters with verbose output hsls <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", forceReread = FALSE, verbose = TRUE) # specify parameters silent output hsls <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", forceReread = FALSE, verbose = FALSE) #for restricted-use student data, replicate weights stored in separate file hslsRUD <- readHSLS(path="~/HSLS/2009", filename = "hsls_16_student_v1_0.sav", wgtFilename = "hsls_16_student_BRR_v1_0.sav", forceReread = FALSE, verbose = TRUE) ## End(Not run)
Opens a connection to a High School Transcript Study (HSTS) data files for years 2019.
Returns an edsurvey.data.frame
with
information about the file and data.
readHSTS( dataFilePath = getwd(), spssPrgPath = dataFilePath, year = c("2019"), verbose = TRUE )
readHSTS( dataFilePath = getwd(), spssPrgPath = dataFilePath, year = c("2019"), verbose = TRUE )
dataFilePath |
a character value to the root directory path of extracted set of ASCII data files (.txt or .dat file extension).
|
spssPrgPath |
a character value to the directory path of where the extracted set of .sps program files are located.
The data file and associated SPSS program filenames *must match* (having different file extensions) to determine which files are associated together.
|
year |
a character value to indicate the year of the dataset. Only one year is supported for a single |
verbose |
a logical value that will determine if you want verbose output while the |
The HSTS data has a complex structure and unique characteristics all handled internally within EdSurvey
.
The structure allows for automatic dynamic linking across all various data 'levels' based the requested variables. The student
data level is the primary analysis unit.
Dynamic linking for variables that include both tests
and transcript
level details will result in an error, as they cannot be simultaneously returned in a single call.
Situations may arise where the analyst must derive variables for analysis. See the documentation for merge
and $<-
functions for more detail. All merge operations are done at the student
level (the main analysis unit).
File Layout for HSTS 2019:
School (school.dat) - School level variables.
School Catalog (catalog.dat) - Catalog variables joined to School data. Variables renamed to begin with SchCat_
to distinguish from Transcript Catalog. Cannot be merged with any Student
data.
Student (student.dat) - Student level variables. Primary analysis unit, all merged/cached data must be at this level.
NAEP Math (naepmath.dat) - Subset of students containing NAEP Math variables. Variables begin with math_
to ensure they are unique from the NAEP Science variables.
NAEP Science (naepsci.dat) - Subset of students containing NAEP Science variables. Variables begin with sci_
to ensure they are unique from the NAEP Math variables.
Tests (tests.dat) - Students may have many test records. Contains ACT/SAT testing score details for students. Cannot be merged together with any Transcript or Transcript Catalog data.
Transcripts (trnscrpt.dat) - Students may have many transcript records. Contains transcript level details. Cannot be merged together with Test data.
Transcript Catalog (catalog.dat) - Each transcript record is associated to a catalog record for giving context to the transcript record. 2019 uses SCED codes for categorizing courses.
an edsurvey.data.frame
for the HSTS dataset.
Tom Fink
showCodebook
, searchSDF
, edsurvey.data.frame
, merge.edsurvey.data.frame
, and getData
Opens a connection to an ICILS data file residing
on the disk and returns an edsurvey.data.frame
with
information about the file and data.
readICILS( path, countries, dataSet = c("student", "teacher"), forceReread = FALSE, verbose = TRUE )
readICILS( path, countries, dataSet = c("student", "teacher"), forceReread = FALSE, verbose = TRUE )
path |
a character value to the full directory path to the ICILS extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit ISO country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the ICILS User Guide
to help determine what countries
are included within a specific testing year of ICILS.
To select all countries, use a wildcard value of |
dataSet |
a character value of either |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the ICILS international dataset(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
an edsurvey.data.frame
for a single specified country or an
edsurvey.data.frame.list
if multiple countries specified
Tom Fink and Jeppe Bundsgaard (updated for 2018 and 2023)
readNAEP
, readTIMSS
, and getData
## Not run: pol <- readICILS("~/ICILS/2013", countries = "pol", dataSet = "student") gg <- getData(data=pol, varnames=c("idstud", "cil", "is1g18b")) head(gg) edsurveyTable(formula=cil ~ is1g18b, pol) ## End(Not run)
## Not run: pol <- readICILS("~/ICILS/2013", countries = "pol", dataSet = "student") gg <- getData(data=pol, varnames=c("idstud", "cil", "is1g18b")) head(gg) edsurveyTable(formula=cil ~ is1g18b, pol) ## End(Not run)
Opens a connection to a NAEP data file residing
on the disk. Returns an edsurvey.data.frame
with
information about the file and data.
readNAEP( path, defaultWeight = "origwt", defaultPvs = "composite", omittedLevels = c("Multiple", NA, "Omitted"), frPath = NULL, xmlPath = NULL )
readNAEP( path, defaultWeight = "origwt", defaultPvs = "composite", omittedLevels = c("Multiple", NA, "Omitted"), frPath = NULL, xmlPath = NULL )
path |
a character value indicating the full filepath location and name of the (.dat) data file |
defaultWeight |
a character value that indicates the default weight
specified in the resulting |
defaultPvs |
a character value that indicates the default plausible value
specified in the resulting |
omittedLevels |
a character vector indicating which factor levels/labels
should be excluded. When set to the default value of
|
frPath |
a character value indicating the file location of the |
xmlPath |
a character value indicating the file path of the |
The frPath
file layout information will take precedence over the xmlPath
file when the xmlPath
is not explicitly set, or when the xmlPath
file cannot be located.
The readNAEP
function includes both scaled scores and theta scores, with the latter having names ending in \_theta
.
When a NAEP administration includes a linking error variable those variables are included and end in _linking
.
When present, simply use the _linking
version of a variable to get a standard error estimate that includes linking error.
This function supports the following NAEP data products:
Main NAEP
Long-Term Trend NAEP (LTT)
Monthly School Survey Linking Study (MSS)
COVID Data Hub School Linking Study
School and Teacher Questionnaire Special Study (STQ)
A table outlining the differences between the Main NAEP and Long-Term Trend (LTT) datasets can be found on the NAEP Nations Report Card website.
For the School and Teacher Questionnaire Special Study (STQ), the School level data can be analyzed independently, or merged together with the Teacher level data. The chosen variables will dynamically link the data when applicable.
Some School records may not have any Teacher records and thus the dimensions
of the resulting edsurvey.data.frame
may not match the total teacher record count.
An edsurvey.data.frame
for a NAEP data file.
Tom Fink and Ahmad Emad
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) sdf # To read in an NCES file first set the directory to the /Data subfolder, # then read in the appropriate .dat file: setwd("location/of/Data") sdf <- readNAEP(path="M36NT2PM.dat") # Or read in the .dat file directly through the folder pathway: sdf <- readNAEP(path="location/of/Data/M36NT2PM.dat") ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) sdf # To read in an NCES file first set the directory to the /Data subfolder, # then read in the appropriate .dat file: setwd("location/of/Data") sdf <- readNAEP(path="M36NT2PM.dat") # Or read in the .dat file directly through the folder pathway: sdf <- readNAEP(path="location/of/Data/M36NT2PM.dat") ## End(Not run)
Opens a connection to a National Household Education Survey (NHES) data file and
returns an edsurvey.data.frame
with
information about the file and data.
readNHES(savFiles, surveyCode = "auto", forceReread = FALSE, verbose = TRUE)
readNHES(savFiles, surveyCode = "auto", forceReread = FALSE, verbose = TRUE)
savFiles |
a character vector to the full file path(s) to the NHES extracted SPSS (*.sav) data files. |
surveyCode |
a character vector of the |
forceReread |
a logical value to force a rereading of all processed data.
The default value of |
verbose |
a logical value that defaults to |
Reads in the unzipped public-use files downloaded from the NCES Online Codebook (https://nces.ed.gov/datalab/onlinecodebook) in SPSS (*.sav) format.
Other sources of NHES data, such as restricted-use files or other websites, may require additional conversion steps to generate the required SPSS data format
and/or explicitly setting the surveyCode
parameter.
an edsurvey.data.frame
if only one NHES file is specified for the savFiles
argument,
or an edsurvey.data.frame.list
if multiple files are passed to the savFiles
argument
Tom Fink
downloadNHES
, getNHES_SurveyInfo
, and viewNHES_SurveyCodes
## Not run: rootPath <- "~/" #get instructions for obtaining NHES data downloadNHES() #get SPSS *.sav file paths of all NHES files for 2012 and 2016 filesToImport <- list.files(path = file.path(rootPath, "NHES", c(2012, 2016)), pattern="\\.sav$", full.names = TRUE, recursive = TRUE) #import all files to edsurvey.data.frame.list object esdfList <- readNHES(savFiles = filesToImport, surveyCode = "auto", forceReread = FALSE, verbose = TRUE) viewNHES_SurveyCodes() #view NHES survey codes in console #get the full file path to the 2016 ATES NHES survey path_ates2016 <- list.files(path = file.path(rootPath, "NHES", "2016"), pattern=".*ates.*[.]sav$", full.names = TRUE) #explicitly setting the surveyCode parameter (if required) esdf <- readNHES(savFiles = path_ates2016, surveyCode = "ATES_2016", forceReread = FALSE, verbose = TRUE) #search for variables in the edsurvey.data.frame searchSDF(string="sex", data=esdf) ## End(Not run)
## Not run: rootPath <- "~/" #get instructions for obtaining NHES data downloadNHES() #get SPSS *.sav file paths of all NHES files for 2012 and 2016 filesToImport <- list.files(path = file.path(rootPath, "NHES", c(2012, 2016)), pattern="\\.sav$", full.names = TRUE, recursive = TRUE) #import all files to edsurvey.data.frame.list object esdfList <- readNHES(savFiles = filesToImport, surveyCode = "auto", forceReread = FALSE, verbose = TRUE) viewNHES_SurveyCodes() #view NHES survey codes in console #get the full file path to the 2016 ATES NHES survey path_ates2016 <- list.files(path = file.path(rootPath, "NHES", "2016"), pattern=".*ates.*[.]sav$", full.names = TRUE) #explicitly setting the surveyCode parameter (if required) esdf <- readNHES(savFiles = path_ates2016, surveyCode = "ATES_2016", forceReread = FALSE, verbose = TRUE) #search for variables in the edsurvey.data.frame searchSDF(string="sex", data=esdf) ## End(Not run)
Opens a connection to a PIAAC data file and
returns an edsurvey.data.frame
with
information about the file and data.
readPIAAC( path, countries, forceReread = FALSE, verbose = TRUE, usaOption = "12_14" )
readPIAAC( path, countries, forceReread = FALSE, verbose = TRUE, usaOption = "12_14" )
path |
a character value to the full directory path to the PIAAC .csv files and Microsoft Excel codebook |
countries |
a character vector of the country/countries to include
using the three-digit ISO country code. A list of country
codes can be found in the PIAAC codebook or
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes.
If files are downloaded using |
forceReread |
a logical value to force rereading of all processed data.
Defaults to |
verbose |
a logical value that will determine if you want verbose
output while the function is running to indicate the progress.
Defaults to |
usaOption |
a character value of |
Reads in the unzipped .csv files downloaded from the PIAAC dataset using
the OECD repository (https://www.oecd.org/skills/piaac.html). Users can use
downloadPIAAC
to download all required files automatically.
an edsurvey.data.frame
for a single specified country or
an edsurvey.data.frame.list
if multiple countries specified
Trang Nguyen
Organisation for Economic Co-operation and Development. (2016). Technical report of the survey of adult skills (PIAAC) (2nd ed.). Paris, France: Author. Retrieved from https://www.oecd.org/skills/piaac/PIAAC_Technical_Report_2nd_Edition_Full_Report.pdf
getData
and downloadPIAAC
## Not run: # the following call returns an edsurvey.data.frame to PIAAC for Canada can <- readPIAAC("~/PIAAC/Cycle 1/", countries = "can") # extract a data.frame with a few variables gg <- getData(data=can, varnames=c("c_d05","ageg10lfs")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=~ c_d05 + ageg10lfs, data = can) # the following call returns an edsurvey.data.frame to PIAAC for Canada can <- readPIAAC("~/PIAAC/Cycle 1/", countries = "can", us) # There are two years of usa data for round 1: 2012-2014 and 2017. # The user must specify which usa year they want with the optional "usaOption" argument. # Otherwise, the read function will return usa 2012-2014. See "?readPIACC()" for more info. # read in usa 2012-2014 usa12 <- readPIAAC("~/PIAAC/Cycle 1", countries = "usa", usaOption="12_14") # read in usa 2017 usa17 <- readPIAAC("~/PIAAC/Cycle 1", countries = "usa", usaOption="17") # if reading in all piaac data, the user can still specify usa option. # Otherwise, by default 2012-1014 will be used when reading in all piaac data. all_piaac <- readPIAAC("~/PIAAC/Cycle 1", countries = "*", usaOption="17") ## End(Not run)
## Not run: # the following call returns an edsurvey.data.frame to PIAAC for Canada can <- readPIAAC("~/PIAAC/Cycle 1/", countries = "can") # extract a data.frame with a few variables gg <- getData(data=can, varnames=c("c_d05","ageg10lfs")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=~ c_d05 + ageg10lfs, data = can) # the following call returns an edsurvey.data.frame to PIAAC for Canada can <- readPIAAC("~/PIAAC/Cycle 1/", countries = "can", us) # There are two years of usa data for round 1: 2012-2014 and 2017. # The user must specify which usa year they want with the optional "usaOption" argument. # Otherwise, the read function will return usa 2012-2014. See "?readPIACC()" for more info. # read in usa 2012-2014 usa12 <- readPIAAC("~/PIAAC/Cycle 1", countries = "usa", usaOption="12_14") # read in usa 2017 usa17 <- readPIAAC("~/PIAAC/Cycle 1", countries = "usa", usaOption="17") # if reading in all piaac data, the user can still specify usa option. # Otherwise, by default 2012-1014 will be used when reading in all piaac data. all_piaac <- readPIAAC("~/PIAAC/Cycle 1", countries = "*", usaOption="17") ## End(Not run)
Opens a connection to a PIRLS data file and
returns an edsurvey.data.frame
with
information about the file and data.
readPIRLS(path, countries, forceReread = FALSE, verbose = TRUE)
readPIRLS(path, countries, forceReread = FALSE, verbose = TRUE)
path |
a character value to the full directory path to the PIRLS extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit ISO country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the PIRLS User Guide
to help determine what countries
are included within a specific testing year of PIRLS.
To select all countries, use a wildcard value of |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the PIRLS international database(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
A PIRLS edsurvey.data.frame
includes three distinct data levels:
student
school
teacher
When the getData
function is called using a PIRLS edsurvey.data.frame
,
the requested data variables are inspected, and it handles any necessary data merges automatically.
The school
data always will be returned merged to the student
data, even if only school
variables are requested.
If teacher
variables are requested by the getData
call, it
will cause teacher
data to be merged.
Many students
can be linked to many teachers
, which varies widely between countries.
Please note that calling the dim
function for a PIRLS
edsurvey.data.frame
will result in
the row count as if the teacher
dataset was merged.
This row count will be considered the full data N
of the
edsurvey.data.frame
, even if no teacher
data were
included in an analysis.
The column count returned by dim
will be the count of unique
column variables across all three data levels.
an edsurvey.data.frame
for a single specified country or an
edsurvey.data.frame.list
if multiple countries specified
Tom Fink
readNAEP
, readTIMSS
, getData
, and downloadPIRLS
## Not run: nor <- readPIRLS("~/PIRLS/2011", countries = c("nor")) gg <- getData(data=nor, varnames=c("itsex", "totwgt", "rrea")) head(gg) edsurveyTable(formula=rrea ~ itsex, nor) ## End(Not run)
## Not run: nor <- readPIRLS("~/PIRLS/2011", countries = c("nor")) gg <- getData(data=nor, varnames=c("itsex", "totwgt", "rrea")) head(gg) edsurveyTable(formula=rrea ~ itsex, nor) ## End(Not run)
Opens a connection to a PISA data file and
returns an edsurvey.data.frame
with
information about the file and data.
readPISA( path, database = c("INT", "CBA", "FIN"), countries, cognitive = c("score", "response", "none"), forceReread = FALSE, verbose = TRUE )
readPISA( path, database = c("INT", "CBA", "FIN"), countries, cognitive = c("score", "response", "none"), forceReread = FALSE, verbose = TRUE )
path |
a character vector to the full directory path(s) to the PISA-extracted fixed-width files and SPSS control files (.txt). |
database |
a character to indicate a selected database. Must be one of
|
countries |
a character vector of the country/countries to include using the
three-digit ISO country code. A list of country codes can be found
in the PISA codebook or https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes.
If files are downloaded using |
cognitive |
one of |
forceReread |
a logical value to force rereading of all processed data.
Defaults to |
verbose |
a logical value that will determine if you want verbose
output while the function is running to indicate progress.
Defaults to |
Reads in the unzipped files downloaded from the PISA database using the
OECD Repository (https://www.oecd.org/pisa.html). Users can use
downloadPISA
to download all required files.
Student questionnaire files (with weights and plausible values) are used as
main files, which are then
merged with cognitive, school, and parent files (if available).
The average first-time processing time for 1 year and one database for all
countries is 10–15 minutes. If forceReread
is set
to be FALSE
, the next time this function is called will take only
5–10 seconds.
For the PISA 2000 study, please note that the study weights are subject
specific. Each weight has different adjustment factors for reading, mathematics, and science
based on it's original subject source file. For example, the w_fstuwt_read
weight is associated with the reading
subject data file. Special care must be used to select the correct weight based on your specific analysis. See the OECD
documentation for further details. Use the showWeights
function to see all three student level subject weights:
w_fstuwt_read = Reading (default)
w_fstuwt_scie = Science
w_fstuwt_math = Mathematics
an edsurvey.data.frame
for a single specified country or
an edsurvey.data.frame.list
if multiple countries are specified
Tom Fink, Trang Nguyen, Paul Bailey, and Yuqi Liao
Organisation for Economic Co-operation and Development. (2017). PISA 2015 technical report. Paris, France: OECD Publishing. Retrieved from https://www.oecd.org/pisa/data/2015-technical-report.html
getData
and downloadPISA
## Not run: # the following call returns an edsurvey.data.frame to # PISA 2012 International Database for Singapore sgp2012 <- readPISA(path = "~/PISA/2012", database = "INT", countries = "sgp") # extract a data.frame with a few variables gg <- getData(sgp2012, c("cnt","read","w_fstuwt")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=read ~ st04q01 + st20q01, data = sgp2012) ## End(Not run)
## Not run: # the following call returns an edsurvey.data.frame to # PISA 2012 International Database for Singapore sgp2012 <- readPISA(path = "~/PISA/2012", database = "INT", countries = "sgp") # extract a data.frame with a few variables gg <- getData(sgp2012, c("cnt","read","w_fstuwt")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=read ~ st04q01 + st20q01, data = sgp2012) ## End(Not run)
Opens a connection to the Programme for International Student Assessment (PISA) YAFS 2016 data file and
returns an edsurvey.data.frame
with
information about the file and data.
readPISA_YAFS( datPath = file.path(getwd(), "PISA_YAFS2016_Data.dat"), spsPath = file.path(getwd(), "PISA_YAFS2016_SPSS.sps"), esdf_PISA2012_USA = NULL )
readPISA_YAFS( datPath = file.path(getwd(), "PISA_YAFS2016_Data.dat"), spsPath = file.path(getwd(), "PISA_YAFS2016_SPSS.sps"), esdf_PISA2012_USA = NULL )
datPath |
a character value of the file location where the data file (.dat) file is saved. |
spsPath |
a character value of the file location where the SPSS (.sps) script file is saved to parse the |
esdf_PISA2012_USA |
(optional) an |
Reads in the unzipped files for the PISA YAFS. The PISA YAFS dataset is a follow-up study of a subset of the students who participated in the PISA 2012 USA study. It can be analyzed on its own as a singular dataset or optionally merged with the PISA 2012 USA data, in which case there will be two sets of weights in the merged dataset (the default PISA YAFS weights and the PISA 2012 USA weights).
An edsurvey.data.frame
for the PISA YAFS dataset if the esdf_PISA2012_USA
parameter is NULL
. If the PISA 2012 USA edsurvey.data.frame
is specified for the esdf_PISA2012_USA
parameter, then the resulting dataset will return an edsurvey.data.frame
allowing analysis for a combined dataset.
Tom Fink
## Not run: #Return an edsurvey.data.frame for only the PISA YAFS dataset. #Either omit, or set the esdf_PISA2012_USA to a NULL value. yafs <- readPISA_YAFS(datPath = "~/PISA YAFS/2016/PISA_YAFS2016_Data.dat", spsPath = "~/PISA YAFS/2016/PISA_YAFS2016_SPSS.sps", esdf_PISA2012_USA = NULL) #If wanting to analyze the PISA YAFS dataset in conjunction with the PISA 2012 #United States of America (USA) dataset, it should be read in first to an edsurvey.data.frame. #Then pass the resulting edsurvey.data.frame as a parameter for the #esdf_PISA2012_USA argument. No other edsurvey.data.frames are supported. usa2012 <- readPISA("~/PISA/2012", database = "INT", countries = "usa") yafs <- readPISA_YAFS(datPath = "~/PISA YAFS/2016/PISA_YAFS2016_Data.dat", spsPath = "~/PISA YAFS/2016/PISA_YAFS2016_SPSS.sps", esdf_PISA2012_USA = usa2012) head(yafs) ## End(Not run)
## Not run: #Return an edsurvey.data.frame for only the PISA YAFS dataset. #Either omit, or set the esdf_PISA2012_USA to a NULL value. yafs <- readPISA_YAFS(datPath = "~/PISA YAFS/2016/PISA_YAFS2016_Data.dat", spsPath = "~/PISA YAFS/2016/PISA_YAFS2016_SPSS.sps", esdf_PISA2012_USA = NULL) #If wanting to analyze the PISA YAFS dataset in conjunction with the PISA 2012 #United States of America (USA) dataset, it should be read in first to an edsurvey.data.frame. #Then pass the resulting edsurvey.data.frame as a parameter for the #esdf_PISA2012_USA argument. No other edsurvey.data.frames are supported. usa2012 <- readPISA("~/PISA/2012", database = "INT", countries = "usa") yafs <- readPISA_YAFS(datPath = "~/PISA YAFS/2016/PISA_YAFS2016_Data.dat", spsPath = "~/PISA YAFS/2016/PISA_YAFS2016_SPSS.sps", esdf_PISA2012_USA = usa2012) head(yafs) ## End(Not run)
Opens a connection to a School Survey on Crime and Safety (SSOCS) data file and
returns an edsurvey.data.frame
, or an edsurvey.data.frame.list
if multiple files specified,
with information about the file(s) and data.
readSSOCS(sasDataFiles, years, forceReread = FALSE, verbose = TRUE)
readSSOCS(sasDataFiles, years, forceReread = FALSE, verbose = TRUE)
sasDataFiles |
a character vector to the full SAS (*.sas7bdat) data file path(s) you wish to read.
If multiple paths are specified as a vector, it will return an |
years |
an integer vector of the year associated with the index position of the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the SSOCS Data Products website in SAS format. Other sources of SSOCS data, such as restricted-use data or other websites, may require additional conversion steps to generate the required SAS format.
An edsurvey.data.frame
if one data file is specified or an edsurvey.data.frame.list
if multiple files are specified in the sasDataFiles
parameter.
For the readSSOCS
function, value label information is stored and retrieved automatically within the EdSurvey
package (based on the year parameter),
as the SAS files contain only raw data values.
Tom Fink
downloadSSOCS
, and getData
## Not run: #download SSOCS data for years 2016 and 2018 downloadSSOCS(years = c(2016, 2018)) rootPath <- "~/"# may need to change this #get SAS *.sas7bdat file paths of all SSOCS files for 2016 and 2018 filesToImport <- list.files(path = file.path(rootPath, "SSOCS", c(2016, 2018)), pattern="\\.sas7bdat$", full.names = TRUE) #import all files to edsurvey.data.frame.list object esdfList <- readSSOCS(sasDataFiles = filesToImport, years = c(2016, 2018), forceReread = FALSE, verbose = TRUE) #reading in the 2018 to an edsurvey.data.frame object esdf <- readSSOCS(sasDataFiles = file.path(rootPath, "SSOCS/2018/pu_ssocs18.sas7bdat"), years = 2018, forceReread = FALSE, verbose = TRUE) #search for variables in the edsurvey.data.frame containing the word 'bully' searchSDF(string="bully", data=esdf) ## End(Not run)
## Not run: #download SSOCS data for years 2016 and 2018 downloadSSOCS(years = c(2016, 2018)) rootPath <- "~/"# may need to change this #get SAS *.sas7bdat file paths of all SSOCS files for 2016 and 2018 filesToImport <- list.files(path = file.path(rootPath, "SSOCS", c(2016, 2018)), pattern="\\.sas7bdat$", full.names = TRUE) #import all files to edsurvey.data.frame.list object esdfList <- readSSOCS(sasDataFiles = filesToImport, years = c(2016, 2018), forceReread = FALSE, verbose = TRUE) #reading in the 2018 to an edsurvey.data.frame object esdf <- readSSOCS(sasDataFiles = file.path(rootPath, "SSOCS/2018/pu_ssocs18.sas7bdat"), years = 2018, forceReread = FALSE, verbose = TRUE) #search for variables in the edsurvey.data.frame containing the word 'bully' searchSDF(string="bully", data=esdf) ## End(Not run)
Opens a connection to a TALIS data file and
returns an edsurvey.data.frame
with
information about the file and data.
readTALIS( path, countries, isced = c("b", "a", "c"), dataLevel = c("teacher", "school"), forceReread = FALSE, verbose = TRUE )
readTALIS( path, countries, isced = c("b", "a", "c"), dataLevel = c("teacher", "school"), forceReread = FALSE, verbose = TRUE )
path |
a character vector to the full directory path(s) to the TALIS SPSS files (.sav) |
countries |
a character vector of the country/countries to include using the
three-digit ISO country code. A list of country codes can be found in
the TALIS codebook, or you can use
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes.
You can use |
isced |
a character value that is one of |
dataLevel |
a character value that indicates which data level to be used. It can be |
forceReread |
a logical value to force rereading of all processed data. Defaults to |
verbose |
a logical value that will determine if you want verbose output while the function is running to indicate the progress.
Defaults to |
Reads in the unzipped files downloaded from the TALIS database using the OECD Repository (https://www.oecd.org/education/talis.html).
If dataLevel
is set to be teacher
, it treats the teacher data file as the main dataset, and merges school data into teacher data for
each country automatically. Use this option if wanting to analyze just teacher variables, or both teacher and school level variables together.
If dataLevel
is set school
, it uses only the school data file (no teacher data will be available).
an edsurvey.data.frame
for a single specified country or
an edsurvey.data.frame.list
if multiple countries specified
Paul Bailey, Tom Fink, and Trang Nguyen
Organisation for Economic Co-operation and Development. (2018). TALIS 2018 technical report. Retrieved from https://www.oecd.org/education/talis/TALIS_2018_Technical_Report.pdf
getData
and downloadTALIS
## Not run: #TALIS 2018 - school level data for all countries talis18 <- readTALIS(path = "~/TALIS/2018", isced = "b", dataLevel = "school", countries = "*") #unweighted summary result <- summary2(data=talis18, variable="tc3g01", weightVar = "") #print usa results to console result$usa # the following call returns an edsurvey.data.frame to TALIS 2013 # for US teacher-level data at secondary level usa2013 <- readTALIS(path = "~/TALIS/2013", isced = "b", dataLevel = "teacher", countries = "usa") # extract a data.frame with a few variables gg <- getData(usa2013, c("tt2g05b", "tt2g01")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=tt2g05b ~ tt2g01, data = usa2013) ## End(Not run)
## Not run: #TALIS 2018 - school level data for all countries talis18 <- readTALIS(path = "~/TALIS/2018", isced = "b", dataLevel = "school", countries = "*") #unweighted summary result <- summary2(data=talis18, variable="tc3g01", weightVar = "") #print usa results to console result$usa # the following call returns an edsurvey.data.frame to TALIS 2013 # for US teacher-level data at secondary level usa2013 <- readTALIS(path = "~/TALIS/2013", isced = "b", dataLevel = "teacher", countries = "usa") # extract a data.frame with a few variables gg <- getData(usa2013, c("tt2g05b", "tt2g01")) head(gg) # conduct an analysis on the edsurvey.data.frame edsurveyTable(formula=tt2g05b ~ tt2g01, data = usa2013) ## End(Not run)
Opens a connection to a TIMSS data file and
returns an edsurvey.data.frame
with
information about the file and data.
readTIMSS( path, countries, gradeLvl = c("4", "8", "4b", "8b"), forceReread = FALSE, verbose = TRUE )
readTIMSS( path, countries, gradeLvl = c("4", "8", "4b", "8b"), forceReread = FALSE, verbose = TRUE )
path |
a character vector to the full directory path(s) to the TIMSS extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit ISO country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the TIMSS User Guide
documentation to help determine what countries
are included within a specific testing year of TIMSS and
for country code definitions.
To select all countries available, use a wildcard value of |
gradeLvl |
a character value to indicate the specific grade level you wish to return
|
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the TIMSS international database(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
A TIMSS edsurvey.data.frame
includes three distinct data levels:
student
school
teacher
When the getData
function is called using a TIMSS edsurvey.data.frame
,
the requested data variables are inspected, and it handles any necessary data merges automatically.
The school
data always will be returned merged to the student
data, even if only school
variables are requested.
If teacher
variables are requested by the getData
call, it
will cause teacher
data to be merged.
Many students
can be linked to many teachers
, which varies
widely between countries.
Please note that calling the dim
function for a TIMSS
edsurvey.data.frame
will result in the row count as if the
teacher
dataset was merged.
This row count will be considered the full data N
of the
edsurvey.data.frame
, even if no teacher
data were included in an analysis.
The column count returned by dim
will be the count of unique column
variables across all three data levels.
Beginning with TIMSS 2015, a numeracy
dataset was designed to assess
mathematics at the end of the primary school cycle
for countries where most children are still developing fundamental mathematics skills.
The numeracy
dataset is handled automatically for the user and is
included within the fourth-grade dataset gradeLvl=4
.
Most numeracy
countries have a 4th grade
dataset in addition
to their numeracy
dataset, but some do not.
For countries that have both a numeracy
and a 4th grade
dataset,
the two datasets are combined into one edsurvey.data.frame
for that country.
Data variables missing from either dataset are kept, with NA
values
inserted for the dataset records where that variable did not exist.
Data variables common to both datasets are kept as a single data variable,
with records retaining their original values from the source dataset.
Consult the TIMSS User Guide for further information.
For the TIMSS 2019 study, a bridge study was conducted to help compute adjustment factors
between the electronic test format and the paper/pencil format. The bridge study is
considered separate from the normal TIMSS 2019 study. The gradeLvl
parameter now
includes a "4B"
option for the Grade 4 bridge study, and the "8B"
option
for the Grade 8 bridge study files.
an edsurvey.data.frame
for a single specified country or an edsurvey.data.frame.list
if multiple countries specified
Tom Fink
readNAEP
, getData
, and downloadTIMSS
## Not run: # single country specified fin <- readTIMSS(path="~/TIMSS/2015", countries = c("fin"), gradeLvl = 4) gg <- getData(data=fin, varnames=c("asbg01", "totwgt", "srea")) head(gg) edsurveyTable(formula=srea ~ asbg01, fin) # multiple countries returned as edsurvey.data.frame.list, specify all countries with '*' argument timss2011 <- readTIMSS(path="~/TIMSS/2011", countries="*", gradeLvl = 8, verbose = TRUE) # print out edsurvey.data.frame.list covariates timss2011$covs ## End(Not run)
## Not run: # single country specified fin <- readTIMSS(path="~/TIMSS/2015", countries = c("fin"), gradeLvl = 4) gg <- getData(data=fin, varnames=c("asbg01", "totwgt", "srea")) head(gg) edsurveyTable(formula=srea ~ asbg01, fin) # multiple countries returned as edsurvey.data.frame.list, specify all countries with '*' argument timss2011 <- readTIMSS(path="~/TIMSS/2011", countries="*", gradeLvl = 8, verbose = TRUE) # print out edsurvey.data.frame.list covariates timss2011$covs ## End(Not run)
Opens a connection to a TIMSS Advanced data file and
returns an edsurvey.data.frame
with
information about the file and data.
readTIMSSAdv( path, countries, subject = c("math", "physics"), forceReread = FALSE, verbose = TRUE )
readTIMSSAdv( path, countries, subject = c("math", "physics"), forceReread = FALSE, verbose = TRUE )
path |
a character vector to the full directory path to the TIMSS Advanced extracted SPSS (.sav) set of data |
countries |
a character vector of the country/countries to include using
the three-digit ISO country code.
A list of country codes can be found on Wikipedia at
https://en.wikipedia.org/wiki/ISO_3166-1#Current_codes
or other online sources. Consult the TIMSS Advanced User Guide to help determine what countries
are included within a specific testing year of TIMSS Advanced.
To select all countries, use a wildcard value of |
subject |
a character value to indicate if you wish to import the |
forceReread |
a logical value to force rereading of all processed data.
The default value of |
verbose |
a logical value to either print or suppress status message output.
The default value is |
Reads in the unzipped files downloaded from the TIMSS Advanced international database(s) using the IEA Study Data Repository. Data files require the SPSS data file (.sav) format using the default filenames.
A TIMSS Advanced edsurvey.data.frame
includes three distinct data levels:
student
school
teacher
When the getData
function is called using a TIMSS Advanced edsurvey.data.frame
,
the requested data variables are inspected, and it handles any necessary data merges automatically.
The school
data always will be returned merged to the student
data, even if only school
variables are requested.
If teacher
variables are requested by the getData
call it will cause the teacher
data to be merged.
Many students
can be linked to many teachers
, which varies widely between countries.
Please note that calling the dim
function for a TIMSS Advanced edsurvey.data.frame
will result in the row count as if the teacher
dataset was merged.
This row count will be considered the full data N
of the edsurvey.data.frame
, even if no teacher
data were included in an analysis.
The column count returned by dim
will be the count of unique column variables across all three data levels.
an edsurvey.data.frame
for a single specified country or an edsurvey.data.frame.list
if multiple countries specified
Tom Fink
readNAEP
, readTIMSS
, getData
, and downloadTIMSSAdv
## Not run: swe <- readTIMSSAdv("~/TIMSSAdv/2015", countries = c("swe"), subject = "math") gg <- getData(data=swe, varnames=c("itsex", "totwgt", "malg")) head(gg) edsurveyTable(formula=malg ~ itsex, swe) ## End(Not run)
## Not run: swe <- readTIMSSAdv("~/TIMSSAdv/2015", countries = c("swe"), subject = "math") gg <- getData(data=swe, varnames=c("itsex", "totwgt", "malg")) head(gg) edsurveyTable(formula=malg ~ itsex, swe) ## End(Not run)
Many R functions strip attributes from data frame objects. This
function assigns the attributes from the attributeData
argument
to the data frame in the data
argument.
rebindAttributes(data, attributeData)
rebindAttributes(data, attributeData)
data |
a |
attributeData |
an |
a data.frame
with a class of a light.edsurvey.data.frame
containing
all elements of data and the attributes (except
names
and row.names
) from attributeData
Paul Bailey and Trang Nguyen
## Not run: require(dplyr) PISA2012 <- readPISA(path = paste0(edsurveyHome, "PISA/2012"), database = "INT", countries = "ALB", verbose=TRUE) ledf <- getData(data = PISA2012, varnames = c("cnt", "oecd", "w_fstuwt", "st62q04", "st62q11", "st62q13", "math"), dropOmittedLevels = FALSE, addAttributes = TRUE) omittedLevels <- c('Invalid', 'N/A', 'Missing', 'Miss', 'NA', '(Missing)') for (i in c("st62q04", "st62q11", "st62q13")) { ledf[,i] <- factor(ledf[,i], exclude=omittedLevels) ledf[,i] <- as.numeric(ledf[,i]) } # after applying some dplyr functions, the "light.edsurvey.data.frame" becomes just "data.frame" PISA2012_ledf <- ledf %>% rowwise() %>% mutate(avg_3 = mean(c(st62q04, st62q11, st62q13), na.rm = TRUE)) %>% ungroup() %>% rebindAttributes(data=PISA2012) # could also be called with ledf class(PISA2012_ledf) # again, a light.edsurvey.data.frame lma <- lm.sdf(formula=math ~ avg_3, data=PISA2012_ledf) summary(lma) PISA2012_ledf <- ledf %>% rowwise() %>% mutate(avg_3 = mean(c(st62q04, st62q11, st62q13), na.rm = TRUE)) %>% ungroup() %>% rebindAttributes(data=ledf) # return attributes and make a light.edsurvey.data.frame # again a light.edsurvey.data.frame lma <- lm.sdf(formula=math ~ avg_3, data=PISA2012_ledf) summary(lma) ## End(Not run)
## Not run: require(dplyr) PISA2012 <- readPISA(path = paste0(edsurveyHome, "PISA/2012"), database = "INT", countries = "ALB", verbose=TRUE) ledf <- getData(data = PISA2012, varnames = c("cnt", "oecd", "w_fstuwt", "st62q04", "st62q11", "st62q13", "math"), dropOmittedLevels = FALSE, addAttributes = TRUE) omittedLevels <- c('Invalid', 'N/A', 'Missing', 'Miss', 'NA', '(Missing)') for (i in c("st62q04", "st62q11", "st62q13")) { ledf[,i] <- factor(ledf[,i], exclude=omittedLevels) ledf[,i] <- as.numeric(ledf[,i]) } # after applying some dplyr functions, the "light.edsurvey.data.frame" becomes just "data.frame" PISA2012_ledf <- ledf %>% rowwise() %>% mutate(avg_3 = mean(c(st62q04, st62q11, st62q13), na.rm = TRUE)) %>% ungroup() %>% rebindAttributes(data=PISA2012) # could also be called with ledf class(PISA2012_ledf) # again, a light.edsurvey.data.frame lma <- lm.sdf(formula=math ~ avg_3, data=PISA2012_ledf) summary(lma) PISA2012_ledf <- ledf %>% rowwise() %>% mutate(avg_3 = mean(c(st62q04, st62q11, st62q13), na.rm = TRUE)) %>% ungroup() %>% rebindAttributes(data=ledf) # return attributes and make a light.edsurvey.data.frame # again a light.edsurvey.data.frame lma <- lm.sdf(formula=math ~ avg_3, data=PISA2012_ledf) summary(lma) ## End(Not run)
Recodes variables in an edsurvey.data.frame
,
a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
recode.sdf(x, recode)
recode.sdf(x, recode)
x |
an |
recode |
a list of recoding rules. See Examples for the format of recoding rules. |
an object of the same class as x
with the recode
added to it
Trang Nguyen and Paul Bailey
## Not run: # filepath argument will vary by operating system conventions usaG4.15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) d <- getData(usaG4.15, "itsex") summary(d) #show details: MALE/FEMALE usaG4.15 <- recode.sdf(usaG4.15, recode = list(itsex=list(from=c("MALE"), to=c("BOY")), itsex=list(from=c("FEMALE"), to=c("GIRL")))) d <- getData(usaG4.15, "itsex") #apply recode summary(d) #show details: BOY/GIRL ## End(Not run)
## Not run: # filepath argument will vary by operating system conventions usaG4.15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) d <- getData(usaG4.15, "itsex") summary(d) #show details: MALE/FEMALE usaG4.15 <- recode.sdf(usaG4.15, recode = list(itsex=list(from=c("MALE"), to=c("BOY")), itsex=list(from=c("FEMALE"), to=c("GIRL")))) d <- getData(usaG4.15, "itsex") #apply recode summary(d) #show details: BOY/GIRL ## End(Not run)
Renames variables in an edsurvey.data.frame
,
a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
This function often is used when users want to conduct a gap analysis across
years but variable names differ across two years of data.
rename.sdf(x, oldnames, newnames, avoid_duplicated = TRUE)
rename.sdf(x, oldnames, newnames, avoid_duplicated = TRUE)
x |
an |
oldnames |
a character vector of old variable names |
newnames |
a character vector of new variable names to replace the corresponding old names |
avoid_duplicated |
a logical value to indicate whether to avoid renaming the
variable if the corresponding new name already exists in the data.
Defaults to |
All variable names are coerced to lowercase to comply with
the EdSurvey
standard.
an object of the same class as x
with new variable names
Trang Nguyen
## Not run: usaG4.15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) usaG4.15.renamed <- rename.sdf(x=usaG4.15, oldnames=c("itsex", "mmat"), newnames=c("gender", "math_overall")) lm1 <- lm.sdf(formula=math_overall ~ gender, data = usaG4.15.renamed) summary(lm1) ## End(Not run)
## Not run: usaG4.15 <- readTIMSS(path="~/TIMSS/2015", "usa", 4) usaG4.15.renamed <- rename.sdf(x=usaG4.15, oldnames=c("itsex", "mmat"), newnames=c("gender", "math_overall")) lm1 <- lm.sdf(formula=math_overall ~ gender, data = usaG4.15.renamed) summary(lm1) ## End(Not run)
rounding helper
roundn(n)
roundn(n)
n |
round to this level |
a function that rounds to n
Paul Bailey
rounding helper for NCES
roundNCES(n)
roundNCES(n)
n |
the value to be rounded; accepts a vector |
the rounded value
Paul Bailey
Fits a quantile regression model that uses weights and variance estimates appropriate for the data.
rq.sdf( formula, data, tau = 0.5, weightVar = NULL, relevels = list(), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, omittedLevels = deprecated(), ... )
rq.sdf( formula, data, tau = 0.5, weightVar = NULL, relevels = list(), jrrIMax = 1, dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, returnNumberOfPSU = FALSE, omittedLevels = deprecated(), ... )
formula |
a |
data |
an |
tau |
the quantile to be estimated. The value could be set between 0 and 1 with a default of 0.5. |
weightVar |
a character indicating the weight variable to use.
The |
relevels |
a list. Used to change the contrasts from the default treatment contrasts to the treatment contrasts with a chosen omitted group (the reference group). The name of each element should be the variable name, and the value should be the group to be omitted (the reference group). |
jrrIMax |
when using the jackknife variance estimation method, the default estimation option, |
dropOmittedLevels |
a logical value. When set to the default value of |
defaultConditions |
a logical value. When set to the default value of |
recode |
a list of lists to recode variables. Defaults to |
returnNumberOfPSU |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
... |
additional parameters passed from |
The function computes an estimate on the tau
-th conditional quantile function of the response,
given the covariates, as specified by the formula argument. Like lm.sdf()
, the
function presumes a linear specification for the quantile regression model (i.e., that the
formula defines a model that is linear in parameters). Unlike lm.sdf()
, the jackknife is the
only applicable variance estimation method used by the function.
For further details on quantile regression models and how they are implemented in R, see Koenker
and Bassett (1978), Koenker (2005), and the vignette from the quantreg
package—
accessible by vignette("rq",package="quantreg")
—on which this function is
built.
For further details on how left-hand side variables, survey sampling weights, and estimated
variances are correctly handled, see lm.sdf
or the vignette titled
Statistical Methods Used in EdSurvey.
An edsurvey.rq
with the following elements:
call |
the function call |
formula |
the formula used to fit the model |
tau |
the quantile to be estimated |
coef |
the estimates of the coefficients |
se |
the standard error estimates of the coefficients |
Vimp |
the estimated variance from uncertainty in the scores (plausible value variables) |
Vjrr |
the estimated variance from sampling |
M |
the number of plausible values |
varm |
the variance estimates under the various plausible values |
coefm |
the values of the coefficients under the various plausible values |
coefmat |
the coefficient matrix (typically produced by the summary of a model) |
weight |
the name of the weight variable |
npv |
the number of plausible values |
njk |
the number of the jackknife replicates used; set to |
rho |
the mean value of the objective function across the plausible values |
Trang Nguyen, Paul Bailey, and Yuqi Liao
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279–292.
Johnson, E. G., & Rust, K. F. (1992). Population inferences and variance estimation for NAEP data. Journal of Education Statistics, 17(2), 175–190.
Koenker, R. W., & Bassett, G. W. (1978). Regression quantiles, Econometrica, 46, 33–50.
Koenker, R. W. (2005). Quantile regression. Cambridge, UK: Cambridge University Press.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # conduct quantile regression at a given tau value (by default, tau is set to be 0.5) rq1 <- rq.sdf(formula=composite ~ dsex + b017451, data=sdf, tau = 0.8) summary(rq1) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # conduct quantile regression at a given tau value (by default, tau is set to be 0.5) rq1 <- rq.sdf(formula=composite ~ dsex + b017451, data=sdf, tau = 0.8) summary(rq1) ## End(Not run)
Score assessments
scoreDefault(edf, polyParamTab, dichotParamTab, scoreDict)
scoreDefault(edf, polyParamTab, dichotParamTab, scoreDict)
edf |
the data |
polyParamTab |
see |
dichotParamTab |
see |
scoreDict |
a data frame; see Details. |
default scorer scores column on edf identified by polyParamTab$ItemID, dichotParamTab$ItemID using a crosswalk in scoreDict
the scoreDict
is a data frame in long format with columns key
, answer
, and score
.
the function maps, within the item identified by key
from answer
to score
.
a data frame with the columns in the scoreDict
key
column mapped from answer
to score
.
Paul Bailey and Tom Fink
Scoring TIMSS data
scoreTIMSS(edf, polyParamTab, dichotParamTab, scoreCard = NULL)
scoreTIMSS(edf, polyParamTab, dichotParamTab, scoreCard = NULL)
edf |
a TIMSS |
polyParamTab |
a dataframe containing IRT parameters for all polytomous items in |
dichotParamTab |
a dataframe containing IRT parameters for all dichotomous items in |
scoreCard |
unused |
This function scores TIMSS data.
For multiple choice items, correct answers are assigned 1 point, and incorrect answers are assigned 0 points.
For constructed response items, correct answers are assigned 2 points, partially correct answers are assigned 1 point,
and incorrect answers are assigned 0 points. For both types of items, "NOT REACHED" and "OMITTED OR INVALID" are assigned 0 points.
these defaults can be changed by modifying the scoreDict
columns pointMult
and pointConst
, respectively.
scored edf
Calculate the standard deviation of a numeric variable in an edsurvey.data.frame
.
SD( data, variable, weightVar = NULL, jrrIMax = 1, varMethod = "jackknife", dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, targetLevel = NULL, jkSumMultiplier = getAttributes(data, "jkSumMultiplier"), returnVarEstInputs = FALSE, omittedLevels = deprecated() )
SD( data, variable, weightVar = NULL, jrrIMax = 1, varMethod = "jackknife", dropOmittedLevels = TRUE, defaultConditions = TRUE, recode = NULL, targetLevel = NULL, jkSumMultiplier = getAttributes(data, "jkSumMultiplier"), returnVarEstInputs = FALSE, omittedLevels = deprecated() )
data |
an |
variable |
character vector of variable names |
weightVar |
character weight variable name. Default is the default weight of |
jrrIMax |
a numeric value; when using the jackknife variance estimation method, the default estimation option, |
varMethod |
deprecated parameter; |
dropOmittedLevels |
a logical value. When set to |
defaultConditions |
a logical value. When set to the default value of
|
recode |
a list of lists to recode variables. Defaults to |
targetLevel |
a character string. When specified, calculates the gap in
the percentage of students at
|
jkSumMultiplier |
when the jackknife variance estimation method—or
balanced repeated replication (BRR)
method—multiplies the final jackknife variance estimate by a value,
set |
returnVarEstInputs |
a logical value set to |
omittedLevels |
this argument is deprecated. Use |
a list object with elements:
mean |
the mean assessment score for |
std |
the standard deviation of the |
stdSE |
the standard error of the |
df |
the degrees of freedom of the |
varEstInputs |
the variance estimate inputs used for calculating covariances with |
Paul Bailey and Huade Huo
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get standard deviation for Male's composite score SD(data = subset(sdf, dsex == "Male"), variable = "composite") # get several standard deviations # build an edsurvey.data.frame.list sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # this shows how these datasets will be described: sdfl$covs # SD results for each survey SD(data = sdfl, variable = "composite") # SD results more compactly and with comparisons gap(variable="composite", data=sdfl, stDev=TRUE, returnSimpleDoF=TRUE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get standard deviation for Male's composite score SD(data = subset(sdf, dsex == "Male"), variable = "composite") # get several standard deviations # build an edsurvey.data.frame.list sdfA <- subset(sdf, scrpsu %in% c(5,45,56)) sdfB <- subset(sdf, scrpsu %in% c(75,76,78)) sdfC <- subset(sdf, scrpsu %in% 100:200) sdfD <- subset(sdf, scrpsu %in% 201:300) sdfl <- edsurvey.data.frame.list(datalist=list(sdfA, sdfB, sdfC, sdfD), labels=c("A locations", "B locations", "C locations", "D locations")) # this shows how these datasets will be described: sdfl$covs # SD results for each survey SD(data = sdfl, variable = "composite") # SD results more compactly and with comparisons gap(variable="composite", data=sdfl, stDev=TRUE, returnSimpleDoF=TRUE) ## End(Not run)
Retrieves variable names and labels for an edsurvey.data.frame
,
a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
using character string matching.
searchSDF(string, data, fileFormat = NULL, levels = FALSE)
searchSDF(string, data, fileFormat = NULL, levels = FALSE)
string |
a vector of character strings to search for in the database connection object ( |
data |
an |
fileFormat |
a character vector indicating the data source to search for variables.
The default |
levels |
a logical value; set to |
a data.frame
that shows the variable names, labels,
and levels (if applicable) from an edsurvey.data.frame
or a light.edsurvey.data.frame
based on a matching character string
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # search both the student and school files by a character string searchSDF(string="book", data=sdf) # use the `|` (OR) operator to search several strings simultaneously searchSDF(string="book|home|value", data=sdf) # use a vector of strings to search for variables that contain multiple strings, # such as both "book" and "home" searchSDF(string=c("book","home"), data=sdf) # search only the student files by a character string searchSDF(string="algebra", data=sdf, fileFormat="student") # search both the student and school files and return a glimpse of levels searchSDF(string="value", data=sdf, levels=TRUE) # save the search as an object to return a full data.frame of search ddf <- searchSDF(string="value", data=sdf, levels=TRUE) ddf ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # search both the student and school files by a character string searchSDF(string="book", data=sdf) # use the `|` (OR) operator to search several strings simultaneously searchSDF(string="book|home|value", data=sdf) # use a vector of strings to search for variables that contain multiple strings, # such as both "book" and "home" searchSDF(string=c("book","home"), data=sdf) # search only the student files by a character string searchSDF(string="algebra", data=sdf, fileFormat="student") # search both the student and school files and return a glimpse of levels searchSDF(string="value", data=sdf, levels=TRUE) # save the search as an object to return a full data.frame of search ddf <- searchSDF(string="value", data=sdf, levels=TRUE) ddf ## End(Not run)
add item response theory data necessary to use mml.sdf
on NAEP data
setNAEPScoreCard(data, dctPath = NULL)
setNAEPScoreCard(data, dctPath = NULL)
data |
a NAEP |
dctPath |
a file location that points to the location of a NAEP |
a NAEP edsurvey.data.frame
with updated attributes
## Not run: datFP <- "~/NAEP_Folder/Data/M50NT3AT.dat" sdf <- readNAEP(path=datFP) #how to set NAEP mml attributes #if readNAEP does not detect them automatically dctFP <- "~/NAEP_Folder/AM/M50NT3AT.dct" sdf <- setNAEPScoreCard(data=sdf, dctPath=dctFP) ## End(Not run)
## Not run: datFP <- "~/NAEP_Folder/Data/M50NT3AT.dat" sdf <- readNAEP(path=datFP) #how to set NAEP mml attributes #if readNAEP does not detect them automatically dctFP <- "~/NAEP_Folder/AM/M50NT3AT.dct" sdf <- setNAEPScoreCard(data=sdf, dctPath=dctFP) ## End(Not run)
Retrieves variable names, variable labels, and value labels for an
edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
showCodebook( data, fileFormat = NULL, labelLevels = FALSE, includeRecodes = FALSE )
showCodebook( data, fileFormat = NULL, labelLevels = FALSE, includeRecodes = FALSE )
data |
an |
fileFormat |
a character string indicating the data source to search for variables.
The default |
labelLevels |
a logical value; set to |
includeRecodes |
a logical value; set to |
a data.frame
that shows the variable names, variable labels, value labels,
value levels (if applicable), and the file format data source from an edsurvey.data.frame
, a light.edsurvey.data.frame
,
or an edsurvey.data.frame.list
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # search both the student and school files, returning levels for variable values showCodebook(data=sdf, fileFormat=c("student","school"), labelLevels = TRUE, includeRecodes = FALSE) # return codebook information for the student file, excluding variable value levels, # including recoded variables sdf <- recode.sdf(sdf, recode = list(dsex = list(from = c("Male"), to = c("MALE")))) showCodebook(data=sdf, fileFormat=c("student"), labelLevels = FALSE, includeRecodes = TRUE) # return codebook information for the student file, including variable value levels # and recoded variables showCodebook(data=sdf, fileFormat=c("student","school"), labelLevels = FALSE, includeRecodes = TRUE) # return codebook information for all codebooks in an edsurvey.data.frame; commonly use View() View(showCodebook(data=sdf)) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # search both the student and school files, returning levels for variable values showCodebook(data=sdf, fileFormat=c("student","school"), labelLevels = TRUE, includeRecodes = FALSE) # return codebook information for the student file, excluding variable value levels, # including recoded variables sdf <- recode.sdf(sdf, recode = list(dsex = list(from = c("Male"), to = c("MALE")))) showCodebook(data=sdf, fileFormat=c("student"), labelLevels = FALSE, includeRecodes = TRUE) # return codebook information for the student file, including variable value levels # and recoded variables showCodebook(data=sdf, fileFormat=c("student","school"), labelLevels = FALSE, includeRecodes = TRUE) # return codebook information for all codebooks in an edsurvey.data.frame; commonly use View() View(showCodebook(data=sdf)) ## End(Not run)
Retrieves a summary of the achievement level cutpoints for a
selected study represented in an
edsurvey.data.frame
, a light.edsurvey.data.frame
,
or an edsurvey.data.frame.list
.
showCutPoints(data)
showCutPoints(data)
data |
an |
If there are achievement levels defined, prints one line per subject scale. Each line names the subject and then shows the cut point for each achievement level.
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the cut points showCutPoints(data=sdf) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the cut points showCutPoints(data=sdf) ## End(Not run)
Prints a summary of the subject scale or subscale and the associated variables for their
plausible values for an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
showPlausibleValues(data, verbose = FALSE)
showPlausibleValues(data, verbose = FALSE)
data |
an |
verbose |
a logical value; set to |
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the plausible values showPlausibleValues(data=sdf, verbose=TRUE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the plausible values showPlausibleValues(data=sdf, verbose=TRUE) ## End(Not run)
Prints a summary of the weights in an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
showWeights(data, verbose = FALSE)
showWeights(data, verbose = FALSE)
data |
an |
verbose |
a logical value; set to TRUE to print the complete list of jackknife replicate weights associated with each full sample weight; otherwise, prints only the full sample weights |
Michael Lee and Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the weights showWeights(data=sdf, verbose=TRUE) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # show the weights showWeights(data=sdf, verbose=TRUE) ## End(Not run)
Suggest Weights for ECLS-K:2011 data based on inputting variables.
suggestWeights( varnames = NULL, data, showAllWeightSuggestions = FALSE, verbose = FALSE )
suggestWeights( varnames = NULL, data, showAllWeightSuggestions = FALSE, verbose = FALSE )
varnames |
character vector indicating variables to be included in the weight suggestion. |
data |
an |
showAllWeightSuggestions |
a logical value. When set to |
verbose |
a logical value to either print or suppress status message output. |
suggestWeights
provides one additional way to assist researchers in deciding which weight to use for analyses.
This function find the intersect of possible weights given variables provided, and rank this intersect
based on the number of components a weight can adjust.
The best weight would adjust for each and every source used and only those sources. However, for many analyses, there will be no weight that adjusts for nonresponse to all the sources of data that are included and for only those source. When no weight corresponds exactly to the combination of components included in the desired analysis, researchers might prefer to use a weight that includes nonresponse adjustments for more components than they are using in their analysis if that weight also includes nonresponse adjustments for the components they are using.
Researchers should always consult their research questions for optimal weight choice.
A list of weight variables. The first one is the most approperate choice.
Huade Huo
Tourangeau, K., Nord, C., Le, T., Sorongon, A.G., Hagedorn, M.C., Daly, P., and Najarian, M. (2015). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011), User's Manual for the ECLS-K:2011 Kindergarten Data File and Electronic Codebook, Public Version (NCES 2015-074). U.S. Department of Education. Washington, DC: National Center for Education Statistics.
Tourangeau, K., Nord, C., Le, T., Wallner-Allen, K., Hagedorn, M.C., Leggitt, J., and Najarian, M. (2015). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011), User's Manual for the ECLS-K:2011 Kindergarten-First Grade Data File and Electronic Codebook, Public Version (NCES 2015-078). U.S. Department of Education. Washington, DC: National Center for Education Statistics.
Tourangeau, K., Nord, C., Le, T., Wallner-Allen, K., Vaden-Kiernan, N., Blaker, L. and Najarian, M. (2017). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011) User's Manual for the ECLS-K:2011 Kindergarten-Second Grade Data File and Electronic Codebook, Public Version (NCES 2017-285). U.S. Department of Education. Washington, DC: National Center for Education Statistics.
Tourangeau, K., Nord, C., Le, T., Wallner-Allen, K., Vaden-Kiernan, N., Blaker, L. and Najarian, M. (2018). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 ( ECLS -K:2011) User's Manual for the ECLS-K:2011 Kindergarten-Third G rade Data File and Electronic Codebook, Public Version (NCES 2018-034). U.S. Department of Education. Washington, DC: National Center for Education Statistics
Tourangeau, K., Nord, C., Le, T., Wallner-Allen, K., Vaden-Kiernan, N., Blaker, L. and Najarian, M. (2018). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011) User's Manual for the ECLS-K:2011 Kindergarten-Fourth Grade Data File and Electronic Codebook, Public Version (NCES 2018-032). U.S. Department of Education. Washington, DC: National Center for Education Statistics.
Tourangeau, K., Nord, C., Le, T., Wallner-Allen, K., Vaden-Kiernan, N., Blaker, L. and Najarian, M. (2019). Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011) User's Manual for the ECLS-K:2011 Kindergarten-Fifth Grade Data File and Electronic Codebook, Public Version (NCES 2019-051). U.S. Department of Education. Washington, DC: National Center for Education Statistics.
## Not run: # read-in ECLS-K:2011 data file with parameters specified eclsk11 <- readECLS_K2011(path=file.path("~/", "ECLS_K", "2011"), filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", verbose = FALSE) # suggest weight for individual variable suggestWeights(varnames="x8mscalk5", data=eclsk11) # suggest weight for multiple variables suggestWeights(varnames=c("x8mscalk5", "x_chsex_r", "x12sesl"), data=eclsk11) ## End(Not run)
## Not run: # read-in ECLS-K:2011 data file with parameters specified eclsk11 <- readECLS_K2011(path=file.path("~/", "ECLS_K", "2011"), filename = "childK5p.dat", layoutFilename = "ECLSK2011_K5PUF.sps", verbose = FALSE) # suggest weight for individual variable suggestWeights(varnames="x8mscalk5", data=eclsk11) # suggest weight for multiple variables suggestWeights(varnames=c("x8mscalk5", "x_chsex_r", "x12sesl"), data=eclsk11) ## End(Not run)
Summarizes edsurvey.data.frame
variables.
summary2( data, variable, weightVar = attr(getAttributes(data, "weights"), "default"), dropOmittedLevels = FALSE, omittedLevels = deprecated() )
summary2( data, variable, weightVar = attr(getAttributes(data, "weights"), "default"), dropOmittedLevels = FALSE, omittedLevels = deprecated() )
data |
an |
variable |
character vector of variable names |
weightVar |
character weight variable name. Default is the default weight of |
dropOmittedLevels |
a logical value. When set to |
omittedLevels |
this argument is deprecated. Use |
summary of weighted or unweighted statistics of a given variable in an edsurvey.data.frame
For categorical variables, the summary results are a crosstab of all variables and include the following:
[variable name] |
level of the variable in the column name that the row regards. There is one column per element of |
N |
number of cases for each category. Weighted N also is produced if users choose to produce weighted statistics. |
Percent |
percentage of each category. Weighted percent also is produced if users choose to produce weighted statistics. |
SE |
standard error of the percentage statistics |
For continuous variables, the summary results are by variable and include the following:
Variable |
name of the variable the row regards |
N |
total number of cases (both valid and invalid cases) |
Min. |
smallest value of the variable |
1st Qu. |
first quantile of the variable |
Median |
median value of the variable |
Mean |
mean of the variable |
3rd Qu. |
third quantile of the variable |
Max. |
largest value of the variable |
SD |
standard deviation or weighted standard deviation |
NA's |
number of |
Zero weights |
number of zero weight cases if users choose to produce weighted statistics |
If the weight option is chosen, the function produces weighted percentile and standard deviation. Refer to the vignette titled Statistical Methods Used in EdSurvey and the vignette titled Methods Used for Estimating Percentiles in EdSurvey for how the function calculates these statistics (with and without plausible values).
Paul Bailey and Trang Nguyen
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # print out summary of weighted statistics of a continuous variable summary2(data=sdf, variable="composite") # print out summary of weighted statistics of a variable, including omitted levels summary2(data=sdf, variable="b017451", omittedLevels = FALSE) # make a crosstab summary2(data=sdf, variable=c("b017451", "dsex"), omittedLevels = FALSE) # print out summary of unweighted statistics of a variable summary2(data=sdf, variable="composite", weightVar = NULL) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # print out summary of weighted statistics of a continuous variable summary2(data=sdf, variable="composite") # print out summary of weighted statistics of a variable, including omitted levels summary2(data=sdf, variable="b017451", omittedLevels = FALSE) # make a crosstab summary2(data=sdf, variable=c("b017451", "dsex"), omittedLevels = FALSE) # print out summary of unweighted statistics of a variable summary2(data=sdf, variable="composite", weightVar = NULL) ## End(Not run)
Changes the name used to refer to a set of plausible values from oldVar
to newVar
in an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
.
updatePlausibleValue(oldVar, newVar, data)
updatePlausibleValue(oldVar, newVar, data)
oldVar |
a character value indicating the existing name of the variable |
newVar |
a character value indicating the new name of the variable |
data |
an |
an object of the same class as the data
argument, with the name of
the plausible value updated from oldVar
to newVar
Michael Lee and Paul Bailey
getPlausibleValue
and showPlausibleValues
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get the PVs before showPlausibleValues(data=sdf) sdf2 <- updatePlausibleValue(oldVar="composite", newVar="overall", data=sdf) showPlausibleValues(data=sdf2) lm1 <- lm.sdf(formula=overall ~ b017451, data=sdf2) summary(lm1) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package="NAEPprimer")) # get the PVs before showPlausibleValues(data=sdf) sdf2 <- updatePlausibleValue(oldVar="composite", newVar="overall", data=sdf) showPlausibleValues(data=sdf2) lm1 <- lm.sdf(formula=overall ~ b017451, data=sdf2) summary(lm1) ## End(Not run)
When the variance of a derived statistic (e.g., a difference) is
required, the covariance between the two statistics must be
calculated. This function uses results generated by various
functions (e.g., a lm.sdf
) to find the covariance
between two statistics.
varEstToCov( varEstA, varEstB = varEstA, varA, varB = varA, jkSumMultiplier, returnComponents = FALSE )
varEstToCov( varEstA, varEstB = varEstA, varA, varB = varA, jkSumMultiplier, returnComponents = FALSE )
varEstA |
a list of two |
varEstB |
a list of two |
varA |
a character that names the statistic in the |
varB |
a character that names the statistic in the |
jkSumMultiplier |
when the jackknife variance estimation method—or
balanced repeated replication (BRR)
method—multiplies the final jackknife variance estimate by a value,
set |
returnComponents |
set to |
These functions are not vectorized, so varA
and
varB
must contain exactly one variable name.
The method used to compute the covariance is in the vignette titled Statistical Methods Used in EdSurvey
The method used to compute the degrees of freedom is in the vignette titled Statistical Methods Used in EdSurvey in the section “Estimation of Degrees of Freedom.”
a numeric value; the jackknife covariance estimate. If returnComponents
is TRUE
, returns a vector of
length three, V
is the variance estimate, Vsamp
is the sampling component of the variance, and Vimp
is the imputation component of the variance
Paul Bailey
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # estimate a regression lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, returnVarEstInputs=TRUE) summary(lm1) # estimate the covariance between two regression coefficients # note that the variable names are parallel to what they are called in lm1 output jkSumMultiplier <- EdSurvey:::getAttributes(data=sdf, attribute="jkSumMultiplier") covFEveryDay <- varEstToCov(varEstA=lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", jkSumMultiplier=jkSumMultiplier) # the estimated difference between the two coefficients # note: unname prevents output from being named after the first coefficient unname(coef(lm1)["dsexFemale"] - coef(lm1)["b017451Every day"]) # the standard error of the difference # uses the formula SE(A-B) = sqrt(var(A) + var(B) - 2*cov(A,B)) sqrt(lm1$coefmat["dsexFemale", "se"]^2 + lm1$coefmat["b017451Every day", "se"]^2 - 2 * covFEveryDay) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # estimate a regression lm1 <- lm.sdf(formula=composite ~ dsex + b017451, data=sdf, returnVarEstInputs=TRUE) summary(lm1) # estimate the covariance between two regression coefficients # note that the variable names are parallel to what they are called in lm1 output jkSumMultiplier <- EdSurvey:::getAttributes(data=sdf, attribute="jkSumMultiplier") covFEveryDay <- varEstToCov(varEstA=lm1$varEstInputs, varA="dsexFemale", varB="b017451Every day", jkSumMultiplier=jkSumMultiplier) # the estimated difference between the two coefficients # note: unname prevents output from being named after the first coefficient unname(coef(lm1)["dsexFemale"] - coef(lm1)["b017451Every day"]) # the standard error of the difference # uses the formula SE(A-B) = sqrt(var(A) + var(B) - 2*cov(A,B)) sqrt(lm1$coefmat["dsexFemale", "se"]^2 + lm1$coefmat["b017451Every day", "se"]^2 - 2 * covFEveryDay) ## End(Not run)
This function prints the defined NHES Survey Codes to console output that are compatible with the readNHES
function for use.
Typically a user will only need to manually set these codes if the 'auto' survey parameter is not able to correctly identify the
correct survey type, or for other unusual situations.
viewNHES_SurveyCodes()
viewNHES_SurveyCodes()
Tom Fink
readNHES
, getNHES_SurveyInfo
## Not run: #print the NHES survey information to the console for quick reference viewNHES_SurveyCodes() ## End(Not run)
## Not run: #print the NHES survey information to the console for quick reference viewNHES_SurveyCodes() ## End(Not run)
Tests on coefficient(s) of edsurveyGlm
and edsurveyLm
models.
waldTest(model, coefficients, H0 = NULL)
waldTest(model, coefficients, H0 = NULL)
model |
an |
coefficients |
coefficients to be tested, by name or position in
|
H0 |
reference values to test coefficients against, default = 0 |
When plausible values are present, likelihood ratio tests cannot be used.
However, the Wald test can be used to test estimated parameters in a model,
with the null hypothesis being that a parameter(s) is equal to some value(s).
In the default case where the null hypothesis value of the parameters is 0,
if the test fails to reject the null hypothesis, removing the variables from
the model will not substantially harm the fit of that model. Alternative null
hypothesis values also can be specified with the H0
argument.
See Examples.
Coefficients to test can be specified by an integer (or integer vector) corresponding to the order of coefficients in the summary output. Coefficients also can be specified using a character vector, to specify coefficient names to test. The name of a factor variable can be used to test all levels of that variable.
This test produces both chi-square and F-tests; their calculation is described in the vignette titled Methods and Overview of Using EdSurvey for Running Wald Tests.
An edsurveyWaldTest
object with the following elements:
Sigma |
coefficient covariance matrix |
coefficients |
indices of the coefficients tested |
H0 |
null hypothesis values of coefficients tested |
result |
result object containing the values of the chi-square and F-tests |
hypoMatrix |
hypothesis matrix used for the Wald Test |
Alex Lishinski and Paul Bailey
Diggle, P. J., Liang, K.-Y., & Zeger, S. L. (1994). Analysis of longitudinal data. Oxford, UK: Clarendon Press.
Draper, N. R., & Smith, H. (1998). Applied regression analysis. New York, NY: Wiley.
Fox, J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: SAGE.
[Institute for Digital Research and Education. (n.d.). FAQ: How are the likelihood ratio, Wald, and LaGrange multiplier (score) tests different and/or similar?](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/). Los Angeles: University of California at Los Angeles. Retrieved from [https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/)
Korn, E., & Graubard, B. (1990). Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics. The American Statistician, 44(4), 270–276.
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # example with glm model myLogit <- logit.sdf(formula=dsex ~ b017451 + b003501, data = sdf, returnVarEstInputs = T) # single coefficient integer waldTest(model = myLogit, coefficients = 2) # set of coefficients integer vector waldTest(model = myLogit, coefficients = 2:5) # specify levels of factor variable to test waldTest(myLogit, c("b017451Every day", "b017451About once a week")) # specify all levels of factor variable to test waldTest(myLogit, "b017451") # example with lm model fit <- lm.sdf(formula=composite ~ dsex + b017451, data = sdf, returnVarEstInputs = T) waldTest(model = fit, coefficients = "b017451") # examples with alternative (nonzero) null hypothesis values waldTest(model = myLogit, coefficients = 2, H0 = 0.5) waldTest(model = myLogit, coefficients = 2:5, H0 = c(0.5, 0.6, 0.7, 0.8)) waldTest(model = myLogit, coefficients = "b017451", H0 = c(0.5, 0.6, 0.7, 0.8)) waldTest(model = myLogit, coefficients = c("b017451Every day", "b017451About once a week"), H0 = c(0.1, 0.2)) ## End(Not run)
## Not run: # read in the example data (generated, not real student data) sdf <- readNAEP(path=system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer")) # example with glm model myLogit <- logit.sdf(formula=dsex ~ b017451 + b003501, data = sdf, returnVarEstInputs = T) # single coefficient integer waldTest(model = myLogit, coefficients = 2) # set of coefficients integer vector waldTest(model = myLogit, coefficients = 2:5) # specify levels of factor variable to test waldTest(myLogit, c("b017451Every day", "b017451About once a week")) # specify all levels of factor variable to test waldTest(myLogit, "b017451") # example with lm model fit <- lm.sdf(formula=composite ~ dsex + b017451, data = sdf, returnVarEstInputs = T) waldTest(model = fit, coefficients = "b017451") # examples with alternative (nonzero) null hypothesis values waldTest(model = myLogit, coefficients = 2, H0 = 0.5) waldTest(model = myLogit, coefficients = 2:5, H0 = c(0.5, 0.6, 0.7, 0.8)) waldTest(model = myLogit, coefficients = "b017451", H0 = c(0.5, 0.6, 0.7, 0.8)) waldTest(model = myLogit, coefficients = c("b017451Every day", "b017451About once a week"), H0 = c(0.1, 0.2)) ## End(Not run)