Research Info
What Are Large-Scale Databases?
A Large scale database is a collection of information from a variety of sources typically stored electronically in a systematic or structured manner. Large-scale databases allow a user to access specific information about a specific individual, event, or phenomenon; but they also allow a user to analyze the information to answer a number of research questions. In the field of special education, large-scale databases typically contain information about students, teachers, parents, and schools. Some examples of large-scale databases used in special education research are: Levine, Marder, & Wagner (2004), Newman, 2007, Wagner, Cadwallader, Garza, & Cameto (2004).
Typically, large-scale databases are structured in specific ways. Most databases are relational models, and represent all information in the form of multiple related tables each consisting of rows and columns. Relational model databases represent relationships by the use of values common to more than one table.
Large scale databases have been in existence in the United States since the 1920's, and for 200 years prior to that. Earlier data collection in Europe and Canada were census focused and periodic in nature. The United States pioneered a number of studies with children. Longitudinal designs have been frequently used to track large scale trends over time. Recent technological advances and more sophisticated statistical techniques have enabled macro- and micro- analyses and also to examine causal paths of associations in the data.
Characteristics of Large-Scale Databases
Secondary data
Most large-scale databases consist of sources of data collected by others and archived in some form.
Secondary data are relatively inexpensive to use and are frequently used to answer basic research type questions. Often times the research conducted using secondary sources leads to the development of primary research studies.
A variety of research questions
Data from large-scale databases allow users to answer a variety of types of research questions
Fosters the use of longitudinal designs
Longitudinal designs allow for repeated measurements over time and as such, can better capture developmental changes as they occur. This paints a more accurate picture of how a multitude of variables may affect personal and/or social change over duration of time.
Provides better insight than cross-sectional designs into causal relations between variables. As such it can bridge the gap between quantitative and qualitative epistemologies.
The three most commonly used longitudinal designs are:
Repeated cross-sectional (trend) which are conducted regularly with different samples.
Prospective longitudinal studies (panel) that repeatedly interview the same subjects over time. This is the preferred method of design.
Retrospective longitudinal studies (event history or duration data) in which subjects are asked to remember of reconstruct certain life events.
Internal Validity
Due to the large number of participants, statistical power is enhanced.
Subjects with similar characteristics can be added to prevent confounding due to attrition. This "freshening" or "refreshening" of participants is a common technique used in large-scale designs.
By examining changes over time, the threat of maturation can be explained, or controlled for.
Multiple data collection methods with established and tested instruments are often used.
External Validity
The use of stratified, multiple stage designs, over sampling and a complex series of weights can facilitate generalization to a nationally representative set of participants.
Types of questions that can be answered
Questions involving group means, comparisons between groups, individuals or institutions, inter- and intra-individual changes over time on variables or measures, and examining social trends over time are some examples of appropriate use of large-scale data analysis. Analytic strategies common to quantitative analysis are suitable.
Analytic Strategies
Univariate descriptive, bivariate, regression, ANOVA, MANOVA and log-linear and logistic regression analyses can be utilized. To better demonstrate causality, advanced techniques such as Structural Equation Modeling (SEM), Hierarchical Linear Modeling (HLM) and Growth Curve Analysis have recently gained favor.
Initial Steps
Identify whether your question can appropriately be answered using large data sets. Some considerations:
Are you interested in examining the differences between two different subsets of a population on a particular variable or measure? For example:
To what extent do students with learning disabilities and students with behavioral disorders differ on 6th grade reading achievement?
Are there differences in the employment status of adults with disabilities based on ethnicity?
Are you interested in examining the differences between or within individuals over a specified period of time?
Do parent perceptions of their child's education program change from 1st grade to 8th grade?
Does a change in family income affect student school performance from one year until the next?
Identify who provided the data. This is important to determine the context of the data that is generated. For example, parents are most knowledgeable about aspects of a students non-school experience, while school personnel are most knowledgeable about classroom specific characteristics, school programs and academic functioning of students. Student generated data can report perceptions about both household and school characteristics. Data are often collected from multiple sources. Some possibilities:
Students
Parents
Teachers
Principals
Identify how data was collected. Some possibilities:
Telephone survey/interview using legitimate skip logic.
Surveys/questionnaires by mail.
In person interview with specific protocol.
Direct assessment
Rating scales
Observations
Identify instruments used. Database designers should provide specific and thorough information about what instruments were used, the reliability and validity of the instruments and procedures for conducting telephone surveys and interviews.
Missing data. Determine the extent of missing data and what this potentially means for your analysis.
Key Terminology
Weights/weighting: A design or analytical technique to adjust for complex survey designs or oversampling to reproduce the representativeness of a sample to the larger population. Since common statistical packages (SPSS, SAS) treat any dataset as though it were constructed through a simple random sample and ignores multistage cluster samples, it is important to know how the data was constructed. Weights are commonly provided for various units of analyses in large datasets to adjust for design effects. Weighting is also a way to compensate for missing data.
Oversampling: A technique used to facilitate adequate sample sizes of relatively small sub-populations within a larger population. Often results in disproportionately higher reflections of the sub-population in the raw data. This is adjusted with weights to ensure accurate representation to the larger population.
Wave: a data collection point.
Legitimate skip: When using a computer generated phone or written questionnaire. Certain questions will automatically be skipped depending on the answer to the previous question. For example, "If you've answered NO to receiving free and reduced price lunch, skip to question number 15." This will generate missing data for those questions not answered, which will need to be addressed before conducting an analysis.
CAPI: Computer Assisted Person Interviewing
CATI: Computer Assisted Telephone Interviewing
Standard Error: Acknowledges that any population estimate that is calculated from a sample will only approximate the true value for that population.
Missing Data: Missing data is a common nuisance in social science research. The main problems are the threats posed to both internal validity (statistical power) and external validity (ability to generalize to target population). An essential component to analyzing large-scale data is to address the issue of missing data by applying appropriate strategies. When analyzing large-scale databases, missing data primarily occurs due to item non-response, unit non-response or wave non-response (Ruspini, 2002).
Item non-response: Occurs when respondents do not answer certain items where a respondent who provided answers on other parts of the survey should have provided data. Item non-response is also evident in legitimate skips where non-response is an expected occurrence.
Unit non-response: Occurs when a sample member provides no data through refusal or non-contact.
Wave non-response: Occurs when data for a member are completely missing for at least one wave but present for one or more of the other waves. Non-response may be due to illness or maturation.
Freshening: This is a process in which introduces new students who would not been available in the previous wave(s) of data collection to be invited into the study. These are individuals who may have been at school outside the United States in the previous wave(s) or had been in a different grade in the previous wave(s). This allows that specific wave to representative of the population, for example of all 12th grade students in the United States. This makes the data for that wave to be analyzed as cross-sectional data, in addition to longitudinal data.
Refreshening: adding new respondents for those who drop out of the study.
Simple Random Sampling: every individual in the population has an equal chance of being selected for the study.
Stratified Random Sampling: the identification of subgroups in that the selection of individuals is representative the proportion in the sample is equal to the population.
Cluster Random Sampling: instead of selecting randomly at the individual level, this method selects groups randomly, for example classrooms.
Two Stage Stratified Sampling: the identification of groups (clusters) at random in proportion of the subgroup population and than the identification of individuals with the cluster at random. (Typically identified in largescale database sampling plans)
Convenience Sampling: a group of individuals available for the study
Purposively Sampling: individuals are selected by the researchers due to certain characteristics that the researcher believes will provide the greatest information.
Restricted Use vs. Public Use Data
In accessing a large-scale database for analysis, often the data is made available through two modes, restricted and non-restricted or public. The non-restricted or public data is available to anyone and can either be found through a website (e.g., www.NLTS2.org) or cd copies of the data can be requested through the organization. The difference between this data and the restricted data is what is found in the data. Restricted data, although does not have direct identifying information about the responder, there is enough information that can lead the researcher to potential identify the individual or school. In addition, there maybe certain questions available removed from the public database, for reasons of identification (small population leading to identifying the individual. Application must be submitted in order to obtain the restricted and either the individual or organization must have a site license in order to obtain the data. There are very strict guidelines in order to obtain this data and further research and time must be allotted in order to obtain this data.
Generally, the non-restricted or public data will serve the majority of the researchers. In the case of NLTS2, SEELS and many of the NCES data have been made available as data tables online for those individuals looking to conduct more descriptive data analysis. If your analysis require further level of analysis, than obtaining for these databases restricted cds would be necessary.
