Subido por Alethea Sandoval

Longitudinal Data Analysis, Panel Data Analysis

Anuncio
Longitudinal Data Analysis, Panel Data
Analysis
CHRISTIANE GRILL
University of Vienna, Austria
Longitudinal or panel data is a special type of pooled data which consists of a crosssection of units (e.g., countries, firms, households, individuals) for which there exist
repeated observations over time. Consequently, observations in panel data involve at
least two dimensions: a cross-sectional dimension and a time-series dimension. Panel
data may be generated by pooling time-series observations across units. Longitudinal or
panel data analysis refers to the statistical analysis of such datasets. In general, methods
associated with the terms panel or longitudinal analysis focus on short panels, for which
the number of observed units (N) is large and the number of repeated observations over
time (T) is small. In contrast, methods under the umbrella of time-series cross-section
focus on long panel, for which N is rather small compared to a relatively large T.
Examples
Longitudinal or panel data has become widely available to empirical researchers. Wellknown examples of U.S. panel data are the Panel Study of Income Dynamics (PSID)
or the National Longitudinal Surveys of Labor Market Experience (NLS). The PSID
conducted by the University of Michigan collects annual economic information from
a representative national sample of about 6,000 U.S. families and 15,000 individuals.
Its datasets contain over 5,000 variables. The NLS contains five separate longitudinal
databases covering distinct segments of labor force. Its measured variables focus on the
supply side of the labor market. The most well-known socioeconomic panels in Europe
are, among others, the German Socio-Economic Panel (GSOEP), the British Household
Panel Survey (BHPS), and the Dutch Socio-Economic Panel. On a European level, the
EU statistics on income and living conditions (EU-SILC) collect data on income distribution and social inclusion in the European Union (EU). The EU-SILC nowadays contains longitudinal data on topics such as poverty, housing, education, or health from all
EU member states, including Iceland, Norway, Switzerland, and Turkey. Panel designs
are also prominent in the field of electoral studies. For instance, the American National
Election Studies (ANES)—established in 1948—conducts national surveys of voters in
the United States before and after every presidential election. Its international counterparts are, for example, the British Election Study (BES), the Dutch Parliamentary
Election Study (DPES), and the Brazilian Electoral Panel Study (BEPS). Besides the
The International Encyclopedia of Communication Research Methods. Jörg Matthes (General Editor),
Christine S. Davis and Robert F. Potter (Associate Editors).
© 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc.
DOI: 10.1002/9781118901731.iecrm0134
2
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
field of political communication, panel designs are on the rise in various research areas
of communication science (e.g., children, adolescents and media use, environmental
communication, health communication, and public relations).
Benefits and limitations of panel data
Some of the benefits and limitations of panel data for statistical analysis include the
inference of causal propositions, the ability to control for heterogeneity or the existence
of heteroscedasticity and serial correlations (Frees, 2004).
One of the main key advantages of panel data is that such data provides the opportunity to thoroughly analyze causal propositions. While cross-sectional data allows observations of covariances, and therefore, does not—strictly speaking—allow drawing conclusions about causalities, panel data allows analyzing whether a change in an input
precedes a change in the outcome. In other words, panel data allows observations on
shifts of responses as reactions to an input. For instance, the analysis of cross-sectional
data might reveal a significantly positive relation between media exposure and being
political informed. However, the analysis does not provide evidence on the cause-andeffect relationship. In contrast, the analysis of longitudinal or panel data might reveal
that increased media exposure causes heightened levels of being politically informed.
Another benefit of using panel data relates to the fact that its datasets are by
nature much larger since the data consists of multiple observations on the same units
over time. The large number of data points increases the degrees of freedom, and
results in more variability and less collinearity among the measured variables than in
cross-sectional designs. Hence, these characteristics overall improve the efficiency of
estimates, and thus, allow more accurate inferences of model parameters. For example,
turnout in national elections and public support for the government to be elected
may be highly correlated for annual time-series observations for a given country. By
stacking or pooling these observations across different countries, the variation in the
data is increased and collinearity reduced. As a result, researchers obtain more reliable
model estimates and are able to test more sophisticated behavioral models using less
restrictive assumptions.
Another advantage of panel data is the possibility to control for individual heterogeneity. In many datasets, subjects (i.e., units) are unlike one another, that is, they are
heterogeneous. In cross-sectional regression analysis, models ascribe the uniqueness
of subjects to a disturbance term. In contrast, longitudinal data allows modeling this
uniqueness. Due to the large number of observations in panel data, researchers are able
to incorporate subject-specific parameters, and hence, are able to control for heterogeneity of individuals. Not controlling for these unobserved individual specific effects
would result in biased estimates. For example, children’s media use is regressed on various individual attributes, such as peer interactions, family structures, gender, race, and
so on. But the error term may still include unobserved individual characteristics, such
as family lifestyle, which are correlated with some of the regressors, such as family
structure. By using panel data, one can difference the data over time and eliminate the
unobserved individual specific effect of family lifestyle.
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
3
Moreover, panel datasets are better suited to study complex issues of dynamic
behavior. They can, therefore, be utilized to study dynamics of change with the help of
more complicated behavioral models and hypotheses. In doing so, panel data analysis
uncovers dynamic relationships between the dependent and independent variables.
For example, with cross-section data one can estimate the turnout rate in elections at
a particular point in time. Repeated cross-sections show how this proportion changes
over time. But, only panel data allows estimating what proportion of those who voted
in one election also voted in another election.
Limitations of panel data encompass problems in the design, data collection,
and data management. Specifically, these problems include problems of coverage
(incomplete account of the units of interest), measurement errors (caused by unclear
questions, memory errors, deliberate distortion, or interviewer effects), nonresponse
(due to the lack of cooperation among units), recall, or frequency of interviewing.
In particular, with panel data, a key concern is that observations for each unit at
every wave may not be possible. In this situation, the nonobservance of these units
in future waves would be missing completely at random (MCAR). In this case, data
could be analyzed by complete-case analysis (analyzing only cases for which all
waves are observed), or available-data analysis (i.e., methods which do not require
response vectors of equal length). A more serious cause of missing data in panels is
attrition (i.e., dropout, panel mortality). Whereas data missing due to censoring are
nearly always MCAR, data missing due to dropout may not fit this criterion. The
failure to re-interview the interested units may result in a selection bias if the attrition
is correlated with substantively relevant characteristics. For instance, the dropout
in a panel survey on environmental behavior might be related to an individual’s
disinterest in environmental protection. If the data are missing at random, imputation
methods may yield unbiased estimates of model quantities. An alternative is multiple
random imputation, which models the probability of missingness and matches missing
observations with observed observations, which have similar probabilities of being
missing. Another strategy, which indirectly remedies the problem of panel attrition,
is to refresh the sample by adding new observations toward the end of the study.
These new observations are called a refreshment sample. This sample allows adjusting
for panel effects. Another alternative is to provide rotating panels. In rotating panel
designs a part of the sample is replaced at each subsequent point in time. In doing so,
rotating designs reduce respondent burden and also provide an opportunity to refresh
the sample with units that better reflect the targeted units of interest.
Another pitfall of panel data represents heteroscedasticity. One of the most important
assumptions of classical linear regression models is that the variance of each disturbance term is some constant number. This is the assumption of homoscedasticity or
equal variance. But if the unmodeled variance differs from one individual to the next,
heteroscedasticity is present in the panel data. Heteroscedasticity does not result in
unbiased estimators, but these estimators no longer have minimum variance. These
problems can be solved by utilizing a generalized least square (GLS) estimator that
allows for unique variances among individuals.
Moreover, panel data might suffer from an endogeneity bias. An endogeneity bias
occurs if the mean responses vary cross-sectionally via unobserved unique means and
4
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
these differences are not modeled, and therewith, left in the error term. As a result, any
cross-sectionally varying covariate will correlate with the error term. In other words,
independent variables correlate with the error term.
Also serial correlation represents another drawback of panel data. Serial correlation
relates to the fact that repeated observations on the same units are highly correlated,
and therewith, violate the assumption of uncorrelated errors. These correlations might
result from panel conditioning, as a unit’s response is influenced by prior interviews.
Serial correlations are by tendency large and positive, but diminish as the time between
the observations increases.
Linear longitudinal data models
Many of the longitudinal data applications that appear in the literature are based on
linear model theory. Hence, this contribution is devoted to these linear longitudinal data
models. However, nonlinear models represent an area of recent development. Nonlinear
models refer to instances where the distribution of the response cannot be reasonably
approximated using a normal curve.
While there exists consensus that longitudinal data is best suited for making causal
inferences, there has not been as much consensus on the best methods for analyzing
such data. There exist many traditions for analyzing panel data: While economics and
political science traditionally analyze trends, and thus, aim at modeling the level of the
dependent variable (Y), social, behavioral, or educational scientists are often concerned
with assessing individual changes, and hence, aim at modeling changes of the dependent
variable (ΔY). Overall, there exist several estimation techniques to address one or more
of the previously outlined pitfalls of panel data. The most prominent linear panel data
models are (i) the fixed-effects model, and (ii) the random-effects model, both of which
are applied to model the level of the dependent variable. As to the most prominent
approaches in order to model the change of the dependent variable, (iii) the lagged
dependent variable approach, and (iv) the change score method are most frequently
used (Andress, Golsch, & Schmidt, 2013).
The fixed-effects model
Whenever scholars aim at modeling the level of the dependent variable (Y), the fixedor the random-effects model are preferably applied. The terminology of a fixed- and
random-effects model has caused quite some misunderstanding and confusion in the
past since the terms fixed and random effects have multiple meanings. While a fixed
effect relates to any model quantity estimated, a random effect relates to any parameter that is unique to the individual but can be predicted separately. In line with the
classic view, a fixed-effects model treats unobserved differences between units as fixed
parameters whereas a random-effects model associates these differences with random
variables. From a more modern perspective, these two approaches are distinguished
by the assumptions these models make about the association between observed and
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
5
unobserved variables. While in a fixed-effects model, the unobserved variables can have
any association with the observed variables, in a random-effects model observed and
unobserved variables are uncorrelated. As a side note, the random-effects model is considered a special case of the mixed-effects model (Allison, 2009). Both approaches—the
fixed- and the random-effects model—are able to overcome unit effects and the endogeneity bias of panel data.
In general, most panel applications underlie a simple regression with an error disturbance term, such as the following model:
yit = xit′ 𝛽 + 𝜇i + vit
i = 1, … , N; t = 1, … , T
(1)
where yit is the dependent variable for the ith individual at time t, xit is a vector of
observations on k explanatory variables, 𝛽 is a k vector of unknown coefficients, 𝜇 i is
an observed individual specific effect, and vit is a zero mean random error disturbance
term with variance 𝜎 2 v .
If 𝜇 i (i.e., the unobserved individual specific effect) in Equation 1 refers to fixed
parameters to be estimated, this model is called a fixed-effects (FE) model. A FE-model
assumes that the individual-specific effect is a random variable, which can correlate with
the explanatory variable. Moreover, the model assumes that time-varying explanatory
variables are not perfectly linear and that they have non-zero within-variance. A fixedeffects model is typically estimated with least squares dummy variables (LSDV). This
approach estimates the model by utilizing ordinary least squares (OLS) and includes
dummy variables for each unit (N – 1) in order to be able to estimate the individual
invariant effects. This in turn leads to a large loss in degrees of freedom, but reduces
multicollinearity among regressors. Although this approach might be computationally
simple and accounts for a known source of variance in the model specification, the
unit’s dummies are perfectly collinear with any variable that varies cross-sectionally.
Consequently, the LSDV approach excludes any time-invariant covariates in the model.
Moreover, if the number of observed units is large (especially relative to the number of
waves), estimating a LSDV model is inefficient.
In this case, an alternative fixed-effects estimator to LSDV is the within estimator. By
applying this approach, the dependent variable and all covariates are introduced as deviations from the unit’s mean of the variable into the model. In doing so, the within estimator avoids estimating unique intercepts for each unit. Most importantly, the within
estimator produces the same coefficient estimates as the LSDV approach does. However,
the within estimator is not able to estimate the effects of any time-invariant variables
such as gender, race, or religion. These variables are eliminated in both the LSDV and
the within estimator approach. Consequently, the main disadvantage of the fixed-effects
model is that it cannot accommodate time-invariant covariates (Hsiao, 2003).
The random-effects model
If 𝜇 i (i.e., the unobserved individual specific effect) in Equation 1 relates to independent
random variables with zero mean and a constant variance 𝜎𝜇2 , this model is called
a random-effects (RE) model. A random-effects model assumes that the individual
6
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
specific effect is a random disturbance at the individual level that is uncorrelated
with the explanatory variables. Furthermore, this model assumes that the regressors
have a non-zero variance. More importantly, the random-effects model allows the
inclusion of time-invariant covariates in the model specification, which is the most
apparent difference between the fixed- and random-effects model. The random-effect
model can be estimated by generalized least squares (GLS) by utilizing a least squares
regression. This model is characterized by a compound symmetry covariance structure
and specifically recognizes that repeated observations covary. Therefore, the model
includes a term that forces all repeated measures to correlate at a constant level with
each other. By allowing time-invariant covariates, the random-effects model avoids the
inefficiency problem of LSDV. Consequently, the random-effects model is the most
practical option for short panels as well as for any models, for which time-invariant
covariates are to be estimated. On the downside, this model assumes that unit effects
are independent of covariates. But if the unit effects yet correlate with any covariate,
the estimates of the random-effects model are biased (Hsiao, 2003).
The lagged dependent variable approach
Since panel data provides the ability to accommodate temporal trends, researchers
are frequently interested in studying changes of the explanatory variable (ΔY). To that
end, longitudinal datasets are characterized by a relatively large number of observed
units compared to a relatively small number of observations over time. In order to
model the changes of the dependent variable and to draw causal inferences about
dynamics, the lagged dependent variable approach and the change score method are
widely prominent.
The underlying idea of the lagged dependent variable approach or regressor variable
method is that while controlling for the dependent variable Y at a prior point time
(yit−1 ), one or more dependent variables (xit ) cause a change in the dependent variable
Y at a subsequent point in time (yit ). As already outlined, serial correlation represents
a major pitfall of panel data. One solution to overcome this problem is the inclusion of
such a lagged term of the dependent variable as a covariate. This term accounts for serial
correlation and makes the remaining errors independent. Specifically, in this approach
yit is regressed on both xit and yit−1 . Such a regression model, which might include one
or more lagged values of the dependent variable among its explanatory variables, is also
called an autoregressive model, also known as dynamic model, in the form of:
yit = 𝛼 + xit′ 𝛽 + 𝛾yit −1 + vit
(2)
Specifically, this regression (Equation 2) models the time path of the dependent
variable in relation to its past value(s). The lagged dependent variable approach
provides appropriate measures for studying causality in longitudinal designs. The
obtained parameters are interpreted in terms of predicting change. It is an appealing
approach for modeling dynamics from a practical viewpoint. Whereas some scholars
argue that a lagged response most effectively accounts for unit effects and serial
correlations, other scholars argue that this approach might yield an endogeneity
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
7
bias if the lagged term does not eliminate all serial correlations (Finkel, 1995).
Although the application of this approach is justified with only two waves of data,
the usage of a lagged dependent variable costs one wave of data, which means that
the first wave of observations cannot be modeled by this approach. Consequently, the
lagged dependent variable approach is usually used for studying dynamics in long
panels.
The change score method
Another option to model changes in the dependent variable is the change score method
or difference score method. This approach utilizes the change or difference score
between two observations over time as the dependent variable. Specifically, in this
method yit – yit−1 is regressed on the dependent variables, in the form of:
yit –yit −1 = 𝛼 + xit′ 𝛽 + vit
(3)
There exist two major objections against the use of change scores, which therefore promote the lagged dependent variable approach. Firstly, change scores tend to be much
less reliable measures than the individual variables. If the measures of the dependent
variables at the various points in time have a reliability of less than 1.00, and thus, do
not perfectly assess the measurement concept, the reliability and the validity of the
difference score are less than the separate scores. Secondly, change scores are frequently
negatively correlated with the score of the dependent variable at yit−1 . This negative
correlation is often substantial. Consequently, if there exists a relation between an
independent variable and the difference score, it remains unclear whether the change
of the independent variable has caused this relation or whether this relation reflects the
relationship of the independent variable at t −1. One solution might be to incorporate
yit−1 into the regression (Equation 3) as a control variable so that the relation between
the difference score and the independent variable is adjusted for confounded effects.
Interestingly, this approach is rarely applied or discussed in the literature (Dalecki &
Willits, 1991).
Recommendations for the analysis of panel data
Although panel data offer many advantages to study causal propositions, the power of
panel or longitudinal data analysis largely depends on the compatibility of the assumptions of the respective statistical models with the generated data. Otherwise, choosing
the wrong analytical method might result in misleading inferences.
Whether the fixed- or random-effects model is better suited for modeling the level
of the dependent variable in longitudinal data depends on the assumption researchers
make about the correlation between the individual specific error and the regressors.
If one assumes that there exists no correlation between the error term and the regressors, the random-effects model is the appropriate choice. In contrast, if a correlation
between the error term and the regressors is assumed, then the fixed-effects model is
8
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
better suited for the analysis. Bearing this fundamental difference in mind, the following
observations might provide additional guidelines: If T is large and N is small, fixedeffects models might be preferable. If N is large and T is small and the units are considered to be random drawings, random-effects models might be preferable (Andress
et al., 2013).
The best-known test in order to decide whether to use a fixed-effects or a randomeffects model is the Hausman test. This test aims at detecting whether the unit effects
are indeed uncorrelated with any input variables. The respective null hypothesis
postulates that the unobserved individual specific effects do not correlate with the
independent variables. The basic idea is that since the fixed-effects transformation
eliminates the effects of the unobserved individual specific effects from the model
specification, the fixed-effects estimator is consistent regardless of whether there exist
correlations between the specific effects and the input variables. If the null hypothesis
is true, the random-effects estimator is efficient. On the other hand, the fixed-effects
estimator is efficient.
Fixed-effects models are frequently applied in randomized experiments since these
models increase efficiency and reduce bias (Allison, 2009). As a rule of thumb, these
models are preferably used to make inferences about the sample whereas randomeffects models are generally applied to draw conclusions about the larger population
(Allison, 1994).
In order to model the change of the dependent variable and to assess dynamics, the
lagged dependent variable approach and the change score method are strongly recommended. The lagged dependent variable approach is frequently used in experimental
research. This approach is able to remedy possible imbalances of randomization procedures when the assignment of subjects to treatment categories has resulted in groups
that are significantly different regarding the dependent variable of interest.
Even though there exist major objections to the use of change scores, this method
might turn out to be superior to the lagged dependent variable approach if the independent variable is temporally subsequent to the dependent variable and uncorrelated
with the transient component of the dependent variable (Allison, 1990).
SEE ALSO: Panel Research Methods; Regression Analysis, Linear; Time-Series Analysis
References
Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological
Methodology, 20(1), 93–114. doi:10.2307/271083
Allison, P. D. (1994). Using panel data to estimate the effects of events. Sociological Methods &
Research, 23(2), 174–199. doi:10.1177/0049124194023002002
Allison, P. D. (2009). Fixed effects regression models. Thousand Oaks, CA: SAGE.
Andress, H. J., Golsch, K., & Schmidt, A. W. (2013). Applied panel data analysis for economic and
social surveys. Berlin/Heidelberg: Springer.
Dalecki, M., & Willits, F. K. (1991). Examining change using regression analysis: Three
approaches compared. Sociological Spectrum, 11(2), 127–145. doi:10.1080/02732173.
1991.9981960
L O N G I T U D I N A L D ATA A N A L Y S I S, P A N E L D ATA A N A L Y S I S
9
Finkel, S. E. (1995). Causal analysis with panel data. Thousand Oaks, CA: SAGE.
Frees, E. W. (2004). Longitudinal and panel data: Analysis and applications in the social sciences.
New York: Cambridge University Press.
Hsiao, C. (2003). Analysis of panel data (2nd ed.). Cambridge, UK/New York: Cambridge University Press.
Further reading
Beck, N., & Katz, J. N. (1995). What to do (and not to do) with time-series cross-section data.
American Political Science Review, 89(3), 634–647. doi:10.2139/ssrn.1658640
Beck, N., & Katz, J. N. (2011). Modeling dynamics in time-series-cross-section political economy
data. Annual Review of Political Science, 14, 331–352. doi:10.1146/annurev-polisci-071510103222
Gillespie, D. F., & Streeter, C. L. (1994). Fitting regression models to research questions
for analyzing change in nonexperimental research. Social Work Research, 18(4), 239–245.
doi:10.1093/swr/18.4.239
Gujarati, D. N. (2003). Basic econometrics (4th ed.). Boston: McGraw-Hill.
Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. (2015). A critique of the cross–lagged panel
model. Psychological Methods, 20(1), 102–116. doi:10.1037/a0038889
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. Cambridge,
MA/London: MIT Press.
Christiane Grill is a researcher at the Department of Communication at the University of Vienna. Her focus of research is on political offline and online communication
with a particular emphasis on EU politics and EU elections. Moreover, her research is
dedicated to media reception and its effects on public opinion. Dr. Grill is also interested in the development of empirical methods in social sciences and within this realm
published the paper “Clarifying and Expanding the Use of Confirmatory Factor Analysis Journalism and Mass Communication Research” together with Lance Holbert in
Journalism & Mass Communication Quarterly in 2015.
Descargar