For those of us logarithmically challenged, Vivalt's "How Much Can We Generalize from Impact Evaluations?" is a daunting task. The topic is quite relevant, the use of impact evaluations (and specifically Randomized Control Trials as the 'golden standard') have been growing in the evaluation of development interventions. However, a common counter-argument has been on the ability to generalize the results to other contexts. This paper does try to address that question, unfortunately (or maybe expectedly given the trend of mathematical jargon in social sciences) the paper is buried under equations, tables, regressions and the like. Meaning that a topic of great importance for development practitioners gets lost in the medium. I wouldn't be surprised if many of my colleagues give up just with the abstract:
"Impact
evaluations aim to predict the future, but they are rooted in particular
contexts and results may not generalize across settings. I founded an
organization to systematically collect and synthesize impact evaluations results
on a wide variety of interventions in development. These data allow me to
answer this and other questions across a wide variety of interventions. I
examine whether results predict each other and whether variance in results can
be explained by program characteristics, such as who is implementing them,
where they are being implemented, the scale of the program, and what methods
are used. I find that when regressing an estimate on the hierarchical
Bayesian meta-analysis result formed from all other studies on the same
intervention-outcome combination, the result is significant with a
coefficient of 0.6-0.7, though the R-squared is very low. The
program implementer is the main source of heterogeneity in results, with
government-implemented programs faring worse than and being poorly predicted by
the smaller studies typically implemented by academic/NGO research teams, even
controlling for sample size. I then turn to examine specification
searching and publication bias, issues which could affect generalizability and
are also important for research credibility. I demonstrate that these
biases are quite small; nevertheless, to address them, I discuss a mathematical
correction that could be applied before showing that randomized control trials
(RCTs) are less prone to this type of bias and exploiting them as a robustness
check."
Those brave (or foolish) enough to have gone on would probably take away a couple of points:
- Academic/NGO scalability into government programmes is not a given. Taking into account that many of our programmes, specially those with governance components, make claims of eventual government take over/transfer (a favorite explanation for the sustainability section of proposals), an even more nuanced view is needed for future impact. Alternatively, further study is needed to find out how consistency of impact can be ensured.
- Heterogeneity. A fancy word for 'results all over the place'. Not only because of programme implementer as see earlier, but even within studies there is a high degree of variation. I believe this points out to the fact that we may not know what the underlying factors that affect the intervention are (making the control of variables much harder). Also the general underpowering (statistical) of studies, sample sizes may not be large enough to smooth out results (this is a hunch, as I am in no position to prove it or debate it), and, of course, context.
- Impact evaluations do have some predictive power in development. We would be in trouble if the result would have been negative!