Life After P-hacking
Okay. My. Name is Joe Simmons thanks, for thanks so much for inviting me I really appreciate the. Chance to be here to talk to you I'm, from the University of Pennsylvania, the. Work that I'm talking about today is always, in collaboration, with Lake Nelson who's at Berkeley and URI, Simonson, who's kind of at the University of Pennsylvania and kind of leaving for Barcelona, to. Live a better life so. So. Anyway these are my collaborators, throughout so everything I say they endorse. As well. Okay. So I'm talking about life after P hacking so let's start with some basics and then work our way through it so. We. Start with this idea that the definition of a true finding in science is replication, right does the finding replicate, under specifiable. Conditions, if it doesn't we, say it's false if it does we say it's true of. Course the. Way we operationally, define true, findings, in, our papers as we say true findings are those that are not due to chance and in, most fields not only in the social sciences but biology. Medicine we. Rely on a threshold of P less than 0.05, so assuming. In all of no relationship, between variables there's less than a 5% chance of observing data this. Extreme, so, basically by adopting this threshold we're saying that for each individual, study not necessarily, for each paper but for each study we're. Accepting a false positive rate of 5% we don't want our false positive rate to be much higher than 5%. But. There's reason to believe that our false positive rate is higher. Than 5%, it is too high and so in my field of psychology. There's. Lots of evidence that things don't replicate under, specifiable, conditions, so here are some examples so, there's. Evidence that behavioral priming, does not replicate, under specifiable, conditions, that, priming, of social distance does not replicate, under specifiable, conditions, that priming of money does not replicate, under specifiable, conditions, that. That. Ego depletion at least in some forms does not replicate, under specifiable. Conditions, that. The effects of mood on the. Effects of the weather on life satisfaction, does not necessarily, replicate, there's, lots of evidence that some, embodied cognition work, does not replicate. There's evidence that fluency, effects do, not necessarily replicate. This one is an important one because, one of the findings that these guys fail to replicate, is my, own finding, so. I, have, a paper in 2006 with. 14 studies in it 12, of those studies are true and, 2, of those studies are probably, not true and one of those studies is a target of this paper, and I believe them and I don't believe me anymore. So I think that that was also a false positive result. There's. Also evidence the facial feedback hypothesis, is operationalized, by struck it out doesn't replicate, power. Poses, seems not to replicate, and of. Course, ESP. Seems, not to, replicate, I always. Think it's crazy that I'm an author on a paper showing, the ESP is not true of, course it's not true stupid. And. Then. Of course in 2015. The. Open science collaboration published. A paper in which they report, the results of systematic, replication, attempts of a hundred findings, in psychology, and they, report that a number, of them a number of those replication, attempts have failed now. Exactly, how many failed depends, on how you calculate, failure, to replicate, the, way we prefer to calculate failure to replicate means that at least 25%, of them failed and many, of them were inconclusive, and. Only, only, only a certain small percentage definitely. Replicated, but, either way even if you want to use that benchmark 25%. Of our findings, failing. To replicate, that's way way way too many so, we can't, have our science can't be doing that and so.
What's. Causing all the failures to, replicate. There. Are a number of potential, culprits ok. So, an. Uncommon, cause is fraud. Fraud is out there, we've. Seen it URIs, caught it and, gotten, people to resign because of it but it's not very common and I'm certainly, not going to talk about it today. There. Are other culprits. That are common. But we don't think they're. Consequential. Enough to be the thing we should necessarily be, focusing, on first ok, so, for example a lot of people focus on this idea that we, tend, to file drawer, failed studies, and so we, only report, the results of studies that are sadistically, significant, and we don't report the results of studies that aren't statistically significant, I think this is a problem but. If this is all the people are doing it's. A really, hard thing to do and to sort of maintain career because it is very costly, to run individual, studies and as long as you're right you're writing up lots of multi study papers, this, is not going to be hugely consequential on, its own and it's. Also really hard to solve and so. We. Think this is not a really, big problem and we also think it's a hard problem to solve, another. Common. Problem, are innocent, errors errors are everywhere, if in. Published. Findings there are papers documenting, the number of errors but. In most cases those, errors, tend. To be relatively. Inconsequential we, should still solve this problem, it is easy to solve this one you just require people to post their data and all. Of a sudden they become a lot more careful when they do that and innocent. Errors go by, the wayside, so, we should try to solve it it's easy to solve but it's not really the main culprit of what these failures to replicate we, think the main culprit accounting. For most of it not all of it but the vast majority of it we think the the main corporate is something called P hacky P. Hacking is just the simple idea that when you get your data set you, try out many analyses, and you report the analysis, or analyses, that are statistically significant, and you don't report the other ones okay so it's like file drawing, except instead of having to go out and run lots of different studies you, can just run one study and then analyze many different things and by, doing that you can increase your false positive rate at grade a great deal and this, alone can just be you, can I.
Show You can basically make your false positive rate a hundred percent by P hacking so. It's really bad I, want. To really hammer home the idea that P, hacking is not something, that bad. Or immoral people do it. Is something that all human. Beings are, going to do if they are researchers, and they have a data set where it's not exactly clear how to perfectly, analyze, your data it's. An inevitable, consequence, of how most, people do research, including. Me so, to, really hammer, that home let's think of a different definition of P hacking here's, a different definition run. At least one unplanned, analysis, that. You might report as though it's B value is valid that's, it if you're doing that then you are then you are P hacking if you're going to run one unplanned analysis, and not tell us that, it was unplanned, then you rpm, okay, and so, I think P hacking is something that almost, every researcher, engages, in almost. All the time and so, it's not something that if I accuse you of being a P hacker I'm not saying you're a bad person I'm saying that you're not a perfect planner right. So. Some. Common ways of P hacking. You. Can stop data collection early, right so you. Have a flexible sample, size you can try out different measures, you could try out different ways of scoring those measures as well you, can try different combinations, of conditions you. Can try different combinations of covariance, you, can exclude participants, or trials that's one that's really consequential. You. Can analyze different subgroups economists. Like to do this and this is the best and my best I mean worst way to P hack okay so it's really that. That's really an effective way to get things to be significant, and you could try to transform, your data to get P less than 0.05. To. Not, P hack you, either not, have you either have to not care at all about how your results will come out or. You. Have to do two things you have to plan all of the details of your critical, analysis, in advance, you, have to decide in advance exactly how, you're going to score your dependent variable exactly. How you're going to deal with outliers and and. Inattentive participants, exactly. How you're gonna do what you're covariate you're going to include and exactly how you're going to score those covariance, etc, you have to figure it all out exactly, what regression, are you going to run or what ANOVA, so. You have to figure out all of that and. You. Have to remember. So. You have to do both of those things and so, a lot of times you go to write up your paper two, years, after you ran study one and you go and reanalyze, study one now with the different hypothesis, in mind and then you try out different analyses, when you go back and do that so you. Not only have to be a perfect planner you have to remember what you planned and this is why I say basically. That I think that P hacking is more or less inevitable. For, most people most of the time and that includes yours truly right here like I if, I do not perfectly plan my analysis out in advance and write it down I juice, in mins will P AK my studies okay, even, though I've been on a crusade against this for seven years. So almost all social scientists P hack almost, all the time and, you. Know P aking is consequential, so, we. First discovered, this ourselves, when. We ran simulations looking. At what P hacking can do and so I can share the results of some of those simulations, so, if you P hack you know just a little bit so. If you choose between let's say you run you collect two dependent variables they're correlated at point five and you can either look. At one variable the other variable, or the average of those variables, then your false positive rate goes from 5%, to 9% so, it almost doubles that's not great if. You, also give yourself flexibility, in sample size and. This is not a lot of flexibility, this is collecting 20 observations, per cell and if it's not significant, you run ten more per cell this is not checking after every one and not doing this repeatedly. Then it goes up to 14% if. You allow yourselves a covariant. So you can call it gender can be whatever you lie yourself one, covariant, and, it's interaction, term then, your false positive rate goes to 31% and, if, you allow. Yourself flexibility, in which conditions, you analyze then. It goes up to 60% and if you allow yourself now to drop, one, of two studies now. It goes up to 84%, and, these. Simulations. We. Think they are extremely, conservative, right. This is you have two measures and you're, choosing between them this is you only have one can variate this, is not, simulating.
What Happens with subgroups, where the analyses are independent. And so that's really bad it, is not simulating, what happens with outliers which, is which also can be really really, bad you can easily get, this if you don't care about the direction of the effect you can easily get this to be a hundred percent no problem okay, so. It's really bad and once we ran the simulations, back, in 2010, we, became convinced, that this is the real problem, this is the big thing that we have to solve yes there are these other problems too but this is the one that we should be focused on first once we solve this one now, which then we can solve the other ones which. Might be a bit harder so. Okay, so some of our results don't replicate largely, because researchers are P hacking and now, what I want to do is walk through the various solutions that have been proposed including, some of the ones that Gwen highlighted, earlier. And talked about whether those solutions are any good okay. So. I'll, walk through six proposed solutions. Some. Of these I'll take longer than others so don't be don't be scared about how long the first one takes. So. Proposed. Solution, number one is death, to p-values which when. Already alluded to and so, the idea here I think the logic is something like there's. A big problem. In. Research, researchers. Have been using p-values. Therefore. Get rid of p-values, that. Is not a logical, argument okay it's not a logically valid argument, it doesn't mean just, because everything's. Bad and we've been using p values that p values are the cause of the everything being bad okay, and yet I think that's the conclusion that lots of people draw, so, I'm not going to be in favor of this and we can talk through it so this. Comes in a couple different flavors so, the first flavor I'll talk about is, reliance, on Bayesian, statistics, so some people say we, need to abandon null, hypothesis, significance testing and, instead, move to more, Bayesian, approaches. And so there's this nice paper by Dean's in 2011, I suggest you read it I think it's a really good paper and there's, this there's there's, also this paper by Wagon mockers and. In. This paper what, they are doing is they're re analyzing. The data from the, notorious. BEM ESP. Paper and this. Paper if you were to read it it. Might persuade, you that a reliance, on Bayesian statistics, instead, of null hypothesis, significance testing might. Help, solve the problem because, what this paper reports. Is that, if you do Bayesian, t-test instead, of regular t-test then. A lot of Ben's results, are no longer significant, and of, course or no, longer passed some threshold, that they care about right and. That's. Obviously good news because we don't want them to results to be to be significant, because they're not true okay, so, the. Problem with that though is the. Following so. Under. The Bayesian, approach they, don't of course rely on P less than 0.05, they. Rely on on. Different. Thresholds. So they're still relying on a threshold because you, kind of have to rely on thresholds, humans need to know when they read papers do it do I believe it or do I not believe it can't walk away being like there's a 14%, chance that this is true and a 62% chance that this is true that's not the way we process information so, I'm actually in favor of thresholds, all thresholds. Have to be arbitrary, by definition okay, so we can't pick on thresholds, just because they're arbitrary now. The Bayesian, approach they, rely on a Bayes factor greater, than 3 as their threshold, now. Okay. Fine I'm not going to talk about what a Bayes factor what, a base vector is here, now. It. Turns, out though that makes it that this, looks completely different from a no hypothesis significance testing it, is not different, than null hypothesis significance testing running. A Bayesian t-test where your threshold, is a Bayes factor greater than 3 is nearly, identical to. Running, a regular, t-test when, you, set your alpha level to 0.01, so. The reason, why lots. Of Ben's results, don't hold up when you do Bayesian analysis is, because. Lots of Ben's p-values, are not less than 0.01, and so, if instead you switched to a null hypothesis significance testing regime where your alpha level is 0.01 then, most of Ben's, don't hold up it's it's the same and so.
Because. The Bayesian approach is basically identical to, the to, the Orthodox, approach it's. Not going to help, with this problem just like you can P, hack a p-value you. Can Bayes hack a Bayes, factor and so, the, Bayesian, approaches will not help with P aking at all they. Are not a solution to this problem so. We have a blog that we call data Coolatta and, yuri. Wrote a post about this a few. Years ago and. What. He did is he, was simulating, p hacking, and. He's. Comparing, reliance, on a t-test, with p less than 0.01 as the threshold, versus. Reliance, on a bayesian t-test where the threshold is base factor greater than 3 and we're, going to look at the false positive, rates, associated. With these various researcher. Procedures. So, if there's no P hacking they're both at 1% there's, some differences here just due to noise in the simulations, but they're both 1%, now. If we simulate data. Peeking, now, our false positive rate goes up from 1%, which is what it should be if our p value is less than 0.01 it, goes up to something a little higher than that if, we then add choosing, from Rheda pendant variables it. Goes up but it's the same across Bayesian, and the, end and null hypothesis significance testing and, if you then add dropping a condition to that they, both go up and it's the same and if you add dropping, outliers they both go up and it's the same that should say 20.8%, up there and so it's. The. Debate, the reliance on Bayesian sistex here is doing nothing for for. Getting rid of the of the false, positive, problem right it's just a different way to do it and I'm. Not gonna say that, you shouldn't do Bayesian, statistics, in fact the next sentence, on my slides says, the opposite, it is okay to prefer Bayesian statistics to p-values, you might like the logic, of Bayesian statistics, more than you like the logic of p-values and that, is fine. You, should you can go ahead and do that but. It is also okay, to prefer p-values, to Bayesian statistics, this is a preference, and the. The key thing to keep in mind here is what's. Not okay is to believe that using Bayesian statistics, is a solution, to this problem it. Is not a solution to this problem okay, so if you don't like the reliance on p-values, and want to use Bayes factors instead go, for it but don't be duped into thinking that it's gonna solve the replication. Crisis, if you want to call it that okay so. Basing sistex are not going to help solve this problem, okay. What. About other approaches, so Bayesian approaches have been have. Been proposed, there's, also been this. Paper by Jeff, coming published, in Psychological Science, a couple of years ago that's had a lot of influence and what, he writes is we need to shift from reliance on null hypothesis, significance testing to estimation, and other preferred techniques, the, new statistics refers to recommended, practices including, estimation, based on effect sizes confidence.
Intervals And, meta-analysis. And so, basically saying don't rely on p-values. Rely on report. Instead just effect sizes confidence, intervals and then do meta-analysis, I'm gonna focus on effect size converse intervals first and then meta-analysis, second. Now. What, I want when, we're talking about just reporting confidence intervals I just want to walk through an example of what, this would look like okay, so, I'm gonna give you this is this. Is just chance, that I'm using this paper I'm not using it for any specific, reason that we recently discussed, this paper in our journal club and I kind of liked it okay but it also reported, confidence intervals so it's good for my it's. It's good for my example. So. In, this. Paper basically. What the researchers, are doing it's kind of they're testing kind of a cool hypothesis, where they're looking at whether. Motivation. On a task differs, by how, people are incentivized, so in one condition let's, say every time you. You solve a puzzle you, get one point and then when you solve the next puzzle you get another point when you solve the next puzzle you get another point that's called that's sort of a linear, linear. Pattern of points but, in the other condition the, first time you solve a puzzle you get one point the next time you have four points the next time you get nine points that's that's an accelerating. So they're basically looking you know it the two accelerating. Rewards. Are. They more motivating, than linear rewards, so it's a cool hypothesis, and I believe, the results of this paper okay so let, me be clear about that I'm not picking on it because I think it's not true I'm actually picking on it because I think it is true, but. Okay. So, here's. What they report participants, exerted more effort ie they entered the target word more times when seeing an accelerated, X number than when seeing no, number at all so here's one of the comparisons they report here's, the effect size they report Cohen's D of 0.5 and then, there's the 95% confidence interval for the mean difference, now. If you were to ask people because you're not miked up I can't ask you okay but, if I were to ask you. You. Know when, you look at this thing what are you looking for what. People say is they're looking to see whether this thing contains zero and if. That's what you're doing then. You are doing null hypothesis, significance testing right. So that's what most people are doing in. Fact URI looked at he, actually did it he wrote a post about this as well a while back and he did a systematic analysis of the reliance on confidence on reporting, confidence intervals for mediation analysis, and he looked at how people interpreted, the confidence intervals and in, the 10 out of 10 cases that he looked at every.
Single Researcher. Just, cared about whether or not the thing contains zero or not and if you're doing that it's the exact same thing as relying on null, hypothesis significance testing with, P less, than 0.05 and so for, one thing I mean there's no there's no plausible, mechanism by, which this is going to solve the problem of P hacking by the way, and it. Means you're not reporting, p-values, and you, know look if I don't, know about you but you. Know do you want to know the p-value, I do. I want, to know the p-value here because. It provides information so, the p-value in this case and I cherry-pick, this p-value most, of the evidence in this paper is way stronger than this okay so I cherry-pick, the p-value in this case but. Here the their. P-value, and thank, God they do report, it in the paper right, there p-value. Is this, and to. Me that's informative, that suggests that in this particular study this. Finding, is not super. Strong again they've run other studies where the p-values are point 0:01 so I do believe the finding but. I want to know the p-value it provides information and, as I mentioned there's no plausible mechanism by, which an exclusive, reliance on effect sizes and confidence intervals use the reductions in P Atkins so if you want to also report effect sizes and confidence intervals go for it but I personally want to see the p-value as well and you, know it's just an extra couple of characters so you can keep adding that to your papers and it won't won't, hurt anything. Okay. So that's proposed solution number number, one proposed, solution number two which I'll also spend considerable, time on and then we'll get faster is, a, reliance, on meta analytic thinking. And this has been very popular, recently especially, within psychology. And I do want to mention that the, work in this section is done with EURion life but also with you walk in bas Corral who's the first author on the paper that we recently submitted about this so he gets a lot of credit for what we're doing here. Okay, so a number of researchers have proposed reliance, on meta. Analysis as a way to combat, the replicability, crisis, so Jeff coming said, that in the, quote that I talked about earlier there's, also this paper by Blakely McShane and. His colleague. Talking. About a reliance on single paper meta-analysis we're gonna use the term internal, meta-analysis, here but it's basically the idea that, you at, the end of your paper say, you run five studies in your paper at the end of your paper you statistically, combine those studies to report what the overall effect is so. So. That's the idea so. I want to walk you through an, example. Of what internal. Meta analysis looks like and then I want to talk about the. Really severe problems, that it causes.
So. Here's, a paper by. By uh cattle published a couple of years ago. The. Hypothesis, is not that, important, here and, what, they did in this paper is they. Report the results of 18, studies okay, and then, at the end of the paper they. Meta analyze those studies and on, the basis of that meta-analysis, they say you can believe our effect okay, now. You. Probably can't read this so, let, me highlight two. Of the studies for you study. Five and study nine are the, only studies, this last column here is the. P-value for the individual, study these are the only two studies that reach conventional, levels of significance the, other sixteen, studies, are not, statistically, are, not statistically significant, and these are significant, at point O four six and, 0.038. Okay, and you, can see the studies are mapped over here but now so. If you look at the evidence on its own I mean there's no one study that jumps out is like wow I really I think that's that's really strong evidence but. If you do the meta-analysis. Man. It looks good okay, so it looks like you have an effect size of 0.2 to with. You. Know a Z value really high in the p-value, I mean you, look at this p-value and you say I should definitely believe this thing right now. There, are some problems with internal, meta-analysis, some really, big problems okay, so the. First problem is that. Meta-analysis. Exacerbates. The effects of P hacking, and file, drawing so. Basically. So. Just, so just to sort of think about it imagine, that I have these eighteen studies or even it's just five studies doesn't have to be eighteen okay and for, every one I P hack, just, a tiny bit okay, so I'm not even P acting to get the P less than 0.05. I'm just P hacking so I get the right sign okay, but, if I'm doing that for every individual, study each individual, study on its own is not going to be significant, but once I add gate across them I am gonna find significance, so, basically. What. Internal, meta-analysis does is it, takes what's not really, a problem that is really, really, tiny, amounts, of P hacking and it turns it into a gigantic problem and so. Let. Me just show you the results of some simulations that, that proved this to be the case in. The simulations, I show you on the next slide we'll, be testing a directional, hypothesis, so the false positive, rate we are, that's at our baseline is two and a half percent not five percent okay. Okay. So, what, I'm going to do in this graph I'm going to show you some I'm, going to show you some blue bars and some red bars the. Way to read this graph is as follows if, your, single study if your pee hacking in such a way that. Your. Single, study false positive, rate is what. The blue bar says. Then. Your ten study meta-analysis. False, positive, rate is going to be what the red bar is okay, so, this, is so let me show you the first the, first one here so this, says again we're starting with a false positive rate of two and a half percent so. What, this is saying is if you pee AK in such a way that you increase your false positive rate from two and a half percent to, four point nine percent which, is not a lot right, so it's not that consequential. Then. If you. Do if you do that for ten different studies on average. And then you combine them your, meta-analytic, false, positive rate is going to be thirty eight percent, okay. So that's really bad if instead. Your. Pee hacking each study so that your false positive, rate of 7.2%. Now. Your meta-analytic false, positive rate is going to be eighty one percent and, if. You do this if it's nine point five percent, then. You're, all the way up at 96%. Mehta. Analysts. Meta-analysis. Is being offered as a solution, to the problem, it. Is it is instead, like gonna really, make the problem, so, much worse whereas, like the Bayesian thing is going to do nothing this. Is like we're just burning everything to the ground okay so. Internal. Meta-analysis, is, really. Really really, bad and we should not allow people to do. Internal, meta-analysis, in their papers in order to conclude that. In effect exists, I will talk about how internal, meta-analysis, can be useful, but. We shouldn't rely on internal meta-analysis, to say that an effect exists it really, exacerbates, the false positive rate we've just written a paper on this I'd be happy to send it to anyone who's, interested. Okay. That's not that that's not even the, total, problem okay so.
Now. The reason why people propose. Reliance. On that on internal, men analysis to solve the problem is because it's, supposed to solve the file drawer problem so, it's supposed to solve the problem of researchers. Aren't motivated to, empty their file drawers but if they're allowed to do internal meta analysis, at the end of their studies they will empty out their file drawers and so, and. So it's supposed to solve that problem and again as I mentioned before I don't think that's that's a huge problem at, least not the one we should be focused on and it makes the problem of P hacking so much worse but, ok so what so our claim is this when we require individual, studies to be significant, we think the file drawing, is not really a big problem but, ironically, in the world of internal, metal Alice it, is, a big problem so. It. It's trying, to solve a problem and then that doesn't really exist or at least not a big problem and in so doing makes, that problem worse so, for, example let's just walk through let's, just do some math so let's say you're looking to publish a five study paper okay, now. Let's. Say you're studying an effect that's not true okay it's non existent and you're. Willing to file drawer half of your studies, now. Obviously. People aren't going to do this consciously unless. They're bad people so they're not gonna say like you, know they're not gonna laugh maniacally, why that while, they you know leave half their studies in the file drawer and publish the other the, other half of their studies instead. They're gonna rationalize. And this is so easy to do if you're running studies with different designs they're, gonna rationalize, why some studies are good. Studies and belong in the meta-analysis, and some studies are bad studies, and don't belong in the meta-analysis, in fact, that is the case in real life that's why people can rationalize it some studies have bad designs they're confounded, they should not be part of the meta-analysis, and so after. The fact you can say look that that was a really dumb way to measure it I'm gonna take that study and put, in and keep it and keep it outside the meta-analysis, so this is not a person who goes with the goal of hahaha, I'm only gonna report half of my findings. This is a person who's able to rationalize, in such a way that they only report half their findings ok so. So. If this is what you're doing and you're. Not in the land of an internal meta-analysis, okay, and you're not P hacking okay, then. You. Have a one in forty, thousand chance to get five studies to be significant, in the predicted, direction this, is why we don't think the file drawing, is the problem we should all be focused on right now okay, it's, really hard if you're running them up if you're writing a multi study paper to, get filed roaring, to like do what you needed to do especially, when you care about the direction of the effect okay. But. If instead you want your five study meta-analysis, to be significant, now, you, have a thirteen thousand, in forty thousand chance so, basically.
Your Your, false positive rate really, increases. I mean another way to put this as you can say your thirteen thousand, times more likely to get your false positive meta-analysis, published than, you are to get your paper published otherwise, right so so. So. This is a this is another version, of the problem so it makes the problem of P hacking worse it makes the problem a file drawing, worse. You. Might say well aren't there procedures, to correct for, selective reporting and meta-analysis, well. Presumably. There are so there's. Trim and fill which has, 3,800. Citations, but it does not work like no one actually tested to see whether it works in real life and once we tested it so we wrote a data clot a post about it we've also presented. Data in papers that basically shows that it does not work to. D2. Correct, for selective. Reporting even. In simulations, it doesn't work let alone in the in the real world, there's. Also a test. Called the the excessive, significance, test I'm not going to get into it because of time but basically this. Test only tells you whether there is selective, reporting it doesn't actually correct, for the magnitude of selective reporting so that's not necessarily useful. We. Have developed technique, that I'm not going to describe, in detail here called P curve we, think the P curve is pretty good at correcting for selective reporting but, P, curve only as, an input into P curve you can only input. Statistically. Significant, results, that's the way that P curve works so it only tells you you, know the, effect it only corrects for selective reporting among, a subset, of published, findings ie those that were statistically significant. So in the example that I presented before right. It's. Not gonna it's not going to be useful at all because it will be useless when say only two of the meta analyzed studies are significant, you need at least a handful, of these studies to be significant, for P curve to do its thing so I don't think that's gonna solve. The problem either in, many of these cases so. So. Again we think P ekang is a big but tractable, problem, and it's tractable, because of the solutions, that I'll talk about later that actually do work but. Internal meta-analysis, makes it bigger and intractable and, filed wrong it's not a consequential, problem until you try to fix it using, internal.
Meta-analysis, Ironically, so. That's the first problem with eternal meta-analysis, the, second, problem with internal meta-analysis, is how. On earth do you falsify, a meta-analysis, so, let's say you do have a false-positive meta-analysis. How do you show that it's false positive, and. We. Thought about this because two of my colleagues is we're gonna see tried to do this and realize wait we don't know how this can even be done, so. There's. A couple of approaches, you can take one idea is to try to replicate the study containing, the strongest, evidence so. For, example if you if we return to the Tuck @al study you, could look at study 9 that's the study that has the smallest p-value, you can instead look at the study that has the biggest effect size whatever you want to count as having the strongest evidence and. So. You could say ok let's focus on study 9 and try. To replicate, study. 9 now. Lafe. Nelson and you're welcome Bhaskar out they tried three times to, replicate study 9 using, a combined n of 3082. And. The. Authors reported that as an effect size of d equals point four, five the, pulled data from the replication, suggested. That it, was not there or at least if it's there it's really really, really tiny certainly, much different than the point four five that was reported and then, they were like well now what, so. They, didn't, replicate study, nine but, what, do you do with that because they can say we, have 17, other studies, and so. You. Know how. Do you what do you conclude about, the overall meta-analysis. On the basis of these failures to replicate it's not obvious so one idea if you're in the meta analytic spirit. You should just add what, some meta analysts, would say is just add the replication, data to the meta analysis but. If you do that it's. Still going to be wild, significant, because the meta-analysis, is made up of all of those other studies, that may, or may not have been selectively, reported, I don't want to suggest by the way that tuck it out selectively reported stuff I don't know I do know that study nine doesn't doesn't, replicate that's all I know.
Another. Idea is to try to replicate all the studies now, for eighteen studies you're probably never gonna do that but if let's say it has five studies or something like that maybe you would try to do that of, course no one will but, but. Let's, say we replicated, all of their studies using, their sample sizes and we got 18, effects equal to zero when, you add those to the meta-analysis. It's still significant, right. So it's really hard to get rid of this thing because you you already have all of these, biases. Baked into it um even. If we replicated, every study using, what simonson recommends for applications, using 2.5, times the sample size of the original study and every, effect was zero we'd, get that we'd still get we. Still get the thing to be significant, so it's. Not obvious how you falsify, an internal meta-analysis, or a meta-analysis, ., so. Meta-analysis. Makes, it much easier to publish false positives, and it makes it prohibitively, difficult to, correct them and that combination is absolutely, deadly so. So. One thing we we, wrote I think this is still in our paper but, I certainly endorse it we. Recommend to never draw inferences about the existence of an effect from a meta-analysis. Including. An external meta-analysis. Instead. Researchers should have mass evidence by running individual, studies that are well-designed highly, powered and demonstrably, replicable. Effect, supported only by studies that lack any of these essential features should not yet be believed no matter how many of those studies there are scientific. Knowledge advances, one replicable, study at a time. When. You think about there, are effects I mean I teach an entire class on managerial, decision making and mostly I walk through experiments, that I and effects that I believe in why do I believe in those effects right so why do I believe in prospect, theory or in framing effects or in preference projection, I believe, in them because when, I run studies on those things they they work right I believe in them because the original studies are highly powered and they're they are obviously replicable, I don't believe in them because somebody Mehta analyzed the preference projection, literature, we don't need meta-analysis, to tell us whether or not, and effect is true or, false we're, doing we're doing well without it so. But isn't meta-analysis, good for something I actually do, think it is good for something and I think it's good for something that it's sort of the opposite of what people use it for it's. Good for exploration so. I really like when people conduct. Meta analyses, and you, can look at each study and what the effect size is and then, when you see there's variability across, the studies and you map it out and you do the nice forest plot you, can basically say wow that's interesting these studies show a big effect up here these studies show the, opposite or a small effect down here I wonder what the difference is between them and from, there you can't draw conclusions, of course you can just come up with hypotheses, and then you can test those hypotheses, in subsequent, confirmatory. Studies. But I do think it is sometimes, really good to go through and systematically. Look at how effect sizes differ across studies and try to draw conclusions about that we, should be doing lots of exploration, we just should be calling it exploration, that's, the idea okay so, so so, solutions one and two or took, awhile but the rest will be the. Rest would be quicker proposed. Solution number three as was mentioned by Gwen and in her introduction. There's. Been a recent solution, to lower the alpha level to P equals 0.005. Here's, the paper and the, many many co-authors. I was. Asked to be an author on this and declined for reasons that I will get, to you right now. So, the. The problem, here, is not, that, this wouldn't help I think it probably would. Help the. Problem here is that we. Have always been very interested in solutions. That are practical. And that are, easy for researchers, to adopt and I, think there's no way in, the world we are ever going to do this and I.
Think We're not going to do it for, one thing I don't think it's gonna help that much especially if you're doing internal meta-analysis, super easy to get 0.005. But. But. Here's what they say in the paper okay, for. A wide range of common statistical tests transitioning. From a p-value threshold. Of 0.05 to 0.005. While, maintaining 80 percent power would. Require an increase in sample sizes. Of about 70%. We. Have tried over the past eight years just. To get people to properly, power their studies for, P. Less than 0.05, and, basically. They throw things at us when we tell them how many subjects, they actually need and these. Guys are saying you need to increase that amount by. 70%, so. For example under the P less than 0.05, regime, I've. Discovered this in my research if if. There's a if there's truly a 10 percentage, point difference between, two conditions, so one condition is has 45, percent the other has 55, percent you, know many subjects, per cell you need to get eighty percent power you. Need six hundred you. Need to run 1200, people to detect a 10 percentage point difference which is gigantic, right, and so, you already need 1200, subjects for a two cell design that, probably no one will publish because they want to see the moderation in which case you have to multiply it by four okay, now. These guys are saying yeah what Joe said but 70 percent more I don't. Think that's I just don't think practically, people are gonna do it and so I don't think we should be chasing solutions, that politically. Are not going to be adopted and, I don't think it's going to help that much compared, to the other easier, to adopt solutions, that I'm going to talk about later we can fix this in a better way in other words that's the idea, so. So. The solution basically, the punchline here because it will make pee hacking harder to do lowering the alpha level to 0.005 will probably help so long as you don't meta-analysis. But I think the cost of it outweigh the benefits, okay. Now let's talk about the good solutions, so. The, first solution that I'll talk about that's good is one that we we wrote a paper in 2011. Called false positive psychology, and in, that paper we, not only wanted to document the, problems, associated with peeing we, tried very hard to come up with a solution that no, one could possibly object, to and it, turns out we were very naive about. That but, the, the. Solution, that, we came up with is that, journals. Should require authors, to disclose, what they did in their studies right, so people, should have to report the, measures they collected, the, manipulations. That they did, you. Know all of that stuff and so the idea is that by, doing this, peer. Review can do it's thing like in the absence of doing this peer-reviewers, can't, see, whether people have to have a chance to pee hack but, once, once, you disclose that actually, I measured, I measured, five similar, things and only one of them is significant, the reviewer could potentially, sniff that out and say you know what I really want you to replicate that because it looks like you might have been capitalizing, one chance here, so. That's, that's the idea. So. Authors. Must report all measures manipulations, data conclusions, and how they arrived at their sample size um in our original paper we wanted to demonstrate how, this could be helpful and so, we've. Ran us we ran two, studies, designed, to show where. We pee hacked them designed, to show evidence for, a completely impossible finding, the. Impossible, finding that we found evidence for is that listening to a particular song changed, how old you are okay that's not true. And we. Were able to do it in a way where our false positive rate was essentially a hundred percent and. For. Study two again we got two studies to work and we only ran two studies. According. To the standards of the day we. Would write up our study this way twenty, undergraduates drawn, from the same subject pool as in study one listen to either when I'm 64 by the Beatles or kalimba kalimba is a song that came with the Windows 7 operating system, in. An ostensibly unrelated. Task they then indicated, their birth date and their, father's age to control for variation in baseline age across participants, and in Cova revealed. The predicted effect or dismiss for nearly a year and a half younger after listening to when I'm 64 rather, than kalimba and you see the p-value there, so.
According. To the standards of 2011, we could just write that like that and it would be believable, even though it sounds completely crazy. But. If instead we had to disclose everything, we did then, the write up would be different and it should be different because we didn't just do all that so instead, we'd have to say 34. Undergraduates, and. We'd have to say 34, because we actually had, dropped a condition we dropped the hot potato condition, okay, by, The Wiggles, we. Also conducted, our analyses, after, every session of approximately, ten participants and, we, did not determine the data collection termination, rule in advance so we P hack that way we. Also. We. Also didn't, just measure fathers ages of covariant, we measured a bunch of things that we thought might be associated with age or feelings of age some of these who are just having fun with but. You can see we had lots of opportunities, to look for different covariates and then we also proposed in our paper that maybe when you use covariates, you should have to report what happens when you don't use them and so if. You do that you, find out that if you don't control for father's age here, there's. No effect right so for. Whatever reason in this study just by chance father's age was like the magic, ticket right to get this effect to be significant, of course of course if we tried to do this again and P act it again it wouldn't be father's age it would be something else but. Of course if we've. Wrote up the study this way no, way it could ever be published people could be like why are you controlling for father's age and not mother's age for example also. Once you think about it controlling for father's age makes no sense, but. So that they wouldn't they wouldn't allow this to be published any peer, reviewers will not allow this to be published and that's exactly right and so we thought to give peer reviewers a shot, to detect this kind of thing we, have to require them to disclose what they did in their studies and journals should have these rules and. You. Know a lot of journals have not done this and that's been disappointing, but some of them have so psychological, science has definitely taken the lead in doing this so Eric Icke who was the former editor back. In 2013 he. Implemented, these requirements, at psyche science and he he wrote a nice editorial called business not as usual and now, you have, to disclose all of this when you are submitting to Psychological Science so. That's, good, but. Like none of the AAP none of the top APA journals to my knowledge require, any of this there's, lots. Of the lots. Of the marketing journals I'm more in the field of marketing than a management marketing. Journals require this. So. I think that's bad. Disclosure. Is imperfect of course it's not the perfect solution it, does hinge on reviewers educated, guesses about what analyses, were tempted, but. We think it's practically costless, it's obviously, consistent, with what it means to be a scientist you should report everything that you did and. Without it it's impossible for viewers to detect, peacocking. So we think we think this is a necessary thing that we should be doing and it's pretty cost list so we should do it tomorrow or even today.
Okay. So if disclosure, is helpful, and. A good solution. We think that. Pre-registration. Is the cure so, pre-registration. We, think where the field needs to go and it's the thing that's really going to prevent. Pee hacking going forward, or at least stop most of it so. The the idea here is that unless, an investigation, is explicitly, exploratory. Authors. Who are collecting, new data should. Pre-register. They're critical analyses, ahead, of time this gets a little complicated when we're analyzing existing, data sets and so we could talk about that in the Q&A if you'd like but but at least when you're collecting new data this. Is a this is a pretty easy thing to do, now. Pre-registration. You, know circa 2011. At least in psychology, the, approximate, number of people who are pre registering their studies was zero right, so no, one had ever pre registered we had never never pre registered, and so over time we, didn't even propose it in our original paper as a solution because we thought no way people would do this because it seems so different from what people had done before and. Over. Time, pre-registration has, sort of gained steam not only definitely, not only because of us lots of people have been have, been promoting, the idea but. When we started to think about ourselves like, how we were going to do this in our own laboratories. We realized that we, didn't really know how to pre-register and so, it seemed like it was a kind of hard thing to do and obviously. Anything that's a hard thing to do people, are not going to want to do it and so, we, decided to create. A website, a platform. That allows researchers to easily, pre-register, their studies so, we launched a website called as predicted, org and, what's. Nice about this website it's super, easy to use you don't even need to remember a password you, just do everything via your email address and you approve things when you get emails about it and. It's. Basically. You create a time stamp document, that time stamp document, remains private, forever. Until, you hit the make it public button we've, recently added a feature where you, can make an anonymized, view an anonymized, version of it public for the peer review process so that way you can show. Reviewers, your pre-registrations. During the review process without revealing who. You are and everything. Gets saved forever so, you don't have to worry about that it's automatically, stored in the web five and, the web or archival, we'll keep it forever and if. The, web archive ever breaks down that's we're, basically an Armageddon, anyway and so there's no we, never have to worry about that it's also backed up elsewhere, but it's you. Can trust that all your data is there's, totally safe here and it generates, a, short. PDF, usually, a one-page it could be a two-page depending on how you answer the questions now, probably the best feature of the whole thing is that the. Way you do it is it, just asks, you nine questions, and you answer them so you know exactly what. Information. You should be providing, in your pre-registration, so. Here, are the nine well there's, eight there's eight questions listed the ninth question is provide a title okay so the ninth question is pretty easy, so. You, just have to indicate, have you collected data for the study already. You. Have to indicate you're, not allowed to say yes, you are allowed to say it's complicated and then, you can explain it.
You. Then have to specify what, the research question is that you're that you're investigating, by, the way you don't have to make a prediction on your on your pre-registration. The pre read the purpose of pre-registration, is not to show that you can predict the future it's, to show that you didn't. P hack your analysis so. Many times because I don't want to look like an idiot later I don't, write, hypotheses. In my pre-registration, I write them in the form of research questions we. Are testing, whether X, affects. Y that, way if it doesn't I don't look stupid later on right, so that's all you have to do you, can also frame it as hypotheses, if you want to my. Students, sometimes like to do that they're overconfident. But. But but it's, fine. So. Describe. The key dependent, variables the the the, key word here is key right, you don't describe all of your measures this is a pre-registration where, you're just pre-registering. What. Of your analyses, is our confirmatory. All of, your exploratory, analyses, have no business being, in this document, all of your exploratory measures, exploratory. Conditions, exploratory, co-variants have no business being in this document this is just a document for, this. Is the analysis, or analyses, that count, and then, I'll know that the other analyses, were exploratory, okay, so you specify the key dependent variables and exactly, how they will be measured or score. You. Describe your manipulations, how many in which conditions will participants be assigned to number. Five is the really critical one specify, exactly, which analyses you will conduct we're gonna do an OLS, regression where, we regress our DV, as stated in question three on our, IV and it'll be coded this way. So, you want to be very specific here exactly, how you're gonna do it describe. Exactly how you're gonna deal with outliers or other exclusions. Describe. What your sample size is gonna be and how that's going to be determined, you don't have to report a power analysis or anything like that you don't to say why you said to say what it's gonna be and. Then, you know anything, else you would like to pre-register in case there's a it's a catch-all category for, well I want to tell you a little bit about this other weird situation, we're going to have, so. It's, pretty easy, we've. Recently, written a blog post where that's basically, called how to properly, pre-register, your study on data, Coolatta it's number 64, and there. We give guidelines for, what should be included and what should be excluded, from pre-registrations. We're. Very big on the fact that pre-registration. Has to be it. Has to be easy for people to do and it also has, to be easy for people to read and so, pre-registrations. One problem, with the way they're done like in medicine for example if you go in clinical trials.gov some. Of the pre registrations, are 100 pages long no. One's checking those things and in, fact there's been lots of studies looking at whether whether. The published, findings that, were pre registered on clinical trials cough whether they actually adhered, to their analysis, plan and in many cases they did not and it's, not surprising because, the editors they're way too busy to look to read through a hundred page documents, so we're really big on you should only include in your pre-registration the, stuff that pertains. To your confirmatory, test you don't need to say why you're investigating the thing you, don't need to talk, it, you. Don't need to talk about the other exploratory, analyses you're doing etc. Some. People have wrong. Notions, about pre-registration. And let. Me let, me disabuse you. Pre-registration, does, not mean when. You pretty much threw your study it doesn't mean you only run your pre-register analysis and that's it okay. Pre-registration. Does. Not preclude explore. Raishin okay, in fact all it does is it makes clear which analyses were exploratory, and which were confirmatory, for, me since I'm chronically, worried about pee hacking if anything. It makes me feel better about exploring, because I know that when I'm exploring, I'm not, gonna mistakenly, say. That this exploratory, analysis was confirmatory. Because this will keep me honest and so when, I get my datasets we've run our pre-registered, analysis, but then we run lots of other analyses, too and if we find anything interesting there the, next study can be about those interesting, exploratory.
Findings That we found right so you should still do the exploratory stuff, of course it does not preclude that at all. It. Of course does not mean that you only publish results that were as predicted. As I, mentioned before the key aspect of pre-registration, is not the prediction, you make but the analysis, or analyses, that you commit, to you could say in your paper we predicted, this but we didn't find it right and I think most people would be be, fine with that or you could just as I mentioned not even make a prediction and then you don't have to do that at all, the. Other pitch is that I think pre-registration. Will improve your life as a researcher. And. I, say this because we started pre-registering, three years ago and life. Is better for me and my students if my students were here they, would tell, you just how great pre-registration. Is in fact I recently couldn't, make it to a conference where I had to give a talk on pre-registration. Because. Of family emergency and, they stepped in and gave the talk because there was just so they're so excited about pre-registration. They just love it and they, love it for a number of practical, reasons um one, thing they love about it is that it. Actually promotes. Exploration. By giving the giving, you the freedom to do all kinds of weird things without being suspected of pee hacking we, are about to run a study where. We, want, to exclude half, the data after we, run the study and that, would look so, weird, if we just did that and didn't pre-register, it you should rightly accuse, us of pee hacking because it's such a crazy thing to do it's but I'm not gonna get into it but it's a really weird reason, why we want to get rid of half the data but, now we feel totally free to run that study because we're going to pre-register that we're going to do in advance there was another case where we, were running a set of items for a study and we really, were confident that the effect would work on this set of items but there was another set of items where we weren't sure if it'd work and in. The, old days I would have told my students, you can't run these other items because, we, can't exclude them later because it'll look like we're P hacking right and so we should only run the items that we are confident will work but, in the now we're in the land of pre-registration, oh yeah, include, those pre-register. That were only gonna analyze these items but now you can exclude these, items and we can see what happens with them and learn something and then the next time we can you know adjust accordingly, our pre-registration. If we want to include those items or not so it gives you the freedom to do all kinds of stuff like you can include measures you wouldn't have included before that other people might have thought were suspicious, so, I really think it's it's, great for explorations, it's the opposite of what people often think, planning. Your analysis in a detailed way forces. You to scrutinize, your design before you run the study many. Times at the stage. Of pre-registration, where we are writing, up in question, number five exactly. How we're gonna specify, our analysis we're, like oh this, is not the right study to be running right. We don't know how we're gonna specify our analysis or missing a control condition or something right so it just makes you more careful and the students like that a third. Thing they would put on the slide which I just talked to them recently about it I didn't have one here earlier. Is. They, like it because when it comes time to write up their studies they've basically already, written them up because they, can just go back to the pre-registration and, remember exactly what the key analyses, were and so they can do a lot of copying pasting there so it's also good for record-keeping which I hadn't thought of but they really appreciate it so on, as, predicted, again, you can keep these things private. Forever, if you want to and so, I know that some, people can game that we're not that worried about that because we don't think many people will try it try. To game it in that way but, there we want to do that because now there's literally no downside to trying it you can try pre-registration. And keep it private forever, just give it a shot I think you'll actually like it and and. I think other other researchers, would like it as well once they try it out because I do think they'll improve your life okay. And then the final solution, will. Take two seconds so the. Final solution of course is to conduct exact replications, so. Way to establish, that a finding is replicable or true is to actually replicate. It and of, course if your replication is exact you know you may as well pre-register.
It So. As as an associate, editor for, a couple of journals I often. Just if. I'm if results, look P, hacked but they're they're kind of interesting and I'm not really sure if they're P AK I don't just like to tell them there are bad people in there P hacking I instead. Say okay you're. Finding seams statistically, kind of weak to me can, you please run a pre-registered, replication, and, I. Think that's a really good editorial policy, and in, fact in a number of these cases I've been surprised, by, I get, the revision, back, they have run the pre-registered, replication, and they get a different finding, their. Paper is now about, a different, thing than their other that. Then the original paper was and that's because it turns out their original finding was not true when they actually went to run, the pre-registered replication, but they're still trying to publish the version, what. I do with that is dependent, very dependent on a case, by case basis, but, you know having. People you know conduct, exact replications, and pre-register, them is the ultimate way to figure out whether whether, things are true and I think this should be incorporated, into the editorial, process so, the. Summary here don't. Do meta analytic thinking that's really bad. Death, to p-values is not going to help lowering, the threshold too, high of a cost I think the benefit will be small disclosure. We should definitely be doing it's not perfect pre-registration, as the cure and exact. Replications, will probably be the other cure just. Some references, and plugs for the purpose of this our blog is data Coolatta the pre-registration, piece. Is number, 64, we, have this thing called P curve that you might want to try, out it's kind of neat and as. Was already mentioned I wanted to plug our recent paper where we review a lot of this stuff I'm, called psychologies Renaissance okay. Thank you very much I appreciate it. You.