This is a draft post which will hopefully be expanded on in the future.
While analyzing some data for a research project, I noticed effect A seemed a lot harder to get a significant result for compared to a closely related effect B. I wanted to know if it was actually the case that A was harder to “prove” than B, or if we should be suspicious of our significant B effects since their A counterparts weren't significant. Intuitively, it seemed like A could be harder to prove than B, but I wanted to be able to make a more solid argument than “seems right to me!” Unfortunately, my statistics education and a few web searches did not offer much advice on how to compare significance levels for different tests. This article describes how I went about trying to answer this question and the conclusions I reached.
To make this question more accessible, I simplified it to: “Is it easier to tell what color a ball is or if it is red?”
As it turns out, this is slightly too simplified, but it's simple enough to pose to people online and get good speculative answers and ideas that I had not considered.
Some set about describing how a color detector might work; it seems the answer depends on the technology used for the detector.
Others took an information-theoretic approach: whether a ball is red is 1 bit of information, but what color it is is more than 1 bit if you have more than two color possibilities.
Philosophy has a plethora of interesting questions to ask about what it means to “be a color”; these are well beyond my ability to answer.
We can also discuss this in terms of how reliable the tests are, how many are performed, and what conditions they're performed under.
A hypothesis testing perspective suggests we test a hypothesis for each color, or perhaps with clever hypothesis design we can test
All these ideas, as well as a late-night trip to Wikipedia, led me to a solution that's convincing enough, at least for me. We'll start with a more explicit statistical “experiment” that describes the data we want to analyze. Next, we'll ask two questions of this data and explore how we might answer them. This leads us to an answer for our original question, “how do we say whether effect A is harder to prove than effect B?” Finally, we'll draw connections to some statistical measures of serial dependence in time-series data.
Say we have balls, one for each of
If this seems too abstract for you, we can switch the scenario up: suppose we have cards with letters written on them and are using some image recognition software to identify the letter. This scenario allows for the situation where we have multiple cards with 'A' written on them in different handwriting. It also permits the image recognition software to be nondeterministic, so repeat detections of the same card may produce different results!
Suppose we want to know whether the color detector is right; that is, if it's given a red ball, does it report red?
And so on for each color.
If
However, we don't actually know any of these values; we just have the observations
To measure how right the detector is, we define the random variable
A relevant fact is that a Bernoulli distribution
This is all well and good, but suppose we let Wittgenstein or Quine loose in the lab before the experiment and they explained referential inscrutability to the color detector. Now the color detector still works or doesn't work, but it also might have mixed up its labels for colors so that, for example, it calls “green” what we'd call “red”. Can we tell if it's just making stuff up or if there is some method to its madness — a mapping from its outputs to the “right” answers? In short, can we tell whether it could be right?
We can even relax this question a little by letting the detector, say, report “green” for both “red” and “yellow” balls.
This question is equivalent to asking whether there is a mapping
As an aside, from a formal logic perspective, we can see that question 2 is more complex to ask.
Question 1 asks if
We can split this up into
As an example, let
Normalize the
To summarize: if we want to know whether the detector is right, the significance test critical values are taken from the
As shown in Figure 1, the more degrees of freedom in the distribution, the longer the
The critical values are lower for the first question than for the second, assuming we have at least 3 colors to pick between.
For the above example, compare the CDF for
Which is to say: it takes a less extreme effect to determine that a detector is right. Determining whether it might be right or if it's just talking nonsense requires a stronger effect.4
All this play in the statistical ball pit is not just for fun, but was motivated by questions about statistics that you won't find so easily in a Chuck-e-Cheese. My research has led me into the realm of Categorical Time Series Analysis. A time series is a sequence of observations; for instance, the outside temperature measured every hour. These are used all over the place to study weather, heart rhythms, stock market performance, and so on. Our data happens to be categorical, meaning that it takes discrete values with no inherent ordering. Categorical models appear in a variety of places; for instance, studies of sleep phases and bird calls. While real-valued data can be averaged, sorted from least to greatest, and run through all kinds of functions, the operations that are allowed for categorical data are much more restricted. Since we can't compute the average, median, or expected value of a categorical distribution, the statistical techniques for it are a bit different from their real-valued counterparts.
We'll focus on measures of categorical serial dependence, which is analogous to autocorrelation for real-valued time series.
These measures ask, “does knowing part of a time series help predict the rest?” or, “how repetitive is this time series?”
We'll ask these for a particular time offset, called lag: given an observation at time
Unsigned serial dependence is a measure of how repetitive a dataset is: if we always see
Notice that wherever we have signed serial dependence, we also have unsigned serial dependence. However, the reverse is not true. Unsigned serial dependence is harder to prove precisely because it encompasses more possible relationships among the data.
We can relate the questions asked of our colored balls data to serial dependence measures at lag
Signed serial dependence corresponds to the first question:
Common measures for signed serial dependence are Cohen's
Unsigned serial dependence corresponds to the second question: “perfect insight” corresponds to picking a mapping
The Bernoulli trial statistics developed here don't exactly follow the properties one would desire of serial dependence measures,
which is why we use Cramer's
I am not a statistician, as you can probably tell. This argument features a few handwaved bits which I would be very interested in shoring up. If you have thoughts on how to improve this argument or if you spot a hole in it, I'd like to hear! Also, if there's articles written about related topics, please send them my way. Write me an email, or if you got this through social media, replies there are good too.
Thanks to Canageek, parenthetical, cpsdqs, cephie, mithrandir, hex, and crystalmoon for their input!
1.
If
2.
If
3.
Given random variables
4. There is some handwaving here; namely, we assume the effect of scaling to standard normal distributions is approximately equivalent for both questions. Whether this is true is somewhat difficult to show; fortunately, it has little effect on the overall conclusion, which concerns measures which empirically produce similar-scale values when evaluated on the same dataset. ↩