## How to determine whether one frequency measure is better than the other?

In our research comparing various frequency measures, we usually look at the correlations between the frequency measures and word processing times (e.g., lexical decision times) and we go for the frequency measure with the highest correlation. However, increasingly reviewers (and editors) request to see a p-value when we recommend one frequency measure over another.

As long as we are dealing with megastudy data of 10 thousands of observations, there is not really a point in testing the statistical significance between different measures, as differences as small as .02 are likely to be statistically “significant” (p < .05!). However, when we only have small-scale studies at our disposal, things become different and reviewers are right asking statistical confirmation.

**Hotelling-Williams test for dependent correlations**

The test recommended for differences in correlations that are themselves intercorrelated (as is the case for various frequency measures) is the Hotelling-Williams test (Steiger, 1980). You can find the test in several R-packages, but it is reasonably simple to implement one yourself. The figure shows the equation you need. For instance, when the SUBTLEX log frequency correlates .75 with 240 lexical decision times and the Celex log frequency .69 while both log frequency measures have a correlation of .84, then r12 = .75, r13 = .69, r23 = .84, N = 240, t = 2.4934, df = 237, p = .0133. You find an Excel file here that does the calculations for you.

**The Vuong-test and Clarke-test for non-nested models**

The Hotelling-Willams test is fine as long as you are dealing with simple correlations. This is a limitation in frequency research because the relationship between word processing times and log frequency is not linear, but levels off at high word frequencies. We capture this aspect by running nonlinear regression analyses (either with polynomials or restricted cubic splines). Then, we have R²-values rather than r-values. For instance, for the above data we would have something like R² = .59 for the SUBTLEX log frequencies, and R² = .51 for the Celex log frequencies (i.e., a few percent above the squared values of the linear correlations). Are these still significant?

The test usually recommended here is the Vuong test (Vuong, 1989). It is based on a comparison of the loglikelihoods of the two models. The calculations are rather complicated, but the test is available in several R-packages, such as games, pscl, spatcounts, or ZIGP (be careful, some require the models to be estimated with the glm-function, other with the lm-function). Clarke (2007) reviewed the Vuong test and found it to be conservative for small N. That is, the test is less likely to yield statistical significance than is warranted. Clarke (2007) proposed an alternative nonparametric test that is claimed not to be conservative.

To test the usefulness of the Vuong and Clarke tests for word frequency research, we ran Monte Carlo simulations of likely scenarios. Each simulation was based on 10K datasets. Per dataset we generated normally distributed variables XYZ that had the following theoretical intercorrelations (these were the same between all three variables): .0, .2, .4, or .6. We additionally varied the number of data triplets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, or 5120. For each set, we calculated the obtained intercorrelations between the variables and tested whether the correlation between XY was significantly different from the correlation between XZ according to the Hotelling-Williams test, the Vuong test, and the Clarke test. For the sake of simplicity, we only present the percentage of tests for which p < .05 and p < .10.

If the test works well, we expect 5% of the tests to be significant at the .05 level and 10% of the tests to be significant at the .10 level (given that both correlations were generated with the same algorithm and, hence, were assumed to be equivalent at the population level). This was exactly what we obtained with the Hotelling-Williams test, as you can see here. In line with Clarke’s observations, the Vuong test was conservative. Surprisingly, this was not the case for the smallest sample sizes (N = 10) and neither when the variables were intercorrelated with each other (as is the case for frequency measures). The Vuong test was particularly conservative when the theoretical correlations between X, Y, and Z were 0. Certainly for correlations of .4 and .6, the Vuong test was no longer conservative.

In contrast, Clarke’s test was way too liberal, in particular for large sample sizes and intercorrelated variables. In the worst cases, it returned more than 50% significance for a situation in which no differences in correlations were expected. Hence, there is not much you can conclude from a significant Clarke test for the question we are addressing (unless you want to impress reviewers and editors without statistical sophistication who insist on seeing “reassuring” p-values).

Thus far we have only used the Vuong and Clarke test for situations in which the better Hotelling-Williams test applies as well. As indicated above, we need the Vuong or Clarke test more for situations in which more complicated models are compared to each other. Therefore, we checked how well these tests would perform when instead of linear regression we used restricted cubic splines with 3 knots (which allows you to capture the floor effect at high word frequencies). For comparison purposes we also calculated the Hotelling-Williams test on the correlations. The results were reassuring: The introduction of nonlinear regression did not lead to an unwarranted increase in significant tests, as you can see here.

All in all, the Hotelling-Williams test is the best to compare dependent correlations. The Vuong test is a good alternative, unless there is very little correlation between the variables. The Clarke test is less useful for our purposes, because it will often return significance when this is not indicated.

Clarke, K.A. (2007). A Simple Distribution-Free Test for Nonnested Model Selection. *Political Analysis, 15*, 347-363.

Steiger, J.H. (1980), Tests for comparing elements of a correlation matrix, *Psychological Bulletin, 87*, 245-251.

Vuong, Q.H. (1989): Likelihood Ratio Tests for Model Selection and non-nested Hypotheses. *Econometrica, 57*, 307-333.