-
Notifications
You must be signed in to change notification settings - Fork 3
/
05-Reliability.Rmd
2965 lines (2325 loc) · 222 KB
/
05-Reliability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Reliability {#reliability}
## Classical Test Theory {#ctt}
### Overview {#cttOverview}
To understand the concept of reliability, it is helpful to understand *classical test theory* (CTT), which is also known as "true score theory."\index{classical test theory}\index{true score}
Classical test theory is one of various measurement theories, in addition to [item response theory](#irt) and [generalizability theory](#gTheory).\index{classical test theory}\index{item response theory}\index{generalizability theory}
CTT has been the predominant measurement theory through the history of psychology.\index{classical test theory}
CTT is a theory of how test scores relate to a construct.\index{classical test theory}
A [*construct*](#constructs) is the concept or characteristic that a measure is intended to assess.\index{construct}
Assume you take a measurement of the same object multiple times (i.e., repeated measure).
For example, you assess the mass of the same rock multiple times.
However, you obtain different estimates of the rock's mass each time.
There are multiple possible explanations for this observation.
One possible explanation could be that the rock is changing in its mass, which would be consistent with the idea proposed by the Greek philosopher Heraclitus that nothing is stable and the world is in flux.
An alternative possibility, however, is that the rock is stable in its mass but the measurements are jittery—that is, they include error.\index{measurement error}
Based on the possibility that the rock is stable and that differences in scores across time reflect measurement error, CTT proposes the true score formula in Equation \@ref(eq:ctt):\index{classical test theory}\index{true score}\index{measurement error}\index{true score!formula}
$$
\begin{aligned}
X &= T + e \\
\text{observed score} &= \text{true score} + \text{measurement error}
\end{aligned}
(\#eq:ctt)
$$
$X$ is the observed measurement or score, $T$ is the classical (or psychometric) "true score," and $e$ is the measurement error (i.e., error score).\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
This formula is depicted visually in the form of a path diagram in Figure \@ref(fig:classicalTestTheory).\index{classical test theory}\index{true score!formula}\index{true score!formula!path diagram}\index{path analysis}
```{r classicalTestTheory, out.width = "10%", fig.align = "center", fig.cap = "Classical Test Theory Formula in a Path Diagram.", echo = FALSE}
knitr::include_graphics("./Images/classicalTestTheory.png")
```
It is important to distinguish between the classical true score and the Platonic true score.\index{classical test theory}\index{true score}
The *Platonic true score* is the truth, and it does not depend on measurement.\index{true score}
The Platonic true score is an abstract notion because it is not directly observable and is based on Platonic ideals and theories of the construct and what a person's "true" level is on the construct.\index{true score}
In CTT, we attempt to approximate the Platonic true score with the classical true score, ($T$).\index{classical test theory}\index{true score}
If we took infinite repeated observations (and the measurements had no carryover effect), the average score approaches the classical true score, $T$.\index{classical test theory}\index{true score}
That is, $\overline{X} = T \text{ as number of observations} \rightarrow \infty$.\index{classical test theory}\index{true score}
CTT attempts to partition variance into different sources.\index{classical test theory}
*Variance* is an index of scores' variability, i.e., the degree to which scores differ.\index{variance}
Variance is defined as the average squared deviation from the mean, as in Equation \@ref(eq:variance):\index{variance}
\begin{equation}
\sigma^2_X = E[(X - \mu)^2]
(\#eq:variance)
\end{equation}
According to CTT, any observed measurement includes both true score (construct) variance ($\sigma^2_T$) and error variance ($\sigma^2_e$).\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
Given the true score formula ($X = T + e$), this means that their variance is as follows (see Equation \@ref(eq:observedScoreVariance)):\index{classical test theory}\index{true score}\index{true score!formula}
$$
\begin{aligned}
\sigma^2_X &= \sigma^2_T + \sigma^2_e \\
\text{observed score variance} &= \text{true score variance} + \text{error variance}
\end{aligned}
(\#eq:observedScoreVariance)
$$
Nevertheless, the *classical true score*, $T$, is the expected value and is not necessarily the same thing as the Platonic true score [@Borsboom2003; @Klein1969] because the expected value would need to be entirely valid/accurate (i.e., it would need to be the construct score) for it to be the Platonic true score.\index{classical test theory}\index{true score}
The expected score could also be influenced by [systematic sources of error](#systematicError) such as other constructs, which would not fall into the error portion of the CTT formula because, as described below, CTT assumes that error is [random](#randomError) (not [systematic](#systematicError)).\index{classical test theory}\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}
The distinctions between construct score, (classical) true score, and observed score, in addition to [validity](#validity), reliability, [systematic error](#systematicError), and [random error](#randomError) are depicted in Figure \@ref(fig:reliabilityValidityError).\index{classical test theory}\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}
```{r reliabilityValidityError, out.width = "100%", fig.align = "center", fig.cap = "Distinctions Between Construct Score, True Score, and Observed Score, in Addition to Reliability, Validity, Systematic Error, and Random Error. (Adapted from W. Joel Schneider.)", fig.scap = "Distinctions Between Construct Score, True Score, and Observed Score, in Addition to Reliability, Validity, Systematic Error, and Random Error.", echo = FALSE}
knitr::include_graphics("./Images/ReliabilityvsValidity-03.png")
```
The true score formula is theoretically useful, but it is not practically useful because it is an under-identified equation and we do not know the values of $T$ or $e$ based on knowing the observed score ($X$).\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
For instance, if we obtain an observed score of 10, our formula is $10 = T + e$, and we do not know what the true score or error is.\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
As a result, CTT makes several simplifying assumptions so we can estimate how stable (reliable) or noisy the measure is and what proportion of the observed score reflects true score versus measurement error.\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}\index{reliability}
1. $E(e) = 0$
The first assumption of CTT is that the expected value of the error (i.e., error scores) is zero.\index{classical test theory}\index{measurement error}
Basically, the error component of the observed scores is expected to be random with a mean of zero.\index{classical test theory}\index{observed score}\index{measurement error}
The likelihood that the observed score is an overestimate of $T$ is assumed to be the same as the likelihood that the observed score is an underestimate of $T$.\index{classical test theory}\index{observed score}\index{measurement error}
In other words, the distribution of error scores above $T$ is assumed to be the same as the distribution of error scores below $T$.\index{classical test theory}\index{true score}\index{measurement error}
In reality, this assumption is likely false in many situations.
For instance, social desirability bias is a [systematic error](#systematicError) where people rate themselves as better than they actually are; thus, social desirability results in biasing scores in a particular direction across respondents, so such an error would not be entirely random.\index{classical test theory}\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}
But using the assumption that the expected value of $e$ is zero also informs that the expected value of the observed score ($X$) equals the expected value of the true score ($T$), as in Equation \@ref(eq:expectedValueObservedScore):\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
$$
\begin{aligned}
E(X) &= E(T + e) \\
&= E(T) + E(e) \\
&= E(T) + 0 \\
&= E(T)
\end{aligned}
(\#eq:expectedValueObservedScore)
$$
2. $r_{T,e} = 0$
The second assumption of CTT is that the correlation between $T$ and $e$ is zero—that is, people's true scores are uncorrelated with the error around their measurement (i.e., people's error scores).\index{classical test theory}\index{true score}\index{measurement error}
However, this assumption is likely false in many situations.
For instance, one can imagine that, on a paper-and-pencil intelligence test, scores may have greater error for respondents with lower true scores and may have less error for respondents with higher true scores.\index{classical test theory}\index{observed score}\index{true score}\index{measurement error}
3. $r_{e_1, e_2} = 0$
The third assumption of CTT is that the error is uncorrelated across time—that is, people's error scores at time 1 ($e_1$) are not associated with their error scores at time 2 ($e_2$).\index{classical test theory}\index{measurement error}
However, this assumption is also likely false in many situations.
For example, if some people have a high social desirability bias at time 1, they are likely to also have a high social desirability bias at time 2.
That is, the error around measurements of participants is likely to be related across time.\index{classical test theory}\index{measurement error}
These three assumptions are implicit in the path analytic diagram in Figure \@ref(fig:reliabilityPathDiagram1), which depicts the CTT approach to understanding reliability of a measure across two time points.\index{classical test theory}\index{reliability!path diagram}\index{path analysis}
```{r reliabilityPathDiagram1, out.width = "30%", fig.align = "center", fig.cap = "Reliability of a Measure Across Two Time Points, as Depicted in a Path Diagram.", echo = FALSE}
knitr::include_graphics("./Images/Reliability_1.png")
```
In [path analytic](#pathAnalysis-sem) (and [structural equation modeling](#sem)) language, rectangles represent variables we observe, and circles represent latent (i.e., unobserved) variables.\index{path analysis}\index{structural equation modeling}
The observed scores at time 1 ($X_1$) and time 2 ($X_2$) are entities we observe, so they are represented by rectangles.\index{observed score}
We do not directly observe the true scores ($T_1, T_2$) and error scores ($e_1, e_2$), so they are considered latent entities and are represented by circles.\index{true score}\index{measurement error}
Single-headed arrows indicate regression paths, where conceptually, one variable is thought to influence another variable.
As the model depicts, the observed scores are thought to be influenced both by true scores and by error scores.\index{observed score}
We also expect the true scores at time 1 ($T_1$) and time 2 ($T_2$) to be correlated, so we have a covariance path, as indicated by a double-headed arrow.\index{true score}
A *covariance* is an unstandardized index of the strength of association between two variables.\index{covariance}
Because a covariance is unstandardized, its scale depends on the scale of the variables.\index{covariance}
The covariance between two variables is the average product of their deviations from their respective means, as in Equation \@ref(eq:covariance):\index{covariance}
\begin{equation}
\sigma_{XY} = E[(X - \mu_X)(X - \mu_Y)]
(\#eq:covariance)
\end{equation}
The covariance of a variable with itself is equivalent to its variance, as in Equation \@ref(eq:covarianceVariance):\index{covariance}\index{variance}
$$
\begin{aligned}
\sigma_{XX} &= E[(X - \mu_X)(X - \mu_X)] \\
&= E[(X - \mu_X)^2] \\
&= \sigma^2_X
\end{aligned}
(\#eq:covarianceVariance)
$$
By contrast, a *correlation* is a standardized index of the strength of association between two variables.\index{correlation}
Because a correlation is standardized (fixed between [−1,1]), its scale does not depend on the scales of the variables.\index{correlation}
In this figure, no other parameters (regressions, covariances, or means) are specified, so the following are implicit in the diagram:\index{path analysis}
- $E(e) = 0$
- $r_{T,e} = 0$
- $r_{e_1, e_2} = 0$
The factor loadings reflect the magnitude that the latent factor influences the observed variable.\index{structural equation modeling!factor loading}
In this case, the true scores influence the observed scores with a magnitude of $\sqrt{r_{xx}}$, which is known as the *index of reliability*.\index{classical test theory}\index{true score}\index{observed score}\index{reliability!index of reliability}
The index of reliability is the theoretical estimate of the correlation between the true scores and the observed scores.\index{reliability!index of reliability}\index{true score}\index{observed score}
This is depicted in Figure \@ref(fig:reliabilityPathDiagram2).\index{reliability!path diagram}
```{r reliabilityPathDiagram2, out.width = "30%", fig.align = "center", fig.cap = "Reliability of a Measure Across Two Time Points, as Depicted in a Path Diagram; Includes the Index of Reliability.", echo = FALSE}
knitr::include_graphics("./Images/Reliability_2.png")
```
We can use path tracing rules to estimate the reliability of the measure, where the reliability of the measure, i.e., the *coefficient of reliability* ($r_{xx}$), is estimated as the correlation between the observed score at time 1 ($x_1$) and the observed score at time 2 ($x_2$).\index{path analysis!path tracing rules}\index{reliability!coefficient of reliability}\index{observed score}
According to path tracing rules [@Pearl2013], the correlation between $x_1$ and $x_2$ is equal to the sum of the standardized coefficients of all the routes through which $x_1$ and $x_2$ are connected.\index{path analysis!path tracing rules}
The contribution of a given route to the correlation between $x_1$ and $x_2$ is equal to the product of all standardized coefficients on that route that link $x_1$ and $x_2$ that move in the following directions: (a) forward (e.g., $T_1$ to $x_1$) or (b) backward once and then forward (e.g., $x_1$ to $T_1$ to $T_2$ to $x_2$).\index{path analysis!path tracing rules}
Path tracing does not allow moving forward and then backward—that is, it does not allow retracing (e.g., $e$ to $x_1$ to $T_1$) in the same route.\index{path analysis!path tracing rules}
It also does not allow passing more than one curved arrow (covariance path) or through the same variable twice in the same route.\index{path analysis!path tracing rules}
Once you know the contribution of each route to the correlation, you can calculate the total correlation between the two variables as the sum of the contribution of each route.\index{path analysis!path tracing rules}
Therefore, using one route, we can calculate the association between $x_1$ and $x_2$ as in Equation \@ref(eq:associationBetweenTwoVars):\index{path analysis!path tracing rules}
$$
\begin{aligned}
r_{x_1,x_2} &= \sqrt{r_{xx}} \times r_{T_1,T_2} \times \sqrt{r_{xx}} \\
&= r_{T_1,T_2} \times r_{xx} \\
&= \text{correlation of true scores across time} \times \text{reliability}
\end{aligned}
(\#eq:associationBetweenTwoVars)
$$
When dealing with a stable construct, we would assume that the correlation between true scores across time is 1.0: $r_{T_1,T_2} = 1.0$, as depicted in Figure \@ref(fig:reliabilityPathDiagram3).\index{true score}\index{reliability!path diagram}\index{construct!stability}
```{r reliabilityPathDiagram3, out.width = "30%", fig.align = "center", fig.cap = "Reliability of a Measure of a Stable Construct Across Two Time Points, as Depicted in a Path Diagram.", echo = FALSE}
knitr::include_graphics("./Images/Reliability_3.png")
```
Then, to calculate the association between $x_1$ and $x_2$ of a stable construct, we can use path tracing rules as in Equation \@ref(eq:reliabilityCoefficient):\index{path analysis!path tracing rules}\index{reliability!coefficient of reliability}
$$
\begin{aligned}
r_{x_1,x_2} &= \sqrt{r_{xx}} \times r_{T_1,T_2} \times \sqrt{r_{xx}} \\
&= \sqrt{r_{xx}} \times 1 \times \sqrt{r_{xx}} \\
&= r_{xx} \\
&= \text{coefficient of reliability}
\end{aligned}
(\#eq:reliabilityCoefficient)
$$
That is, for a stable construct (i.e., whose true scores are perfectly correlated across time; $r_{T_1,T_2} = 1.0$), we estimate reliability as the correlation between the observed scores at time 1 ($x_1$) and the observed scores at time 2 ($x_2$).\index{classical test theory}\index{reliability!coefficient of reliability}\index{observed score}\index{construct!stability}
This is known as [*test–retest reliability*](#testRetest-reliability).\index{reliability!test–retest}
We therefore assume that the extent to which the correlation between $x_1$ and $x_2$ is less than one reflects measurement error (an unstable measure), rather than people's changes in their true score on the construct (an unstable construct).\index{classical test theory}\index{reliability!coefficient of reliability}\index{measurement error}\index{true score}\index{construct!stability}
As described above, the reliability coefficient ($r_{xx}$) is the association between a measure and itself over time or with another measure in the domain.\index{classical test theory}\index{reliability!coefficient of reliability}
By contrast, the *reliability index* ($r_{xT}$) is the correlation between observed scores on a measure and the true scores [@Nunnally1994].\index{classical test theory}\index{reliability!index of reliability}
The reliability index is the square root of the reliability coefficient, as in Equation \@ref(eq:reliabilityIndex).\index{classical test theory}\index{reliability!index of reliability}\index{reliability!coefficient of reliability}
$$
\begin{aligned}
r_{xT} &= \\
&= \sqrt{r_{xx}} \\
&= \text{index of reliability}
\end{aligned}
(\#eq:reliabilityIndex)
$$
### Four CTT Measurement Models {#cttMeasurementModels}
There are four primary measurement models in CTT [@Graham2006a]:\index{classical test theory!measurement models}
1. [parallel](#cttParallel)\index{classical test theory!measurement models!parallel}
1. [tau-equivalent](#cttTauEquivalent)\index{classical test theory!measurement models!tau equivalent}
1. [essentially tau-equivalent](#cttEssentiallyTauEquivalent)\index{classical test theory!measurement models!essentially tau equivalent}
1. [congeneric](#cttCongeneric)\index{classical test theory!measurement models!congeneric}
#### Parallel {#cttParallel}
The parallel measurement model is the most stringent measurement model for use in estimating reliability.\index{classical test theory!measurement models!parallel}
In CTT, a measure is considered parallel if the true scores and error scores are equal across items.\index{classical test theory!measurement models!parallel}
That is, the items must be unidimensional and assess the same construct, on the same scale, with the same degree of precision, and with the same amount of error [@Graham2006a].\index{unidimensional}\index{reliability!precision}\index{measurement error}
Items are expected to have the same strength of association (i.e., factor loading or discrimination) with the construct.\index{structural equation modeling!factor loading}\index{item response theory!item discrimination}
#### Tau-Equivalent {#cttTauEquivalent}
The tau ($\tau$)-equivalent measurement model is the same as the parallel measurement model, except error scores are allowed to differ across items.\index{classical test theory!measurement models!tau equivalent}\index{classical test theory!measurement models!parallel}
That is, a measure is tau-equivalent if the items are unidimensional and assess the same construct, on the same scale, with the same degree of precision, but with possibly different amounts of error [@Graham2006a].\index{classical test theory!measurement models!tau equivalent}\index{unidimensional}\index{reliability!precision}\index{measurement error}
In other words, true scores are equal across items but each item is allowed to have unique error scores.\index{true score}\index{measurement error}
Items are expected to have the same strength of association with the construct.
Variance that is unique to a specific item is assumed to be error variance.\index{measurement error}
#### Essentially Tau-Equivalent {#cttEssentiallyTauEquivalent}
The essentially tau ($\tau$)-equivalent model is the same as the tau-equivalent measurement model, except items are allowed to differ in their precision.\index{classical test theory!measurement models!essentially tau equivalent}\index{classical test theory!measurement models!tau equivalent}
That is, a measure is essentially tau-equivalent if the items are unidimensional and assess the same construct, on the same scale, but with possibly different degrees of precision, and with possibly different amounts of error [@Graham2006a].\index{classical test theory!measurement models!essentially tau equivalent}\index{unidimensional}\index{reliability!precision}\index{measurement error}
The essentially tau-equivalent model allows item true scores to differ by a constant that is unique to each pair of variables.\index{classical test theory!measurement models!essentially tau equivalent}\index{true score}
The magnitude of the constant reflects the degree of imprecision and influences the mean of the item scores but not its variance or covariances with other items.\index{reliability!precision}
Items are expected to have the same strength of association with the construct.
#### Congeneric {#cttCongeneric}
The congeneric measurement model is the least restrictive measurement model.\index{classical test theory!measurement models!congeneric}
The congeneric measurement model is the same as the essentially tau ($\tau$)-equivalent model, except items are allowed to differ in their scale.\index{classical test theory!measurement models!congeneric}\index{classical test theory!measurement models!essentially tau equivalent}
That is, a measure is congeneric if the items are unidimensional and assess the same construct, but possibly on a different scale, with possibly different degrees of precision, and with possibly different amounts of error [@Graham2006a].\index{classical test theory!measurement models!congeneric}\index{scale}\index{reliability!precision}\index{measurement error}
Items are not expected to have the same strength of association with the construct.
## Measurement Error {#measurementError}
Measurement error is the difference between the measured (observed) value and the true value.\index{measurement error}\index{observed score}\index{true score}
All measurements come with uncertainty and measurement error.\index{measurement error}
Even a measure of something as simple as whether someone is dead has error.\index{measurement error}
There are two main types of measurement error: [systematic](#systematicError) (nonrandom) error and [unsystematic](#randomError) (random) error.\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}
In addition, measurement error can be [within-person](#withinPersonError), [between-person](#betweenPersonError), or both.\index{measurement error}\index{measurement error!within person}\index{measurement error!between person}
### Systematic (Nonrandom) Error {#systematicError}
An example of *systematic error* is depicted in Figure \@ref(fig:systematicError).\index{measurement error!systematic error}
Systematic error is error that influences consistently for a person or across the sample.\index{measurement error!systematic error}
An error is systematic if the error always occurs, with the same value, when using the measure in the same way and in the same case.\index{measurement error!systematic error}
An example of a systematic error is a measure that consistently assesses constructs other than the construct the measure was designed to assess.\index{measurement error!systematic error}
For instance, if a test written in English to assess math skills is administered in a nonnative English-speaking country, some portion of the scores will reflect variance attributable to English reading skills rather than the construct of interest (math skills).
Other examples of systematic error include response styles or subjective, idiosyncratic judgments by a rater—for instance, if the rater's judgments are systematically harsh or lenient.\index{measurement error!systematic error}
A systematic error affects the average score (i.e., resulting in bias), which makes the group-level estimates less accurate and makes the measurements for an individual less accurate.\index{measurement error!systematic error}
As depicted in Figure \@ref(fig:systematicError), systematic error does not affect the variability of the scores but it does affect the mean of the scores, so the person-level mean and group-level mean are less accurate.\index{measurement error!systematic error}
In other words, a systematic error leads to a biased estimate of the average.\index{measurement error!systematic error}\index{bias}
However, multiple systematic errors may simultaneously coexist and can operate in the same direction (exacerbating the effects of bias) or in opposite directions (hiding the extent of bias).\index{measurement error!systematic error}\index{bias}
```{r systematicError, echo = FALSE, results = "hide", out.width = "100%", fig.align = "center", fig.cap = "Systematic Error."}
library("tidyverse")
library("viridis")
sampleSize <- 1000
set.seed(52242)
rawMeasure <- data.frame(score = rnorm(sampleSize), measure = "rawMeasure")
addSystematicError <- data.frame(score = rawMeasure$score + 1, measure = "addSystematicError")
addRandomError <- data.frame(score = rawMeasure$score + rnorm(sampleSize), measure = "addRandomError")
errorList <- list(rawMeasure, addSystematicError, addRandomError)
errorData <- bind_rows(errorList)
ggplot(errorData %>% filter(measure != "addRandomError"), aes(x = score, fill = measure)) +
geom_density(alpha = 0.25) +
scale_fill_viridis_d(name = "",
breaks = c("rawMeasure", "addSystematicError"),
labels = c("Raw Measure", "Measure with Systematic Error Added")) +
theme_bw() +
theme(legend.position = "top")
```
### Unsystematic (Random) Error {#randomError}
An example of *unsystematic (random) error* is depicted in Figure \@ref(fig:unsystematicError).\index{measurement error!random error}
Random error occurs due to chance.\index{measurement error!random error}
For instance, a random error could arise from a participant being fatigued on a particular testing day or from a participant getting lucky in guessing the correct answer.\index{measurement error!random error}
Random error does not have consistent effects for a person or across the sample, and it may vary from one observation to another.\index{measurement error!random error}
Random error does not (systematically) affect the average, i.e., the group-level estimate—random error affects only the variability around the average (noise).\index{measurement error!random error}
However, random error makes measurements for an individual less accurate.\index{measurement error!random error}
A large number of observations of the same construct cancels out random error but does not cancel out systematic error.\index{measurement error!random error}\index{measurement error!systematic error}
As depicted in Figure \@ref(fig:unsystematicError), random error does not affect the mean of the scores, but it does increase the variability of the scores.\index{measurement error!random error}
In other words, the group-level mean is still accurate, but individuals' scores are less precise.\index{measurement error!random error}\index{reliability!precision}
```{r unsystematicError, echo = FALSE, results = "hide", out.width = "100%", fig.align = "center", fig.cap = "Random Error."}
ggplot(errorData %>% filter(measure != "addSystematicError"), aes(x = score, fill = measure)) +
geom_density(alpha = 0.25) +
scale_fill_viridis_d(name = "",
breaks = c("rawMeasure", "addRandomError"),
labels = c("Raw Measure", "Measure with Random Error Added")) +
theme_bw() +
theme(legend.position = "top")
```
### Within-Person Error {#withinPersonError}
Consider two data columns, one column for participants' scores at time 1 and another column for participants' scores at time 2.\index{measurement error!within person}
Adding within-person error would mean adding noise ($e$) within the given row (or rows) for the relevant participant(s).\index{measurement error!within person}
Adding between-person error would mean adding noise ($e$) across the rows within the column.\index{measurement error!between person}
Within-person error occurs within a particular person.\index{measurement error!within person}
For instance, you could add within-person error to a data set by adding error to the given row (or rows) for the relevant participant(s).\index{measurement error!within person}
### Between-Person Error {#betweenPersonError}
Between-person error occurs across the sample.\index{measurement error!between person}
You could add between-person random error to a variable by adding error across the rows, within a column.\index{measurement error!between person}
### Types of Measurement Error {#measurementErrorTypes}
There are four nonmutually exclusive types of measurement error: within-person random error, within-person systematic error, between-person random error, and between-person systematic error.\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}\index{measurement error!within person}\index{measurement error!between person}
The four types of [measurement error](#measurementError) are depicted in Figure \@ref(fig:measurementErrorTypes), as adapted from @Willett2012.\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}\index{measurement error!within person}\index{measurement error!between person}
```{r measurementErrorTypes, out.width = "100%", fig.align = "center", fig.cap = "Types of Measurement Error.", echo = FALSE}
knitr::include_graphics("./Images/MeasurementErrorTypes.png")
```
#### Within-Person Random Error {#withinPersonRandomError}
Adding within-person random error would involve adding random noise ($e$) to the given row (or rows) for the relevant participant(s).\index{measurement error!random error}\index{measurement error!within person}
This could reflect momentary fluctuations in the assessment for a specific person.\index{measurement error!random error}\index{measurement error!within person}
When adding within-person random error, the person's and group's measurements show no bias, i.e., there is no consistent increase or decrease in the scores from time 1 to time 2 (at least with a sample size large enough to cancel out the random error, according to the law of large numbers).\index{measurement error!random error}\index{measurement error!within person}\index{bias}
A person's average approximates their true score if many repeated measurements are taken.\index{measurement error!random error}\index{measurement error!within person}
A group's average approximates the sample mean's true score, especially when averaging the repeated measures across time.\index{measurement error!random error}\index{measurement error!within person}
The influence of within-person random error is depicted in Figure \@ref(fig:withinPersonRandomError).\index{measurement error!random error}\index{measurement error!within person}
```{r withinPersonRandomError, out.width = "100%", fig.align = "center", fig.cap = "Within-Person Random Error.", echo = FALSE}
knitr::include_graphics("./Images/Error_Within-PersonRandom.png")
```
#### Within-Person Systematic Error {#withinPersonSystematicError}
Adding within-person systematic error would involve adding systematic noise ($e$) (the same variance across columns) to the given row (or rows), reflecting the relevant participant(s).\index{measurement error!systematic error}\index{measurement error!within person}
These are within-person effects that are consistent across time.\index{measurement error!systematic error}\index{measurement error!within person}
For example, social desirability bias is high for some people and low for others.\index{measurement error!systematic error}\index{measurement error!within person}\index{bias!social desirability}
Another instance in which within-person systematic error could exist is when one or more people consistently misinterpret a particular question.\index{measurement error!systematic error}\index{measurement error!within person}
Within-person systematic error increases person-level bias because the person's mean shows a greater difference from their true score.\index{measurement error!systematic error}\index{measurement error!within person}\index{bias}\index{individual level}
The influence of within-person systematic error is depicted in Figure \@ref(fig:withinPersonSystematicError).\index{measurement error!systematic error}\index{measurement error!within person}
```{r withinPersonSystematicError, out.width = "100%", fig.align = "center", fig.cap = "Within-Person Systematic Error.", echo = FALSE}
knitr::include_graphics("./Images/Error_Within-PersonSystematic.png")
```
#### Between-Person Random Error {#betweenPersonRandomError}
Adding between-person random error at time 2 would involve adding random noise ($e$) across the rows, within the column.\index{measurement error!random error}\index{measurement error!between person}
Between-person random error would result in less accurate scores at the person level but would not result in bias at the group level.\index{measurement error!random error}\index{measurement error!between person}
At a given timepoint, it results in over-estimates of the person's true score for some people and under-estimates for other people, i.e., there is no consistent pattern across the sample.\index{measurement error!random error}\index{measurement error!between person}
Thus, the group average approximates the sample's mean true score (at least with a sample size large enough to cancel out the random error, according to the law of large numbers).\index{measurement error!random error}\index{measurement error!between person}
In addition, the average of repeated measurements of the person's score would approximate the person's true score.\index{measurement error!random error}\index{measurement error!between person}
However, the group's variance is inflated.\index{measurement error!random error}\index{measurement error!between person}
The influence of between-person random error is depicted in Figure \@ref(fig:betweenPersonRandomError).\index{measurement error!random error}\index{measurement error!between person}
```{r betweenPersonRandomError, out.width = "100%", fig.align = "center", fig.cap = "Between-Person Random Error.", echo = FALSE}
knitr::include_graphics("./Images/Error_Between-PersonRandom.png")
```
#### Between-Person Systematic Error {#betweenPersonSystematicError}
Adding between-person systematic error at time 2 would involve adding systematic noise ($e$) across the rows, within the column.\index{measurement error!systematic error}\index{measurement error!between person}
Between-person systematic error results from within-person error that tends to be negative or positive across participants.\index{measurement error!systematic error}\index{measurement error!between person}
For instance, this could reflect an influence with a shared effect across subjects.\index{measurement error!systematic error}\index{measurement error!between person}
For example, social desirability leads to a positive group-level bias for rating their socially desirable attributes.\index{measurement error!systematic error}\index{measurement error!between person}\index{bias!social desirability}
Another example would be when a research assistant enters values wrong at time 2 (e.g., adding 10 to all participants' scores).\index{measurement error!systematic error}\index{measurement error!between person}
Between-person systematic error increases bias because it results in a greater group mean difference from the group's mean true score.\index{measurement error!systematic error}\index{measurement error!between person}\index{bias}\index{group level}
The influence of between-person systematic error is depicted in Figure \@ref(fig:betweenPersonSystematicError).\index{measurement error!systematic error}\index{measurement error!between person}
```{r betweenPersonSystematicError, out.width = "100%", fig.align = "center", fig.cap = "Between-Person Systematic Error.", echo = FALSE}
knitr::include_graphics("./Images/Error_Between-PersonSystematic.png")
```
### Summary {#measurementErrorSummary}
In sum, all types of measurement error (whether [systematic](#systematicError) or [random](#randomError)) lead to less accurate scores for an individual.\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}\index{validity}\index{individual level}
But different kinds of error have different implications.\index{measurement error}
[Systematic](#systematicError) and [random](#randomError) error have different effects on accuracy at the group-level.\index{measurement error!systematic error}\index{measurement error!random error}\index{validity}\index{group level}
[Systematic error](#systematicError) leads to less accurate estimates at the group-level, whereas [random error](#randomError) does not.\index{measurement error!systematic error}\index{measurement error!random error}\index{validity}\index{group level}
CTT assumes that all error is [random](#randomError).\index{classical test theory}\index{measurement error}\index{measurement error!random error}
According to CTT, as the number of measurements approaches infinity, the mean of the measurements gets closer to the true score, because the [random errors](#randomError) cancel each other out.\index{classical test theory}\index{true score}\index{measurement error!random error}
With more measurements, we reduce our uncertainty and increase our precision.\index{classical test theory}\index{reliability!precision}
According to CTT, if we take many measurements and the average of the measurements is 10, we have some confidence that the true score $(T) \approx 10$.\index{classical test theory}\index{true score}
In reality, however, error for a given measure likely includes both [systematic](#systematicError) and [random](#randomError) error.\index{measurement error}\index{measurement error!systematic error}\index{measurement error!random error}
## Overview of Reliability {#overview-reliability}
The "Standards for Educational and Psychological Testing" set the standard for educational and psychological assessment and are jointly published by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.\index{Standards for Educational and Psychological Testing}
According to the "Standards" [@AERA2014, p. 33], reliability is the "consistency of the scores across instances of the testing procedure."\index{Standards for Educational and Psychological Testing}\index{reliability!overview}
In this book, we define reliability as how much repeatability, consistency, and precision the scores from a measure have.\index{reliability!overview}\index{reliability!repeatability}\index{reliability!consistency}\index{reliability!precision}
Reliability ($r_{xx}$) has been defined mathematically as the proportion of observed score variance ($\sigma^2_X$) that is attributable to true score variance ($\sigma^2_T$), as in Equation \@ref(eq:reliabilityRatio):\index{reliability!overview}\index{true score}\index{observed score}
$$
\begin{aligned}
r_{xx} &= \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_e} \\
&= \frac{\sigma^2_T}{\sigma^2_X} \\
&= \frac{\text{true score variance}}{\text{observed score variance}} \\
\end{aligned}
(\#eq:reliabilityRatio)
$$
An alternative formulation is that reliability ($r_{xx}$) is the lack of error variance or the degree to which observed scores are correlated with true scores or uncorrelated with error scores.\index{reliability!overview}\index{observed score}\index{measurement error}
In CTT, reliability can be conceptualized in four primary ways, as depicted in Figure \@ref(fig:conceptualizingReliability) [@Furr2017].\index{reliability!overview}\index{classical test theory}
```{r conceptualizingReliability, out.width = "100%", fig.align = "center", fig.cap = "Four Different Ways of Conceptualizing Reliability.", echo = FALSE}
knitr::include_graphics("./Images/conceptualizingReliability.png")
```
However, we cannot *calculate* reliability because we cannot measure the true score component of an observation.\index{reliability!overview}\index{true score}
Therefore, we *estimate* reliability (the coefficient of reliability) based on the relation between two observations of the same measure (for test–retest reliability) or using other various estimates of reliability.\index{reliability!overview}\index{reliability!coefficient of reliability}\index{reliability!test–retest}
The coefficient of reliability can depend on several factors.\index{reliability!overview}\index{reliability!coefficient of reliability}
Reliability is inversely related to the amount of [measurement error](#measurementError).\index{reliability!overview}\index{measurement error}
The coefficient of reliability, like correlation, depends on the degree of spread (variability) of the scores.\index{reliability!coefficient of reliability}\index{correlation}\index{variability}
If the scores at one or both time points show restricted range, the scores will show a weaker association and coefficient of reliability, as shown in Figure \@ref(fig:restrictedRange).\index{reliability!coefficient of reliability}\index{correlation}\index{restricted range}
An ideal way to estimate the reliability of a measure would be to take a person and repeatedly measure them many times to get an estimate of their true score and each measurement's deviation from their true score, and to do this for many people.\index{reliability!overview}\index{true score}
However, this is rarely done in practice.
Instead of taking one person and repeatedly measuring them many times, we typically estimate reliability by taking many people and doing repeated measures twice.\index{reliability!overview}
This is a shortcut to estimate reliability, but even this shorter approach is not often done.\index{reliability!overview}
In short, researchers rarely estimate the [test–retest reliability](#testRetest-reliability) of the measures they use.\index{reliability!test–retest}
Reliability can also be related to the number of items in the measure.\index{reliability!overview}
In general, the greater the number of items, the more reliable the measure (assuming the items assess the same construct) because we are averaging out [random error](#randomError).\index{reliability!overview}\index{reliability!number of items}\index{measurement error!random error}
We never see the true scores or the error scores, so we cannot compute reliability—we can only *estimate* it from the observed scores.\index{reliability!overview}\index{true score}\index{measurement error}\index{observed score}
This estimate of reliability gives a probabilistic answer of the reliability of the measure, rather than an absolute answer.\index{reliability!overview}
### How Reliable Is Reliable Enough? {#reliableEnough}
As described by @Nunnally1994, how reliable a measure should be depends on the proposed uses.\index{reliability}
If it is early in the research process, and the focus is on group-level inferences (e.g., associations or group differences), modest reliability (e.g., .70) may be sufficient and save time and money.\index{reliability}\index{individual level}
Then, the researcher can see what the associations would be when disattenuated for unreliability, as described in Section \@ref(effectOfMeasurementErrorOnAssociations) of the chapter on [validity](#validity).\index{measurement error!disattenuation of association}
If the disattenuated associations are promising, it may be worth increasing the reliability of the measure.\index{reliability}\index{measurement error!disattenuation of association}
Associations are only weakly attenuated above a reliability of .80, so achieving a reliability coefficient of .80 may be an appropriate target for basic research.\index{reliability}
However, when making decisions about individual people from their score on a measure, reliability and precision are more important (than when making group-level inferences) because small differences in scores can lead to different decisions.\index{reliability}\index{reliability!precision}\index{individual level}
@Nunnally1994 recommend that measures have at least a reliability of .90 and—when making important decisions about individual people—that measures preferably have a reliability of .95 or higher.\index{reliability}\index{individual level}
Nevertheless, they also note that one should not switch to a less [valid](#validity) measure merely because it is more reliable.\index{reliability}\index{validity}
### Standard Error of Measurement (SEM) {#standardErrorOfMeasurement}
The estimate of reliability gives a general idea of the degree of uncertainty you have of a person's true score given their observed score.\index{reliability!standard error of measurement}\index{uncertainty}\index{standard error of measurement!zzzzz@\igobble|seealso{reliability}}
From this, we can estimate the standard error of measurement, which estimates the extent to which an observed score deviates from a true score.\index{reliability!standard error of measurement}\index{observed score}\index{true score}
The standard error of measurement indicates the typical distance of the observed score from the true score.\index{reliability!standard error of measurement}\index{observed score}\index{true score}
The formula for the standard error of measurement is in Equation \@ref(eq:standardErrorOfMeasurement):\index{reliability!standard error of measurement}
\begin{equation}
\text{standard error of measurement (SEM)} = \sigma_x \sqrt{1 - r_{xx}}
(\#eq:standardErrorOfMeasurement)
\end{equation}
where $\sigma_x$ represents the standard deviation of scores.
Thus, the standard error of measurement is directly related to the reliability of the measure.\index{reliability!standard error of measurement}\index{reliability}
The higher the reliability, the lower the standard error of measurement.\index{reliability!standard error of measurement}\index{reliability}
The standard error of measurement as a function of reliability of the measure and the standard deviation of scores is depicted in Figure \@ref(fig:standardErrorOfMeasurementPlot).\index{reliability!standard error of measurement}\index{reliability}
```{r standardErrorOfMeasurementPlot, echo = FALSE, results = "hide", out.width = "100%", fig.align = "center", fig.height = 4, fig.cap = "Standard Error of Measurement as a Function of Reliability."}
library("ggplot2")
standardErrorOfMeasurementData <- expand.grid(reliability = seq(0, 1, by = 0.01),
sd = c(1, 3, 5, 10))
standardErrorOfMeasurementData$sem <- standardErrorOfMeasurementData$sd * sqrt(1 - standardErrorOfMeasurementData$reliability)
ggplot(standardErrorOfMeasurementData, aes(reliability, sem, group = as.factor(sd), color = as.factor(sd))) +
geom_line(linewidth = 1.5) +
theme_bw() +
xlab("Reliability") +
ylab("Standard Error of Measurement") +
scale_color_viridis_d(name = "Standard Deviation of Scores") +
theme(legend.justification = c(1,1), legend.position = c(1,1)) +
theme(legend.background = element_rect(color = "black", fill = "white", linetype = "solid"))
```
The derivation of the SEM (from W. Joel Schneider) is in Equation \@ref(eq:standardErrorOfMeasurementDerivation):\index{reliability!standard error of measurement}
$$
\begin{aligned}
\text{Remember, based on } X = T + e: && \sigma^2_X &= \sigma^2_T + \sigma^2_e \\
\text{Solve for }\sigma^2_T: && \sigma^2_T &= \sigma^2_X - \sigma^2_e \\
\text{Remember:} && r_{xx} &= \frac{\sigma^2_T}{\sigma^2_X} \\
\text{Substitute for } \sigma^2_T: && &= \frac{\sigma^2_X - \sigma^2_e}{\sigma^2_X} \\
\text{Multiply by } \sigma^2_X: && \sigma^2_X \cdot r_{xx} &= \sigma^2_X - \sigma^2_e \\
\text{Solve for } \sigma^2_e: && \sigma^2_e &= \sigma^2_X - \sigma^2_X \cdot r_{xx} \\
\text{Factor out } \sigma^2_X: && \sigma^2_e &= \sigma^2_X (1 - r_{xx}) \\
\text{Take the square root:} && \sigma_e &= \sigma_X \sqrt{1 - r_{xx}}
\end{aligned}
(\#eq:standardErrorOfMeasurementDerivation)
$$
The SEM is equivalent to the standard deviation of [measurement error](#measurementError) ($e$) [@Lek2018a], as in Equation \@ref(eq:standardErrorOfMeasurementAlternative):\index{reliability!standard error of measurement}
$$
\begin{aligned}
\text{standard error of measurement (SEM)} &= \sigma_x \sqrt{1 - r_{xx}} \\
&= \sqrt{\sigma_x^2} \sqrt{1 - r_{xx}} \\
&= \sqrt{\sigma_x^2(1 - r_{xx})} \\
&= \sqrt{\sigma_x^2 - \sigma_x^2 \cdot r_{xx}} \\
&= \sqrt{\sigma_x^2 - \sigma_x^2 \frac{\sigma^2_T}{\sigma_x^2}} \\
&= \sqrt{\sigma_x^2 - \sigma^2_T} \\
&= \sqrt{\sigma^2_e} \\
&= \sigma_e \\
\end{aligned}
(\#eq:standardErrorOfMeasurementAlternative)
$$
Around 95% of scores would be expected to fall within $\pm 2$ SEMs of the true score (or, more precisely, within $\pm `r qnorm(.975)`$ SEMs of the true score).\index{reliability!standard error of measurement}\index{true score}
In other words, 95% of the time, the true score is expected to fall within $\pm 2$ SEMs of the observed score.\index{reliability!standard error of measurement}\index{true score}\index{observed score}
Given an observed score of $X = 15$ and $\text{SEM} = 2$, the 95% confidence interval of the true score is [11, 19].\index{reliability!standard error of measurement}\index{true score}\index{observed score}
So if a person gets a score of 15 on the measure, 95% of the time, their true score is expected to fall within 11–19.\index{reliability!standard error of measurement}\index{true score}\index{observed score}
We provide an empirical example of estimating the SEM in Section \@ref(sem-reliability).\index{reliability!standard error of measurement}
Based on the preceding discussion, consider the characteristics of measures that make them more useful from a reliability perspective.\index{reliability}
A useful measure would show wide variation across people (individual differences), so we can more accurately estimate its reliability.\index{reliability}\index{variability}
And we would expect a useful measure to show consistency, stability, precision, and reliability of scores.\index{reliability!consistency}\index{reliability!stability}\index{reliability!repeatability}\index{reliability!precision}\index{consistency!zzzzz@\igobble|seealso{reliability}}\index{repeatability!zzzzz@\igobble|seealso{reliability}}\index{precision!zzzzz@\igobble|seealso{reliability}}
## Getting Started {#gettingStarted-reliability}
### Load Libraries {#loadLibraries-reliability}
```{r}
library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
library("psych")
library("blandr")
library("MBESS")
library("lavaan")
library("semTools")
library("psychmeta")
library("irrCAC")
library("irrICC")
library("irrNA")
library("gtheory")
library("performance")
library("MOTE")
library("tidyverse")
library("tinytex")
library("knitr")
library("rmarkdown")
library("bookdown")
```
### Prepare Data {#prepareData-reliability}
#### Simulate Data {#simulateData-reliability}
For reproducibility, we set the seed below.\index{simulate data}
Using the same seed will yield the same answer every time.
There is nothing special about this particular seed.
```{r}
sampleSize <- 100
set.seed(52242)
rater1continuous <- rnorm(n = sampleSize, mean = 50, sd = 10)
rater2continuous <- rater1continuous +
rnorm(
n = sampleSize,
mean = 0,
sd = 4)
rater3continuous <- rater2continuous +
rnorm(
n = sampleSize,
mean = 0,
sd = 8)
rater1categorical <- sample(
c(0,1),
size = sampleSize,
replace = TRUE)
rater2categorical <- rater1categorical
rater3categorical <- rater1categorical
rater2categorical[
sample(1:length(rater2categorical),
size = 10,
replace = FALSE)] <- 0
rater3categorical[
sample(1:length(rater3categorical),
size = 10,
replace = FALSE)] <- 1
time1 <- rnorm(n = sampleSize, mean = 50, sd = 10)
time2 <- time1 + rnorm(n = sampleSize, mean = 0, sd = 4)
time3 <- time2 + rnorm(n = sampleSize, mean = 0, sd = 8)
item1 <- rnorm(n = sampleSize, mean = 50, sd = 10)
item2 <- item1 + rnorm(n = sampleSize, mean = 0, sd = 4)
item3 <- item2 + rnorm(n = sampleSize, mean = 0, sd = 8)
item4 <- item3 + rnorm(n = sampleSize, mean = 0, sd = 12)
Person <- as.factor(rep(1:6, each = 8))
Occasion <- Rater <-
as.factor(rep(1:2, each = 4, times = 6))
Item <- as.factor(rep(1:4, times = 12))
Score <- c(
9,9,7,4,9,8,5,5,9,8,4,6,
6,5,3,3,8,8,6,2,8,7,3,2,
9,8,6,3,9,6,6,2,10,9,8,7,
8,8,9,7,6,4,5,1,3,2,3,2)
```
#### Add Missing Data {#addMissingData-reliability}
Adding missing data to dataframes helps make examples more realistic to real-life data and helps you get in the habit of programming to account for missing data.
```{r}
rater1continuous[c(5,10)] <- NA
rater2continuous[c(10,15)] <- NA
rater3continuous[c(10)] <- NA
rater1categorical[c(5,10)] <- NA
rater2categorical[c(10,15)] <- NA
rater3categorical[c(10)] <- NA
time1[c(5,10)] <- NA
time2[c(10,15)] <- NA
time3[c(10)] <- NA
item1[c(5,10)] <- NA
item2[c(10,15)] <- NA
item3[c(10)] <- NA
item4[c(10)] <- NA
```
#### Combine Data Into Dataframe {#combineData-reliability}
```{r}
mydata <- data.frame(
subid = 1:sampleSize,
rater1continuous, rater2continuous, rater3continuous,
rater1categorical, rater2categorical, rater3categorical,
time1, time2, time3,
item1, item2, item3, item4)
pio_cross_dat <- data.frame(Person, Item, Score, Occasion)
```
## Types of Reliability {#typesOfReliability}
Reliability is not one thing.\index{reliability!types of}
There are several types of reliability.\index{reliability!types of}
In this book, we focus on [test–retest](#testRetest-reliability), [inter-rater](#interrater-reliability), [intra-rater](#intrarater-reliability), [parallel-forms](#parallelForms-reliability), and [internal consistency](#internalConsistency-reliability) reliability.\index{reliability!types of}
Moreover, one can estimate reliability at different levels (e.g., person-level versus group-level) [@Geldhof2014; @Hove2022].\index{reliability!types of}
### Test–Retest Reliability {#testRetest-reliability}
Test–retest reliability is defined as the consistency of scores across time.
Typically, this is based on a two-week retest interval.\index{reliability!test–retest}\index{test–retest reliability!zzzzz@\igobble|seealso{reliability}}
The intent of a two-week interval between the original testing and the retest is to provide adequate time to pass to reduce any carryover effects from the original testing while not allowing too much time to pass such that the person's level on the construct (i.e., true scores) would change.\index{reliability!test–retest}\index{carryover effect}
A carryover effect is an effect of the experimental condition that affects the participant's behavior at a later time.\index{reliability!test–retest}\index{carryover effect}
Examples of carryover effects resulting from repeated measurement can include fatigue, boredom, learning (practice effects), etc.\index{carryover effect}
Another potential issue is that [measurement error](#measurementError) can be correlated across the two measurements.\index{measurement error}
Test–retest reliability controls for transient error and random response error.\index{reliability!test–retest}
If the construct is not stable across time (i.e., people's true scores change), test–retest reliability is not relevant because the CTT approach to estimating reliability assumes that the true scores are perfectly correlated across time (see Section \@ref(ctt)).\index{reliability!test–retest}\index{construct!stability}
The length of the optimal retest interval depends on the construct of interest.\index{reliability!test–retest}
For a construct in which people's levels change rapidly, a shorter retest interval may be appropriate.\index{reliability!test–retest}
But one should pay attention to ways to reduce potential carryover effects.\index{reliability!test–retest}\index{carryover effect}
By contrast, if the retest interval is too long, people's levels on the construct may change during that span.\index{reliability!test–retest}\index{construct!stability}
If people's levels on the construct change from test to retest, we can no longer assume that the true scores are perfectly correlated across time, which would violate CTT assumptions for estimating test–retest reliability of a measure.\index{reliability!test–retest}\index{construct!stability}\index{true score}\index{classical test theory}
The longer the retest interval, the smaller the observed association between scores across time will tend to be.\index{reliability!test–retest}
For weak associations obtained from a lengthy retest interval, it can be difficult to determine how much of this weak association reflects measurement unreliability versus people's change in their levels on the construct.\index{reliability!test–retest}\index{measurement error}
Thus, when conducting studies to evaluate test–retest reliability, it is important to consider the length of the retest interval and ways to reduce carryover effects.\index{reliability!test–retest}\index{carryover effect}
#### Coefficient of Stability (and Coefficient of Dependability) {#stability}
The coefficient of stability is the most widely used index when reporting the test–retest reliability of a measure.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}
It is estimated using a Pearson correlation of the scores at time 1 with the score at time 2.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}
That is, the coefficient of stability assesses the stability of individual differences (i.e., rank-order stability).\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}
The Pearson correlation is called the coefficient of stability when the length of the retest interval (the delay between test and retest) is on the order of days or weeks.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}
If the retest occurs almost at the same time as the original test (e.g., a 45-minute delay), the Pearson correlation is called the *coefficient of dependability* [@Revelle2019].\index{reliability!test–retest}\index{reliability!test–retest!coefficient of dependability}
We estimate the coefficient of stability below:\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}
```{r}
cor.test(x = mydata$time1, y = mydata$time2)
cor(mydata[,c("time1","time2","time3")],
use = "pairwise.complete.obs")
```
```{r, include = FALSE}
corValue <- cor.test(
x = mydata$time1,
y = mydata$time2)$estimate
```
The *r* value of $`r apa(corValue, 2, leading = FALSE)`$ indicates a strong positive association.\index{correlation}
Figure \@ref(fig:testRetestScatterplot) depicts a scatterplot of the time 1 scores on the x-axis and time 2 scores on the y-axis.
(ref:testRetestScatterplotCaption) Test–Retest Reliability Scatterplot. The black line is the best-fitting linear line. The red line is a locally estimated scatterplot smoothing (LOESS) line, which uses nonparametric estimation of the best fit.
```{r testRetestScatterplot, out.width = "100%", fig.align = "center", class.source = "fold-hide", fig.cap = "(ref:testRetestScatterplotCaption)", fig.scap = "Test–Retest Reliability Scatterplot."}
plot(mydata$time1, mydata$time2,
main = substitute(
paste(italic(r), " = ", x, sep = ""),
list(x = apa(corValue, 2, leading = FALSE))))
abline(lm(time2 ~ time1, data = mydata), col = "black")
mydataNoMissing <- na.omit(mydata[,c("time1","time2")])
lines(lowess(mydataNoMissing$time1, mydataNoMissing$time2),
col = "red")
```
##### Considerations About the Correlation Coefficient {#correlationConsiderations}
The correlation coefficient ranges from −1.0 to +1.0.\index{correlation}
The correlation coefficient ($r$) tells you two things: (1) the direction of the association (positive or negative) and (2) the magnitude of the association.\index{correlation}
If the correlation coefficient is positive, the association is positive.\index{correlation}
If the correlation coefficient is negative, the association is negative.\index{correlation}
If the association is positive, as X increases, Y increases (or conversely, as X decreases, Y decreases).\index{correlation}
If the association is negative, as X increases, Y decreases (or conversely, as X decreases, Y increases).\index{correlation}
The smaller the absolute value of the correlation coefficient (i.e., the closer the $r$ value is to zero), the weaker the association and the flatter the slope of the best-fit line in a scatterplot.\index{correlation}
The larger the absolute value of the correlation coefficient (i.e., the closer the absolute value of the $r$ value is to one), the stronger the association and the steeper the slope of the best-fit line in a scatterplot.\index{correlation}
See Figure \@ref(fig:rangeOfCorrelations) for a range of different correlation coefficients and what some example data may look like for each direction and strength of association.\index{correlation}
```{r, include = FALSE}
#Simulate data with specified correlation in relation to an existing variable (https://stats.stackexchange.com/a/313138/20338; archived at https://perma.cc/S26F-QSW3)
complement <- function(y, rho, x){
if (missing(x)) x <- rnorm(length(y)) # Optional: supply a default if `x` is not given
y.perp <- residuals(lm(x ~ y, na.action = na.omit))
rho * sd(y.perp) * y + y.perp * sd(y, na.rm = TRUE) * sqrt(1 - rho^2)
}
correlations <- data.frame(criterion = rnorm(1000))
correlations$v1 <- complement(correlations$criterion, -1)
correlations$v2 <- complement(correlations$criterion, -.9)
correlations$v3 <- complement(correlations$criterion, -.8)
correlations$v4 <- complement(correlations$criterion, -.7)
correlations$v5 <- complement(correlations$criterion, -.6)
correlations$v6 <- complement(correlations$criterion, -.5)
correlations$v7 <- complement(correlations$criterion, -.4)
correlations$v8 <- complement(correlations$criterion, -.3)
correlations$v9 <- complement(correlations$criterion, -.2)
correlations$v10 <-complement(correlations$criterion, -.1)
correlations$v11 <-complement(correlations$criterion, 0)
correlations$v12 <-complement(correlations$criterion, .1)
correlations$v13 <-complement(correlations$criterion, .2)
correlations$v14 <-complement(correlations$criterion, .3)
correlations$v15 <-complement(correlations$criterion, .4)
correlations$v16 <-complement(correlations$criterion, .5)
correlations$v17 <-complement(correlations$criterion, .6)
correlations$v18 <-complement(correlations$criterion, .7)
correlations$v19 <-complement(correlations$criterion, .8)
correlations$v20 <-complement(correlations$criterion, .9)
correlations$v21 <-complement(correlations$criterion, 1)
```
```{r rangeOfCorrelations, echo = FALSE, results = "hide", out.width = "100%", fig.align = "center", fig.height = 12, fig.cap = "Correlation Coefficients."}
par(mfrow = c(7,3), mar = c(1, 0, 1, 0))
# -1.0
plot(correlations$criterion, correlations$v1, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v1)$estimate, 2))))
abline(lm(v1 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v1), col = "red")
# -.9
plot(correlations$criterion, correlations$v2, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v2)$estimate, 2))))
abline(lm(v2 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v2), col = "red")
# -.8
plot(correlations$criterion, correlations$v3, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v3)$estimate, 2))))
abline(lm(v3 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v3), col = "red")
# -.7
plot(correlations$criterion, correlations$v4, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v4)$estimate, 2))))
abline(lm(v4 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v4), col = "red")
# -.6
plot(correlations$criterion, correlations$v5, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v5)$estimate, 2))))
abline(lm(v5 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v5), col = "red")
# -.5
plot(correlations$criterion, correlations$v6, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v6)$estimate, 2))))
abline(lm(v6 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v6), col = "red")
# -.4
plot(correlations$criterion, correlations$v7, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v7)$estimate, 2))))
abline(lm(v7 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v7), col = "red")
# -.3
plot(correlations$criterion, correlations$v8, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v8)$estimate, 2))))
abline(lm(v8 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v8), col = "red")
# -.2
plot(correlations$criterion, correlations$v9, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v9)$estimate, 2))))
abline(lm(v9 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v9), col = "red")
# -.1
plot(correlations$criterion, correlations$v10, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v10)$estimate, 2))))
abline(lm(v10 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v10), col = "red")
# 0.0
plot(correlations$criterion, correlations$v11, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v11)$estimate, 2))))
abline(lm(v11 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v11), col = "red")
# 0.1
plot(correlations$criterion, correlations$v12, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v12)$estimate, 2))))
abline(lm(v12 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v12), col = "red")
# 0.2
plot(correlations$criterion, correlations$v13, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v13)$estimate, 2))))
abline(lm(v13 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v13), col = "red")
# 0.3
plot(correlations$criterion, correlations$v14, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v14)$estimate, 2))))
abline(lm(v14 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v14), col = "red")
# 0.4
plot(correlations$criterion, correlations$v15, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v15)$estimate, 2))))
abline(lm(v15 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v15), col = "red")
# 0.5
plot(correlations$criterion, correlations$v16, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v16)$estimate, 2))))
abline(lm(v16 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v16), col = "red")
# 0.6
plot(correlations$criterion, correlations$v17, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v17)$estimate, 2))))
abline(lm(v17 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v17), col = "red")
# 0.7
plot(correlations$criterion, correlations$v18, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v18)$estimate, 2))))
abline(lm(v18 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v18), col = "red")
# 0.8
plot(correlations$criterion, correlations$v19, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v19)$estimate, 2))))
abline(lm(v19 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v19), col = "red")
# 0.9
plot(correlations$criterion, correlations$v20, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v20)$estimate, 2))))
abline(lm(v20 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v20), col = "red")
# 1.0
plot(correlations$criterion, correlations$v21, xaxt = "n", yaxt = "n", xlab = "" , ylab = "",
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = correlations$criterion, y = correlations$v21)$estimate, 2))))
abline(lm(v21 ~ criterion, data = correlations), col = "black")
lines(lowess(correlations$criterion, correlations$v21), col = "red")
dev.off() #par(mfrow = c(1,1))
```
Keep in mind that the Pearson correlation examines the strength of the *linear* association between two variables.\index{correlation}
If the association between two variables is nonlinear, the Pearson correlation provides the strength of the linear trend and may not provide a meaningful index of the strength of the association between the variables.\index{correlation}
For instance, Anscombe's quartet includes four sets of data that have nearly identical basic descriptive statistics (see Tables \@ref(tab:anscombe) and \@ref(tab:anscombeStats)), including the same bivariate correlation, yet have very different distributions and whose association takes very different forms (see Figure \@ref(fig:anscombeQuartet)).\index{correlation}
Table: (\#tab:anscombe) Anscombe's Quartet.
| x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
|----|-------|----|------|----|-------|----|------|
| 10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
| 8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
| 13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
| 9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
| 11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
| 14 | 9.96 | 14 | 8.1 | 14 | 8.84 | 8 | 7.04 |
| 6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
| 4 | 4.26 | 4 | 3.1 | 4 | 5.39 | 19 | 12.5 |
| 12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
| 7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
| 5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
Table: (\#tab:anscombeStats) Descriptive Statistics of Anscombe's Quartet.
| Property | Value |
|------------------------------|--------------|
| Sample size | 11 |
| Mean of X | 9.0 |
| Mean of Y | ~7.5 |
| Variance of X | 11.0 |
| Variance of Y | ~4.1 |
| Equation of regression line | Y = 3 + 0.5X |
| Standard error of slope | 0.118 |
| One-sample t-statistic | 4.24 |
| Sum of squares of X | 110.0 |
| Regression sum of squares | 27.50 |
| Residual sum of squares of Y | 13.75 |
| Correlation coefficient | .816 |
| Coefficient of determination | .67 |
```{r anscombeQuartet, echo = FALSE, results = "hide", out.width = "100%", fig.align = "center", fig.height = 6, fig.width = 6, fig.cap = "Anscombe's Quartet."}
anscombe
par(mfrow = c(2,2))
plot(anscombe$x1, anscombe$y1, xlab = "x", ylab = "y", xlim = c(0,20), ylim = c(0,15),
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = anscombe$x1, y = anscombe$y1)$estimate, 2))))
abline(lm(y1 ~ x1, data = anscombe), col = "black", lty = 2)
#lines(lowess(anscombe$x1, anscombe$y1), col = "red")
plot(anscombe$x2, anscombe$y2, xlab = "x", ylab = "y", xlim = c(0,20), ylim = c(0,15),
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = anscombe$x2, y = anscombe$y2)$estimate, 2))))
abline(lm(y2 ~ x2, data = anscombe), col = "black", lty = 2)
#lines(lowess(anscombe$x2, anscombe$y2), col = "red")
plot(anscombe$x3, anscombe$y3, xlab = "x", ylab = "y", xlim = c(0,20), ylim = c(0,15),
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = anscombe$x3, y = anscombe$y3)$estimate, 2))))
abline(lm(y3 ~ x3, data = anscombe), col = "black", lty = 2)
#lines(lowess(anscombe$x3, anscombe$y3), col = "red")
plot(anscombe$x4, anscombe$y4, xlab = "x", ylab = "y", xlim = c(0,20), ylim = c(0,15),
main = substitute(paste(italic(r), " = ", x, sep = ""), list(x = round(cor.test(x = anscombe$x4, y = anscombe$y4)$estimate, 2))))
abline(lm(y4 ~ x4, data = anscombe), col = "black", lty = 2)
#lines(lowess(anscombe$x4, anscombe$y4), col = "red")
```
As an index of correlation, the coefficient of stability (and dependability) has important weaknesses.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}\index{reliability!test–retest!coefficient of dependability}\index{correlation}
It is a form of *relative* reliability rather than *absolute* reliability: It examines the consistency of scores across time relative to variation across people.\index{reliability!test–retest}\index{reliability!relative}\index{reliability!absolute}\index{correlation}
Higher stability coefficients reflect greater stability of individual differences—not greater stability in people's absolute level on the measure.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}\index{reliability!test–retest!coefficient of dependability}\index{correlation}
This is a major limitation.
Figure \@ref(fig:relativeReliability) depicts two example data sets that show strong relative reliability (a strong coefficient of stability) but poor absolute reliability based on inconsistency in people's absolute level across time.\index{reliability!test–retest}\index{reliability!relative}\index{reliability!absolute}\index{reliability!test–retest!coefficient of stability}\index{reliability!test–retest!coefficient of dependability}\index{correlation}
(ref:relativeReliabilityCaption) Hypothetical Data Demonstrating Good Relative Reliability Despite Poor Absolute Reliability. The figure depicts two fictional data sets (black and red circles), which both exhibit a similar linear association. The line of best fit is the solid line in the graph and is the same for both data sets, but the black circles sit much closer to the line than the red circles, leading to a much higher coefficient of stability ($r = .99$ and $.84$, respectively). However, neither sets of circles are on the line of complete agreement represented by the dashed line in the graph. If the circles fall on the line of complete agreement, it indicates that the measure's scores show consistency in absolute scores across time. Thus, although the measures show strong relative reliability, they show poor absolute reliability. (Figure reprinted from @Vaz2013, Figure 1, p. 3. Vaz, S., Falkmer, T., Passmore, A. E., Parsons, R., & Andreou, P. (2013). The case for using the repeatability coefficient when calculating test–retest reliability. *PLoS ONE*, *8*(9), e73990. [https://doi.org/10.1371/journal.pone.0073990](https://doi.org/10.1371/journal.pone.0073990))
```{r relativeReliability, out.width = "80%", fig.align = "center", fig.cap = "(ref:relativeReliabilityCaption)", fig.scap = "Hypothetical Data Demonstrating Good Relative Reliability Despite Poor Absolute Reliability.", echo = FALSE}
knitr::include_graphics("./Images/relative-reliability.png")
```
Another limitation of the coefficient of stability (and dependability) is that it is sensitive to outliers.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}\index{reliability!test–retest!coefficient of dependability}\index{correlation}
Additionally, if there are little or no individual differences in scores on a given measure, the coefficient of stability will be low relative to the true reliability because correlation coefficients tend to be attenuated in the presence of a restricted range, and it may not be a useful index of reliability depending on the purpose.\index{reliability!test–retest}\index{reliability!test–retest!coefficient of stability}\index{reliability!test–retest!coefficient of dependability}\index{restricted range}\index{correlation}
For more information see Section \@ref(reliabilityParadox) on the reliability paradox in the chapter on [cognitive](#cognition) assessment.\index{reliability!paradox}
See Figure \@ref(fig:restrictedRange) for an example of how a correlation coefficient tends to be attenuated when the range of one or both of the variables has restriction of range.\index{restricted range}\index{correlation}
Here is the correlation with and without restriction of range on $x$ (i.e., $x$ is restricted to values between 55 and 65):\index{restricted range}\index{correlation}
```{r restrictedRange, fig.cap = "Example of Correlation With (Right Panel) and Without (Left Panel) Range Restriction. Filled black points represent the points in common across the two scatterplots.", fig.scap = "Example of Correlation With and Without Range Restriction.", out.width = "100%", fig.align = "center", fig.width = 10, fig.height = 5, echo = FALSE}
par(mfrow = c(1,2))
par(mar = c(4, 0, 3, 0),
oma = c(0, 4, 0, 0))
#Plot with no restricted range
plot(
mydata$time1, mydata$time3,
main = substitute(
paste("No range restriction: ", italic(r), " = ", x, sep = ""),
list(x = apa(cor.test(x = mydata$time1,
y = mydata$time3)$estimate, 2, leading = FALSE))),
xlab = "time1",