To evaluate the ability of a multisensory fitness tracker, the Jawbone UP3 (JB3), to quantify and classify sleep in patients with suspected central disorders of hypersomnolence.
This study included 43 patients who completed polysomnography (PSG) and a Multiple Sleep Latency Test (MSLT) with concurrent wrist-worn JB3 and Actiwatch 2 (AW2) recordings for comparison. Mean differences in nocturnal sleep architecture variables were compared using Bland-Altman analysis. Sensitivity, specificity, and accuracy were derived for both devices relative to PSG. Ability of the JB3 to detect sleep onset rapid eye movement periods (SOREMPs) during MSLT naps was also quantified.
JB3 demonstrated a significant overestimation of total sleep time (39.6 min, P < .0001) relative to PSG, but performed comparably to AW2. Although the ability of the JB3 to detect epochs of sleep was relatively good (sensitivity = 0.97), its ability to distinguish light, deep, and REM sleep was poor. Similarly, the JB3 did not correctly identify a single SOREMP during any MSLT nap opportunity.
The JB3 did not accurately quantify or classify sleep in patients with suspected central disorders of hypersomnolence, and was particularly poor at identifying REM sleep. Thus, this device cannot be used as a surrogate for PSG or MSLT in the assessment of patients with suspected central disorders of hypersomnolence.
Cook JD, Prairie ML, Plante DT. Ability of the multisensory Jawbone UP3 to quantify and classify sleep in patients with suspected central disorders of hypersomnolence: a comparison against polysomnography and actigraphy. J Clin Sleep Med. 2018;14(5):841–848.
Current Knowledge/Study Rationale: Consumers are increasingly using commercially available fitness trackers to quantify and classify sleep despite limited empiric data assessing their performance characteristics. These devices may also have clinical utility, particularly in the evaluation of patients with suspected central disorders of hypersomnolence, for which total sleep duration and REM sleep instability are important diagnostic considerations.
Study Impact: Results demonstrate the Jawbone UP3 does not accurately measure sleep duration or identify sleep stages during polysomnography and Multiple Sleep Latency Testing. Thus, this device should not be used for diagnostic purposes in patients with suspected central disorders of hypersomnolence.
Central disorders of hypersomnolence (CDH), most notably narcolepsy and idiopathic hypersomnia (IH), are characterized by excessive daytime sleepiness not caused by disrupted nocturnal sleep or circadian rhythm disorders.1,2 The current standard assessment for patients with suspected CDH is an in-laboratory, nocturnal polysomnography (PSG) followed by a daytime Multiple Sleep Latency Test (MSLT).2–5 This in-laboratory evaluation assists in identifying patients with pathological sleep propensity during the day as well as distinguishing narcolepsy from IH.2–4 Historically, the mean sleep onset latency (MSL) during the MSLT has been used to identify patients with possible CDH, whereas the presence or absence of two or more sleep onset rapid eye movement periods (SOREMPs) has been used to distinguish narcolepsy from IH.2,4
In recognition that IH often is characterized by excessive sleep duration without pathological sleep propensity on the MSLT (ie, MSL below 8 minutes),6 the International Classification of Sleep Disorders, Third Edition (ICSD-3) additionally allows for the diagnosis of IH in the context of the objective finding of total sleep time (TST) greater than or equal to 660 minutes measured by either 24-hour PSG (performed after correction for chronic sleep deprivation) or wrist actigraphy (7 days with unrestricted sleep), regardless of MSL during the MSLT. Notably, the use of actigraphic recordings to quantify excessive sleep duration in IH still requires validation, and the type and specifications of the actigraphic monitors used for this purpose are not elucidated in current diagnostic nosology.1
The evolution of commercially available, wrist-worn fitness trackers has provided a low-cost, field-based, and user-friendly alternative to standard actigraphy for the estimation of sleep outside of laboratory conditions.7–10 These mass-marketed devices are continuing to gain popularity in both general and patient populations, and may have utility in the clinical assessment of patients.11 Furthermore, these trackers typically provide user access to cloud-based platforms, which facilitates continuous data collection and may be particularly useful to researchers investigating longitudinal sleep patterns.7 Given their potential clinical and research utility, it is important to empirically evaluate these devices against PSG in the quantification and staging of sleep, in order to elucidate their diagnostic capabilities and limitations.
Previous investigations of single sensor, accelerometer-based fitness trackers that have evaluated adolescents,8,9,12 healthy young adults,10 middle-aged women,13 and adult psychiatric populations7 have consistently demonstrated an overestimation of TST relative to PSG. Similar to standard actigraphy, which also relies on a single-sensor accelerometer, these devices have displayed high sensitivity (ability to detect true sleep) and poor specificity (ability to detect true wake) when compared against PSG.14 However, to our knowledge, no previous investigation has evaluated a single-sensor, accelerometer-based fitness tracker in patients with suspected CDH. Thus, given the potential use of these devices to quantify TST in suspected CDH, evaluation of the capabilities of these devices in this patient population is a crucial step before considering them as viable tools for clinical or research implementation.
Recent technological advancements have evolved the single-sensor, wrist-worn fitness trackers to a multisensory approach, pairing accelerometry with heart rate detection. The Jawbone UP3 (JB3; Jawbone, San Francisco, California, United States) is one such device that advertises the ability to estimate sleep duration and efficiency while classifying sleep into various stages. Although other single and multisensor trackers attempt to classify sleep into variations of “light” and “deep” sleep,12,15 the JB3 purports the ability to classify sleep as light sleep (LS), deep sleep (DS), and rapid eye movement (REM) sleep.16 To our knowledge, no previous investigations have been performed evaluating the ability of a wrist-worn multisensory tracker to detect REM sleep. Given the importance of TST and REM sleep in the diagnosis and delineation of CDH, this device may have particular utility for these patients. Furthermore, this device's relatively low cost and longitudinal capabilities may help circumvent the test-retest limitations of the MSLT and need for more cost-effective measures in CDH assessment.17
Thus, this investigation aimed to evaluate the ability of the multisensory JB3 to quantify and classify sleep against PSG and a standard wrist-worn actigraph, the Actiwatch 2 (AW2; Phillips Respironics Murrysville, Pennsylvania, United States), in clinical patients with suspected CDH.
Participants, Inclusion/Exclusion Criteria, and Study Design
As part of a larger study examining novel assessment measures of excessive daytime sleepiness, 57 consecutively recruited clinical patients referred to undergo PSG and MSLT assessment at Wisconsin Sleep, the sleep laboratory and clinic of the University of Wisconsin-Madison, were considered for this analysis. After obtaining informed consent, participants completed the Epworth Sleepiness Scale (ESS), and wore the AW2 and JB3 concurrently on their nondominant wrist for the duration of in-laboratory testing. Participants were later excluded from analyses if they demonstrated moderate or worse sleep-disordered breathing during their nocturnal PSG (apnea-hypopnea index [AHI] ≥ 15 events/h), which triggered a split-night protocol and the application of positive airway pressure therapy, subsequently cancelling the MSLT. Final diagnoses were determined by post hoc chart review and ICSD-3 criteria.1
All participants provided informed consent and this study was approved by the Health Sciences Institutional Review Board of the University of Wisconsin-Madison.
In-Laboratory, Nocturnal PSG and MSLT Assessment Procedures
PSG data were collected following American Academy of Sleep Medicine parameters using a standard six-channel electroencephalographic montage (unless the referring physician requested additional electroencephalography derivations) paired with other recording sensors including electrooculogram, submental electromyogram, electrocardiogram, bilateral tibial electromyogram, respiratory inductance plethysmography, pulse oximetry, and a position sensor (Alice Sleepware; Phillips Respironics, Murrysville, Pennsylvania, United States). A registered sleep technologist, blind to the JB3 and AW2 staging output, staged all PSG recordings using 30-second epochs according to American Academy of Sleep Medicine criteria.18 Participants were allowed to sleep ad libitum, remaining minimally disturbed throughout the night and not awakened at a prescribed time the following morning. Lights-on time was determined by the participant's stated desire to terminate the nocturnal sleep period upon awakening. PSG and accelerometer data were collected within a local network of computers that were time synchronized to an external atomic clock through frequent restart.
Upon awakening, participants continued to wear the JB3 and AW2 through the MSLT. Daytime naps were scored by a registered sleep technologist, using the same standards utilized for nocturnal PSG sleep staging. Number of nap opportunities depended on the time of participant awakening from the nocturnal PSG, and thus could vary among individuals; however, whenever possible four to five naps were obtained.
Nocturnal PSG Data Analysis
PSG was considered the gold standard measure of sleep duration, continuity, and staging. PSG lights-off and lights-on times were used as the start and end points for the JB3 and AW2 rest periods to maintain consistency.7,9 The following sleep variables were calculated for PSG, JB3, and AW2: total sleep time (TST; total duration of all epochs of sleep during period of time in bed), sleep onset latency (SOL; time from lights-off to the first epoch of sleep), wake after sleep onset (WASO; total duration of wake time after sleep onset), and sleep efficiency (SE; equal to TST divided by total time in bed). The following supplementary sleep variables were calculated strictly for PSG and JB3 comparisons: rapid eye movement sleep latency (REML; time from first sleep epoch to first epoch of rapid eye movement sleep), total rapid eye movement sleep (TRT; total duration of rapid eye movement sleep during period of time in bed), light sleep (LS; total duration of staged light sleep during period of time in bed), and deep sleep (DS; total duration of staged deep sleep during period of time in bed). PSG LS was a summation of the initial two non-rapid eye moment (NREM) sleep stages (stage N1 + stage N2), and PSG DS was equal to stage N3 sleep.19
AW2 data were collected as 30-second epochs and were analyzed utilizing the medium wake/sleep threshold (value = 40) with 5-minute immobility time for sleep onset/offset, which has been shown to produce the most accurate output, relative to PSG.20 JB3 data was manually extracted in 1-minute epochs (the smallest epoch length available) from the Jawbone UP Android Application version 4.24 (Jawbone, San Francisco, California, United States) after the device synchronized with the interface through Bluetooth integration. In order to compare 1-minute JB3 epochs with PSG and AW2, each JB3 epoch was split into two equivalent 30-second epochs to correspond with the PSG and AW2 30-second epoch values.7,10 Individual participant JB3 sleep variables were tabulated utilizing in-house custom MATLAB (Mathworks; Natick, Massachusetts, United States) scripts.
Bland-Altman analysis21 was utilized to calculate mean differences across sleep variables for each comparison of interest (JB3 versus PSG; JB3 versus AW2; AW2 versus PSG). Further analyses explored the overall congruency between individually staged epochs among the devices. All sleep (AS) sensitivity (ability to correctly detect PSG-scored sleep epochs), AS specificity (ability to correctly detect PSG-scored wake epochs), and AS accuracy (ability to correctly detect PSG-scored sleep and wake epochs) were calculated for both the JB3 and AW2. Further epoch comparisons calculating JB3 sensitivity, specificity, and accuracy for LS, DS, and REM sleep were also conducted. Exploratory analyses also examined epoch-by-epoch performance for the AW2 and JB3 relative to PSG within diagnostic categories.
Epoch-by-epoch comparisons were executed utilizing MATLAB with all other statistical analyses performed using JMP Pro 11 (SAS; Cary, North Carolina, United States). Alpha equaled 0.05 for statistical significance for all comparisons. Results are presented as mean ± standard deviation unless otherwise noted.
MSLT Data Analysis
MSLT analysis was aimed specifically at the capability of the JB3 to correctly identify SOREMPs during daytime naps. Similar to methods used for overnight data, JB3 nap periods were defined by MSLT lights-off and lights-on and JB3 data was manually extracted using the Jawbone UP Android Application after device synchronization through Bluetooth integration. True positives (JB3 correctly detecting a nap with SOREMPs), false positives (JB3 incorrectly detecting a nap with SOREMPs), true negatives (JB3 correctly detecting a nap without SOREMPs), and false negatives (JB3 incorrectly detecting a nap without SOREMPs) were determined for each nap. Because most MSLT naps do not contain SOREMPs, sensitivity adjusted for repeated measures within the same subject was the primary outcome of interest.22
Fourteen participants were excluded from final analyses because of the presence of moderate-to-severe sleep-disordered breathing (n = 3), JB3 device malfunction (n = 10), and MSLT cancellation due to external circumstances (n = 1). The 43 participants included in final analyses consisted of 29 females and 14 males (mean age = 33.3 ± 11.0 years). Average ESS score was 13.4 ± 3.9, and the overall AHI was 1.7 ± 2.7 events/h. Of the 43 patients included in final analyses, the following diagnoses were made: narcolepsy in 3 (2 with narcolepsy type 1, 1 with narcolepsy type 2), IH in 13, organic/unspecified hypersomnia in 18, mild obstructive sleep apnea in 6 (AHI range 7.1–11.3 events/h), and hypersomnolence related to another condition in 4 (1 shift-work sleep disorder, 1 behaviorally induced insufficient sleep, and 2 medical disorders). The arousal index associated with periodic limb movements of sleep was less than 5 events/h for all participants.
Overall results including sleep variables quantified from each measure, mean differences resulting from Bland-Altman analyses, and epoch-by-epoch comparisons relative to PSG are summarized in Table 1.
Sleep variable values, mean differences, and epoch comparisons.
Sleep variable values, mean differences, and epoch comparisons.
JB3 Versus PSG and AW2
When the JB3 was compared to the gold standard PSG, JB3 significantly overestimated TST (mean difference of 39.6 minutes, P < .0001) and SE (mean difference of 6.78%, P < .0001), while significantly underestimating WASO (−34.3 minutes, P < .0001). Although not reaching statistical significance, JB3 did underestimate SOL (mean difference = −5.13, P = .17) relative to PSG. Further sleep variable comparisons did not demonstrate significant differences for JB3 REML, TRT, LS, or DS, relative to PSG; however, trends toward overestimation of TRT and DS by the JB3 were observed. Corresponding Bland-Altman plots are presented in Figure 1.
Bland-Altman comparison of JB3 and PSG.
Bland-Altman plots presenting the mean difference values of the JB3 and PSG on the y-axis against PSG values on the x-axis. Horizontal, solid red line denotes the average mean difference with dotted lines representing 95% confidence interval. DS = deep sleep, JB3 = Jawbone UP3, LS = light sleep, PSG = polysomnography, REML = rapid eye movement sleep latency, SE = sleep efficiency, SOL = sleep onset latency, TRT = total rapid eye movement sleep time, TST = total sleep time, WASO = wake after sleep onset.
Bland-Altman comparison of JB3 and PSG.
When compared epoch-by-epoch against PSG, the JB3 sensitivity was relatively good for AS (0.97 ± 0.04), but poor for REM sleep (0.30 ± 0.18), DS (0.49 ± 0.25), and LS (0.60 ± 0.11). The JB3 specificity was relatively good for REM sleep (0.81 ± 0.10) and DS (0.88 ± 0.08), but poor for AS (0.39 ± 0.19) and LS (0.52 ± 0.12). Overall, the JB3 accuracy was relatively good for AS (0.87 ± 0.8) and DS (0.82 ± 0.07), moderate for REM sleep (0.72 ± 0.09), and poor for LS (0.56 ± 0.08).
When the JB3 was compared to AW2, these devices did not have significantly different estimations of TST, SE, or WASO (Table 1). JB3 did significantly overestimate SOL relative to AW2, (mean difference of 7.78 minutes, P = .001); however, JB3 estimates were closer to PSG than AW2. Corresponding Bland-Altman plots are presented in Figure 2.
Bland-Altman comparison of JB3 and AW2.
Bland-Altman plots presenting the mean difference values of the JB3 and AW2 on the y-axis against the mean average values of the JB3 and AW2 on the x-axis. Horizontal, solid red line denotes the average mean difference with dotted lines representing 95% confidence interval. AW2 = Actiwatch 2, JB3 = Jawbone UP3, SE = sleep efficiency, SOL = sleep onset latency, TST = total sleep time, WASO = wake after sleep onset.
Bland-Altman comparison of JB3 and AW2.
AW2 Versus PSG
When the AW2 was compared to the gold standard PSG, AW2 significantly overestimated TST (mean difference of 43.9 minutes, P < .0001) and SE (mean difference of 7.51%, P < .0001), while significantly underestimating SOL (mean difference of −12.9 minutes, P = .0006) and WASO (mean difference of −33.9 minutes, P < .0001). Corresponding Bland-Altman plots are presented in Figure 3.
Bland-Altman comparison of AW2 and PSG.
Bland-Altman plots presenting the mean difference values of the AW2 and PSG on the y-axis against PSG values on the x-axis. Horizontal, solid red line denotes the average mean difference with dotted lines representing 95% confidence interval. AW2 = Actiwatch 2, PSG = polysomnography, SE = sleep efficiency, SOL = sleep onset latency, TST = total sleep time, WASO = wake after sleep onset.
Bland-Altman comparison of AW2 and PSG.
When compared epoch-by-epoch against PSG, the AW2 displayed relatively good AS sensitivity (0.97 ± 0.02) and accuracy (0.87 ± 0.08), with poor specificity (0.31 ± 0.14).
JB3 and AW2 Performance by Diagnosis
Epoch-by-epoch performance characteristics of the JB3 and AW2 relative to PSG within diagnostic categories are presented in Table 2. No significant differences in sensitivity, specificity, or accuracy were observed between groups after correction for multiple comparisons. One patient with obstructive sleep apnea additionally carried a diagnosis of REM sleep behavior disorder, and unlike other participants with obstructive sleep apnea, remained on fixed continuous positive airway pressure throughout the PSG recording. Exclusion of this participant from analyses did not substantially alter findings.
Epoch-by-epoch performance characteristics by diagnosis.
Epoch-by-epoch performance characteristics by diagnosis.
JB3 SOREMP Detection During Daytime Naps
From the 43 participants, there were 167 total naps for comparison between the JB3 and PSG. Of those 167 naps, PSG identified 9 naps with SOREMPs and JB3 detected 18 naps with SOREMPs. However, none of the JB3- detected SOREMPs were congruent with the PSG identified SOREMPs, resulting in zero true positives, 143 true negatives, 18 false positives, and 9 false negatives. In the absence of a true positive SOREMP detection by the JB3, the sensitivity of the device to detect SOREMPs during MSLT naps was zero.
To our knowledge, this is the first investigation to evaluate the ability of a multisensory, wrist-worn sleep tracking device, the JB3, to quantify and classify sleep in adult patients with suspected CDH, and as such may have significant implications for both clinical care and research within this population. The results of our investigation demonstrate that the JB3 has some utility in estimating TST in patients with suspected CDH; however, the ability to classify sleep, particularly REM sleep, is quite limited when compared to PSG.
The primary results of our study indicate that the JB3, when compared to PSG, significantly overestimates TST and SE, while significantly underestimating WASO. Consistent with previous investigations of wrist-worn sleep tracking devices, the JB3 also displayed an inability to accurately identify wake epochs (AS specificity = 0.39) when compared against PSG.7,8,13 Although a recent investigation by Toon and colleagues14 that evaluated the inaugural version of the Jawbone UP in children with suspected sleep-disordered breathing demonstrated higher specificity (0.66) than observed here, this marginally better performance may have been due to the use of a different generation of device, differing study populations, changes in proprietary algorithms used to detect sleep, or some other methodological difference between studies. Regardless, in our current investigation, the JB3 was also not able to accurately classify sleep into REM sleep (REM sleep sensitivity = 0.30), deep sleep (DS sensitivity = 0.49), and light sleep (LS sensitivity = 0.60) relative to PSG. The JB3's inability to accurately detect REM sleep was further evidenced by its very poor performance in the detection of SOREMPs during MSLT naps.
Interestingly, the JB3 performed comparably to a standard actigraph, the AW2, in estimation of TST, WASO, and SE. Comparability in performance between the JB3 and AW2 was further substantiated by epoch-by-epoch comparisons relative to PSG, where the JB3 and AW2 demonstrated similar sensitivity, specificity, and accuracy. These results align with previous studies showing similarities in capabilities between standard actigraphy and wrist-worn sleep tracking devices,7,14 and suggest that wrist-worn sleep tracking devices may have some utility as a lower-cost alternative to standard actigraphs for field-based estimates of sleep duration. The comparability in assessment of TST between JB3 and standard actigraphy within this population of patients with suspected CDH is noteworthy given the potential use of actigraphy as a means to objectively quantify excessive sleep duration in IH.1 Our results suggest that regardless of whether the JB3 or AW2 is utilized to quantify sleep duration, both devices are likely to overestimate sleep duration relative to PSG. However, these results cannot be extended to conclude that the JB3 can be substituted for a standard actigraph in clinical care, as there may be other important differences in device performance (eg, failure rate, longitudinal performance, etc.) that must be considered before they are utilized for diagnostic purposes. Thus, further research is indicated to assess these performance characteristics between devices, and further refine the parameters and algorithms utilized by wrist-worn accelerometers to quantify TST in patients with CDH. Unfortunately, the rapid evolution of fitness trackers coupled with the proprietary algorithms employed by commercial entities that produce these devices represents an ongoing and significant barrier to research in this sphere.
There are limitations of our study that merit discussion. First, our findings may not generalize equally to both sexes, as participants in this investigation were predominantly female. Second, our findings may not extend to patients with other sleep fragmenting disorders not evaluated in this investigation (eg, moderate-to-severe sleep apnea, periodic limb movement disorder) as the amount of wake and/or movement time that occurs during a sleep period can have sizeable effects on device performance. Third, in a similar manner, it is not clear from this study how the JB3 would perform in persons with REM sleep behavior disorder; however, given the very limited ability of the JB3 to accurately detect REM sleep observed in this investigation, it seems quite unlikely that the device's performance would be improved in patients who lack atonia during REM sleep. Fourth, our study was not designed (or powered) to detect differences in JB3 performance between varying diagnoses. Because prior studies have suggested that actigraphy can aid in distinguishing between narcolepsy and IH,23 future research using the JB3—and other wearable trackers—as a diagnostic tool to segregate disorders of CDH are indicated. Fifth, only 1 night of data was utilized for this investigation, and thus the results do not reflect the longitudinal capabilities of the JB3 as a sleep-tracking instrument. Finally, our results cannot be extended to other sleep tracking devices, or different generations of the same model, as technological and algorithmic alterations may result in different performance capabilities.
In summary, this investigation demonstrates that the JB3 cannot serve as a surrogate for gold-standard PSG in the quantification and classification of sleep in patients with suspected CDH. However, this multisensory, wrist-worn fitness tracker does demonstrate comparable performance characteristics to a standard actigraph, particularly in the estimation of total sleep duration during a single night of sleep. The consistently evolving frontier of wearable sleep trackers coupled with the proprietary technologies that underlie them presents a major hurdle for researchers attempting to elucidate the capabilities and limitations of these devices. As devices such as the JB3 continue to progress and evolve, it would seem particularly important to either improve REM detection algorithms that depend on heart rate and/or accelerometer data, or possibly incorporate other forms of physiological measurements (eg, portable electroencephalography, arterial tonometry) to improve device performance.24,25 For the foreseeable future, ongoing empiric evaluation that elucidates the capabilities and limitations of these emerging devices in diverse patient populations will remain an area of high value for researchers, clinicians, and consumers of these products.
All authors have seen and approved this manuscript. This work was supported by a grant from the American Sleep Medicine Foundation. Dr. Plante also receives research support from NIMH (K23MH099234). None of the funding sources had any further role in the study design, data collection, analysis and interpretation of the data, and the decision to submit the paper for publication. The authors report no conflicts of interest.