Skip to main content
Free AccessSleep Restriction - Wake After Sleep Onset - Sleep Restriction - Sleep Duration - Adolescents - Light - Sleep Staging - Polysomnography - EOG - EEG - Actigraphy - Aging - Scientific Investigations

Validation of a Consumer Sleep Wearable Device With Actigraphy and Polysomnography in Adolescents Across Sleep Opportunity Manipulations

Published Online: by:54


Study Objectives:

To compare the quality and consistency in sleep measurement of a consumer wearable device and a research-grade actigraph with polysomnography (PSG) in adolescents.


Fifty-eight healthy adolescents (aged 15–19 years; 30 males) underwent overnight PSG while wearing both a Fitbit Alta HR and a Philips Respironics Actiwatch 2 (AW2) for 5 nights, with either 5 hours or 6.5 hours time in bed (TIB) and for 4 nights with 9 hours TIB. AW2 data were evaluated using two different wake and immobility thresholds. Discrepancies in estimated total sleep time (TST) and wake after sleep onset (WASO) between devices and PSG, as well as epoch-by-epoch agreements in sleep/wake classification, were assessed. Fitbit-generated sleep staging was compared to PSG.


Fitbit and AW2 under default settings similarly underestimated TST and overestimated WASO (TST: medium setting (M10) ≤ 38 minutes, Fitbit ≤ 47 minutes; WASO: M10 ≤ 38 minutes; Fitbit ≤ 42 minutes). AW2 at the high motion threshold setting provided readings closest to PSG (TST: ≤ 12 minutes; WASO: ≤ 18 minutes). Sensitivity for detecting sleep was ≥ 90% for both wearable devices and further improved to 95% by using the high threshold (H5) setting for the AW2 (0.95). Wake detection specificity was highest in Fitbit (≥ 0.88), followed by the AW2 at M10 (≥ 0.80) and H5 thresholds (≤ 0.73). In addition, Fitbit inconsistently estimated stage N1 + N2 sleep depending on TIB, underestimated stage N3 sleep (21–46 min), but was comparable to PSG for rapid eye movement sleep. Fitbit sensitivity values for the detection of N1 + N2, N3 and rapid eye movement sleep were ≥ 0.68, ≥ 0.50, and ≥ 0.72, respectively.


A consumer-grade wearable device can measure sleep duration as well as a research actigraph. However, sleep staging would benefit from further refinement before these methods can be reliably used for adolescents.

Clinical Trial Registration:

Registry:; Title: The Cognitive and Metabolic Effects of Sleep Restriction in Adolescents; Identifier: NCT03333512; URL:


Lee XK, Chee NIYN, Ong JL, Teo TB, van Rijn E, Lo JC, Chee MWL. Validation of a consumer sleep wearable device with actigraphy and polysomnography in adolescents across sleep opportunity manipulations. J Clin Sleep Med. 2019;15(9):1337–1346.


Current Knowledge/Study Rationale: Consumer sleep trackers are an attractive alternative to expensive research actigraphs for measuring sleep. However, validation studies in adolescent populations are limited and typically conducted on only 1 night of sleep. We compared a consumer sleep/activity tracker and a research-grade actigraph with polysomnography (PSG) over different sleep opportunities and across multiple nights.

Study Impact: Sleep estimation was comparable between the consumer wearable device and research-grade actigraphy on default settings. Both underestimated sleep duration compared to PSG. Sleep estimation improved in the research actigraph by adjusting sensitivity to motion. With data-driven customization, consumer wearable devices could replace research actigraphs for large-scale total sleep measurement. Sleep staging still lags behind PSG and needs further work, particularly for assessment of stage N3 sleep.


Sleep is increasingly recognized as important for health and well-being. In addition to this growing awareness, wearable devices that incorporate accelerometers have proliferated on a massive scale. Annual global sales estimated to be under 25 million in 2014 are expected to reach 125 million by the end of 2018.1 Growth in sales of smartwatches in particular have been even more dramatic, leaping from 5 million to 80 million in the same time interval. Originally intended to track physical activity, many wearable devices now incorporate algorithms that can provide outputs on sleep.24

Although polysomnography (PSG) remains the gold standard for quantifying sleep, wrist actigraphy based on movement (where the absence of motion implies sleep) is inexpensive and a widely available proxy for estimating sleep in nonlaboratory settings. Actigraphy is well suited for large-scale longitudinal surveys of personal and/or community sleep habits, and how these influence health and well-being. It has been well validated against PSG in adults and is widely adopted in research and clinical settings.57 To date, expensive “research-grade” actigraphs remain the mainstay in scientific studies, influenced by mixed reports about the reliability,812 particularly the accuracy of sleep detection of earlier consumer devices. However, constant advancements including the use of heart rate variability (HRV) measurement to estimate sleep stages13,14 and recent reports of good agreement with research devices15,16 motivate a detailed re-evaluation of consumer-grade devices.

The current report has several features that might serve to better inform about the feasibility of using consumer grade activity trackers to estimate sleep in research studies. First, we collected multinight sleep data per individual across three levels of sleep opportunity (5 hours, 6.5 hours, and 9 hours), concurrently comparing a relatively new consumer wearable device that incorporates heart rate measurements to augment sleep/wake classification (Fitbit Alta HR, Fitbit Inc., San Francisco, California) with a research actigraph (Actiwatch 2, Philips Respironics Inc., Pittsburgh, Pennsylvania). Both devices were referenced to PSG sleep measurement. Second, we focused on an adolescent sample. Although actigraphy tends to overestimate sleep in adults, some studies have found underestimation of sleep in adolescents.1719 To examine how sensitivity to motion might affect appropriate sleep detection, we used two different motion sensitivity settings to evaluate sleep. Finally, we assessed how well the consumer wearable device could stage adolescent sleep.



Participants consisted of 58 adolescents aged 15 to 19 years (mean ± standard deviation [SD]: 16.6 ± 0.94 years; 30 males) who took part in a study examining the cognitive and metabolic effects of sleep restriction in adolescents. They were recruited through social media, online advertisements, talks, and word of mouth. Consent was obtained from both participants and their legal guardians. Participants had no known health conditions or sleep disorders, were not habitual short sleepers (self-reported total sleep time [TST] < 5 hours on weekdays concurrent with ≤ 1 hour of weekend sleep extension), and did not travel across more than two time zones in the month prior to the study.

Study Protocol

Participants underwent a 14-night evaluation (Figure S1 in the supplemental material) in a boarding school under quasi-laboratory conditions. The study protocol was approved by the Institutional Review Board of the National University of Singapore and in accordance with the principles of the Declaration of Helsinki. The sleep schedule was designed to simulate one-and-a-half cycles of shortened sleep on weekdays and extended sleep on weekends in adolescents. Students who were randomized into the continuous (n = 29) and the split (n = 29) sleep groups did not significantly differ in demographic or habitual sleep characteristics (Table S1 in the supplemental material). During the 2 baseline sleep and 4 recovery nights, all participants had a 9-hour (11:00 pm to 8:00 am) sleep opportunity. On 8 manipulation nights, participants in the continuous sleep group had a 6.5-hour (12:15 am to 6:45 am) nocturnal sleep opportunity, whereas those in the split sleep group had a 5-hour (1:00 am to 6:00 am) nocturnal sleep opportunity plus a 1.5-hour (2:00 pm to 3:30 pm) afternoon nap following the night of restricted sleep. Actigraphy and Fitbit data were recorded throughout the protocol. PSG data were available for 9 nights: 2 baseline nights, 5 sleep restriction nights, and 2 recovery nights (Figure S1, asterisks). All devices were synchronized to a common Internet time server to ensure proper alignment of time-stamped data. Because the first baseline night was an adaptation night, data were not analyzed.


Electroencephalography (EEG) was performed using a SOMNOtouch recorder (SOMNOmedics GmbH, Randersacker, Germany) on two channels (C3 and C4 in the international 10-20 system). Contralateral mastoids were used as references. Electrodes placed at Cz and Fpz were used as common reference and ground electrodes respectively. Electrooculography (EOG) and submental electromyography (EMG) were also used. Impedance was kept below 5 kΩ for EEG and 10 kΩ for EOG and EMG electrodes. Signals were sampled at 256 Hz and filtered between 0.2 and 35 Hz for EEG, and between 0.2 and 10 Hz for EOG.

Sleep periods were set according to the times of lights on and off. Sleep stages were automatically scored in 30-second epochs using an in-house algorithm20 ( in conjunction with the FASST toolbox (∼phillips/FASST.html), and visually reviewed by trained technicians following criteria set by The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications to ensure accuracy of the staging.21

TST was computed by totaling epochs of stage N1, N2, N3, and rapid eye movement (REM) sleep, whereas wake after sleep onset (WASO) was defined by the sum of epochs scored as wake after the first stage N1 or N2 sleep epoch (because Fitbit does not distinguish between these two sleep stages). For sleep staging comparisons with Fitbit, PSG epochs classified as stage N1 and N2 sleep were categorized as “light sleep” and stage N3 sleep PSG epochs as “deep sleep.”


Participants wore an Actiwatch 2 (AW2) on their nondominant hand. Data were collected in 30-second epochs, with sleep periods manually defined by lights on and off times, matching those of PSG. Data were processed using Actiware (version 6.0.7, Philips Respironics Inc., Pittsburgh, Pennsylvania), using two wake threshold and immobility settings. The default setting utilizes a medium wake threshold (40 counts per epoch) with 10 immobile minutes (M10) for sleep onset and end. In addition, an optimized setting employing a high wake threshold (80 counts per epoch) and 5 immobile minutes (H5) was also included as a comparison based on prior findings suggesting increased movement during sleep in adolescents.18,22 TST was computed by summing all sleep epochs within the sleep periods, whereas WASO was defined by summing all wake epochs after the first sleep epoch.

Consumer Wearable Device

During the protocol, each participant wore a Fitbit Alta HR, hereafter simply referred to as Fitbit, on their nondominant hand. This device tracks motion and HRV via accelerometers and optical plethysmography respectively in 30-second epochs. A proprietary classification algorithm utilizing accelerometry and HRV signals23 classifies epochs into wake, or one of three sleep stages: light, deep, or REM sleep. Data were then wirelessly transferred to a smartphone application, and batch extracted using a third-party data management platform (Fitabase, San Diego, California).

Given that Fitbit sleep periods are automatically defined, measures were taken to ensure that time in bed (TIB) was identical across all the three instruments. Specifically, if Fitbit sleep onset or offset timings occurred between scheduled lights off and lights on timings, wake epochs were inserted to the beginning and/or the end of the record. Conversely, if sleep onset or offset timings occurred outside of scheduled timings, wake periods at the beginning and/or the end of the record were truncated. TST was computed by summing the duration of all Light, Deep and REM epochs within each sleep period, whereas WASO was defined by summing all wake epochs after the first epoch of sleep.

Data Analysis

SPSS 24.0 (IBM Corp., Armonk, New York) was used for all statistical analyses. First, to investigate whether the discrepancy from PSG in TST and WASO estimates differed across nights within each TIB condition (5 hours, 6.5 hours, and 9 hours), a general linear mixed model was performed with device setting (M10, H5, and Fitbit) and measurement night as factors. No significant interaction effects between device setting and night were found (P ≥ .10), indicating that biases were similar in magnitude across all measurement nights. Subsequent analyses were thus performed on intrasubject averaged TST and WASO.

Next, one-sample t tests against zero were used to determine if estimations of TST and WASO by devices were significantly biased, that is, overestimated or underestimated, from PSG. In addition, for each TIB condition, a repeated-measures analysis of variance (ANOVA) was conducted for TST and WASO separately to compare differences in biases between different device settings and were followed by post hoc t tests to discern significant pairwise differences. Furthermore, for each TIB condition, Bland-Altman24 plots for TST were generated by plotting the bias of each device setting from PSG against the TST averaged across the device setting and PSG. Similar Bland-Altman plots were created for WASO per TIB condition. To determine if bias magnitudes were proportional to the TST or WASO measure averaged across the device setting and PSG, simple linear regression was performed. In a secondary analysis, we also investigated the effects of sex on TST estimates for each TIB condition (supplemental material). As differential effects only occurred in the 5-hour TIB condition and in actigraphy, we did not perform further sex-related analyses.

Finally, to quantify the agreement in sleep-wake categorization between actigraphy/Fitbit and PSG, epoch-by-epoch (EBE) analyses were conducted for deriving three agreement measures: accuracy (ability to correctly classify epochs), sensitivity (ability to detect sleep), and specificity (ability to detect wake) for each device setting. Repeated-measures ANOVAs for each EBE agreement measure were similarly conducted within each TIB condition to compare differences in agreement performance between different device settings and were also followed up with post hoc t tests. Also, to quantify the agreement in sleep staging between Fitbit and PSG, for each sleep stage (light sleep, deep sleep, and REM), an EBE analysis was conducted, and a Bland-Altman plot was generated to show the duration discrepancies between instruments against the average duration assessed with Fitbit and PSG.


Fifty-seven patients contributed to the final sample, as one participant in the continuous sleep group dropped out. In addition, data loss from technical issues from Fitbit (58 records), Actiwatch 2 (2 records,) and PSG (12 records), and the exclusion of 11 outlier recordings (> 3 SDs) resulted in the final sample consisting 386 nights of data common to all devices, with each participant contributing between 3 to 7 nights of data.

PSG determined sleep architecture for the final sample is provided in Table 1. Results of one-sample t tests used to determine the significance of device setting-PSG biases are summarized in Table 2. Bland-Altman plots representing device setting-PSG biases for sleep-wake and sleep-stage analyses are presented in Figure 1 and Figure 2, respectively. Results of simple linear regression used to investigate proportional biases are summarized in Table 3. Accuracy, sensitivity and specificity values for sleep-wake and sleep-stage classification are presented in Table 4 and Table 5 whereas EBE classification metrics are provided in Table 6.

Table 1 Polysomnography-determined sleep architecture.

Table 1

Table 2 Biases of each device setting from polysomnography, grouped by TIB condition.

Table 2
Figure 1: Bland-Altman plots for total sleep time and wake after sleep onset.

Bland-Altman plots, in minutes, of (A) total sleep time and (B) wake after sleep onset. Red, green, and blue points represent data collected from the 5-hour, 6.5-hour, and 9-hour time in bed conditions respectively. Solid lines and bolded numbers represent the mean biases of each recording, whereas dashed lines and regular numbers represent 1.96 standard deviation limits of agreement. H5 = Actiwatch 2 high wake threshold with 5 immobile minutes for sleep onset and end, M10 = Actiwatch 2 medium wake threshold with 10 immobile minutes for sleep onset and end, PSG = polysomnography.

Figure 2: Bland-Altman plots for sleep stages.

Bland-Altman plots, in minutes, of (A) stage N1 + N2 sleep (light sleep), (B) stage N3 sleep (deep sleep), and (C) REM sleep. Red, green, and blue points represent data collected from the 5-hour, 6.5-hour, and 9-hour time in bed conditions, respectively. Solid lines and bold numbers represent the mean biases of each recording, whereas dashed lines and regular numbers represent 1.96 standard deviation limits of agreement. H5 = Actiwatch 2 high wake threshold with 5 immobile minutes for sleep onset and end, M10 = Actiwatch 2 medium wake threshold with 10 immobile minutes for sleep onset and end, PSG = polysomnography, REM = rapid eye movement.

Table 3 Proportional biases associated with sleep duration observed in each device setting and grouped by TIB condition.

Table 3

Table 4 Confusion matrices of each device setting by TIB group.

Table 4

Table 5 Confusion matrices of Fitbit sleep staging by TIB group.

Table 5

Table 6 EBE agreement metrics, referenced to PSG, of each device setting grouped by TIB condition.

Table 6

Actiwatch 2 M10 Versus PSG

M10 significantly underestimated TST and overestimated WASO in all TIB conditions. M10 underestimated TST by an average of 24 to 38 minutes (t ≥ 8.19, P < .001; Table 2). WASO duration was overestimated by an average of 22 to 38 minutes (t ≥ 10.14, P < .001). EBE comparisons indicated comparable agreement across all TIBs. Sleep-wake discrimination accuracy was excellent (0.89 to 0.90; Table 6), with sensitivities ranging from 0.90 to 0.91, and good specificities from 0.80 to 0.86.

Actiwatch 2 H5 Versus PSG

H5 showed better agreement with PSG but still underestimated TST and overestimated WASO in all TIB conditions. TST was underestimated by an average of 7 to 12 minutes (t ≥ 3.49, P ≤ .002; Table 2) while WASO was overestimated by an average of 11 to 18 minutes (t ≥ 8.19, P < .001). Sleep-wake accuracies ranged from 0.93 to 0.94 (Table 6). Sensitivity was 0.95 across all TIBs, whereas specificities were acceptable, ranging from 0.64 to 0.73.

Fitbit Versus PSG

Fitbit significantly underestimated TST and overestimated WASO across all TIB conditions. TST was underestimated by an average of 24 to 47 minutes (t ≥ 15.62, P < .001: Table 2) whereas WASO was overestimated by an average of 21 to 41 minutes (t ≥ 15.14, P < .001). EBE comparisons indicated excellent accuracy and sensitivity of around 0.90 across all TIB conditions (Table 6). Specificity was between 0.88 and 0.90.

Concerning sleep-stage classification, biases were dependent on sleep stage and TIB condition examined. Fitbit overestimated stage N1 + N2 sleep (light sleep) by an average (SD) of 9.9 (19.7) minutes (t = 2.70, P = .012) in the 5-hour TIB condition; did not differ significantly from PSG in the 6.5-hour TIB condition, (t = 0.62, P = .54); and underestimated stage N1 + N2 sleep by an average (SD) of 20.7 (35.8) minutes (t = 4.36, P < .001) in the 9-hour TIB condition. The device consistently underestimated stage N3 sleep (deep sleep) duration in all TIB conditions, by an average of 21.5 to 46.4 minutes (t ≥ 8.19, P < .001). No significant differences were observed for REM sleep (t ≤ 1.54, P ≥ .13) in all TIB conditions.

EBE comparisons of Fitbit’s sleep staging algorithm indicated average accuracy of 0.68 to 0.71 for stage N1 + N2 sleep (light sleep), 0.50 to 0.64 for stage N3 sleep (deep sleep), and 0.72 to 0.74 for REM sleep. Confusion matrices (Table 5) indicate that, on average, misclassifications of PSG stage N1 + N2 sleep occurred mostly either as REM sleep (0.10 to 0.13) or wake (0.11 to 0.13); misclassifications of PSG stage N3 sleep occurred mostly as light sleep (0.31 to 0.43), and misclassifications of PSG REM sleep occurred mostly as light sleep (0.15 to 0.16).

Comparison Among Actiwatch 2 M10, Actiwatch 2 H5, and Fitbit

Magnitudes of device setting-PSG biases for TST and WASO (Table 2), and EBE agreement metrics (Table 6), between M10, H5, and Fitbit were compared. All ANOVAs were significant (TST: F ≥ 86.11, P < .001; WASO: F ≥ 54.61, P < .001; accuracy: F ≥ 33.03, P < .001; sensitivity: F ≥ 79.84, P < .001; specificity: F ≥ 33.80, P < .001) for all TIB conditions examined.

Post hoc pairwise comparisons indicated that H5 had on average significantly less TST underestimation than M10 by 17 to 27 minutes (t ≥ 13.14, P < .001), and Fitbit by 17 to 35 minutes (t ≥ 13.14, P < .001) across all TIBs. Additionally, H5 had on average significantly less WASO overestimation than M10 by 11 to 20 minutes (t ≥ 9.59, P < .001), and Fitbit by 10 to 24 minutes (t ≥ 10.07, P < .001).

EBE analyses indicated that H5 had on average small but significantly higher sleep-wake classification accuracies than M10 by 0.03 (t ≥ 7.02, P < .001), and Fitbits by 0.02 to 0.04 (t ≥ 6.35, P < .001) across all TIBs. H5 also had on average small but significantly higher sensitivity values than M10 by 0.04 to 0.05 (t ≥ 10.71, P < .001), and Fitbit by 0.04 to 0.05 (t ≥ 12.06, P < .001) across all TIBs. However, this came at a cost of lower specificity values: H5 was lower than both M10 by 0.13 to 0.15 (t ≥ 7.30, P < .001), and Fitbit by 0.16 to 0.24 (t ≥ 6.30, P < .001).

M10 and Fitbit underestimated TST in the 5-hour TIB condition comparably (t = 0.17, P = .87). This underestimation of TST was larger in the 6.5-hour (mean [SD] = 5.0 [12.6] minutes, t = 2.11, P = .044) and 9-hour recordings (mean [SD] = 8.7 [19.4] minutes, t = 3.40, P = .001). However, M10 and Fitbit had similar overestimations of WASO across all TIB conditions (t ≤ 1.91, P ≥ .06).

Sleep-wake accuracies did not significantly differ in the 6.5-hour and 9-hour TIBs (t ≤ 1.18, P ≥ .25) across all device settings. Fitbit was only slightly more accurate than M10 in the 5-hour TIB condition (mean [SD] = 0.01 [0.02]; t = 2.45, P = .019). Sensitivity values of M10 and Fitbit were comparable across all TIBs (t ≤ 1.73, P ≥ .09). Finally, although specificity was higher in Fitbit compared to M10 in both 5-hour, mean (SD) = 0.08 (0.17); t = 2.65, P = .013, and 9-hour TIB conditions, mean (SD) = 0.07 (0.12); t = 4.56, P < .001, both had comparable performance in the 6.5-hour TIB condition (t = 1.99, P = .06).

Proportional Biases

Across all TIBs, M10 demonstrated significant increases in TST and WASO estimation biases with increasing sleep durations (Table 3). The amount of underestimation increased by 0.40 to 0.94 minute per minute of TST, whereas the amount of overestimation increased by 0.75 to 1.31 minutes per minute of WASO (F ≥ 10.72, P ≤ .003). H5 and Fitbit demonstrated a similar relationship for TST only in the 6.5-hour TIB condition (H5: B = 0.65 minutes, F = 17.18, P < .001; Fitbit: B = 0.69 minutes, F = 12.01, P = .002); no other TIB condition demonstrated significant relationships (F ≤ 3.98, P ≥ .06). However, H5 and Fitbit demonstrated increasing estimation biases for WASO across all TIBs, with H5 biases increasing by 0.32 to 0.80 minutes (F ≥ 6.76, P ≤ .012) and Fitbit biases increasing by 0.31 to 0.86 minutes (F ≥ 4.46, P ≤ .039) per minute of WASO. There was generally no significant relationship between the amount of bias by Fitbit and sleep stage duration across all TIBs (F ≤ 1.56, P ≥ .223). The exception to this was a decrease in the magnitude of stage N3 sleep (deep sleep) bias by 1.03 minutes in the 6.5-hour TIB condition (F = 12.33, P = .002) and by 0.84 minutes in the 9-hour TIB condition (F = 9.81, P = .003) per minute of stage N3 sleep (deep sleep).


We assessed how well a contemporary consumer-grade wearable device assessed sleep compared to a research-grade actigraph and PSG. At default settings, both Fitbit Alta HR and AW2 performed comparably. Both devices systematically underestimated sleep in adolescents by an average of approximately 30 minutes. This underestimation increased when sleep opportunity was lengthened, likely as a result of reduced sleep efficiency (greater wake within a given sleep opportunity) and a tendency of these devices to overestimate wakefulness. Reducing motion sensitivity during sleep in the AW2 device yielded TST and WASO measurements closer to those obtained with PSG, but at the expense of slightly worsened detection of wakefulness. Fitbit estimation of sleep stages was good for stage N1 and N2 sleep, as well as REM sleep, but stage N3 sleep was systematically underestimated.

Consumer-Grade Actigraphy Has Caught Up With Research-Grade Devices

A key finding of the current work is that at default settings, and for the assessment of adolescent sleep, the Fitbit Alta HR performed comparably with the research grade AW2, costing about three times more. Additionally, Fitbit readouts showed less deviation in TST measurement relative to PSG than Actiwatch 2 at default (medium sensitivity) settings. Fitbit demonstrated the best wake specificities of all device settings considered across the different TIB conditions and this likely reflects the benefit of incorporating heart rate sensing to the classification of sleep and wake. Current findings document a clear advance of consumer wearable devices for the purpose of measuring sleep relative to prior comparisons between consumer and research devices. An additional advantage of Fitbit devices is that sleep data can be wirelessly synchronized by participants to a data cloud, allowing monitoring of data as it is collected. This saves time from having to physically download data using a proprietary dock one unit at a time as when using conventional research devices.

Actigraphy Underestimates Sleep of Healthy Adolescents

Currently, there are conflicting data about the accuracy of actigraphy for assessing adolescent sleep. In two studies actigraphy underestimated sleep in adolescents18,19 whereas another found either correct estimation or overestimation of sleep by actigraphy in older adolescents, depending on device sensitivity settings.16 Conversely, at least two sleep diary + actigraphy studies have shown significant underestimation of adolescent sleep with actigraphy,22,25 but neither had PSG confirmation of sleep duration.

The current work shows actigraphy to clearly underestimate sleep and to overestimate WASO in healthy older adolescents studied over multiple nights with PSG and over different sleep opportunity durations. A likely reason for sleep underestimation relates to greater movement compared to adults during healthy adolescent sleep.18,22 Inclusion of data from a clinical population may have masked such increased movement during sleep in an earlier study, as patients tend to move less.16

Estimation biases increased with TIB duration. Underestimation of TST was more pronounced at 9-hour TIB compared to 5-hour TIB. Conversely, WASO was overestimated with longer TIB. In our sample, sex effects were not sufficiently significant across different sleep schedules to merit correction. This information regarding estimation biases as a function of adolescence and TIB provides for finer grained customization of sleep evaluation using actigraphy and could make for better estimates of TST and WASO in future consumer wearable devices. The value of “tuning” sleep detection to the patient is illustrated in the comparison between M10 and H5 (lower sensitivity to motion) settings in Actiwatch devices, the latter giving rise to superior accuracy of sleep detection with some tradeoff in the form of reduced sensitivity to wakefulness detection.

Fitbit Sleep Staging

REM sleep estimation by Fitbit was accurate on average across all TIB conditions considered. However, stage N3 sleep (deep sleep) was consistently underestimated. Our findings replicate those of de Zambotti and colleagues.11 Stage N3 sleep underestimation was more pronounced in shorter TIB conditions as compared to the longer 9-hour TIB condition. The estimation of stage N1 + N2 sleep (light sleep) was also affected by sleep duration, where it was overestimated in the 5-hour TIB condition, and underestimated it in the 9-hour TIB condition.

EBE comparisons of Fitbit-PSG help shed some light onto its overall sleep staging performance. Stage N1 + N2 and REM sleep demonstrated accuracies of approximately 70% across all TIB conditions. Stage N3 sleep classification accuracies were much poorer, especially at shorter recording durations of 5-hour and 6.5-hour TIBs. Contributing to these errors, stage N3 sleep epochs, similar to REM sleep epochs, were most commonly misclassified as light sleep. As sleep restriction has been shown to cause an increase in sympathetic activity evidenced by alterations to HRV,26 this could have affected accurate staging.

Strengths and Limitations

A key strength of the current study is the evaluation of more than 50 healthy adolescents over multiple nights in carefully controlled settings and with concurrent PSG. The consumer wearable device tested belongs to a new generation of devices where other physiological sensors other than motion (eg, heart rate, skin temperature, conductance) are used to differentiate sleep and wake. A fuller evaluation of newer consumer-grade wearable devices would need to test other devices, and include young adults as well as older persons in the study sample to confirm our suggestion regarding the utility of age and sleep duration customization of sleep measurement to improve accuracy. We are also unable to comment on how this particular wearable device would perform in persons with medical conditions. In addition, because of the limited number of EEG channels afforded by the PSG setup, only C3 and C4 derivations were recorded. This could have affected the scoring of (1) N1 sleep onset based on occipital alpha rhythm attenuation, as well as (2) the amount of N3 recorded as signals from the central electrodes are typically less prominent than those recorded from the frontal electrodes.

Notably, although many previous reports were motivated by investigators seeking to use actigraphy in clinical settings, the rapid growth in adoption of consumer wearable devices is driven by aspirations to improve personal health and wellbeing in mostly healthy persons. The favorable price-to-performance ratio of these new devices makes them very attractive for large-scale longitudinal bio-bank type studies where sleep is being increasingly recognized as a health variable that should be tracked and analyzed when creating models of healthy life styles.


In healthy adolescents, a new generation of consumer-grade wearable activity/sleep trackers exemplified by the Fitbit Alta HR generates sleep/wake data that are comparable to default settings used in a well-known research actigraph costing about three times more. Age and sleep opportunity should be considered as variables for tuning the performance of such wearable devices for sleep/wake estimation in the future. Wearable device sleep staging, although somewhat adequate for detecting stage N1 + N2 and REM sleep, significantly underestimates stage N3 sleep and consumers should be made aware of this point to allay anxiety when comparing their sleep stages to PSG based norms.


All authors have reviewed and approve of the manuscript. All adolescent participants and their legal guardians provided written informed consent. This work was approved by the Institutional Review Board of the National University of Singapore, and was supported by the National Medical Research Council, Singapore (NMRC/StaR/015/2013) and the Far East Organization. Alta HR units were provided under an unrestricted gift from Fitbit. The authors report no conflicts of interest.



Actiwatch 2










Actiwatch high wake threshold with 5 immobile minutes for sleep onset and end


heart rate variability


Actiwatch medium wake threshold with 10 immobile minutes for sleep onset and end




rapid eye movement


time in bed


total sleep time


wake after sleep onset


  • 1 International Data CorporationNew wearables forecast from IDC shows smartwatches continuing their ascendance while wristbands face flat growth Accessed August 28, 2018

    Google Scholar
  • 2 Dickinson DL, Cazier J, Cech TA practical validation study of a commercial accelerometer using good and poor sleepersHealth Psychol Open2016322055102916679012

    CrossrefGoogle Scholar
  • 3 Liang Z, Chapa Martell MAValidity of consumer activity wristbands and wearable EEG for measuring overall sleep parameters and sleep structure in free-living conditionsJ Healthc Inform Res201821-2152178

    CrossrefGoogle Scholar
  • 4 Wright SP, Hall Brown TS, Collier SR, Sandberg KHow consumer physical activity monitors could transform human physiology researchAm J Physiol Regul Integr Com Physiol20173123R358R367

    CrossrefGoogle Scholar
  • 5 Sadeh A, Acebo CThe role of actigraphy in sleep medicineSleep Med Rev200262113124

    CrossrefGoogle Scholar
  • 6 Ancoli-Israel S, Cole R, Alessi C, Chambers M, Moorcroft W, Pollak CPThe role of actigraphy in the study of sleep and circadian rhythmsSleep2003263342392

    CrossrefGoogle Scholar
  • 7 Van de Water AT, Holmes A, Hurley DAObjective measurements of sleep for non-laboratory settings as alternatives to polysomnography–a systematic reviewJ Sleep Res2011201 Pt 2183200

    CrossrefGoogle Scholar
  • 8 de Zambotti M, Baker FC, Colrain IMValidation of sleep-rracking technology compared with polysomnography in adolescentsSleep201538914611468

    CrossrefGoogle Scholar
  • 9 Montgomery-Downs HE, Insana SP, Bond JAMovement toward a novel activity monitoring deviceSleep Breath2012163913917

    CrossrefGoogle Scholar
  • 10 Evenson KR, Goto MM, Furberg RDSystematic review of the validity and reliability of consumer-wearable activity trackersInt J Behav Nutr Phys Act201512159

    CrossrefGoogle Scholar
  • 11 de Zambotti M, Goldstone A, Claudatos S, Colrain IM, Baker FCA validation study of Fitbit Charge 2 compared with polysomnography in adultsChronobiol Int2018354465476

    CrossrefGoogle Scholar
  • 12 Meltzer LJ, Hiruma LS, Avis K, Montgomery-Downs H, Valentin JComparison of a commercial accelerometer with polysomnography and actigraphy in children and adolescentsSleep201538813231330

    CrossrefGoogle Scholar
  • 13 Takeda T, Mizuno O, Tanaka TTime-dependent sleep stage transition model based on heart rate variabilityConf Proc IEEE Eng Med Biol Soc2015201523432346

    Google Scholar
  • 14 Aktaruzzaman M, Migliorini M, Tenhunen M, Himanen SL, Bianchi AM, Sassi RThe addition of entropy-based regularity parameters improves sleep stage classification based on heart rate variabilityMed Biol Eng Comput2015535415425

    CrossrefGoogle Scholar
  • 15 Werner H, Molinari L, Guyer C, Jenni OGAgreement rates between actigraphy, diary, and questionnaire for children’s sleep patternsArch Pediatr Adolesc Med20081624350358

    CrossrefGoogle Scholar
  • 16 Meltzer LJ, Walsh CM, Traylor J, Westin AMDirect comparison of two new actigraphs and polysomnography in children and adolescentsSleep2012351159166

    Google Scholar
  • 17 Lo JC, Ong JL, Leong RL, Gooley JJ, Chee MWCognitive performance, sleepiness, and mood in partially sleep deprived adolescents: The Need for Sleep StudySleep2016393687698

    CrossrefGoogle Scholar
  • 18 Johnson NL, Kirchner HL, Rosen CL, et al.Sleep estimation using wrist actigraphy in adolescents with and without sleep disordered breathing: a comparison of three data modesSleep2007307899905

    CrossrefGoogle Scholar
  • 19 Pesonen AK, Kuula LThe validity of a new consumer-targeted wrist device in sleep measurement: an overnight comparison against polysomnography in children and adolescentsJ Clin Sleep Med2018144585591

    LinkGoogle Scholar
  • 20 Patanaik A, Ong JL, Gooley JJ, Ancoli-Israel S, Chee MWLAn end-to-end framework for real-time automatic sleep stage classificationSleep2018415

    CrossrefGoogle Scholar
  • 21 Iber C, Ancoli-Israel S, Chesson AL, Quan SFfor the American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. 1st ed. Westchester, IL: American Academy of Sleep Medicine; 2007

    Google Scholar
  • 22 Short MA, Gradisar M, Lack LC, Wright H, Carskadon MAThe discrepancy between actigraphic and sleep diary measures of sleep in adolescentsSleep Med2012134378384

    CrossrefGoogle Scholar
  • 23 Fitbit IncStart sleeping better with Fitbit. Accessed July 24, 2018

    Google Scholar
  • 24 Bland JM, Altman DGStatistical methods for assessing agreement between two methods of clinical measurementLancet198618476307310

    CrossrefGoogle Scholar
  • 25 Short MA, Gradisar M, Lack LC, Wright HR, Chatburn AEstimating adolescent sleep patterns: parent reports versus adolescent self-report surveys, sleep diaries, and actigraphyNat Sci Sleep201352326

    CrossrefGoogle Scholar
  • 26 Dettoni JL, Consolim-Colombo FM, Drager LF, et al.Cardiovascular effects of partial sleep deprivation in healthy volunteersJ Appl Physiol (1985)20121132232236

    CrossrefGoogle Scholar