Ecologists and conservation biologists often rely upon occupancy or count based surveys to monitor species population status and trends, determine habitat associations, and examine species responses to environmental change (MacKenzie et al. 2006). In bird studies these surveys are commonly based largely upon the detection of acoustic cues (Matsuoka et al. 2014), which often account for greater than 90% of the detections in the data (e.g., Alquezar and Machado 2015). In particular, point count surveys (during which all birds are recorded that are heard or seen during a specified duration of time) are one of the most frequently used survey techniques in ornithology (Matsuoka et al. 2014), and autonomous recording units (ARUs) are increasingly being used for such studies (Shonfield and Bayne 2017). Despite their common use, there is fairly widespread recognition that point counts are prone to detection errors and there has been a concerted effort to adopt field and statistical methods to deal with false absences in point count data (e.g., MacKenzie et al. 2002, 2006, Royle and Link 2006). More recently increased recognition that species misidentifications are also relatively common in acoustic survey data (Simons et al. 2007, 2009, Farmer et al. 2012, Miller et al. 2012, Chambert et al. 2015, Banner et al. 2018) has prompted the development of statistical and sampling methods to account for misidentification (false positives; Royle and Link 2006, Miller et al. 2011, 2012, Ferguson et al. 2015, Clement 2016, Banner et al. 2018).
Both false negative (FN) and false positive (FP) error lead to misclassification of site occupancy (Nichols 2019), where FN leads to a misclassification of the site as unoccupied, and FP error leads to both misclassification of the site as unoccupied for the species that was misidentified, and occupied for the species for which it was incorrectly identified. These are important errors, and accurate estimation of site occupancy (ψ) in the presence of both FP and FN errors requires additional data beyond those normally collected in occupancy studies (Miller et al. 2011, 2012, 2013). Specifically, additional data are required to allow explicit modeling of the FP detection process (Miller et al. 2011, 2012, 2013, Chambert et al. 2015). Occupancy models designed to account for species identification error hold promise for correcting biases in point count surveys (Royle and Link 2006, Simons et al. 2007, 2009, Miller et al. 2011, Farmer et al. 2012, Ferguson et al. 2015), although there is still discussion about the best approach. For example, Ferguson et al. (2015) discuss estimation issues related to both the Miller et al. (2011) and Royle and Link (2006) models, and suggest further testing is required. The multiple detection state model (MDSM) of Miller et al. (2011) relies on confirmation of species occupancy for all or a subset of sites using more intensive surveys and is the approach used for site-level confirmation designs (Chambert et al. 2015). The MDS model assigns observations into 2-states, “certain detections” and “uncertain detections.” This approach allows the explicit assignment of data into different detection states (0 = no detection, 1 = uncertain, and 2 = certain), and is potentially appropriate for ARU-based songbird studies. This approach does not conform strictly to the MDS model requiring “certain detection,” so the validity of applying the MDS model where identification certainty is based on confirmation by independent interpreters requires investigation of bias through simulation of known occupancy, detection, and FP rates. It is also necessary to test for bias in the model estimates across a broad range of observation, FP, and detection rates.
Multiple visits and more intensive methods to account for FN and FP errors, respectively impose additional logistical sampling constraints and costs. Furthermore, recent design optimization studies by Clement (2016) suggest survey/analysis strategies that can efficiently account for both FP and FN error using additional surveys, but if FP errors are not present they will unnecessarily increase cost. Thus, efficient estimation of FP and FN error will have important implications for optimal species occupancy monitoring. The relatively recent advent of commercially available programmable ARUs provide the opportunity to obtain sufficient numbers of repeat samples while only requiring field staff to physically access a site twice, i.e., deployment and retrieval of the ARUs, and thus have relatively low marginal costs for adding additional samples, i.e., recordings (Rempel et al. 2014). In addition, recordings from ARUs can be archived, allowing confirmation of identifications through use of multiple interpreters, and can provide additional confirmation aids such as spectrograms and digital signatures (Hobson et al. 2002, Acevedo et al. 2009, Rempel et al. 2014, Knight et al. 2017). Such recordings are conducive to implementing statistical methods to simultaneously model both FN and FP detection errors (Chambert et al. 2015).
In our study we first simulated songbird data that would be collected from an ARU and interpreted by two people listening to the same species song/call, then mimicked the introduction of detection error, misidentification error, and species confirmation. We then evaluated the validity of the MDS model for this type of data by evaluating bias and precision across a range of model estimates for occupancy (ψ), detection rate (d), and FP rate. We then investigated the relative effects of correcting for only FN error in modeling occupancy (ψ) versus correcting for both FN and FP error using ARU recordings of forest birds in three ecoregions across northern Ontario. We explored the relationship between the % change in modeled occupancy rates across survey conditions as a function of detection rate, number of confirmed observations, and naïve occupancy rate. Finally, we discuss ARU-based monitoring and analysis strategies for cost-effectively correcting for these errors.
We conducted our study in the Boreal Shield ecozone of Ontario, Canada (Fig. 1). Study area 1 (SA1) lies within the Lake Abitibi ecoregion (ecoregion 3E) of eastern Ontario whereas study areas 2 (SA2) and 3 (SA3) are both partly in the Lake Wabigoon ecoregion with SA2 also straddling the Lake St. Joseph ecoregion (ecoregion 4S3S) and study area 3 (SA3) straddling the Pigeon River ecoregion (ecoregion 4W; Crins et al. 2009). SA3 and the southern portions of SA 1 and SA2 fall within Bird Conservation Region (BCR) 12 (Boreal Hardwood Transition), the northern portions of SA1 and SA2 fall within BCR 8 (Boreal Softwood Shield; North American Bird Conservation Initiative 2000). SA2 and SA3 are generally drier with more extensive areas of exposed bedrock while SA1 has extensive wet, low-lying areas with organic soils and lowland forest species, as well as a more variable fire cycle and more precipitation (Crins et al. 2009). The forest cover of all study areas is dominated by mixed, coniferous and sparse forest (more sparse forest in SA2) with some deciduous forest, mainly white spruce (Picea glauca), black spruce (Picea mariana), jack pine (Pinus banksiana), balsam fir (Abies balsamea), tamarack (Larix laricina), eastern white cedar (Thuja occidentalis), trembling aspen (Populus tremuloides), balsam poplar (Populus balsamifera), and white birch (Betula papyrifera); SA3 has a mixture of Boreal and Great Lakes-St. Lawrence Forest tree species, e.g., red pine (Pinus resinosa), white pine (Pinus strobus), red maple (Acer rubrum), yellow birch (Betula alleghaniensis), black ash (Fraxinus nigra), white elm (Ulmus americana), and American beech (Fagus americana). SA1 was sampled in 2013, SA2 in 2014, and SA3 in 2015.
Audio recordings were collected using a repeat sampling design. Automated recording units were scheduled to capture five 10-minute recordings per day including three times during the dawn chorus (half an hour before sunrise, at sunrise, and half an hour after sunrise) and twice in the evening (half an hour before sunset and one hour after sunset) within each study area. A combination of Song MeterTM SM1, SM2, and SM2+ recording units, set as similarly as possible (at 22050Hz, 45dB microphone gain and low pass filter at ~160Hz for SM1 and +48dB gain and ~180Hz low pass filter for SM2/2+), were used to capture recordings in Wildlife Acoustics Audio Compression (.WAC) format. Recordings were selected for interpretation across a 4- to 6-day window within the songbird breeding season (late May to early July) assuming that no new territories would be established or abandoned within the analysis period and that the birds present and vocalizing were within their home range. We randomly selected six good quality (i.e. little wind, rain, or traffic noise) dawn chorus recordings for each site and two additional evening recordings for the wetland sites only (one from each evening time period) for a total of 117, 115, and 114 sites in SA1, SA2 and SA3, respectively. Each recording was independently interpreted twice; interpreter A processed all of the recordings for all study areas, while the duplicate interpretation of recordings in SA1 was split between two interpreters (B1 and B2) and duplicate interpretation for SA2 and SA3 was completed by a single interpreter (B1). Because interpreters and their respective identification skills differed between the two study areas, we treated models independently of study area, and did not average model results by species across study areas. We term these groupings of species by study area “observation sets.”
We instructed interpreters to listen to 10-minute recordings and record all detectable and identifiable species. If species identification for a particular audio cue was uncertain, the interpreter was instructed not to record that instance but instead continue with the interpretation, assuming that if a species truly occupies a site then it is likely it would be detected again at another time during the recording sequence.
FP modeling requires assignment of certain detections, i.e., observations where there is little doubt that a true detection occurred given the site was occupied. Identification was based on interpretation of the recorded acoustic cues, and in the MDS model we used interpretation from a second interpreter to “confirm” species presence at a site. Species presence was assigned a “0” if not recorded, a “1” if only the primary interpreter (A) recorded an observation and a “2” if the both the primary and secondary interpreters (B) confirmed hearing that species. This analysis was repeated so that the secondary interpreter (B) was treated as the primary observer. For each site one interpreter was randomly selected as the primary observer, and those results used to calculate naïve occupancy estimates and to estimate model-averaged ψ. We used this approach to avoid double counting where interpreters did not agree. For example, if interpreter A was selected as primary observer for the site, and identified ALFL, BWWA, and CHSP, whereas interpreter B identified ALFL, BWWA and DEJU, then this would indicate the presence of 3 species with ALFL and BWWA confirmed, and the third species, CHSP, as unconfirmed. Alternatively, combining results from the two interpreters rather than assigning a primary interpreter would incorrectly indicate the presence of four species, with ALFL and BWWA as confirmed, but now with two species, CHSP and DEJU, as unconfirmed.
The MDS model requires a subset of observations to be certain, but unless the bird is in-hand the identification is always subject to error, so although confirmation by two interpreters does not guarantee certain identification it can be used as an estimate. Our approach of using 2-interpreter confirmation to establish certain detection may introduce bias in the model, so we evaluated potential bias using simulated observations that reflect that process by which error is introduced into empirical data through interpretation of recorded audio cues. We simulated true occupancy, FN detection error, and FP identification error (see Appendices 1 and 2 for spreadsheets with simulation code). Based on analysis of our empirical data, we found detection history from interpreters 1 and 2 were correlated with an R of 0.88, so we simulated this level of correlation when introducing the FN detection error, i.e., species was not detected in a recording because of an unclear audio cue, even though the species occupies the site. We then introduced FP identification error where a species was detected in a recording even though it did not occupy the site, and where misidentification is related to clarity of the audio cue. Conceptually, an audio cue may be unclear because of ambient background noise, it is a rare variant of song used by the species, the volume of the song is very low, or the recording only caught a piece of the full song.
Correction for FP error was modeled using the MDS model for 2 detection states (Hines 2006, Miller et al. 2011), which corrects ψ estimates for both detection and misidentification error (where the species is detected but the site is unoccupied). Occupancy analyses were completed in R (v.3.4.2) using the package unmarked (v.0.12-2; Fiske and Chandler 2011) using the occuFP function. Data was coded as 0, 1, and 2 (type 3 data only in occuFP) to represent no detection, uncertain detection, and confirmed detection, respectively, triggering the model to run the multiple detection state model, creating estimates for occupancy (ψ), true positive detection probability (model parameter p11, hereafter termed d), false positive detection probability (model parameter p10, hereafter termed FP), and the probability a true positive detection was designated as confirmed (model parameter b).
For field data we generated three estimates of site occupancy: the raw naïve estimate, estimate from the model accounting for FN error only (occu function in the unmarked R package), and estimate from the model accounting for both FN+FP error (MDSM). Based on previous studies (Rempel et al. 2014) we selected four detection covariates that we hypothesized would influence diurnal, seasonal, noise, and sound transmission effects on detection rate: time since sunrise/sunset (TSSR), days since spring (15 May; DSS), recording quality (RQ), and percent hardwood (PHW), respectively; RQ was ordinal, and varied from 1 to 4 (with 1 representing the lowest and 4 representing the highest quality recordings, respectively). These detection covariates were included because some birds adjust singing frequency as the morning progresses (TSSR; Sólymos et al. 2018), some birds adjust singing frequency as the season progresses (DSS; Sólymos et al. 2018), sound can be impeded in stands with heavy deciduous cover (PHW; Yip et al. 2017), and the interpreter can miss hearing a song where heavy background noise occurs in the recording (RQ). AIC model selection was used to select the best supported model from our a priori detection hypotheses (Appendix 3). All covariates were modeled as linear functions.
For simulated data we evaluated bias and precision of model estimates using assigned levels of true occupancy rate, detection rate (d), FP rate, and probability of confirmed detection (b) that reflected the range of conditions found in our empirical data. We also compared three versus six repeat recordings (K). For the simulated data we evaluated bias and precision by producing side by side box plots to compare the true values to the modeled estimates, and by calculating % bias as:
where ψ is the estimated occupancy under the MDS model, and true occupancy is based on the known simulated occupancy values.
For our field recordings we derived ψ estimates for a total of 166 observation sets, comprising 64 species across the 3 study areas (54, 54, and 56 unique species in SA1, SA2, and SA3, respectively). Two of the 64 species did not have sufficient data in some study areas to model ψ and were removed from the analysis. For field data we calculated % change in the occupancy estimate as:
where ψ is the occupancy estimate under the MDS model and naïve is the uncorrected occupancy rate. After graphical inspection of % change versus detection rate (d), it was clear that a major change in slope and variance occurred near the x-axis origin, so we fit piecewise (segmented) regression (Muggeo 2003), using the R package Segmented (Muggeo 2008), to determine if the relationship between % change and d showed evidence for breakpoints, and to estimate separate (segmented) slopes if confirmed. Some % change values were > 1000, so % change was first scaled to its maximum value using the R function rangeScale, and then transformed using the arcsine of the square root.
To better understand the frequency with which interpreter error occurs among a broad range of interpreter skill levels, an online identification survey was created to document FP rates among species and interpreter expertise. Respondents were asked to identify the correct species using only acoustic cues, where for each of the 20 questions, interpreters were asked to identify the species from a multiple choice (four to five similar sounding species) answers set. We only used verified song clips obtained from the Macaulay Library at the Cornell Lab of Ornithology in our identification quiz. We solicited 225 participants for the survey to sample a wide range in years of experience and skill in identifying species by audio cues. For each audio interpretation question in the survey, the respondents were asked to make their best educated guess as to the identity of the species in the recording; however, if they responded that “he or she was unable to make an educated guess of the species identity,” then this was coded as a FN, and excluded from the estimation of FP rate. If the respondent selected the correct species, this was coded as true positive (TP). Identification accuracy was hypothesized to be a function of experience and ability, so we created three categorical variables representing answers to the yes/no questions “have you had greater than 5 years’ experience identifying birds by song,” “have you had your interpretation ability tested,” and “do you believe that your interpretation ability is sufficient for providing valid identifications” that were treated as categorical factors. We also asked the respondents to rate their confidence for identifying species commonly found in the ecoregions associated with the test (Boreal and Great-Lakes St. Lawrence) from 1 to 5, and this variable was treated as continuous covariate. Models were estimated using the Generalized Linear Models procedure in SPSS, with the Poisson loglinear option selected.
Similar to determining the optimal number of repeat visits (K) needed to achieve a desired probability of detecting a species given its presence (p* = 1-(1-p)K), we can reformulate this expression and solve for the number of interpreters that should be used to reached a desired probability of “confirmed detection” (b*), where b* = 1-(1-b)L (i.e., the probability of a confirmed detection given L additional interpreters), L is the number of additional interpreters, and b is the probability that a detection is classified as confirmed given that the site is occupied and the species was detected. Note that where only one additional interpreter is used (i.e., L = 1), b* = b, and our “confirmed detection” is an approximation of “certain detection.” We can then reformulate the equation for b* to solve for L under a target value of b (e.g., Sewell et al. 2012), as follows:
where L is the number of interpreters needed to achieve a target b*, and b is the estimated probability of “confirmed detection” from the occupancy model correcting for both false positives and false negatives. The parameter b was calculated using the occuFP routine in the package unmarked. Here, we estimate L by setting b* to achieve an 85% chance that species identification at a site will be “confirmed” by the primary interpreter using L additional interpreters.
Our simulations revealed that the model estimates from the MDS model had low bias, with estimated ψ and associated variance similar to true (known) occupancy rates across a broad range of occupancy and FP rates (Fig. 2 C-F; Table 1). As well, across this same range the model accurately estimated the survey parameters d, FP rate, probability of “confirmed detection” (b), and associated variances. However, at low occupancy of 0.2, and with higher FP (p10) rates of about ≥ 0.02, the model produced biased estimates of ψ, with modeled ψ about twice the true occupancy rate (Fig. 2 A and B). Slightly biased results also occurred with an occupancy rate of 0.5 but where FP was higher at 0.07 (Fig. 3 D); however, bias disappeared at the lower FP rate of 0.02 (Fig. 3 C). For all other simulations where true occupancy was ≥ 0.2, and where FP was < 0.02 the modeled estimates were unbiased (Table 1, Appendix 4). The other survey parameters, d, FP, and b were relatively unbiased across the full spectrum of simulation conditions.
In addition to effects from FP and occupancy rate, we also found a strong effect of detection rate on bias and precision of ψ, where at low detection rates (d = 0.1) estimates of ψ that were orders of magnitude higher than true occupancy rate (Fig. 3 A, C and E). Bias was also found with a d of 0.2 where true occupancy rate was only 0.1, but at this same detection rate bias largely disappeared when true occupancy increased to 0.5 (Fig. 3 F; Appendix 4)
With the number of repeat observations (“recordings”) set at six, the estimates of variance were unbiased across this broad range of conditions, with box plots revealing a similar range in the true variance of the simulated data and the estimated variance produced by the models. To explore the effect of number of repeat observations (recordings) on bias and precision, we compared simulations using three versus six repeat observations, and FP rates of 2.4% versus 6.7% (keeping occupancy rate at 0.5 and detection rate at 0.4). Under both FP rates we found little effect on magnitude of bias in ψ, but the variance of the estimate increased as repeat observations decreased (Fig. 4).
As true or naïve occupancy rate decreases, the number of confirmed observations also decreases, and this depression in sample size possibly contributes to poor model performance; thus, it appears from our simulation that bias is related to naïve occupancy rate, number of confirmed observations, d, and FP rates. To help assign risk of bias to our field evaluation data we developed a simple empirical logistic regression equation to relate these four variables to known bias in the simulation results, where the moderate/high risk class was defined as bias ≥ 40% (Table 1).
This simple model had 100% classification accuracy for assigning observations in Table 1 to the associated high/low bias category (with p < 0.001, omnibus test of model coefficients). This model was then applied to the field data to assign a category of low versus high risk of bias.
Initial analysis of our field recordings revealed a strong break-point in % change between our MDS model estimates of ψ and the naïve rate among observation sets with a combination of very low detection rates, occupancy rates, and/or FP rates. To explore these relationships further we used segmented regression analysis and found detection rate explained a significant proportion of the variance in the % change in ψ (adjusted r² = 0.679), with the regression supporting a breakpoint in the relationship where d < 0.149 (SE 0.011; Fig. 5A). As expected the % change in ψ decreased as d increases, suggesting the model is operating properly where d is higher than the 0.149 threshold (Fig. 5A).
Through segmented regression we also found break-points in the relationships between naïve occupancy, i.e., observation rate, and % change in ψ (Fig. 5B), and between number of confirmed observations and % change (Fig. 5C). Naïve occupancy rates < 0.15 and number of confirmed observations < 7.6 were associated with greatly increased variability in the estimate of ψ (P <0.001 for both models). Evaluation of the simulated data found that bias in the estimate is highest where naïve occupancy < 0.2 (Fig. 3), which is similar to the break-point of 0.15 identified from the field recordings. Low occupancy rates are often associated with a low number of confirmed observations, which may have a direct influence on poor model performance. These results support the simulation assessment that certain combinations of low observation rate, low number of confirmed observations, low d, and low FP result in high risk of bias for ψ, so we applied our simulation-based equation 4 to assign bias risk categories based on combinations of these parameters, where Group 0 had low risk of bias and Group 1 high risk of bias in ψ (Table 2). We assigned Group 2 to those observation data sets where the model was not able to converge or provide valid confidence limits for the estimates. Most Group 2 models also had some combination of low naïve occupancy, number of confirmed observations, and d, and high FP rate.
The mean % change (absolute value) from naïve to ψ for the 80 Group 0 observation sets (low risk of bias) was 11.75% (SE 9.98%; median % change of 9.20%), with about 84% of those changes being negative (Table 3, Fig. 6). The mean (SE) of detection rate (d), FP error rate and certain detection rate (b) were 0.50 (0.12), 0.014 (0.013), and 0.79 (0.13), respectively. As expected, ψ estimates from models accounting for only FN error increased relative to naïve estimates. However, it was less obvious what would happen if both FN and FP errors were modeled using the MDS model, where 84% of ψ estimates were lower than naïve (Fig. 6). Clearly modeling FP errors had a stronger effect on estimate correction than modeling FN errors.
Observation sets in Group 1 were classed as having moderate to high risk of bias, although note that this simply assigns risk and is not a statistical assessment of bias. The % change in ψ for these observation sets was highly variable, skewed, and in some cases extremely large, so we presented only nonparametric median estimates of central tendency. For Group 1 observation sets the median % change (absolute value) from naïve to ψ was 58.5% (Table 3) and ranged from 3.3% to 1666%. Relative to the Group 0 models, these observation sets were associated with a higher false positive rate (FP) and lower detection rate (d), naïve occupancy rate, number of recordings with confirmed observations, and number of sites that were confirmed to have the species present (Table 3). For the 47 observation sets in Group 2, where models were unable to converge and/or provide valid confidence limits and standard errors, the median % change (absolute value) from naïve to ψ was 72.5% (Table 3) and ranged from 0.48% to 2699%. Most of these models had very low numbers of confirmed observations, with a median of only 11 confirmed recordings and five confirmed sites, although there may be additional unresolved reasons why the models did not converge or estimate properly. The results from these models were excluded from further evaluation.
The online song identification survey included 225 respondents from across North America, including Alberta, Saskatchewan, Ontario, Quebec, Newfoundland, and northern U.S. States, and provided a broader evaluation of the prevalence of misidentification (FP error) among interpreters. The overall Poisson model was significant (omnibus test of fitted model against intercept only model; LLR χ²13 = 169.27, P < 0.001), with four significant factors associated with a higher number of correct answers (TP). Identification confidence with the species group used in the survey, ranging from 1 (not confident at all) to 5 (quite confident) was the strongest predictor (Exp(β) = 1.151; CI = 0.785-1.01); P < 0.001). Having greater than five years of experience in identifying birds by sound had the second largest effect size (Exp(β) = 0.888, SE= 0.0625) but with P slightly less than significant (P = 0.059). Two other factors, having had interpretation ability tested and self-assessing ability as sufficient, were also significant, with Exp (β) = 0.849, P < 0.001 and Exp (β) = 0.803, P < 0.001, respectively.
Misidentification rate, i.e., probability the respondent picked the incorrect species, decreased dramatically from the group with the lowest identification confidence (0.749) to the highest (0.308; confidence = 1 and 5, respectively; Fig. 7). Based on these results we created an “expert” subgroup (N = 28), classed as those with the highest level (5) of self-assessed confidence. However, even among this expert group misidentification rates could be high for certain species.
We identified confusion groups of species based on the answers selected by the “expert” group in the online survey (Table 4), and found that this group always correctly identified Rose-breasted Grosbeak (Pheucticus ludovicianus), but misidentification rates for other species ranged from 0.75 (Yellow Warbler, [Setophaga petechia]) to a low of 0.071 (Boreal Chickadee [Poecile hudsonicus] and Black-throated Blue Warbler [Setophaga caerulescens]; Table 4). Misidentification rates also differed between two American Redstart (Setophaga ruticilla) song types, with higher misidentification for the higher pitched song (type 2, syllable pattern more similar to Bay-breasted Warbler [Setophaga castanea] or Black-and-white Warbler [Mniotilta varia]).
From our field study we calculated L to estimate of the minimum number of observers, additional to the primary interpreter that would be required to achieve an 85% probability of confirmed detection (b). For all but two species in our 115 observation sets, L ranged from 1 to 9 (Table 2). Chipping Sparrow (Spizella passerina) and Yellow Warbler had low values of b, and would require 21 and 17 additional observers, respectively, to achieve an 85% probability that any detection would be a confirmed detection. Chipping Sparrow, Dark-eyed Junco (Junco hyemalis), Yellow Warbler, American Redstart, Chestnut-sided Warbler (Setophaga pensylvanica) and Wilson’s Warbler (Cardellina pusilla) were among the most inconsistently identified species. Of the 115 observation sets, 90% would require between one and three interpreters (additional to the primary interpreter) to reach 85% probability of confirmed detection, while the remaining would require more than three interpreters (Table 2). Among the most problematic species in the field survey, many of these also had high rates of misidentification in the online survey, including Chestnut-sided Warbler, Yellow Warbler, Dark-eyed Junco, Chipping Sparrow, American Redstart, and Bay-breasted Warbler (Table 4).
Misidentification has been largely ignored as a source of observational error in acoustic based studies, and whereas early occupancy modeling focused on nondetection (FN) error leading to misclassification of a site as being unoccupied, more recent attention has been directed on misidentification (FP) error leading to the misclassification of the state of site occupancy (Nichols 2019). In this study we explored the validity and performance of using an existing model to correct for FP error resulting from misidentification of interpreted songbird recordings that are typical of those collected using ARUs.
Our simulation study evaluated bias and precision of occupancy estimates from using the Miller et al. (2011) MDS model, and our results support the position that this model can effectively correct for both FN and FP errors across a broad range of survey observation and detection rates, and is appropriate for the type of two-interpreter confirmation used in this study. The model also produces unbiased occupancy estimates across the range of three to six repeat recordings, although variance in the estimates increases with only three repeat recordings. In our study we used confirmation by two interpreters to estimate identification “certainty” of observations, and it was not known if this approach was appropriate and met the conditions for valid use of the MDS model. In our simulation of FN and FP error we closely mimic the process by which these errors could be introduced into data sets, and also allowed for situations where identity was confirmed by both interpreters, but where the identification was ultimately incorrect.
There are other models that may also be appropriate for modeling FP error, and further work is needed to evaluate these. Royle and Link (2006) developed the original approach to accounting for false positive detections using a binomial mixture model, but this approach had limitations to its successful implementation (Miller et al. 2011). Cook and Hartley (2018) found that using double and multiple observer occupancy models (Huggins’s closed capture data type) resulted in positively biased occupancy estimates and failed to account for observer error. However, we modeled observer error using the Miller et al. (2011) MDS model and found that under appropriate conditions of observation rate and detection rate this approach corrected for positively biased occupancy estimates by lowering the modeled estimate relative to estimates based on correction of FN error alone.
The MDS model, however, did not perform well under very low observation and detection rate conditions, where there was significant bias and high variability in the estimates. This is a well-known problem among occupancy models; for example, Miller et al. (2011) and McKelvey et al. (2008) found that the largest bias occurred when ψ was low and that as a species becomes rarer, the proportion of FP observations compared to true observations increases. Simulated results reveal that in such cases even the naïve occupancy rate can be significantly biased. We developed a simple model to assign risk of bias to an observation data set based on a combination of naïve occupancy, detection rate, and FP rate. Our analysis, however, also suggested that low number of confirmed observations, not just the rate of unconfirmed observations, may be a more important driver of poor model performance. The relationship between observation rates and model bias needs to be evaluated through more extensive simulation analysis across a broader range of observation and detection rate conditions.
For the field data sets with low risk of bias (Group 0), we found that under the Miller et al. (2011) MDS model, FP error had a greater impact on corrected ψ than FN error. Similar impacts of FP error have been reported by others (McClintock et al. 2010a, Chambert et al. 2015, Ferguson et al. 2015, Clement 2016), suggesting that bias resulting from ignoring even a small number of FP errors can often be very large. There is particular concern regarding observation error when using data collected by the public (McKelvey et al. 2008, Miller et al. 2013); however, results from our online survey provide a coarse evaluation of how experience and ability affect misidentification, and suggest that FP errors occur across a range of expertise, and even with the most skilled and experienced interpreters there was still high possibility of significant FP error and variability between interpreters for certain species. Likewise, Simons et al. (2009) designed a series of field experiments using audio broadcasting to assess the factors affecting detection probabilities on auditory counts and concluded that direct estimates of detection rate, including error related to misidentification, should accompany all analyses of avian point count data because measurement error on auditory point counts is substantial.
As expected, we found corrections for FN always increased ψ relative to the naïve estimate, but that in 84% of the cases corrections that modeled both FP and FN brought ψ lower than the naive estimate. Although this result may be somewhat study specific, it nonetheless flags the importance of accounting for FP error. After accounting for both FP and FN error, the mean deviation from the uncorrected naïve occupancy rate was about 11.5%. The effect of FP error on ψ was often substantially greater than the effect of FN error. We estimated occupancy from six recordings (i.e. "visits") per site, so FN error is inherently low in our study because of multiple opportunities to identify a species; but if even one audio cue is misinterpreted among the recordings, then an FP error occurs. These results suggest that surveys that correct for FN but not FP error will tend to inflate ψ estimates and introduce significant bias into surveys. There may be an interaction effect between the magnitude of inflation and the number of visits to the site, as the correction for FN error generally decreases with higher numbers of site visits.
Overestimation of species occurrence could have consequences for species conservation; for example, delayed identification of species in decline because listing criteria are based on trends in estimated population size and occupancy (Camaclang et al. 2015, Loehle and Sleep 2015). Thus, overestimating species occurrence risks delaying conservation efforts, potentially leading to longer species recovery and decreased likelihood of success (Taylor et al. 2005).
Thompson et al. (2017) suggests a minimum naïve occupancy rate of > 0.05 for conducting occupancy analysis; however this might be too restrictive in some situations. Given the need for management/conservation prescriptions for rare species, it is particularly important to account for observational error when occupancy or detection rate is low (McKelvey et al. 2008, McClintock et al. 2010a, Miller et al. 2011). Our model to estimate L, the number of observers required to correct for FP error, could also be considered when selecting focal species for a study using ARU data. However, for species that are important to survey, but where detection and occupancy rates are inherently low or the species has inherently high misidentification rates, changes in the survey protocol may increase detection rate and number of confirmed observations. For example, we found that the time of day and date can affect detection probability, so changes in timing of the survey may increase the opportunities to hear the bird vocalize and identify it accurately.
Stratified sampling to increase the number of sites found in habitat used by rare species or species that vocalize infrequently may increase the number of encounters with the species and provide more opportunities for detection and accurate identification. This could also increase the number of sites confirmed to have the species present. These types of survey efforts may help reduce bias and increase precision in the occupancy estimate. If this is not possible, recognize that even a few misidentifications could inflate the naïve occupancy estimate by orders of magnitude, so exercise prudence in reporting of possibly biased estimates.
Additionally, the improved conditional survey design of Specht et al. (2017) may be useful for cost-effectively generating improved correction factors for both FN error and FP errors. In a conditional survey design applicable to interpretation of ARU recordings, one or two recordings from all sites would initially be interpreted, and if the rare species is detected in these initial recordings by at least one of the two interpreters, then additional recordings are interpreted. Optionally, additional interpreters can be assigned to those sites to improve estimation of the FP error rate. This could similarly be done for species with high rates of species misidentification. Estimates of the optimal number of listeners could be used to select the number of additional interpreters to use, and their efforts could be directed to those sites containing species for which estimates of L were > two. To reduce cost, interpreters could generate audio clips of species with high FP error rates, and multiple interpreters or cloud sourcing (Wimmer et al. 2013) could be used to classify only the recording clips. These approaches provide options to cost-effectively improve correction for estimates of FN and FP error by focusing observation effort on those few sites where at least one interpreter has detected rare or problematic species in at least one or two recordings. Banner et al. (2018) have developed an observation confirmation model and workflow that provides flexibility in establishing confirmation at the observation level and could possibly reduce cost and effort required for this type of FP modeling.
Reducing FP rate may also be accomplished through additional interpreter training, especially for those species identified as having a high FP rate. Note, however, that Miller et al. (2012) found only marginal reduction in errors by using additional instruction or training. Farmer et al. (2012) found that rates of nondetection and FP error vary with species rarity and interpreter skill, and that interpreter confidence (declaration of certainty) was not a reliable measure of these errors. In our online study, however, we found interpreter confidence a useful predictor of FP error. However, we also found that high misidentification rates occurred for interpreters with years of experience and high confidence, so other approaches such as changes to survey design may be required to sufficiently reduce FP error.
Our interpreters were instructed to only record a species if they felt confident in the identification, as recommended in previous studies (McClintock et al. 2010b, Farmer et al. 2012). Suggestions to reduce errors during data collection include recording detailed detection evidence to indicate observation confidence such as call type, i.e., song versus call (Farmer et al. 2012), and detection method, i.e., visual vs. auditory (Simons et al. 2009, Miller et al. 2011), or by discounting detections beyond specific distance thresholds, depending on species and recording conditions (McClintock et al. 2010b). Additional suggestions for reducing misidentification during recording interpretation include the following: providing knowledge of the habitat in which the recoding was made, e.g., field conditions, habitat information, site photos, range maps, etc.; using spectrograms and computer based identification resources (Knight and Bayne 2018); providing examples of species songs; and quizzing potential interpreters to assess their ability (Rempel et al. 2014). In this study FP error was very high for Blue Jay (Cyanocitta cristata) and Chestnut-sided Warbler; use of spectrogram aids and digital detection algorithms might be particularly helpful whenever species known or estimated to have high FP rates are encountered. Given the high potential for FP error, study design should include a request for song “type or voucher specimens” so errors can be corrected in later years.
Approaches to correcting FP error that include nonauditory data or experimental control data might also be considered. Chambert et al. (2015) devised an occupancy modeling approach that involves confirmation of individual observations using additional biological samples, e.g., DNA, or indirect records such as photographs or acoustic recordings from which identification can be confirmed. Alternatively, Chambert et al. (2015) developed a calibration design where they used controlled experiments to evaluate the detection process and use these data to model the FP detection process within the occupancy model. Van Wilgenburg et al. (2017) integrated the use of automated recording units and human observer point count data to take advantage of the relative merits of each survey type to reduce bias in estimated avian counts and/or density; associated recordings from this approach could also be used to estimate FP rates for human observer counts. Guillera-Arroita et al. (2017) demonstrated a two-stage model requiring at least two sources of extra information, e.g., records from a survey method with no possibility of false detections and a calibration experiment, to reliably estimate occupancy. These authors also developed a method within a Bayesian approach to set bounds on false detections rates based on prior knowledge, and we suggest that data from experiments such as our online interpretation experiment would provide a convenient framework for deriving such informative priors. Recently, progress has been made on using various classification and machine learning approaches for recognition of animal sounds, including Hidden Markov Models, Neural Networks, Deep Learning, and Support Vector Machines (e.g., Cai et al. 2007, Ranjard and Ross 2008, Weninger and Schuller 2011, Stowell et al. 2019). Although automated recognizers are typically applied to an entire recording, often with less than impressive results (e.g., Venier et al. 2017), recognizers could perhaps be more effectively applied to isolated song clips to provide additional evidence for songbird identity, and this may be particularly useful when using ARU based surveys to cost-effectively reduce misidentification error. Indeed, classification scores (Knight and Bayne 2018) from automated classification algorithms could be used as aids to help interpreters decide between alternative species or songs, and spectrogram based tools for visualizing and listening to recordings that suggest possible species identifications are in development (E. M. Bayne, personal communication).
Study objectives and conditions will of course vary, but the cost for modeling FP error in addition to FN error from data collected using ARUs will in many cases be focused on funding additional interpretations rather than sampling additional sites or collecting more repeat recordings. As discussed above, the quality and effectiveness of our two-observer MDS model design could be improved by adding a third interpreter to confirm identification in only those sites where the initial two interpreters disagreed, and may improve the approximation of “certain identification” required by the MDS model. Regardless of approach, additional resources to reveal major identification problems and to correct for FP error (in addition to FN) should be incorporated into study designs. Identifying and correcting species misidentification should reduce the potential introduction of bias into estimate of ψ that could otherwise impede conservation efforts.
In this study we have evaluated a framework (MDSM) for modeling FN and FP error, and have provided suggestions for survey design and interpretation protocols for ARU data (including analysis guidelines) that should aid in simultaneous modeling of both FP and FN errors to reduce the potential introduction of bias in the estimates of ψ. We encourage others to employ similar approaches over a broader range of ecological conditions to examine the generality of this proposed approach.
We would like to thank Stephen Gullage, George Holborn, and Jeff Robinson for their interpretation of songbird recordings. We want to thank the 225 respondents that took the online survey for bird identification. Also we want to thank the reviewers for very constructive comments.
Acevedo, M. A., C. J. Corrada-Bravo, H. Corrada-Bravo, L. J. Villanueva-Rivera, and T. M. Aide. 2009. Automated classification of bird and amphibian calls using machine learning: a comparison of methods. Ecological Informatics 4(4):206-214. https://doi.org/10.1016/j.ecoinf.2009.06.005
Alquezar, R. D., and R. B. Machado. 2015. Comparisons between autonomous acoustic recordings and avian point counts in open woodland savanna. Wilson Journal of Ornithology 127(4):712-723. https://doi.org/10.1676/14-104.1
Banner, K. M., K. M. Irvine, T. J. Rodhouse, W. J. Wright, R. M. Rodriguez, and A. R. Litt. 2018. Improving geographically extensive acoustic survey designs for modeling species occurrence with imperfect detection and misidentification. Ecology and Evolution 8:6144-6156. https://doi.org/10.1002/ece3.4162
Cai, J., D. Ee, B. Pham, P. Roe, and J. Zhang. 2007. Sensor network for the monitoring of ecosystem: bird species recognition. Pages 293-298 in M. Palaniswami, S. Marusic, and Y. W. Law, editors. 2007 3rd International Conference on Intelligent Sensors, Sensor Networks and Information Processing. https://doi.org/10.1109/ISSNIP.2007.4496859
Camaclang, A. E., M. Maron, T. G. Martin, and H. P. Possingham. 2015. Current practices in the identification of critical habitat for threatened species. Conservation Biology 29(2):482-492. https://doi.org/10.1111/cobi.12428
Chambert, T., D. A. W. Miller, and J. D. Nichols. 2015. Modeling false positive detections in species occurrence data under different study designs. Ecology 96(2):332-339. https://doi.org/10.1890/14-1507.1
Clement, M. J. 2016. Designing occupancy studies when false-positive detections occur. Methods in Ecology and Evolution 7:1538-1547. https://doi.org/10.1111/2041-210X.12617
Cook, A., and S. Hartley. 2018. Efficient sampling of avian acoustic recordings: intermittent subsamples improve estimates of single species prevalence and total species richness. Avian Conservation and Ecology 13(1):21. https://doi.org/10.5751/ACE-01221-130121
Crins, W. J., P. A. Gray, P. W. Uhlig, and M. C. Wester. 2009. The ecosystems of Ontario, Part 1: Ecozones and ecoregions. Technical Report. Ontario Ministry of Natural Resources, Peterborough, Ontario, Canada.
Farmer, R. G., M. L. Leonard, and A. G. Horn. 2012. Observer effects and avian-call-count survey quality: rare-species biases and overconfidence. Auk 129(1): 76-86. https://doi.org/10.1525/auk.2012.11129
Ferguson, P. F. B., M. J. Conroy, and J. Hepinstall-Cymerman. 2015. Occupancy models for data with false positive and false negative errors and heterogeneity across sites and surveys. Methods in Ecology and Evolution 6(12):1395-1406. https://doi.org/10.1111/2041-210X.12442
Fiske, I., and R. Chandler. 2011. Unmarked: an R package for fitting hierarchical models of wildlife occurrence and abundance. Journal of Statistical Software 43(10):1-23. https://doi.org/10.18637/jss.v043.i10
Guillera-Arroita, G., J. J. Lahoz-Monfort, A. R. van Rooyen, A. R. Weeks, and R. Tingley. 2017. Dealing with false positive and false negative errors about species occurrence at multiple levels. Methods in Ecology and Evolution 8(9):1081-1091. https://doi.org/10.1111/2041-210X.12743
Hines, J. E. 2006. PRESENCE- Software to estimate patch occupancy and related parameters. U.S. Geological Survey, Patuxent Wildlife Research Center, Laurel, Maryland, USA.
Hobson, K. A., R. S. Rempel, H. Greenwood, B. Turnbull, and S. L. VanWilgenburg. 2002. Acoustic surveys of birds using electronic recordings: new potential from an omnidirectional microphone system. Wildlife Society Bulletin 30(3):709-720.
Knight, E. C., and E. M. Bayne. 2018. Classification threshold and training data affect the quality and utility of focal species data processed with automated audio-recognition software. Bioacoustics. https://doi.org/10.1080/09524622.2018.1503971
Knight, E. C., K. C. Hannah, G. Foley, C. Scott, R. M. Brigham, and E. Bayne. 2017. Recommendations for acoustic recognizer performance assessment with application of five common automated signal recognition programs. Avian Conservation and Ecology 12(2):14. https://doi.org/10.5751/ACE-01114-120214
Loehle, C., and D. J. H. Sleep. 2015. Use and application of range mapping in assessing extinction risk in Canada. Wildlife Society Bulletin 39(3):658-663. https://doi.org/10.1002/wsb.574
MacKenzie, D. I., J. D. Nichols, G. B. Lachman, S. Droege, J. A. Royle, and C. A. Langtimm. 2002. Estimating site occupancy rates when detection probabilities are less than one. Ecology 83(8):2248-2255. https://doi.org/10.1890/0012-9658(2002)083[2248:ESORWD]2.0.CO;2
MacKenzie, D. I., J. D. Nichols, J. A. Royle, K. H. Pollock, L. L. Bailey, and J. E. Hines. 2006. Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Academic Press, Amsterdam, The Netherlands.
Matsuoka, S. M., L. C. Mahon, C. M. Handel, P. Sólymos, E. M. Bayne, P. C. Fontaine, and C. J. Ralph. 2014. Reviving common standards in point-count surveys for broad inference across studies. Condor 116:599-608. https://doi.org/10.1650/CONDOR-14-108.1
McClintock, B. T., L. L. Bailey, K. H. Pollock and T. R. Simons. 2010a. Unmodeled observation error induces bias when inferring patterns and dynamics of species occurrence via aural detections. Ecology 91(8):2446-2454. https://doi.org/10.1890/09-1287.1
McClintock, B. T., L. L. Bailey, K. H. Pollock, and T. R. Simons. 2010b. Experimental investigation of observation error in anuran call surveys. Journal of Wildlife Management 74(8):1882-1893. https://doi.org/10.2193/2009-321
McKelvey, K. S., K. B. Aubry and M. K. Schwartz. 2008. Using anecdotal occurrence data for rare or elusive species: the illusion of reality and a call for evidentiary standards. BioScience 58(6):549-555. https://doi.org/10.1641/B580611
Miller, D. A., J. D. Nichols, B. T. McClintock, E. H. C. Grant, L. L. Bailey, and L. A. Weir. 2011. Improving occupancy estimation when two types of observational error occur: non-detection and species misidentification. Ecology 92(7):1422-1428. https://doi.org/10.1890/10-1396.1
Miller, D. A. W., J. D. Nichols, J. A. Gude, L. N. Rich, K. M. Podruzny, J. E. Hines, and M. S. Mitchell. 2013. Determining occurrence dynamics when false positives occur: estimating the range dynamics of wolves from public survey data. PLoS ONE 8(6):e65808. https://doi.org/10.1371/journal.pone.0065808
Miller, D. A. W., L. A. Weir, B. T. McClintock, E. H. C. Grant, L. L. Bailey, and T. R. Simons. 2012. Experimental investigation of false positive errors in auditory species occurrence surveys. Ecological Applications 22(5):1665-1674. https://doi.org/10.1890/11-2129.1
Muggeo, V. M. 2003. Estimating regression models with unknown break-points. Statistics in Medicine 22(19):3055-3071. https://doi.org/10.1002/sim.1545
Muggeo, V. M. 2008. Segmented: an R package to fit regression models with broken-line relationships. R News 8(1):20-25.
Nichols, J. D. 2019. Confronting uncertainty: contributions of the wildlife profession to the broader scientific community. Journal of Wildlife Management 83(3):519-533. https://doi.org/10.1002/jwmg.21630
North American Bird Conservation Initiative. 2000. North American Bird Conservation Initiative: bird conservation region descriptions. U.S. Fish and Wildlife Service, Washington, D.C., USA.
Ranjard, L., and H. A. Ross. 2008. Unsupervised bird song syllable classification using evolving neural networks. Journal of the Acoustical Society of America 123(6):4358-4368. https://doi.org/10.1121/1.2903861
Rempel, R. S., J. M. Jackson, and J. N. Robinson. 2014. Acoustic monitoring and assessment of forest songbirds: sample design, analysis methods, and observation error. CNFER Technical Report TR-012. Ontario Ministry of Natural Resources, Centre for Northern Forest Ecosystem Research, Thunder Bay, Ontario, Canada.
Royle, J. A., and W. A. Link. 2006. Generalized site occupancy models allowing for false positive and false negative errors. Ecology 87(4):835-841. https://doi.org/10.1890/0012-9658(2006)87[835:GSOMAF]2.0.CO;2
Sewell, D., G. Guillera-Arroita, R. A. Griffiths, and T. J. Beebee. 2012. When is a species declining? Optimizing survey effort to detect population changes in reptiles. PLoS ONE 7(8):e43387. https://doi.org/10.1371/journal.pone.0043387
Shonfield, J., and E. M. Bayne. 2017. Autonomous recording units in avian ecological research: current use and future applications. Avian Conservation and Ecology 12(1):14. https://doi.org/10.5751/ACE-00974-120114
Simons, T. R., M. W. Alldredge, K. H. Pollock, and J. M. Wettroth. 2007. Experimental analysis of the auditory detection process on avian point counts. Auk 124(3):986-999. https://doi.org/10.1642/0004-8038(2007)124[986:EAOTAD]2.0.CO;2
Simons, T. R., K. H. Pollock, J. M. Wettroth, M. W. Alldredge, K. Pacifici, and J. Brewster. 2009. Sources of measurement error, misclassification error, and bias in auditory avian point count data. Pages 237-254 in D. L. Thomson, E. G. Cooch, and M. J. Conroy, editors. Modeling demographic processes in marked populations. Springer Science + Business Media, New York, New York, USA. https://doi.org/10.1007/978-0-387-78151-8_10
Sólymos, P., S. M. Matsuoka, D. Stralberg, N. K. Barker, and E. M. Bayne. 2018. Phylogeny and species traits predict bird detectability. Ecography 41(10):1595-1603. https://doi.org/10.1111/ecog.03415
Specht, H. M., H. T. Reich, F. Iannarilli, M. R. Edwards, S. P. Stapleton, M. D. Weegman, M. K. Johnson, B. J. Yohannes, and T. W. Arnold. 2017. Occupancy surveys with conditional replicates: an alternative sampling design for rare species. Methods in Ecology and Evolution 8:1725-1734. https://doi.org/10.1111/2041-210X.12842
Stowell, D., M. D. Wood, H. Pamuła, Y. Stylianou, and H. Glotin. 2019. Automatic acoustic detection of birds through deep learning: the first Bird Audio Detection challenge. Methods in Ecology and Evolution 10(3):368-380. https://doi.org/10.1111/2041-210X.13103
Taylor, M. F. J., K. F. Suckling, and J. J. Rachlinski. 2005. The effectiveness of the Endangered Species Act: a quantitative analysis. BioScience 55(4):360-367. https://doi.org/10.1641/0006-3568(2005)055[0360:TEOTES]2.0.CO;2
Thompson, S. J., C. M. Handel, and L. B. McNew. 2017. Autonomous acoustic recorders reveal complex patterns in avian detection probability. Journal of Wildlife Management 81(7):1228-1241. https://doi.org/10.1002/jwmg.21285
Van Wilgenburg, S. L., P. Sólymos, K. J. Kardynal, and M. D. Frey. 2017. Paired sampling standardizes point count data from humans and acoustic recorders. Avian Conservation and Ecology 12(1):13. https://doi.org/10.5751/ACE-00975-120113
Venier, L. A., M. J. Mazerolle, A. Rodgers, K. A. McIlwrick, S. Holmes, and D. Thompson. 2017. Comparison of semiautomated bird song recognition with manual detection of recorded bird song samples. Avian Conservation and Ecology 12(2):2. https://doi.org/10.5751/ACE-01029-120202
Weninger, F., and B. Schuller. 2011. Audio recognition in the wild: static and dynamic classification on a real-world database of animal vocalizations. Pages 337-340 in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2011.5946409
Wimmer, J., M. Towsey, B. Planitz, I. Williamson, and P. Roe. 2013. Analysing environmental acoustic data through collaboration and automation. Future Generation Computer Systems 29(2):560-568. https://doi.org/10.1016/j.future.2012.03.004
Yip, D., L. Leston, E. M. Bayne, P. Sólymos, and A. Grover. 2017. Experimentally derived detection distances from audio recordings and human observers enable integrated analysis of point count data. Avian Conservation and Ecology 12(1):11. https://doi.org/10.5751/ACE-00997-120111