Monitoring migration timing in remote habitats: assessing the value of extended duration audio recording

ABSTRACT. Because birds are frequently detected by sound, autonomous audio recorders (called automated recording units or ARUs) are now an established tool in addition to in-person observations for monitoring the status and trends of bird populations. ARUs have been evaluated and applied during breeding seasons, and to monitor the nocturnal flight calls of migrating birds. However, birds behave differently during migration than during the breeding season. Here we present a method for using ARUs to monitor land birds during the migration period in remote habitats. We conducted in-person point counts next to continuously recording ARUs, and compared estimates of the number of species detected and focal species relative abundance from point counts and ARUs. We used a desk-based audio bird survey method for processing audio recordings, which does not require automated species identification algorithms. We tested two methods of using extended duration ARU recording: surveying consecutive minutes and surveying randomly selected minutes. Deskbased surveys using randomly selected minutes from extended duration ARU recordings performed similarly to point counts, and better than desk-based surveys using consecutive minutes from ARU recordings. Surveying randomly selected minutes from ARUs provided estimates of relative abundance that were strongly correlated with estimates from point counts and successfully showed the increase in abundance associated with migration timing. Randomly selected minutes also provided estimates of the number of species present that were comparable to estimates from point counts. Our results suggest that ARUs are an effective way to track migration timing and intensity in remote or seasonally inaccessible habitat during spring migration. Additional testing is needed to determine the efficacy of our methods during fall migration, and at more southerly latitudes. We recommend that desk-based surveys use randomly sampled minutes from extended duration ARU recordings, rather than using consecutive minutes from recordings. Our methods can be immediately applied by researchers with the skills to conduct point counts, with no additional expertise necessary in automated species identification algorithms.


INTRODUCTION
Conserving bird populations requires knowledge of bird distribution and habitat use at all stages of their life cycle, including during breeding, migration, and non-breeding periods (Sherry and Holmes 1995). Monitoring birds' habitat use during migration is a necessary component of conservation plans for migratory birds. Historically, researchers have primarily relied on in-person observations including mist-netting (Peach et al. 1996) and point counts (Ralph et al. 1995) for migration monitoring, but because birds are frequently detected by sound, audio recording technology offers opportunities to expand monitoring techniques. Here we present a method for using audio recorders to monitor the timing of migration in remote or seasonally inaccessible habitats.
Figuring out how to best monitor bird abundance and diversity in remote habitat is a current challenge. The climatic variation between winter and summer in high latitude continental regions increases the challenges associated with accessing remote areas during spring migration. Significant annual snow accumulation in the winter, followed by rapid melting as temperature increases in spring, makes unpaved roads impassable for a period of weeks in much of northern North America, typically overlapping the time period when migrant bird species begin to arrive in the region in spring. For example, in the State of Michigan, in the United States, many roads are closed to vehicles and unmaintained from November to April (Michigan Transportation Fund Act 1981). Developing survey monitoring protocols that can be implemented despite poor traveling conditions is a way to fill in gaps in knowledge of northern forest birds and birds in similarly remote habitats.
Autonomous recording units (ARUs) are programmable audio recorders that can be deployed in the field for long time periods to efficiently maximize the spatial and temporal extent of monitoring. Passive acoustic monitoring is widely used in ecology to monitor and study vocalizing organisms; ARUs have been deployed to study bats (Tuneu-Corral et al. 2020), whales (Baumgartner et al. 2019), invertebrates (Penone et al. 2013), amphibians (Dutilleux and Curé 2020), and birds (Shonfield and Bayne 2017). ARUs are also used to evaluate the success of conservation programs (Shonfield and Bayne 2017). Current challenges for implementing passive acoustic monitoring include the availability of reference sound libraries, minimizing errors in species identification, and determining the relationship between acoustic index values and their associated real-world underlying parameters (Gibb et al. 2019), as well as accounting for differences in the sampling detection space when deploying recorders at different sites and in different configurations (Darras et al. 2016). ARU deployments are frequently limited by both battery life and data storage capabilities, but rapid advances are currently being made in deploying fully autonomous systems that are solar powered and can automatically transmit data (Sethi et al. 2018).
Point-count surveys are the most commonly used bird monitoring protocol for long-term study sites (Ralph et al. 1995, Rosenstock et al. 2002, but ARUs are now viewed as a viable supplement to point-counts, especially during the breeding season when birds vocalize frequently (Furnas and Callas 2015, Klingbeil and Willig 2015, Shonfield and Bayne 2017, Darras et al. 2018, Darras et al. 2019). Many researchers have compared ARUs and point counts in terms of their estimates of species richness and relative abundance or occupancy (Haselmayer and Quinn 2000, Campbell and Francis 2011, Tegeler et al. 2012, La and Nudds 2016, including in temperate forest (Klingbeil and Willig 2015). However, none of these studies (including the 23 studies reviewed in Darras et al.'s [2018] meta-analysis) compared point counts and ARUs during migration. Birds behave and vocalize differently during migration than during the breeding season (Rappole andWarner 1976, Morse 1991). Testing and refining migrationspecific monitoring techniques for ARUs is therefore necessary to understand how data from ARUs compare to data from inperson observations.
ARUs are currently used during migration to record the flight calls of nocturnally migrating species. They are deployed to track the abundance of migrants as they move through an area and can provide helpful information about migratory flyway locations, migration phenology, and relative abundance of migrants (Evans and Rosenburg 2000, Farnsworth et al. 2004, Sanders and Mennill 2014. Understanding how migrating birds use remote habitats during the migratory period is a different challenge and requires different methods. Determining how birds are distributed, the relative abundance and species richness, and the timing of arrival and departure from remote areas during migration are all important research questions for applied conservation. To take advantage of the large volume of data generated by continuously recording ARUs, researchers are actively developing methods for automated identification of vocalizing organisms (Salamon et al. 2016, Gibb et al. 2019, Cramer et al. 2020. Applying automated detection algorithms requires extensive calibration time and expertise in signal processing systems (Priyadarshani et al. 2018b). Even with extensive algorithm training, detection precision can still be low for some species (Ruff et al. 2020). We present a method that can be implemented by anyone with the skills to conduct point counts, that does not rely on machine learning for species identification and data processing. Because applications of ARUs surveying diurnal habitat use during the migratory period have been under-explored in the literature thus far, we demonstrated and assessed an immediately applicable monitoring technique.
We compared data from ARU surveys to in-person point count surveys during spring migration in the northern Great Lakes region of the United States. Our goal was to understand how ARUs could be applied to monitor diurnal habitat use during migration by examining whether ARUs could provide estimates of relative abundance and number of species that are comparable to estimates from in-person surveys. We asked the following questions. 1) What are the differences between the number of species detected using point counts and using ARUs? 2) Can ARUs give estimates of relative abundance for focal species that are correlated with estimates of relative abundance from point counts? 3) Can randomly sampling from extended duration audio recordings provide better estimates of focal species relative abundance or the number of species detected than consecutive minutes of audio recording?

METHODS
We conducted in-person point counts alongside continuously recording ARUs on the southern shore of Lake Superior during two months at the start of spring migration. We compared both raw data and model-based estimates of the number of species detected and focal species relative abundance from point counts and ARUs. Our sampling scheme targeted diurnal land birds using the peninsular habitat during the migratory period. The sampled community consisted largely of passerine species that breed in forested habitat in North America, including Canada and Northern Michigan.

Study site
We conducted field work in a 2.7 km² area on the Point Abbaye peninsula in Baraga County, Michigan, USA (Fig. 1). Surveys took place from 2 April to 22 May 2019 and were conducted daily unless prevented by weather conditions. Field work was designed to coincide with the arrival and peak relative abundance of earlyseason migrating birds. Point Abbaye juts into the southern part of Lake Superior and comprises the western border of Keweenaw Bay. Habitat included forested wetland, upland hardwood, and hardwood forest disturbed by recent logging activity. We selected survey sites randomly across the study area using the R programming language and the rgdal, geosphere, rgeos, sp, maptools, and spatstat packages (Pebesma and Bivand 2005, Bivand et al. 2013, Baddeley et al. 2015, Bivand et al. 2018, Bivand and Rundel 2018, Bivand and Lewin-Koh 2019, Hijmans 2019, R Core Team 2020). We conducted a pilot study in 2018 to test our protocols and evaluate the accessibility of our randomly selected survey locations. See Appendix 1 for details about pilot year surveys, and survey site and date selection.

Automated recording units
Birds were recorded using three SWIFT bioacoustic recorder rugged units (Cornell Lab of Ornithology, Ithaca, NY, USA) and one AudioMoth bioacoustic recorder that was housed in a thin plastic bag for light weather proofing (Hill et al. 2018, Open Acoustic Devices, Southampton, UK). SWIFT units used a builtin PUI Audio brand omni-directional microphone. The AudioMoth unit used an analog microelectro-mechanical systems (MEMS) microphone. We refer to both the SWIFT and AudioMoth units as "automated recording units" (ARUs). ARUs recorded at a sampling rate of 48 kHz and saved recordings as uncompressed .WAV files. The microphone gain was set to "midhigh" for the AudioMoth unit and 35 dB for the SWIFT units. The signal to noise ratio reported by device manufacturers is approximately 58 dB for the SWIFT units and approximately 44 dB for the AudioMoth unit.

Field survey methods
ARUs recorded continuously for five hours each day, beginning within 10 minutes of local sunrise time (United States Naval Observatory 2016). The field technician manually re-programmed recorders approximately once per week to adjust for changing sunrise times. ARUs were attached to trees less than 0.6 m in diameter, and were placed 1.5-2 m above the ground (Darras et al. 2018). The SWIFT omni-directional microphones were always oriented downward to prevent precipitation landing directly on the microphone. After the five hour recording period ended each day, ARUs were moved to new locations for the next day's samples, thereby rotating the ARU and point count samples through all 18 survey locations approximately every five days. The sampling order for the points was chosen randomly, and ARUs were deployed to the randomly selected point locations each day.
Point counts were conducted daily next to each ARU during the five hour recording period. Point counts involved recording all birds seen and heard at an unlimited distance during a stationary, 10-minute count. The technician noted wind speed, precipitation level, and non-bird noise level, which included both anthropogenic noise like boats and planes, as well as frogs and other taxa. We did not survey in high wind or heavy precipitation. See Appendix 1 for detailed point count protocols.

Desk-based audio surveys
We conducted desk-based audio bird surveys by listening to ARU recordings played through headphones on a laptop computer in the lab after the end of the field season. We tested three types of desk-based audio surveys: 1) we listened to a recording of the 10 consecutive minutes during which the in-person point count was conducted; 2) we listened to 22 minutes from each recorder, selected randomly from the five hour recording duration; 3) we sampled a subset of 10 of the 22 random minutes (without listening to those minutes again). Our goal was to compare each of these desk-based ARU survey methods to in-person point count observations. For each audio file, the desk-based survey technician noted the identity of each bird species that vocalized, the type of vocalization, and the 30-second time intervals in which each species vocalized. We discarded randomly sampled minutes that contained a human voice. While listening, we viewed spectrograms of the recording in Audacity (Audacity Team 2019). Detailed protocols for completing desk-based audio surveys can be found in Appendix 1, and a completed data sheet from a deskbased survey is shown in Fig. A1.6.

Indices for observed number of species and relative abundance
We summed the number of unique species detected (S) separately using each survey type: 10-minute in-person point counts (S p ), 10 consecutive-minute ARU surveys (S 10C ), 10 random-minute ARU surveys (S 10R ), and 22 random-minute ARU surveys (S 22R ) ( Table  1). We calculated a value of S for each individual survey on each day, resulting in three or four values of S for each survey type on each day. We created an index of daily relative abundance (A) for individual species using each survey type (Table 1). Our relative abundance indices were: the mean observed number of individuals per point count (A p ); the proportion of 30-second intervals with a vocalization calculated by surveying n minutes in sections of 10 consecutive minutes (A n C); the proportion of 30-second intervals with a vocalization calculated by surveying n minutes in sections of one minute chosen randomly from the five hour survey window (A n R). To reduce the number of zero relative abundance counts in our data, we calculated indices by grouping all surveys of each type for each day, so there was a single value for each relative abundance index on each day. Abundance index abbreviations (Table 1) indicate the number of minutes surveyed per day. For example, three samples of 10 consecutive minutes per day were aggregated for an index of A 30C , indicating that 30 minutes total were sampled for each day.
Note that while April 16 th and 17 th data appear on plots and in results, ARU malfunctions on those dates made the number of sampled minutes, n, different for those two dates for some of our relative abundance indices. See Appendix 1 for detailed discussion of sample size on these dates. Because these dates coincided with an important arrival period of migrants into the study area, we did not exclude them from our analyses.

Observed number of species
To determine whether the survey type (S p , S 10C , S 10R , S 22R ) significantly influenced the number of species detected, we modeled the number of species detected using a generalized linear mixed model (GLMM) with a Poisson error distribution and log link function, using the lme4 package in R (Bates et al. 2015, R Core Team 2020). Our fixed effects were survey type, first and second degree terms for day of year, wind, rain, noise, and interactions for the day of year terms and survey type, and for rain and survey type. We also used day of year as a random effect; we expected that the number of species detected by all surveys on each day would be strongly correlated, regardless of survey location or survey type. More information about our GLMM can be found in Appendix 1.

Relative abundance
We compared relative abundance estimates from A p to each of the three desk-based audio survey types (A 30C , A 30R , A 66R ) for all species that were detected at least twice using each survey type.
To illustrate these analyses, we present detailed results for two example species, Regulus satrapa (Golden-crowned Kinglet) and Troglodytes hiemalis (Winter Wren). Winter Wrens were abundant in the survey area, and vocalized frequently and loudly during early spring, representing the "best case" scenario for detectability on ARU recordings. Golden-crowned Kinglets were abundant in the survey area, but vocalized quietly (though regularly) during early spring, and so represent a greater challenge for detection using ARUs.
For each relative abundance model, we fitted boosted regression trees (BRTs) (Friedman 2001, Elith et al. 2008  We assessed whether the observed values from the three deskbased audio survey indices (A 30C , A 30R , A 66R ) were correlated with the observed values from A p , and whether the predicted values from the models trained with desk-based audio survey indices were correlated with the predicted values from models trained with A p for all focal species using Spearman's rank correlation coefficient. To assess whether the correlation coefficients for all species varied based on which abundance index pair was used and on whether coefficients were calculated from observed or

RESULTS
Between 2 April and 22 May 2019, we were able to survey on 37 days. During that time, we conducted 137 in-person point counts. We recorded 130 simultaneous 10-minute periods with ARUs (when a human observer was also present conducting a point count) and 124 periods of 22 randomly selected minutes from each morning (five hours of recording). All four audio recorders experienced occasional malfunctions; more details about these malfunctions can be found in Appendix 1. A complete list of the species detected by each survey method is in Table A1.1.

Observed number of species
A Chi-square ANOVA comparing our full model to a null model with survey type removed showed that survey type (the S-index used) had a significant effect on the number of species detected (χ 2 9, 21 =247, p < 0.0001). We detected a similar number of species using S 10R as we did using S p ( Fig. 2; Table 2; change in the log of the number of species detected = -0.065, 95% CI [-0.2; 0.08], p = 0.3888). Using S 22R , we detected significantly more species than by using S p ( Fig. 2; Table 2; change in the log of the number of species detected = 0.305, 95% CI [0.17; 0.44], p < 0.0001). We detected fewer species using S 10C than using S p ( Fig. 2; Table 2; change in the log of the number of species detected = -0.614, 95% CI [-0.78; -0.44], p < 0.0001). Listening to randomly selected rather than consecutive minutes eliminated the gap in number of species detected between 10-minute point counts and 10-minute ARU surveys (Fig. 2). Day of year had a significant effect on the number of species detected (Table 2), with more species expected later in the migration season (Fig. 2).  Table 1 for a description of the species richness index abbreviations. Listening to 10 random minutes of data from an ARU (S10R) allowed for detection of the same number of species as a 10 consecutive minute in-person point count (Sp). Increasing survey effort to 22 random minutes of ARU data (S22R) increased the number of species detected to above the number of species detected by in-person point counts.  Table 1) calculated from: (a) and (e) three 10-minute in person point counts, (b) and (f) three samples of 10 consecutive minutes of audio recordings from automated recording units (ARUs), (c) and (g) three samples of 10 randomly selected minutes of audio recordings from ARUs, (d) and (h) three samples of 22 randomly selected minutes of audio recordings from ARUs.
Chi-square ANOVA showed that the overall effect of wind was not significant (χ 2 19, 21 =1.44, p = 0.48) nor was the overall effect of precipitation (χ 2 18, 21 = 3.45, p = 0.32). The overall effect of noise was significant (χ 2 19, 21 = 10.3, p = 0.005). The interaction between survey type and day of year was not significant (Table  2), providing no evidence of a difference in the effect of survey method on the observed number of species over the course of the survey season.

Relative abundance models
BRT models of relative abundance over time differed in how well they showed the initial period of absence, and the increase in relative abundance corresponding with the arrival of migrant birds in our study area, depending on the survey method used (Fig. 3, Fig. A2.1). The general pattern of initial absence followed by arrival of migrants can be seen in both the raw data and the model predictions of relative abundance for A p , A 30R and A 66R for Winter Wrens (Fig. 3 a, c, and d) and for A p and A 66R for Golden-crowned Kinglets (Fig. 3 e, and h). The same general pattern is visible for A p and A 66R for at least 12 additional species (Fig. A2.1). For most species, including our two example species, the observed relative abundance indices from ARU surveys were positively correlated with the observed relative abundance index from point counts (Fig. 4, Fig. 5, Table A2.1), indicating that the relative abundance proxies we calculated using ARUs are comparable to relative abundance estimates from in-person observations. For all species, correlations for predicted values of A p and the three desk-based survey methods (A 30C , A 30R , A 66R ) were generally higher than correlations of the observed index values (Fig. 6, Table 3, Table A2.1). For correlations calculated with observed abundance index values, A p appeared to be most strongly correlated with A 30C ; however, correlations calculated with predicted abundance index values were stronger between A p and the two random-minute indices (A 30R and A 66R ) (Fig. 6, Table  3). The strongest median correlation value was between predicted values of A p and A 30R , though predicted values of A p and A 66R  (Table 1) for Winter Wren. See Table A2.1 for Spearman's correlation coefficients. Abundance indices for ARUs (the proportion of 30second intervals with a vocalization) are correlated with the abundance index from point counts (mean number of individuals observed per count per day). Note that axis scales vary by abundance index; absolute values are less important here than the relationship between observations. Photo: "Winter Wren" by ilouque, used under license CC BY 2.0. Cropped from original.  (Table 1) for Golden-crowned Kinglet. See Table A2.1 for Spearman's correlation coefficients. Abundance indices for ARUs (the proportion of 30-second intervals with a vocalization) are correlated with the abundance index from point counts (mean number of individuals observed per count per day). Note that axis scales vary by abundance index; absolute values are less important here than the relationship between observations. Photo: "Golden-crowned Kinglet" by Laura Gooch, used under license CC BY-NC-SA 2.0. Cropped from original.  Table A2.1). Boxes show the middle 50% of the data, and the horizontal line shows the median value in each box (Table 3). Correlations between predicted index values (right) were higher than correlations between observed index values (left). Table 3. Median of Spearman's rank correlation coefficient values for each combination of abundance indices for all species that were detected at least twice by both methods (Ap, A30C: n = 25 species; Ap, A30R; n = 28 species; Ap, A66R: n = 30 species). were only slightly less correlated (Fig. 6, Table 3). The strong correlation between predicted values of the relative abundance indices indicates that our models found the same underlying signal regardless of whether training data were from ARUs or point counts.
The interaction between value type (observed or predicted) and index pair (A p and A 30C , A p and A 30R , or A p and A 66R ) was significant according to a likelihood ratio test comparing linear mixed models with and without the interaction term (L = 7.85, df = 1, p = 0.01). This indicates that the correlation values depended on the value type and indices used.

DISCUSSION
Our results showed that ARUs recording for an extended duration can be as effective as in-person point counts for monitoring vocal migrating land birds in high latitude remote habitats during spring migration. The number of species detected by randomly sampling minutes from ARU recordings was similar to, or higher than, the number of species detected by point counts. Relative abundance models trained with ARU data showed the increase in relative abundance indicating the arrival of migrants at the study site, suggesting that ARUs can be used to track migration phenology in remote habitats for vocal species.

Sampling random rather than consecutive minutes from ARU recordings
Data from randomly selected minutes of ARU recordings detected more species and produced modeled relative abundance estimates that better showed the expected seasonal pattern of migration timing than data from consecutive minutes of ARU recordings ( There are two likely explanations for this. First, randomly selected minutes are less temporally auto-correlated than consecutive minutes. For example, during a 10-minute in-person point count, little new information is gained during the seventh minute of the survey compared to what was collected during the sixth minute of the survey; a Winter Wren singing near the end of the sixth minute of a point count survey will likely still be singing in the beginning of the seventh minute. By selecting minutes randomly from across the five-hour survey window, the temporal correlation between each successive minute that is analyzed is minimized. Second, during the migration season, birds may move more within the study area than they would during the breeding season, when they have established a territory. The community of birds within the immediate detection radius of an observer (either a person or a recording ARU) may therefore change over the course of five hours. Wimmer et al. (2013) found that randomly selected minutes from extended duration recording provided better estimates of species richness than consecutive minute in-person surveys during the breeding season. Our findings lend support to the conclusion that using randomly selected minutes provides a more complete sample of the birds using a spatial location over the entire course of the survey window than consecutively sampled minutes.
For in-person point counts, the time taken to travel to a survey site takes up a major portion of the total time invested, so site visits are typically limited to once per day. With ARUs, no such constraints exist; it is possible to do multiple short-duration surveys from many locations over the course of one day without additional travel and field work logistics. We recommend that studies using ARUs on migration should randomly sample recordings of short periods of time (e.g., one-minute recordings) from a defined survey window relevant to the study question (e.g., the five hours following sunrise for passerines in temperate forest or twilight to dawn for crepuscular and nocturnal species). Our study focused on migration, but we recommend that studies using ARUs to monitor birds during wintering or breeding seasons (e.g., Wimmer et al. 2013) also consider using randomly selected minutes.

Effects of wind, rain, and noise on ARU surveys
We did not detect an effect of either wind or rain in our model of the number of species detected. However, because we controlled for adverse weather conditions during our field surveys by not deploying ARUs on rainy or windy days, the number of high wind values in our data was low, as was the number of rainy survey days. We noted anecdotally that the wind values recorded in person for a survey day did not always correlate with the amount of wind heard while conducting our desk-based audio surveys; we speculate that wind direction in relation to the microphone may make a difference in how much wind is actually picked up by the ARU. Given that wind and rain have an effect on the detectability of birds in the study system (Ralph et al. 1995), they remain important predictors to include, despite not appearing significant in our model.
Interpreting the significance of the noise variable is challenging, because we used the variable to describe all non-avian noise in the environment, which could include waves, airplanes, and frogs. We suspect that the overall significance of the noise variable may be due to frogs. Future studies may want to consider distinguishing between other vocalizing taxa and surrounding environmental noise, as ARUs can be used to simultaneously sample multiple taxa (e.g., crickets and bats; Newson et al. 2017). Background noise in ARU recordings can impede the ability of human listeners or automatic identification algorithms to identify bird calls and songs (Priyadarshani et al. 2018b). We addressed background noise in ARU recordings both in the experimental design and in statistical analysis: we did not deploy ARUs during high winds or during heavy precipitation, and we included wind speed and noise as "nuisance" covariates in statistical models. However, most uses of ARUs will deploy ARUs for much longer periods of time (weeks or months), and will not have the option of avoiding recording during strong wind and precipitation. Most uses of ARUs will therefore require an audio cleaning step to identify and deal with sections of recordings with high amounts of background noise (

Estimating relative abundance despite imperfect detection of birds
Estimates of abundance are more useful than estimates of occurrence for prioritizing conservation resources at dynamic temporal scales, such as during migration (Johnston et al. 2015). ARUs do not solve the problem of how to estimate true abundance during migration. We accounted for variation in detectability by controlling for observer effort and weather variables, but we recognize that imperfect detection, and the possibility of vocal behavior changing over time, means that the number of individuals detected is not necessarily a good estimate of the number of individuals present (MacKenzie and Kendall 2002). Hierarchical models that account for imperfect detection (MacKenzie et al. 2002, Kéry andRoyle 2016) rely on assumptions about population closure that may be badly violated during migration, when birds are only present in stopover habitat for short periods of time. The period in which we can reasonably assume population closure for our study area during migration may be as short as several hours or as long as several days, depending on weather conditions. Therefore disentangling true occupancy or abundance from detectability is difficult, whether using traditional in-person survey methods or ARUs. The current standard in studies that examine abundance during the migration period is to account for detectability by controlling for effort and weather (Johnston et al. 2015).
It is possible that individual birds' vocalizations may increase over the spring migration period, as birds prepare for the breeding season. Changes in vocalization behavior over the migration season could confound our estimates of relative abundance. One possible avenue for dealing with these issues is the method proposed by Metcalf et al. (2019) which used ARU data and dynamic occupancy models that allowed detectability and occupancy to vary over short timescales relevant to migration. We believe increases in relative abundance shown by our models reflect real increases in relative abundance associated with the arrival of these species in the study area, demonstrated by the abrupt arrival of focal species apparent in both raw data and model predictions (Fig. 3, Fig. A2.1) and by the moderate to strong correlations between results using A p (relative abundance from point counts) and results using our ARU relative abundance indices for many focal species. ARUs can therefore provide valuable information about migration phenology, comparable to the information obtained by in-person surveys, even if estimating true abundance remains challenging.

Desk-based ARU surveys are generalizable to many species
The pattern of absence followed by arrival is visible for both A p and A 66R (and often A 30R ) for many species in addition to our example species (Fig. A2.1). This suggests that our methods are generalizable to many vocal birds in this region. The model predictions from random-minute ARU indices were more strongly correlated with A p than were predictions from the consecutive minute ARU index (Fig. 6, Table 3). In contrast, raw observed values from A 30C were more strongly correlated with observed values from A p than were the random-minute indices (A 30R and A 66R ). However, the goal of the abundance models was to produce similar summary conclusions from the data (i.e., similar out of sample predictions of relative abundance), not to reproduce the raw data values. Therefore, the random-minute ARU indices provided better estimates of relative abundance than did the consecutive minute ARU index.
Using correlation between point count and ARU relative abundance indices is an imperfect measure of ARU index performance. It works well when there is a strong directional trend in relative abundance, as we see with arriving migrant species. However, some vocal resident species (e.g., Common Raven, CORA, Fig. A2.1) showed no directional trend in relative abundance over time. Though the BRT models successfully showed the same overall trends for A p , A 30R , and A 66R , the correlation coefficients were low (Table A2.1). Alternative measures for comparing models might more accurately describe how ARU surveys compare to point counts for all species, not only those that show strong directional trends in relative abundance.

Adjusting ARU methods for different study systems
Future studies might consider increasing ARU survey effort beyond our maximum of 66 randomly selected minutes per day.
We were able to model relative abundance for more species with A 66R (n = 30) than with A 30R (n = 28). The rate at which we detected new species when sampling additional random minutes slowed notably with less than 22 minutes of sampling, suggesting that few new species would be detected by additional sampling, except in early April (Fig. A1.7). The optimum number of minutes to sample will likely depend on the study system and season.
We limited our survey window to the first five hours after sunrise in order to maintain similarity to common point count protocols. We limited our survey effort to 66 random minutes per day because we wanted to keep the technician work load for desk-based audio surveys similar to the time invested for point counts; when researchers do not have to travel between survey locations they may invest a higher percentage of their time listening to recordings. Future studies need not be bound to these constraints. ARUs can be used more flexibly when researchers are not concerned with direct comparison between in-person and ARU survey methods. Similarly, while we placed ARUs at about headheight because we were making direct comparisons to point counts, one could choose an optimal height to mount ARUs based on either behavioral characteristics of focal species or consideration of sound transmission and attenuation (Priyadarshani et al. 2018a). For example, researchers wishing to monitor canopy-dwelling forest birds may wish to place ARUs higher in order to better target those species.

Are ARUs useful across the entire migration route?
Our study site and survey timing represent the northern end of the spring migratory journey, and therefore may represent a bestcase scenario for ARU use during spring migration. Indeed, more southerly sites and the fall migratory period may present less favorable conditions for monitoring with ARUs, because many species may not vocalize as reliably when they are farther from their breeding grounds or moving away from their breeding grounds. We were unable to differentiate between individual birds using our study site as a stopover location before moving on to more northerly breeding grounds and those that would eventually establish a breeding territory locally. Future studies could evaluate the applicability of ARUs in migration stopover specifically by replicating this study farther south, where many of the species we detected will stopover but not breed. Further research is necessary to determine how far south these methods are applicable during spring, and whether they will work during fall migration.
Future studies using ARUs to monitor bird migration may wish to take advantage of ARUs' unique ability to scale research in ways that may be infeasible or prohibitively expensive for inperson field work. ARUs can increase the amount of data collected without increasing the costs associated with technicianhours in the field (Williams et al. 2018). For example, ARUs could be deployed in dense, small-scale networks to examine microhabitat use in stopover regions. Alternatively, they could be deployed on a latitudinal gradient covering hundreds or thousands of kilometers to examine how vocal behavior changes over the spring migration period as birds approach their breeding grounds.

Conclusion
Applying the methods described here can facilitate an increase in survey effort in difficult-to-access habitats in high latitude forests during migration. Temporal variation in accessibility in these habitats is dramatic, as unpaved roads typically turn from snow to slush to impassable mud before hardening into reliably dry surfaces in early summer. ARUs can eliminate many of the restrictive logistics and safety concerns for researchers interested in monitoring spring migration. Our method of using desk-based surveys of randomly selected minutes from ARUs can be used by any researcher with the skills to conduct point counts. Researchers can set up ARUs during winter conditions when access to study sites over snow is relatively easy (e.g., using snowmobiles, skis or snowshoes), and revisit to collect the audio data once conditions have stabilized in late spring. Our methods for using ARU data to model relative abundance of focal species and the number of species present during migration can be immediately applied to increase monitoring effort in logistically difficult regions.

Pilot year surveys; site selection
We selected survey sites randomly across the study area with the criteria that points were at least 100 m away from the shoreline, and at least 300 m apart, to ensure ARUs recorded non-overlapping areas (Klingbeil and Willig 2015). During the pilot year of surveys in 2018, nineteen points were initially selected and tested (eighteen chosen randomly as previously mentioned, and one point selected by hand near the tip of the peninsula). Three of those initial nineteen points were dropped for the 2019 season, due to difficult access, posted private property signs, and smaller effective survey area (because of proximity to water) relative to the other points. For the 2019 field season, the survey area was expanded from 2.5 km 2 to 2.7 km 2 to include an adjacent Keweenaw Land Trust property, and two additional survey points were added.

Date selection
We selected survey dates by reviewing historic observations for four early season migrant would allow us to catch the peak in the daily number of observations reported to eBird for each of our target species. This survey period did not encompass the entirety of spring migration for all migrant species in the region, but was designed to capture the peak of abundance during migration for our focal migrant species.

Point count protocols
The field technician announced aloud the beginning and end of each point count, as well as the date, time, location name, and geographic coordinates, so that this information could be recorded on the ARU, as well as on the technician's data sheet. Each species observed was noted on a data sheet, including the number of individuals seen, the bearing of the first individual or group detected (relative to the direction the observer was facing), the detection method (call, song, woodpecker drum or visual), the distance from the observer (in three distance bands of 0-25 m, 26-50 m, or 50+ m), and the minute of the survey in which the species was first detected (0-9). The observer also noted cloud cover (0-33%, 34-66%, 67-100%), precipitation (Dry, Fog/Haze, Drizzle, Rain/Snow), Beaufort wind scale rating (0-5) (Beaufort 1805), and non-bird noise level for each point count (0-4). We did not survey if the wind was greater than force 5, or in heavy, continuous precipitation.

Desk-based survey protocols
No more than five hours of desk-based audio surveys were conducted in a single day, and all audio recordings were listened to at full speed. The technician was allowed a maximum of 15 minutes to listen to each 10 minute recording, during which time they could pause, rewind or replay the audio file, and could look up and play songs or calls from any external resource they felt may be helpful, excluding using any kind of automated identification program. Because of the difficulty of deciding what constitutes a single "vocalization" from species with different songs and calls, we did not attempt to count the number of vocalizations. While conducting desk-based audio surveys, we viewed spectrograms of the recording in Audacity (Audacity Team 2019). Spectrograms were viewed in gray scale, with a minimum frequency of 0 kHz and a maximum frequency of 15 kHz. The gain (brightness) of the spectrogram was 20 dB, while the range (contrast) was 80 dB. Frequency gain was 0 dB/dec.
Window size was 256, with window type Hanning and a zero padding factor of 1.
To process the data from the 10 consecutive minute counts, we clipped the audio recording of each 10 minute point count from its larger audio file, excluding voice announcements about the location, date and time of recording and including only the "begin point count" and "end point count" announcements from the survey technician. Because the field survey technician also conducted the desk-based audio bird surveys, the second author anonymized the audio recorder file names so that the 3 survey technician could not see or hear the dates and locations of the audio recordings. This reduced the possibility that memories of particular days or locations would influence the data collected during the desk-based audio survey. In the anonymized file names we included an indicator of the two week period in which the point count took place ("early" or "late" in April or May) because information about season is used by bird observers to inform their mental list of "possible" species, and this information would be available to a technician conducting desk-based audio bird surveys in practical applications.
In order to ensure that the technician's desk-based audio survey species identifications were reproducible, we duplicated 20% of the 10 consecutive minute recordings, and assigned new anonymous names to the duplicated recordings, so that the technician listened to that data twice. After data entry and de-anonymization, we compared the species detected in each duplicated recording.
The desk-based survey process used for listening to 24 random minutes was similar to the process used for 10 consecutive minute recordings. Selection of the 24 random minutes was done in R using the 'warbleR' package and work flow (Araya-Salas and Smith-Vidaurre 2017, R Core Team 2020). We wanted to analyze a minimum of 20 random minutes without anthropogenic disturbance, so we selected and clipped 24 minute-long segments from each day's audio for each recorder. We subsequently discarded any clip that contained a human voice, but did not discard clips containing other possibly anthropogenic noise (e.g. footsteps), because we did not have a clear way to distinguish between human and wild animal sounds. We also did not control for distant anthropogenic noises such as vehicles and planes. We created a new sample of 24 random minutes to select from each ARU on each survey day, so that we did not select the same minutes from each day or ARU. We only selected 24 random minutes from an ARU on days when the unit recorded the full five-hour survey window.

4
We listened to randomly selected minutes using the procedures outlined above for the 10 consecutive minute desk-based audio surveys, but allowing for a maximum of 50 minutes to listen to each set of 24 random minutes. This allowed for approximately the same effective listening time for the audio files (50% more than the length of the original file), but included extra time for file management (opening and closing the audio files in Audacity). After discarding minutes that contained a human voice, we had a sample of 22 random minutes from each ARU on each survey day.

Sample size on April 16 th and 17 th
The number of surveys per day over the course of the season varied based on local conditions.
We deployed at least three ARUs every day, and deployed a fourth ARU on days when insufficient weatherproofing was not likely to interfere with recording efforts. We considered all survey days with at least three in person point counts, and five hour recording windows from at least three ARUs, a "complete" survey day. On two days (April 16 th and 17 th ) we failed to capture a complete survey day due to ARU SWIFT03 malfunctioning. On April 16 th , we also were unable to conduct three in-person point counts, and conducted only two counts, one alongside the functionally recording SWIFT01, and one next to the malfunctioning SWIFT03. SWIFT02 successfully recorded the full five hour survey window on April 16 th , but no point count was conducted there.
Because these dates coincided with an important arrival period of migrants into the study area, we did not drop them from our analyses. Because we modeled species richness per count rather than per day, our species richness models were unaffected by the anomalies described above. However, because we aggregated our abundance indices per day, it is important to note that the abundance indices for April 16 th and 17 th are different than the other survey days (see Box 1 for description of abundance 5 indices). On the 16 th of April, the abundance index for point counts is the mean number of individuals detected per count, but averaged between only two point counts instead of the usual three counts. On that date, the abundance index for consecutive minute ARU counts was A 10C , rather than A 30C . On April 17 th , the abundance index for consecutive minute ARU counts was A 20C , rather than A 30C . We opted to leave these dates in our models with reduced survey effort, rather than remove them. To compensate for the malfunctioning third recorder, we selected an additional 11 random minutes from each of the functional recorders so that we do have abundance indices of A 66R and A 30R for those dates, but they are sampled from only two ARUs instead of three ARUs.
For days when four ARUs were deployed, we listened to 22 random minutes from each ARU (88 minutes total), but standardized effort to 66 minutes and 30 minutes per day for the A 66R and A 30R indices, respectively. To do this, we randomly selected 66 and 30 minutes from the entire survey day, which may include data from all four of the ARUs.

Generalized Linear Mixed Model methods
We included an interaction between day of year and survey type because we anticipated that the effect of survey type might change over time, for example if increased numbers of species or individuals later in migration made distinguishing identifiable sounds on the audio recordings more difficult. We included an interaction between rain and survey type because we expected that even small amounts of rain hitting the ARUs might impair our ability to detect birds on desk-based audio surveys more than a similar amount of precipitation during point counts. Because the number of observations for some values of the categorical weather variables was small, we pooled levels as follows: wind was binned into categories 0-1, 2, or 3+; rain was binned into "wet" and "dry" conditions; noise was binned 6 as 0, 1, or >2. We centered and scaled all continuous variables. We did not perform model selection but rather included all variables in the final model due to an a priori expectation that all variables were relevant to the study system. We assigned the weather variables noted in person during point counts to all the desk-based audio surveys from the same unit on the same day. While weather conditions may have changed slightly over the course of the survey window, the observed weather conditions from the point count represent our best estimate of the conditions at each survey location on each survey day.

Differences between individual ARUs
To investigate whether there was a discernible difference between our individual ARUs, we ran a poisson GLM for all 10 consecutive minute ARU surveys that included all the predictor variables as described above and in the main text, and an additional "ARU ID" variable that identified which ARU was used on each survey. Chi-square ANOVA comparing this model to a null model without ARU ID showed no significant difference at the .05 level (χ 2 3, 118 = 6.96, p = 0.07). Based on this result, to conserve degrees of freedom, we did not include ARU ID as a predictor variable in our final GLMM.

Boosted Regression Tree (BRT) relative abundance models
We fit BRTs with a Laplace distribution, and an absolute loss link function which is more robust than a RMSE loss function to data with long tailed distributions (Hastie et al. 2009). We ran BRTs with an interaction depth of one, a minimum of one observation per node, and a bag fraction of 0.8. To optimize the number of trees and shrinkage parameter for our boosted regression tree models (BRTs), we set the cross validation parameter built into the gbm function to ten, and looked at graphs of the cross-validation test error for ten iterations of our model using the function gbm.perf in the gbm package (Greenwell et al. 2019, R Core Team 2020. We aimed to optimize the shrinkage parameter to grow at least 1000 trees before models started to overfit (indicated by increasing cross-validation test error) (Elith et al. 2008 In some cross-validation folds, the test error increased immediately, after fitting only one tree ( Fig. A1.5b). In these cases, we tested the smallest possible shrinkage parameter recommended by Elith, Leathwick & Hastie (2008), which was 0.0001. We believe that the immediate over fitting is likely due to the large number of zeros in the data, and we think it unlikely that we would be able to tune parameters to build good models with these data; the problem is with the data rather than with the tuning of the model parameters (see e.g. the large number of observed zeros in Fig. 3B, F). This supports our overall conclusion that using consecutive minute ARU recordings is less useful for assessing the relative abundance of migrant species than using randomly selected minutes from ARU recordings.
To ensure that our results were not solely based on our choice of modeling method, we also modeled relative abundance using generalized additive models (GAMs) because they allowed specification of a negative binomial error distribution which we suspected might fit our data well. We tested GAMs using two of our four abundance indices: A p and A 30C . GAMs were fit with a thin plate spline, with the number of knots optimized at k = -1. GAMs were fit using the 'gam' function in the 'mgcv' package (Wood 2003, Wood 2011, Wood 2017. As with BRTs, we fit GAMs using 200 iterations of five-fold temporal block cross validation, that used blocks of three consecutive days (Fig.   S3). To compare GAMs and BRTs, we calculated a mean Root Mean Square Error (RMSE) of all 1000 model iterations.

Details about final sample size
One SWIFT unit recorded 36 five-hour survey days, two SWIFT units recorded 35 five-hour survey days, and the AudioMoth unit recorded 24 five-hour survey days, for a total of 650 hours recorded by ARUs. Because of ARU malfunctions, on some days ARUs did not record the full fivehour period (as described for the 16 th and 17 th of April), but we were able to manually turn on the units for the 10-minute period during the in-person point count. Therefore, we recorded 130 10 consecutive minute periods with ARUs (during which a human observer was present conducting a simultaneous point count) on 37 survey days, but only 124 periods of 22 randomly selected minutes on 36 survey days.

Analysis of duplicated recordings
The duplicated recordings had perfect agreement about the occurrence of Winter Wren, indicating that we can have high confidence in Winter Wren identification from ARU surveys. We intended to use Krippendorff's alpha (Krippendorff 2013) to assess agreement about detections of focal species on duplicated desk-based audio surveys of 10 consecutive minute counts. However, we did not have enough detections of Golden-crowned Kinglet in our duplicated recordings to calculate a value for Krippendorff's alpha; we therefore do not have an estimate of the reliability of identification of Kinglets on ARU recordings. We encourage future researchers to give consideration to listener agreement when using ARU data.

GAM results
GAMs and BRTs performed similarly for estimating relative abundance trends across both survey types on which GAMs were tested based on evaluating Root Mean Square Error (RMSE) (Table   S2), and gave qualitatively similar models of the change in bird abundance during the migration season. This indicated that the choice of modeling method and error distribution did not have a large effect on the results of our model, and we ultimately chose BRTs following Johnston et al.'s (2015) methods for modeling abundance.    Table A2.1. Predictions of A p and the random minute indices (A 30R and A 66R ) are strongly correlated, indicating that the model is finding the same signal regardless of whether training data are from ARUs or point counts. Photo: "Golden-crowned Kinglet" by Laura Gooch, used under license CC BY-NC-SA 2.0. Cropped from original. Figure A1.3: Schematic of temporal block cross-validation used for fitting and testing abundance models. Days (1, 2, 3, ..., 37) were grouped into blocks of three consecutive days. Each block of three days was assigned to one of five cross-validation folds (colors, panel A). Abundance models were fitted by withholding data from days in one fold (e.g. the orange fold) and using data from days in the other four folds as training data (B). The performance of the model was evaluated based on how well it predicted data from days in the test fold (C).

Figure A1.4:
Test error for 10 cross validation folds for a Boosted Regression Tree model, for each of four abundance indices for Winter Wren relative abundance models. Colored lines show the crossvalidation fold for which the error was calculated, and the black vertical bar shows the number of trees we chose for each abundance index. Plots a, c and d all show test error for a shrinkage rate of 0.0005; plot b shows test error for a shrinkage rate of 0.0001. For models that appeared to overfit immediately (b), we believe the problem was with the data rather than with the tuning parameters (see discussion in Appendix 1). Figure A1.5: Test error for 10 cross validation folds for a Boosted Regression Tree model, for each of four abundance indices for Golden-crowned Kinglet relative abundance models. Colored lines show the cross-validation fold for which the error was calculated, and the black vertical bar shows the number of trees we chose for each abundance index. Plots a and d all show test error for a shrinkage rate of 0.0005; plots b and c show test error for a shrinkage rate of 0.0001. For models that appeared to overfit immediately (b, c), we believe the problem was with the data rather than with the tuning parameters (see discussion in Appendix 1).         (Table A2.1). Four-letter species codes correspond to the species common names found in Table A2.1. Blank plots indicate that there was not enough data from that abundance index to fit a model.