Introduction

Beta diversity, defined as the variation in species composition among sites in a geographic region of interest [@Legendre2005AnaBet], is an essential measure to describe the organization of biodiversity through space. Total beta diversity within a community can be partitioned into local contributions to beta diversity (LCBD) [@Legendre2013BetDiv], which allow the identification of sites with exceptional species composition, hence unique biodiversity and potential conservation value. Sites with unique community composition often differ from those with high species richness, possibly as they harbour rare species or help maintain beta diversity [@daSilva2018DisCor; @Heino2017UnrCor; @Landeiro2018SpeLow]. Hence, focusing on uniqueness can prove helpful as a complementary approach to species richness [@Heino2017ExpSpe; @daSilva2018DisCor; @Yao2021EcoUni; @Dubois2020EcoUni]. However, the use of LCBD indices is currently limited in two ways. First, LBCD indices are typically used on data collected over local or regional scales with relatively few sites, for example, on fish communities at intervals along a river or stream [@Legendre2013BetDiv]. Second, LCBD calculation methods require complete information on community composition; thus, they are inappropriate for partially sampled sites (e.g., where data for some species are missing or uncertain) and cannot directly provide assessments for unsampled ones. Accordingly, this method is of limited use to identify areas with exceptional biodiversity in regions with sparse sampling. However, predictive approaches offer an opportunity to overcome such limitations, as computational methods often uncover novel ecological insights from existing data [@Poisot2019DatSyn], including in lesser-known locations and on larger spatial scales.

Species distribution models (SDMs) [@Guisan2005PreSpe] can bring a new perspective to LCBD studies by filling in gaps in community composition data to perform analyses on broader scales. Single-species SDMs aim at predicting the distribution of a species in unsampled locations based on information (such as environmental data) from sampled locations with reported occurrences. Many approaches allow going from single-species SDMs to a whole community on which to evaluate community-level metrics, yet their relevance has not been explicitly evaluated for ecological uniqueness and LCBD indices. The most straightforward approach is stacked distribution models (S-SDMs) [@Ferrier2006SpaMod; @Guisan2011SesNew]. Single-species SDMs are first performed separately, then combined to form a community prediction on which community-level analyses can be applied. S-SDMs tend to overestimate species richness [@Dubuis2011PreSpa; @DAmen2015UsiSpe; @Zurell2020TesSpe], which could result from thresholding the probabilities into presence-absence data before stacking the species distributions [@Calabrese2014StaSpe]. Summing the occurrence probabilities without applying a threshold is an alternative [@Calabrese2014StaSpe], but it may limit some analyses as it does not return species identities for every site [@Zurell2020TesSpe], as is required with LCBD calculations. In comparison, joint species distribution models (JSDMs)[@Pollock2014UndCoo] try to improve predictions by incorporating species co-occurrence or shared environmental responses into the models. However, these models do not always improve community-level predictions compared to S-SDMs [@Zurell2020TesSpe]. Spatially explicit species assemblage modelling (SESAM) [@Guisan2011SesNew], hierarchical modelling of species communities (HMSC) [@Ovaskainen2017HowMak], and Bayesian networks (BN) [@Staniczenko2017LinMac] are other alternatives that could yield better community predictions than S-SDMs. On the other hand, they add methodological and computational overload, impeding their use for broad spatial extents. Moreover, their relevance for community prediction is often validated against extensive work on species richness. By comparison, ecological uniqueness and LCBD indices have rarely been used in predictive frameworks. Therefore, S-SDMs may prove an appropriate first step to establish some prediction baselines.

Combining LCBD indices with a predictive approach through SDMs will allow measuring uniqueness over broader spatial extents, across continuous landscapes, and on a higher number of sites than what has previously been studied. LCBD scores have typically been used at local or regional scales with relatively few sites [up to 60 sites on extents covering 10 km to 400 km, @Legendre2013BetDiv; @daSilva2014LocReg; @Heino2017UnrCor; @Heino2017ExpSpe]. Some studies did use the measure over broader, near-continental extents [@Yang2015ComSim; @Poisot2017HosPar; @Taranu2020LarMul], but the total number of sites in these studies were relatively small (maximum 51 sites). Recent studies also investigated LCBD and beta diversity on sites distributed in contiguous grids or as pixels, hence uniform sampling intervals and no spatial gaps, but these did not cover large extents and a high number of sites [up to 1250 sites and 6 km^2^, @Tan2017HowBet; @Tan2019UndPro; @Legendre2019SpaTem; @DAntraccoli2020MorSpe]. Two recent studies have, however, adopted promising predictive approaches on regional extents. First, @Niskanen2017DriHig predicted LCBD values of plant communities (and three other diversity measures) on a continuous scale and a high number of sites (> 25,000) using Boosted Regression Trees (BRTs). However, they modelled the diversity measures directly after calculating them on a smaller number of sampled sites. Second, @Vasconcelos2018ExpImp used ecological niche models (ENMs) to predict anurans ecological niches according to actual and forecasted environmental conditions, then calculated the LCBD values on the predictions to identify biodiversity hotspots. Using this approach, predicted LCBD values are calculated in a way closer to the original formulation. This development of predictive techniques is exciting, especially as it could be pushed a step further to continental extents, a higher number of sites, and more species occurrences using SDMs and massive data sources. Still, it should be accompanied by an investigation of the determinant of ecological uniqueness in such conditions.

Measuring ecological uniqueness from LCBD indices over broad spatial extents and spatially continuous data also raises the question of which sites will be identified as exceptional and for what reason. The method intends that sites stand out and receive a high LCBD score whenever they display an exceptional community composition, be it a unique assemblage of species with high conservation value or a community richer or poorer than others in the region [@Legendre2013BetDiv]. Both the original study and many of the later empirical ones have shown a negative relationship between LCBD scores and species richness [@Legendre2013BetDiv; @daSilva2014LocReg; @Heino2017UnrCor; @Heino2017ExpSpe], although other studies observed both negative and positive relationships at different sites [@Kong2017SpaVar] or quadrats [@Yao2021EcoUni]. Some studies showed that the direction of the relationship is related to the percentage of rare species in the community [@daSilva2018DisCor; @Yao2021EcoUni]. However, beta diversity and species rarity are both concepts that depend on scale. For instance, total beta diversity increases with spatial extent [@Barton2013SpaSca] and varies because of higher environmental heterogeneity and sampling of different local species pools [@Heino2015ComAna]. Therefore, the LCBD-richness relationship and the effect of rare species on LCBD values should be investigated over broad spatial extents, as they might not be constant across scales.

Here, we examined whether species distribution models (SDMs) can be combined with local contributions to beta diversity (LCBD) to assess ecological uniqueness over broader spatial extents. We also investigated the effect of scale changes on beta diversity quantification. We first predicted species distributions on continental scales using extended occurrence data from eBird and Bayesian additive regression trees (BARTs). We then quantified uniqueness with the LCBD measure for both predicted and observed data. Next, we examined the site-wise difference using direct comparison, a spatial autocorrelation test, and generalized linear regression. We then investigated the relationship between uniqueness and species richness for different regions and scales and according to the proportion of rare species.

Methods

Occurrence data

We used occurrence data from eBird [@Sullivan2009EbiCit] downloaded through the eBird Basic Data set from June 2019 [@eBirdBasicDataset2019VerEbd]. We restricted our analyses to the New World warbler family (Parulidae) in North America (Canada, the United States, Mexico). eBird is a semi-structured citizen science data set, meaning that observations are reported as checklists of species detected in an observation run [@Johnston2020AnaGui]. Observers can explicitly specify that their checklist contains all species they could detect and identify during a sampling event, in which case it is labelled as a "complete checklist." Using complete checklists instead of regular ones allows researchers to infer non-detections in locations where detection efforts occurred, which offers performance gains in species distribution models [@Johnston2020AnaGui]. Therefore, we selected the data from the complete checklists only. Our final data set comprised 62 warbler species and 22,974,330 observations from 9,103,750 checklists. Warblers are a diverse group with many species, are popular among birders given their charismatic aspect, and are widely distributed in various habitats across North America.

Environmental data

Our environmental data consisted of climatic data from WorldClim 2.1 [@Fick2017Wor2N] and land cover data from the Copernicus Global Land Service [@Buchhorn2019CopGlo]. We restricted these data to a spatial extent comprised between -145.0 and -50.0 degrees of longitude and between 20.0 and 75.0 degrees of latitude. First, the WorldClim data consist of spatially interpolated monthly climate data for global land areas. We used the standard bioclim variables [@Booth2014BioFir] from WorldClim 2.1, which represent annual trends, ranges, and extremes of temperature and precipitation, but selected only 8 out of the 19 ones to avoid redundancy (bio1, bio2, bio5, bio6, bio12, bio13, bio14, bio15). We downloaded the data at a resolution of 10 arcminutes (around 18 km^2^ at the equator), the coarsest resolution available, which should mitigate potential imprecision in the eBird data regarding the extent of the sampled areas in each observation checklist. Moreover, some studies have argued that coarser resolutions lead to less overestimation of species richness and better identification of bird biodiversity hotspots given the patchiness of observation data [@Hurlbert2007SpeRic]. We acknowledge that using an arcminutes-based resolution means that the surface area of our pixels will not be equal depending on the latitude.

Second, the Copernicus data are a set of variables representing ten land cover classes (e.g., crops, trees, urban areas) and measured as a percentage of land cover. The data are only available at a finer resolution of 100 m. We coarsened them to the same ten arcminute resolution as the WorldClim data by averaging the pixels' cover fraction values. We removed two variables (moss and snow) from our predictive models as their cover fraction was 0% on all sites with warbler observations.

Species distribution models

We converted the occurrence data to a presence-absence format compatible with community analyses. We considered every pixel from our ten arcminutes environmental layers as a site and then verified, for each species, if there was a single observation in every site. Finally, we recorded the outcome as a binary value: present (1) if a species was ever recorded in a site and absent (0) if it was not. Complete checklists help ensure that these zeros represent non-detections, rather than the species not being reported; hence we considered them as absence data, similar to @Johnston2020AnaGui, although we recognize that there exists a doubt on whether these truly represent non-detections.

We predicted species distribution data on continuous scales from our presence-absence data using Bayesian Additive Regression Trees (BARTs) [@Chipman2010BarBay], a classification and regression trees method recently suggested for species distribution modelling [@Carlson2020EmbSpe]. BARTs are based on an ensemble of trees, similarly to Boosted Regression Trees and Random Forest, but follow a sum-of-trees model and a Bayesian framework. Trees are first constrained as weak learners by priors regarding structure and nodes, then updated through an iterative Bayesian backfitting Markov Chain Monte Carlo (MCMC) algorithm which ultimately generates a posterior distribution of predicted classification probabilities [@Chipman2010BarBay; @Carlson2020EmbSpe]. In the context of species distribution modelling, BARTs showed high performance when compared to other predictive algorithms [@Konowalik2021EvaMet; @Tytar2021AssHab]. We first performed BARTs separately for all species and estimated the probability of occurrence in all the sites of our region of interest using the posterior median. We then converted the results to a binary outcome according to the threshold that maximized the True Skill Statistic (TSS) for each species, as suggested by @Carlson2020EmbSpe.

Quantification of ecological uniqueness

We used the method of @Legendre2013BetDiv to quantify compositional uniqueness from overall beta diversity for both the observed and predicted data. First, we assembled the presence-absence data by site to form two site-by-species community matrices, one from observed data, called $Y$ (39,024 sites by 62 species), and one from predicted data, called $\hat Y$ (99,382 sites by 62 species). Next, we measured species richness per site as the number of species present. Finally, we removed the sites without any species from the predicted matrix $\hat Y$, for a new total of 85,526 sites (this was unnecessary for the observed matrix $Y$). We then applied the Hellinger transformation to both matrices in order to compute beta diversity from the community composition data [@Legendre2013BetDiv]. We measured total beta diversity as the variance of each community matrix and calculated the local contributions to beta diversity (LCBD), which quantify how much a specific site (a row in each matrix) contributes to the overall variance in the community [@Legendre2013BetDiv]. High LCBD values indicate a unique community composition, while low values indicate a more common species set. We note that our LCBD values, which add up to 1 because the values are divided by the total sum-of-squares of the data matrix, were very low given the high number of sites in both $Y$ and $\hat Y$. However, the relative difference between the scores in one set matters more than the absolute value to differentiate their uniqueness.

Comparison of observed and predicted values

We performed three verification to compare the richness and uniqueness estimates obtained from our predicted distributions to those obtained with the eBird occurrence data. First, we performed a direct comparison by subtracting the richness and LCBD estimates obtained from $Y$ (the observed data) from the estimates obtained from $\hat{Y}$ (the predicted data). To do so, we used the richness estimates as-is but modified the LCBD values to achieve a non-biased comparison, given that the values were initially calculated on sets of different lengths. Therefore, we recomputed the LCBD scores only for the sites for which we had occurrences in both $Y$ and $\hat{Y}$, which mostly corresponded to the sites in $Y$, minus a few sites where the SDMs predicted no species occurrence. We then plotted the richness and LCBD differences to examine their spatial distributions. Second, we performed the modified t test from @Clifford1989AssSig to assess the correlation between the observed and predicted estimates and test for spatial association. We performed the test separately for the richness and the LCBD estimates. Third, we performed Generalized Linear Models between the observed and predicted estimates and plotted the deviance residuals to examine their spatial distribution. We used a negative binomial regression with a log link function for the richness estimates and a beta regression with a logit link function for the LCBD values, similar to @Heino2017ExpSpe and @Yao2021EcoUni.

Investigation of regional and scaling variation

To investigate possible regional and scaling effects, we recalculated LCBD values on various subregions at different locations and scales. First, we selected two subregions of equivalent size (20.0 longitude degrees by 10.0 latitude degrees) with contrasting richness profiles and corresponding to different ecoregions to verify if the relationship between species richness and LCBD values was similar. The first subregion was in the Northeast (longitude between -80.0 and -60.0, latitude between 40.0 and 50.0), was mostly species-rich (for both the observed and predicted data), and corresponded to the Eastern Temperate Forests level I ecoregion [@CommissionforEnvironmentalCooperation1997EcoReg]. The second subregion was in the Southwest (longitude between -120.0 and 100.0, latitude between 30.0 and 40.0), was mostly species-poor, and covered Mediterranean California, North American Deserts, Temperate Sierras, and Southern Semi-Arid Highlands ecoregions [@CommissionforEnvironmentalCooperation1997EcoReg]. Second, we recalculated the LCBD indices at three different extents, starting with a focus on the Northeast subregion and progressively extending the extent to encompass the Southwest subregion. We did these two verifications with both the observed and predicted data but only illustrate the results with the predicted data as both were qualitatively similar.

Proportion of rare species

We investigated the effect of the proportion of rare species in the community on the direction of the relationship between species richness and LCBD values in our Northeast and Southwest subregions. Following @DeCaceres2012VarTre and @Yao2021EcoUni, we classified species as rare when they occurred in less than 40% of the sites in each subregion. We calculated the proportion of rare species for every site. We then grouped the sites for both subregions depending on whether they were part of an ascending or a descending portion in the LCBD-richness relationship. Given that the relationship sometimes displays a curvilinear form with a positive quadratic term [@Heino2017ExpSpe; @Tan2019UndPro], we separated the ascending and descending portions based on the species richness at the site with the lowest LCBD value (using the median richness if there were multiple sites). This value corresponds to the inflection point of the relationships. For example, the lowest LCBD value was 7.045e-05 in the Northeast subregion and the corresponding richness was 23. All the sites with more than 23 species were assigned to the ascending portion, and all the sites with 23 species or fewer were assigned to the descending portion. In the Southwest subregion, the lowest LCBD value and its corresponding richness were 5.438e-05 and 12, respectively. We then mapped the ascending and descending groups to view their spatial distribution. We also examined the distribution of the rare species proportions in both groups using a kernel density estimation plot. Similar to our previous verification, we performed this analysis with both observed and predicted data but once again only illustrate the results with the predicted data as both were qualitatively similar.

Software

We used Julia v1.6.1 [@Bezanson2017JulFre] for most of the project and R v4.1.0 [@RCoreTeam2021RLan] for some specific steps. We used the Julia package SimpleSDMLayers.jl [@Dansereau2021SimJl] as the basic framework for our analyses, to download the WorldClim 2.1 data, and to map our results through the package's integration of Plots.jl. We also used StatsPlots.jl to produce the kernel density estimation plots in our rare species analysis. We computed the LCBD indices with our own function implemented in Julia, whose results were verified by comparison to the beta.div function from the package adespatial [@Dray2021AdeMul] in R. We used the R packages auk [@Strimas-Mackey2018AukEbi] to extract and manipulate eBird data, embarcadero [@Carlson2020EmbSpe] to perform the BART models, vegan [@Oksanen2019VegCom] to apply the Hellinger transformations, and SpatialPack [@Vallejos2020SpaRel] to perform the modified t test (with the function modified.ttest) from @Clifford1989AssSig. We used MASS [@Venables2002ModApp] and betareg [@Cribari-Neto2010BetReg] to perform the negative binomial and beta regressions, respectively. We also used GDAL [@GDAL/OGRcontributors2021GdaOgr] to coarsen the Copernicus land cover data. All the scripts required to reproduce the analyses are archived on Zenodo (https://doi.org/10.5281/zenodo.6024392).

Results

Species distribution models generate relevant community predictions

Species richness from observation data ([@Fig:maps]a) was higher on the East Coast and lower on the West Coast, with many unsampled patches in the North, South, and Central West. Richness results from SDM data ([@Fig:maps]b) displayed higher richness on the East Coast and sites with few or no species up north and in the Central West. There was no clear latitudinal gradient in richness but rather an East-West one. Landmarks such as the Rockies and croplands in the Central West (which should be species-poor habitats) were notably visible on the maps, separating the East and West. LCBD scores from observation data ([@Fig:maps]c) were low on the East Coast and higher on the border of sampled sites in the Central West. They were also higher in the North and in the South where observations were sparser. Results from SDM predictions were qualitatively similar ([@Fig:maps]d), with lower LCBD values in the East and more unique sites in the Central West, Central Mexico, and some Northern regions. There was no clear latitudinal gradient, and the East-West contrast, while present, was less clear than on the richness maps.

{#fig:maps}

The modified t test of @Clifford1989AssSig showed a high correlation between the observed and predicted estimates of richness and uniqueness, as well as a statistically significant spatial association between the values. For species richness, the correlation coefficient was 0.777, the F-statistic was 20.007, and the p-value was 6.093e-04. For LCBD scores, the correlation coefficient was 0.518, the F-statistic was 40.083, and the p-value was 5.528e-09.

The difference between the observed and predicted estimates (predicted richness - observed richness and predicted LCBD - observed LCBD) showed opposite geographic distributions for species richness and ecological uniqueness ([@Fig:comparison-combined]). Predicted richness estimates were higher than observed estimates on the East Coast, on the West Coast and in Mexico but were lower than observed estimates in the Central West ([@Fig:comparison-combined]a). Predicted LCBD estimates, on the other hand, were lower than observed estimates on the East Coast and higher in the Central West ([@Fig:comparison-combined]b). Regression residuals showed similar geographic distributions to their corresponding difference values ([@Fig:residuals-combined]).

{#fig:comparison-combined}

{#fig:residuals-combined}

Uniqueness displays regional variation as two distinct profiles

The relationship between LCBD values and species richness displayed contrasting profiles in species-rich and species-poor regions ([@Fig:subareas]). In the species-rich Northeast region , LCBD scores displayed a mostly decreasing relationship with species richness, with a slightly curvilinear form and increase of values for very rich sites. The sites with the highest LCBD values were the species-poor sites while the species-rich sites displayed scores. The Southwest subarea showed a different relationship with a sharper initial decline and a larger increase as richness reached 20 species. The sites with the highest LCBD values were the poorest in terms of species richness, as in the Northeast region, but the species-rich sites were proportionally more unique in the Southwest region. Total beta diversity was higher in the Southwest subregion (0.417) than in the Northeast (0.179), indicating higher compositional differences between the sites.

{#fig:subareas}

Uniqueness depends on the scale on which it is measured

The LCBD-richness relationship showed important variation when scaling up and changing the region's extent ([@Fig:extents]). For smaller extents, starting with a species-rich region, the relationship is well defined, mostly decreasing but notably curvilinear (with a lesser increase for richness values higher than the median). However, as the extent increases and progressively reaches species-poor regions, the relationship broadens, displays more variance, and loses its curvilinear aspect while keeping a decreasing form. Total beta diversity was higher when increasing the spatial extent, going from 0.121 to 0.284 and 0.687.

{#fig:extents}

Uniqueness depends on the proportion of rare species

The proportion of rare species per site differed depending on the classification in the ascending or descending portions of the LCBD-richness relationship ([@Fig:rarespecies]). The proportion of rare species was higher in the sites corresponding to the ascending portions of the relationships ([shown in @Fig:subareas]) than in the sites corresponding to the descending portions for both subregions. The classification of the sites in the two portions showed a clear latitudinal gradient in the Northeast subregion, while it was distributed in patches in the Southwest subregion ([@Fig:rarespecies]).

{#fig:rarespecies}

Discussion

Our results showed a decreasing relationship between species richness and LCBD values on broad spatial extents ([@Fig:extents]c) but also highlighted that the exact form of this relationship varies depending on the region and the spatial extent on which it is measured. Our species-rich Northeast subregion ([@Fig:subareas]a) showed a decreasing relationship, very similar to previous studies, and slightly curvilinear, as described by @Heino2017ExpSpe and @Tan2019UndPro. This result for warbler species is in line with the original study on fish communities [@Legendre2013BetDiv] and with following ones on insect metacommunities [@daSilva2014LocReg; @Heino2017UnrCor; @Heino2017ExpSpe], dung beetles [@daSilva2018DisCor; @daSilva2020CanTax], aquatic beetles [@Heino2019KniPat], stream macroinvertebrates [@Sor2018UniSam], stream diatoms [@Vilmi2017EcoUni], multi-trophic pelagic food webs (phytoplankton, zooplankton, fish) [@Taranu2020LarMul], temperate forest trees [@Tan2019UndPro], mammals [@daSilva2020CanTax], wetland birds [@deDeus2020AviBet], and various phylogenetic groups (plants, lizards, mites, anurans, mesoinvertebrates) [@Landeiro2018SpeLow]. However, it was originally argued that the negative relationship was not general or obligatory [@Legendre2013BetDiv]. Different LCBD-richness relationships have also been observed, with both positive and negative relationships for different sites or taxonomic groups in some studies [@Kong2017SpaVar; @Teittinen2017LocGeo], as well as a negative relationship with the number of common species but a positive relationship with the number of rare species [@Qiao2015BetDiv].

Our results further show that the relationship may depend on the region's richness profile, as the relationship was different in our species-poor Southwest subregion, with a sharper initial decrease ([@Fig:subareas]b). Therefore, the curvilinear form may depend on how pronounced the contrast is between the region's median richness and its richest ecologically feasible sites. The increasing part of the curvilinear form for higher richness values was also more pronounced in our results ([@Fig:subareas]a,b; [@Fig:extents]c) than in previous studies [e.g, @Tan2019UndPro], which reinforces the idea that the relationship and its curvilinear form may vary depending on the region.

The variation in the LCBD-richness relationship when extending the study extent showed that the uniqueness patterns highlighted are not necessarily the same depending on the scale on which it is used ([@Fig:extents]). The relationship progressively lost its clear definition and curvilinear form as the East and West profiles merged, creating a new distinct profile with more variation and no curvilinear form. Therefore, aggregating too many different sites might possibly mask some patterns of uniqueness in species-rich sites. Total beta diversity, on the other hand, showed the variation expected from previous studies, increasing with spatial extent ([@Fig:extents]) [@Barton2013SpaSca; @Heino2015ComAna]. Its value was high at the continental scale (0.687) but lower than what has been observed in some studies [e.g., 0.80 on macroinvertebrate communities in the Lower Mekong Basin by @Sor2018UniSam].

Our results confirm that the proportion of rare species in the community may affect the direction of the relationship between species richness and ecological uniqueness ([@Fig:rarespecies]). @daSilva2018DisCor suggested that the proportion of rare and common species in the communities determines whether the relationship will be negative, non-significant, or positive. @Yao2021EcoUni showed an association between the direction of the relationship and the proportion of rare species, with sites with a lower proportion (between 60% and 75% in their case) displaying a negative relationship and sites with a higher proportion (around 85%) showing a positive one. Our results further show that sites associated with a positive relationship within a curvilinear one tended to have a higher rare species proportion ([@Fig:rarespecies]). This also implies that the proportion of rare species was higher in species-rich sites than in species-poor ones in both our Northeast and Southwest subregions. Further work should attempt to disentangle the effects of the rare species proportion and the region's richness profile.

Our results showed that SDM models provide richness and uniqueness predictions highly correlated to the occurrence data while filling gaps in poorly sampled regions ([@Fig:maps]). The results showed a statistically significant spatial association between predicted and observed estimates despite correcting for autocorrelation using the modified t-test from @Clifford1989AssSig. A positive autocorrelation on large distances indicates aggregates or structures repeating through space [@Legendre1989SpaPat]. This is consistent with our results, as the distribution of richness and uniqueness values was visibly spatially structured in both our observed and predicted data ([@Fig:maps]). Nonetheless, it is possible that the autocorrelation in the predicted values could represent an artifact of the predictive models (capturing the spatial structure from the environmental variables, for example), and might not represent the true autocorrelation expected for the uniqueness estimates. Further work could verify this by quantitatively comparing the autocorrelation and spatial structures in the observed and predicted uniqueness estimates.

Predicted values also tended to underestimate uniqueness in species-rich regions and overestimate it in species-poor ones, with the opposite trend for species richness ([@Fig:comparison-combined; @Fig:residuals-combined]). Overprediction of richness using S-SDMs was reported previously [@Dubuis2011PreSpa; @DAmen2015UsiSpe; @Zurell2020TesSpe]. No comparable baseline exists for predictions of LCBD values, as our study is the first to compare LCBD estimates from observed and predicted data in such a way. However, some studies showed that LCBD distributions were spatially structured across sampling sites [@daSilva2018DisCor]. On the other hand, the spatial structure in our results did not exactly concord with the one reported by @Heino2019KniPat, who showed a negative relationship between LCBD values and latitude for diving beetles communities in Northern Europe. In comparison, our LCBD scores increased both in the North and South ([@Fig:maps]), hence did not strictly increase with latitude, and also showed a clear East-West gradient. Overall, our distribution results ([@Fig:maps; @Fig:comparison-combined; @Fig:residuals-combined]) also have implications for conservation, as they confirm that species richness and ecological uniqueness measured from LCBD values may conflict and highlight different potential hotspots [@Dubois2020EcoUni; @Yao2021EcoUni], thus reinstating the need to protect both with complementary strategies.

Our predictions for regions with sparse sampling are of interest as they allow a quantitative evaluation, however imperfect, for sites where we would otherwise have no information. Our SDMs also offered relevant LCBD predictions using eBird, arguably one of the largest presence-absence data sets available (when using its complete checklist system), showing the measure's potential on such massive data. Together, the potential to generate uniqueness predictions in new locations and through massive data opens new opportunities for LCBD analyses on extended spatial scales and on a broader diversity of taxons. An interesting way forward would be to test these results using more advanced community assembling techniques than S-SDMs. The use of SESAM [@Guisan2011SesNew] with probabilistic SDMs, probability ranking, and species richness predictions as macroecological constraints returns high site-level prediction accuracy [@Zurell2020TesSpe] and would be compatible with presence-absence LCBD calculations. The use of probabilistic stacks rather than binary ones [@Calabrese2014StaSpe] could also constitute a novel way to calculate LCBD indices. Both these procedures should reduce the richness deviation we observed, and it would be interesting to verify if this can also be the case with LCBD values. An ensemble of SDM algorithms could also be used to capture a broader range of possible outcomes for the LCBD predictions. However, given the high performance of BARTs in model comparisons [@Konowalik2021EvaMet; @Tytar2021AssHab] and the large extent we covered, we do not believe the changes to the LCBD gradients would be strong enough to affect our results in a meaningful way.

This study showed how ecological uniqueness can be measured over broad spatial extents, including for regions with sparse sampling, and how scale changes may affect beta diversity quantification. It is the first study to assess the relevance of local contributions to beta diversity calculated on the output of species distribution models. It is also the first to compare the relationship between LCBD values and species richness for different regions and spatial extents. First, our results showed that the negative relationship often observed between species richness and LCBD scores can take different forms depending on the richness profile of the regions on which it is measured. Therefore, species-rich and species-poor regions may display different ways to be unique. Second, the negative relationship was not constant when varying the spatial study extent and may be less clearly defined at broad scales when contrasting regional relationships are present. The broad-scale uniqueness profile might then be completely distinct from the regional profiles constituting it. Finally, species distribution models offer a promising way to generate uniqueness predictions on broad spatial extents and could prove useful to identify beta diversity hotspots in unsampled locations on large spatial scales, which could be important targets for conservation purposes.

Acknowledgments

We acknowledge that this study was conducted on land within the traditional unceded territory of the Saint Lawrence Iroquoian, Anishinabewaki, Mohawk, Huron-Wendat, and Omàmiwininiwak nations. We received financial support from the Fonds de recherche du Québec - Nature et technologie (FRQNT) and the Computational Biodiversity Science and Services (BIOS^2^) NSERC CREATE training program. We thank Élise Filotas and Anne-Lise Routier for their helpful comments on this manuscript.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manuscript.md

manuscript.md

Introduction

Methods

Occurrence data

Environmental data

Species distribution models

Quantification of ecological uniqueness

Comparison of observed and predicted values

Investigation of regional and scaling variation

Proportion of rare species

Software

Results

Species distribution models generate relevant community predictions

Uniqueness displays regional variation as two distinct profiles

Uniqueness depends on the scale on which it is measured

Uniqueness depends on the proportion of rare species

Discussion

Acknowledgments

References

Files

manuscript.md

Latest commit

History

manuscript.md

File metadata and controls

Introduction

Methods

Occurrence data

Environmental data

Species distribution models

Quantification of ecological uniqueness

Comparison of observed and predicted values

Investigation of regional and scaling variation

Proportion of rare species

Software

Results

Species distribution models generate relevant community predictions

Uniqueness displays regional variation as two distinct profiles

Uniqueness depends on the scale on which it is measured

Uniqueness depends on the proportion of rare species

Discussion

Acknowledgments

References