Coastlines potentially harbor a large part of litter entering the oceans, such as plastic waste. The relative importance of the physical processes that influence the beaching of litter is still relatively unknown. Here, we investigate the beaching of litter by analyzing a data set of litter gathered along the Dutch North Sea coast during extensive beach cleanup efforts between the years 2014 and 2019. This data set is unique in the sense that data are gathered consistently over various years by many volunteers (a total of 14 000) on beaches that are quite similar in substrate (sandy). This makes the data set valuable to identify which environmental variables play an important role in the beaching process and to explore the variability of beach litter concentrations. We investigate this by fitting a random forest machine learning regression model to the observed litter concentrations. We find that tides play an especially important role, where an increasing tidal variability and tidal height leads to less litter found on beaches. Relatively straight and exposed coastlines appear to accumulate more litter. The regression model indicates that transport of litter through the marine environment is also important in explaining beach litter variability. By understanding which processes cause the accumulation of litter on the coast, recommendations can be given for more effective removal of litter from the marine environment, such as organizing beach cleanups during low tides at exposed coastlines. We estimate that 16 500–31 200 kg (95 % confidence interval) of litter is located along the 365 km of Dutch North Sea coastline.

The accelerated release of mismanaged plastic waste into the global ocean gives rise to the need for effective cleanup strategies

In addition, the plastic concentrations found on beaches are generally higher compared to other environmental compartments, such as the surface water or the seafloor

Although the benefits of beach cleanups are well known, the location and timing of these activities are often not optimized.

However, the relative importance of the various physical processes involved and how these can be parameterized so far remains unknown

In addition to the study by

In order to make data-driven methods work, relatively large and consistent data sets are necessary, but most observational data sets are sparse. Beach cleanups and citizen science initiatives can potentially provide valuable information for scientific studies on marine pollution

Here, we will build upon past data-driven studies by using an unprecedented data set obtained from beach cleanup efforts organized along the Dutch North Sea coast between 2014 and 2019. The number of participants (about 14 000), person hours (about 84 000 h), the length of beach sampled (about 1400 km), and the fact that all beaches sampled were similar in substrate (sandy) make this data set unique and very appropriate to apply data-driven methods. Furthermore, a large set of explanatory variables will be created based on environmental conditions and modeled transport of marine litter. We will fit a random forest regression model to the observed litter concentrations as a function of these explanatory variables and investigate which ones are important to explain the variability in beach litter. This allows us to investigate which variables are important predictors for the amount of litter present on beaches to get a better understanding of marine pollution and to increase the efficacy of beach cleanups by creating a predictive model that could aid future cleanup efforts.

Since 2013 the North Sea Foundation, a Dutch environmental non-governmental organization (NGO) advocating the protection and sustainable use of the North Sea marine ecosystem, has organized the national Boskalis Beach Cleanup Tour. During this tour, every year in August the entire Dutch North Sea coast is cleaned up by volunteers. It is the largest cleanup campaign in The Netherlands. The tour is divided into stages along the North Sea coast. The length of each stage is between 8–10 km. The midway points of all stages are plotted in Fig.

During the first three editions (2013–2015), the tour was organized over a period of a month, with one stage per day. From 2016 on, the tour took 15 d, with simultaneous cleaning of two stages per day. One cleanup team started on the Wadden Island Schiermonnikoog (the easternmost cross in Fig.

At each stage, all litter present on the beach was collected in plastic bags and weighed. The weighing of the collected litter was done using analogue and/or digital scales (during the stage or at the end of the stage) and carried out by one of the members of the cleanup team. Most of the litter found was plastic (estimated percentage between 80 %–90 % in terms of numbers). The years over which weights of collected litter are available for each stage are plotted in Fig.

To get an impression of the mean environmental conditions along the Dutch North Sea coast, the mean surface currents are plotted in Fig.

Locations of the midway points for each cleanup tour stage (black crosses) and dates showing for which year data are available (the colored squares). For stages with multiple data points per year, different stretches of beach were cleaned (e.g., once on the northern side and once on the southern side). Also plotted are the mean surface currents (arrows)

Different sources of marine litter exist, such as mismanagement of waste near the coast, input from rivers, or fishing gear which is lost at sea. The litter is then transported through the environment and can eventually end up on beaches, influenced by various factors such as ocean currents and winds. However, how all of these variables combined influence the beaching of litter is unknown. A regression model is used here to relate various environmental variables to the observed litter concentrations. We will assess whether it is possible to use the regression model to make predictions about the amount of beached litter and, if so, which environmental variables are important predictors to take into account.

For the environmental variables, three classes of data are used. First of all, hydrodynamic data (ocean currents, ocean surface waves, tides) and wind data are used (Sect.

Numerical model data are used to specify the state of the sea and wind around the beach cleanup locations, as these factors have been found to likely play a role in the accumulation of beach litter

Information about ocean surface currents (

An overview of the numerical hydrodynamic and wind data used to derive the variables for the regression analysis. The data set name, temporal and spatial resolution, data used to assimilated the numerical models, and corresponding references are presented.

While data on the sea state and wind might explain the litter accumulating on beaches to some extent, it misses information on possible sources of litter and how this litter is transported through the marine environment. We therefore include estimates of beached litter fluxes in our analysis based on Lagrangian particle simulations.

Using the OceanParcels Lagrangian ocean analysis framework

We use the same approach as in

A beaching timescale

Each virtual particle starts with a unit mass. For each time step that a virtual particle spends near the coast, a fraction of its mass is lost due to the beaching process. This means that as

Input scenarios used to seed virtual litter particles in the Lagrangian simulations. Riverine input is indicated by the green circles, the amount of fishing hours is shown in blue, and the coastal mismanaged plastic waste density is shown in red. Note the log scale used for all input scenarios. While all rivers from

Coastal orientation, geometry, and substrate are likely to influence the amount of litter that actually beaches on coastlines

The Natural Earth data set is used here at a

Normal vectors to the coastline (denoted by

Dot products are calculated for vector fields (e.g., current velocity) with respect to the coastline normal vectors to quantify how much a vector points onshore (positive dot product) or offshore (negative dot product). An example is presented in Fig.

The coastal normal vectors are also used to estimate the misalignment between the numerical model coastline and the high resolution coastline. In Fig.

Finally, the coastline length per grid cell is estimated. For each cell of the numerical model, we take the coastline segments within the given cell and calculate their total length. Since coastlines show fractal behavior

Illustration of the methodology used to calculate the directional variables. Panel

Information about spatial variability of beached litter can be useful for cleanup campaigns to target areas that are likely to be the most polluted. One might expect that cleanup locations close to each other show more similar litter concentrations compared to locations that are further apart. Furthermore, it is important for modeling studies to know the subgrid-scale variability that is not captured by the (discrete) numerical data

We will quantify the spatial variability of litter found on the coast as a function of the separation distance between the different cleanup locations using an empirical variogram. To compute the empirical variogram, all pairs of measurements within a certain distance of each other are compared, defined by

We calculate the empirical variogram on the log

Measured litter concentrations are subject to both spatial and temporal variability. To remove temporal variability as much as possible from the empirical variance estimates, we only use data pairs within a certain time separation. Decreasing the time separation window reduces the effect of the temporal variability but also reduces the number of available data pairs. We use a time separation of 3 d here, for which it was found that there are still enough available data pairs to compute the empirical variogram.

The variables described in Sect.

An overview of the features is given in Table

For the scalar features, we look at

We calculate a number of features derived from the tidal height

The total coastline length within a given radius is calculated (

The number of participants for each stage is used as a feature (

For the directional features, we calculate the dot product of the Stokes drift, wind, ocean currents, and tides with respect to the coastline normal vector (

Finally, the total fluxes of beached litter from the Lagrangian particle simulations are given as features from fisheries (

An overview of the machine learning features used. For each set of variables in each column, derived quantities are calculated, e.g., the maximum, sum, or mean, over the given radius and lead time. Directional features are dot products of a given vector field with respect to the coastline normal vector

The features and corresponding response (the measured amount of litter in kg km

In total we have 342 features from all variable, radius, and lead time combinations. There are a total of 175 measured litter concentrations. The large number of features in comparison to the measurements makes it difficult to interpret the feature importance and could lead to overfitting. Therefore,

Some features correlate because they are derived from the same variable but for a different radius or lead time. However, we do not know a priori which of these radii and lead times are the most appropriate predictors for the beached litter quantities. For example, litter concentrations might be influenced by long-term processes, there may be a slow increase to the standing stock of litter on the beach, or the concentrations simply could be better predicted by conditions on the day leading up to the cleanup stage. Since we do not know whether these factors play a role, we let the algorithm select the most appropriate variables. Features that are highly correlated will be assigned to clusters. We use hierarchical Ward linkage clustering for this, based on Spearman rank–order correlations

Nested 5-fold cross-validation is used for optimal feature selection from the clusters and to assess the model performance on a test data set. In the outer loop, we use 80 % of the data to train the model and use the remaining 20 % to test the model performance. This is repeated for each fold, i.e., 5 times. In the inner loop, 80 % of the training data (i.e., 64 % of the total data) are used to train the model and 20 % (i.e., 16 % of the total data) are used to calculate the importance of the features; this process is also repeated 5 times. Since in the inner loop none of the test data are used to train the model, we do not overpredict the model performance

The regression model shows reasonable correspondence with the measured litter concentrations, where the Pearson correlation coefficient (

In Fig.

Scatterplot of the observed log-transformed litter quantities (

In Fig.

The coastline length in the neighborhood of the cleanup stage (

Results suggest that transport of marine litter is important to take into account, as the third and seventh most important features are beaching fluxes from the Lagrangian model simulations from fishing activity and coastal mismanaged waste, respectively. These features implicitly contain information about various hydrodynamic variables and sources of litter, explaining why these are ranked above most other scalar and directional features related to wind, currents, and waves. It is also interesting that they are all ranked above the nearby fishing activity (

Finally, the dot product of

Changes in predictive capability are relatively small when leaving out the Lagrangian model simulation features; see Fig.

It is estimated that the number of participants taking part in the tour does not have a large influence on the amount of litter that is found; see Appendix

Box plots for the feature Gini importance values from the random forest regression algorithm. Only the top 10 features are plotted here; an overview of all features can be found in Appendix

Having the full set of 66 feature clusters is not necessary for predictive capability. In Fig.

The two principal components based on the five most important features (see Fig.

To assess which length scales are important for the spatial variability of beached litter, we calculate the empirical variogram for different lag distances. Spatial variability remains relatively constant for lag distances up to about 100 km, with a mean of

Interestingly, some periodic behavior seems to be present with a length scale of about 25 km. One possible explanation could be the typical spacing of the Dutch islands and peninsulas. As shown in the previous section, coastline orientation likely plays an important role in the amount of observed litter. This effect can also present itself in the variogram with, for example, measurements in sheltered areas (e.g., coves) being more correlated with each other compared to nearby exposed locations (e.g., headlands).

The grid sizes used for our numerical data range from about 7 km (the surface current data) to about 20 km (the wind data). This means that the variance at and below these length scales is not captured by the numerical data. The variance calculated for lag distances up to 20 km is quite substantial (

Variogram calculated for the log

The random forest regression model can be used to extrapolate how much litter is likely to be beached along the entire Dutch coastline. First, a regression model is trained using the top eight features listed in Fig.

For each section, the litter concentrations (in kg km

We find a total of 16 500–31 200 kg litter along the Dutch North Sea coastline based on the 95 % confidence interval. It must be noted that this only accounts for the visible litter on the beach surface. The cleanup efforts are likely to miss a substantial amount of beached litter that is buried in beach sediment or located at the back of the beach (e.g., in vegetation). This was also noted, for example, in

The total amount of litter gathered during the cleanup campaigns and the total amount of kilometers sampled per year is presented in Table

Mean litter concentrations over the month of August in the years 2014–2019 extrapolated to the entire Dutch coastline.

Using data from beach cleanup efforts in the Netherlands for the years 2014–2019, we analyzed which variables are important for predicting litter on beaches and what spatial variability this litter has. In order to do this, we fitted a regression model to the observed litter quantities as a function of variables related to wind, waves, currents, tides, coastal geometry, and simulated oceanic transport. We find that tides play an important role, where increasing tidal variability and increasing tidal maximum lead to less observed litter on beaches. Other important variables are whether the local orientation of a beach corresponds to the large-scale coastline orientation and the total nearby coastal length, which can both be seen as measures of how exposed a beach is. These factors are likely explanations for why the observed litter quantities are relatively low in the southwestern part of the Netherlands compared to the other parts. Additionally, transport of litter through the marine environment is seen as important to take into account by the regression model. Rivers, fishing activity, and mismanaged plastic waste along coastlines were taken into account as possible sources of litter in the transport model, where the regression analysis attributed relatively high importance to litter originating from fishing activity. This is in line with findings in

We compute that spatial variability of the observed litter concentrations is substantial on length scales less than 10 km, causing model

Estimating the spatial variability of beached litter can give us information for efficient monitoring of pollution. It can be used to constrain estimates of litter concentrations based on observations elsewhere. We found that the variance for lag distances smaller than 125 km is relatively constant around

For future studies on quantifying beach litter variability, it would be interesting to segment the beach cleanup tours into smaller stretches. One idea would be to organize some stages where the litter quantities are weighed per 1 km, 100 m, or even shorter stretches. This way it would be possible to estimate the variance on sub-kilometer scale.

Future studies could further investigate the causal relations between the variables seen as important predictors by the regression model and the litter concentrations found on beaches. This is especially the case for tides, which constitute the two most important features in the regression model (see Fig.

It should be investigated how the results found here generalize to other geographic regions, and how the importance of explanatory variables vary globally. The model itself cannot be directly used for other geographic regions since the features used to train the algorithm are specific to the region of interest. The model is likely to perform poorly when making extrapolations for conditions not present in the training data. As an example, the substrate of beaches is likely to have a large impact on litter concentrations

It is necessary to further investigate the effect of regular cleaning of beaches by municipalities and other volunteer groups or individuals. This effect was left out in this analysis due to unavailability of these data. It is likely that it is mainly the beaches near densely populated areas that are regularly cleaned. Since data on population density has been included in the features, it is possible that this effect is taken into account by the regression model, but further analysis is necessary. Furthermore, effects of tourism can be taken into account in the future when these data are available, as this affects the local population density seasonally.

Regarding effective cleanup of beaches, it is recommended to perform beach cleanups during low tide, preferably in a week around the neap tide, when the tidal variability is lower. If limited resources are available, one can focus on exposed shorelines, which generally accumulate more litter. Additionally, more litter can be expected on relatively straight shorelines compared to more irregular geometries where litter is distributed over longer stretches of beach. We saw no effect from the number of participants per beach cleanup tour on the amount of gathered litter, with an average of 77 participants per tour. One possible improvement to clean up more litter could therefore be to spread out participants over different stages, avoiding parts of the beach being inspected multiple times.

Figures

Modeled

Modeled

Overview of the total amount of litter gathered per year during the beach cleanup tours.

A complete overview of the Gini importance for all features is presented in Fig.

Gini importance overview of all features. Labels are colored according to the feature categories in Table

A scatterplot of the measured litter concentrations versus the predicted values is presented in Fig.

Scatterplot of the observed litter quantities (

A complete overview of the featured Gini importance values corresponding to the cases without Lagrangian model features is presented in Fig.

Gini importance overview when not taking into account the Lagrangian model features, where labels are colored according to the feature categories in Table

It is not necessary to include all 66 feature clusters for predictive capability of the model. In Fig.

The effect of the number of included features on the Pearson correlation coefficient

In Fig.

Analysis where some of the feature categories have been left out. The top 10 features have been used without the Lagrangian model features (see Fig.

As mentioned in the main text, the number of participants is not seen as important in terms of the Gini importance. The number of participants is correlated with the population density in the neighborhood of the stage and is therefore assigned to the same feature cluster as the population density; for more details see Appendix

Gini importance overview when not using nearby population densities as features, which separates the effect of the number of participants per cleanup stage. In this case, it is the 28th most important feature.

The general effect of some features was described in the main text, such as the fact that an increasing tidal variability, and misalignment of the high resolution coastline with respect to the numerical model coastline (

Features which show relatively robust relations are related to tidal height, where an increasing variability and a higher maximum decrease the predicted litter concentrations. The effect for

Illustrated effect of the 12 most important features (

Correlated features are put into clusters using hierarchical Ward linkage clustering

Dendrogram used to construct the feature clusters.

Pipeline to train and test the random forest regression model. Nested

Code used to conduct the experiment and to create all of the figures and the beach cleanup data from Stichting De Noordzee are available at

MLAK designed and conducted the study, with initial data analysis from EvS and steering and discussion from EvS, SLY, MB, and HAD. Curation of the beach cleanup tour data was done by MB. All authors contributed to the manuscript.

Marijke Boonstra is employed by the North Sea Foundation. All other authors declare no competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The North Sea Foundation thanks all volunteers that participated in the Beach Cleanup Tour. We also like to thank all sponsors and partners that make the Beach Cleanup Tour possible.

This work was supported through funding from the European Research Council (ERC) under the European Union Horizon 2020 research and innovation programme (grant agreement no. 715386). Funding was provided to Stefanie L. Ypma by the Galapagos Conservation Trust and the Evolution Education Trust, Pathways to Sustainability, and the K.F. Hein Fonds.

This paper was edited by Oliver Zielinski and reviewed by three anonymous referees.