Passive acoustic monitoring (PAM) is commonly used to obtain year-round continuous data on marine soundscapes harboring valuable information on species distributions or ecosystem dynamics. This continuously increasing amount of data requires highly efficient automated analysis techniques in order to exploit the full potential of the available data. Here, we propose a benchmark, which consists of a public dataset, a well-defined task and evaluation procedure to develop and test automated analysis techniques. This benchmark focuses on the special case of detecting animal vocalizations in a real-world dataset from the marine realm. We believe that such a benchmark is necessary to monitor the progress in the development of new detection algorithms in the field of marine bioacoustics. We ultimately use the proposed benchmark to test three detection approaches, namely ANIMAL-SPOT, Koogu and a simple custom sequential convolutional neural network (CNN), and report performances. We report the performance of the three detection approaches in a blocked cross-validation fashion with 11 site-year blocks for a multi-species detection scenario in a large marine passive acoustic dataset. Performance was measured with three simple metrics (i.e., true classification rate, noise misclassification rate and call misclassification rate) and one combined fitness metric, which allocates more weight to the minimization of false positives created by noise. Overall, ANIMAL-SPOT performed the best with an average fitness metric of 0.6, followed by the custom CNN with an average fitness metric of 0.57 and finally Koogu with an average fitness metric of 0.42. The presented benchmark is an important step to advance in the automatic processing of the continuously growing amount of PAM data that are collected throughout the world's oceans. To ultimately achieve usability of developed algorithms, the focus of future work should be laid on the reduction of the false positives created by noise. |