Department of Aerospace Engineering AENGM0032 Research Project 2022-2023 AN INVESTIGATION INTO THE USE OF CONVOLUTIONAL NEURAL NETWORKS FOR WILDLIFE IDENTIFICATION Bertie Auricchio Department of Aerospace Engineering, University of Bristol, Queen’s Building, University Walk, Bristol, BS8 1TR, UK ABSTRACT Seal populations in the United Kingdom are crucial indicators of the health of the entire ecosystem, and accurate population estimates are essential for effective management and conservation efforts. Currently, individual recognition of seals is achieved through invasive methods such as tagging or aerial surveys, which are expensive, disruptive, and potentially dangerous. In this study, convolutional neural networks are implemented and tested for the use of individual recognition of seals, with a focus on the impact of training dataset and model architecture on the efficacy of these models. It was found that the quality of the training dataset is of paramount importance to the quality of model predictions, both for in-set classification and open-set classification. It was also shown that wider networks have better inference times on CPU, as well as lower error rates for in-set classification. Finally, it was shown that deeper networks perform better for open-set classification of novel species while wider networks perform better in open-set classification of novel individuals. Keywords: Ecology, Computer vision, Neural networks, Data science, Open-set recognition 1 INTRODUCTION The United Kingdom is home to two seal species: the grey seal (Halichoerus grypus) and the common seal (Phoca vitulina), which serve as crucial indicators of the health of the entire UK ecosystem. However, in the last century overhunting and disease brought the population numbers down to record lows[1].At the start of the 20th century, the grey seal population dwindled to as few as 500 individuals. Nevertheless, recent conservation efforts and the Protection of Seals Act 1970[2] have helped increase the grey seal population to over 120,000, representing 95% of the European population and 40% of the global population[3]. Estimations for population numbers and distributions are essential for answering questions related to community, ecosystem function, population dynamics, and behavioural ecology[4], which is vital in determining how best to manage human interaction with the animals, or in targeting priority communities and sites for support[5]. Often, this information must be obtained by keeping track of specific individuals, in order to measure metrics such as abundance, life expectancy and migration. Currently, most individual recognition is achieved through invasive methods such as applying tags or marks to the animal’s body, which have the downside of impacting the animal’s natural behaviour and relationship to others. Furthermore they don’t last for the entire duration of the animal’s life. Aerial surveys are another particularly useful tool for population management and are typically conducted in people-carrying helicopters or fixed-wing aircraft, which are expensive, disruptive and dangerous. 66% of work-related fatalities among wildlife workers between 1937 and 2000 were aviation-related[6]. Several alternatives[7] have been suggested, including remote sensing techniques and the use of satellites[8]. However, even high-resolution satellite imagery is not suited to smaller animal observations and the identification of individual seals is certainly not feasible from space. Weather and cloud cover pose further problems for satellite observation. UAVs have been suggested[9] as a promising solution, but an aerial survey would produce a large amount of data that would be too large to be processed automatically, requiring an automatic classification pipeline. Bertie Auricchio Certain species, such as seals, have unique natural patterns that can be used similarly to a fingerprint to visually identify individuals. This provides a more cost-effective and less exploitative method of population management, provides an identification method that lasts for the whole lifetime of the seal, and is advantageous for studying threatened and endangered animals[10]. The downside of manual visual recognition is that it is extremely time and labour consuming, requiring the agreement of two expert surveyors, and manually looking up features of the seal in a database. It is clear that an automated seal identification pipeline would greatly boost the feasibility of visual recognition. Ever since Krizhevsky et al. [11] demonstrated the effectiveness of convolutional neural networks for computer vision applications, CNNs have been a staple of modern computer vision tasks. Their recent introduction in ecology has proven successful for tasks such object detection, classification, segmentation[12], and recognition of individuals[13, 14, 15]. Additionally, the advent of citizen science, the involvement of volunteers in ecological research[16, 17], has lead to an influx of data the scale of which was never before possible. This, however, leads to large class imbalance and often low quality images. Large-scale citizen science projects such as iNaturalist[18] and the North Carolina Candid Critters[19] have been the focus of computer vision research due to these difficulties. A dataset has been collected by volunteers at the Cornwall Seal Group Research Trust[20] (CSGRT) and this will be the focus of this research project. One field of research that is more rarely found in the literature is that of open-set recognition. Especially in ecology, in order to for a model to be deployed in real world scenarios, where it is impossible to entirely predict the classes that the model will be exposed to, the model must be able to distinguish inputs from outside of the training set as unknowns. This is a difficult challenge as during training, models are incentivised to be as confident as possible in a single output in order to minimise the training loss[21]. In this project, the open-set detection performance on novel species and novel individuals will be analysed, where novel individual detection is a much harder task due to the smaller number of differentiating features. In most deep-learning literature, little thought is given to the availability of compute power available, as most approaches are tailored towards high-end datacentres with powerful GPUs. In reality, conservationists are likely to be using simple laptops with no access to decent GPUs, and so the inference of the models is likely to be done on a CPU. Due to this, another key factor to be considered is the CPU inference time of the models used. 1.1 Aims and Objectives The key aims of this report are as follows: • Investigate how the training dataset and architecture of a convolutional neural network impacts three parameters: the CPU inference time, classification accuracy and open-set performance. • Build an understanding of how to implement and design neural networks for the first time. All of the work presented in this paper was built from scratch in PyTorch unless otherwise specified. With the findings from this project, it is hoped that future implementations of neural networks for wildlife classification purposes will have a starting point in understanding how to optimise performance for this task. The key objectives are as follows: • Develop CNN models with a range of architectures, and train them on a benchmark model as well as the CSGRT dataset. 2 Bertie Auricchio • Assess the performance of these models with the specified criteria, and provide justifications for observed trends. • Provide methodologies to improve upon the observed performance. 2 NEURAL NETWORKS Figure 1 shows a representation of a single neuron in a fully-connected (FC) neural network. As shown, the neuron takes a vector of inputs, x, and multiples them by a weights vector w, summing them together before adding a scalar bias b. The scalar input z can therefore be given as z = x · w + b. In order to capture non-linearities in the system, the scalar input to the neuron is passed through some activation function fact to give the neuron activation a. The operation of a single neuron can therefore be given as a = fact (x · w + b). x1 Bias b w1 Activation function Output Input x x2 w2 x3 w3 Σ fact y Weights w Figure 1: A single neuron A neural network is simply a network of these neurons, where each layer is fed the activations of the previous layer. The relationship between neuron activations in successive layers can be summarised in matrix form as[22]: al = fact (Wl al−1 + bl ) 2.1 (1) Optimisation A neural network can be seen as simply a function approximator with many parameters θ → Rm that need to be optimised for arg min L(y, ŷ; θ). This is done through gradient descent, stepping θ the parameters in the direction of the negative gradient[22]. θ′ = θ − η∇θ L (2) Instead of stepping the gradients for each input, stochastic gradient descent is used, where an average value of each gradient ∂L/∂θi is found for a batch of size n, and then the parameters are stepped according to this average gradient and the learning rate, η. This enables fewer steps to be taken by the optimiser, speeding up the training process and provides an implicit regularisation[23] that prevents optimisation to minima that do not generalise well. n ηX θ =θ− ∇θ Lj n ′ j 3 (3) Bertie Auricchio The gradients ∇θ L are found using an algorithm known as backpropagation (Appendix A), which utilises the chain rule and reuses previously calculated gradients to maximise the efficiency of calculation. 2.2 Loss Function and Activation Functions The activation function is a vital part of the structure of a neural network, introducing nonlinearities into the system and hence the ability of the network to develop complex representations. Otherwise, the layers in the neural network could simply be composed into a single linear operation. Typically, the logistic sigmoid and ReLU (rectified linear unit) non-linearities are the most commonly seen activation functions for deep learning applications[24]. Recent work in computer vision has been very successful in using the ReLU due to the constant gradient, meaning gradient saturation isn’t experienced at higher or lower activations. A downside of the ReLU is that the gradient becomes zero at an activation of zero, and hence the so-called Leaky ReLU[25] can also be used. For many multi-class classification purposes, it is desired for the model to output a discrete confidence distribution si = pmodel (ŷ = i|x; θ). The softmax (Equation 4) is used for this purpose due to two important properties it has: firstly, like any probability distribution it sums to 1, and secondly it heavily weighs the greatest input, hence the name softmax. si = sof tmax(yi ) = P e yi yj j=1 e (4) The loss function most commonly associated with multi-class classificationP problems in machine learning is the negative log likelihood : L = −EX∼ŷ [log pmodel (s|x; θ)] = − ŷ log pmodel (s|x; θ). Minimising the negative log likelihood can be viewed as minimising the the dissimilarity between the target distribution ŷ and the model’s output confidence distribution (softmax output), measuring this dissimilarity as the KL divergence. Since the target ŷ is a one-shot vector corresponding to a single class at index j, all outputs except yj are excluded from the loss function, giving: L = −log(sj ) (5) The logarithm and exponential of the softmax cancel each other out, giving a linear relationship between the loss function and the output of the neural network. This reduces the problem of learning slowdown in the network[26]. 2.3 Convolutional Neural Networks Convolutional layers are a method of extracting high-level and low-level features from an image and allow for translational invariance, meaning features can be recognised regardless of their location in the image. A kernal slides over the entire height and width of an input image and, for each location, finds the cross-correlation (known here as a convolution) between the kernel values and the values in the region that the filter is acting on. This value is then deposited as a value in the new feature map. As shown by Figure 2, this means that a single kernel can produce an entirely new feature map, thereby greatly reducing the necessary number of parameters required in the network. In a single convolutional layer, a number of different kernels are used, leading to a number of feature maps being generated. Each convolutional layer can therefore be viewed as a H ×W ×Nf 4 Bertie Auricchio tensor, where Nf is the number of features maps in that layer and H and W are the height and width of the feature map. 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 0 I 0 1 1 1 0 0 0 0 0 1 0 0 0 0 ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1 0 0 0 0 0 0 0 ∗ 1 0 1 0 1 0 1 0 1 = K 1 1 1 1 3 4 2 2 3 3 3 4 3 3 1 4 3 4 1 1 1 3 1 1 0 I∗K Figure 2: A single convolutional operation Other parameters that can be changed for a convolutional layer include: • Stride: Most convolutions have a stride of 1, meaning the kernal moves by one pixel after every convolutional step. However, downsampling can be achieved by increasing the stride and upsampling can be achieved by using fractional strides. • Padding: In order to ensure that no downsampling occurs, padding can be added so that blank pixels are added to the border of the feature map being operated on. Another important layer is the pooling layer, which is used to subsample - reduce the size of feature maps. The function similarly to convolutions, instead taking the average or maximum value within the kernel. 2.4 Residual Blocks Despite the fact that, hypothetically, adding more layers to a neural network should increase its accuracy, He et al. [27] have shown that adding more layers to a network counter-intuitively reduces its performance. As a solution to this problem, they introduced the residual block. The reasoning behind the residual block is that as the network becomes deeper, deep layers eventually approximate the previous layers Xn ≈ Xn+1 as the representation is already well learned, so only small tweaks are needed. ⃗x skip connection (identity) F(⃗x) + ⃗x ⃗x layer 1 a(⃗x) layer 2 activation ⊕ a(⃗x) add activation F(⃗x) Figure 3: Residual Block The degradation of accuracy shown in standard CNNs demonstrates the difficulty that standard convolutional layers have in approximating the identity function. By using residuals, the output 5 Bertie Auricchio of the convolutional layers becomes Xn+1 = Xn + F(Xn ), and hence the identity function becomes much easier to learn. The result of implementing these residual blocks is that much deeper networks can be made, allowing for a greater number of features and therefore, better n1 × n1 , f1 representation of the dataset. Note that Figure 3 can be represented as , where n2 × n2 , f2 n refers to the kernel size of each layer and f refers to the number of features in each layer. 2.5 Regularisation Goodfellow et al. [22] define regularisation as a ‘modification introduced into the learning algorithm intended to decrease the generalisation error but not its training error’. Many different strategies exist to reduce generalisation error, including norm penalties, dropout, data augmentation and batch normalisation. In this project, data augmentation and batch normalisation are used. They are described below. As deep neural networks learn, the distribution of activations in each layer change over time. This is known as covariate shift, and leads to a slowdown in training. Another issue with traditional depe learning is the vanishing gradient problem, where, due to the way backpropagation works, earlier layers experience very small gradients as networks become deeper, causing significant training slowdown. As a solution to these problems, Ioffe et al. [28] suggested a technique known as batch normalisation. In this technique, the input vector to each layer is standardised across the whole batch, and then scaled and shifted by γ and β, which are parameters to be learned. An exponential moving average of µ and σ 2 is implemented during training so that they can be roughly estimated at test time. Another method of regularisation that proved highly effective during this project was data augmentation, a technique which generates many more images from the original data. Data augmentation is effective at reducing overfitting and increases model generalisation as the training images never look quite the same. Data augmentation operations include translations, cropping, shearing and colour shifting[11], and are usually done on mini-batches before feeding them into the model during training. 3 3.1 METHODOLOGY Datasets Two datasets of different difficulties were investigated; CUB200-2011 is an easier fine-grained dataset of 200 bird classes of approximately 60 images per class. It is well curated, consisting entirely of high definition images where the subject if each image is well lit and focused. CUB2002011 is not representative of the datasets that would typically be used for animal identification, however, it serves as a benchmark for how model architecture impacts model performance when an ideal dataset is used. For the more realistic case, a dataset of individual seals was provided by the Cornwall Seal Group Research Trust (CSGRT ), consisting of over 1000 individuals of class sizes ranging between 1 and 100. These class disparities are illustrated in Appendix B, where the long-tailed nature of the seals dataset is shown clearly. In order to perform open-set recognition, the datasets were split into two - a known set K = M {(xi , ŷ)}N i and an unknown set U = {(xi , ŷ)}i . Further splits of Dtrain ∪ Dval ∪ Dtest = K were made. When making the datasets, stratified sampling was used so that each class has the same proportion of images in each split. Each split is then stored separately in .csv format with image file paths and their corresponding labels: 6 Bertie Auricchio \cub\train.csv 089.Hooded_Merganser/Hooded_Merganser_0086_796780.jpg, 89 009.Brewer_Blackbird/Brewer_Blackbird_0014_2679.jpg, 9 143.Caspian_Tern/Caspian_Tern_0011_146058.jpg, 143 106.Horned_Puffin/Horned_Puffin_0079_100847.jpg, 106 007.Parakeet_Auklet/Parakeet_Auklet_0075_795981.jpg, 7 116.Chipping_Sparrow/Chipping_Sparrow_0041_108370.jpg, 116 057.Rose_breasted_Grosbeak/Rose_Breasted_Grosbeak_0114_39770.jpg, 57 ... CUB200-2011 served as a novel species recognition task while CSGRT served as the novel individual benchmark. 3.2 Data pre-processing Pre-processing is a key step in maximising the efficacy of the neural network. First, the images are normalised. This is achieved simply by finding the mean value and standard deviation of each channel across the whole dataset, and then each value is scaled by X−µ σ , fitting the values to the standard normal. As mentioned in Section 2.5, image augmentation is an effective regularisation method. The RandAugment[29] augmentation method was selected and implemented in PyTorch. It consists of several image transformations such as random cropping, rotations, solarizing, perspective changes, gaussian blurs and more. Two parameters are changed: n, the number of consecutive transformations to apply to a single image, and m, the magnitude of the transforms. n was selected to be 2 for both datasets, while m needs to be more fine-tuned. It has been empirically shown that the optimal magnitude scales with network depth and width[29]. Figure 4 illustrates the pre-processing stage graphically. Raw Normalised Normalised + RandAugment Figure 4: Image pre-processing 3.3 Open-set testing While many methods of open-set recognition exist[30], the maximum softmax probability was used in order to determine whether a sample image belongs to the training dataset during testing. Using the maximum softmax probability provides a much more simple method of testing while performing similarly to more advanced methods and is often used as a baseline for OSR[31, 32]. Moreover, with some tweaks, it has been shown to outperform more complex methods in some cases[33]. As is standard in most OSR literature, the threshold-free area under the receiver-operator curve is used as the main metric for evaluating open-set performance, as it provides a score independent of the softmax probability threshold used. At test time, the 7 Bertie Auricchio softmax probabilities and associated labels are extracted from the model, and false positive and false positive rates calculated for 1000 thresholds. A numerical integration is then performed to provide the AUROC. 3.4 Model In order to experiment with a wide range of architectures, a total of 8 models were implemented in PyTorch, based on the ResNet[27] architecture. They each consist of 5 convolutional layers, each consisting of several residual blocks (see Section 2.4), followed by an average pool into a fully-connected network and finally a softmax layer. Table 1 shows the baseline ResNet architecture. The number of residual blocks per layer is provided for each model, and downsampling is performed in the first blocks of conv3, conv4 and conv5, with a stride length of 2. As shown, ResNet18 and ResNet34 use basic 2-layer residual blocks while deeper ResNets use 3-layer bottleneck blocks. These bottleneck blocks use a 1 × 1 kernel to initially reduce the number of channels before performing the expensive 3 × 3 convolution, and then use another 1 × 1 convolution to project it back into the original shape, reducing the overall number of parameters while keeping a large number of layers. Table 1: The ResNet Architecture[27] layer name output size conv1 112×112 conv2 x 56×56 conv3 x 28×28 conv4 x 14×14 conv5 x 7×7 1×1 FLOPs 18-layer 34-layer 50-layer 101-layer 152-layer 7×7, 64, stride 2 3×3 max pool, stride 2 1×1, 64 1×1, 64 1×1, 64 3×3, 64 3×3, 64 3×3, 64 ×3 3×3, 64 ×3 ×2 ×3 3×3, 64 ×3 3×3, 64 3×3, 64 1×1, 256 1×1, 256 1×1, 256 1×1, 128 1×1, 128 1×1, 128 3×3, 128 3×3, 128 3×3, 128 ×4 3×3, 128 ×8 ×2 ×4 3×3, 128 ×4 3×3, 128 3×3, 128 1×1, 512 1×1, 512 1×1, 512 1×1, 256 1×1, 256 1×1, 256 3×3, 256 3×3, 256 ×2 ×6 3×3, 256 ×6 3×3, 256 ×23 3×3, 256 ×36 3×3, 256 3×3, 256 1×1, 1024 1×1, 1024 1×1, 1024 1×1, 512 1×1, 512 1×1, 512 3×3, 512 3×3, 512 ×2 ×3 3×3, 512 ×3 3×3, 512 ×3 3×3, 512 ×3 3×3, 512 3×3, 512 1×1, 2048 1×1, 2048 1×1, 2048 average pool, 1000-d fc, softmax 1.8×109 3.6×109 3.8×109 7.6×109 11.3×109 40 30 error (%) error (%) ble 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Dow For by this project, several changes to of these mpling is performed conv3 1, conv4 1, and conv5 were 1 withmade a stride 2. baseline model architectures. Firstly, a ResNet10 model was also introduced, consisting of 1 residual block per layer. The fullyconnected layer was also modified so that it contains 60 as many outputs as the number of classes 60 of each dataset. Finally, a width scaling factor k was implemented, which multiplies the number of filters in each layer, hence widening the network. In this project, models are referred to as 50 ResNetn-k, where n refers to the number of layers in 50 the model. In order to analyse the effect of increasing depth of models, ResNet10 through to ResNet101 40 were developed, as well as ResNet18 with a width scaling of 1.5, 2, and 2.5 to analyse the effect 34-layer of increasing width. Due to the limited compute resources available, scaling of depth and width18-layer simultaneously were unfeasible and would have led to30unacceptable inference times. plain-18 3.5 Model Training plain-34 20 0 18-layer ResNet-18 ResNet-34 20 0 total of 34-layer 20 30 model 40 50 for a 10 30 of between 40 50 For 10 each dataset, each was trained 160 epochs at a20batch size iter. (1e4) iter. (1e4) 32 and 128, depending of memory usage. Due to the limited compute power available locally, ure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: pl works of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared ir plain counterparts. 8 plain ResNet reducing of the training error3 . The reason for such op Bertie Auricchio larger models were trained on Google Colab[34] GPUs. The learning rates used for SGD were scheduled according to the OneCycle learning rate schedule by Smith et al. [35], who showed it to achieve significantly greater performance in fewer epochs than a traditional scheduler. Table 2: Optimal hyperparameters Hyperparameter Image size Peak η RandAugment m RandAugment n Label smoothing 4 4.1 CUB200-2011 448 1 × 10−3 30 2 0.3 CSGRT 448 5 × 10−4 25 2 0 RESULTS AND DISCUSSION Inference time Figure 5 shows the variation of inference time on the CPU with depth and width scaling. There is a clear decrease in inference time for the width-scaled networks, despite the significant increase in parameters. ResNet101 Width scaling Depth scaling Inference Time (ms) 120 100 ResNet18-2.5 ResNet50 80 60 ResNet18-2 ResNet34 40 ResNet18-1.5 ResNet18 20 ResNet10 1 2 3 4 No. Parameters 5 6 7 1e7 Figure 5: Inference times The reason for this is because the features within a layer can be computed as large tensor calculations, whereas deeper models require each layer to be computed separately. 4.1.1 Pruning performance Due to the importance of low inference time on CPUs in this project, other methods of further increasing inference performance were investigated. A widely used method is pruning[36], which involves systematically removing parts of the model to improve inference performance. Two pruning algorithms were implemented in PyTorch: structured L1 pruning and unstructured L1 pruning. The L1 unstructured algorithm operates by calculating the L1 norms of each weight and setting the lowest weights to 0 according to the pruning percentage. As shown by Figure 6b, 9 Bertie Auricchio unstructured pruning enables large pruning percentages with minimal degradation of accuracy, but in practice no speed-up of the model occurs, since each layer still needs to be computed. The structured pruning algorithm, on the other hand, calculates the total L1 norm of each entire structure (neurons, channels and filters) and drops the lowest structures according to pruning percentage. This results in an acceleration of the model, but as shown in Figure 6a, leads to a severe reduction in accuracy. a) b) Width scaling Depth scaling 0.7 0.6 0.6 0.5 0.5 Accuracy Accuracy 0.7 0.4 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 0.2 0.4 0.6 0.8 Pruning Percentage 1.0 Width scaling Depth scaling 0.0 0.2 0.4 0.6 0.8 Pruning Percentage 1.0 Figure 6: Degradation of accuracy with increasing pruning percentage. a) Structured L1 pruning b) Unstructured L1 pruning. Increasing opacity represents increasing model parameters. As shown in Figure 6, wider models respond much better to pruning than deeper models, consistently maintaining higher accuracy at greater pruning percentages. This suggests that they would be a better choice for deployment on low-powered devices or devices with limited memory. 4.2 4.2.1 Classification accuracy Baseline - CUB2011-200 dataset Machine learning has traditionally faced a trade-off between bias and variance. High variance models are able to represent the training set well but run the risk of overfitting and modelling random noise, leading to poor generalization. In contrast, high bias models are typically simpler and more generalisable, but may fail to capture important regularities in the training set, resulting in underfitting. Recent research[37] has challenged the conventional bias-variance trade-off and shown that deep neural networks of sufficient capacity can interpolate simpler functions, despite being over-parametrised. This counter-intuitive phenomenon is known as double descent and is shown in Figure 7a for width scaling. Interestingly, this phenomenon is not observed in depth scaling, where increasing the depth of the model results in a decrease in performance, seen in the U-shaped curve of the depth scaling. Nichani et al. [38] also noticed this and, after studying the bias-variance trade-off, showed that deeper models have reducing bias and increasing variance, like what is observed in traditional 10 Bertie Auricchio a) b) 0.305 0.300 0.7 Generalization Gap Test error 0.295 0.290 Width scaling Depth scaling 0.285 0.280 0.6 0.5 0.275 0.270 Width scaling Depth scaling 0.4 0.265 2 4 No. Parameters 2 6 1e7 4 No. Parameters 6 1e7 Figure 7: Effect of width and depth scaling on a) Test error b) Generalization gap. (CUB-200-2011 ) machine learning methods. It was shown that increasing depth in linear convolutional networks leads to linear operators of decreasing L2 norm, which causes non-generalising optimisations. Another explanation for why the performance of models decreases with depth is that, as gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training, so it is possible that there are either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal. This problem is formulated by Zagoruyko et al. [39] as the problem of diminishing feature reuse. Some works have attempted to address this problem, such as Huang et al. [40], which introduced the concept of ‘stochastic depth’, a technique which involves randomly eliminating identity mappings for each batch in order to force the network to learn. The generalisation performance of the models is shown in Figure 7b. As previously discussed, deeper models tend to exhibit a substantial decrease in generalization performance, while wider models demonstrate a more consistent performance. 4.2.2 CSGRT dataset The test error rate of each model trained on the CSGRT dataset is plotted in Figure 9. As shown, there is a significant difference in performance between the benchmark dataset and the CSGRT dataset. This is both due to the difficulty of the task and the limited size of the training data. The optimal network depth is reduced to 18 layers instead of 50. Brigato et al. [41] corroborates this finding, observing a similar trend that datasets with fewer samples performed better with smaller models. This is likely due to over-parameterisation. An open-source[42] implementation of Grad-CAM[43] was used to analyse each model. Grad-CAM uses the gradients flowing into the final convolutional layer to produce a coarse localisation map highlighting the important regions in the image for classification. The generated localisation maps are overlayed over the original image in Figure 8. Important regions are shown in red. Extracted features are also overlayed on top, with blue representing important features. 11 Bertie Auricchio Original ResNet18 ResNet50 ResNet18-2.5 Figure 8: Grad-CAM visualisation of correctly classified image (Class ID=3). (CSGRT) Figure 8 shows an example that was correctly classified by all three models shown. It is interesting to see how the regions of importance change with model architecture. ResNet18 achieved the highest accuracy of these models, and the region of importance is clearly entirely located over the region where the seal’s pattern is most prevalent. As shown, depth scaling leads to a contraction of the region of importance, while maintaining precision on the seal’s patterns. Meanwhile, width scaling leads to an expansion of the region of importance, including patterns that are not related to the seal, such as the waves in the background. This trend was observed for most images that were analysed. 0.65 Width scaling Depth scaling Test error 0.60 0.55 0.50 0.45 1 2 3 4 No. Parameters 5 6 7 1e7 Figure 9: Effect of width and depth scaling on test error. (CSGRT) In order to further investigate the cause of this increase in test error due to dataset difficulty, a bar chart was plotted for the accuracy of each class. It is shown in Figure 10 below. As shown, a significant disparity of accuracy was observed for each class, with some classes achieving 100% accuracy and others achieving 0%. 12 Bertie Auricchio 1.0 Accuracy 0.8 0.6 0.4 0.2 0.0 0 10 20 Class 30 40 Figure 10: Bar chart representing classification accuracy of each class. (CSGRT) Further investigation of low accuracy classes showed that deteriorated class accuracy was strongly correlated with low-resolution images, low contrast of fur pattern and images where the fur pattern was occluded from view. Furthermore, some classes contained individuals with widely varying appearances between images, due to the fact that seals molt their fur every year[44]. As expected, these classes had a larger observed error rate. Examples of incorrectly identified images are included in Figures 11 and 12 below. Original ResNet18 Figure 11: Grad-CAM visualisation of image with low contrast fur pattern (Class ID=23). (CSGRT) Original ResNet18 Figure 12: Grad-CAM visualisation of low-resolution image (Class ID=27). (CSGRT) 13 Bertie Auricchio As shown in both examples, the ability of the model to identify important regions of the image is greatly reduced with low resolution and low contrast fur patterns. Examples of classes with high intra-class appearance variation are shown in Figure 13. From these images it is clear that some curation of the dataset is required to some degree, as it is practically infeasible for any neural network to identify these images as being the same individual. Class ID = 40 Class ID = 21 Figure 13: Examples of classes with high intra-class variation in appearance. Each row represents a single individual. (CSGRT) 4.3 4.3.1 Open-set performance Baseline - CUB2011-200 dataset While it is true that a neural network of an infinite width with only a single hidden layer is a universal approximator[45, 46], it has been shown that the expressive power of deeper networks is substantially greater than that of shallower networks of finite width. Eldan et al. [47] demonstrated that there exist functions in three-layered networks that cannot be expressed by networks of two layers of finite width, and Guliyev et al. [46] demonstrated that the polynomials are exponentially easier to express with deeper networks that with wider ones. Studies have shown[48] that wider networks are better at ‘memorisation’ of random noise, while deeper networks are more effective at extracting more abstract features, due to the larger number of non-linearities present in deeper networks. Figure 15 shows that scaling depth leads to increased performance over scaling width. This shows that the greater expressivity of deeper networks enables them to identify useful semantic differences between individuals rather than relying on pure memorisation, hence increasing their performance in open-set recognition. For example, in the CUB200-2011 dataset, water dwelling birds are likely to photographed with some water in the background, and so models that more heavily rely on memorisation are likely to use the presence of water in the background as evidence for the species of bird, when this is a spurious correlation[49]. This was shown previously in Figure 8 but is made more clear in Figure 14, where deeper models are shown to be using key features on the body of the albatross, such as the face, beak and tail shape, while the wider ResNet18-2.5 is shown to use the background as an important feature. 14 Bertie Auricchio Original ResNet18 ResNet101 ResNet18-2.5 Figure 14: Demonstration of increased spurious correlations in wider networks. (CUB200-2011 ) 0.92 Width scaling Depth scaling 0.90 AUROC 0.88 0.86 0.84 0.82 1 2 3 4 No. Parameters 5 6 7 1e7 Figure 15: Effect of depth and width scaling on open-set performance. (CUB200-2011 ) 4.3.2 CSGRT dataset Figure 16 shows the variation of AUROC with depth and width scaling for the CSGRT dataset. Rather than following the same trends as the better curated CUB200-2011 dataset shown previously, the open-set performance seems to be strongly tied to the classification accuracy. This is likely due to the fact that the low quality and highly fine-grained nature of the dataset acts as a bottleneck preventing deeper models from extracting any extra meaningful features from the images. Due to the fact that this is a novel individual recognition task, there are far fewer semantic differences between classes than in novel species recognition. In this case it therefore seems that memorisation leads to better results. Figure 16 also shows the open-set performance when a separate dataset, the Beach Litter [19] dataset was used as the unseen dataset in testing. The Beach Litter dataset was selected as it contains similar backgrounds to the CSGRT dataset. As shown, deeper networks experienced a greater increase in performance with respect to wider networks, again suggesting that the deeper networks ‘understand’ the important features of what a seal is rather than simply memorising patterns. 15 Bertie Auricchio Width scaling Depth scaling 0.68 0.66 Beach Litter AUROC 0.64 0.62 CSGRT 0.60 0.58 0.56 1 2 3 4 No. Parameters 5 6 7 1e7 Figure 16: Effect of depth and width scaling on open-set performance. (CSGRT ) 0.8 0.7 AUROC 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.4 0.6 0.8 Classification Accuracy 1.0 Figure 17: Scatter plot of AUROC and classification accuracy for each class. R2 = 0.663. (CSGRT ) Figure 17 shows the relationship between AUROC and classification accuracy for each class in CSGRT. An R2 value of 0.663 was calculated, showing a strong to moderate correlation between the two variables. This suggests that the same factors that gave the dataset a poor classification accuracy are also responsible for the deterioration of open-set performance. 5 CONCLUSION Overall, it was found that, for the tasks analysed in this project, wider networks are capable of achieving higher classification accuracies and lower inference times. Meanwhile, due to greater 16 Bertie Auricchio semantic differences, deeper models performed better in open-set recognition of novel species. For the open-set detection of novel individuals, similar trends to in-set classification were observed, with greater memorisation ability of the network leading to better results. Importantly it was observed that the quality of the training images was a key factor in determining the performance of the models, with the best quality classes performing 9x better than the worst classes. 5.1 Limitations This investigation was limited to basic residual networks, applying simple scaling as the only changes in architecture. More advanced architecutures such as InceptionV3[50], MobileNets[51] and EfficientNets[52] have been developed, making use of more advanced techniques such as depthwise separable convolutions[51], inverted bottlenecks[53] and squeeze and excite layers[54]. It is not immediately obvious that the findings of this report translate directly to these networks. Furthermore, the datasets used were very limited in size - CSGRT only had a training set of around 1500 inputs. A typical citizen science dataset would have several orders of magnitude more images than this and so the results of this investigation may not apply to these significantly larger datasets. Finally, very small classes were removed from the training data during this investigation. It is likely that including small classes would introduce significant changes in observed results. 6 FUTURE WORK It is recommended that datasets are curated to some degree, either through an automated pipeline or manually, before being used as training data. Furthermore, current work with triplet losses[55] and siamese neural networks for image similarity[56] applications seems very promising and may lead to results that outperform those of a standard CNN classification, especially in few-shot applications[57]. Facial recognition has proven highly successful for bears[58], and may provide another alternative that is more robust to changes in fur patterns than the methodology used here. Additionally, other OSR methods such as ODIN[59], OpenMax[30] and distance based methods such as Mahanobilis distance[60] and ARPL[33, 61] may prove more successful than the Maximum Softmax Probability used here. REFERENCES [1] Dave Thompson et al. “The status of harbour seals (Phoca vitulina) in the UK”. In: Aquatic Conservation: Marine and Freshwater Ecosystems 29.S1 (2019). [2] Seals. https://www.gov.uk/government/publications/protected-marine-species/ seals. [3] The Wildlife Trusts. Grey Seal. https://www.wildlifetrusts.org/wildlife-explorer/ marine/marine-mammals-and-sea-turtles/grey-seal. [4] Stefan Schneider et al. Past, Present, and Future Approaches Using Computer Vision for Animal Re-Identification from Camera Trap Data. arXiv:1811.07749 [cs]. Nov. 2018. [5] A. C. Seymour et al. “Automated detection and enumeration of marine wildlife using unmanned aircraft systems (UAS) and thermal imagery”. In: Scientific Reports 7.1 (2017), p. 45127. issn: 2045-2322. [6] D. Blake Sasse. “Job-Related Mortality of Wildlife Workers in the United States, 19372000”. In: Wildlife Society Bulletin (1973-2006) 31.4 (2003). Publisher: [Wiley, Wildlife Society], pp. 1015–1020. issn: 00917648, 19385463. 17 Bertie Auricchio [7] Ned Horning et al. Remote sensing for ecology and conservation: a handbook of techniques. Oxford University Press, 2010. [8] Nathalie Pettorelli and Nathalie Pettorelli. 95Satellite remote sensing and the management of wild species and habitats. May 2019. [9] Jarrod C. Hodgson et al. “Precision wildlife monitoring using unmanned aerial vehicles”. en. In: Scientific Reports 6.1 (Mar. 2016). Number: 1 Publisher: Nature Publishing Group, p. 22574. issn: 2045-2322. [10] Marcella Kelly. “Computer-Aided Photograph Matching in Studies Using Individual Identification: An Example from Serengeti Cheetahs”. In: Journal of Mammalogy 82 (May 2001), pp. 440–449. [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012. [12] Jason Parham et al. “An Animal Detection Pipeline for Identification”. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 2018, pp. 1075–1083. [13] Yongliang Qiao et al. “Individual Cattle Identification Using a Deep Learning Based Framework”. en. In: IFAC-PapersOnLine 52.30 (Jan. 2019), pp. 318–323. [14] Roberto Sacchi et al. “Photographic identification in reptiles: A matter of scales”. In: Amphibia-Reptilia 31 (Nov. 2010), pp. 489–502. [15] Keren Levy, Amit Lerner, and Nadav Shashar. “Mate choice and body pattern variations in the Crown Butterfly fish Chaetodon paucifasciatus (Chaetodontidae)”. eng. In: Biology Open 3.12 (Nov. 2014), pp. 1245–1251. issn: 2046-6390. [16] Caroline Moussy et al. “A quantitative global review of species population monitoring”. In: Conservation Biology 36.1 (2022). [17] Laura Oleniacz and North Carolina State University. Citizen science study captures 2.2M wildlife images in NC. en. [18] Grant Van Horn et al. “The iNaturalist Challenge 2017 Dataset”. In: CoRR abs/1707.06642 (2017). eprint: 1707.06642. [19] North Carolina Candid Critters. https://www.nccandidcritters.org/. [20] Seal Identification. http://www.cornwallsealgroup.co.uk/seal-identification/. [21] Hongxin Wei et al. Mitigating Neural Network Overconfidence with Logit Normalization. arXiv:2205.09310 [cs]. June 2022. [22] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http : / / www . deeplearningbook.org. MIT Press, 2016. [23] Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv:1704.04289 [cs, stat]. Jan. 2018. [24] Tomasz Szandala. “Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks”. en. In: Bio-inspired Neurocomputing. Ed. by Akash Kumar Bhoi et al. Vol. 903. Series Title: Studies in Computational Intelligence. Singapore: Springer Singapore, 2021, pp. 203–224. isbn: 9789811554940 9789811554957. [25] Chiyuan Zhang et al. Understanding deep learning requires rethinking generalization. Feb. 2017. [26] Michael A. Nielsen. “Neural Networks and Deep Learning”. en. In: (2015). Publisher: Determination Press. 18 Bertie Auricchio [27] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015). eprint: 1512.03385. [28] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: CoRR abs/1502.03167 (2015). eprint: 1502.03167. [29] Ekin D. Cubuk et al. RandAugment: Practical automated data augmentation with a reduced search space. arXiv:1909.13719 [cs]. Nov. 2019. [30] Atefeh Mahdavi and Marco Carvalho. “A Survey on Open Set Recognition”. In: 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). arXiv:2109.00893 [cs]. Dec. 2021, pp. 37–44. [31] Dan Hendrycks et al. “Scaling Out-of-Distribution Detection for Real-World Settings”. In: CoRR abs/1911.11132 (2019). eprint: 1911.11132. [32] Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-ofDistribution Examples in Neural Networks”. In: CoRR abs/1610.02136 (2016). eprint: 1610.02136. [33] Sagar Vaze et al. “Open-Set Recognition: A Good Closed-Set Classifier is All You Need”. In: CoRR abs/2110.06207 (2021). [34] Google Colaboratory. en. https://colab.research.google.com/. [35] Leslie N. Smith and Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv:1708.07120 [cs, stat]. May 2018. [36] Tailin Liang et al. Pruning and Quantization for Deep Neural Network Acceleration: A Survey. arXiv:2101.09671 [cs]. June 2021. [37] Preetum Nakkiran et al. Deep Double Descent: Where Bigger Models and More Data Hurt. arXiv:1912.02292 [cs, stat]. Dec. 2019. [38] Eshaan Nichani, Adityanarayanan Radhakrishnan, and Caroline Uhler. Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks. Oct. 2020. [39] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. arXiv:1605.07146 [cs]. June 2017. [40] Gao Huang et al. Deep Networks with Stochastic Depth. arXiv:1603.09382 [cs]. July 2016. [41] L. Brigato and L. Iocchi. A Close Look at Deep Learning with Small Data. arXiv:2003.12843 [cs, stat]. Oct. 2020. [42] Francesco Saverio Zuppichini. Mirror. original-date: 2018-10-11T16:11:17Z. Mar. 2023. [43] Ramprasaath R. Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”. In: International Journal of Computer Vision 128.2 (Feb. 2020). issn: 0920-5691, 1573-1405. [44] Grey Seal - Moulting season — North Wales Wildlife Trust. https://www.northwales/ /wildlifetrust.org.uk/blog/living-seas/grey-seal-moulting-season. Dec. 2021. [45] Allan Pinkus. “Approximation theory of the MLP model in neural networks”. en. In: Acta Numerica 8 (Jan. 1999). Publisher: Cambridge University Press, pp. 143–195. issn: 1474-0508, 0962-4929. [46] Namig J. Guliyev and Vugar E. Ismailov. “Approximation capability of two hidden layer feedforward neural networks with fixed weights”. In: Neurocomputing 316 (Nov. 2018), pp. 262–269. issn: 09252312. [47] Ronen Eldan and Ohad Shamir. The Power of Depth for Feedforward Neural Networks. arXiv:1512.03965 [cs, stat] version: 4. May 2016. 19 Bertie Auricchio [48] Heng-Tze Cheng et al. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792 [cs, stat]. June 2016. [49] Shiori Sagawa et al. An Investigation of Why Overparameterization Exacerbates Spurious Correlations. arXiv:2005.04345 [cs, stat]. Aug. 2020. [50] Christian Szegedy et al. Rethinking the Inception Architecture for Computer Vision. Dec. 2015. [51] Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs]. Apr. 2017. [52] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [cs, stat]. Sept. 2020. [53] Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381 [cs]. Mar. 2019. [54] Andrew Howard et al. Searching for MobileNetV3. arXiv:1905.02244 [cs]. Nov. 2019. [55] Elad Hoffer and Nir Ailon. Deep metric learning using Triplet network. arXiv:1412.6622 [cs, stat]. Dec. 2018. [56] Matthijs Douze et al. The 2021 Image Similarity Dataset and Challenge. arXiv:2106.09672 [cs]. Feb. 2022. [57] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese Neural Networks for One-shot Image Recognition”. en. In: (). [58] BearID Project. en-US. https://bearresearch.org/. Sept. 2022. [59] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. arXiv:1706.02690 [cs, stat]. Aug. 2020. [60] Jie Ren et al. A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. arXiv:2106.09022 [cs]. June 2021. [61] Ziheng Xia et al. Adversarial Motorial Prototype Framework for Open Set Recognition. arXiv:2108.04225 [cs]. July 2021. A BACKPROPAGATION ALGORITHM As mentioned, the gradients ∇θ L are found using backpropagation, which utilises the chain rule and reuses previously calculated gradients to maximise the efficiency of calculation. Returning to Figure 1, through the chain rule it is easy to show that the input to the neuron, z can be given as: ∂C ∂L ′ = fact (zjL ) ∂zjL ∂aL j (6) ∂C ′ = ∇a C ⊙ fact zL L ∂z (7) Or in matrix form: ∂C From this ∂z L term, again using the chain rule, the expressions for the gradient of the loss function with respect to the biases and weights in the final layer L can be shown to be: ∂C ∂C = l ∂blj ∂zj 20 (8) Bertie Auricchio ∂C ∂C = al−1 k l ∂wjk ∂zjl Finally, the ∂C ∂z L (9) term can be propagated back into the network with the matrix expression: T ∂C ′ l+1 l+1 = ⊙ σ zl w δ ∂z l (10) The basic algorithm behind backpropagation consists of: • Feed an input x through the network. • Calculate the loss function for each output yj = aL j. ∂C • Calculate ∂z . L j • Calculate network parameters (bL , wL ) at final layer L. ∂C • Backpropagate ∂z L to previous layer. j B CLASS SIZE HISTOGRAMS The class size histograms for both datasets are given in Figure 18. As illustrated, the more difficult CSGRT dataset has a significant disparity in class size while CUB2011-200 is more uniform with the majority of classes containing 60 images. b) 800 140 700 120 600 Number of classes Number of classes a) 500 400 300 100 80 60 40 200 20 100 0 0 0 20 40 60 Class size 80 100 45 50 Class size Figure 18: Class histograms for a) CSGRT b) CUB200-2011 21 55 60