Subido por Bertie Auricchio

RP3 Write up

Anuncio
Department of Aerospace Engineering
AENGM0032 Research Project
2022-2023
AN INVESTIGATION INTO THE USE OF CONVOLUTIONAL NEURAL
NETWORKS FOR WILDLIFE IDENTIFICATION
Bertie Auricchio
Department of Aerospace Engineering, University of Bristol, Queen’s Building, University
Walk, Bristol, BS8 1TR, UK
ABSTRACT
Seal populations in the United Kingdom are crucial indicators of the health of the entire
ecosystem, and accurate population estimates are essential for effective management and
conservation efforts. Currently, individual recognition of seals is achieved through invasive
methods such as tagging or aerial surveys, which are expensive, disruptive, and potentially
dangerous. In this study, convolutional neural networks are implemented and tested for the
use of individual recognition of seals, with a focus on the impact of training dataset and
model architecture on the efficacy of these models. It was found that the quality of the
training dataset is of paramount importance to the quality of model predictions, both for
in-set classification and open-set classification. It was also shown that wider networks have
better inference times on CPU, as well as lower error rates for in-set classification. Finally,
it was shown that deeper networks perform better for open-set classification of novel species
while wider networks perform better in open-set classification of novel individuals.
Keywords: Ecology, Computer vision, Neural networks, Data science, Open-set recognition
1
INTRODUCTION
The United Kingdom is home to two seal species: the grey seal (Halichoerus grypus) and the
common seal (Phoca vitulina), which serve as crucial indicators of the health of the entire UK
ecosystem. However, in the last century overhunting and disease brought the population numbers
down to record lows[1].At the start of the 20th century, the grey seal population dwindled to as
few as 500 individuals. Nevertheless, recent conservation efforts and the Protection of Seals Act
1970[2] have helped increase the grey seal population to over 120,000, representing 95% of the
European population and 40% of the global population[3]. Estimations for population numbers
and distributions are essential for answering questions related to community, ecosystem function,
population dynamics, and behavioural ecology[4], which is vital in determining how best to
manage human interaction with the animals, or in targeting priority communities and sites for
support[5]. Often, this information must be obtained by keeping track of specific individuals,
in order to measure metrics such as abundance, life expectancy and migration. Currently, most
individual recognition is achieved through invasive methods such as applying tags or marks to
the animal’s body, which have the downside of impacting the animal’s natural behaviour and
relationship to others. Furthermore they don’t last for the entire duration of the animal’s life.
Aerial surveys are another particularly useful tool for population management and are typically
conducted in people-carrying helicopters or fixed-wing aircraft, which are expensive, disruptive
and dangerous. 66% of work-related fatalities among wildlife workers between 1937 and 2000
were aviation-related[6]. Several alternatives[7] have been suggested, including remote sensing
techniques and the use of satellites[8]. However, even high-resolution satellite imagery is not
suited to smaller animal observations and the identification of individual seals is certainly not
feasible from space. Weather and cloud cover pose further problems for satellite observation.
UAVs have been suggested[9] as a promising solution, but an aerial survey would produce a large
amount of data that would be too large to be processed automatically, requiring an automatic
classification pipeline.
Bertie Auricchio
Certain species, such as seals, have unique natural patterns that can be used similarly to a fingerprint to visually identify individuals. This provides a more cost-effective and less exploitative
method of population management, provides an identification method that lasts for the whole
lifetime of the seal, and is advantageous for studying threatened and endangered animals[10].
The downside of manual visual recognition is that it is extremely time and labour consuming,
requiring the agreement of two expert surveyors, and manually looking up features of the seal
in a database. It is clear that an automated seal identification pipeline would greatly boost the
feasibility of visual recognition.
Ever since Krizhevsky et al. [11] demonstrated the effectiveness of convolutional neural networks
for computer vision applications, CNNs have been a staple of modern computer vision tasks.
Their recent introduction in ecology has proven successful for tasks such object detection, classification, segmentation[12], and recognition of individuals[13, 14, 15]. Additionally, the advent of
citizen science, the involvement of volunteers in ecological research[16, 17], has lead to an influx
of data the scale of which was never before possible. This, however, leads to large class imbalance and often low quality images. Large-scale citizen science projects such as iNaturalist[18]
and the North Carolina Candid Critters[19] have been the focus of computer vision research
due to these difficulties. A dataset has been collected by volunteers at the Cornwall Seal Group
Research Trust[20] (CSGRT) and this will be the focus of this research project.
One field of research that is more rarely found in the literature is that of open-set recognition.
Especially in ecology, in order to for a model to be deployed in real world scenarios, where
it is impossible to entirely predict the classes that the model will be exposed to, the model
must be able to distinguish inputs from outside of the training set as unknowns. This is a
difficult challenge as during training, models are incentivised to be as confident as possible in a
single output in order to minimise the training loss[21]. In this project, the open-set detection
performance on novel species and novel individuals will be analysed, where novel individual
detection is a much harder task due to the smaller number of differentiating features.
In most deep-learning literature, little thought is given to the availability of compute power
available, as most approaches are tailored towards high-end datacentres with powerful GPUs.
In reality, conservationists are likely to be using simple laptops with no access to decent GPUs,
and so the inference of the models is likely to be done on a CPU. Due to this, another key factor
to be considered is the CPU inference time of the models used.
1.1
Aims and Objectives
The key aims of this report are as follows:
• Investigate how the training dataset and architecture of a convolutional neural network
impacts three parameters: the CPU inference time, classification accuracy and open-set
performance.
• Build an understanding of how to implement and design neural networks for the first time.
All of the work presented in this paper was built from scratch in PyTorch unless otherwise
specified.
With the findings from this project, it is hoped that future implementations of neural networks
for wildlife classification purposes will have a starting point in understanding how to optimise
performance for this task.
The key objectives are as follows:
• Develop CNN models with a range of architectures, and train them on a benchmark model
as well as the CSGRT dataset.
2
Bertie Auricchio
• Assess the performance of these models with the specified criteria, and provide justifications for observed trends.
• Provide methodologies to improve upon the observed performance.
2
NEURAL NETWORKS
Figure 1 shows a representation of a single neuron in a fully-connected (FC) neural network.
As shown, the neuron takes a vector of inputs, x, and multiples them by a weights vector w,
summing them together before adding a scalar bias b. The scalar input z can therefore be given
as z = x · w + b. In order to capture non-linearities in the system, the scalar input to the neuron
is passed through some activation function fact to give the neuron activation a. The operation
of a single neuron can therefore be given as a = fact (x · w + b).
x1
Bias
b
w1
Activation
function Output
Input x
x2
w2
x3
w3
Σ
fact
y
Weights w
Figure 1: A single neuron
A neural network is simply a network of these neurons, where each layer is fed the activations
of the previous layer. The relationship between neuron activations in successive layers can be
summarised in matrix form as[22]:
al = fact (Wl al−1 + bl )
2.1
(1)
Optimisation
A neural network can be seen as simply a function approximator with many parameters θ → Rm
that need to be optimised for arg min L(y, ŷ; θ). This is done through gradient descent, stepping
θ
the parameters in the direction of the negative gradient[22].
θ′ = θ − η∇θ L
(2)
Instead of stepping the gradients for each input, stochastic gradient descent is used, where an
average value of each gradient ∂L/∂θi is found for a batch of size n, and then the parameters
are stepped according to this average gradient and the learning rate, η. This enables fewer
steps to be taken by the optimiser, speeding up the training process and provides an implicit
regularisation[23] that prevents optimisation to minima that do not generalise well.
n
ηX
θ =θ−
∇θ Lj
n
′
j
3
(3)
Bertie Auricchio
The gradients ∇θ L are found using an algorithm known as backpropagation (Appendix A), which
utilises the chain rule and reuses previously calculated gradients to maximise the efficiency of
calculation.
2.2
Loss Function and Activation Functions
The activation function is a vital part of the structure of a neural network, introducing nonlinearities into the system and hence the ability of the network to develop complex representations. Otherwise, the layers in the neural network could simply be composed into a single linear
operation. Typically, the logistic sigmoid and ReLU (rectified linear unit) non-linearities are
the most commonly seen activation functions for deep learning applications[24]. Recent work
in computer vision has been very successful in using the ReLU due to the constant gradient,
meaning gradient saturation isn’t experienced at higher or lower activations. A downside of the
ReLU is that the gradient becomes zero at an activation of zero, and hence the so-called Leaky
ReLU[25] can also be used.
For many multi-class classification purposes, it is desired for the model to output a discrete
confidence distribution si = pmodel (ŷ = i|x; θ). The softmax (Equation 4) is used for this
purpose due to two important properties it has: firstly, like any probability distribution it sums
to 1, and secondly it heavily weighs the greatest input, hence the name softmax.
si = sof tmax(yi ) = P
e yi
yj
j=1 e
(4)
The loss function most commonly associated with multi-class classificationP
problems in machine
learning is the negative log likelihood : L = −EX∼ŷ [log pmodel (s|x; θ)] = − ŷ log pmodel (s|x; θ).
Minimising the negative log likelihood can be viewed as minimising the the dissimilarity between
the target distribution ŷ and the model’s output confidence distribution (softmax output), measuring this dissimilarity as the KL divergence. Since the target ŷ is a one-shot vector corresponding to a single class at index j, all outputs except yj are excluded from the loss function,
giving:
L = −log(sj )
(5)
The logarithm and exponential of the softmax cancel each other out, giving a linear relationship
between the loss function and the output of the neural network. This reduces the problem of
learning slowdown in the network[26].
2.3
Convolutional Neural Networks
Convolutional layers are a method of extracting high-level and low-level features from an image
and allow for translational invariance, meaning features can be recognised regardless of their
location in the image. A kernal slides over the entire height and width of an input image and,
for each location, finds the cross-correlation (known here as a convolution) between the kernel
values and the values in the region that the filter is acting on. This value is then deposited
as a value in the new feature map. As shown by Figure 2, this means that a single kernel
can produce an entirely new feature map, thereby greatly reducing the necessary number of
parameters required in the network.
In a single convolutional layer, a number of different kernels are used, leading to a number of
feature maps being generated. Each convolutional layer can therefore be viewed as a H ×W ×Nf
4
Bertie Auricchio
tensor, where Nf is the number of features maps in that layer and H and W are the height and
width of the feature map.
0
0
0
0
0
0
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
1
1
1
1
1
0
0
I
0
1
1
1
0
0
0
0
0
1
0
0
0
0
×1
×0
×1
×0
×1
×0
×1
×0
×1
0
0
0
0
0
0
0
∗
1 0 1
0 1 0
1 0 1
=
K
1
1
1
1
3
4
2
2
3
3
3
4
3
3
1
4
3
4
1
1
1
3
1
1
0
I∗K
Figure 2: A single convolutional operation
Other parameters that can be changed for a convolutional layer include:
• Stride: Most convolutions have a stride of 1, meaning the kernal moves by one pixel after
every convolutional step. However, downsampling can be achieved by increasing the stride
and upsampling can be achieved by using fractional strides.
• Padding: In order to ensure that no downsampling occurs, padding can be added so that
blank pixels are added to the border of the feature map being operated on.
Another important layer is the pooling layer, which is used to subsample - reduce the size of
feature maps. The function similarly to convolutions, instead taking the average or maximum
value within the kernel.
2.4
Residual Blocks
Despite the fact that, hypothetically, adding more layers to a neural network should increase
its accuracy, He et al. [27] have shown that adding more layers to a network counter-intuitively
reduces its performance. As a solution to this problem, they introduced the residual block. The
reasoning behind the residual block is that as the network becomes deeper, deep layers eventually
approximate the previous layers Xn ≈ Xn+1 as the representation is already well learned, so
only small tweaks are needed.
⃗x
skip connection
(identity)
F(⃗x) + ⃗x
⃗x
layer 1
a(⃗x)
layer 2
activation
⊕
a(⃗x)
add
activation
F(⃗x)
Figure 3: Residual Block
The degradation of accuracy shown in standard CNNs demonstrates the difficulty that standard
convolutional layers have in approximating the identity function. By using residuals, the output
5
Bertie Auricchio
of the convolutional layers becomes Xn+1 = Xn + F(Xn ), and hence the identity function
becomes much easier to learn. The result of implementing these residual blocks is that much
deeper networks can be made, allowing for a greater number of features and therefore,
better
n1 × n1 , f1
representation of the dataset. Note that Figure 3 can be represented as
, where
n2 × n2 , f2
n refers to the kernel size of each layer and f refers to the number of features in each layer.
2.5
Regularisation
Goodfellow et al. [22] define regularisation as a ‘modification introduced into the learning algorithm intended to decrease the generalisation error but not its training error’. Many different
strategies exist to reduce generalisation error, including norm penalties, dropout, data augmentation and batch normalisation. In this project, data augmentation and batch normalisation are
used. They are described below.
As deep neural networks learn, the distribution of activations in each layer change over time. This
is known as covariate shift, and leads to a slowdown in training. Another issue with traditional
depe learning is the vanishing gradient problem, where, due to the way backpropagation works,
earlier layers experience very small gradients as networks become deeper, causing significant
training slowdown. As a solution to these problems, Ioffe et al. [28] suggested a technique
known as batch normalisation. In this technique, the input vector to each layer is standardised
across the whole batch, and then scaled and shifted by γ and β, which are parameters to be
learned. An exponential moving average of µ and σ 2 is implemented during training so that
they can be roughly estimated at test time.
Another method of regularisation that proved highly effective during this project was data augmentation, a technique which generates many more images from the original data. Data augmentation is effective at reducing overfitting and increases model generalisation as the training
images never look quite the same. Data augmentation operations include translations, cropping,
shearing and colour shifting[11], and are usually done on mini-batches before feeding them into
the model during training.
3
3.1
METHODOLOGY
Datasets
Two datasets of different difficulties were investigated; CUB200-2011 is an easier fine-grained
dataset of 200 bird classes of approximately 60 images per class. It is well curated, consisting
entirely of high definition images where the subject if each image is well lit and focused. CUB2002011 is not representative of the datasets that would typically be used for animal identification,
however, it serves as a benchmark for how model architecture impacts model performance when
an ideal dataset is used. For the more realistic case, a dataset of individual seals was provided
by the Cornwall Seal Group Research Trust (CSGRT ), consisting of over 1000 individuals of
class sizes ranging between 1 and 100. These class disparities are illustrated in Appendix B,
where the long-tailed nature of the seals dataset is shown clearly.
In order to perform open-set recognition, the datasets were split into two - a known set K =
M
{(xi , ŷ)}N
i and an unknown set U = {(xi , ŷ)}i . Further splits of Dtrain ∪ Dval ∪ Dtest = K were
made. When making the datasets, stratified sampling was used so that each class has the same
proportion of images in each split. Each split is then stored separately in .csv format with image
file paths and their corresponding labels:
6
Bertie Auricchio
\cub\train.csv
089.Hooded_Merganser/Hooded_Merganser_0086_796780.jpg, 89
009.Brewer_Blackbird/Brewer_Blackbird_0014_2679.jpg, 9
143.Caspian_Tern/Caspian_Tern_0011_146058.jpg, 143
106.Horned_Puffin/Horned_Puffin_0079_100847.jpg, 106
007.Parakeet_Auklet/Parakeet_Auklet_0075_795981.jpg, 7
116.Chipping_Sparrow/Chipping_Sparrow_0041_108370.jpg, 116
057.Rose_breasted_Grosbeak/Rose_Breasted_Grosbeak_0114_39770.jpg, 57
...
CUB200-2011 served as a novel species recognition task while CSGRT served as the novel
individual benchmark.
3.2
Data pre-processing
Pre-processing is a key step in maximising the efficacy of the neural network. First, the images
are normalised. This is achieved simply by finding the mean value and standard deviation of
each channel across the whole dataset, and then each value is scaled by X−µ
σ , fitting the values
to the standard normal.
As mentioned in Section 2.5, image augmentation is an effective regularisation method. The
RandAugment[29] augmentation method was selected and implemented in PyTorch. It consists
of several image transformations such as random cropping, rotations, solarizing, perspective
changes, gaussian blurs and more. Two parameters are changed: n, the number of consecutive
transformations to apply to a single image, and m, the magnitude of the transforms. n was
selected to be 2 for both datasets, while m needs to be more fine-tuned. It has been empirically
shown that the optimal magnitude scales with network depth and width[29]. Figure 4 illustrates
the pre-processing stage graphically.
Raw
Normalised
Normalised + RandAugment
Figure 4: Image pre-processing
3.3
Open-set testing
While many methods of open-set recognition exist[30], the maximum softmax probability was
used in order to determine whether a sample image belongs to the training dataset during
testing. Using the maximum softmax probability provides a much more simple method of
testing while performing similarly to more advanced methods and is often used as a baseline
for OSR[31, 32]. Moreover, with some tweaks, it has been shown to outperform more complex
methods in some cases[33]. As is standard in most OSR literature, the threshold-free area under
the receiver-operator curve is used as the main metric for evaluating open-set performance, as
it provides a score independent of the softmax probability threshold used. At test time, the
7
Bertie Auricchio
softmax probabilities and associated labels are extracted from the model, and false positive and
false positive rates calculated for 1000 thresholds. A numerical integration is then performed to
provide the AUROC.
3.4
Model
In order to experiment with a wide range of architectures, a total of 8 models were implemented
in PyTorch, based on the ResNet[27] architecture. They each consist of 5 convolutional layers,
each consisting of several residual blocks (see Section 2.4), followed by an average pool into a
fully-connected network and finally a softmax layer. Table 1 shows the baseline ResNet architecture. The number of residual blocks per layer is provided for each model, and downsampling
is performed in the first blocks of conv3, conv4 and conv5, with a stride length of 2.
As shown, ResNet18 and ResNet34 use basic 2-layer residual blocks while deeper ResNets use
3-layer bottleneck blocks. These bottleneck blocks use a 1 × 1 kernel to initially reduce the
number of channels before performing the expensive 3 × 3 convolution, and then use another
1 × 1 convolution to project it back into the original shape, reducing the overall number of
parameters while keeping a large number of layers.
Table 1: The ResNet Architecture[27]
layer name output size
conv1
112×112
conv2 x
56×56
conv3 x
28×28
conv4 x
14×14
conv5 x
7×7
1×1
FLOPs
18-layer
34-layer
50-layer
101-layer
152-layer
7×7, 64, stride 2
3×3 max pool, stride 2






1×1, 64
1×1, 64
1×1, 64
3×3, 64
3×3, 64
 3×3, 64 ×3
 3×3, 64 ×3
×2
×3  3×3, 64 ×3
3×3, 64
3×3, 64
1×1, 256
1×1, 256
1×1, 256






1×1, 128
1×1, 128
1×1, 128
3×3, 128
3×3, 128
 3×3, 128 ×4
 3×3, 128 ×8
×2
×4  3×3, 128 ×4
3×3, 128
3×3, 128
1×1, 512
1×1, 512
1×1, 512






1×1, 256
1×1, 256
1×1, 256
3×3, 256
3×3, 256
×2
×6  3×3, 256 ×6  3×3, 256 ×23  3×3, 256 ×36
3×3, 256
3×3, 256
1×1, 1024
1×1, 1024
1×1, 1024






1×1, 512
1×1, 512
1×1, 512
3×3, 512
3×3, 512
×2
×3  3×3, 512 ×3  3×3, 512 ×3  3×3, 512 ×3
3×3, 512
3×3, 512
1×1, 2048
1×1, 2048
1×1, 2048
average pool, 1000-d fc, softmax
1.8×109
3.6×109
3.8×109
7.6×109
11.3×109
40
30
error (%)
error (%)
ble 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Dow
For by
this
project,
several
changes
to of
these
mpling is performed
conv3
1, conv4
1, and
conv5 were
1 withmade
a stride
2. baseline model architectures. Firstly,
a ResNet10 model was also introduced, consisting of 1 residual block per layer. The fullyconnected layer was also modified so that it contains 60
as many outputs as the number of classes
60
of each dataset. Finally, a width scaling factor k was implemented, which multiplies the number
of filters in each layer, hence widening the network. In this project, models are referred to as
50
ResNetn-k, where n refers to the number of layers in 50
the model.
In order to analyse the effect of increasing depth of models, ResNet10 through to ResNet101
40
were developed, as well as ResNet18 with a width scaling
of 1.5, 2, and 2.5 to analyse the effect
34-layer
of increasing width. Due to the limited compute resources available, scaling of depth and width18-layer
simultaneously were unfeasible and would have led to30unacceptable inference times.
plain-18
3.5 Model Training
plain-34
20
0
18-layer
ResNet-18
ResNet-34
20
0
total of
34-layer
20
30 model 40
50 for a
10
30 of between
40
50
For 10
each dataset,
each
was trained
160 epochs
at a20batch size
iter. (1e4)
iter. (1e4)
32 and 128, depending of memory usage. Due to the limited compute power available locally,
ure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: pl
works of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared
ir plain counterparts.
8
plain
ResNet
reducing of the training error3 . The reason for such op
Bertie Auricchio
larger models were trained on Google Colab[34] GPUs. The learning rates used for SGD were
scheduled according to the OneCycle learning rate schedule by Smith et al. [35], who showed it
to achieve significantly greater performance in fewer epochs than a traditional scheduler.
Table 2: Optimal hyperparameters
Hyperparameter
Image size
Peak η
RandAugment m
RandAugment n
Label smoothing
4
4.1
CUB200-2011
448
1 × 10−3
30
2
0.3
CSGRT
448
5 × 10−4
25
2
0
RESULTS AND DISCUSSION
Inference time
Figure 5 shows the variation of inference time on the CPU with depth and width scaling. There
is a clear decrease in inference time for the width-scaled networks, despite the significant increase
in parameters.
ResNet101
Width scaling
Depth scaling
Inference Time (ms)
120
100
ResNet18-2.5
ResNet50
80
60
ResNet18-2
ResNet34
40
ResNet18-1.5
ResNet18
20
ResNet10
1
2
3
4
No. Parameters
5
6
7
1e7
Figure 5: Inference times
The reason for this is because the features within a layer can be computed as large tensor
calculations, whereas deeper models require each layer to be computed separately.
4.1.1
Pruning performance
Due to the importance of low inference time on CPUs in this project, other methods of further
increasing inference performance were investigated. A widely used method is pruning[36], which
involves systematically removing parts of the model to improve inference performance. Two
pruning algorithms were implemented in PyTorch: structured L1 pruning and unstructured L1
pruning. The L1 unstructured algorithm operates by calculating the L1 norms of each weight
and setting the lowest weights to 0 according to the pruning percentage. As shown by Figure 6b,
9
Bertie Auricchio
unstructured pruning enables large pruning percentages with minimal degradation of accuracy,
but in practice no speed-up of the model occurs, since each layer still needs to be computed.
The structured pruning algorithm, on the other hand, calculates the total L1 norm of each entire
structure (neurons, channels and filters) and drops the lowest structures according to pruning
percentage. This results in an acceleration of the model, but as shown in Figure 6a, leads to a
severe reduction in accuracy.
a)
b)
Width scaling
Depth scaling
0.7
0.6
0.6
0.5
0.5
Accuracy
Accuracy
0.7
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.0
0.2
0.4
0.6
0.8
Pruning Percentage
1.0
Width scaling
Depth scaling
0.0
0.2
0.4
0.6
0.8
Pruning Percentage
1.0
Figure 6: Degradation of accuracy with increasing pruning percentage. a) Structured L1
pruning b) Unstructured L1 pruning. Increasing opacity represents increasing model
parameters.
As shown in Figure 6, wider models respond much better to pruning than deeper models,
consistently maintaining higher accuracy at greater pruning percentages. This suggests that
they would be a better choice for deployment on low-powered devices or devices with limited
memory.
4.2
4.2.1
Classification accuracy
Baseline - CUB2011-200 dataset
Machine learning has traditionally faced a trade-off between bias and variance. High variance
models are able to represent the training set well but run the risk of overfitting and modelling
random noise, leading to poor generalization. In contrast, high bias models are typically simpler
and more generalisable, but may fail to capture important regularities in the training set, resulting in underfitting. Recent research[37] has challenged the conventional bias-variance trade-off
and shown that deep neural networks of sufficient capacity can interpolate simpler functions,
despite being over-parametrised. This counter-intuitive phenomenon is known as double descent
and is shown in Figure 7a for width scaling. Interestingly, this phenomenon is not observed in
depth scaling, where increasing the depth of the model results in a decrease in performance,
seen in the U-shaped curve of the depth scaling.
Nichani et al. [38] also noticed this and, after studying the bias-variance trade-off, showed that
deeper models have reducing bias and increasing variance, like what is observed in traditional
10
Bertie Auricchio
a)
b)
0.305
0.300
0.7
Generalization Gap
Test error
0.295
0.290
Width scaling
Depth scaling
0.285
0.280
0.6
0.5
0.275
0.270
Width scaling
Depth scaling
0.4
0.265
2
4
No. Parameters
2
6
1e7
4
No. Parameters
6
1e7
Figure 7: Effect of width and depth scaling on a) Test error b) Generalization gap.
(CUB-200-2011 )
machine learning methods. It was shown that increasing depth in linear convolutional networks
leads to linear operators of decreasing L2 norm, which causes non-generalising optimisations.
Another explanation for why the performance of models decreases with depth is that, as gradient
flows through the network there is nothing to force it to go through residual block weights and
it can avoid learning anything during training, so it is possible that there are either only a
few blocks that learn useful representations, or many blocks share very little information with
small contribution to the final goal. This problem is formulated by Zagoruyko et al. [39] as
the problem of diminishing feature reuse. Some works have attempted to address this problem,
such as Huang et al. [40], which introduced the concept of ‘stochastic depth’, a technique which
involves randomly eliminating identity mappings for each batch in order to force the network
to learn. The generalisation performance of the models is shown in Figure 7b. As previously
discussed, deeper models tend to exhibit a substantial decrease in generalization performance,
while wider models demonstrate a more consistent performance.
4.2.2
CSGRT dataset
The test error rate of each model trained on the CSGRT dataset is plotted in Figure 9. As shown,
there is a significant difference in performance between the benchmark dataset and the CSGRT
dataset. This is both due to the difficulty of the task and the limited size of the training data.
The optimal network depth is reduced to 18 layers instead of 50. Brigato et al. [41] corroborates
this finding, observing a similar trend that datasets with fewer samples performed better with
smaller models. This is likely due to over-parameterisation. An open-source[42] implementation
of Grad-CAM[43] was used to analyse each model. Grad-CAM uses the gradients flowing into
the final convolutional layer to produce a coarse localisation map highlighting the important
regions in the image for classification. The generated localisation maps are overlayed over the
original image in Figure 8. Important regions are shown in red. Extracted features are also
overlayed on top, with blue representing important features.
11
Bertie Auricchio
Original
ResNet18
ResNet50
ResNet18-2.5
Figure 8: Grad-CAM visualisation of correctly classified image (Class ID=3).
(CSGRT)
Figure 8 shows an example that was correctly classified by all three models shown. It is interesting to see how the regions of importance change with model architecture. ResNet18 achieved
the highest accuracy of these models, and the region of importance is clearly entirely located
over the region where the seal’s pattern is most prevalent. As shown, depth scaling leads to
a contraction of the region of importance, while maintaining precision on the seal’s patterns.
Meanwhile, width scaling leads to an expansion of the region of importance, including patterns
that are not related to the seal, such as the waves in the background. This trend was observed
for most images that were analysed.
0.65
Width scaling
Depth scaling
Test error
0.60
0.55
0.50
0.45
1
2
3
4
No. Parameters
5
6
7
1e7
Figure 9: Effect of width and depth scaling on test error.
(CSGRT)
In order to further investigate the cause of this increase in test error due to dataset difficulty, a
bar chart was plotted for the accuracy of each class. It is shown in Figure 10 below. As shown,
a significant disparity of accuracy was observed for each class, with some classes achieving 100%
accuracy and others achieving 0%.
12
Bertie Auricchio
1.0
Accuracy
0.8
0.6
0.4
0.2
0.0
0
10
20
Class
30
40
Figure 10: Bar chart representing classification accuracy of each class.
(CSGRT)
Further investigation of low accuracy classes showed that deteriorated class accuracy was strongly
correlated with low-resolution images, low contrast of fur pattern and images where the fur pattern was occluded from view. Furthermore, some classes contained individuals with widely
varying appearances between images, due to the fact that seals molt their fur every year[44].
As expected, these classes had a larger observed error rate. Examples of incorrectly identified
images are included in Figures 11 and 12 below.
Original
ResNet18
Figure 11: Grad-CAM visualisation of image with low contrast fur pattern (Class ID=23).
(CSGRT)
Original
ResNet18
Figure 12: Grad-CAM visualisation of low-resolution image (Class ID=27). (CSGRT)
13
Bertie Auricchio
As shown in both examples, the ability of the model to identify important regions of the image
is greatly reduced with low resolution and low contrast fur patterns. Examples of classes with
high intra-class appearance variation are shown in Figure 13. From these images it is clear that
some curation of the dataset is required to some degree, as it is practically infeasible for any
neural network to identify these images as being the same individual.
Class ID = 40
Class ID = 21
Figure 13: Examples of classes with high intra-class variation in appearance. Each row
represents a single individual. (CSGRT)
4.3
4.3.1
Open-set performance
Baseline - CUB2011-200 dataset
While it is true that a neural network of an infinite width with only a single hidden layer is
a universal approximator[45, 46], it has been shown that the expressive power of deeper networks is substantially greater than that of shallower networks of finite width. Eldan et al. [47]
demonstrated that there exist functions in three-layered networks that cannot be expressed by
networks of two layers of finite width, and Guliyev et al. [46] demonstrated that the polynomials are exponentially easier to express with deeper networks that with wider ones. Studies
have shown[48] that wider networks are better at ‘memorisation’ of random noise, while deeper
networks are more effective at extracting more abstract features, due to the larger number of
non-linearities present in deeper networks.
Figure 15 shows that scaling depth leads to increased performance over scaling width. This
shows that the greater expressivity of deeper networks enables them to identify useful semantic
differences between individuals rather than relying on pure memorisation, hence increasing their
performance in open-set recognition. For example, in the CUB200-2011 dataset, water dwelling
birds are likely to photographed with some water in the background, and so models that more
heavily rely on memorisation are likely to use the presence of water in the background as evidence
for the species of bird, when this is a spurious correlation[49]. This was shown previously in
Figure 8 but is made more clear in Figure 14, where deeper models are shown to be using key
features on the body of the albatross, such as the face, beak and tail shape, while the wider
ResNet18-2.5 is shown to use the background as an important feature.
14
Bertie Auricchio
Original
ResNet18
ResNet101
ResNet18-2.5
Figure 14: Demonstration of increased spurious correlations in wider networks.
(CUB200-2011 )
0.92
Width scaling
Depth scaling
0.90
AUROC
0.88
0.86
0.84
0.82
1
2
3
4
No. Parameters
5
6
7
1e7
Figure 15: Effect of depth and width scaling on open-set performance.
(CUB200-2011 )
4.3.2
CSGRT dataset
Figure 16 shows the variation of AUROC with depth and width scaling for the CSGRT dataset.
Rather than following the same trends as the better curated CUB200-2011 dataset shown previously, the open-set performance seems to be strongly tied to the classification accuracy. This
is likely due to the fact that the low quality and highly fine-grained nature of the dataset acts
as a bottleneck preventing deeper models from extracting any extra meaningful features from
the images. Due to the fact that this is a novel individual recognition task, there are far fewer
semantic differences between classes than in novel species recognition. In this case it therefore
seems that memorisation leads to better results.
Figure 16 also shows the open-set performance when a separate dataset, the Beach Litter [19]
dataset was used as the unseen dataset in testing. The Beach Litter dataset was selected as it
contains similar backgrounds to the CSGRT dataset. As shown, deeper networks experienced a
greater increase in performance with respect to wider networks, again suggesting that the deeper
networks ‘understand’ the important features of what a seal is rather than simply memorising
patterns.
15
Bertie Auricchio
Width scaling
Depth scaling
0.68
0.66
Beach Litter
AUROC
0.64
0.62
CSGRT
0.60
0.58
0.56
1
2
3
4
No. Parameters
5
6
7
1e7
Figure 16: Effect of depth and width scaling on open-set performance.
(CSGRT )
0.8
0.7
AUROC
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0.4
0.6
0.8
Classification Accuracy
1.0
Figure 17: Scatter plot of AUROC and classification accuracy for each class. R2 = 0.663.
(CSGRT )
Figure 17 shows the relationship between AUROC and classification accuracy for each class in
CSGRT. An R2 value of 0.663 was calculated, showing a strong to moderate correlation between
the two variables. This suggests that the same factors that gave the dataset a poor classification
accuracy are also responsible for the deterioration of open-set performance.
5
CONCLUSION
Overall, it was found that, for the tasks analysed in this project, wider networks are capable of
achieving higher classification accuracies and lower inference times. Meanwhile, due to greater
16
Bertie Auricchio
semantic differences, deeper models performed better in open-set recognition of novel species.
For the open-set detection of novel individuals, similar trends to in-set classification were observed, with greater memorisation ability of the network leading to better results. Importantly
it was observed that the quality of the training images was a key factor in determining the
performance of the models, with the best quality classes performing 9x better than the worst
classes.
5.1
Limitations
This investigation was limited to basic residual networks, applying simple scaling as the only
changes in architecture. More advanced architecutures such as InceptionV3[50], MobileNets[51]
and EfficientNets[52] have been developed, making use of more advanced techniques such as
depthwise separable convolutions[51], inverted bottlenecks[53] and squeeze and excite layers[54].
It is not immediately obvious that the findings of this report translate directly to these networks.
Furthermore, the datasets used were very limited in size - CSGRT only had a training set of
around 1500 inputs. A typical citizen science dataset would have several orders of magnitude
more images than this and so the results of this investigation may not apply to these significantly
larger datasets. Finally, very small classes were removed from the training data during this
investigation. It is likely that including small classes would introduce significant changes in
observed results.
6
FUTURE WORK
It is recommended that datasets are curated to some degree, either through an automated
pipeline or manually, before being used as training data. Furthermore, current work with triplet
losses[55] and siamese neural networks for image similarity[56] applications seems very promising
and may lead to results that outperform those of a standard CNN classification, especially in
few-shot applications[57]. Facial recognition has proven highly successful for bears[58], and may
provide another alternative that is more robust to changes in fur patterns than the methodology
used here. Additionally, other OSR methods such as ODIN[59], OpenMax[30] and distance
based methods such as Mahanobilis distance[60] and ARPL[33, 61] may prove more successful
than the Maximum Softmax Probability used here.
REFERENCES
[1]
Dave Thompson et al. “The status of harbour seals (Phoca vitulina) in the UK”. In:
Aquatic Conservation: Marine and Freshwater Ecosystems 29.S1 (2019).
[2] Seals. https://www.gov.uk/government/publications/protected-marine-species/
seals.
[3]
The Wildlife Trusts. Grey Seal. https://www.wildlifetrusts.org/wildlife-explorer/
marine/marine-mammals-and-sea-turtles/grey-seal.
[4]
Stefan Schneider et al. Past, Present, and Future Approaches Using Computer Vision for
Animal Re-Identification from Camera Trap Data. arXiv:1811.07749 [cs]. Nov. 2018.
[5]
A. C. Seymour et al. “Automated detection and enumeration of marine wildlife using
unmanned aircraft systems (UAS) and thermal imagery”. In: Scientific Reports 7.1 (2017),
p. 45127. issn: 2045-2322.
[6]
D. Blake Sasse. “Job-Related Mortality of Wildlife Workers in the United States, 19372000”. In: Wildlife Society Bulletin (1973-2006) 31.4 (2003). Publisher: [Wiley, Wildlife
Society], pp. 1015–1020. issn: 00917648, 19385463.
17
Bertie Auricchio
[7]
Ned Horning et al. Remote sensing for ecology and conservation: a handbook of techniques.
Oxford University Press, 2010.
[8]
Nathalie Pettorelli and Nathalie Pettorelli. 95Satellite remote sensing and the management
of wild species and habitats. May 2019.
[9]
Jarrod C. Hodgson et al. “Precision wildlife monitoring using unmanned aerial vehicles”.
en. In: Scientific Reports 6.1 (Mar. 2016). Number: 1 Publisher: Nature Publishing Group,
p. 22574. issn: 2045-2322.
[10]
Marcella Kelly. “Computer-Aided Photograph Matching in Studies Using Individual Identification: An Example from Serengeti Cheetahs”. In: Journal of Mammalogy 82 (May
2001), pp. 440–449.
[11]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with
Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing
Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012.
[12]
Jason Parham et al. “An Animal Detection Pipeline for Identification”. In: 2018 IEEE
Winter Conference on Applications of Computer Vision (WACV). 2018, pp. 1075–1083.
[13]
Yongliang Qiao et al. “Individual Cattle Identification Using a Deep Learning Based
Framework”. en. In: IFAC-PapersOnLine 52.30 (Jan. 2019), pp. 318–323.
[14]
Roberto Sacchi et al. “Photographic identification in reptiles: A matter of scales”. In:
Amphibia-Reptilia 31 (Nov. 2010), pp. 489–502.
[15]
Keren Levy, Amit Lerner, and Nadav Shashar. “Mate choice and body pattern variations
in the Crown Butterfly fish Chaetodon paucifasciatus (Chaetodontidae)”. eng. In: Biology
Open 3.12 (Nov. 2014), pp. 1245–1251. issn: 2046-6390.
[16]
Caroline Moussy et al. “A quantitative global review of species population monitoring”.
In: Conservation Biology 36.1 (2022).
[17]
Laura Oleniacz and North Carolina State University. Citizen science study captures 2.2M
wildlife images in NC. en.
[18]
Grant Van Horn et al. “The iNaturalist Challenge 2017 Dataset”. In: CoRR abs/1707.06642
(2017). eprint: 1707.06642.
[19] North Carolina Candid Critters. https://www.nccandidcritters.org/.
[20] Seal Identification. http://www.cornwallsealgroup.co.uk/seal-identification/.
[21]
Hongxin Wei et al. Mitigating Neural Network Overconfidence with Logit Normalization.
arXiv:2205.09310 [cs]. June 2022.
[22]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http : / / www .
deeplearningbook.org. MIT Press, 2016.
[23]
Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent
as Approximate Bayesian Inference. arXiv:1704.04289 [cs, stat]. Jan. 2018.
[24]
Tomasz Szandala. “Review and Comparison of Commonly Used Activation Functions for
Deep Neural Networks”. en. In: Bio-inspired Neurocomputing. Ed. by Akash Kumar Bhoi
et al. Vol. 903. Series Title: Studies in Computational Intelligence. Singapore: Springer
Singapore, 2021, pp. 203–224. isbn: 9789811554940 9789811554957.
[25]
Chiyuan Zhang et al. Understanding deep learning requires rethinking generalization. Feb.
2017.
[26]
Michael A. Nielsen. “Neural Networks and Deep Learning”. en. In: (2015). Publisher:
Determination Press.
18
Bertie Auricchio
[27]
Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). eprint: 1512.03385.
[28]
Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift”. In: CoRR abs/1502.03167 (2015). eprint:
1502.03167.
[29]
Ekin D. Cubuk et al. RandAugment: Practical automated data augmentation with a reduced
search space. arXiv:1909.13719 [cs]. Nov. 2019.
[30]
Atefeh Mahdavi and Marco Carvalho. “A Survey on Open Set Recognition”. In: 2021
IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). arXiv:2109.00893 [cs]. Dec. 2021, pp. 37–44.
[31]
Dan Hendrycks et al. “Scaling Out-of-Distribution Detection for Real-World Settings”. In:
CoRR abs/1911.11132 (2019). eprint: 1911.11132.
[32]
Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-ofDistribution Examples in Neural Networks”. In: CoRR abs/1610.02136 (2016). eprint:
1610.02136.
[33]
Sagar Vaze et al. “Open-Set Recognition: A Good Closed-Set Classifier is All You Need”.
In: CoRR abs/2110.06207 (2021).
[34] Google Colaboratory. en. https://colab.research.google.com/.
[35]
Leslie N. Smith and Nicholay Topin. Super-Convergence: Very Fast Training of Neural
Networks Using Large Learning Rates. arXiv:1708.07120 [cs, stat]. May 2018.
[36]
Tailin Liang et al. Pruning and Quantization for Deep Neural Network Acceleration: A
Survey. arXiv:2101.09671 [cs]. June 2021.
[37]
Preetum Nakkiran et al. Deep Double Descent: Where Bigger Models and More Data Hurt.
arXiv:1912.02292 [cs, stat]. Dec. 2019.
[38]
Eshaan Nichani, Adityanarayanan Radhakrishnan, and Caroline Uhler. Increasing Depth
Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks. Oct. 2020.
[39]
Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. arXiv:1605.07146 [cs].
June 2017.
[40]
Gao Huang et al. Deep Networks with Stochastic Depth. arXiv:1603.09382 [cs]. July 2016.
[41]
L. Brigato and L. Iocchi. A Close Look at Deep Learning with Small Data. arXiv:2003.12843
[cs, stat]. Oct. 2020.
[42]
Francesco Saverio Zuppichini. Mirror. original-date: 2018-10-11T16:11:17Z. Mar. 2023.
[43]
Ramprasaath R. Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks
via Gradient-based Localization”. In: International Journal of Computer Vision 128.2
(Feb. 2020). issn: 0920-5691, 1573-1405.
[44] Grey Seal - Moulting season — North Wales Wildlife Trust. https://www.northwales/
/wildlifetrust.org.uk/blog/living-seas/grey-seal-moulting-season. Dec. 2021.
[45]
Allan Pinkus. “Approximation theory of the MLP model in neural networks”. en. In:
Acta Numerica 8 (Jan. 1999). Publisher: Cambridge University Press, pp. 143–195. issn:
1474-0508, 0962-4929.
[46]
Namig J. Guliyev and Vugar E. Ismailov. “Approximation capability of two hidden layer
feedforward neural networks with fixed weights”. In: Neurocomputing 316 (Nov. 2018),
pp. 262–269. issn: 09252312.
[47]
Ronen Eldan and Ohad Shamir. The Power of Depth for Feedforward Neural Networks.
arXiv:1512.03965 [cs, stat] version: 4. May 2016.
19
Bertie Auricchio
[48]
Heng-Tze Cheng et al. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792
[cs, stat]. June 2016.
[49]
Shiori Sagawa et al. An Investigation of Why Overparameterization Exacerbates Spurious
Correlations. arXiv:2005.04345 [cs, stat]. Aug. 2020.
[50]
Christian Szegedy et al. Rethinking the Inception Architecture for Computer Vision. Dec.
2015.
[51]
Andrew G. Howard et al. MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications. arXiv:1704.04861 [cs]. Apr. 2017.
[52]
Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional
Neural Networks. arXiv:1905.11946 [cs, stat]. Sept. 2020.
[53]
Mark Sandler et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv:1801.04381
[cs]. Mar. 2019.
[54]
Andrew Howard et al. Searching for MobileNetV3. arXiv:1905.02244 [cs]. Nov. 2019.
[55]
Elad Hoffer and Nir Ailon. Deep metric learning using Triplet network. arXiv:1412.6622
[cs, stat]. Dec. 2018.
[56]
Matthijs Douze et al. The 2021 Image Similarity Dataset and Challenge. arXiv:2106.09672
[cs]. Feb. 2022.
[57]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese Neural Networks for
One-shot Image Recognition”. en. In: ().
[58] BearID Project. en-US. https://bearresearch.org/. Sept. 2022.
[59]
Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Out-of-distribution
Image Detection in Neural Networks. arXiv:1706.02690 [cs, stat]. Aug. 2020.
[60]
Jie Ren et al. A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection.
arXiv:2106.09022 [cs]. June 2021.
[61]
Ziheng Xia et al. Adversarial Motorial Prototype Framework for Open Set Recognition.
arXiv:2108.04225 [cs]. July 2021.
A
BACKPROPAGATION ALGORITHM
As mentioned, the gradients ∇θ L are found using backpropagation, which utilises the chain rule
and reuses previously calculated gradients to maximise the efficiency of calculation. Returning
to Figure 1, through the chain rule it is easy to show that the input to the neuron, z can be
given as:
∂C
∂L ′
=
fact (zjL )
∂zjL
∂aL
j
(6)
∂C
′
= ∇a C ⊙ fact
zL
L
∂z
(7)
Or in matrix form:
∂C
From this ∂z
L term, again using the chain rule, the expressions for the gradient of the loss
function with respect to the biases and weights in the final layer L can be shown to be:
∂C
∂C
= l
∂blj
∂zj
20
(8)
Bertie Auricchio
∂C
∂C
= al−1
k
l
∂wjk
∂zjl
Finally, the
∂C
∂z L
(9)
term can be propagated back into the network with the matrix expression:
T
∂C
′
l+1
l+1
=
⊙
σ
zl
w
δ
∂z l
(10)
The basic algorithm behind backpropagation consists of:
• Feed an input x through the network.
• Calculate the loss function for each output yj = aL
j.
∂C
• Calculate ∂z
.
L
j
• Calculate network parameters (bL , wL ) at final layer L.
∂C
• Backpropagate ∂z
L to previous layer.
j
B
CLASS SIZE HISTOGRAMS
The class size histograms for both datasets are given in Figure 18. As illustrated, the more
difficult CSGRT dataset has a significant disparity in class size while CUB2011-200 is more
uniform with the majority of classes containing 60 images.
b)
800
140
700
120
600
Number of classes
Number of classes
a)
500
400
300
100
80
60
40
200
20
100
0
0
0
20
40
60
Class size
80
100
45
50
Class size
Figure 18: Class histograms for a) CSGRT b) CUB200-2011
21
55
60
Descargar