Subido por Carbon Air

Spatial sound localization model using neural networks

Anuncio
Audio Engineering Society
Convention Paper
Presented at the 120th Convention
2006 May 20–23
Paris, France
This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration
by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request
and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org.
All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the
Journal of the Audio Engineering Society.
SPATIAL SOUND LOCALIZATION MODEL
USING NEURAL NETWORK
R. Venegas1, M. Lara2, R. Correa3, and S. Floody4
1
Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile
[email protected]
2
Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile
[email protected]
3
4
Departamento de Física, Universidad Tecnológica Metropolitana, Santiago, Macul, Chile
[email protected]
Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile
[email protected]
ABSTRACT
This work presents the design, implementation and training of a spatial sound localization model for broadband
sound in an anechoic environment inspired in the human auditory system and implemented using artificial neural
networks. The data acquisition was made experimentally. The model consist in a nonlinear transformer which
possesses one module of ITD, ILD and ISLD extraction and a second module constituted by a neural network that
estimates the sound source position in elevation and azimuth angle. A comparative study of the model performances
using three different bank filters and a sensitivity analysis of the neural network inputs are also presented. The
average error is 2.3º. This work was supported by the FONDEI fund of the Universidad Tecnológica de Chile.
1.
INTRODUCTION
The spatial sound localization is the ability to identify
the spatial location of a particular sound and eventually,
the association with the sound source that generate this
sound and its spatial location. Two perspectives exist
for the perceptual phenomenon studies: The cognition
theory and natural sciences [1].
This work focuses principally in the natural sciences
perspective; this includes physics, physiology and
psychology. We use information obtained in these fields
to include them in a computational model that is
inspired by them. The objective is to emulate a human
ability in a machine and not explain or validate the
human perceptual phenomenon.
Venegas et al.
Spatial Sound Localization model using NN
The spatial sound localization, in general, is a nonuniqueness problem; one of the main reasons is that
each person has a different physiology and consequently
a unique Head Related Transfer Function (HRTF) [2, 3,
4, 5].
The use of Computational Intelligence techniques such
as the Artificial Neural Network and its learning ability,
offer a powerful tool in order to solve the sound
localization problem [6, 7, 8, 9] and, at the same time,
specifically present a methodology for its resolution.
The data for the system training was measured in the
anechoic chamber of the Universidad Tecnológica de
Chile. The measurement procedure and results are
presented in the reference [10].
sensitivity analysis to inputs of the best configuration
are also presented. The sensitivity analysis consists in
analyzing how many inputs to the neural network
delivers similar results to the neural net with 20 inputs
and how the error varies as the inputs diminish. All
process was implemented in Matlab environment.
2.
REFERENCE SYSTEM
For the present research the elevation angle varies
(down to up) from -90º to 90º and the azimuth angle
varies from 0º to 360º (right to left). This reference
system is adopted in order to avoid discontinuities and
redundancies in the absolute values of the neural
network target codification.
The principal cues for sound localization in azimuth are
the Interaural Time differences (ITD) and the Interaural
Level Differences (ILD) [1, 11, 12]. For elevation the
Interaural spectral level differences (ISLD), which are
the energy difference per frequency band between the
left and right ear arrival signal, obtained filtering those
signals with a bank filter, gives relevant information for
its identification [1, 11, 13].
A spatial sound localization model for broadband sound
in an anechoic environment is presented. The model is a
nonlinear transformer which input are the sounds
received by each ear and its output is the spatial
estimation of sound source position in azimuth and
elevation angle. The model is constituted by two
modules. The first module is used to extract the ITD,
ILD and ISLD, with these cues; we obtain the feature
vector for each direction. The second module is a three
layer feed forward neural network; its inputs are the
feature vectors. The neural network estimates the sound
source position.
Figure 1 Reference System
3.
SOUNDS AND MEASUREMENT
PROCEDURE
The table 1 show the sounds measured for the model
construction. 1708 sounds were measured.
A comparative study of the configuration performance
using three different bank filters for the ISLD
calculation is presented. The bank filters were: 18
Equivalent
Rectangular
Bandwidth
PattersonHoldsworth Auditory filter (for simplicity ERB filter or
ERB bank filter from now on) [14, 15, 16], 18 filter of
the Lyon passive cochlear model (Lyon filter or Lyon
Bank Filter) [16, 17, 18] and 12 Third Octave Filter
[19].
Sound
01
02
03
04
Description
Logarithmic Swept Sine
whitout emphasis 1 [s]
White Noise 100 [ms]
White Noise 250 [ms]
White Noise 500 [ms]
Table 1 Sounds used in the model construction
An error measure for the neural network configuration
applied to the sound localization problem [20], the
research process of the adequate configuration and a
AES 120th Convention, Paris, France, 2006 May 20–23
Page 2 of 21
Venegas et al.
Spatial Sound Localization model using NN
At the center of the anechoic chamber an artificial head
was placed in a fixed position during the measurement
procedure.
The sound source (loudspeaker) was placed at 1.2
meters from the reference system’s origin. The
measurements for the training set were made in 5º
azimuth steps for 0º elevations, 10º azimuth steps for
elevation angles between -60º and 60º (taking 15º steps)
and 20º azimuth steps for 75º elevations. The test set
was crafted from random direction measurements. The
sampling frequency was 44.1 kHz. Additionally, the
Head Related Transfer Functions (HRTF) was obtained
from a logarithmic swept. For more details and results
see reference [10, 21].
4.
(
)
m
−1 ⎞
⎟ (2 )
2 ⎠
⎛ 2N
ITD = arg max C xR xL [m] − nint ⎜
⎝
Where xR [n] is the right ear signal, xL [ n] is the left ear
signal, N is the length of the signal in samples, and
nint ( ) is the nearest integer function. Figure 3 shows
the ITD for the logarithmic swept (see table 1).
4.2.
Interaural Level Differences - ILD
The Interaural Level Differences are the differences in
level of the sound arriving at both ears [1]. Figure 4
shows the ILD for the logarithmic swept (see table 1).
⎛
⎜
ILD = 10 log ⎜
⎜
⎜
⎝
MODEL DESCRIPTION
Figure 2 shows a scheme of the model.
4.3.
N
∑ ( x [ n] )
2
∑ ( x [ n] )
2
R
n =0
N
L
n =0
⎞
⎟
⎟
⎟
⎟
⎠
(3 )
Interaural Spectral Level Differences ISLD
The Interaural Spectral level differences are the energy
difference per frequency band between the left and right
ear arrival signal obtained by filtering these signals with
a bank filter.
The procedure to calculate ISLD is described by the
equations 4 to 9.
X ( R , L ) ( f ) = 20 log FFT ( x( R , L ) [n])
Figure 2 Block Diagram of the model
4.1.
(4 )
Interaural Time Differences - ITD
The Interaural Time Differences are the differences in
the time arrival of the arrival sound at the two ears. The
method to calculate ITD is based in the proposed by
Jeffress [22]. It is used the cross-correlation between the
signals of both ears since the cross-correlation vector is
proportional to ITD [22, 23]. Mathematically:
⎧ N − m −1
⎫
xR [ m + n]xL [n] m ≥ 0 ⎪
⎪
C xR xL [m] = ⎨ n =1
⎬
⎪C [− m]
m < 0 ⎪⎭
⎩ xR xL
∑
(1 )
After obtaining the signal spectrum with a FFT that
allows 1 Hz resolution, a smooth process is applied
using a moving average (eq. 5), if the smoothing is not
strong, this smooth process does not have an influence
in the localization [24]. For this work BW = 160 [Hz], in
reference [21] is presented an analysis of this
parameter.
X s( R , L ) ( f ) =
[
1
BW
]
fp = f +
BW
2
fp = f −
BW
2
∑
X ( R, L) ( f p )
(5 )
With: f ∈ f 0 , f1 ; f 0 > 80 [Hz]; f1 < 21.97 [kHz];
f s = 44.1 [kHz].
AES 120th Convention, Paris, France, 2006 May 20–23
Page 3 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 3 ITD for Logarithmic swept in [samples]
Figure 4 ILD for Logarithmic swept in [dB]
The smooth spectrums of two ears are filtered with a
bank filter. In this paper, are used three different bank
filters: ERB Patterson- Holdsworth Auditory Bank
Filter [14, 15, 16], Lyon Bank Filter [16, 17, 18] and
Third Octave Bank Filter [19]. Figures 5 to 7 show the
frequency response of the three bank filters.
Filters with central frequency greater to 1.25 kHz were
used. The purpose of this process is to offer significant
information for elevation detection. For frequencies
smaller than to 1.5 kHz the localization is dominated by
ITD and ILD [1]. Finally, ISLD is obtained from the
equations 6, 7 and 8.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 4 of 21
Venegas et al.
Spatial Sound Localization model using NN
Y((k , F )) (ω ) = X s(
R,L
( R, L )
EF
(k )
R,L)
( f ) + 20 log
⎛
= 10 ⋅ log ⎜
⎜
⎝
1
BW( k , F )
( BF(
k ,F )
f = fH (k ,F )
∑(
(f)
)
0.1⋅Y((k ,F )) ( f )
R ,L
10
f = fL k ,F )
(6 )
⎞
⎟
⎟
⎠
R
L
ISLDF ( k ) = EF( ) ( k ) − EF( ) ( k )
(7 )
(8 )
Where:
F: Bank Filter {ERB, Lyon or Third Octave}
BF( k , F ) ( f ) : Magnitude of frequency response of the
kth filter of the F bank filter.
The feature vector components are the neural network
inputs. In order to normalize these inputs, the average
of the maximum absolutes values of the sounds that
conform the database were calculated, after the ISLD
extraction using three banks filters independently. All
inputs were normalized, and allowed to obtain three
different training and test sets.
BW( k , F ) : Bandwidth of the kth filter of the F bank
filter.
6.
EF( R , L ) : Energy per frequency band of the right (R)
A complete theoretical and practical development is
explained in references [25, 26].
or left (L) ear signal filtered with the bank filter F.
f L (k , F ) : Low cut off frequency of the kth filter of
the bank filter F.
f H (k , F ) : High cut off frequency of the kth filter of
the bank filter F.
For ERB and Lyon Bank filter k=1:18.
For Third Octave Bank filter k=1:12.
5.
FEATURE VECTOR
For all database elements the equations 1 to 8 were
applied to obtain ILD, ITD and ISLD, with these
cues, the feature vector is constructed. The feature
vector is a function of the sound source spatial
location.
V (θ , ϕ ) = [ ISLD ILD ITD ]k ×1
(9 )
With k=20 for ISLD calculated using ERB and Lyon
Bank Filter and k=14 for ISLD calculated using Third
Octave Bank Filter
NEURAL NETWORKS
The normalized feature vector components are the
neural network inputs. The neural network targets are
the azimuth and elevation angle. The targets were
normalized to the interval [0, 1] for azimuth and [-1,
1] for elevation angle.
Neural networks are trained with the features
extracted from measured sounds, not with the HRTF
[9] or the features obtained from the convolution
between different signals and HRTF [6] because the
reported results by Jost [27] in artificial heads and
Vlieglen [28] in human beings about the influences of
measurements signal level or level in humans sound
localization are not captured completely by the
auralization process, in the other hand, the
background noise effect is ignored or remained
constant. In order to avoid these problems it is more
realistic the feature extraction process from
experimental data for the neural network training.
Figures 8 to 10 show the initial training set for ISLD
calculated with ERB, Lyon and Third Octave Bank
Filter respectively. The initial training set contained
79.7% of the database elements (representing 1361
directions) and the test set contained 20.3%
(representing 347 directions).
AES 120th Convention, Paris, France, 2006 May 20–23
Page 5 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 5 Frequency Response of ERB Bank Filter
Figure 6 Frequency Response of Lyon Bank Filter
AES 120th Convention, Paris, France, 2006 May 20–23
Page 6 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 7 Frequency Response of Third Octave Bank Filter
Figure 8 Initial Training Set – ISLD calculated using ERB Bank Filter
AES 120th Convention, Paris, France, 2006 May 20–23
Page 7 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 9 Initial Training Set – ISLD calculated using Lyon Bank Filter
Figure 10 Initial Training Set – ISLD calculated using Third Octave Bank Filter
AES 120th Convention, Paris, France, 2006 May 20–23
Page 8 of 21
Venegas et al.
Spatial Sound Localization model using NN
It is important to analyze the difference between the
training sets. Figure 11 shows the absolute difference
between the initial training set that contain inputs
calculated using ERB and Lyon Bank Filter. From
input number 10 the differences are significant. The
periodic structure is due to the nature of sounds
(broadband sounds). After the feature extraction
process, the inputs are similar, however, in a non
normalized ISLD, differences around 5 [db] for some
directions were found [21].
(purelin) respectively. Configurations 20-13-2, 2015-2 and 20-17-2 with sigmoidal logarithmic
(logisg), sigmoidal hyperbolic tangent (tansig) and
linear transfer function (purelin) respectively were
also studied.
Each configuration was trained at three independent
times for each training set (inputs calculated with
different banks filters) with random weights
initialization.
7.
ERROR MEASURES
A global error measure [20] is described by the
equations 10 to 13.
E (TR,TE ) (θ , ϕ ) = y(TR ,TE ) (θ , ϕ ) − t(TR ,TE ) (θ , ϕ )
Ec(
TR ,TE )
(θ ,ϕ ) =
E p ( TR, TE ) =
Figure 11 Absolute differences between initial
training sets calculated using ERB and Lyon Bank
Filter
The training algorithms used in this investigation are
presented in table 2
Algorithm
Quasi-Newton Back propagation
Scaled Conjugate Gradient
Conjugate Gradient with
Powell/Beale Restarts
Fletcher-Powell Conjugate Gradient
Polak-Ribiére Conjugate Gradient
One-Step Secant
Table 2
Name
trainbfg
trainscg
Eg =
1
H
∑(
H
E(
TR ,TE )
i
(θ ,ϕ ) )
(10 )
(11 )
i
1 ⎡ (TR ,TE )
Et
(θ ) + Et(TR,TE ) (ϕ )⎤⎦
2⎣
1
⎡ E p (TR ) + E p ( TE ) ⎤
⎦
2⎣
(12 )
(13 )
Where
E : Mean Localization Error for θ and ϕ in training
(TR) or test (TE).
θ : Azimuth angle.
traincgb
ϕ : Elevation angle.
traincgf
traincgp
trainoss
y : Neural network Output for θ and ϕ .
y : Target for θ and ϕ .
Training algorithms
H : Number of Train, H = 3
The neural networks studied were three layer feed
forward nets with 20 neurons in the input layer, 13
neurons in the hidden layer and 2 neurons in the
output layers (20-13-2), 20-15-2 and 20-17-2 with
sigmoidal hyperbolic tangent (tansig), sigmoidal
logarithmic (logsig) and linear transfer function
Ec : Average error of the net configuration for θ and
ϕ
in
training
(TR)
or
Test
(TE).
AES 120th Convention, Paris, France, 2006 May 20–23
Page 9 of 21
Venegas et al.
Spatial Sound Localization model using NN
E p : Total Error in training (TR) or Test (TE) of the
net configuration.
Eg : Global error
Other error measure, used to obtain the surface of
configuration, is the average Euclidean distance
between the neural network output and the target.
D=
1
N
∑
N
i =1
( yθ − tθ )i2 + ( yϕ − tϕ )i
2
(14 )
Where N is the number of elements
8.
For this problem, neural nets with tansig-logsigpurelin transfer functions always presents better
performance than neural nets with logsig-tansigpurelin transfer functions, independently of the
amount of neurons in the hidden layer and the
training algorithms.
The increase in the performance is not significant
when the neurons in the hidden layer augment. A
greater number of neurons in the hidden layer makes
the weights surface more complex, with many local
minimums, and consequently it makes more difficult
the convergence of the training process and increase
the computational cost [25, 26, 29]. Subsequent
analysis was implemented using configurations 2015-2 tansig-logsig-purelin.
In order to reduce the error, the test set elements with
great error, were included in the training set.
RESULTS AND DISCUSSIONS
A heuristic search of the configuration was made.
The first test was focused on analyzing the amount of
neurons in the hidden layer and the transfer functions
of input and hidden layer. Figures 12 to 17 represent
the global error for configurations with tansig-logsigpurelin and logsig-tansig-purelin transfer functions.
Figures 18, 19 and 20 show the global error of the
configuration 20-15-2 tansig-logsig-purelin using
different training set and test set. The amount of
elements for the training and test sets is expressed in
percentage relative to the total amount of database
elements (1708 elements).
Figure 12 Global Error – ISLD calculated using ERB Bank Filter - Neural Net Configuration with tansig-logsigpurelin transfer functions.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 10 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 13 Global Error – ISLD calculated using ERB Bank Filter - Neural Net Configuration with logsig-tansigpurelin transfer functions..
Figure 14 Global Error – ISLD calculated using Lyon Bank Filter - Neural Net Configuration with tansig-logsigpurelin transfer functions.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 11 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 15 Global Error – ISLD calculated using Lyon Bank Filter - Neural Net Configuration with logsig- tansigpurelin transfer functions.
Figure 16 Global Error – ISLD calculated using Third Octave Bank Filter - Neural Net Configuration with tansiglogsig-purelin transfer functions.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 12 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 17 Global Error – ISLD calculated using Third Octave Bank Filter - Neural Net Configuration with logsigtansig-purelin transfer functions.
Figure 18 Global error – ISLD calculated using ERB Bank Filter – 20-15-2 tansig-logsig-purelin for different
Training/Test set percentages.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 13 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 19 Global error – ISLD calculated using Lyon Bank Filter – 20-15-2 tansig-logsig-purelin for different
Training/Test set percentages.
Figure 20 Global error – ISLD calculated using Third Octave Bank Filter – 20-15-2 tansig-logsig-purelin for
different Training/Test set percentages
AES 120th Convention, Paris, France, 2006 May 20–23
Page 14 of 21
Venegas et al.
Spatial Sound Localization model using NN
The reference for the analysis of the training/test set
elements percentages were the training and test set
calculated with ERB Bank filter. In the first
modification, 18 test set elements were included in
the ERB training set, for the Lyon and Third Octave
test set also including 18 test elements in the
respective training set. However, the elements
included in each training set were totally different
since the greater error, in the neural net answer,
varies according to the training set. This phenomenon
is due to the ISLD and specifically because the
central frequencies of the filters are different and
consequently the ISLD gives better information for
certain directions. In this case the results are
comparable in terms of the training/tests set elements
percentages.
When increasing the amount of the training set
elements, the global error diminishes, but from four
modifications the increasing is not significant.
The best training algorithm, for this problem, is the
Quasi-Newton Back propagation (trainbfg) (see
figure 12 to 20).
The space of configurations is a highly dimensional
space composed by the neural network parameters
(neurons, layers, transfer functions, aggregate
methods, connections, etc). In this paper a surface of
configurations, using all configurations studied,
training algorithms and an error measure, is
constructed. In order to obtain the best configuration,
the argument that minimizes the surface of
configurations was found. The error measure used is
the average Euclidean distance (eq. 14)
Neurons
Transfer Function
Inputs
(feature vector)
Train Set
Test Set
Train Algorithm
Error Measure
(Average Euclidian Distances) [º/direction]
Table 3
Figures 21 and 22 show the surface of configurations
for the test set and training set respectively, figure 23
shows the surface of configurations considering the
test and training set like a single set. (See Appendix
A, for neural net configuration description).
Table 3 presents the parameters and characteristics of
the best model.
Table 4 presents the median of the average Euclidean
distances for configurations trained with ERB, Lyon
and Third Octave training set (42 configurations for
each bank filter), with this, we can confirm that the
best algorithm for this problem is Quasi Newton
Back propagation. In the other hand, the performance
of the ERB and Lyon configurations is better than
Third Octave configurations. The performance
difference between ERB and Lyon configurations is
not significant, with this, we can assure the
robustness of the neural network to solve the
broadband sound localization problem. The inputs are
different (see figure 11) and the final results are
similar.
A sensitivity analysis with the best configuration (see
table 3) parameter was implemented. It consists in
analyzing how many inputs to the neural network
delivers similar results to the best configuration and
how the error varies as the inputs diminish.
Each configuration was trained at three independent
times with random weights initialization.
20-15-2
Sigmoidal hyperbolic tangent (tansig)
Sigmoidal logarithmic (logsig)
Linear (purelin)
ISLD calculated using ERB Bank Filter (18 filters)
IID
ITD
1433 Directions (Total=1708)
275 Directions (Total=1708)
Quasi-Newton Backpropagation
Test Set
3.9
Train Set
1.5
Train Set + Test Set
2.3
Parameters and characteristic of the best model
AES 120th Convention, Paris, France, 2006 May 20–23
Page 15 of 21
Venegas et al.
Spatial Sound Localization model using NN
ERB Bank Filter
Error Average
Lyon Bank Filter
Error Average
Third Octave Bank Filter
Error Average
º (direction ⋅ configuration)
º (direction ⋅ configuration)
º (direction ⋅ configuration)
Training
Algorithm
Test
Training
Trainbfg
Trainscg
Traincgb
Traincgf
Traincgp
Trainoss
11,15
11,97
12,75
12,73
13,47
15,8
2,04
5,23
5,75
6,66
7,52
10,21
Table 4
Training
+
Test
3,82
6,54
7,1
7,83
8,66
11,28
Test
Training
11,2
11,61
13,01
12,39
13,46
15,17
2,37
5,59
6,77
7,17
7,98
11,16
Training
+
Test
4,06
6,75
7,96
8,18
9,02
11,94
Test
Training
13,12
14,37
14,69
14,98
16,05
17,66
3,55
7,58
8,71
9,74
10,58
13,81
Training
+
Test
5,37
8,89
9,85
10,76
11,63
14,56
Median of Average Euclidian distances for configurations training using ERB, Lyon and Third Octave
Bank Filter
Figure 21 Surface of Configurations for Test Set – All configurations.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 16 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 22 Surface of Configurations for Training Set – All configurations
Figure 23 Surface of Configurations for Training and Test Set like a single set – All configurations.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 17 of 21
Venegas et al.
Spatial Sound Localization model using NN
Figure 24 Sensitivity analysis – Median of Average Euclidean error vs number of inputs
The input of the first training were ITD, ILD and
ISLD calculated using 17 ERB filters, the second
training with ITD, ILD and ISLD calculated using 16
ERB filters and so on. The training with three inputs
was realized with ITD, ILD and ISLD calculated only
with the greater central frequency of the ERB bank
filter (see fig 5). The inputs of the last training were
only ITD and ILD.
9.
CONCLUSSIONS
In this work, is described the design, implementation
and training of the spatial sound localization model
based on interaural time, level and spectral level
differences for broadband sound using neural
networks in anechoic environment.
Figure 24 shows the median of the average Euclidean
distance versus the number of inputs.
A total of 804 configurations, including the
sensitivity analysis, were studied.
The analysis confirms that 20 inputs are necessary to
obtain an approximate error of 4º with the test set.
Considering the test and training set as a single set,
from 14 inputs obtain an average error around 4º, this
result is comparable with the best performance of the
Third Octave neural net configuration, however this
interesting coincidence, needs more study. It is
necessary to carry out more tests to quantify the real
influence of ISLD components and considering the
combinations of the inputs.
The signal for the training and test were logarithmic
swept and white noise of different lengths. The
system architecture, due to the nature of sounds, is
invariant to the length of the signal.
The final model takes 1 [s] in a personal computer
without exclusive dedication.
An error measure are presented, from this measure,
the increase in the performance is not significant
when the neurons in the hidden layer are greater than
15 neurons. At first, the 20 neurons in the input layer
correspond a heuristic search, the sensitivity analysis
confirm this search.
The best training algorithm for this problem is the
Quasi Newton Back propagation (trainbfg).
AES 120th Convention, Paris, France, 2006 May 20–23
Page 18 of 21
Venegas et al.
Spatial Sound Localization model using NN
From table 4, ERB and Lyon configurations
performance are better than the Third Octave
configurations. The difference in the performance
between ERB and Lyon configurations, in average, is
not significant, with this, we can assure the
robustness of the neural network to solve the sound
localization problem for broadband sound. With
different inputs the final results are similar.
The neural network inputs are less than the
employees in other works [7, 8] and reveal to be
sufficient on the results basis obtained by the model.
The average error, measured with average Euclidean
distance, of the final model is 2.3 [º/direction], this
error does not consider the experimental error
estimated in 3º [21]. In the practice, the real estimate
average error is 5.3º. The error is comparable to the
presented by other authors in models and
measurements of human sound localization for
broadband sounds [30, 31, 12].
The neural networks offer a powerful tool for
modeling and/or for generating resolution
methodologies for non-uniqueness problems or in
problems that the mathematical model, in the classic
sense, is very complex or is not realized yet.
The principal limitations of the presented model are:
Only tested with broadband sound and its
performance was tested only in anechoic
environment. Future works will be developed taking
into account these limitations and a more accurate
sensitivity analysis to quantify the real influences of
each input.
10.
ACKNOWLEDGEMENTS
This work was supported by Fondo de Desarrollo de
la Investigacón FONDEI of the Universidad
Tecnológica de Chile.
11.
REFERENCES
[1] J. Blauert.,”Spatial Hearing, The Psychophysics
of Human Sound Localization”, MIT Press, 2nd
ed., Cambridge MA, USA, 1983.
[2] V. Algazzi, P. Diventi, R. Duda, “Subject
dependent Transfer Function in Spatial Hearing”
in Proc. 1997 IEEE MWSCAS Conference, 1997.
[3] S. Carlile, D. Pralong, “The location-dependent
nature of perceptually salient features of the
human related transfer functions”, J. Acoust.
Soc. Am., Vol. 95 No. 6, pp: 3445-3459, 1994,
June.
[4] V. Algazi, R. Duda, D. Thompson, C. Avendano,
“The CIPIC HRTF Database”, Proceedings
IEEE Workshop on Applications of Signal
Processing to audio and Electroacoustics, pp:99102, Mohonk Mountain House, New Paltz, NY,
2001, October 21-24
[5] E. M. Wenzel, M. Arruda, D. J. Kistler, F. L.
Wighman, “Localization using nonindividualized
head related transfer functions”, J. Acoust. Soc.
Am. Vol. 94 No 1,pp 111-123, 1993, July.
[6] T.R. Anderson, J. A. Janko, R.H. Gilkey, “An
Artificial neural network model of human
sound”, J. Acoust. Soc. Am., Vol 92 No. 4,
pp.2298, 1992, October.
[7] J. Backman, M. Karjalainen, “Modelling of
Human directional and Spatial Hearing using
neural networks” in Proceedings IEEE
International Conference Acoustics Speech and
Signal Processing (ICASSP’93) - IEEE, Vol. 1,
pp.125-128, Minneapolis, Minnesota, USA,
1993, April.
[8] C. Jin, M. Schenkel, S. Carlile, “Neural system
identification
model of
human sound
localization” J. Acoust. Soc. Am. Vol. 108 No. 3
Pt. 1, pp: 1215-1235, 2000, September
[9] D. Nandy, J. Ben-Arie, “Neural models for
auditory localization based on spectral cues”,
Neurological Research; Vol. 23 No. 5, pp: 489500, 2000, July.
[10] R. Venegas, M. Lara, R. Correa, S. Floody,
“Medición de HRTF y extracción de
características significativas para localización
sonora”, Semacus 2005, Universidad Pérez
Rosales, 2005, Octubre.
[11] D. Begault, “3-D Sound for Virtual Reality and
Multimedia”, NASA Ames Research Center,
Moffet Field California, 2000.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 19 of 21
Venegas et al.
Spatial Sound Localization model using NN
[12] K. Martin, “A computational model of Spatial
Hearing”, MSc. dissertation, Massachusetts
Institute of Technology MIT, 1995.
[13] . Roffler, R. Butler, “Factors that influence the
localization of sound in the vertical plane”, J.
Acoust. Soc. Am., Vol 43, No 6, pp: 1255-1259,
1968.
[14] R.D. Patterson, K. Robinson, J. Holdsworth, D.
McKeown, C. Zhang, H. Allerhand, “Complex
sounds and auditory images” in Auditory
Physiology and Perception, (Eds) Cazals, Y.,
Demany, K., and Horner, K., Pergamon, Exford,
1992.
[15] M. Slaney, “An efficient Implementation of the
Patterson-Holdsworth Auditory Filter Bank”,
Apple Technical Report No. 35, Advanced
Technology Group, Apple Computer, Inc,
Cupertino. CA, 1993.
[16] M. Slaney, “Auditory Toolbox Version 2”,
Technical Report #1998-010, Interval Research
Corporation, 1998.
[17] R. F. Lyon, “A computational model of filtering,
detection, and compression in the cochlea,” in
Proc. IEEE Inr. Conf. Acousr., Speech, Signal
Processing, Pans, France, 1982, May.
[18] M Slaney, Lyon’s Cochlear Model, Technical
Report #13, Apple Computer, Inc, 1988.
[19] IEC 61260, Electroacoustic - Octave-band and
fractional octave-band filters, 1995.
[20] R. Venegas, M. Lara, R. Correa, S. Floody,
“Modelo de localización sonora espacial 2D para
sonidos de banda ancha”, EMFIMIN 2005,
Universidad de Santiago, 2005, Noviembre.
[21] R. Venegas, M. Lara, R. Correa, S. Floody,
“Diseño, Implementación y Entrenamiento de un
sistema de Identificación Sonora Espacial
Neurodifuso”, Reporte Técnico, Proyecto
FONDEI, Universidad Tecnológica de Chile,
2005.
[22] L. A. Jeffress, ‘‘A place theory of sound
localization,’’ J. Comp. Physiol. Psychol. Vol.
41, pp: 35–39., 1948.
[23] W. Lindemann, “Extension of a binaural crosscorrelation model by a contralateral inhibition. I.
Simulation of lateralization by contralateral
inhibition”, J. Acoust. Soc. Am., Vol. 80, pp:
1608-1622, 1986, December.
[24] A. Kulkarni, H.S. Colburn, “Role of spectral
detail in sound-source localization”, Nature, Vol.
396 (6713): 747-749; 1998, December.
[25] N. Kasakov, “Foundations of Neural Network,
fuzzy system, and Knowledge Engineering”, A
Bradford Book, MIT Press, Cambridge, MA,
1998
[26] M. Gupta, L. Jin, N. Homma, “Static and
dynamic Neural Network”, IEEE Press, John
Wiley and Sons, 2003.
[27] Jost, A., Begault, D., “Observed Effects of
HRTF measurements signal level”, presented at
the AES International 21st Conference, St.
Petersburg, Russia, 2002, June.
[28] Vliegen, J., and Van Opstal, J., “The influence of
duration and level on human sound localization”,
J. Acoust. Soc. Am. Vol. 115 No. 4, pp: 17051713, 2004, April.
[29] S. Lawrence, C. Lee, A. Chung,”What size
Neural Network Gives Optimal Generalization?
Convergence properties of Backpropagation”,
Technical Report, UMIACS-TR-96-22 and CSTR-3617, Institute for Advanced Computer
Studies, University of Maryland, 1996.
[30] R. Butler,“ Monoaural and binaural Localization
of Noise Burst Vertically in the Median Sagittal
Plane”, J. Aud. Res. Vol 3, pp: 245-246., 1969
[31] J. Hebrank, D. Wright, “Are two ears necessary
for median plane localization”. J. Acoust. Soc.
Am., Vol 56, pp: 935-938, 1974, September.
12.
APPENDIX A
Table 5 shows all the neural network configurations
and their parameters. Each training process is
considered to be a different configuration because
random weights initialization were used.
AES 120th Convention, Paris, France, 2006 May 20–23
Page 20 of 21
Venegas et al.
Spatial Sound Localization model using NN
Index
Neurons
Transfer Function
1:3
4:6
7:9
10:12
13:15
16:18
19:21
22:24
25:27
28:30
31:33
34:36
37:39
40:42
43:45
46:48
49:51
52:54
55:57
58:60
61:63
64:66
67:69
70:72
73:75
76:78
79:81
82:84
85:87
88:90
91:93
94:96
97:99
100:102
103:105
106:108
109:111
112:114
115:117
118:120
121:123
124:126
20-13-2
20-13-2
20-15-2
20-15-2
20-17-2
20-17-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-13-2
20-13-2
20-15-2
20-15-2
20-17-2
20-17-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-13-2
20-13-2
20-15-2
20-15-2
20-17-2
20-17-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
20-15-2
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
logsig-tansig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
tansig-logsig-purelin
% Training/Test
Elements
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
80.9-19.1
81.6-18.4
83.3-16.7
83.9-16.1
84.2-15.8
84.5-15.5
84.8-15.2
85.2-14.8
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
80.9-19.1
81.6-18.4
83.3-16.7
83.9-16.1
84.2-15.8
84.5-15.5
84.8-15.2
85.2-14.8
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
79.7-20.3
80.9-19.1
81.6-18.4
83.3-16.7
83.9-16.1
84.2-15.8
84.5-15.5
84.8-15.2
85.2-14.8
Bank Filter used in
ISLD calculation
ERB Bank Filter
Lyon Bank Filter
Third Octave Bank
Filter
Table 5 Neural network configurations and its parameters (related to x axis in the figures 21 to 23)
(Parameters of the sensitivity analysis configurations are not included)
AES 120th Convention, Paris, France, 2006 May 20–23
Page 21 of 21
View publication stats
Descargar