Chritsmann -Consistency of support vector machines using additive

Anuncio
Computational Statistics and Data Analysis 56 (2012) 854–873
Contents lists available at SciVerse ScienceDirect
Computational Statistics and Data Analysis
journal homepage: www.elsevier.com/locate/csda
Consistency of support vector machines using additive kernels for
additive models
Andreas Christmann, Robert Hable ∗
University of Bayreuth, Department of Mathematics, D-95440 Bayreuth, Germany
article
info
Article history:
Available online 19 April 2011
Keywords:
Support vector machine
SVM
Additive model
Consistency
Robustness
Kernel
abstract
Support vector machines (SVMs) are special kernel based methods and have been among
the most successful learning methods for more than a decade. SVMs can informally be
described as kinds of regularized M-estimators for functions and have demonstrated
their usefulness in many complicated real-life problems. During the last few years a
great part of the statistical research on SVMs has concentrated on the question of how
to design SVMs such that they are universally consistent and statistically robust for
nonparametric classification or nonparametric regression purposes. In many applications,
some qualitative prior knowledge of the distribution P or of the unknown function f to be
estimated is present or a prediction function with good interpretability is desired, such that
a semiparametric model or an additive model is of interest. The question of how to design
SVMs by choosing the reproducing kernel Hilbert space (RKHS) or its corresponding kernel
to obtain consistent and statistically robust estimators in additive models is addressed. An
explicit construction of such RKHSs and their kernels, which will be called additive kernels,
is given. SVMs based on additive kernels will be called additive support vector machines.
The use of such additive kernels leads, in combination with a Lipschitz continuous loss
function, to SVMs with the desired properties for additive models. Examples include
quantile regression based on the pinball loss function, regression based on the ϵ -insensitive
loss function, and classification based on the hinge loss function.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
Kernel methods such as support vector machines have been among the most successful learning methods for more than
a decade; see Schölkopf and Smola (2002), Cucker and Zhou (2007) and Steinwart and Christmann (2008). Examples include
classification or regression models where we have an input space X, an output space Y, some unknown probability measure
P on X × Y, and an unknown function f : X → R which describes the quantity of interest, e.g. the conditional quantile
curve, of the conditional distributions P(·|x), x ∈ X. Support vector machines can informally be described as kinds of
regularized M-estimators for functions (see e.g. Steinwart and Christmann (2008, p. 390f)), and have demonstrated their
usefulness in many complicated high-dimensional real-life problems. Besides several other nice features, one key argument
for using SVMs has been the so-called ‘‘kernel trick’’ (Schölkopf et al., 1998), which decouples the SVM optimization problem
from the domain of the samples, thus making it possible to use SVMs on virtually any input space X. This flexibility is in
strong contrast to the case for more classical learning methods from both machine learning and nonparametric statistics,
which almost always require input spaces X ⊂ Rd . As a result, kernel methods have been successfully used in various
application areas that were previously infeasible for machine learning methods. We mention just three examples. (i) SVMs
∗
Corresponding author. Tel.: +49 921 55 3274; fax: +49 921 55 2152.
E-mail address: [email protected] (R. Hable).
URL: http://www.stoch.uni-bayreuth.de (R. Hable).
0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.csda.2011.04.006
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
855
have been used to analyze histogram data and colored images, where the input values are probability measures; see, e.g.,
Hein and Bousquet (2005), Sriperumbudur et al. (2009) and Christmann and Steinwart (2010). (ii) SVMs have been used for
text classification and web mining; see, e.g., Joachims (2002) and Lafferty and Lebanon (2005). (iii) SVMs based on kernels
from computational biology
have been used for trees and graphs; see the book by Schölkopf et al. (2004).

For a data set Dn = (x1 , y1 ), . . . , (xn , yn ) , the empirical SVM is defined as
fL,Dn ,λ := arg inf
f ∈H
n
1
n i =1
L(xi , yi , f (xi )) + λ ∥f ∥2H
(1)
where ∥·∥H denotes the Hilbert space norm of H. That is, SVMs are based on three key components: (i) a convex loss function
L : X × Y × R → [0, ∞) used to measure the quality of the prediction f (x), (ii) a reproducing kernel Hilbert space (RKHS)
H of functions f : X → R used to specify the set of functions over which the expected loss is minimized, and (iii) the
regularization term λ ∥f ∥2H used to reduce the danger of overfitting and to guarantee the existence of a unique SVM even if L
is not strictly convex. The RKHS is often implicitly defined by specifying a kernel k : X × X → R. Details about the definition
of SVMs and some examples will be given in Section 2.
During the last few years a great part of the statistical research on SVMs has concentrated on the central question of how
to choose the loss function L, the RKHS H or its kernel k, and sequences of regularization parameters λn to guarantee that
SVMs are universally consistent and statistically robust for classification and regression purposes. In a nutshell, it turned
out in a purely nonparametric setup that SVMs based on the combination of a Lipschitz continuous loss function and a
bounded continuous kernel with a dense and separable RKHS are universally consistent with desirable statistical robustness
properties for any probability measure P from which we observed the data set; see, e.g., Steinwart and Christmann (2008)
and Christmann et al. (2009) for details. Here we only mention the combination of a Gaussian RBF kernel with the Lipschitz
continuous pinball loss function for nonparametric quantile regression; see Sections 2 and 4.
Although a nonparametric approach is often the best choice in practice due to the lack of prior knowledge on P, a
semiparametric approach or an additive model (Friedman and Stuetzle, 1981; Hastie and Tibshirani, 1990) can also be
valuable. Sometimes prior knowledge on P is known from previous experiments or from some theory specifying in some
sense how the data set was generated. Another example is that we may be interested, for practical reasons, only in functions
f which offer a nice interpretation because an interpretable prediction function can be crucial if the prediction f (x) has to be
explicable to clients. This can be the case, for example, if the prediction is the expected claim amount of a client and these
predictions are the basis for the construction of an insurance tariff.
Here we will mainly consider additive models, although models with a multiplicative structure or even additive and
multiplicative components can also be of interest and can be treated in an analogous manner. More precisely, for some
s ∈ N, the input space X is split up into s ∈ N non-empty spaces according to
X = X1 × · · · × Xs
(2)
and only additive functions f : X → R of the form
f (x1 , . . . , xs ) = f1 (x1 ) + · · · + fs (xs ),
xj ∈ Xj ,
are considered, where fj : Xj → R for j ∈ {1, . . . , s}.
To the best of our knowledge, there are currently no results on consistency and statistical robustness published on SVMs
based on kernels designed for such additive models. Of course, one can use one of the purely nonparametric SVMs described
above, but the hope is that SVMs based on kernels especially designed for such situations may offer better results.
In this paper we therefore address the question of how to design specific SVMs for additive models. The main goal of this
paper is to give an explicit construction principle for kernels – and thus for their RKHSs – and to show that such SVMs are
consistent under weaker assumptions than the usual assumptions of purely nonparametric SVMs. Furthermore, we mention
that the combination of a bounded additive kernel with a Lipschitz continuous loss function leads to statistically robust SVMs
for additive models. Examples include SVMs in additive models for quantile regression based on the pinball loss function,
for regression based on the ϵ -insensitive loss function, and for classification based on the hinge loss function.
The rest of the paper is organized as follows. In Section 2 we collect some known results on loss functions, kernels and
their RKHSs, and on support vector machines. These results are needed to state our results on consistency and statistical
robustness of SVMs for additive models in Section 3. Our main theoretical result is Theorem 3 which shows consistency
of SVMs based on additive kernels. Although results on the rates of convergence or the computation of the corresponding
constants for these rates are far beyond the scope of this paper, our numerical examples given in Section 4.1 will demonstrate
that SVMs based on kernels designed for additive models can outperform standard nonparametric SVMs if the assumption
of an additive model is valid. Section 4.2 contains a small simulation study comparing SVMs based on additive kernels for
quantile regression with a recently proposed boosting approach. In Section 4.3, we apply both approaches to a real-life data
set and compare the results. Section 5 contains the discussion. Some details on how to compute SVMs on the basis of the
pinball loss function for additive models are given in Appendix A. All proofs are given in Appendix B.
856
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
2. Background on support vector machines
Let X be a complete separable metric space and let Y be a closed subset of R. We will always use the corresponding
Borel-σ -algebras. The set of all probability measures on the Borel-σ -algebra of X × Y is denoted by M1 (X × Y). The
random input variables X1 , . . . , Xn take their values in X and the random output variables Y1 , . . . , Yn take their values in Y.
It is assumed that (X1 , Y1 ), . . . , (Xn , Yn ) are independent and identically distributed according to some unknown probability
measure P ∈ M1 (X × Y). Since Y ⊂ R is closed, P can be split into the marginal distribution PX on X and the conditional
distribution P(· | x) of Y given X = x. We mention that this covers of course the special case where Yi given Xi = xi is
distributed with median f (xi ) and heteroscedastic error term ϵi g (xi ), where f and g are unknown functions from X to R.
The goal is to find a good prediction function f : X → R such that f (x) estimates the quantity of interest of the conditional
distributions of Y given X = x (e.g. the conditional quantile or mean). The quality of an estimate t = f (x) is measured
through a loss function
L : X × Y × R → [0, ∞),
(x, y, t ) → L(x, y, t ).
It is assumed that L is measurable and L(x, y, y) = 0 for every (x, y) ∈ X × Y—that is, the loss is zero if the prediction t
equals the actual value y of the output variable. In addition, we make the standard assumption that
L(x, y, ·) : R → [0, ∞),
t → L(x, y, t )
is convex for every (x, y) ∈ X × Y and that additionally the following uniform Lipschitz property is fulfilled for some real
number |L|1 ∈ (0, ∞):
sup
(x,y)∈X×Y
|L(x, y, t ) − L(x, y, t ′ )| ≤ |L|1 · |t − t ′ | ∀t , t ′ ∈ R.
(3)
We restrict our attention to such Lipschitz continuous loss functions because the use of loss functions which are not Lipschitz
continuous (such as the least squares loss which is only locally Lipschitz continuous on unbounded domains) usually conflicts
with robustness; see, e.g., Steinwart and Christmann (2008, Section 10.4).
The quality of a (measurable) predictor f : X → R is measured through the risk
RL,P (f ) =

L x, y, f (x) P d(x, y) .

 

X×Y
By different choices of Y and the loss function L, different purposes are covered by this setup—e.g. binary classification for
Y = {−1; +1} and the hinge loss
Lhinge (x, y, t ) := max{0, 1 − yt },
regression for Y = R and, for ϵ > 0, the ϵ -insensitive loss
Lϵ (x, y, t ) := max{0, |y − t | − ϵ},
and quantile regression for Y = R and the pinball loss
Lτ (x, y, t ) :=

(τ − 1)(y − t ), if y − t < 0,
τ (y − t ), if y − t ≥ 0,
(4)
where τ ∈ (0, 1).
An optimal predictor is a measurable function f ∗ : X → R which attains the minimal risk, called the Bayes risk,
RL∗,P =
inf
f :X→R
measurable
RL,P (f ).
(5)
The optimal predictor in a set F of measurable functions f : X → R is a function f ∗ ∈ F which attains the minimal risk
RL∗,P,F = inf RL,P (f ).
f ∈F
For example, the goal of quantile regression is to estimate a conditional quantile function, i.e., a function fτ∗,P : X → R such
that
P (−∞, fτ∗,P (x)] | x ≥ τ


and
P [fτ∗,P (x), ∞) | x ≥ 1 − τ


for the quantile τ ∈ (0, 1). If fτ∗,P ∈ F , then the conditional quantile function fτ∗,P attains the minimal risk RL∗τ ,P,F for the
pinball loss Lτ (with parameter τ ), so quantile regression can be done by trying to minimize the risk RLτ ,P in F .
One way to build a nonparametric predictor f is to use a support vector machine
fL,P,λ := arg inf RL,P (f ) + λ ∥f ∥2H ,
f ∈H
(6)
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
857
where H is a reproducing kernel Hilbert space (RKHS) of a measurable kernel k : X × X → R, and λ > 0 is a regularization
parameter used to reduce the danger of overfitting; see e.g., Vapnik (1998), Schölkopf and Smola (2002) or Steinwart and
Christmann (2008) for details.
The reproducing property of k states that, for all f ∈ H and all x ∈ X,
f (x) = ⟨f , Φ (x)⟩H
where Φ : X → H , x → k(·, x) denotes the canonical feature map. A kernel k is called bounded if

∥k∥∞ := sup k(x, x) < ∞.
x∈X
Using the reproducing property and ∥Φ (x)∥H =
√
k(x, x), we obtain the well-known inequalities
∥f ∥∞ ≤ ∥k∥∞ ∥f ∥H
(7)
∥Φ (x)∥∞ ≤ ∥k∥∞ ∥Φ (x)∥H ≤ ∥k∥2∞
(8)
and
for all f ∈ H and all x ∈ X. We mention, as an example of a bounded kernel, the popular Gaussian radial basis function (GRBF )
kernel defined by
kγ (x, x′ ) = exp(−γ −2 ∥x − x′ ∥2Rd ),
x, x′ ∈ X ,
(9)
where γ is some positive constant and X ⊂ Rd . This kernel leads to a large RKHS which is dense in L1 (µ) for all probability
measures µ on Rd . We will also consider the polynomial kernel
km,c (x, x′ ) = ⟨x, x′ ⟩Rd + c

m
,
x, x′ ∈ X,
where m ∈ (0, ∞), c ∈ (0, ∞) and X ⊂ Rd . The dot kernel is a special polynomial kernel with c = 0 and m = 1. The
polynomial kernel is bounded if and only if X is bounded.
Of course, the regularized risk
reg
RL,P,λ (f ) := RL,P (f ) + λ ∥f ∥2H
is in general not computable, because P is unknown. However, the empirical distribution
Dn =
n
1
n i=1
δ(xi ,yi )
corresponding to the data set Dn = (x1 , y1 ), . . . , (xn , yn ) can be used as an estimator of P. Here δ(xi ,yi ) denotes the Dirac
reg
distribution in (xi , yi ). If we replace P by Dn in (6), we obtain the regularized empirical risk RL,Dn ,λ (f ) and the empirical SVM
fL,Dn ,λ . Furthermore, we need analogous notions where (xi , yi ) is replaced by random variables (Xi , Yi ). Thus, we define

Dn =
n
1
n i =1

δ(Xi ,Yi ) .
Then, for every ω ∈ Ω , Dn (ω) is the empirical distribution corresponding to the data set (X1 (ω), Y1 (ω)), . . . , (Xn (ω),

reg
reg
Yn (ω)) and, accordingly, RL,Dn ,λ (f ) denotes the mapping Ω → R, ω → RL,Dn (ω),λ (f ), and fL,Dn ,λ denotes the mapping
Ω → H , ω → fL,Dn (ω),λ .
Support vector machines fL,P,λ need not exist for every probability measure P ∈ M1 (X × Y), but they exist and are unique



for every data set; for Lipschitz continuous loss functions it is sufficient for the existence of fL,P,λ that L(x, y, 0) P d(x, y) <
∞. This condition may be violated by heavy-tailed distributions P and, in this case, it is possible that RL,P (f ) = ∞ for every
f ∈ H.
In order to enlarge the applicability of support vector machines to heavy-tailed distributions, the following extension has
been developed in Christmann et al. (2009). Following an idea already used by Huber (1967) for M-estimates in parametric
models, a shifted loss function L∗ : X × Y × R → R is defined by

L∗ (x, y, t ) = L(x, y, t ) − L(x, y, 0) ∀(x, y, t ) ∈ X × Y × R.
Then, define, similarly to the original loss function L, the L∗ -risk by
RL⋆ ,P (f ) =

L∗ x, y, f (x) P d(x, y)

 
and the regularized L∗ -risk by
reg
RL⋆ ,P,λ (f ) = RL∗ ,P (f ) + λ∥f ∥2H

858
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
for every f ∈ H. In complete analogy to (6), we define the support vector machine based on the shifted loss function L∗ by
fL,P,λ := arg inf RL⋆ ,P (f ) + λ ∥f ∥2H .
(10)
f ∈H
If the support vector machine fL,P,λ defined by (6) exists, we have seemingly defined fL,P,λ in two different ways now.
However, the two definitions coincide in this case and the following theorem summarizes some basic results of Christmann
et al. (2009).
Theorem 1. Let L be a convex and Lipschitz continuous loss function and let k be a bounded kernel. Then, for every P ∈ M1 (X×Y)
reg
and every λ ∈ (0, ∞), there exists a unique SVM fL,P,λ ∈ H which minimizes RL∗ ,P,λ , i.e.
RL∗ ,P (fL,P,λ ) + λ∥fL,P,λ ∥2H = inf RL∗ ,P (f ) + λ∥f ∥2H .
f ∈H
If the support vector machine fL,P,λ defined by (6) exists, then the two definitions (6) and (10) coincide.
3. Support vector machines for additive models
3.1. The model and assumptions
As described in the previous section, the goal is to minimize the risk f → RL,P (f ) in a set F of functions f : X → R. In
this article, we assume an additive model. Accordingly, let
X = X1 × · · · × Xs
where X1 , . . . , Xs are non-empty sets. For every j ∈ {1, . . . , s}, let Fj be a set of functions fj : Xj → R. Then, we only
consider functions f : X → R of the form
f (x1 , . . . , xs ) = f1 (x1 ) + · · · + fs (xs ) ∀(x1 , . . . , xs ) ∈ X1 × · · · × Xs
for f1 ∈ F1 , . . . , fs ∈ Fs . Thus,
F := f1 + · · · + fs : fj ∈ Fj , 1 ≤ j ≤ s .


(11)
In (11), we have identified fj with the map X → R, (x1 , . . . , xs ) → fj (xj ).
Such additive models can be treated using support vector machines in a very natural way. For every j ∈ {1, . . . , s}, choose
a kernel kj on Xj with RKHS Hj . Then, the space of functions
H := f1 + · · · + fs : fj ∈ H , 1 ≤ j ≤ s


is an RKHS on X = X1 × · · · × Xs with additive kernel
k = k1 + · · · + ks
such that kj : Xj × Xj → R, j ∈ {1, . . . , s}, are kernels and
k (x1 , . . . , xs ), (x′1 , . . . , x′s ) = k1 (x1 , x′1 ) + · · · + ks (xs , x′s )


for every xj , x′j ∈ Xj , j ∈ {1, . . . , s}; see Theorem 2. In this way, SVMs can be used to fit additive models and SVMs enjoy at
least three appealing features. First, it is guaranteed that the predictor has the assumed additive structure (x1 , . . . , xs ) →
f1 (x1 ) + · · · + fs (xs ). Second, it is possible to still use the standard SVM machinery including the kernel trick (Schölkopf and
Smola, 2002, Section 2) and implementations of SVMs—just by selecting a kernel k = k1 + · · · + ks . Third, the possibility
of choosing different kernels k1 , . . . , ks offers a great flexibility. For example, take s = 2 and let k1 be a GRBF kernel on
Rd1 and k2 be a GRBF kernel on Rd2 . Since the RKHS of a Gaussian kernel is an infinite dimensional function space, we
get nonparametric estimates of f1 and f2 . As a second example, consider a semiparametric model with f = f1 + f2 where
f1 : x1 → f1 (x1 ) is assumed to be a polynomial function of order at most m and f2 : x2 → f2 (x2 ) may be some complicated
function. Then, this semiparametric model can be treated by simply taking a polynomial kernel on Rd1 for k1 and a GRBF
kernel on Rd2 for k2 . This can be used, for example, in order to model changes in space (for d1 ≤ 3 and x1 specifying the
location) or in time (for d1 = 1 and x1 specifying the point in time).
Theorem 2. For every j ∈ {1, . . . , s}, let Xj be a non-empty set and
kj : Xj × Xj → R,
(xj , x′j ) → kj (xj , x′j ),
be a kernel with corresponding RKHS Hj . Define k = k1 + · · · + ks . That is,
k (x1 , . . . , xs ), (x′1 , . . . , x′s ) = k1 (x1 , x′1 ) + · · · + ks (xs , x′s )


A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
859
for every xj , x′j ∈ Xj , j ∈ {1, . . . , s}. Then, k is a kernel on X = X1 × · · · × Xs with RKHS
H := f1 + · · · + fs : fj ∈ H , 1 ≤ j ≤ s


and the norm of H, given in (B.2), fulfills
∥f1 + · · · + fs ∥2H ≤ ∥f1 ∥2H1 + · · · + ∥fs ∥2Hs ∀f1 ∈ H1 , . . . , fs ∈ Hs .
(12)
If not otherwise stated, we make the following assumptions throughout the rest of the paper although some of the results
are also valid under more general conditions.
Main assumptions.
(i) For every j ∈ {1, . . . , s}, the set Xj is a complete, separable metric space; kj is a continuous and bounded kernel on Xj
with RKHS Hj . Furthermore, k = k1 + · · · + ks denotes the kernel on X = X1 × · · · × Xs defined in Theorem 2 and H
denotes its RKHS.
(ii) The subset Y ⊂ R is closed.
(iii) The loss function L is convex and fulfills the uniform Lipschitz continuity (3) with Lipschitz constant |L|1 ∈ (0, ∞). In
addition, L(x, y, y) = 0 for every (x, y) ∈ X × Y.
Note that every closed subset of Rd is a complete, separable metric space. We restrict ourselves to Lipschitz continuous loss
functions and continuous and bounded kernels because it was shown earlier that these assumptions are necessary in order
to ensure good robustness properties; see e.g. Steinwart and Christmann (2008, Section 10). The condition L(x, y, y) = 0 is
quite natural and practically always fulfilled—it means that the loss of a correct prediction is 0. Our assumptions cover many
of the most interesting cases. In particular, the hinge loss (classification), the ϵ -insensitive loss (regression) and the pinball
loss (quantile regression) fulfill all assumptions. Many commonly used kernels are continuous. In addition, the Gaussian
kernel is always bounded, the linear kernel and all polynomial kernels are bounded if and only if Xj is bounded. From the
assumption that the kernels kj are continuous and bounded on Xj , it follows that the kernel k = k1 + · · · + ks is continuous
and bounded on X.
3.2. Consistency
SVMs are called universally consistent if the risk of the SVM estimator fL,Dn ,λn converges, for all probability measures P,
in probability to the Bayes risk (see (5)), i.e.
P
RL⋆ ,P (fL,Dn ,λn ) −→ RL∗∗ ,P (n → ∞).
(13)
In order to obtain universal consistency of SVMs, it is necessary to choose a kernel with a large RKHS. Accordingly for most
known results about universal consistency of SVMs, it is a crucial assumption that the RKHS is dense in C (X) where X is a
compact metric space (see e.g. Steinwart, 2001) or, at least, that the RKHS is dense in Lq (PX ) for some q ∈ [1, ∞). However,
this assumption is violated if we want to fit an additive model given by
F = f1 + · · · + fs : fj ∈ Fj , 1 ≤ j ≤ s


with correspondingly additive SVMs because the RKHS H = H1 + · · · + Hs , which is generated by the additive kernel, is not
dense in the whole space Lq (PX ) – apart from entirely degenerate cases. Therefore, existing results on universal consistency
are not applicable here. Nevertheless, additive SVMs are consistent in additive models as shown in Theorem 3. Instead of
the – in this context – inappropriate assumption that H was dense in the whole space Lq (PX ), we only assume that Hj is
dense in Fj for every j ∈ {1, . . . , s}.
As usual, Lq (µ) denotes the set of all q-integrable real-valued functions with respect to some measure µ and Lq (µ)
denotes the set of all equivalence classes in Lq (µ).
We can now state our first main result.
Theorem 3. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that
Hj ⊂ Fj ⊂ L1 (PXj ),
1 ≤ j ≤ s,
and let Hj be dense in Fj with respect to ∥ · ∥L1 (PX ) . Then, for every sequence (λn )n∈N ⊂ (0, ∞) such that limn→∞ λn = 0 and
j
limn→∞ λ2n n = ∞,
RL⋆ ,P (fL,Dn ,λn ) −→ RL∗⋆ ,P,F
in probability.
(n → ∞)
860
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
In general, it is not clear whether convergence of the risks implies convergence of the SVM fL,Dn ,λn . However, the following
theorem will show such a convergence for quantile regression in an additive model—under the condition that the quantile
function fτ∗,P actually lies in F = F1 + · · · + Fs . In order to formulate this result, we define
d0 (f , g , ) =

min 1, |f − g | dPX


where f , g : X → R are arbitrary measurable functions. It is known that d0 is a metric describing convergence in probability.
We can now state our second main result.
Theorem 4. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that
Hj ⊂ Fj ⊂ L1 (PXj )
∀j ∈ {1, . . . , s}
and Hj is dense in Fj with respect to ∥ · ∥L1 (PX ) . Let τ ∈ (0, 1) and assume that the quantile function fτ∗,P is PX -almost surely
j
unique and that
fτ∗,P ∈ F .
Then, for the pinball loss function L = Lτ and for every sequence (λn )n∈N ⊂ (0, ∞) such that limn→∞ λn = 0 and
limn→∞ λ2n n = ∞,
d0 fL,Dn ,λn , fτ∗,P −→ 0 (n → ∞)


in probability.
3.3. Robustness
During the last few years some general results on the statistical robustness properties of SVMs have been shown. Many
of these results are directly applicable to SVMs for additive models if the kernel is bounded and continuous (or at least
measurable) and the loss function is convex and Lipschitz continuous. We summarize some of the most important robustness
properties of additive SVMs in the following. For brevity we only give upper bounds for the bias and the Bouligand influence
function for SVMs, which are both even applicable for non-smooth loss functions like the pinball loss for quantile regression,
and refer the reader to Christmann et al. (2009) and Steinwart and Christmann (2008, Chap. 10) for results on the classical
influence function proposed by Hampel (1968, 1974) and to Hable and Christmann (2011) for the qualitative robustness of
SVMs.
Define the function
T : M1 (X × Y ) → H ,
T (P) := fL,P,λ ,
(14)
which maps each probability distribution to its SVM. In robust statistics we are interested in smooth and bounded functions
T , because this will give us stable SVMs within small neighborhoods of P. If an appropriately chosen derivative of T (P) is
bounded, then we expect the value of T (Q) to be close to the value of T (P) for distributions Q in a small neighborhood of P.
The next result shows that the H-norm of the difference of two SVMs increases with respect to the mixture proportion
ε ∈ (0, 1) at most linearly in gross-error neighborhoods. The norm of total variation of a signed measure µ is denoted by
∥µ∥M .
Theorem 5 (Bounds for Bias). If the main assumptions (i)–(iii) are valid, then we have, for all λ > 0, all ε ∈ [0, 1], and all
probability measures P and Q on X × Y, that
∥T (Q) − T (P)∥∞ ≤ c × ∥P − Q∥M ,
(15)
∥T ((1 − ε)P + εQ) − T (P)∥∞ ≤ c ∥P − Q∥M × ε,
(16)
1
2
where c = λ ∥k∥∞ |L|1 .
Because of (8), there are analogous bias bounds of SVMs with respect to the norm in H if we replace c by c̃ := λ1 ∥k∥∞ |L|1 .
While Hampel’s influence function is related to a Gâteaux derivative which is linear, the Bouligand influence function is
related to the Bouligand derivative which needs only to be positive homogeneous. Because this weak derivative is less known
in statistics, we would like to recall its definition. Let E1 and E2 be normed linear spaces. A function f : E1 → E2 is called
positive homogeneous if f (α x) = α f (x) for all α ≥ 0 and for all x ∈ E1 . If U is an open subset of E1 , then a function f : U → E2
is called Bouligand differentiable at a point x0 ∈ U if there exists a positive homogeneous function ∇ B f (x0 ) : U → E2 such
that
lim
h→0


f (x0 + h) − f (x0 ) − ∇ B f (x0 )(h)
∥h∥E1
E2
=0
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
861
(see Robinson (1991)).
The Bouligand influence function (BIF) of the map T : M1 (X × Y) → H for a distribution P in the direction of a distribution
Q ̸= P was defined by Christmann and Van Messem (2008) as
lim
 


T (1 − ε)P + ε Q − T (P) − BIF(Q; T , P)
= 0.
H
ε
ε↓0
(17)
Note that the BIF is a special Bouligand derivative
 


T P + ε(Q − P) − T (P) − BIF(Q; T , P)
lim
∥ε(Q − P)∥
∥ε(Q−P)∥→0
H
=0
due to the fact that Q and P are fixed, and it is independent of the norm on M1 (X × Y). The partial Bouligand derivative
with respect to the third argument of L⋆ is denoted by ∇3B L⋆ (x, y, t ). The BIF shares with Hampel’s influence function the
interpretation that it measures the impact of an infinitesimal small amount of contamination of the original distribution P in
the direction of Q on the quantity of interest T (P). It is thus desirable that the function T has a bounded BIF. It is known that
existence of the BIF implies the existence of the IF and in this case they are equal. The next result shows that, under some
conditions, the Bouligand influence function of SVMs exists and is bounded; see Christmann et al. (2009) for more related
results.
Theorem 6 (Bouligand Influence Function). Let the main assumptions (i)–(iii) be valid, but assume that X is a complete
separable normed linear space.1 Let P, Q ∈ M1 (X × Y). Let L be the pinball loss function Lτ with τ ∈ (0, 1) or let L be the
ϵ -insensitive loss function Lϵ with ϵ > 0. Assume that for all δ > 0 there exist positive constants ξP , ξQ , cP , and cQ such that for
all t ∈ R with |t − fL,P,λ (x)| ≤ δ ∥k∥∞ the following inequalities hold for all a ∈ [0, 2δ ∥k∥∞ ] and x ∈ X:
P [t , t + a] | x ≤ cP a1+ξP


and Q [t , t + a] | x ≤ cQ a1+ξQ .


(18)
Then the Bouligand influence function BIF(Q; T , P) of T (P) := fL,P,λ exists, is bounded, and equals
1
2

EP ∇3B L⋆ (X , Y , fL,P,λ (X ))Φ (X ) − EQ ∇3B L⋆ (X , Y , fL,P,λ (X ))Φ (X ) .
(19)
Note that the Bouligand influence function of the SVM only depends on Q via the second term in (19). The interpretation
of the condition (18) is that the probability that Y given x is in some small interval around the SVM is essentially at most
proportional to the length of the interval to some power greater than 1.
For the pinball loss function, the BIF given in (19) simplifies to

1
2λ
−
P (−∞, fL,P,λ (x)] | x − τ Φ (x)PX (dx)
 


X
1
2λ
(20)


Q (−∞, fL,P,λ (x)] | x − τ Φ (x)QX (dx).
 

X
The BIF of the SVM based on the pinball loss function can hence be interpreted as the difference of the integrated and 21λ Φ (x)
weighted differences between the estimated quantile level and the desired quantile level τ .
Recall that the BIF is a special Bouligand derivative and thus positive homogeneous in h = ε(Q − P). If the BIF exists, we
then immediately obtain
fL,(1−αε)P+αεQ,λ − fL,P,λ = T (P + α h) − T (P)
= α BIF(Q; T , P) + o(α h)


= α T (P + h) − T (P) + o(h) + o(α h)


= α fL,(1−ε)P+εQ,λ − fL,P,λ + o(αε(Q − P))
(21)
for all α ≥ 0. Hence this equation gives a nice approximation of the asymptotic bias term fL,(1−ε)P+εQ,λ − fL,P,λ if we consider
the amount αε of contamination instead of ε .
1 Bouligand derivatives are only defined in normed linear spaces, e.g., for X ⊂ Rd a linear subspace.
862
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
4. Some numerical results
In this section some numerical results for finite sample sizes will be given to complement our theoretical results on
properties of SVMs for additive models. The goals of this section are threefold.
In Section 4.1, the asymptotic results of Section 3.2 are illustrated with some simulated data sets. The example visualizes
our result that SVMs based on an additive kernel converge to the true quantile function if the true quantile function is
additive. The additive SVM is compared to a standard SVM based on a classical d-dimensional Gaussian RBF kernel, for
which convergence is already well-known. It is illustrated that there can be a real benefit from using additive SVMs if the
assumption of an additive model is valid.
In Section 4.2 the results from a small simulation study will be given to compare the finite sample behavior of SVMs
based on additive kernels with a boosting approach using gamboost for additive quantile regression proposed by Schmid
and Hothorn (2008).
In Section 4.3 we apply SVMs based on an additive kernel and this boosting approach for additive quantile regression for
a real-life data set containing rent standard prices for dwellings in the German city Munich.
4.1. Some simulations to show the influence of the sample size
Let us consider the following situation of median regression. We have two independent input variables X1 and X2
each with a uniform distribution on the interval [0, 1] and the output variable Y given that x = (x1 , x2 ) has a Cauchy
distribution (and thus not even the first moment exists) with center f (x1 , x2 ) := f1 (x1 ) + f2 (x2 ), where f1 (x1 ) := 7 + 5x21
and f2 (x2 ) := sin(5x2 ) cos(17x2 ). We used Cauchy error terms instead of the classical Gaussian error terms to check the
finite sample robustness properties of the SVM. Hence the true function f that we would like to estimate with SVMs has
an additive structure, where the first function is a polynomial of order 2 and the second function is a smooth and bounded
function but not a polynomial. Please note that here X = [0, 1]2 is bounded whereas Y = R is unbounded. As X is bounded,
even a polynomial kernel on X is bounded, which is not true for unbounded input spaces. We simulated three data sets of
this type with sample sizes n = 500, n = 2000, and n = 10, 000. We compare the exact function f with three SVMs fL,D,λn
fitted with the three data sets, where we use the pinball loss function with τ = 0.5 because we are interested in median
regression.
• Nonparametric SVM. We use an SVM based on the two-dimensional GRBF kernel k defined in (9) to fit f in a totally
nonparametric manner.
• Nonparametric additive SVM. We use an SVM based on the kernel k = k1 + k2 where k1 and k2 are one-dimensional GRBF
kernels.
• Semiparametric additive SVM. We use an SVM based on the kernel k = k1 + k2 where k1 is a polynomial kernel of order 2
for fitting the function f1 and k2 is a one-dimensional GRBF kernel for fitting the function f2 .
Our interest in these examples was in illustrating that SVMs are consistent in all of these situations. In order to stay in line
with the theoretical results obtained in Section 3.2, the kernel parameter γ = 2 was fixed and the regularization parameter
was set to λn = a · n−0.45 for the rather arbitrarily chosen a = 0.05. Note that the asymptotic results hold for any fixed γ
and a.
From Figs. 1–3 we can draw the following conclusions for this special situation.
(i) If the additive model is valid, all three SVMs give comparable and reasonable results if the sample size n is large enough
even for Cauchy distributed error terms; see Fig. 1. This is in good agreement with the theoretical results derived in
Section 3.
(ii) If the sample size is small to moderate and if the assumed additive model is valid, then both SVMs based on kernels
especially designed for additive models show better results than the standard two-dimensional GRBF kernel; see Figs. 2
and 3.
(iii) The difference between the nonparametric additive SVM and semiparametric additive SVM was somewhat surprisingly
small for all three sample sizes, although the true function had the very special structure which favors the
semiparametric additive SVM.
4.2. Some simulations for comparing SVMs with gamboost
In the following situation, we consider quantile regression for the quantile levels τ = 0.5 (median regression) and
τ = 0.9.
The model. We considered the additive model
Yi = f1 (xi1 ) + f2 (xi2 ) + εi
where εi ∼i.i.d. Cauchy(0, 1)
so the resulting quantile function was given by
−1
fτ∗ (xi1 , xi2 ) = f1 (xi1 ) + f2 (xi2 ) + FCauchy
(0,1) (τ )
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
863
Fig. 1. Quantile regression using SVMs and the pinball loss function with τ = 0.5. Model: Y |(x1 , x2 ) ∼ f1 (x1 ) + f2 (x2 ) + Cauchy errors, where
f1 (x1 ) := 7 + 5x21 and f2 (x2 ) = sin(5x2 ) cos(17x2 ) and x1 and x2 are observations of independent and identically uniform distributed random variables on
the interval [0, 1]. The regularization parameter is λn = 0.05n−0.45 , and the kernel parameter of the Gaussian RBF kernel is γ = 2. Upper left subplot: true
function f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ). Upper right subplot: SVM fit based on the GRBF kernel k on X = R2 . Lower left subplot: SVM fit based on the sum of
two one-dimensional GRBF kernels. Lower right subplot: SVM fit based on the sum of a one-dimensional polynomial kernel on R and a one-dimensional
GRBF kernel.
−1
where FCauchy
(0,1) (τ ) denotes the τ -quantile of the standard Cauchy distribution. For the simulated data sets, the functions
were given by
f1 (xi1 ) = 7 + 5x2i1
and
f2 (xi2 ) := sin(5xi2 ) cos(17xi2 ).
In each case, we simulated 100 data sets for the sample sizes n = 500 and n = 2000 and for the quantiles τ = 0.5 and
τ = 0.9. The covariates xi1 and xi2 were independently drawn from the uniform distribution on the interval [0, 1].
Estimation. For each of the data sets, we estimated the conditional quantiles by additive SVMs as presented in Section 3
and by a boosting approach for additive quantile regression (function gamboost from the R-package ‘‘mboost’’); see also
Schmid and Hothorn (2008).
In the case of the additive SVM, we used the additive kernel k = k1 +k2 where k1 and k2 are one-dimensional GRBF kernels
kγ1 and kγ2 . As the parameters γ1 , γ2 > 0 in the GRBF kernels act as scale parameters for the covariates xi1 and xi2 , they were
estimated separately from the covariates, by the robust scale estimator Qn (function Qn from the R-package ‘‘robustbase’’);
see Rousseeuw and Croux (1993). In accordance with Theorem 3 (consistency of additive SVMs) the regularization parameter
λn > 0 was equal to
λn = a · n−0.45 .
The hyperparameter a > 0 was chosen within the values
0.4, 0.2, 0.1, 0.05, 0.025, 0.012, 0.006, 0.004, 0.003, 0.0025, 0.002, 0.00175, 0.0015, 0.0013, 0.0012, 0.0011, 0.0010,
0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001, 0.00005
in a data-driven way by a fivefold cross-validation.
864
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
Fig. 2. Quantile regression using SVMs and the pinball loss function with τ = 0.5. The same situation as in Fig. 1, but with smaller sample size.
In the case of boosting, following Fenske et al. (2009, Section 3.2), we used cubic penalized spline base-learners with
second-order difference penalty, 20 inner knots, and three degrees of freedom, and fixed the step size at ν = 0.1. With
these settings, only the number of iterations mstop remained as a hyperparameter which was chosen within the values
100, 200, 300, 400, 500, 1000, 2000, 3000, 5000, 7000, 9000, 13000, 17000, 21000, 25000, 30000, 35000, 40000
in a data-driven way by a fivefold cross-validation.
Performance results. The evaluation of the estimates was done on test data sets consisting of 10,000 data points. That is, for
every run k ∈ {1, . . . , 100}, we got the empirical risk of the additive SVM, the boosting algorithm, and the true conditional
quantile function:
(k)
Remp
(f⋆ ) =
1
10000

10000 i=1
(k)
(k)
Lτ yi , f⋆(k) (xi )


(k)
(k)
(k) 
for f⋆(k) ∈ fSVM , fboost , ftrue

(k)
where Lτ denotes the pinball loss function. Note that it does not make sense to compare the empirical risks Remp . As we
considered errors from the standard Cauchy distribution, the empirical risk does not converge to the true risk because the
‘‘true risk’’ is equal to infinity. However, according to the L∗ -trick (see Section 2), it is possible to compare differences of the
risks
(k)
(k)
(k)
(k)
Remp
(f⋆ ) − Remp
(ftrue ) for f⋆(k) ∈ fSVM , fboost .


As a measure of the performance, we calculated the average over the 100 runs
∆ − risk :=
100

1  (k)
(k)
(ftrue )
Remp (f⋆ ) − Remp
100 k=1
(k)
(k)
for f⋆(k) ∈ fSVM , fboost


A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
865
Fig. 3. Quantile regression using SVMs and the pinball loss function with τ = 0.5. The same situation as in Fig. 2, but with an even smaller sample size.
and also the standard deviation (sd) over the 100 runs. Similarly, we calculated the average bias over the 100 runs
100
1 
10000
1

(k)
(k)

(k)
(k)
for f⋆(k) ∈ fSVM , fboost ,
100 k=1 10000 i=1
the standard deviation (sd) of the bias over the 100 runs, the average of the square root of the mean squared errors (sqrMSE)
over the 100 runs


100 

1

100 k=1
1
f⋆(k) (xi ) − ftrue (xi )
10000

10000 i=1
(k)
(k)
(k)

2
f⋆ (xi ) − ftrue (xi )
(k)

(k)
for f⋆(k) ∈ fSVM , fboost ,


and its standard deviation (sd) over the 100 runs.
The results are shown in Table 1 and the corresponding boxplots are shown in Fig. 4. It can be seen that the performance
results are similar for the two methods. However, with respect to the criterion sqrMSE, boosting performs slightly better in
this example. The standard deviations (see Table 1) and the boxplots (Fig. 4) indicate that the two methods cope with the
heavy-tailed error distribution similarly well in this example. Only for τ = 0.9 and n = 500 did the boosting algorithm
produce severe outliers. The SVM algorithm did not produce such severe outliers, as was to be expected from our theoretical
results on the statistical robustness of such SVM.
Computational costs. As usual, the computational costs increased with increasing sample size. However, while the
computational time increased only moderately in the case of SVMs (17 h for n = 500 and 25 h for n = 2000 for 100
runs each with fivefold cross-validation), the computational time considerably increased in the case of boosting (26 h for
n = 500 and 87 h for 100 runs each with fivefold cross-validation). In the case of boosting, the need for a large working
space was a limiting factor: approximately 0.8 GB for n = 500 and approximately 1.7 GB for n = 2000 were needed. In
contrast, the working space was no problem in the case of SVMs for such sample sizes, because less than 10 kB were needed
for n = 500 and n = 2000. The computations were done on a dual-core PC with 3.0 GHz and 3 GB RAM.
866
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
Table 1
Results of the simulations with 100 runs.
τ = 0.5
n = 500
∆-risk
(sd)
bias
(sd)
sqrMSE
(sd)
τ = 0.5
n = 2000
∆-risk
(sd)
bias
(sd)
sqrMSE
(sd)
τ = 0.9
n = 500
∆-risk
(sd)
bias
(sd)
sqrMSE
(sd)
τ = 0.9
n = 2000
∆-risk
(sd)
bias
(sd)
sqrMSE
(sd)
SVM
gamboost
0.026
(0.005)
0.001
(0.079)
0.415
(0.034)
0.021
(0.008)
−0.010
(0.087)
0.371
(0.073)
0.017
(0.002)
0.004
(0.034)
0.333
(0.015)
0.005
(0.002)
−0.003
(0.044)
0.178
(0.034)
0.022
(0.013)
0.176
(0.498)
1.164
(0.421)
0.018
(0.014)
−0.212
(0.472)
0.958
(0.374)
0.008
(0.004)
0.095
(0.207)
0.731
(0.176)
0.007
(0.003)
−0.005
(0.229)
0.653
(0.137)
Fig. 4. Boxplots for the simulations with 100 runs.
We mention that some analogous simulations were done for SVMs where instead of homoscedastic error terms some
heteroscedasticity was used. These results are not shown here because they are qualitatively similar to those shown above.
4.3. A real-life example: rent standard prices
Let us now consider a real-life example of the rent standard for dwellings in the large German city Munich. Many German
cities compose so-called rent standards to make a decision making instrument available to tenants, landlords, renting
advisory boards, and experts. Such rent standards can in particular be used for the determination of the local comparative
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
867
Table 2
Results for the rent standard data set.
τ = 0.5
Risk
(sd)
β̂1
β̂2
τ = 0.9
Risk
(sd)
β̂1
β̂2
SVM
gamboost
1.512
(1.189)
1.424
3.524
1.517
(1.189)
1.312
3.680
0.670
(0.559)
1.604
5.215
0.671
(0.563)
1.447
4.448
rent, i.e. the net rent as a function of the dwelling size, year of construction of the house, geographical information etc.
For the construction of a rent standard, a representative random sample is drawn from all households and questionnaires
are used to determine the relevant information by trained interviewers. Fahrmeir et al. (2007) described such a data set
consisting of n = 3082 rent prices in Munich, which is one of the largest cities in Germany. The data set is publicly available
at http://www.regressionbook.org/. Fahrmeir et al. (2007) fitted the following additive model
price = f1 (size) + f2 (year) + β0 + β1 region1 + β2 region2 + error,
where the following variables were used:
price: net rent price per square meter in DM (1 e ≈ 1.96 DM);
size: size in square meters of the dwelling (between 20 and 160);
year: year (between 1918 and 1997);
region1 : good residential area (0 = no, 1 = yes);
region2 : best residential area (0 = no, 1 = yes).
Hence region1 and region2 are dummy variables with respect to a standard residential area. Fahrmeir et al. (2007) used a
special spline method for estimating the functions f1 and f2 .
For purposes of illustration of the SVMs with additive kernels investigated in the present paper, we used a nonparametric
additive SVM for median regression. More precisely, we used the pinball loss function with τ = 0.5 and the kernel
k(x, x′ ) =
4

kj (xj , x′j ),
(22)
j =1
where x = (x1 , x2 , x3 , x4 ) ∈ R4 , x′ = (x′1 , x′2 , x′3 , x′4 ) ∈ R4 , and
k1 : R
k2 : R
k3 : R
k4 : R
→ R Gaussian RBF kernel with parameter γ1 for size
→ R Gaussian RBF kernel with parameter γ2 for year
→ R dot kernel for region1
→ R dot kernel for region2 .
The determination of the kernel parameters γ1 and γ2 for the kernels k1 and k2 and the determination of the regularization
parameter λ of the SVM were done in the same manner as in Section 4.2, i.e. via the robust scale estimator Qn (Rousseeuw
and Croux, 1993) and fivefold cross-validation. The whole data set for sample size n = 3082 was split randomly into a
training data set of size ntrain = 2000 and a test data set of size ntest = 1082.
Overall, a support vector machine based on the additive kernel given above and the boosting approach using gamboost
yielded very similar results for this particular data set; see Table 2 and Fig. 5.
The slope parameters were estimated via the support vector machine by β̂1 = 1.42 for a good residential area
(region1 = 1) and β̂2 = 3.52 for the best residential area (region2 = 1). Similar results were obtained by gamboost;
see Table 2. Hence, we obtain, apart from these level shifts, the same surfaces, as shown in Fig. 5, for dwellings located in
good or in the best residential areas. The upper subplots in Fig. 5 show the estimated median net rent price for one square
meter depending on the size of the dwelling and the year of the construction for a dwelling in a standard residential area.
These subplots show that the median of the net rent prices per square meter is fairly stable for construction years up to
1960, but a more or less linear increase is visible for newer buildings. The upper subplots also show that the median of the
net rent prices per square meter is especially high for dwellings of size less than 80 m2 , that the price is nearly constant for
sizes between 80 and 140 m2 , and that a slight increase of the square meter price seems to occur for even larger dwellings.
The shapes of the surfaces for the conditional τ = 0.9 quantiles are given in the lower subplots and are quite similar to the
ones for the estimated median net rent prices. However, the lower subplots may give an indication for two moderate peaks
for the 90% quantile net rent prices for dwellings of size 50 m2 and 100 m2 , respectively.
We would like to mention that we used this real-life example just for illustration purposes, but nevertheless our results
are in good agreement with the more detailed statistical analysis of this data set made by Fahrmeir et al. (2007) who used
spline methods.
868
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
Fig. 5. Plot for the fitted additive model for the rent standard data set based on a nonparametric additive SVM for quantile regression, i.e., pinball loss
functions with τ = 0.50 and τ = 0.90, respectively. The surface gives the estimated median (upper) or 90% quantile (lower) net rent price of one square
meter depending on the size of the dwelling and the year of the construction for a standard residential area, i.e., region1 = region2 = 0.
5. Discussion
Support vector machines belong to the class of modern statistical machine learning methods based on kernels. The
success of SVMs is partly based on the kernel trick which makes SVMs usable even for abstract input spaces, their universal
consistency, that they are the solutions of a well-posed problem in Hadamard’s sense, their statistical robustness with
respect to small model violations, and the existence of fast numerical algorithms. During the last decade there has been
considerable research on these topics. To obtain universal consistency, one needs a sufficiently large reproducing kernel
Hilbert space H such that many SVMs of practical interest are based on Hilbert spaces with infinite dimension. Due to the
no-free-lunch theorem (Devroye, 1982), there exists in general no uniform rate of convergence of SVMs on the set of all
probability measures.
Although such a nonparametric approach is often the best choice in practice due to the lack of prior knowledge of the
unknown probability measure P, a semiparametric approach or an additive model (Friedman and Stuetzle, 1981; Hastie and
Tibshirani, 1990) can also be valuable for at least two reasons. (i) In some applications some weak knowledge on P or on
the unknown function f to be estimated, say the conditional quantile curve, is known, e.g. f is known to be bounded or at
least integrable. (ii) For practical reasons, we may be interested only in functions f which offer a nice interpretation from an
applied point of view, although there might be a measurable function with a smaller risk, because an interpretable prediction
function can be crucial in some applications. An important class of statistical models whose predictions are relatively easily
to interpret are additive models.
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
869
Therefore, support vector machines for additive models were treated in this paper and some results on their consistency
and statistical robustness properties were derived.
Additive models have already been treated by several other methods. Currently, there is considerable interest in the
boosting of additive models; see, e.g., De Bock et al. (2010), Ferraty and Vieu (2009), and Shafik and Tutz (2009). In addition
to the backfitting algorithm of Hastie and Tibshirani (1990) (gam in the R-package ‘gam’), there is also a number of methods
which are based on penalized splines:
• gam in the R-package ‘mgcv’—see Wood (2006);
• hgam in the R-package ‘hgam’—see Meier et al. (2009);
• vgam in the R-package ‘vgam’—see Yee and Wild (1996);
and on boosting:
• GAMBoost in the R-package ‘GAMBoost’—see Tutz and Binder (2007);
• gamboost in the R-package ‘mboost’—see Schmid and Hothorn (2008).
In addition, Koenker (2005, Section 7) specifically considers quantile regression for additive models (rqss in the R-package
‘quantreg’). Apart from gam in the R-package ‘mgcv’, all of these algorithms can only deal with additive models
y = f1 (x1 ) + · · · + fs (xs ) + ε
such that the covariates x1 , . . . , xs are univariate or at most bivariate; gam in the R-package ‘mgcv’ and additive SVMs
(using ‘mySVM’) can deal with covariates xj ∈ Rpj of any dimension pj ∈ N. In addition, additive SVMs can also be
used for abstract input spaces Xj as we have only assumed in Section 3 that Xj is any complete, separable metric space.
Such situations occur, e.g., in the analysis of colored images, text classification, web mining and signal processing; see e.g.
Christmann and Steinwart (2010) and the references cited therein. In the case of any abstract input spaces, standard software
implementations of SVMs can still be used because
 additive
 SVMs can simply be obtained by choosing a suitable kernel k
or, correspondingly, a suitable kernel matrix K = k(xi , xj ) ∈ Rn×n where
k(xi , xj ) := k1 (x1,i , x1,j ) + · · · + ks (xs,i , xs,j ) for all i, j ∈ {1, . . . , n}.
In particular, any implementation which allows for passing the kernel matrix K (instead of just choosing a kernel function
among a set of predefined kernel functions) is suitable; e.g. use ksvm in the R-package ‘kernlab’ and fix the arguments x = K
and kernel = "matrix". Note that the calculation of the kernel matrix is the only step where the complexity of Xj and
the number s of additive terms matters. Since the kernel matrix has to be calculated once only, the complexity of Xj and
the number s of additive terms essentially have hardly any influence on the computational costs—at least, as long as the
sample size is not too large and so the kernel matrix can still be stored. Accordingly, additive SVMs can also be used for
high-dimensional additive models.
In addition to these computational advantages, additive SVMs have a sound theoretical justification: we have shown
in Section 3.2 that additive SVMs are consistent for additive models and we have explained in Section 3.3 that the good
robustness properties of SVMs carry over to additive SVMs without further ado. However, as additive SVMs are a special
case of general SVMs, general disadvantages of SVMs also carry over.
One of the most important topics of recent research on support vector machines is the investigation of rates of
convergence for special cases; see e.g. Steinwart and Scovel (2007), Caponnetto and De Vito (2007), Blanchard et al. (2008),
Steinwart et al. (2009) and Mendelson and Neeman (2010). Xu and Chen (2008) and Steinwart and Christmann (2009)
considered convergence rates of SVMs under mixing conditions. Accordingly, it may also be worthwhile to investigate the
rates of convergence of SVMs based on kernels designed for additive models. Our simple numerical examples in Section 4.1
seem to indicate that there might be some gain with respect to the rate of convergence or with respect to smaller constants
describing the rate of convergence. However, this is far beyond the scope of this paper.
Finally, we would like to mention the well-known fact that not only is the sum of s kernels a kernel but also the product
of s kernels is a kernel. Hence it seems to be possible to derive results similar to those given here for additive models also
for multiplicative models or even models with additive and multiplicative terms.
Appendix A. Some computational details
Support vector machines can be computed with many different software

tools, which often allow the definition of
new kernels or the direct specification of the kernel matrix K := k(xi , xj ) ∈ Rn×n . We used the implementation
mySVM developed by Rüping (2000), who also gives a description of the parameters and options of his program, for the
computation of SVMs for classification, regression, and distribution estimation problems. This implementation is based on
SVMlight developed by Joachims (1999). The software mySVM can be used for SVMs based on the pinball loss for quantile
regression. The parameters and numerical options for the specific SVM are submitted to mySVM via an external file. A typical
example for the external files that we used to fit an SVM based on the pinball loss function with τ = 0.9 and the kernel
defined in (22) as the sum of two GRBF kernels and two linear kernels in order to analyze the rent standard data set is given
below. The definition of the kernel starts with @kernel. The options type sum_aggregation and number_parts 4
specify that an additive kernel consisting of four parts will be used. The following lines specify these individual kernels. After
870
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
the line @parameters, the loss function and some options for specifying numerical parameters are given; e.g., the options
epsilon=0, L+ = 1 − τ , and L- = τ can be used to specify an SVM based on the pinball loss function for quantile level τ .
@kernel
type sum_aggregation
number_parts 4
range 1 1
range 2 2
range 3 3
range 4 4
@kernel
type radial
gamma 0.581
@kernel
type radial
gamma 0.626
@kernel
type dot
@kernel
type dot
@parameters
C 5.097
scale
epsilon 0.00
L+ 0.1
L- 0.9
kernel_cache 2000
max_iterations 1000000
working_set_size 2
convergence_epsilon 0.0001
Appendix B. Proofs
Proof of Theorem 2. First fix any j ∈ {1, . . . , s} and define the mapping k̃j : X × X → R via
k̃j (x1 , . . . , xs ), (x′1 , . . . , x′s ) = kj (xj , x′j )


for every (x1 , . . . , xs ) ∈ X and (x′1 , . . . , x′s ) ∈ X. Accordingly, for every fj ∈ Hj , define f˜j : X → R via
f˜j (x1 , . . . , xs ) = fj (xj ) ∀(x1 , . . . , xs ) ∈ X.
Then, it is easy to see that
H̃j = f˜j : X → R : fj ∈ Hj


is a Hilbert space with inner product and norm given by
⟨f˜j , h̃j ⟩H̃j = ⟨fj , hj ⟩Hj and ∥f˜j ∥H̃j = ∥fj ∥Hj
(B.1)
for every fj ∈ Hj and gj ∈ Hj . Hence, for every x = (x1 , . . . , xs ) ∈ X, we get k̃j (·, x) ∈ H̃j and
f˜j (x) = fj (xj ) = ⟨fj , kj (·, xj )⟩Hj = ⟨f˜j , k̃j (·, x)⟩H̃j
∀fj ∈ Hj
where the last equality follows from (B.1) and the definition of k̃j . That is, k̃j is a reproducing kernel and H̃j is its RKHS.
Next, it follows from Berlinet and Thomas-Agnan (2004, Section 4.1) that k = k̃1 + · · · + k̃s is a reproducing kernel on X
with RKHS H = H̃1 + · · · + H̃s and norm
∥f ∥2H =
min
f =f˜1 +···+f˜s
∥f˜1 ∥2H̃ + · · · + ∥f˜s ∥2H̃
1
s
f˜1 ∈H̃1 ,...,f˜s ∈H̃s
(B.1)
=
min
f =f˜1 +···+f˜s
f1 ∈H1 ,...,fs ∈Hs
∥f1 ∥2H1 + · · · + ∥fs ∥2Hs .
Using the reduced notation f1 + · · · + fs instead of f˜1 + · · · + f˜s , inequality (12) follows.
(B.2)
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
871
In order to prove Theorem 3, the following proposition is needed. It provides conditions on Hj and Fj under which the
minimal risk over H = H1 + · · · + Hs is equal to the minimal risk over the larger F = F1 + · · · + Fs .
Proposition 7. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that
Hj ⊂ Fj ⊂ L1 (PXj ) ∀j ∈ {1, . . . , s}
and Hj is dense in Fj with respect to ∥ · ∥L1 (PX ) . Then,
j
RL⋆ ,P,H := inf RL⋆ ,P (f ) = RL⋆ ,P,F .
∗
∗
(B.3)
f ∈H
Proof of Proposition 7. According to the definitions, it only remains to prove RL∗⋆ ,P,H ≤ RL∗⋆ ,P,F . To this end, take any f ∈ F
and any ε > 0. Then, by assumption there are functions
fj ∈ Fj ,
j ∈ {1, . . . , n},
such that f = f1 + · · · + fs and, for every j ∈ {1, . . . , s}, there is an hj ∈ Hj such that
∥hj − fj ∥L1 (PXj ) <
ε
.
s · |L|1
(B.4)
Hence, for h = h1 + · · · + hs ∈ H,
|RL⋆ ,P (h) − RL⋆ ,P (f )| ≤



|L(x, y, h(x)) − L(x, y, f (x))|P d(x, y)

≤ |L|1
|h(x) − f (x)|PX (dx) ≤ |L|1
s 

|hj (xj ) − fj (xj )|PXj (dxj )
j =1
< ε. Proof of Theorem 3. To avoid handling too many constants, let us assume that ∥k∥∞ = 1. According to (7), this implies
∥f ∥∞ ≤ ∥f ∥H for all f ∈ H. Now we use the Lipschitz continuity of L to obtain, for all g ∈ H,



|RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (g )| ≤
|L(x, y, fL,P,λn (x)) − L(x, y, g (x))|P d(x, y)

≤ |L|1 |fL,P,λn (x) − g (x)|PX (dx)

≤ |L|1 ∥fL,P,λn − g ∥∞ PX (dx)


(B.5)
≤ |L|1 fL⋆ ,P,λn − g H .
Let Φ denote the canonical feature map which corresponds to the kernel k. According to Christmann et al. (2009, Theorem
7), for every n ∈ N, there is a bounded, measurable function hn : X × Y → R such that
∥hn ∥∞ ≤ |L|1
(B.6)
and, for every Q ∈ M1 (X × Y),




fL,P,λ − fL,Q,λ  ≤ λ−1 EP hn Φ − EQ hn Φ  .
n
n H
n
H
(B.7)
Fix any ε ∈ (0, 1) and define
Bn := Dn ∈ (X × Y)n : EP hn Φ − EDn hn Φ H ≤ ελn ,




(B.8)
where Dn denotes the empirical distribution of the data set Dn . Then, (B.5), (B.7) and (B.8) yield
|RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (fL,Dn ,λn )| ≤ |L|1 ε ∀Dn ∈ Bn .
(B.9)
Now let us turn to the probability P (Bn ). By use of Hoeffding’s inequality, we will show that
n
lim Pn (Bn ) = 1.
(B.10)
n→∞
To this end, we first observe that λn n1/2 → ∞ implies that λn ε ≥ n−1/2 for all sufficiently large n ∈ N. Moreover, (B.6) and
our assumption ∥k∥∞ = 1 yield ∥hn Φ ∥∞ ≤ |L|1 . Define
1
an := |L|−
1 ελn
and
ξn :=
2 2 2
3 |L|−
1 ε λn n
1
8 |L|−
1 ελn + 3
=
3 a2n n
8 an + 3
872
A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
and note that, for sufficiently large n,
√
2ξn + 1
√
n
+
√
4ξn
=
3n
<
an
2
an
2
3
·√
1
+√ +
n
an + 3
1
+√ +
n
an
2
·
1
3
an
2
·
an
an + 3
1
< an = |L|−
1 ελn .
(B.11)
Consequently, Hoeffding’s inequality in Hilbert spaces (see Steinwart and Christmann, 2008, Corollary 6.15) yields for
B = |L|1 the bound
Pn (Bn ) =
Pn
(B.11)
≥ Pn
≥

D ∈ (X × Y)n : ∥EP hn Φ − ED hn Φ ∥H ≤ ελn

D ∈ (X × Y)n : ∥EP hn Φ − ED hn Φ ∥H ≤

√
B( 2ξn + 1)
√
n
+
4Bξn

3n




3
ε 2 λ2n n/|L|21
3
ε 2 λ2n n
1 − exp − ·
= 1 − exp − ·
8 ελn /|L|1 + 3
8 (ελn + 3|L|1 )|L|1
for all sufficiently large values of n. Now (B.10) follows from λn → 0 and λn n1/2 → ∞.
According to (B.9) and (B.10),
RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (fL,Dn ,λn ) −→ 0 (n → ∞)
in probability. Note that
|RL⋆ ,P (fL,Dn ,λn ) − RL∗⋆ ,P,F | ≤ |RL⋆ ,P (fL,Dn ,λn ) − RL∗⋆ ,P,H | + |RL∗⋆ ,P,H − RL∗⋆ ,P,F |
(B.3)
≤ |RL⋆ ,P (fL,Dn ,λn ) − RL⋆ ,P (fL,P,λn )| + |RL⋆ ,P (fL,P,λn ) − RL∗⋆ ,P,H |.
(B.12)
As shown above, the first term in (B.12) converges in probability to 0. Therefore, it only remains to prove that the second
term converges to 0. To this end, define, for every f ∈ H, the affine linear function
A∗f : R → R,
λ → RL⋆ ,P (f ) + λ∥f ∥2H − RL∗⋆ ,P,H .
Then, a continuity result for the pointwise infimum of a family of affine functions (see e.g. Steinwart and Christmann,
2008, A.6.4) yields
lim inf A∗f (λn ) = inf A∗f (0).
n→∞ f ∈H
f ∈H
However, according to the definitions,
inf A∗f (λn ) = RL⋆ ,P (fL,P,λn ) + λn ∥fL,P,λn ∥2H − RL∗⋆ ,P,H
f ∈H
∀n ∈ N
and inff ∈H A∗f (0) = 0. Hence,
0 ≤ lim sup RL⋆ ,P (fL,P,λn ) − RL∗⋆ ,P,H


n→∞


≤ lim sup inf Af (λn ) − inf Af (0) = 0. ∗
n→∞
f ∈H
∗
f ∈H
Proof of Theorem 4. Since the quantile function fτ∗,P attains the minimal risk RL∗⋆ ,P for the pinball loss L = Lτ (Koenker,
2005, Section 1.3), the assumption fτ∗,P ∈ F implies RL∗⋆ ,P,F = RL∗⋆ ,P . Hence, an application of Theorem 3 yields
RL⋆ ,P (fL,Dn ,λn ) −→ RL∗⋆ ,P (n → ∞)
(B.13)
in probability. It is shown in Christmann et al. (2009, Corollary 31) that, for all sequences (fn )n∈N of measurable functions
fn : X → R ,
RL⋆ ,P (fn ) −→ RL∗⋆ ,P implies d0 fn , fτ∗,P −→ 0.


This proves Theorem 4 in the following way. According to the characterization of convergence in probability by means of
almost surely convergent subsequences (Dudley, 2002, Theorem 9.2.1), it follows from (B.13) that, for every subsequence of
RL⋆ ,P (fL,Dn ,λn ), n ∈ N, there is a further subsequence which converges almost surely to RL∗⋆ ,P . Hence, according to the cited
result (Christmann et al., 2009, Corollary 31), for every subsequence of
d0 fL,Dn ,λn , fτ∗,P ,


n ∈ N,
there is a further subsequence which converges almost surely to 0. That is, d0 fL,Dn ,λn , fτ∗,P → 0 in probability.


A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873
873
References
Berlinet, A., Thomas-Agnan, C., 2004. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Boston, MA.
Blanchard, G., Bousquet, O., Massart, P., 2008. Statistical performance of support vector machines. The Annals of Statistics 36 (2), 489–531.
Caponnetto, A., De Vito, E., 2007. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), 331–368.
Christmann, A., Steinwart, I., 2010. Universal kernels on non-standard input spaces. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A.
(Eds.), Advances in Neural Information Processing Systems, vol. 23. pp. 406–414.
Christmann, A., Van Messem, A., 2008. Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning
Research 9, 915–936.
Christmann, A., Van Messem, A., Steinwart, I., 2009. On consistency and robustness properties of support vector machines for heavy-tailed distributions.
Statistics and Its Interface 2, 311–327.
Cucker, F., Zhou, D.X., 2007. Learning Theory. An Approximation Theory Viewpoint. Cambridge University Press, Cambridge.
De Bock, K., Coussement, K., Van den Poel, D., 2010. Ensemble classification based on generalized additive models. Computational Statistics and Data
Analysis 54, 1535–1546.
Devroye, L.P., 1982. Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Transactions on Pattern Analysis and
Machine Intelligence 4, 154–157.
Dudley, R.M., 2002. Real Analysis and Probability. Cambridge University Press, Cambridge.
Fahrmeir, L., Kneib, T., Lang, S., 2007. Regression: Modell, Methode und Anwendungen. Springer, Berlin.
Fenske, N., Kneib, T., Hothorn, T., 2009. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression.
http://epub.ub.uni-muenchen.de/10510/.
Ferraty, F., Vieu, P., 2009. Additive prediction and boosting for functional data. Computational Statistics and Data Analysis 53, 1400–1413.
Friedman, J.H., Stuetzle, W., 1981. Projection pursuit regression. Journal of the American Statistical Association 76, 817–823.
Hable, R., Christmann, A., 2011. On qualitative robustness of support vector machines. Journal of Multivariate Analysis 102, 993–1007.
Hampel, F.R., 1968. Contributions to the theory of robust estimation, unpublished. Ph.D. Thesis. Department of Statistics, University of California, Berkeley.
Hampel, F.R., 1974. The influence curve and its role in robust estimation. Journal of the American Statistical Association 69, 383–393.
Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Chapman and Hall, London.
Hein, M., Bousquet, O., 2005. Hilbertian metrics and positive definite kernels on probability measures. In: Ghahramani, Z., Cowell, R. (Eds.), AISTATS.
pp. 136–143.
Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proc. 5th Berkeley Symp. vol. 1, pp. 221–233.
Joachims, T., 1999. Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods–Support Vector
Learning. MIT Press, Cambridge, MA, pp. 41–56.
Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston.
Koenker, R., 2005. Quantile Regression. Cambridge University Press, Cambridge.
Lafferty, J., Lebanon, G., 2005. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research 6, 129–163.
Meier, L., van de Geer, S., Bühlmann, P., 2009. High-dimensional additive modeling. The Annals of Statistics 37 (6B), 3779–3821.
Mendelson, S., Neeman, J., 2010. Regularization in kernel learning. The Annals of Statistics 38 (1), 526–565.
Robinson, S.M., 1991. An implicit-function theorem for a class of nonsmooth functions. Mathematics of Operations Research 16, 292–309.
Rousseeuw, P.J., Croux, C., 1993. Alternatives to the median absolute deviation. Journal of the American Statistical Association 88 (424), 1273–1283.
Rüping, S., 2000. mySVM-Manual. Department of Computer Science, University of Dortmund. www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM.
Schmid, M., Hothorn, T., 2008. Boosting additive models using component-wise P-splines. Computational Statistics and Data Analysis 53 (2), 298–311.
Schölkopf, B., Smola, A.J., 2002. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.
Schölkopf, B., Smola, A.J., Müller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computing 10, 1299–1319.
Schölkopf, B., Tsuda, K., Vert, J.P., 2004. Kernel Methods in Computational Biology. MIT Press, Cambridge, MA.
Shafik, N., Tutz, G., 2009. Boosting nonlinear additive autoregressive time series. Computational Statistics and Data Analysis 53, 2453–2464.
Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G., Schölkopf, B., 2009. Kernel choice and classifiability for RKHS embeddings of probability
distributions. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems, 22.
pp. 1750–1758.
Steinwart, I., 2001. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2, 67–93.
Steinwart, I., Christmann, A., 2008. Support Vector Machines. Springer, New York.
Steinwart, I., Christmann, A., 2009. Fast learning from non-i.i.d. observations. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.),
Advances in Neural Information Processing Systems, vol. 22. pp. 1768–1776.
Steinwart, I., Hush, D., Scovel, C., 2009. Optimal rates for regularized least squares regression. In: Proceedings of the 22nd Conference on Learning Theory
(COLT 2009).
Steinwart, I., Scovel, C., 2007. Fast rates for support vector machines using Gaussian kernels. The Annals of Statistics 35 (2), 575–607.
Tutz, G., Binder, H., 2007. Boosting ridge regression. Computational Statistics and Data Analysis 51 (12), 6044–6059.
Vapnik, V.N., 1998. Statistical Learning Theory. John Wiley & Sons, New York.
Wood, S.N., 2006. Generalized Additive Models. Chapman & Hall/CRC, Boca Raton, FL.
Xu, Y.-L., Chen, D.-R., 2008. Learning rates of regularized regression for exponentially strongly mixing sequence. Journal of Statistical Planning and Inference
138 (7), 2180–2189.
Yee, T.W., Wild, C.J., 1996. Vector generalized additive models. Journal of the Royal Statistical Society. Series B. Methodological 58 (3), 481–493.
Descargar