Chritsmann -Consistency of support vector machines using additive

Computational Statistics and Data Analysis 56 (2012) 854–873 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Consistency of support vector machines using additive kernels for additive models Andreas Christmann, Robert Hable ∗ University of Bayreuth, Department of Mathematics, D-95440 Bayreuth, Germany article info Article history: Available online 19 April 2011 Keywords: Support vector machine SVM Additive model Consistency Robustness Kernel abstract Support vector machines (SVMs) are special kernel based methods and have been among the most successful learning methods for more than a decade. SVMs can informally be described as kinds of regularized M-estimators for functions and have demonstrated their usefulness in many complicated real-life problems. During the last few years a great part of the statistical research on SVMs has concentrated on the question of how to design SVMs such that they are universally consistent and statistically robust for nonparametric classification or nonparametric regression purposes. In many applications, some qualitative prior knowledge of the distribution P or of the unknown function f to be estimated is present or a prediction function with good interpretability is desired, such that a semiparametric model or an additive model is of interest. The question of how to design SVMs by choosing the reproducing kernel Hilbert space (RKHS) or its corresponding kernel to obtain consistent and statistically robust estimators in additive models is addressed. An explicit construction of such RKHSs and their kernels, which will be called additive kernels, is given. SVMs based on additive kernels will be called additive support vector machines. The use of such additive kernels leads, in combination with a Lipschitz continuous loss function, to SVMs with the desired properties for additive models. Examples include quantile regression based on the pinball loss function, regression based on the ϵ -insensitive loss function, and classification based on the hinge loss function. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Kernel methods such as support vector machines have been among the most successful learning methods for more than a decade; see Schölkopf and Smola (2002), Cucker and Zhou (2007) and Steinwart and Christmann (2008). Examples include classification or regression models where we have an input space X, an output space Y, some unknown probability measure P on X × Y, and an unknown function f : X → R which describes the quantity of interest, e.g. the conditional quantile curve, of the conditional distributions P(·|x), x ∈ X. Support vector machines can informally be described as kinds of regularized M-estimators for functions (see e.g. Steinwart and Christmann (2008, p. 390f)), and have demonstrated their usefulness in many complicated high-dimensional real-life problems. Besides several other nice features, one key argument for using SVMs has been the so-called ‘‘kernel trick’’ (Schölkopf et al., 1998), which decouples the SVM optimization problem from the domain of the samples, thus making it possible to use SVMs on virtually any input space X. This flexibility is in strong contrast to the case for more classical learning methods from both machine learning and nonparametric statistics, which almost always require input spaces X ⊂ Rd . As a result, kernel methods have been successfully used in various application areas that were previously infeasible for machine learning methods. We mention just three examples. (i) SVMs ∗ Corresponding author. Tel.: +49 921 55 3274; fax: +49 921 55 2152. E-mail address: [email protected] (R. Hable). URL: http://www.stoch.uni-bayreuth.de (R. Hable). 0167-9473/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2011.04.006 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 855 have been used to analyze histogram data and colored images, where the input values are probability measures; see, e.g., Hein and Bousquet (2005), Sriperumbudur et al. (2009) and Christmann and Steinwart (2010). (ii) SVMs have been used for text classification and web mining; see, e.g., Joachims (2002) and Lafferty and Lebanon (2005). (iii) SVMs based on kernels from computational biology have been used for trees and graphs; see the book by Schölkopf et al. (2004).  For a data set Dn = (x1 , y1 ), . . . , (xn , yn ) , the empirical SVM is defined as fL,Dn ,λ := arg inf f ∈H n 1 n i =1 L(xi , yi , f (xi )) + λ ∥f ∥2H (1) where ∥·∥H denotes the Hilbert space norm of H. That is, SVMs are based on three key components: (i) a convex loss function L : X × Y × R → [0, ∞) used to measure the quality of the prediction f (x), (ii) a reproducing kernel Hilbert space (RKHS) H of functions f : X → R used to specify the set of functions over which the expected loss is minimized, and (iii) the regularization term λ ∥f ∥2H used to reduce the danger of overfitting and to guarantee the existence of a unique SVM even if L is not strictly convex. The RKHS is often implicitly defined by specifying a kernel k : X × X → R. Details about the definition of SVMs and some examples will be given in Section 2. During the last few years a great part of the statistical research on SVMs has concentrated on the central question of how to choose the loss function L, the RKHS H or its kernel k, and sequences of regularization parameters λn to guarantee that SVMs are universally consistent and statistically robust for classification and regression purposes. In a nutshell, it turned out in a purely nonparametric setup that SVMs based on the combination of a Lipschitz continuous loss function and a bounded continuous kernel with a dense and separable RKHS are universally consistent with desirable statistical robustness properties for any probability measure P from which we observed the data set; see, e.g., Steinwart and Christmann (2008) and Christmann et al. (2009) for details. Here we only mention the combination of a Gaussian RBF kernel with the Lipschitz continuous pinball loss function for nonparametric quantile regression; see Sections 2 and 4. Although a nonparametric approach is often the best choice in practice due to the lack of prior knowledge on P, a semiparametric approach or an additive model (Friedman and Stuetzle, 1981; Hastie and Tibshirani, 1990) can also be valuable. Sometimes prior knowledge on P is known from previous experiments or from some theory specifying in some sense how the data set was generated. Another example is that we may be interested, for practical reasons, only in functions f which offer a nice interpretation because an interpretable prediction function can be crucial if the prediction f (x) has to be explicable to clients. This can be the case, for example, if the prediction is the expected claim amount of a client and these predictions are the basis for the construction of an insurance tariff. Here we will mainly consider additive models, although models with a multiplicative structure or even additive and multiplicative components can also be of interest and can be treated in an analogous manner. More precisely, for some s ∈ N, the input space X is split up into s ∈ N non-empty spaces according to X = X1 × · · · × Xs (2) and only additive functions f : X → R of the form f (x1 , . . . , xs ) = f1 (x1 ) + · · · + fs (xs ), xj ∈ Xj , are considered, where fj : Xj → R for j ∈ {1, . . . , s}. To the best of our knowledge, there are currently no results on consistency and statistical robustness published on SVMs based on kernels designed for such additive models. Of course, one can use one of the purely nonparametric SVMs described above, but the hope is that SVMs based on kernels especially designed for such situations may offer better results. In this paper we therefore address the question of how to design specific SVMs for additive models. The main goal of this paper is to give an explicit construction principle for kernels – and thus for their RKHSs – and to show that such SVMs are consistent under weaker assumptions than the usual assumptions of purely nonparametric SVMs. Furthermore, we mention that the combination of a bounded additive kernel with a Lipschitz continuous loss function leads to statistically robust SVMs for additive models. Examples include SVMs in additive models for quantile regression based on the pinball loss function, for regression based on the ϵ -insensitive loss function, and for classification based on the hinge loss function. The rest of the paper is organized as follows. In Section 2 we collect some known results on loss functions, kernels and their RKHSs, and on support vector machines. These results are needed to state our results on consistency and statistical robustness of SVMs for additive models in Section 3. Our main theoretical result is Theorem 3 which shows consistency of SVMs based on additive kernels. Although results on the rates of convergence or the computation of the corresponding constants for these rates are far beyond the scope of this paper, our numerical examples given in Section 4.1 will demonstrate that SVMs based on kernels designed for additive models can outperform standard nonparametric SVMs if the assumption of an additive model is valid. Section 4.2 contains a small simulation study comparing SVMs based on additive kernels for quantile regression with a recently proposed boosting approach. In Section 4.3, we apply both approaches to a real-life data set and compare the results. Section 5 contains the discussion. Some details on how to compute SVMs on the basis of the pinball loss function for additive models are given in Appendix A. All proofs are given in Appendix B. 856 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 2. Background on support vector machines Let X be a complete separable metric space and let Y be a closed subset of R. We will always use the corresponding Borel-σ -algebras. The set of all probability measures on the Borel-σ -algebra of X × Y is denoted by M1 (X × Y). The random input variables X1 , . . . , Xn take their values in X and the random output variables Y1 , . . . , Yn take their values in Y. It is assumed that (X1 , Y1 ), . . . , (Xn , Yn ) are independent and identically distributed according to some unknown probability measure P ∈ M1 (X × Y). Since Y ⊂ R is closed, P can be split into the marginal distribution PX on X and the conditional distribution P(· | x) of Y given X = x. We mention that this covers of course the special case where Yi given Xi = xi is distributed with median f (xi ) and heteroscedastic error term ϵi g (xi ), where f and g are unknown functions from X to R. The goal is to find a good prediction function f : X → R such that f (x) estimates the quantity of interest of the conditional distributions of Y given X = x (e.g. the conditional quantile or mean). The quality of an estimate t = f (x) is measured through a loss function L : X × Y × R → [0, ∞), (x, y, t ) → L(x, y, t ). It is assumed that L is measurable and L(x, y, y) = 0 for every (x, y) ∈ X × Y—that is, the loss is zero if the prediction t equals the actual value y of the output variable. In addition, we make the standard assumption that L(x, y, ·) : R → [0, ∞), t → L(x, y, t ) is convex for every (x, y) ∈ X × Y and that additionally the following uniform Lipschitz property is fulfilled for some real number |L|1 ∈ (0, ∞): sup (x,y)∈X×Y |L(x, y, t ) − L(x, y, t ′ )| ≤ |L|1 · |t − t ′ | ∀t , t ′ ∈ R. (3) We restrict our attention to such Lipschitz continuous loss functions because the use of loss functions which are not Lipschitz continuous (such as the least squares loss which is only locally Lipschitz continuous on unbounded domains) usually conflicts with robustness; see, e.g., Steinwart and Christmann (2008, Section 10.4). The quality of a (measurable) predictor f : X → R is measured through the risk RL,P (f ) =  L x, y, f (x) P d(x, y) .     X×Y By different choices of Y and the loss function L, different purposes are covered by this setup—e.g. binary classification for Y = {−1; +1} and the hinge loss Lhinge (x, y, t ) := max{0, 1 − yt }, regression for Y = R and, for ϵ > 0, the ϵ -insensitive loss Lϵ (x, y, t ) := max{0, |y − t | − ϵ}, and quantile regression for Y = R and the pinball loss Lτ (x, y, t ) :=  (τ − 1)(y − t ), if y − t < 0, τ (y − t ), if y − t ≥ 0, (4) where τ ∈ (0, 1). An optimal predictor is a measurable function f ∗ : X → R which attains the minimal risk, called the Bayes risk, RL∗,P = inf f :X→R measurable RL,P (f ). (5) The optimal predictor in a set F of measurable functions f : X → R is a function f ∗ ∈ F which attains the minimal risk RL∗,P,F = inf RL,P (f ). f ∈F For example, the goal of quantile regression is to estimate a conditional quantile function, i.e., a function fτ∗,P : X → R such that P (−∞, fτ∗,P (x)] | x ≥ τ   and P [fτ∗,P (x), ∞) | x ≥ 1 − τ   for the quantile τ ∈ (0, 1). If fτ∗,P ∈ F , then the conditional quantile function fτ∗,P attains the minimal risk RL∗τ ,P,F for the pinball loss Lτ (with parameter τ ), so quantile regression can be done by trying to minimize the risk RLτ ,P in F . One way to build a nonparametric predictor f is to use a support vector machine fL,P,λ := arg inf RL,P (f ) + λ ∥f ∥2H , f ∈H (6) A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 857 where H is a reproducing kernel Hilbert space (RKHS) of a measurable kernel k : X × X → R, and λ > 0 is a regularization parameter used to reduce the danger of overfitting; see e.g., Vapnik (1998), Schölkopf and Smola (2002) or Steinwart and Christmann (2008) for details. The reproducing property of k states that, for all f ∈ H and all x ∈ X, f (x) = ⟨f , Φ (x)⟩H where Φ : X → H , x → k(·, x) denotes the canonical feature map. A kernel k is called bounded if  ∥k∥∞ := sup k(x, x) < ∞. x∈X Using the reproducing property and ∥Φ (x)∥H = √ k(x, x), we obtain the well-known inequalities ∥f ∥∞ ≤ ∥k∥∞ ∥f ∥H (7) ∥Φ (x)∥∞ ≤ ∥k∥∞ ∥Φ (x)∥H ≤ ∥k∥2∞ (8) and for all f ∈ H and all x ∈ X. We mention, as an example of a bounded kernel, the popular Gaussian radial basis function (GRBF ) kernel defined by kγ (x, x′ ) = exp(−γ −2 ∥x − x′ ∥2Rd ), x, x′ ∈ X , (9) where γ is some positive constant and X ⊂ Rd . This kernel leads to a large RKHS which is dense in L1 (µ) for all probability measures µ on Rd . We will also consider the polynomial kernel km,c (x, x′ ) = ⟨x, x′ ⟩Rd + c  m , x, x′ ∈ X, where m ∈ (0, ∞), c ∈ (0, ∞) and X ⊂ Rd . The dot kernel is a special polynomial kernel with c = 0 and m = 1. The polynomial kernel is bounded if and only if X is bounded. Of course, the regularized risk reg RL,P,λ (f ) := RL,P (f ) + λ ∥f ∥2H is in general not computable, because P is unknown. However, the empirical distribution Dn = n 1 n i=1 δ(xi ,yi ) corresponding to the data set Dn = (x1 , y1 ), . . . , (xn , yn ) can be used as an estimator of P. Here δ(xi ,yi ) denotes the Dirac reg distribution in (xi , yi ). If we replace P by Dn in (6), we obtain the regularized empirical risk RL,Dn ,λ (f ) and the empirical SVM fL,Dn ,λ . Furthermore, we need analogous notions where (xi , yi ) is replaced by random variables (Xi , Yi ). Thus, we define  Dn = n 1 n i =1  δ(Xi ,Yi ) . Then, for every ω ∈ Ω , Dn (ω) is the empirical distribution corresponding to the data set (X1 (ω), Y1 (ω)), . . . , (Xn (ω),  reg reg Yn (ω)) and, accordingly, RL,Dn ,λ (f ) denotes the mapping Ω → R, ω → RL,Dn (ω),λ (f ), and fL,Dn ,λ denotes the mapping Ω → H , ω → fL,Dn (ω),λ . Support vector machines fL,P,λ need not exist for every probability measure P ∈ M1 (X × Y), but they exist and are unique    for every data set; for Lipschitz continuous loss functions it is sufficient for the existence of fL,P,λ that L(x, y, 0) P d(x, y) < ∞. This condition may be violated by heavy-tailed distributions P and, in this case, it is possible that RL,P (f ) = ∞ for every f ∈ H. In order to enlarge the applicability of support vector machines to heavy-tailed distributions, the following extension has been developed in Christmann et al. (2009). Following an idea already used by Huber (1967) for M-estimates in parametric models, a shifted loss function L∗ : X × Y × R → R is defined by  L∗ (x, y, t ) = L(x, y, t ) − L(x, y, 0) ∀(x, y, t ) ∈ X × Y × R. Then, define, similarly to the original loss function L, the L∗ -risk by RL⋆ ,P (f ) =  L∗ x, y, f (x) P d(x, y)    and the regularized L∗ -risk by reg RL⋆ ,P,λ (f ) = RL∗ ,P (f ) + λ∥f ∥2H  858 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 for every f ∈ H. In complete analogy to (6), we define the support vector machine based on the shifted loss function L∗ by fL,P,λ := arg inf RL⋆ ,P (f ) + λ ∥f ∥2H . (10) f ∈H If the support vector machine fL,P,λ defined by (6) exists, we have seemingly defined fL,P,λ in two different ways now. However, the two definitions coincide in this case and the following theorem summarizes some basic results of Christmann et al. (2009). Theorem 1. Let L be a convex and Lipschitz continuous loss function and let k be a bounded kernel. Then, for every P ∈ M1 (X×Y) reg and every λ ∈ (0, ∞), there exists a unique SVM fL,P,λ ∈ H which minimizes RL∗ ,P,λ , i.e. RL∗ ,P (fL,P,λ ) + λ∥fL,P,λ ∥2H = inf RL∗ ,P (f ) + λ∥f ∥2H . f ∈H If the support vector machine fL,P,λ defined by (6) exists, then the two definitions (6) and (10) coincide. 3. Support vector machines for additive models 3.1. The model and assumptions As described in the previous section, the goal is to minimize the risk f → RL,P (f ) in a set F of functions f : X → R. In this article, we assume an additive model. Accordingly, let X = X1 × · · · × Xs where X1 , . . . , Xs are non-empty sets. For every j ∈ {1, . . . , s}, let Fj be a set of functions fj : Xj → R. Then, we only consider functions f : X → R of the form f (x1 , . . . , xs ) = f1 (x1 ) + · · · + fs (xs ) ∀(x1 , . . . , xs ) ∈ X1 × · · · × Xs for f1 ∈ F1 , . . . , fs ∈ Fs . Thus, F := f1 + · · · + fs : fj ∈ Fj , 1 ≤ j ≤ s .   (11) In (11), we have identified fj with the map X → R, (x1 , . . . , xs ) → fj (xj ). Such additive models can be treated using support vector machines in a very natural way. For every j ∈ {1, . . . , s}, choose a kernel kj on Xj with RKHS Hj . Then, the space of functions H := f1 + · · · + fs : fj ∈ H , 1 ≤ j ≤ s   is an RKHS on X = X1 × · · · × Xs with additive kernel k = k1 + · · · + ks such that kj : Xj × Xj → R, j ∈ {1, . . . , s}, are kernels and k (x1 , . . . , xs ), (x′1 , . . . , x′s ) = k1 (x1 , x′1 ) + · · · + ks (xs , x′s )   for every xj , x′j ∈ Xj , j ∈ {1, . . . , s}; see Theorem 2. In this way, SVMs can be used to fit additive models and SVMs enjoy at least three appealing features. First, it is guaranteed that the predictor has the assumed additive structure (x1 , . . . , xs ) → f1 (x1 ) + · · · + fs (xs ). Second, it is possible to still use the standard SVM machinery including the kernel trick (Schölkopf and Smola, 2002, Section 2) and implementations of SVMs—just by selecting a kernel k = k1 + · · · + ks . Third, the possibility of choosing different kernels k1 , . . . , ks offers a great flexibility. For example, take s = 2 and let k1 be a GRBF kernel on Rd1 and k2 be a GRBF kernel on Rd2 . Since the RKHS of a Gaussian kernel is an infinite dimensional function space, we get nonparametric estimates of f1 and f2 . As a second example, consider a semiparametric model with f = f1 + f2 where f1 : x1 → f1 (x1 ) is assumed to be a polynomial function of order at most m and f2 : x2 → f2 (x2 ) may be some complicated function. Then, this semiparametric model can be treated by simply taking a polynomial kernel on Rd1 for k1 and a GRBF kernel on Rd2 for k2 . This can be used, for example, in order to model changes in space (for d1 ≤ 3 and x1 specifying the location) or in time (for d1 = 1 and x1 specifying the point in time). Theorem 2. For every j ∈ {1, . . . , s}, let Xj be a non-empty set and kj : Xj × Xj → R, (xj , x′j ) → kj (xj , x′j ), be a kernel with corresponding RKHS Hj . Define k = k1 + · · · + ks . That is, k (x1 , . . . , xs ), (x′1 , . . . , x′s ) = k1 (x1 , x′1 ) + · · · + ks (xs , x′s )   A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 859 for every xj , x′j ∈ Xj , j ∈ {1, . . . , s}. Then, k is a kernel on X = X1 × · · · × Xs with RKHS H := f1 + · · · + fs : fj ∈ H , 1 ≤ j ≤ s   and the norm of H, given in (B.2), fulfills ∥f1 + · · · + fs ∥2H ≤ ∥f1 ∥2H1 + · · · + ∥fs ∥2Hs ∀f1 ∈ H1 , . . . , fs ∈ Hs . (12) If not otherwise stated, we make the following assumptions throughout the rest of the paper although some of the results are also valid under more general conditions. Main assumptions. (i) For every j ∈ {1, . . . , s}, the set Xj is a complete, separable metric space; kj is a continuous and bounded kernel on Xj with RKHS Hj . Furthermore, k = k1 + · · · + ks denotes the kernel on X = X1 × · · · × Xs defined in Theorem 2 and H denotes its RKHS. (ii) The subset Y ⊂ R is closed. (iii) The loss function L is convex and fulfills the uniform Lipschitz continuity (3) with Lipschitz constant |L|1 ∈ (0, ∞). In addition, L(x, y, y) = 0 for every (x, y) ∈ X × Y. Note that every closed subset of Rd is a complete, separable metric space. We restrict ourselves to Lipschitz continuous loss functions and continuous and bounded kernels because it was shown earlier that these assumptions are necessary in order to ensure good robustness properties; see e.g. Steinwart and Christmann (2008, Section 10). The condition L(x, y, y) = 0 is quite natural and practically always fulfilled—it means that the loss of a correct prediction is 0. Our assumptions cover many of the most interesting cases. In particular, the hinge loss (classification), the ϵ -insensitive loss (regression) and the pinball loss (quantile regression) fulfill all assumptions. Many commonly used kernels are continuous. In addition, the Gaussian kernel is always bounded, the linear kernel and all polynomial kernels are bounded if and only if Xj is bounded. From the assumption that the kernels kj are continuous and bounded on Xj , it follows that the kernel k = k1 + · · · + ks is continuous and bounded on X. 3.2. Consistency SVMs are called universally consistent if the risk of the SVM estimator fL,Dn ,λn converges, for all probability measures P, in probability to the Bayes risk (see (5)), i.e. P RL⋆ ,P (fL,Dn ,λn ) −→ RL∗∗ ,P (n → ∞). (13) In order to obtain universal consistency of SVMs, it is necessary to choose a kernel with a large RKHS. Accordingly for most known results about universal consistency of SVMs, it is a crucial assumption that the RKHS is dense in C (X) where X is a compact metric space (see e.g. Steinwart, 2001) or, at least, that the RKHS is dense in Lq (PX ) for some q ∈ [1, ∞). However, this assumption is violated if we want to fit an additive model given by F = f1 + · · · + fs : fj ∈ Fj , 1 ≤ j ≤ s   with correspondingly additive SVMs because the RKHS H = H1 + · · · + Hs , which is generated by the additive kernel, is not dense in the whole space Lq (PX ) – apart from entirely degenerate cases. Therefore, existing results on universal consistency are not applicable here. Nevertheless, additive SVMs are consistent in additive models as shown in Theorem 3. Instead of the – in this context – inappropriate assumption that H was dense in the whole space Lq (PX ), we only assume that Hj is dense in Fj for every j ∈ {1, . . . , s}. As usual, Lq (µ) denotes the set of all q-integrable real-valued functions with respect to some measure µ and Lq (µ) denotes the set of all equivalence classes in Lq (µ). We can now state our first main result. Theorem 3. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that Hj ⊂ Fj ⊂ L1 (PXj ), 1 ≤ j ≤ s, and let Hj be dense in Fj with respect to ∥ · ∥L1 (PX ) . Then, for every sequence (λn )n∈N ⊂ (0, ∞) such that limn→∞ λn = 0 and j limn→∞ λ2n n = ∞, RL⋆ ,P (fL,Dn ,λn ) −→ RL∗⋆ ,P,F in probability. (n → ∞) 860 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 In general, it is not clear whether convergence of the risks implies convergence of the SVM fL,Dn ,λn . However, the following theorem will show such a convergence for quantile regression in an additive model—under the condition that the quantile function fτ∗,P actually lies in F = F1 + · · · + Fs . In order to formulate this result, we define d0 (f , g , ) =  min 1, |f − g | dPX   where f , g : X → R are arbitrary measurable functions. It is known that d0 is a metric describing convergence in probability. We can now state our second main result. Theorem 4. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that Hj ⊂ Fj ⊂ L1 (PXj ) ∀j ∈ {1, . . . , s} and Hj is dense in Fj with respect to ∥ · ∥L1 (PX ) . Let τ ∈ (0, 1) and assume that the quantile function fτ∗,P is PX -almost surely j unique and that fτ∗,P ∈ F . Then, for the pinball loss function L = Lτ and for every sequence (λn )n∈N ⊂ (0, ∞) such that limn→∞ λn = 0 and limn→∞ λ2n n = ∞, d0 fL,Dn ,λn , fτ∗,P −→ 0 (n → ∞)   in probability. 3.3. Robustness During the last few years some general results on the statistical robustness properties of SVMs have been shown. Many of these results are directly applicable to SVMs for additive models if the kernel is bounded and continuous (or at least measurable) and the loss function is convex and Lipschitz continuous. We summarize some of the most important robustness properties of additive SVMs in the following. For brevity we only give upper bounds for the bias and the Bouligand influence function for SVMs, which are both even applicable for non-smooth loss functions like the pinball loss for quantile regression, and refer the reader to Christmann et al. (2009) and Steinwart and Christmann (2008, Chap. 10) for results on the classical influence function proposed by Hampel (1968, 1974) and to Hable and Christmann (2011) for the qualitative robustness of SVMs. Define the function T : M1 (X × Y ) → H , T (P) := fL,P,λ , (14) which maps each probability distribution to its SVM. In robust statistics we are interested in smooth and bounded functions T , because this will give us stable SVMs within small neighborhoods of P. If an appropriately chosen derivative of T (P) is bounded, then we expect the value of T (Q) to be close to the value of T (P) for distributions Q in a small neighborhood of P. The next result shows that the H-norm of the difference of two SVMs increases with respect to the mixture proportion ε ∈ (0, 1) at most linearly in gross-error neighborhoods. The norm of total variation of a signed measure µ is denoted by ∥µ∥M . Theorem 5 (Bounds for Bias). If the main assumptions (i)–(iii) are valid, then we have, for all λ > 0, all ε ∈ [0, 1], and all probability measures P and Q on X × Y, that ∥T (Q) − T (P)∥∞ ≤ c × ∥P − Q∥M , (15) ∥T ((1 − ε)P + εQ) − T (P)∥∞ ≤ c ∥P − Q∥M × ε, (16) 1 2 where c = λ ∥k∥∞ |L|1 . Because of (8), there are analogous bias bounds of SVMs with respect to the norm in H if we replace c by c̃ := λ1 ∥k∥∞ |L|1 . While Hampel’s influence function is related to a Gâteaux derivative which is linear, the Bouligand influence function is related to the Bouligand derivative which needs only to be positive homogeneous. Because this weak derivative is less known in statistics, we would like to recall its definition. Let E1 and E2 be normed linear spaces. A function f : E1 → E2 is called positive homogeneous if f (α x) = α f (x) for all α ≥ 0 and for all x ∈ E1 . If U is an open subset of E1 , then a function f : U → E2 is called Bouligand differentiable at a point x0 ∈ U if there exists a positive homogeneous function ∇ B f (x0 ) : U → E2 such that lim h→0   f (x0 + h) − f (x0 ) − ∇ B f (x0 )(h) ∥h∥E1 E2 =0 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 861 (see Robinson (1991)). The Bouligand influence function (BIF) of the map T : M1 (X × Y) → H for a distribution P in the direction of a distribution Q ̸= P was defined by Christmann and Van Messem (2008) as lim     T (1 − ε)P + ε Q − T (P) − BIF(Q; T , P) = 0. H ε ε↓0 (17) Note that the BIF is a special Bouligand derivative     T P + ε(Q − P) − T (P) − BIF(Q; T , P) lim ∥ε(Q − P)∥ ∥ε(Q−P)∥→0 H =0 due to the fact that Q and P are fixed, and it is independent of the norm on M1 (X × Y). The partial Bouligand derivative with respect to the third argument of L⋆ is denoted by ∇3B L⋆ (x, y, t ). The BIF shares with Hampel’s influence function the interpretation that it measures the impact of an infinitesimal small amount of contamination of the original distribution P in the direction of Q on the quantity of interest T (P). It is thus desirable that the function T has a bounded BIF. It is known that existence of the BIF implies the existence of the IF and in this case they are equal. The next result shows that, under some conditions, the Bouligand influence function of SVMs exists and is bounded; see Christmann et al. (2009) for more related results. Theorem 6 (Bouligand Influence Function). Let the main assumptions (i)–(iii) be valid, but assume that X is a complete separable normed linear space.1 Let P, Q ∈ M1 (X × Y). Let L be the pinball loss function Lτ with τ ∈ (0, 1) or let L be the ϵ -insensitive loss function Lϵ with ϵ > 0. Assume that for all δ > 0 there exist positive constants ξP , ξQ , cP , and cQ such that for all t ∈ R with |t − fL,P,λ (x)| ≤ δ ∥k∥∞ the following inequalities hold for all a ∈ [0, 2δ ∥k∥∞ ] and x ∈ X: P [t , t + a] | x ≤ cP a1+ξP   and Q [t , t + a] | x ≤ cQ a1+ξQ .   (18) Then the Bouligand influence function BIF(Q; T , P) of T (P) := fL,P,λ exists, is bounded, and equals 1 2  EP ∇3B L⋆ (X , Y , fL,P,λ (X ))Φ (X ) − EQ ∇3B L⋆ (X , Y , fL,P,λ (X ))Φ (X ) . (19) Note that the Bouligand influence function of the SVM only depends on Q via the second term in (19). The interpretation of the condition (18) is that the probability that Y given x is in some small interval around the SVM is essentially at most proportional to the length of the interval to some power greater than 1. For the pinball loss function, the BIF given in (19) simplifies to  1 2λ − P (−∞, fL,P,λ (x)] | x − τ Φ (x)PX (dx)     X 1 2λ (20)   Q (−∞, fL,P,λ (x)] | x − τ Φ (x)QX (dx).    X The BIF of the SVM based on the pinball loss function can hence be interpreted as the difference of the integrated and 21λ Φ (x) weighted differences between the estimated quantile level and the desired quantile level τ . Recall that the BIF is a special Bouligand derivative and thus positive homogeneous in h = ε(Q − P). If the BIF exists, we then immediately obtain fL,(1−αε)P+αεQ,λ − fL,P,λ = T (P + α h) − T (P) = α BIF(Q; T , P) + o(α h)   = α T (P + h) − T (P) + o(h) + o(α h)   = α fL,(1−ε)P+εQ,λ − fL,P,λ + o(αε(Q − P)) (21) for all α ≥ 0. Hence this equation gives a nice approximation of the asymptotic bias term fL,(1−ε)P+εQ,λ − fL,P,λ if we consider the amount αε of contamination instead of ε . 1 Bouligand derivatives are only defined in normed linear spaces, e.g., for X ⊂ Rd a linear subspace. 862 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 4. Some numerical results In this section some numerical results for finite sample sizes will be given to complement our theoretical results on properties of SVMs for additive models. The goals of this section are threefold. In Section 4.1, the asymptotic results of Section 3.2 are illustrated with some simulated data sets. The example visualizes our result that SVMs based on an additive kernel converge to the true quantile function if the true quantile function is additive. The additive SVM is compared to a standard SVM based on a classical d-dimensional Gaussian RBF kernel, for which convergence is already well-known. It is illustrated that there can be a real benefit from using additive SVMs if the assumption of an additive model is valid. In Section 4.2 the results from a small simulation study will be given to compare the finite sample behavior of SVMs based on additive kernels with a boosting approach using gamboost for additive quantile regression proposed by Schmid and Hothorn (2008). In Section 4.3 we apply SVMs based on an additive kernel and this boosting approach for additive quantile regression for a real-life data set containing rent standard prices for dwellings in the German city Munich. 4.1. Some simulations to show the influence of the sample size Let us consider the following situation of median regression. We have two independent input variables X1 and X2 each with a uniform distribution on the interval [0, 1] and the output variable Y given that x = (x1 , x2 ) has a Cauchy distribution (and thus not even the first moment exists) with center f (x1 , x2 ) := f1 (x1 ) + f2 (x2 ), where f1 (x1 ) := 7 + 5x21 and f2 (x2 ) := sin(5x2 ) cos(17x2 ). We used Cauchy error terms instead of the classical Gaussian error terms to check the finite sample robustness properties of the SVM. Hence the true function f that we would like to estimate with SVMs has an additive structure, where the first function is a polynomial of order 2 and the second function is a smooth and bounded function but not a polynomial. Please note that here X = [0, 1]2 is bounded whereas Y = R is unbounded. As X is bounded, even a polynomial kernel on X is bounded, which is not true for unbounded input spaces. We simulated three data sets of this type with sample sizes n = 500, n = 2000, and n = 10, 000. We compare the exact function f with three SVMs fL,D,λn fitted with the three data sets, where we use the pinball loss function with τ = 0.5 because we are interested in median regression. • Nonparametric SVM. We use an SVM based on the two-dimensional GRBF kernel k defined in (9) to fit f in a totally nonparametric manner. • Nonparametric additive SVM. We use an SVM based on the kernel k = k1 + k2 where k1 and k2 are one-dimensional GRBF kernels. • Semiparametric additive SVM. We use an SVM based on the kernel k = k1 + k2 where k1 is a polynomial kernel of order 2 for fitting the function f1 and k2 is a one-dimensional GRBF kernel for fitting the function f2 . Our interest in these examples was in illustrating that SVMs are consistent in all of these situations. In order to stay in line with the theoretical results obtained in Section 3.2, the kernel parameter γ = 2 was fixed and the regularization parameter was set to λn = a · n−0.45 for the rather arbitrarily chosen a = 0.05. Note that the asymptotic results hold for any fixed γ and a. From Figs. 1–3 we can draw the following conclusions for this special situation. (i) If the additive model is valid, all three SVMs give comparable and reasonable results if the sample size n is large enough even for Cauchy distributed error terms; see Fig. 1. This is in good agreement with the theoretical results derived in Section 3. (ii) If the sample size is small to moderate and if the assumed additive model is valid, then both SVMs based on kernels especially designed for additive models show better results than the standard two-dimensional GRBF kernel; see Figs. 2 and 3. (iii) The difference between the nonparametric additive SVM and semiparametric additive SVM was somewhat surprisingly small for all three sample sizes, although the true function had the very special structure which favors the semiparametric additive SVM. 4.2. Some simulations for comparing SVMs with gamboost In the following situation, we consider quantile regression for the quantile levels τ = 0.5 (median regression) and τ = 0.9. The model. We considered the additive model Yi = f1 (xi1 ) + f2 (xi2 ) + εi where εi ∼i.i.d. Cauchy(0, 1) so the resulting quantile function was given by −1 fτ∗ (xi1 , xi2 ) = f1 (xi1 ) + f2 (xi2 ) + FCauchy (0,1) (τ ) A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 863 Fig. 1. Quantile regression using SVMs and the pinball loss function with τ = 0.5. Model: Y |(x1 , x2 ) ∼ f1 (x1 ) + f2 (x2 ) + Cauchy errors, where f1 (x1 ) := 7 + 5x21 and f2 (x2 ) = sin(5x2 ) cos(17x2 ) and x1 and x2 are observations of independent and identically uniform distributed random variables on the interval [0, 1]. The regularization parameter is λn = 0.05n−0.45 , and the kernel parameter of the Gaussian RBF kernel is γ = 2. Upper left subplot: true function f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ). Upper right subplot: SVM fit based on the GRBF kernel k on X = R2 . Lower left subplot: SVM fit based on the sum of two one-dimensional GRBF kernels. Lower right subplot: SVM fit based on the sum of a one-dimensional polynomial kernel on R and a one-dimensional GRBF kernel. −1 where FCauchy (0,1) (τ ) denotes the τ -quantile of the standard Cauchy distribution. For the simulated data sets, the functions were given by f1 (xi1 ) = 7 + 5x2i1 and f2 (xi2 ) := sin(5xi2 ) cos(17xi2 ). In each case, we simulated 100 data sets for the sample sizes n = 500 and n = 2000 and for the quantiles τ = 0.5 and τ = 0.9. The covariates xi1 and xi2 were independently drawn from the uniform distribution on the interval [0, 1]. Estimation. For each of the data sets, we estimated the conditional quantiles by additive SVMs as presented in Section 3 and by a boosting approach for additive quantile regression (function gamboost from the R-package ‘‘mboost’’); see also Schmid and Hothorn (2008). In the case of the additive SVM, we used the additive kernel k = k1 +k2 where k1 and k2 are one-dimensional GRBF kernels kγ1 and kγ2 . As the parameters γ1 , γ2 > 0 in the GRBF kernels act as scale parameters for the covariates xi1 and xi2 , they were estimated separately from the covariates, by the robust scale estimator Qn (function Qn from the R-package ‘‘robustbase’’); see Rousseeuw and Croux (1993). In accordance with Theorem 3 (consistency of additive SVMs) the regularization parameter λn > 0 was equal to λn = a · n−0.45 . The hyperparameter a > 0 was chosen within the values 0.4, 0.2, 0.1, 0.05, 0.025, 0.012, 0.006, 0.004, 0.003, 0.0025, 0.002, 0.00175, 0.0015, 0.0013, 0.0012, 0.0011, 0.0010, 0.0009, 0.0008, 0.0007, 0.0006, 0.0005, 0.0004, 0.0003, 0.0002, 0.0001, 0.00005 in a data-driven way by a fivefold cross-validation. 864 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 Fig. 2. Quantile regression using SVMs and the pinball loss function with τ = 0.5. The same situation as in Fig. 1, but with smaller sample size. In the case of boosting, following Fenske et al. (2009, Section 3.2), we used cubic penalized spline base-learners with second-order difference penalty, 20 inner knots, and three degrees of freedom, and fixed the step size at ν = 0.1. With these settings, only the number of iterations mstop remained as a hyperparameter which was chosen within the values 100, 200, 300, 400, 500, 1000, 2000, 3000, 5000, 7000, 9000, 13000, 17000, 21000, 25000, 30000, 35000, 40000 in a data-driven way by a fivefold cross-validation. Performance results. The evaluation of the estimates was done on test data sets consisting of 10,000 data points. That is, for every run k ∈ {1, . . . , 100}, we got the empirical risk of the additive SVM, the boosting algorithm, and the true conditional quantile function: (k) Remp (f⋆ ) = 1 10000  10000 i=1 (k) (k) Lτ yi , f⋆(k) (xi )   (k) (k) (k)  for f⋆(k) ∈ fSVM , fboost , ftrue  (k) where Lτ denotes the pinball loss function. Note that it does not make sense to compare the empirical risks Remp . As we considered errors from the standard Cauchy distribution, the empirical risk does not converge to the true risk because the ‘‘true risk’’ is equal to infinity. However, according to the L∗ -trick (see Section 2), it is possible to compare differences of the risks (k) (k) (k) (k) Remp (f⋆ ) − Remp (ftrue ) for f⋆(k) ∈ fSVM , fboost .   As a measure of the performance, we calculated the average over the 100 runs ∆ − risk := 100  1  (k) (k) (ftrue ) Remp (f⋆ ) − Remp 100 k=1 (k) (k) for f⋆(k) ∈ fSVM , fboost   A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 865 Fig. 3. Quantile regression using SVMs and the pinball loss function with τ = 0.5. The same situation as in Fig. 2, but with an even smaller sample size. and also the standard deviation (sd) over the 100 runs. Similarly, we calculated the average bias over the 100 runs 100 1  10000 1  (k) (k)  (k) (k) for f⋆(k) ∈ fSVM , fboost , 100 k=1 10000 i=1 the standard deviation (sd) of the bias over the 100 runs, the average of the square root of the mean squared errors (sqrMSE) over the 100 runs   100   1  100 k=1 1 f⋆(k) (xi ) − ftrue (xi ) 10000  10000 i=1 (k) (k) (k)  2 f⋆ (xi ) − ftrue (xi ) (k)  (k) for f⋆(k) ∈ fSVM , fboost ,   and its standard deviation (sd) over the 100 runs. The results are shown in Table 1 and the corresponding boxplots are shown in Fig. 4. It can be seen that the performance results are similar for the two methods. However, with respect to the criterion sqrMSE, boosting performs slightly better in this example. The standard deviations (see Table 1) and the boxplots (Fig. 4) indicate that the two methods cope with the heavy-tailed error distribution similarly well in this example. Only for τ = 0.9 and n = 500 did the boosting algorithm produce severe outliers. The SVM algorithm did not produce such severe outliers, as was to be expected from our theoretical results on the statistical robustness of such SVM. Computational costs. As usual, the computational costs increased with increasing sample size. However, while the computational time increased only moderately in the case of SVMs (17 h for n = 500 and 25 h for n = 2000 for 100 runs each with fivefold cross-validation), the computational time considerably increased in the case of boosting (26 h for n = 500 and 87 h for 100 runs each with fivefold cross-validation). In the case of boosting, the need for a large working space was a limiting factor: approximately 0.8 GB for n = 500 and approximately 1.7 GB for n = 2000 were needed. In contrast, the working space was no problem in the case of SVMs for such sample sizes, because less than 10 kB were needed for n = 500 and n = 2000. The computations were done on a dual-core PC with 3.0 GHz and 3 GB RAM. 866 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 Table 1 Results of the simulations with 100 runs. τ = 0.5 n = 500 ∆-risk (sd) bias (sd) sqrMSE (sd) τ = 0.5 n = 2000 ∆-risk (sd) bias (sd) sqrMSE (sd) τ = 0.9 n = 500 ∆-risk (sd) bias (sd) sqrMSE (sd) τ = 0.9 n = 2000 ∆-risk (sd) bias (sd) sqrMSE (sd) SVM gamboost 0.026 (0.005) 0.001 (0.079) 0.415 (0.034) 0.021 (0.008) −0.010 (0.087) 0.371 (0.073) 0.017 (0.002) 0.004 (0.034) 0.333 (0.015) 0.005 (0.002) −0.003 (0.044) 0.178 (0.034) 0.022 (0.013) 0.176 (0.498) 1.164 (0.421) 0.018 (0.014) −0.212 (0.472) 0.958 (0.374) 0.008 (0.004) 0.095 (0.207) 0.731 (0.176) 0.007 (0.003) −0.005 (0.229) 0.653 (0.137) Fig. 4. Boxplots for the simulations with 100 runs. We mention that some analogous simulations were done for SVMs where instead of homoscedastic error terms some heteroscedasticity was used. These results are not shown here because they are qualitatively similar to those shown above. 4.3. A real-life example: rent standard prices Let us now consider a real-life example of the rent standard for dwellings in the large German city Munich. Many German cities compose so-called rent standards to make a decision making instrument available to tenants, landlords, renting advisory boards, and experts. Such rent standards can in particular be used for the determination of the local comparative A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 867 Table 2 Results for the rent standard data set. τ = 0.5 Risk (sd) β̂1 β̂2 τ = 0.9 Risk (sd) β̂1 β̂2 SVM gamboost 1.512 (1.189) 1.424 3.524 1.517 (1.189) 1.312 3.680 0.670 (0.559) 1.604 5.215 0.671 (0.563) 1.447 4.448 rent, i.e. the net rent as a function of the dwelling size, year of construction of the house, geographical information etc. For the construction of a rent standard, a representative random sample is drawn from all households and questionnaires are used to determine the relevant information by trained interviewers. Fahrmeir et al. (2007) described such a data set consisting of n = 3082 rent prices in Munich, which is one of the largest cities in Germany. The data set is publicly available at http://www.regressionbook.org/. Fahrmeir et al. (2007) fitted the following additive model price = f1 (size) + f2 (year) + β0 + β1 region1 + β2 region2 + error, where the following variables were used: price: net rent price per square meter in DM (1 e ≈ 1.96 DM); size: size in square meters of the dwelling (between 20 and 160); year: year (between 1918 and 1997); region1 : good residential area (0 = no, 1 = yes); region2 : best residential area (0 = no, 1 = yes). Hence region1 and region2 are dummy variables with respect to a standard residential area. Fahrmeir et al. (2007) used a special spline method for estimating the functions f1 and f2 . For purposes of illustration of the SVMs with additive kernels investigated in the present paper, we used a nonparametric additive SVM for median regression. More precisely, we used the pinball loss function with τ = 0.5 and the kernel k(x, x′ ) = 4  kj (xj , x′j ), (22) j =1 where x = (x1 , x2 , x3 , x4 ) ∈ R4 , x′ = (x′1 , x′2 , x′3 , x′4 ) ∈ R4 , and k1 : R k2 : R k3 : R k4 : R → R Gaussian RBF kernel with parameter γ1 for size → R Gaussian RBF kernel with parameter γ2 for year → R dot kernel for region1 → R dot kernel for region2 . The determination of the kernel parameters γ1 and γ2 for the kernels k1 and k2 and the determination of the regularization parameter λ of the SVM were done in the same manner as in Section 4.2, i.e. via the robust scale estimator Qn (Rousseeuw and Croux, 1993) and fivefold cross-validation. The whole data set for sample size n = 3082 was split randomly into a training data set of size ntrain = 2000 and a test data set of size ntest = 1082. Overall, a support vector machine based on the additive kernel given above and the boosting approach using gamboost yielded very similar results for this particular data set; see Table 2 and Fig. 5. The slope parameters were estimated via the support vector machine by β̂1 = 1.42 for a good residential area (region1 = 1) and β̂2 = 3.52 for the best residential area (region2 = 1). Similar results were obtained by gamboost; see Table 2. Hence, we obtain, apart from these level shifts, the same surfaces, as shown in Fig. 5, for dwellings located in good or in the best residential areas. The upper subplots in Fig. 5 show the estimated median net rent price for one square meter depending on the size of the dwelling and the year of the construction for a dwelling in a standard residential area. These subplots show that the median of the net rent prices per square meter is fairly stable for construction years up to 1960, but a more or less linear increase is visible for newer buildings. The upper subplots also show that the median of the net rent prices per square meter is especially high for dwellings of size less than 80 m2 , that the price is nearly constant for sizes between 80 and 140 m2 , and that a slight increase of the square meter price seems to occur for even larger dwellings. The shapes of the surfaces for the conditional τ = 0.9 quantiles are given in the lower subplots and are quite similar to the ones for the estimated median net rent prices. However, the lower subplots may give an indication for two moderate peaks for the 90% quantile net rent prices for dwellings of size 50 m2 and 100 m2 , respectively. We would like to mention that we used this real-life example just for illustration purposes, but nevertheless our results are in good agreement with the more detailed statistical analysis of this data set made by Fahrmeir et al. (2007) who used spline methods. 868 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 Fig. 5. Plot for the fitted additive model for the rent standard data set based on a nonparametric additive SVM for quantile regression, i.e., pinball loss functions with τ = 0.50 and τ = 0.90, respectively. The surface gives the estimated median (upper) or 90% quantile (lower) net rent price of one square meter depending on the size of the dwelling and the year of the construction for a standard residential area, i.e., region1 = region2 = 0. 5. Discussion Support vector machines belong to the class of modern statistical machine learning methods based on kernels. The success of SVMs is partly based on the kernel trick which makes SVMs usable even for abstract input spaces, their universal consistency, that they are the solutions of a well-posed problem in Hadamard’s sense, their statistical robustness with respect to small model violations, and the existence of fast numerical algorithms. During the last decade there has been considerable research on these topics. To obtain universal consistency, one needs a sufficiently large reproducing kernel Hilbert space H such that many SVMs of practical interest are based on Hilbert spaces with infinite dimension. Due to the no-free-lunch theorem (Devroye, 1982), there exists in general no uniform rate of convergence of SVMs on the set of all probability measures. Although such a nonparametric approach is often the best choice in practice due to the lack of prior knowledge of the unknown probability measure P, a semiparametric approach or an additive model (Friedman and Stuetzle, 1981; Hastie and Tibshirani, 1990) can also be valuable for at least two reasons. (i) In some applications some weak knowledge on P or on the unknown function f to be estimated, say the conditional quantile curve, is known, e.g. f is known to be bounded or at least integrable. (ii) For practical reasons, we may be interested only in functions f which offer a nice interpretation from an applied point of view, although there might be a measurable function with a smaller risk, because an interpretable prediction function can be crucial in some applications. An important class of statistical models whose predictions are relatively easily to interpret are additive models. A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 869 Therefore, support vector machines for additive models were treated in this paper and some results on their consistency and statistical robustness properties were derived. Additive models have already been treated by several other methods. Currently, there is considerable interest in the boosting of additive models; see, e.g., De Bock et al. (2010), Ferraty and Vieu (2009), and Shafik and Tutz (2009). In addition to the backfitting algorithm of Hastie and Tibshirani (1990) (gam in the R-package ‘gam’), there is also a number of methods which are based on penalized splines: • gam in the R-package ‘mgcv’—see Wood (2006); • hgam in the R-package ‘hgam’—see Meier et al. (2009); • vgam in the R-package ‘vgam’—see Yee and Wild (1996); and on boosting: • GAMBoost in the R-package ‘GAMBoost’—see Tutz and Binder (2007); • gamboost in the R-package ‘mboost’—see Schmid and Hothorn (2008). In addition, Koenker (2005, Section 7) specifically considers quantile regression for additive models (rqss in the R-package ‘quantreg’). Apart from gam in the R-package ‘mgcv’, all of these algorithms can only deal with additive models y = f1 (x1 ) + · · · + fs (xs ) + ε such that the covariates x1 , . . . , xs are univariate or at most bivariate; gam in the R-package ‘mgcv’ and additive SVMs (using ‘mySVM’) can deal with covariates xj ∈ Rpj of any dimension pj ∈ N. In addition, additive SVMs can also be used for abstract input spaces Xj as we have only assumed in Section 3 that Xj is any complete, separable metric space. Such situations occur, e.g., in the analysis of colored images, text classification, web mining and signal processing; see e.g. Christmann and Steinwart (2010) and the references cited therein. In the case of any abstract input spaces, standard software implementations of SVMs can still be used because  additive  SVMs can simply be obtained by choosing a suitable kernel k or, correspondingly, a suitable kernel matrix K = k(xi , xj ) ∈ Rn×n where k(xi , xj ) := k1 (x1,i , x1,j ) + · · · + ks (xs,i , xs,j ) for all i, j ∈ {1, . . . , n}. In particular, any implementation which allows for passing the kernel matrix K (instead of just choosing a kernel function among a set of predefined kernel functions) is suitable; e.g. use ksvm in the R-package ‘kernlab’ and fix the arguments x = K and kernel = "matrix". Note that the calculation of the kernel matrix is the only step where the complexity of Xj and the number s of additive terms matters. Since the kernel matrix has to be calculated once only, the complexity of Xj and the number s of additive terms essentially have hardly any influence on the computational costs—at least, as long as the sample size is not too large and so the kernel matrix can still be stored. Accordingly, additive SVMs can also be used for high-dimensional additive models. In addition to these computational advantages, additive SVMs have a sound theoretical justification: we have shown in Section 3.2 that additive SVMs are consistent for additive models and we have explained in Section 3.3 that the good robustness properties of SVMs carry over to additive SVMs without further ado. However, as additive SVMs are a special case of general SVMs, general disadvantages of SVMs also carry over. One of the most important topics of recent research on support vector machines is the investigation of rates of convergence for special cases; see e.g. Steinwart and Scovel (2007), Caponnetto and De Vito (2007), Blanchard et al. (2008), Steinwart et al. (2009) and Mendelson and Neeman (2010). Xu and Chen (2008) and Steinwart and Christmann (2009) considered convergence rates of SVMs under mixing conditions. Accordingly, it may also be worthwhile to investigate the rates of convergence of SVMs based on kernels designed for additive models. Our simple numerical examples in Section 4.1 seem to indicate that there might be some gain with respect to the rate of convergence or with respect to smaller constants describing the rate of convergence. However, this is far beyond the scope of this paper. Finally, we would like to mention the well-known fact that not only is the sum of s kernels a kernel but also the product of s kernels is a kernel. Hence it seems to be possible to derive results similar to those given here for additive models also for multiplicative models or even models with additive and multiplicative terms. Appendix A. Some computational details Support vector machines can be computed with many different software  tools, which often allow the definition of new kernels or the direct specification of the kernel matrix K := k(xi , xj ) ∈ Rn×n . We used the implementation mySVM developed by Rüping (2000), who also gives a description of the parameters and options of his program, for the computation of SVMs for classification, regression, and distribution estimation problems. This implementation is based on SVMlight developed by Joachims (1999). The software mySVM can be used for SVMs based on the pinball loss for quantile regression. The parameters and numerical options for the specific SVM are submitted to mySVM via an external file. A typical example for the external files that we used to fit an SVM based on the pinball loss function with τ = 0.9 and the kernel defined in (22) as the sum of two GRBF kernels and two linear kernels in order to analyze the rent standard data set is given below. The definition of the kernel starts with @kernel. The options type sum_aggregation and number_parts 4 specify that an additive kernel consisting of four parts will be used. The following lines specify these individual kernels. After 870 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 the line @parameters, the loss function and some options for specifying numerical parameters are given; e.g., the options epsilon=0, L+ = 1 − τ , and L- = τ can be used to specify an SVM based on the pinball loss function for quantile level τ . @kernel type sum_aggregation number_parts 4 range 1 1 range 2 2 range 3 3 range 4 4 @kernel type radial gamma 0.581 @kernel type radial gamma 0.626 @kernel type dot @kernel type dot @parameters C 5.097 scale epsilon 0.00 L+ 0.1 L- 0.9 kernel_cache 2000 max_iterations 1000000 working_set_size 2 convergence_epsilon 0.0001 Appendix B. Proofs Proof of Theorem 2. First fix any j ∈ {1, . . . , s} and define the mapping k̃j : X × X → R via k̃j (x1 , . . . , xs ), (x′1 , . . . , x′s ) = kj (xj , x′j )   for every (x1 , . . . , xs ) ∈ X and (x′1 , . . . , x′s ) ∈ X. Accordingly, for every fj ∈ Hj , define f˜j : X → R via f˜j (x1 , . . . , xs ) = fj (xj ) ∀(x1 , . . . , xs ) ∈ X. Then, it is easy to see that H̃j = f˜j : X → R : fj ∈ Hj   is a Hilbert space with inner product and norm given by ⟨f˜j , h̃j ⟩H̃j = ⟨fj , hj ⟩Hj and ∥f˜j ∥H̃j = ∥fj ∥Hj (B.1) for every fj ∈ Hj and gj ∈ Hj . Hence, for every x = (x1 , . . . , xs ) ∈ X, we get k̃j (·, x) ∈ H̃j and f˜j (x) = fj (xj ) = ⟨fj , kj (·, xj )⟩Hj = ⟨f˜j , k̃j (·, x)⟩H̃j ∀fj ∈ Hj where the last equality follows from (B.1) and the definition of k̃j . That is, k̃j is a reproducing kernel and H̃j is its RKHS. Next, it follows from Berlinet and Thomas-Agnan (2004, Section 4.1) that k = k̃1 + · · · + k̃s is a reproducing kernel on X with RKHS H = H̃1 + · · · + H̃s and norm ∥f ∥2H = min f =f˜1 +···+f˜s ∥f˜1 ∥2H̃ + · · · + ∥f˜s ∥2H̃ 1 s f˜1 ∈H̃1 ,...,f˜s ∈H̃s (B.1) = min f =f˜1 +···+f˜s f1 ∈H1 ,...,fs ∈Hs ∥f1 ∥2H1 + · · · + ∥fs ∥2Hs . Using the reduced notation f1 + · · · + fs instead of f˜1 + · · · + f˜s , inequality (12) follows. (B.2) A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 871 In order to prove Theorem 3, the following proposition is needed. It provides conditions on Hj and Fj under which the minimal risk over H = H1 + · · · + Hs is equal to the minimal risk over the larger F = F1 + · · · + Fs . Proposition 7. Let the main assumptions (i)–(iii) be valid. Let P ∈ M1 (X × Y) be such that Hj ⊂ Fj ⊂ L1 (PXj ) ∀j ∈ {1, . . . , s} and Hj is dense in Fj with respect to ∥ · ∥L1 (PX ) . Then, j RL⋆ ,P,H := inf RL⋆ ,P (f ) = RL⋆ ,P,F . ∗ ∗ (B.3) f ∈H Proof of Proposition 7. According to the definitions, it only remains to prove RL∗⋆ ,P,H ≤ RL∗⋆ ,P,F . To this end, take any f ∈ F and any ε > 0. Then, by assumption there are functions fj ∈ Fj , j ∈ {1, . . . , n}, such that f = f1 + · · · + fs and, for every j ∈ {1, . . . , s}, there is an hj ∈ Hj such that ∥hj − fj ∥L1 (PXj ) < ε . s · |L|1 (B.4) Hence, for h = h1 + · · · + hs ∈ H, |RL⋆ ,P (h) − RL⋆ ,P (f )| ≤    |L(x, y, h(x)) − L(x, y, f (x))|P d(x, y)  ≤ |L|1 |h(x) − f (x)|PX (dx) ≤ |L|1 s   |hj (xj ) − fj (xj )|PXj (dxj ) j =1 < ε. Proof of Theorem 3. To avoid handling too many constants, let us assume that ∥k∥∞ = 1. According to (7), this implies ∥f ∥∞ ≤ ∥f ∥H for all f ∈ H. Now we use the Lipschitz continuity of L to obtain, for all g ∈ H,    |RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (g )| ≤ |L(x, y, fL,P,λn (x)) − L(x, y, g (x))|P d(x, y)  ≤ |L|1 |fL,P,λn (x) − g (x)|PX (dx)  ≤ |L|1 ∥fL,P,λn − g ∥∞ PX (dx)   (B.5) ≤ |L|1 fL⋆ ,P,λn − g H . Let Φ denote the canonical feature map which corresponds to the kernel k. According to Christmann et al. (2009, Theorem 7), for every n ∈ N, there is a bounded, measurable function hn : X × Y → R such that ∥hn ∥∞ ≤ |L|1 (B.6) and, for every Q ∈ M1 (X × Y),     fL,P,λ − fL,Q,λ  ≤ λ−1 EP hn Φ − EQ hn Φ  . n n H n H (B.7) Fix any ε ∈ (0, 1) and define Bn := Dn ∈ (X × Y)n : EP hn Φ − EDn hn Φ H ≤ ελn ,     (B.8) where Dn denotes the empirical distribution of the data set Dn . Then, (B.5), (B.7) and (B.8) yield |RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (fL,Dn ,λn )| ≤ |L|1 ε ∀Dn ∈ Bn . (B.9) Now let us turn to the probability P (Bn ). By use of Hoeffding’s inequality, we will show that n lim Pn (Bn ) = 1. (B.10) n→∞ To this end, we first observe that λn n1/2 → ∞ implies that λn ε ≥ n−1/2 for all sufficiently large n ∈ N. Moreover, (B.6) and our assumption ∥k∥∞ = 1 yield ∥hn Φ ∥∞ ≤ |L|1 . Define 1 an := |L|− 1 ελn and ξn := 2 2 2 3 |L|− 1 ε λn n 1 8 |L|− 1 ελn + 3 = 3 a2n n 8 an + 3 872 A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 and note that, for sufficiently large n, √ 2ξn + 1 √ n + √ 4ξn = 3n < an 2 an 2 3 ·√ 1 +√ + n an + 3 1 +√ + n an 2 · 1 3 an 2 · an an + 3 1 < an = |L|− 1 ελn . (B.11) Consequently, Hoeffding’s inequality in Hilbert spaces (see Steinwart and Christmann, 2008, Corollary 6.15) yields for B = |L|1 the bound Pn (Bn ) = Pn (B.11) ≥ Pn ≥  D ∈ (X × Y)n : ∥EP hn Φ − ED hn Φ ∥H ≤ ελn  D ∈ (X × Y)n : ∥EP hn Φ − ED hn Φ ∥H ≤  √ B( 2ξn + 1) √ n + 4Bξn  3n     3 ε 2 λ2n n/|L|21 3 ε 2 λ2n n 1 − exp − · = 1 − exp − · 8 ελn /|L|1 + 3 8 (ελn + 3|L|1 )|L|1 for all sufficiently large values of n. Now (B.10) follows from λn → 0 and λn n1/2 → ∞. According to (B.9) and (B.10), RL⋆ ,P (fL,P,λn ) − RL⋆ ,P (fL,Dn ,λn ) −→ 0 (n → ∞) in probability. Note that |RL⋆ ,P (fL,Dn ,λn ) − RL∗⋆ ,P,F | ≤ |RL⋆ ,P (fL,Dn ,λn ) − RL∗⋆ ,P,H | + |RL∗⋆ ,P,H − RL∗⋆ ,P,F | (B.3) ≤ |RL⋆ ,P (fL,Dn ,λn ) − RL⋆ ,P (fL,P,λn )| + |RL⋆ ,P (fL,P,λn ) − RL∗⋆ ,P,H |. (B.12) As shown above, the first term in (B.12) converges in probability to 0. Therefore, it only remains to prove that the second term converges to 0. To this end, define, for every f ∈ H, the affine linear function A∗f : R → R, λ → RL⋆ ,P (f ) + λ∥f ∥2H − RL∗⋆ ,P,H . Then, a continuity result for the pointwise infimum of a family of affine functions (see e.g. Steinwart and Christmann, 2008, A.6.4) yields lim inf A∗f (λn ) = inf A∗f (0). n→∞ f ∈H f ∈H However, according to the definitions, inf A∗f (λn ) = RL⋆ ,P (fL,P,λn ) + λn ∥fL,P,λn ∥2H − RL∗⋆ ,P,H f ∈H ∀n ∈ N and inff ∈H A∗f (0) = 0. Hence, 0 ≤ lim sup RL⋆ ,P (fL,P,λn ) − RL∗⋆ ,P,H   n→∞   ≤ lim sup inf Af (λn ) − inf Af (0) = 0. ∗ n→∞ f ∈H ∗ f ∈H Proof of Theorem 4. Since the quantile function fτ∗,P attains the minimal risk RL∗⋆ ,P for the pinball loss L = Lτ (Koenker, 2005, Section 1.3), the assumption fτ∗,P ∈ F implies RL∗⋆ ,P,F = RL∗⋆ ,P . Hence, an application of Theorem 3 yields RL⋆ ,P (fL,Dn ,λn ) −→ RL∗⋆ ,P (n → ∞) (B.13) in probability. It is shown in Christmann et al. (2009, Corollary 31) that, for all sequences (fn )n∈N of measurable functions fn : X → R , RL⋆ ,P (fn ) −→ RL∗⋆ ,P implies d0 fn , fτ∗,P −→ 0.   This proves Theorem 4 in the following way. According to the characterization of convergence in probability by means of almost surely convergent subsequences (Dudley, 2002, Theorem 9.2.1), it follows from (B.13) that, for every subsequence of RL⋆ ,P (fL,Dn ,λn ), n ∈ N, there is a further subsequence which converges almost surely to RL∗⋆ ,P . Hence, according to the cited result (Christmann et al., 2009, Corollary 31), for every subsequence of d0 fL,Dn ,λn , fτ∗,P ,   n ∈ N, there is a further subsequence which converges almost surely to 0. That is, d0 fL,Dn ,λn , fτ∗,P → 0 in probability.   A. Christmann, R. Hable / Computational Statistics and Data Analysis 56 (2012) 854–873 873 References Berlinet, A., Thomas-Agnan, C., 2004. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Boston, MA. Blanchard, G., Bousquet, O., Massart, P., 2008. Statistical performance of support vector machines. The Annals of Statistics 36 (2), 489–531. Caponnetto, A., De Vito, E., 2007. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), 331–368. Christmann, A., Steinwart, I., 2010. Universal kernels on non-standard input spaces. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (Eds.), Advances in Neural Information Processing Systems, vol. 23. pp. 406–414. Christmann, A., Van Messem, A., 2008. Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research 9, 915–936. Christmann, A., Van Messem, A., Steinwart, I., 2009. On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface 2, 311–327. Cucker, F., Zhou, D.X., 2007. Learning Theory. An Approximation Theory Viewpoint. Cambridge University Press, Cambridge. De Bock, K., Coussement, K., Van den Poel, D., 2010. Ensemble classification based on generalized additive models. Computational Statistics and Data Analysis 54, 1535–1546. Devroye, L.P., 1982. Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, 154–157. Dudley, R.M., 2002. Real Analysis and Probability. Cambridge University Press, Cambridge. Fahrmeir, L., Kneib, T., Lang, S., 2007. Regression: Modell, Methode und Anwendungen. Springer, Berlin. Fenske, N., Kneib, T., Hothorn, T., 2009. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. http://epub.ub.uni-muenchen.de/10510/. Ferraty, F., Vieu, P., 2009. Additive prediction and boosting for functional data. Computational Statistics and Data Analysis 53, 1400–1413. Friedman, J.H., Stuetzle, W., 1981. Projection pursuit regression. Journal of the American Statistical Association 76, 817–823. Hable, R., Christmann, A., 2011. On qualitative robustness of support vector machines. Journal of Multivariate Analysis 102, 993–1007. Hampel, F.R., 1968. Contributions to the theory of robust estimation, unpublished. Ph.D. Thesis. Department of Statistics, University of California, Berkeley. Hampel, F.R., 1974. The influence curve and its role in robust estimation. Journal of the American Statistical Association 69, 383–393. Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Chapman and Hall, London. Hein, M., Bousquet, O., 2005. Hilbertian metrics and positive definite kernels on probability measures. In: Ghahramani, Z., Cowell, R. (Eds.), AISTATS. pp. 136–143. Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proc. 5th Berkeley Symp. vol. 1, pp. 221–233. Joachims, T., 1999. Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods–Support Vector Learning. MIT Press, Cambridge, MA, pp. 41–56. Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston. Koenker, R., 2005. Quantile Regression. Cambridge University Press, Cambridge. Lafferty, J., Lebanon, G., 2005. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research 6, 129–163. Meier, L., van de Geer, S., Bühlmann, P., 2009. High-dimensional additive modeling. The Annals of Statistics 37 (6B), 3779–3821. Mendelson, S., Neeman, J., 2010. Regularization in kernel learning. The Annals of Statistics 38 (1), 526–565. Robinson, S.M., 1991. An implicit-function theorem for a class of nonsmooth functions. Mathematics of Operations Research 16, 292–309. Rousseeuw, P.J., Croux, C., 1993. Alternatives to the median absolute deviation. Journal of the American Statistical Association 88 (424), 1273–1283. Rüping, S., 2000. mySVM-Manual. Department of Computer Science, University of Dortmund. www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM. Schmid, M., Hothorn, T., 2008. Boosting additive models using component-wise P-splines. Computational Statistics and Data Analysis 53 (2), 298–311. Schölkopf, B., Smola, A.J., 2002. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. Schölkopf, B., Smola, A.J., Müller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computing 10, 1299–1319. Schölkopf, B., Tsuda, K., Vert, J.P., 2004. Kernel Methods in Computational Biology. MIT Press, Cambridge, MA. Shafik, N., Tutz, G., 2009. Boosting nonlinear additive autoregressive time series. Computational Statistics and Data Analysis 53, 2453–2464. Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G., Schölkopf, B., 2009. Kernel choice and classifiability for RKHS embeddings of probability distributions. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems, 22. pp. 1750–1758. Steinwart, I., 2001. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2, 67–93. Steinwart, I., Christmann, A., 2008. Support Vector Machines. Springer, New York. Steinwart, I., Christmann, A., 2009. Fast learning from non-i.i.d. observations. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems, vol. 22. pp. 1768–1776. Steinwart, I., Hush, D., Scovel, C., 2009. Optimal rates for regularized least squares regression. In: Proceedings of the 22nd Conference on Learning Theory (COLT 2009). Steinwart, I., Scovel, C., 2007. Fast rates for support vector machines using Gaussian kernels. The Annals of Statistics 35 (2), 575–607. Tutz, G., Binder, H., 2007. Boosting ridge regression. Computational Statistics and Data Analysis 51 (12), 6044–6059. Vapnik, V.N., 1998. Statistical Learning Theory. John Wiley & Sons, New York. Wood, S.N., 2006. Generalized Additive Models. Chapman & Hall/CRC, Boca Raton, FL. Xu, Y.-L., Chen, D.-R., 2008. Learning rates of regularized regression for exponentially strongly mixing sequence. Journal of Statistical Planning and Inference 138 (7), 2180–2189. Yee, T.W., Wild, C.J., 1996. Vector generalized additive models. Journal of the Royal Statistical Society. Series B. Methodological 58 (3), 481–493.

Chritsmann -Consistency of support vector machines using additive

Documentos relacionados

Productos

Apoyo

Chritsmann -Consistency of support vector machines using additive

Documentos relacionados

Añadir este documento a la recogida (s)

Añadir a este documento guardado

Sugiéranos cómo mejorar StudyLib