# Full text of "Principal support vector machines for linear and nonlinear sufficient dimension reduction"

## See other formats

c^ m C^ The Anjials of Statistics 2011, Vol. 39, No. 6, 3182-3210 DOI: 10.1214/11-AOS932 @ Institute of Mathematical Statistics, 2011 PRINCIPAL SUPPORT VECTOR MACHINES FOR LINEAR AND O^ ; NONLINEAR SUFFICIENT DIMENSION REDUCTION "^ . By Bing Li^, Andreas Artemiou and Lexin Li^ ^ ' Pennsylvania State University, Michigan Technological University and North Carolina State University We introduce a principal support vector machine (PSVM) ap- CO , proach that can be used for both hnear and nonlinear sufficient di- mension reduction. The basic idea is to divide the response variables into slices and use a modified form of support vector machine to find L^ , the optimal hyperplanes that separate them. These optimal hyper- r^ • planes are then aligned by the principal components of their normal • ' vectors. It is proved that the aligned normal vectors provide an un- 1 -^ . biased, y^n-consistent, and asymptotically normal estimator of the j^ ' sufficient dimension reduction space. The method is then generalized to nonlinear sufficient dimension reduction using the reproducing ker- nel Hilbert space. In that context, the aligned normal vectors become functions and it is proved that they are unbiased in the sense that they are functions of the true nonlinear sufficient predictors. We com- ^ ■ pare PSVM with other sufficient dimension reduction methods by ^^ ' simulation and in real data analysis, and through both comparisons O^ , firmly establish its practical advantages. (N 1. Introduction. With the increase of computer power in storing and (3 ' processing data, high dimensional data have become increasingly prevalent CN I across many disciplines. The demand for effective methods to extract use- ful information from such data has led inevitably to dimension reduction, an area that has undergone tremendous development during the past two decades. ^ . Let X be a p-dimensional predictor and y be a response variable. In its classical form, sufficient dimension reduction (SDR) [Li (1991, 1992), Cook and Weisberg (1991), Cook (1998)] identifies a small number of linear Received October 2010; revised September 2011. ^Supported in part by NSF Grants DMS-07-04621 and DMS-08-06058. ^Supported in part by NSF Grant DMS- 11-06668. AMS 2000 subject classifications. 62-09, 62G08, 62H12. Key words and phrases. Contour regression, invariant kernel, inverse regression, prin- cipal components, reproducing kernel Hilbert space, support vector machine. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics^ 2011, Vol. 39, No. 6, 3182-3210. This reprint differs from the original in pagination and typographic detail. i 2 B. LI, A. ARTEMIOU AND L. LI combinations of predictors that can replace the original predictor vector X without loss of information on the conditional distribution of Y given X. In other words, the objective is to find a, p x d {d <p) matrix r] such that the following conditional independence holds: (1) y_lLX|T7'^X. In this relation, the identifiable parameter is the subspace spanned by the columns of 77, rather than t] itself. The intersection of all subspaces satis- fying (1), provided itself satisfies (1), is called the central subspace, and is denoted by 5y|x [Cook (1994)]. Cook (1996) and Yin, Li and Cook (2008) showed that Syvx. uniquely exists under very mild conditions. Thus, we assume its existence throughout this article. Many methods have been pro- posed for this problem since the publication of the original works. See, for example. Cook and Li (2002), Xia et al. (2002), Yin and Cook (2002), Fung et al. (2002), Li, Zha and Chiaromonte (2005), Cook and Ni (2005), Li and Wang (2007), Li and Dong (2009). A more general sufficient dimension reduction problem, as formulated in Cook (2007), is to seek an arbitrary function cji:W ^Mr such that (2) yxx|0(x). We refer to this problem as nonlinear sufficient dimension reduction, and any one-to-one function of 4>i^) as the nonlinear sufficient predictor. Several recent works pioneered estimation procedures for nonlinear dimension reduc- tion of this type, including Wu (2008), Wu, Liang and Mukherjee (2008), Wang (2008) and Yeh, Huang and Lee (2009), by extending sliced inverse regression [SIR; Li (1991)] from different angles. In this paper, we propose a sufficient dimension reduction method, to be called the principal support vector machine (PSVM), that can extract the sufficient predictors in both problems (1) and (2). Let (Xi, Yi), . . . , (X„, y„) be a sample of (X, Y). The basic idea of PSVM is to divide Xi, . . . , X„ into several slices according to the values of the responses, and then use sup- port vector machine [SVM; Vapnik (1998)] to find the optimal hyperplanes that separate these slices. The optimal hyperplanes are then aligned by ap- plying principal component analysis to their normal vectors. We show that the principal components are, in fact, an unbiased estimator of the central subspace 5y|x- This idea is then extended to the nonlinear dimension reduc- tion problem (2) via the reproducing kernel Hilbert space [RKHS; Aronszajn (1950), Hsing and Ren (2009)]. In this context, the normal vectors in the linear case become functions in the RKHS. It is shown that the normal func- tions thus derived are functions of (f) in the general problem (2). This is, to our knowledge, the first result of this type. Our proposal is noticeably different from the existing SDR methods in the following respects. First, PSVM is developed under, and for, a uni- HYPERPLANE ALIGNMENT 3 fied framework of linear and nonlinear sufficient dimension reduction. Such a standpoint allows us to formulate some theoretical properties, such as unbiasedness, more rigorously and generally than previous works. Second, PSVM improves the accuracy for sufficient dimension reduction, for the fol- lowing reason. It is well known that a regression surface is more accurately estimated at the center of the data cloud than at the outskirt. However, an inverse regression based method, such as SIR, tends to downweight the slice means near the center due to their shorter lengths. Since PSVM relies on separating hyperplanes rather than slice means, it makes better use of the central portion of the data than inverse regression. This improvement is clearly demonstrated in our numerical studies. Finally, PSVM establishes a firm connection between sufficient dimension reduction and the acclaimed machine learning technique, support vector machine, both of which have been extensively used in high dimensional data analysis. This combination brings fresh insights and further advances to both subjects. Along with the theoretical development of PSVM, we develop a more complete asymptotic theory for SVM than previously given, and introduce the notion of invariant kernel for SVM. Meanwhile, we expect some inherent advantages of SVM to benefit sufficient dimension reduction estimation. For instance, SVM tends to be more robust against outliers than a typical moment method. This is because the separating hyperplanes are largely determined by the support vectors lying in the interior of the data cloud, as a result an observation far away from the data cloud has less influence than a typical moment-based estimator. In this sense, SVM behaves more like a median than a mean. It is also expected to help address several challenging issues facing the existing SDR methods, such as small-n-large-p and presence of categorical predic- tors. However, due to limited space these potential advantages cannot be fully discussed within this paper. Some of them, such as robustness and categorical predictors, are further explored in Artemiou (2010). The rest of the paper is organized as follows. In Section 2, we illustrate the basic idea of PSVM by examples and figures, and give intuitions about why it works. In Section 3, we formally introduce the linear PSVM and study its population-level properties in terms of its unbiasedness as an estimator of the central subspace. In Section 4, we develop the estimation procedures for the linear PSVM, and describe how to implement it using standard SVM packages. In Section 5, we generalize the linear PSVM to the kernel PSVM to solve the nonlinear sufficient dimension reduction problem, and establish its unbiasedness in this general setting. In Section 6, we develop an algorithm to implement the kernel PSVM, and introduce the notion of invariant kernel. In Section 7, we study the asymptotic properties for the linear PSVM estimator. Though the identified subspaces are asymptotically consistent, they are almost surely incorrect for finite sample sizes. Thus, in Section 8, we compare the linear and kernel PSVM with other dimension 4 B. LI, A. ARTEMIOU AND L. LI reduction methods in finite sample by simulation. In Section 9, we apply it to analyze a data set concerning the recognition of vowels, and make further comparisons in the practical setting. All the proofs are given in a complementary document published online by The Annals of Statistics. 2. Principal support vector machine: The basic idea. The idea of the principal support vector machine arises from an interplay of several ideas: sliced inverse regression, support vector machine, and contour regression [Li, Zha and Chiaromonte (2005)]. In this section, we illustrate this idea by two simple examples that cover both linear and nonlinear dimension reduc- tion. Throughout this paper, X represents a random vector; X^ represents the rth component of X; Xj represents the ith random vector from a sam- ple Xi, . . . ,X.„, and Xir represents the rth component of Xj. First, consider the regression model (3) Y = f{Xi + 2X2)+e, where e _IL (Xi,X2). This is a linear sufficient dimension reduction problem, in which the central subspace is spanned by (1,2) G R^. Note that the contours for the regression function is the set {(xi, X2) : xi + 2x2 = c}, which is uniquely associated with the vector (1,2)^. Based on this intuition, Li, Zha and Chiaromonte (2005) introduced the contour regression, which estimates the contour directions by the directions in X that are aligned with the smallest increments in Y. Here, we propose to identify the contours by the separating hyperplanes derived from the support vector machine as applied to different slices of X, formed according to the values of Y. Let 5i = {Xj :Yi < c} and 52 = {Xj : Yi > c} for some constant c. We use SVM to obtain the optimal separating hyperplane of Si and S2, and repeat the process to obtain several hyper- planes. Intuitively, the normals of these hyperplanes are roughly aligned with the directions in which the regression surface varies — directions that form the central subspace. We use the principal components of these normals to estimate the central subspace. A related idea is Loh (2002), who proposed to divide each individual predictor according to the mean of Y and assess the importance of that predictor by its degree of separation. As an illustration, we generate 100 replications from model (3) where / is taken to be the identity mapping. We divide Xi , . . . , Xioo into 4 slices according to the 25th, 50th, 75th sample quantiles oiYi, . . . ,Yn, as indicated in Figure 1 by differently colored dots. Application of SVM between these slices yields three hyperplanes, represented by the solid lines on the right panel, which closely resemble the contours derived from the true model, as shown on the left panel. Clearly, the normals of the three hyperplanes give close estimate of the central subspace. HYPERPLANE ALIGNMENT x1 Fig. 1. Linear contours for model Y — 2Xi + X2 + e. Left panel: true contours; right panel: contours based on linear SVM. Contour levels are three evenly spaced sample quan- tiles ofYi,...,Y„. The sample size is n— 100 . We can apply the same idea to sufficient nonlinear dimension reduction. Let (4) y = /(Xi+X|) + e, where / is an unknown function. The contours of this function are of the form {(xi, 3:2) : xi +x| = c}, which are no longer hyperplanes in M?. However, if we map x to a higher dimensional space of functions of x that is rich enough to contain xi + x^, then the contours become hyperplanes again. We can apply SVM at that level to find the optimal hyperplanes, and then map them back to the x-space to extract the nonlinear predictor. Usually, in conjunction with mapping a low-dimensional regressor to a high-dimensional regressor, a Tikhonov-type regularization is applied, so that the overfitting tendency of increased dimension is counteracted by the regularization. As in the linear case we generate 100 replications from model (4) and use the same set of quantiles to slice the response. The curves in the left panel in Figure 2 are the true contours computed from the function y = xi + x"^. Those in the right panel are obtained by first applying kernel SVM (with Gaussian radial basis) to find hyperplanes in R^^'^ and then mapping them back to R^. Clearly, any function of (xi,X2) that generates the contours in the right panel would closely resemble the true predictor xi + x"^, modulo a monotone transformation. 3. PSVM for linear sufHcient dimension reduction. We first develop PSVM for linear sufficient dimension reduction. We begin with a population- level formulation of SVM, since it is usually described at the sample level. B. LI, A. ARTEMIOU AND L. LI Fig. 2. Nonlinear contours for model Y = Xi + X2 + e. Left panel: true contours; right panel: contours based on kernel SVM with gauss radial kernel. Contour levels are three evenly spaced sample quantiles ofYi,...,Y„. The sample size is n= 100 . which is not the best way to set up our problem. For now, assume Y to be a binary random variable taking values —1 and 1. The soft-margin SVM is defined through the following optimization: A minimize ip tp -\ — > ^j n 1 among {tp,t,^) G (5) subject to Ci > 0, Yi [-0 ' (Xi - X) - t] > 1 - ^i, i = 1, . . . , n, where A is a positive constant often referred to as the "cost." See Vapnik [(1998), page 411] for the intuitions behind this construction. If {ip*,t*,^*) is the solution to (5), then the set {x : -0* x = t*} is the optimal hyperplane that separates {Xj ■.Yi = —1} and {Xj : 1^ = 1}. Although this representation defines the algorithm, it does not tell us what objective function is minimized at the population level. To see things more clearly, we first carry out the optimization for a fixed {ip,t). This amounts to minimizing Yl^=i C« subject to ^j > max{0, 1 — Yiltp (Xj — X) — t]}. The optimal solution is ^* = {1 — Yi[ip (Xj — X) — t]}"*" where a"*" = max(a,0). Substituting ^* into (5), we have (6) A X) t\Y i=\ This corresponds to the following objective function at the population level: (7) v^"^^ + A^[i-y(V'^(x-^x)-t)]+. The hyperplane that minimizes this criterion can be viewed as that which best separates the conditional distributions of X|y = —1 and X|y = 1. HYPERPLANE ALIGNMENT 7 Jiang, Zhang and Cai (2008) used a slight variation of representation (7) to derive the asymptotic distribution of SVM. We also use a representation similar to (7) but with two important modifications, as we describe below. Return now to sufficient dimension reduction problem (1) where Y is an arbitrary random variable (in particular, it can be either continuous or categorical). Let Oy be the support of Y and let Ai and A2 be disjoint subsets of Qy. Let Y be the discrete random variable defined by (8) Y = I{YeA2)-l{YeAi). We introduce the following objective function for linear SDR: (9) LiiP, t) = ip'^'Sip + XE{1 - y[^^(X - EX) - t]}+, where S = var(X). Compared with (7), we have made two modifications. First, we allow Y to take the value 0, so that we can use a pair of disjoint subsets that are not a partition of fiy. Second, we have inserted 5] in the first term of (9). This is so that the objective function transforms in a desired manner. We will return to this point in Section 6. We now establish the unbiasedness of the normal vector for the opti- mal separating hyperplane in SVM as an estimator of the central subspace. Let Fn be the empirical distribution based on the sample (Xi,Yi ),..., (X.n,Yn), Fq be the true distribution of (X.,Y), and T be a statistic that can be expressed as a matrix- valued function of the distribution of (X, 1"). In our context, we say T(F„) is an unbiased estimator of 5y|x, if it satisfies (10) span[T(Fo)] C 5y|x. Theorem 1. Suppose -^(XIt^^X) is a linear function of rj^l^, where rj is as defined in (1). If [if)* ,t*) minimizes the objective function (9) among all {ip, t)eWx R, then tp* G 5y|x. The linearity condition on £^(X|t7^X) in the theorem is well known and generally assumed in the SDR literature. See, for example, Li and Duan (1989), Li (1991) and Li and Dong (2009). It implies (11) ii;(v.'^x|r,Tx) = ^Tp;(s)x, where Pr,(I]) is the projection matrix r]{r]^'Sr])~^r]^'S [Cook (1998)]. It is satisfied when X is elliptically symmetric [Eaton (1986)], and is approx- imately satisfied when p is large [Hall and Li (1993)]. Interestingly, as we show in Section 5, this assumption is no longer needed for the unbiasedness in the more general setting of nonlinear sufficient dimension reduction. Here we note that, though Theorem 1 is far from a trivial generalization, the type of argument used in the proof is somewhat standard in the SDR 8 B. LI, A. ARTEMIOU AND L. LI literature. See, for example, Li and Duan (1989), Cook (1998) and Cook and Li (2002). It is possible to extend the above theorem to more general objective functions. For example, the theorem still holds if a i— t- a"*" in the objective function is replaced by any convex function u{a). 4. Estimation procedure for linear PSVM. 4.1. Estimation. We propose two ways to generate the set of pairs of slices for PSVM. One, which we call "left versus right" (LVR), repeatedly divides the predictors into two groups according to a set of cutting points for the response. The other, which we call "one versus another" (OVA), partitions the predictors into several slices and pairs up all possible slices. We summarize the estimation procedure as follows. 1. Compute the sample mean X and sample variance matrix S. 2. (LVR) Let qr,r = 1,. . . ,h — 1, be h — 1 dividing points. For example, they can be equally spaced sample percentiles of {Yi, . . . ,Yn}. Let (12) Y; = I{Yi > Qr) - I{Yi < qr) and let (■0^, i^)) r = 1, . . . , /i — 1, be the minimizer of (13) V'^i]V^ + A^„{i-y^[(x-x)^^-t]}+. 2'. (OVA) Apply SVM to each pair of slices from the h slices. More specif- ically, let (7o = ™in{li, . • • ,5^} and (//j = max{Yi, . . . ,y„}. For each (r, s) satisfying 1 < r < s < /i, let fr = I{qs-l <Yi< Qs) - I{qr-i <Yi< q^). Let {il}j.g,trs) be the minimizer of the objective function 'lP'^t^p + XEn{l - y"1(X - X)^rp - t]}+ . 3. Let vi, . . . , Vrf be the d leading eigenvectors of either one of the ma- trices h-l h h (14) Mn = Y,^r'4'J or Mn = Y,Yl ^rs¥rs- r=l r=l s=r+l We use subspace spanned by v = (vi, . . . , v^) to estimate 5y|x- Based on our experiences, LVR works best when the response is a con- tinuous variable, where Y being larger or smaller has a concrete physical meaning; OVA works best when the response is categorical, where the val- ues of Y are simply labels of classes, such as different vowels in our example in Section 9. Our numerical studies also suggest that the estimation re- HYPERPLANE ALIGNMENT 9 suits are not overly sensitive to the choice of the number of slices h, though a larger h often works better. Standard packages for SVM minimize the objective function (6) instead of (13). However, they can be modified to suit our procedure. Let C = S^'^i/^ and Z = I]"i/2(x - X). Then (13) becomes (15) c^c + AK[i-y^(z^C-t)]+. We can apply standard packages to minimize (15) to obtain ^, whose trans- formation S~^'^^ is the desired minimizer of (13). We use the kernlab package in R to solve problem (15). See Karatzoglou and Meyer (2006) for an exposition of this package. 4.2. Order determination. Estimating the dimension d of the central subspace is a vital ingredient of sufficient dimension reduction estimation. Here, we propose a cross-validated BIC procedure [Schwarz (1978)] for this purpose. The BIC component of this procedure is an extension of a criterion introduced by Wang and Yin (2008), and is also related to Zhu, Miao and Peng (2006). We refer to this combined procedure as CVBIC. Let M„ be one of the matrices in (14), and let Aj(M„) be its ith largest eigenvalue. Let Gn{k) = Yl,i=i ^i(^n) — ci{n)c2{k) , where ci{n) is a sequence of positive numbers or random variables that converge (in probability) to 0, and C2{k) is a nonrandom increasing function of k. Let d be the maximizer of Gn{k) over {0,...,p}. In Section 7, we show that P{d = d) — )■ 1. The standard choices of ci(n) and C2(fe) are ci(n) oc n~^/^log(n) and C2{k) = /c, so that the penalty term is CQn~^''^\og{n)k, where cq > is a constant (or random variable) of order 0(1) [or Op(l)]. Since the eigenvalues Aj(M„) may differ for different problems, it is sensible to make cq comparable to their magnitude. One reasonable choice is to make cq proportional to Ai(M„), leading to the following BIC-type criterion: k (16) Y. ^*(^") - aAi(M„)n-V2 iog(„)^. We now turn to the choice of a. Though this choice does not affect the consistency of (i, it does affect its finite-sample performance. Moreover, from our experience this choice is also sensitive to p, d, and the regression model. For these reasons, it is important to have a systematic way of choosing a. The SVM used in our setting suggests naturally the cross-validation, because the former provides a set of labels to validate. We outline the CVBIC procedure as follows, using LVR as an illustration. First, divide the data into a training set and a testing set, denoted by {(Xi,yi), . . . , (x„,, y;j}, {(Xi, n), . . . , (x„„y„,)}. 10 B. LI, A. ARTEMIOU AND L. LI Apply the PSVM to the training set with dividing points qi, . . . ,qh~i to obtain a set normal vectors ■0;^, . . . ,i/'/^_i. Let M„j = X^j=i '^'^'i^J . Second, for a fixed a, maximize the criterion (16), with M„, replaced by M„j, to obtain an integer k. Let vj^, . . . , v^ be the k leading eigenvectors of Mm and transform the testing predictors Xj to X^- = (vi, . . . , Vfc)'''Xj, i = 1, . . . ,n2- Third, let Lj = I{Yi > q^) — I{Yi < qr) be the true label of Yi in the testing set. Apply SVM to (X^ ,Li), . . . , (KuJ ,Ln2) to predict Li,. . . ^Ln^. Repeat this process for all dividing points and record the total number of misclas- sifications. The optimal a is the one that minimizes the total number of misclassifications. Finally, substitute the optimal a into (16) and maximize it again using the full data to estimate d. In Section 8.3, we investigate the numerical performance of CVBIC under a variety of combinations of p, d, n and regression models. 4.3. Special features of linear PSVM. As we conclude the exposition of the linear PSVM, we mention some special features of this method. One is that it shares the similar limitation with SIR when dealing with regression functions that are symmetric about the origin. If the regression function is /(||X||), then all slices of the form {Xj:1^ G 5} are roughly concentric spheres in R^, which no hyperplane in W can separate. However, as we shall see in Sections 5 and 8, this is remedied by the kernel PSVM, because when mapped into higher dimensional feature space the slices become linear again. Another is that when dealing asymmetric regression functions, the linear PSVM tends to work better than SIR for the following reason. Recall that SIR is based, roughly, on the principal components of the slice mean vec- tors of the form £^(X|y G /S) — E(X.), where S is an interval in Qy This determines that it downweights the slice means near the center of the data cloud, where the Euclidean norm of ii^(X|y £ S) — E(X.) is smaller. How- ever, it is well known that the regression function £'(y|X) tends to be more accurately estimated near the center of the data cloud [see, e.g., Kutner, Nachtsheim and Neter (2004), Section 2.4]. In comparison, the linear PSVM relies on the normals of the separating hyperplanes of the slices, which does not downweight the data near the center. As we will see from our simulation studies in Section 8, this brings substantial improvement to the estimate. We should point out, however, that there is an important exception. As shown in Cook (2007) and Cook and Forzani (2008), under the assumption that Y has a finite support and X|y has a conditional multivariate normal distri- bution where var(X|y) is independent of Y, SIR is the maximum likelihood estimate of the central subspace. In this case, no regular estimate can be more efficient than SIR. The mentioned advantage of linear PSVM applies mainly to the forward regression setting where the conditional distribution of X|y is typically non-Gaussian. HYPERPLANE ALIGNMENT 11 5. Kernel PSVM for nonlinear dimension reduction. In this section, we extend the PSVM to nonlinear sufficient dimension reduction as defined by (2). We first develop the objective function by generalizing the linear PSVM objective function (9), and then establish the unbiasedness of the proposed nonlinear PSVM estimator. Before proceeding further, we note that the function in relation (2) is not unique in the strict sense, but is unique modulo injective transforma- tions. Again, the situation is parallel to linear sufficient dimension reduction problem (1), where 77 X is only unique modulo injective linear transfor- mations. Any injective linear transformation of 77 X is an equivalent linear predictor, because it does not change the linear subspace. Likewise, for non- linear SDR, any injective transformation of is an equivalent sufficient predictor, because it does not change conditional independence (2). Let ^ be a Hilbert space of functions of X. In analogy to the linear objective function (9), consider Ai'H x M-^M"'" defined by (17) A('0, t) = variV'CX)] + XE[1 - Y{ij(X) - Eij(X) - i)]+, where Y is as defined in (8). To see that this is indeed a generalization of (9), consider the bilinear form from H x 7^ to M defined by 6(/i,/2) = cov[/i(X),/2(X)]. Under the assumption that the mapping (18) n^L^iPj^), f^f is continuous, the bilinear form b induces a bounded and self-adjoint operator Tj-.'H^'H such that (/i, 11/2)-^ = b{fi, f2), where (•, •)-^ is the inner product in Ti. See, for example, Conway (1990), Theorem 2.2, and Fukumizu, Bach and Jordan (2004). The objective function (17) can now be rewritten as (19) A(V, t) = (V', SV)w + XE[1 - y(VXX) - Ei;{X) - t)] + . Thus, A(^,t) is a generalization of L('ip,t) with the matrix S replaced by the operator S, the linear function ip X replaced by an arbitrary function ^ in Ti, and the inner product in R^ replaced by the inner product in Ti. For the usual kernel SVM, the population-level objective function is iij, ij)n + XE[1 - y(^(X) - Ei;{X) - t)] + . Comparing with (19), we see a parallel modification to the linear case. The significance of this modification is further discussed in Section 6. We now establish that, if (■i/'*,t*) is the minimizer of A(^,t), then ip* is necessarily a function of the sufficient predictor 0(X) in the nonlinear problem problem (2). This is a generalization of the notion unbiasedness in the linear setting. Our definition of unbiasedness (10) in the linear sufficient dimension reduction setting is equivalent to (20) [T(Fo)]'^X is a linear function of r]~^X. 12 B. LI, A. ARTEMIOU AND L. LI It is the statement (20) that is more readily generahzed to the nonhnear sufficient dimension reduction setting: we simply require i/' to be a function of the sufficient predictor 0(X) in (2). The following definition makes this notion rigorous. For a generic random element U, let 0"{U} denote the a- field generated by U. Definition 1. A function ip gT-L is unbiased for nonlinear sufficient dimension reduction (2) if it has a version that is measurable a{cf)(X.)}. The reason that we only require a version of "0 to be measurable (t{0(X)} is that the L2-metric ignores measure zero sets. Theorem 2. Suppose the mapping (18) is continuous and: 1. % is a dense subset of L2{Pyi), 2. y_lLX|0(X). If {ip* ^t*) minimizes (19) among all {il^,t) G "H x M, then i/;*(X) is unbiased. Condition 1 is satisfied by some commonly used reproducing kernel Hilbert spaces. For example, if ^ is a reproducing kernel Hilbert space based on the Gaussian radial basis, then the collection of functions {c + g : c £ M, g £ Q} is dense in L2{Pk)- See Fukumizu, Bach and Jordan (2009). It is important to note that in this more general setting we no longer require any linearity assumption that resembles the one assumed in The- orem 1. In contrast, the kernel sliced inverse regression developed by Wu (2008) and Wu, Liang and Mukherjee (2008), and functional sliced inverse regression by Hsing and Ren (2009) all require a version of the linearity condition to hold in the reproducing kernel Hilbert space. The notion of unbiasedness for sufficient dimension reduction is more akin to Fisher consistency than to unbiasedness in the classical setting. While un- biasedness in the classical setting can exclude many useful statistics, Fisher consistency often guarantees correct asymptotic behavior without putting undue restrictions on the expectation. Moreover, an estimator that is not Fisher consistent is clearly undesirable, because it is guaranteed not to con- verge to the true parameter. For these reasons unbiasedness for linear SDR is a useful criterion, even though some useless estimators (such as 0) are unbiased. Unbiasedness for nonlinear SDR plays the parallel role, except that it only requires the estimator to be an arbitrary, rather than a linear, function of the true predictor. This relaxation also allows us to establish the unbiasedness of PSVM without evoking the linearity condition. Theorem 2 assumes that A{ip,t) attains its minimum in ^ x M. We think this is a reasonable assumption for the following reasons. As shown be- low, A(^,t) is lower semicontinuous with respect to the weak topology HYPERPLANE ALIGNMENT 13 in ?^ X M. Since any closed, bounded, and convex set in a Hilbert space is compact with respect to the weak topology [Weidmann (1980), Theo- rem 4.25, Conway (1990), Corollary V.1.5], by the generalized Weierstrass theorem [Kurdila and Zabarankin (2005), Section 7.3], A('i/',t) attains its minimum within any such set in ?^ x R. The next proposition establishes this fact. Let %' be the Hilbert space ?^ x M endowed with the inner product Proposition 1. IfH is an RKHS with its kernel k satisfying £'k(X, X) < oo, then K{il),t) is lower seniicontinuous with respect to the weak topol- ogy in %' , and attains its minimum in any closed, hounded, and convex set in %' . 6. Estimation of kernel PSVM and invariant kernel. The purpose of this section is twofold. First, because we have modified ('i/',V')w to {ip,T,ip)'}{ in the kernel SVM objective function, we can no longer use the standard SVM packages to solve for tp* . Therefore, we reformulate the minimization of A('0,t) as quadratic programming that can be solved by available com- puter packages. Second, in deriving this quadratic programming problem, we gain more insights into the meaning and significance of this modification. As we shall see, by replacing {ip,ip)H by {'tp,T,ip)'}{, we are in effect making SVM invariant with respect to the marginal distribution of X. Intuitively, since we are using SVM to make inference about the conditional distribution of y|X, it is plausible that the procedure does not depend on the marginal distribution of X. Let 7^ be a linear space of functions from Ox to M spanned by J^n = {■01, • • • , V'fc}- The choice of these functions will be discussed later, but it will ensure £'„['i/'j(X)] = 0, so that ipi{x) = ilJi{'x) — Eni^i(X.). Let /Vi(Xi) ••• V'fc(Xi)' (21) *= : •.. : \V'l(X„) ••• IpkC^n Then the sample version of the objective function (19) is n (22) A(c) = n'^c^^^^c + Xn'^ V[l - Yi{^Jc - t)] + where ^^ = (V'i(Xj), . . . ,'0fc(Xj)) and c G M^'. We minimize A(c) among all c. In the following, y = (yi, . . . ,yn)^ and a,/3,^ G M". The symbol < rep- resents componentwise inequality. The symbol represents the Hadamard product between matrices. For a matrix A of full column rank. Pa is the projection A(A A)~^A . The symbols and 1 represent, respectively, the n-dimensional vectors whose entries are and 1. 14 B. LI, A. ARTEMIOU AND L. LI Theorem 3. Ifc* minimizes k{c) overM.^ , thenc* = i(*'^*)~^*"''(y0 a*), where a* is the solution to the quadratic programming problem: maximize 1 a — j{aQy) P^(Q!0y) (23) subject to < a < XI, a y = 0. Note that the quadratic programming problem (23) differs from that of the standard kernel SVM, where the projection P^ is replaced by the ker- nel matrix K„ = {^(i, j) : i, j = 1, . . . , n} for some positive definite bivariate mapping k : Ox x Ox — > M. The kernel matrix K„ uniquely determines the sample estimate of the covariance operator S, which bears the information about the shape of the marginal distribution of X. By replacing K„ with P^ , we are, in effect, removing the information about X. For this reason we call the matrix P^ an invariant kernel For the function class Ti, we use the reproducing kernel Hilbert space based on the mapping k. Common choices of k include the polynomial kernel k(xi,X2) = (x^X2 + c)'^', where r is a positive integer, and the Gaussian radial II II 2 kernel k(xi,X2) = e~'^"^^~^^" , where 7 > 0. Let (24) n^ = {cq + cik(-,Xi) H h c„k(-,X„) : cq, . . . , c„ e M} with inner product specified by (K(-,a),K(-,b)) =K(a,b). In the standard kernel SVM, it is a common practice to use all functions in T-Lk as Ti. How- ever, the invariant nature of our kernel, P*, determines that we cannot use all those functions, because if so then P^ becomes nearly an identity matrix (note that if P^ were an identity matrix then the objective func- tion in (23) would become independent of Xi, . . . ,X„). We instead use the principal functions of the linear operator S„, as defined by (V'i,5]„'02) = cov„[V'i(X),^2(X)], as our basis -F„. Here cov„(-,-) denotes sample covari- ance. Let Q„ = I„ — Jn/n, where I^ is the n x n identity matrix and J„ is the n X n matrix whose entries are 1. The next proposition tells us how to find the eigenfunctions of S„. Its proof is easy and omitted. Proposition 2. Letw = {wi,...,Wn), V'w = J^'^d^i'^^^i) - ^ni^i'^,^)]- The following statements are equivalent: 1. w is an eigenvector of the matrix Q„K„Q„ with eigenvalue X; 2. -i/^w is an eigenfunction of the operator S„, with eigenvalue X/n. If Xj^O, then either statement implies (V'w(Xi), . . . ,'0w(X,i)) = Xw^ . Although the eigenvectors of Q„K„Q„ and the eigenfunctions of S„ are similar objects, it is the latter that can be evaluated at any x, not just the observed Xi, . . . ,X„. This property is important for prediction. Essentially, HYPERPLANE ALIGNMENT 15 we use the first k eigenfunctions (j)i,. . .^(pk of S„ as tlie functions in T^ ■ This is equivalent to using {aicpi, . . . , a^cpk} = {V'l) • • • ) "^fc} for any nonzero ai, . . . ,afc. We choose Oj to satisfy aj(^j(Xi), . . . ,'0i(X„))''' = Wj, where Wj is the eigenvector of Q„K„Q„, corresponding to its ith eigenvalue Aj. Thus ai = 1/Aj. With this choice, ^ is simply (wi, . . . , ■w^). The choice of number of basis functions, k, should allow sufficient flexibility but not as large as n; our experiences indicate that the choice of k in the range n/3 ~ 2n/3 works well. We summarize the kernel PSVM estimation procedure as follows. 1. (Optional) Marginally standardize Xi,...,X„. Let fir and a^ be the sample mean and sample variance Xir,. . . ,Xnr- Reset Xir to be {Xir — fir) far- The purpose of this step is so that the kernel k treats different components of Xj more or less equally. This step can be omitted if the components of Xj have similar variances. 2. Choose a kernel k and the number of basis functions k (say k = n/2). Compute * = (wi, . . . , w^) and P^ from Q„K„Q„. 3. Divide the sample according to LVR or OVA, each yielding a set of slices. For each pair of slices, solve the quadratic programming problem in Theorem 3 using the P^ computed from step 2. This gives coefficient vectors cj, . . . , c! e R'', where h = h-l for LVR and h = (0 for OVA. 4. Compute the first d eigenvectors, vi, ... jV^i, of the matrix ^^^^c*c* . Denote the rth component of of v^ as Vsr- 5. The sth sufficient predictor evaluated at x is fsiV'i(x) H l-fsfcV'fc(x), where ipr{'^) = A~^^"^^ u;ri[K(x, Xj) — £^„k(x, X)]. If step 1 is used, then x should be marginally standardized by the fir and ar computed from that step. Many computing packages are available to solve the quadratic programming problem in step 3. We use the ipop program in the kernlab package in R. See Karatzoglou et al. (2004). If the Gaussian radial kernel is used in step 2, then we recommend choosing 7 as 1 " ^2/ i<j,j=2 Alternatively, we can use the population version of the above quantity, (26) 7 = l/(i?l|X-X'||)2, where X and X' are independent A^(0, Ip) random vectors. This quantity can be easily evaluated by Monte Carlo. In Section 8, we use (26) for large-scale simulations to avoid repeated evaluations of (25), whereas in Section 9 we use (25) for the real data analysis, where it needs to be calculated only once. Some authors recommend sample median in (25). See Gretton et al. (2005) and Fukumizu, Bach and Jordan (2009). This does not make a significant difference in our examples. 16 B. LI, A. ARTEMIOU AND L. LI 7. Asymptotic analysis of linear PSVM. In this section, we give a com- prehensive asymptotic analysis of linear PSVM estimator introduced in Sec- tions 3 and 4. This is developed in three parts. First, we derive the influence function for the normal vector -0 based on two slices. In this part, we em- ploy some asymptotic properties of SVM developed recently by Jiang, Zhang and Cai (2008). In the second part, we derive the asymptotic distribution of the linear PSVM estimator, (vi, . . . , v^^), defined in Section 4.1. In the third part, we establish the consistency of the order determination criterion introduced in Section 4.2. 7.1. Influence function for support vector machine. The asymptotic re- sults of Jiang, Zhang and Cai (2008) are largely applicable here except for three places: our SVM involves an additional S; our A is fixed but the A in their paper depends on n; they did not derive the explicit form of the hessian matrix — and hence neither the asymptotic variance — but we are interested in the explicit asymptotic distribution. The first two points are minor but the third needs nontrivial additional work. We only consider the case where Y is defined through a partition {yli,j42} of ily- Thus, our results only apply to the LVR scheme. The asymptotic analysis the OVA scheme can be carried out similarly, and is omitted. We first develop some notation. Let 6 = (t/>^,t)^, Z = (X~'',y)'^, X* = (X^, -1)^, E* =diag(5],0). Then (27) V^SV + A[l - YixJiP - t)]+ = 9^-^*0 - A(l - e^X*Y)+. We denote this function by m(0,Z). Let ilz be the support of Z and let h:6 X i7z — ^ I^^ be a function of (0,Z). Let Dg denote the {p + 1)- dimensional column vector of differential operators {d/d6i,. . . ,d/d6p-^i) . The next theorem gives the gradient of the support vector machine objective function E[m{0,7i)]. Theorem 4. Suppose, for each y = —1, 1, the distribution of X|y = y is dominated by the Lebesgue measure and £'(||X|p) <oo. Then (28) DeE[m{e, Z)] = (2V'^S, 0)^ - A^[X*y/(l - e'^X.*Y > 0)]. We now present the hessian matrix of support vector machine, which leads to the asymptotic variance of 0. To our knowledge, this is the first time that the asymptotic variance is explicitly given. This result is then used to derive the asymptotic distribution of the linear PSVM estimator. Theorem 5. Suppose X has a convex and open support and its condi- tional distributions given Y = 1 and Y = —1 are dominated by the Lebesgue measure. Suppose, moreover: HYPERPLANE ALIGNMENT 17 1. for any linearly independent ip,S & W, y = —1,1, and ti G M, the fol- lowing function is continuous: u H> £;(X*|V'^X = n,<5^X = v,Y = y)/^Tx|5Tx^y(w|?^,y); 2. for anyi = 1, . . . ,p, andy= —1, 1, i/iere is a nonnegative function Ci{v,y) with E[ci(y,Y)\Y] < oo such that vE{Xi\il;^X = u,S'^X = v,Y = y)f^T^^gT^Yi'^\^^y) ^ Ci(f ,y); 3. there is a nonnegative function co{v,y) with E[co(y,Y)\Y] < oo such that /^Tx|5Tx,y(^ib,y) < co{v,y). Then the function h^ DeE[m{9,Z)] is differentiable in all directions with derivative matrix 2diag(5],0) + A Y, P{Y = y)f^T^^y{t + y\y)E{^*^*^\i^'^^ = t + y). Furthermore, if the function (■0,t) i->/ ,Tx|y(i + ?/|?/)-E'(X*X*^|'j/'^X = t + y) is continuous, then Dg[m{6,Z)] is jointly differentiable with respect to 6. Joint differentiability and directional differentiability are sometimes ref- ered to as Frechet differentiability and Gateaux differentiability. The latter is generally weaker than the former. In a finite-dimensional space, having continuous directional derivative in all directions implies joint differentia- bility [Bickel et al. (1993), page 453]. The next theorem gives the influence function for support vector machine. Theorem 6. If the conditions in Theorems 4 o-nd 5 are satisfied, then 6 = 90- H-i{(2^([^,0)^ - XEn[X.*YI{l - YO'^X* > 0)]} + op{n-^^'^), where H is hessian matrix given by Theorem 5. The proof is similar to that of Jiang, Zhang and Cai (2008) and is omitted. Alternatively, one can prove it by applying Theorem 5.23 of van der Vaart (1998). 7.2. Asymptotic distribution of (vi,. . . ,Vd). Consider a fixed division point Qr, where r G {1, . . . ,h — 1}. Let y be as defined in (12), and Z'' = (X^,y'')^. Let 0or = {-ipor^torV be the minimizer of E[m{6,Z'')], and 6r = {tpj ,tr) be the minimizer of En[m{0,Zi'^)]. Let H,. be the hessian matrix of E[m{6, Z*")], and let F,. be the first p rows of H~"^. By Theorem 6, (29) ^, = -0or - Sr(6'or, Z"^) + opin-^/^), 18 B. LI, A. ARTEMIOU AND L. LI where s^(0, Z*") = Fr[(2t/>~^5], 0)"^ - XX*Y''I{1 - Y^'e'^X* > 0)]. Let h-l h-1 r=l r=l For a matrix A G ]R'"i^'''2, let Kri,r2 ^ I^''^''^^^^''^ be the commutation matrix defined by the relation Kri,r2 vec(A) = vec(A ). See Magnus and Neudecker (1979). Two properties of K,.i,r-2 that will prove useful for our purpose are that Kr^,r2 = ^J2,ri ^nd that for any B G W^''''^ (30) A®B = Kr,,^3(B®A)Kr4,r2. We now present the asymptotic distribution of M„. Theorem 7. Under the assumptions in Theorems 4 and 5, -y/nvec(M„ — Mq) converges to multivariate normal with mean and variance h~lh~l P,Ph (V + Kp,p) Yl Y.^'^Ori'Ot ® A,i)(Ip2 + K r=l t=l where A^t = ^[s^(6>or, ZOs7(6>oi, Z*)]. This result leads directly to the asymptotic distribution of V = (vi , . . . , V(^). Since, by Theorem 1, span(Mo) ^ 5y|x, we have rank(Mo) < d. We make the working assumption that rank(Mo) = d. This means we exclude the situations where the regression surface is symmetric about the origin. Since Mq is positive semi-definite, it has the spectral decomposition UDU"'', where U is a p x d matrix whose columns are the eigenvectors of Mq cor- responding to nonzero eigenvalues, and D is a d x d diagonal matrix with diagonal elements being the nonzero eigenvalues. The following corollary is a direct consequence of Theorem 7 and Bura and Pfeiffer (2008). Its proof is omitted. Corollary 1. Under the assumptions in Theorems 4 o,nd 5 and rank(Mo) = d, -y/nvec(V — Vq) — > N{Q, T), where T is the pd x pd matrix (D-^U^ Ip)(Ip2 + Kp,p) ^ ^(^OrV'M » Art)(V + Kp,p)(UD^l r=l i=l n- It is possible to refine the PSVM estimator by introducing weights to M, Take the LVR scheme for example. Let ^ = {ipi, ■ ■ ■ jipj^^i). Let A be an h — 1 by h — 1 matrix. Rather than working with Mnj we could base the spectral decomposition on a weighted matrix M„(A) = ^ A^. Let v(A) = HYPERPLANE ALIGNMENT 19 (vi(A), . . . , Vrf(A)) be the first d eigenvectors of M„(A). One way to deter- mine the optimal A is by minimizing a real- valued monotone function (say trace) of the asymptotic variance matrix of vec[v(A)], which can be ex- tracted from the asymptotic distribution. This type of argument was used in Li (2000, 2001) to construct optimal estimating equations. Alternatively, one can develop an optimal procedure using the minimum distance approach introduced by Cook and Ni (2005). We leave these to future research. 7.3. Consistency of the BIC-type criterion. In the following, we say a se- quence of random variables Wn converges in probability to infinity {Wn — )■ oo) if, for any K > 0, lim„^oo -P(|^n| > K) = 1. Let d is the maximizer of Gn{k) over {0, . . . ,p} as defined in Section 4.2. Theorem 8. Suppose P{ci{n) > 0) = 1, ci(n) -;> 0, n^/^ci(n) -^ oo, and C2{k) is an increasing function of k. Under the conditions in Theo- rems 1, 4 o-nd 5 and rank(Mo) = d, we have lim„_>>oo P{d = d) = 1. Note that we have again made the working assumption rank(Mo) = d. However, even when this assumption is violated the theorem still holds with d replaced by the rank of Mq. 8. Simulation studies. In this section, we compare the linear and kernel PSVM with four other methods based on the idea of inverse regression: SIR, the sliced average variance estimator [SAVE; Cook and Weisberg (1991)], directional regression [DR; Li and Wang (2007)], and kernel sliced inverse regression [Wu (2008)]. We also investigate the performance of the CVBIC for order determination. 8.1. Linear dimension reduction. We use the following models: Model I: Y = Xi/[0.5 + (X2 + 1)^] + ae. Model II: Y = Xi {Xi + X2 + 1) + ae, Model III: Y = (Xf + xlf^ \og{Xl + Xlf'^ + ae, where X ~ iV(0,Ip), p = 10,20,30, e ~ iV(0,l) and a = 0.2. The sample size n is taken to be 100. The first two models, which are taken from Li (1991), are asymmetric about 0, but the last one is symmetric about 0. As we have discussed in Section 4.3, linear PSVM, like SIR, does not work when the regression surface is symmetric about 0. The first two examples show how the linear PSVM compares with other methods in the situations where it works. The purpose of the last model is to provide a benchmark of error when it fails, so that we can gauge how the kernel PSVM improves the situation in the next comparison. 20 B. LI, A. ARTEMIOU AND L. LI Table 1 Estimated means and simulation standard errors (in parentheses) of the distance measure (31) and mean computation times (in second) of linear sufficient dimension reduction methods Models P SIR SAVE DR Linear PSVM I 10 20 30 0.84 (0.22) 1.14 (0.18) 1.31 (0.14) 1.55 (0.19) 1.93 (0.05) 1.96 (0.03) 1.02 (0.23) 1.32 (0.17) 1.48 (0.11) 0.65 (0.17) 0.93 (0.16) 1.17 (0.14) II 10 20 30 1.20 (0.27) 1.51 (0.19) 1.67 (0.16) 1.43 (0.16) 1.72 (0.15) 1.84 (0.12) 1.17 (0.23) 1.46 (0.14) 1.63 (0.12) 0.85 (0.25) 1.26 (0.23) 1.58 (0.17) III 10 20 30 1.80 (0.13) 1.89 (0.08) 1.93 (0.05) 0.87 (0.21) 1.46 (0.20) 1.72 (0.12) 0.85 (0.20) 1.45 (0.20) 1.71 (0.12) 1.65 (0.16) 1.85 (0.10) 1.93 (0.05) Time 0.03 0.01 0.03 0.16 To evaluate the performance of each method, we use the distance measure suggested by Li, Zha and Chiaromonte (2005). Specifically, let Si and ^2 be two subspaces of W . Then (31) dist(cSi,52) = ||P5i-P52ll, where P^^ and P^j are orthogonal projections on to Si and ^2, and || • || is a matrix norm. In the following the Frobenius norm is used. For SAVE and DR, we use /i = 4 slices, and for SIR, we use /i = 8 slices, having roughly the same number of points. Our choices of h are in line with the usual practice in the SDR literature for such a sample size. For methods such as SAVE and DR that involve the second-order inverse moment, h is suggested to be chosen smaller than that for methods such as SIR which only involve the first-order inverse moment [Li and Zhu (2007)]. For the lin- ear PSVM, the cost A is taken to be 1. The number of division points {qr) is 20. We have tried some other numbers of division points and obtained very similar results. In general, our experiences suggest that a relatively large number of dividing points is preferable. The results are presented in Table 1. The entries are of the form a{h) where a is the mean, and h is the standard deviation of the distance criterion (31) calculated from 200 simu- lated samples. The last row in Table 1 records the CPU time (in seconds) each method uses for Model I with p = 10 (on a Dell OptiPlex 745 desktop computer with speed 2.66 GHz). Table 1 shows that the linear PSVM consistently performs better than the other methods in all cases for models I and II. The intuition behind this improvement is explained in Section 4.3. Also, as expected, the linear PSVM and SIR do not perform well for Model III because of the syinmetry of the HYPERPLANE ALIGNMENT 21 regression function. However, as we will see in the next comparison, this defect is no longer present in the kernel PSVM. The linear PSVM requires more computing time than the classical methods, mainly because it needs to process more dividing points, and, for each dividing point, the full data (rather than a slice of data) are processed. 8.2. Nonlinear dimension reduction. As we have mentioned. Model III is symmetric about 0, and the linear PSVM fails. To a certain degree, the shape of regression surface of Model II is also symmetric about 0. We now use these two models to investigate the performance of the kernel PSVM for nonlinear sufficient dimension reduction. In terms of linear dimension reduc- tion, Model III has two sufficient predictors, Xi, X2, but in terms of nonlin- ear dimension reduction, it has only one sufficient predictor, (Xf + X2)^'^, or any monotone function of it. The kerenl PSVM is designed to recover a monotone transformation of {Xf + X|)^' ^ without having to assume any regression model. In doing so, it solves two problems at one stroke — further reducing the dimension from 2 to 1, and avoiding the difficulty of SIR in dealing with symmetric responses. To illustrate the idea, in Figure 3 we present the 2-D and 3-D scatter plots for Y versus the nonlinear and linear predictors obtained by different methods. The upper left panel is the 2-D scatter plot for Y versus the true nonlinear predictor (Xf -l-Xl)^'^; the upper right panel is the 2-D scatter plot of Y versus the first kerenl PSVM predictor; the lower panels are 3-D scatter plots for Y versus the first two predictors from SAVE and DR. We can see that all three methods capture the right shape of the regression function, but kernel PSVM only requires one predictor and its sufficient plot appears sharper (bearing in mind that the upper right panel only has to resemble a monotone transformation of the upper left panel). To make a more precise comparison, we need to design a new criterion that can compare one nonlinear predictor with two linear predictors; the criterion (31) is no longer suitable for this purpose. Since the nonlinear suf- ficient predictor estimates a monotone function of {Xf +X|)^'^, we use the absolute value of Spearman's correlation to measure their closeness, which is invariant under monotone transformation [Kutner, Nachtsheim and Neter (2004), page 87]. To measure the closeness between two linear predictors and the true nonlinear predictor {Xf + Xl)"^'^, let (^n, . . . , Uin), {U21, ■ ■ ■ , U2n) represent the two linear predictors obtained by SAVE or DR. These predic- tors estimate linear combinations of Xii,X2i but do not specify Xn, X2i themselves. We therefore regress Tj = Xf^ + X|j on {{l,Uu,U2^,Ul,Ul)■.i = l,...,n}. If Uii and U2i are (linearly independent) linear combinations oi Xn and X2i, then this regression is guaranteed to recover the true predictor Tj regard- less of the specific form of the linear combinations. Let Ti,...,r„ be the 22 B. LI, A. ARTEMIOU AND L. LI 0.05 kerenl PSVM predictor X^ ^^ o o ® o o (b "^ ° o o o •^ ^;^^^^, ^ X o --g^- ^ o y^ y^ 1 1 1 1 0.0 0.2 0.4 0.6 0.8 1.0 SAVE1 0.0 0.2 0.4 0.6 o.a DR1 Fig. 3. Comparison between linear and nonlinear sufficient dimension reduction meth- ods. Upper left panel: true nonlinear predictor \/Xf + X| versus Y ; upper right panel: first (nonlinear) PSVM predictor versus Y ; lower left: first two SAVE predictors versus Y ; lower right panel: first two DR predictors versus Y . fitted responses of tliis regression. We use the absolute values of Spearman's correlation between Ti and Tj to measure the performance of SAVE and DR. We compute these numbers for 200 simulation samples, and tabulate their means and standard deviations in Table 2. Note that large numbers represent better performance, and all numbers are between and 1. The SAVE and DR estimators are computed in exactly the same way as in the linear dimension reduction comparison. For the kernel PSVM, the cost is 1, the number of division points is still 20, the kernel is the Gaussian radial basis, and the number of principal eigenfunctions of E„ is taken to be 60. The parameter 7 is calculated by (26), which are approximately 0.0526, 0.0257 and 0.0169 for p = 10,20,30, respectively. We see that the kernel PSVM actually performs better than SAVE and DR, even though it uses only one predictor. It also performs better than KSIR. Moreover, the accuracy of the kernel PSVM remains reasonably high for larger p, where the accuracies of SAVE, DR, and KSIR drop considerably. HYPERPLANE ALIGNMENT Table 2 Estimated means and simulation standard errors (in parentheses) of Spearman correlations of linear and nonlinear sufficient dimension reduction 23 Model II Model III p SAVE DR KSIR KPSVM SAVE DR KSIR KPSVM 10 0.53 0.67 0.88 0.92 0.79 0.79 0.89 0.90 (0.13) (0.11) (0.07) (0.02) (0.09) (0.08) (0.05) (0.02) 20 0.37 0.53 0.68 0.86 0.56 0.57 0.59 0.81 (0.14) (0.09) (0.17) (0.03) (0.11) (0.11) (0.18) (0.03) 30 0.30 0.43 0.55 0.83 0.47 0.48 0.42 0.77 (0.13) (0.10) (0.23) (0.04) (0.11) (0.11) (0.21) (0.04) 8.3. Estimation of structural dimension. We now investigate the perfor- mance of the CVBIC order-determination procedure for a variety of combi- nations of {p,d,n). We still use models I and II, but, to include different d, we add the following models which both have d=l: Model IV: Y = Xi/[0.5 + {Xi + 1)^] + ae, Model V: Y = Xi (2Xi + 1) + ae. These are derived from models I and II by replacing X2 by Xi. We apply CVBIC in conjunction with PSVM to Models I, II, IV and V, with {d,n,p) ranging over the set {1,2} x {200,300,400,500} x {10,20,30}. The training and testing sample sizes are ni = n2 = n/2. We take 20 divid- ing points Qr as equally-spaced sample quantiles of li , . . . , Yn-^ . As a com- parison we also apply the order-determination procedure for SIR based on Theorem 5.1 of Li (1991) with significant level a = 0.05. The results are presented in Table 3, where the entries are the percentage of correct esti- mation of d out of 200 simulated samples for each of the 48 combinations of (model, p,n). Table 3 shows that CVBIC works very well, with percentage of correct estimation reaching as high as 100% for sample size of 200 (training sample size 100). In almost all cases, PSVM compares favorably with SIR for order determination. Also clear from the table is the trend of increasing accuracy for both methods as n increases. 9. Application and further discussions. We now compare the kernel PSVM with SIR, SAVE, and DR in a real data analysis concerning recognition of vowels. The data can be found in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets). The response variable Y is a categorical variable of 11 levels, representing different vowel sounds. The predictor X is a 10-dimensional vector describing the features of a sound. For 24 B. LI, A. ARTEMIOU AND L. LI Table 3 Rate of correct order determination by SIR and PSVM in % d P n ■ = 200 n = = 300 n ■ = 400 n = 500 Model SIR PSVM SIR PSVM SIR PSVM SIR PSVM I 1 10 92 96 92 100 97 100 98 100 20 80 82 95 96 96 100 94 100 30 65 54 92 94 94 98 96 100 II 2 10 67 80 86 85 97 90 98 94 20 36 64 66 84 85 86 96 84 30 22 32 55 77 73 84 88 80 IV 1 10 39 82 60 84 71 93 75 95 20 28 78 42 74 54 76 68 80 30 15 78 33 84 44 78 60 80 V 2 10 93 100 96 100 96 100 97 100 20 94 98 96 100 96 100 96 100 30 96 98 96 99 96 100 92 100 clear presentation, we select only three vowels: the sounds in heed^ head and hud, with training and testing sample sizes being 144 and 126, respectively. For each dimension reduction method, we find a set of sufficient predic- tors from the training data, and evaluate them at the testing set, resulting in a sufficient plot for the testing set. Given that the testing data are inde- pendent of the training data from which the sufficient predictors is derived, the degree of separation of the vowels in the sufficient plot objectively re- flects the discriminating power of a dimension reduction method. The four scatter plots in Figure 4 present the ffi'st two predictors found by SIR (upper left panel), SAVE (upper right panel), DR (lower left panel), and the kerenl PSVM (lower right panel). For the kernel PSVM, the OVA scheme is used. The basis functions are the first 40 eigenfunctions of the operator S„ derived from the Gaussian radial kernel, whose parameter 7 is calculated by (25). The cost A is 1. We have varied the number of eigenfunctions (from 10 to 60) and the cost (from 0.5 to 20), but they do not seem to result in significant difference in the degree of separation in the test data. From Figure 4, we see that the kernel PSVM achieves much better sep- aration of the three vowels in the test data than the other three methods. The second best performer is DR, followed by SIR and SAVE. It is also interesting to note that the various degrees of separation are also reflected in the sufficient plots; that is, the distance between heed and hud is larger than those between heed and head, and head and hud. We would like to comment that classification, though important, is not the sole purpose for sufficient dimension reduction, and that linear and non- linear sufficient dimension reductions have their own strengths in reducing, HYPERPLANE ALIGNMENT 25 o " r(»oO « » »o oo o iS) >: ° ° •>o o°a% » 90^ ^8 "A°o%8 ot> ° o° °i° o 8 _^ e ° oo »o ° oo Q o -0 ° '--;: o o o ooO°c rf? oO "in .v° o«°o 8 ° ^° S *o » ° = °°o o ' Jb SIR1 SAVE1 o y Fig. 4. Firsi too predictors based on SIR, SAVE, DR, and the kernel PSVM plotted for the vowel recognition testing data set. Green, red and blue colors indicate the vowel sounds in heed, head, hud. discriminating, visualizing, and interpreting high-dimensional data. To illu- minate the point, consider an example where variation, rather than loca- tion, is the differentiating characteristic. Let y be a bernoulli variable with P[Y = 1) = P{Y = 0) = 1/2 and (X|y = y)~iV where cr^(O) = 1 and (7^(1) = 10. Let (Xi, Yi), . . . , (X„,y„) be a sample from this model, where n = 200 and p = 10. For simplicity, we fix the number of cases of y = 1 at n/2, because this has no bearing on our problem. In this case, the central subspace is span(ei, 62), where e^ = (0, . . . , 1, . . . , 0)"'" with the 1 occupying the ith position. We apply SAVE and the kernel PSVM and the results are shown in Fig- ure 5, where the top panel shows the scatter plot for the true sufficient predictors Xi and X2, the lower left panel shows the first two SAVE predic- 26 B. LI, A. ARTEMIOU AND L. LI o o o o o o o o oo LO _ o o o % o o°" o o + -ft / °° o *o o O - °<ff M ^^ °i5o %' %C!D ^ o oS> O OqO o LO 1 - o o oo o o o o o o o o o o o 1 1 1 1 1 XI -2-10 1 SAVE predictorl Fig. 5. Variation as the differentiating characteristic. Blue o represents the Y = Q cases and red + represents the Y — 1 cases. tors, and the lower right panel shows the boxplot of a single kernel PSVM predictor. Since for a single variable we cannot produce a scatter plot, for clarity we use a boxplot to represent the predictor. The value of the kernel PSVM predictor is represented by the height in the boxplot; the two boxes represents the two groups. All three plots are based on the testing data. What is interesting is that kernel PSVM in some sense "translates" the difference in variation into the difference in location. The intuitive reason is that there is a quadratic — and hence variance — component in the kernel mapping, but in the mapped high-dimensional space the variance compo- nent is treated as an augmented part of feature vector [as in {x,x'^)]. Of course this is only a simplification of the situation: there is still significant difference in variation in the kernel PSVM predictor for the two groups. In this case, linear dimension reduction methods such as SAVE have a def- inite advantage, both for their clear separation of variation and for their good HYPERPLANE ALIGNMENT 27 Si* ° %° iP o O O ° O + Or T 1 1 1 1 1 r -3-2-10 1 2 3 /.orf?^Vof?P ' I ' I -^ ' 1 1 ' 1 1 Fig. 6. Degrees of separation by SAVE (upper panels) and kernel PSVM (lower panels) for higher dimensions: p = 60 (left panels), p — 80 (middle panels) and p — 100 (right panels). interpretability. In the meantime, this example also shows that kernel PSVM is capable of differentiating variation, to the degree comparable to SAVE, but its interpretability is not as direct as SAVE. Another desirable feature of the kernel PSVM is that its accuracy is more stable than the classical methods as the dimension p increases. Figure 6 shows the sufficient predictors derived from SAVE and kernel PSVM for p = 60,80,100 (from left to right). The upper panels are the scatter plots for the first two SAVE predictors, and the lower panels are the boxplots representing the single kernel PSVM predictor. Again, all plots are based on testing data. We see that SAVE gradually loses its discriminating power as p is increased to 100, whereas the discriminating power of kernel PSVM remains reasonably strong. Acknowledgments. We are very grateful to three referees and an Asso- ciate Editor, whose many useful comments and suggestions greatly broad- ened and deepened an earlier version of this work. 28 B. LI, A. ARTEMIOU AND L. LI REFERENCES Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337- 404. MR0051437 Artemiou, a. a. (2010). Topics on supervised and unsupervised dimension reduction. Ph.D. thesis, Pennsylvania State Univ., University Park, PA. BiCKEL, P., Klaassen, C. a. J., RiTOV, Y. and Wellner, J. (1993). Efficient and Adaptive Inference in Semi-Parametric Models. Johns Hopkins Univ. Press, Baltimore. BuRA, E. and Pfeiffer, R. (2008). On the distribution of the left singular vectors of a random matrix and its applications. Statist. Probab. Lett. 78 2275-2280. MR2462662 Conway, ,J. B. (1990). A Course in Functional Analysis, 2nd ed. Graduate Texts in Mathematics 96. Springer, New York. MR1070713 Cook, R. D. (1994). Using dimension-reduction subspaces to identify important inputs in models of physical systems. In Proc. Section on Physical and Engineering Sciences 18-25. Amer. Statist. Assoc, Alexandria, VA. Cook, R. D. (1996). Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91 983-992. MR1424601 Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graph- ics. Wiley, New York. MR1645673 Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statist. Sci. 22 1-26. MR2408655 Cook, R. D. andFORZANi, L. (2008). Principal fitted components for dimension reduction in regression. Statist. Sci. 23 485-501. MR2530547 Cook, R. D. and Ll, B. (2002). Dimension reduction for conditional mean in regression. Ann. Statist. 30 455-474. MR1902895 Cook, R. D. and Nl, L. (2005). Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. J. Amer. Statist. Assoc. 100 410-428. MR2160547 Cook, R. D. and Weisberg, S. (1991). Discussion of "Sliced inverse regression for di- mension reduction," by K.-C. Li. J. Amer. Statist. Assoc. 86 316-342. Eaton, M. L. (1986). A characterization of spherical distributions. J. Multivariate Anal. 20 272-276. MR0866075 FuKUMizu, K., Bach, F. R. and Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5 73-99. MR2247974 Fukumizu, K., Bach, F. R. and Jordan, M. I. (2009). Kernel dimension reduction in regression. Ann. Statist. 37 1871-1905. MR2533474 Fung, W. K., He, X., Liu, L. and Shi, P. (2002). Dimension reduction based on canonical correlation. Statist. Sinica 12 1093-1113. MR1947065 Gretton, a., Bousquet, O., Smola, A. and Scholkopf, B. (2005). Measuring sta- tistical dependence with Hilbert-Schmidt norms. In 16th International Conference on Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, eds.). Lecture Notes in Computer Science 3734 63-77. Springer, Berlin. MR2255909 Hall, P. and Ll, K.-C. (1993). On almost linearity of low-dimensional projections from high-dimensional data. Ann. Statist. 21 867-889. MR1232523 Hsing, T. and Ren, H. (2009). An RKHS formulation of the inverse regression dimension- reduction problem. Ann. Statist. 37 726-755. MR2502649 Jiang, B., Zhang, X. and Cai, T. (2008). Estimating the confidence interval for pre- diction errors of support vector machine classifiers. J. Mach. Learn. Res. 9 521-540. MR2417245 Karatzoglou, a. and Meyer, D. (2006). Support vector machines in R. J. Stat. Softw. 15 9. HYPERPLANE ALIGNMENT 29 Karatzoglou, a., Smola, A., Hornik, K. and Zeileis, A. (2004). Kernlab — an S4 package for kernel methods in R. J. Stat. Software 11 9. KuRDlLA, A. J. and Zabarankin, M. (2005). Convex Functional Analysis. Birkhauser, Basel. MR2145612 KuTNER, M. H., Nachtsheim, C. J. and Neter, J. (2004). Applied Linear Regression Models, 4th ed. McGraw-Hill/Irwin, Boston. Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316-342. MR1137117 Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension reduction: Another application of Stein's lemma. J. Amer. Statist. Assoc. 87 1025- 1039. MR1209564 Li, B. (2000). Nonparametric estimating equations based on a penalized information cri- terion. Canad. J. Statist. 28 621-639. MR1793116 Li, B. (2001). On quasi likelihood equations with non-parametric weights. Scand. J. Stat. 28 577-602. MR1876501 Li, B. and Dong, Y. (2009). Dimension reduction for nonelliptically distributed predic- tors. Ann. Statist. 37 1272-1298. MR2509074 Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17 1009-1052. MR1015136 Li, B. and Wang, S. (2007). On directional regression for dimension reduction. J. Amer. Statist. Assoc. 102 997-1008. MR2354409 Li, B., Zha, H. and Chiaromonte, F. (2005). Contour regression: A general approach to dimension reduction. Ann. Statist. 33 1580-1616. MR2166556 Li, Y. and Zhu, L.-X. (2007). Asymptotics for sliced average variance estimation. Ann. Statist. 35 41-69. MR2332268 LOH, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statist. Sinica 12 361-386. MR1902715 Magnus, J. R. and Neudecker, H. (1979). The commutation matrix: Some properties and applications. Ann. Statist. 7 381-394. MR0520247 SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. MR0468014 VAN DER Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series m Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge. MR1652247 Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York. MR1641250 Wang, Y. (2008). Nonlinear dimension reduction in feature space. Ph.D. thesis, Pennsyl- vania State Univ., University Park, PA. Wang, Q. and Yin, X. (2008). A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE. Comput. Statist. Data Anal. 52 4512-4520. MR2432477 Weidmann, J. (1980). Linear Operators in Hilbert Spaces. Graduate Texts m Mathematics 68. Springer, New York. MR0566954 Wu, H.-M. (2008). Kernel sliced inverse regression with applications to classification. J. Comput. Graph. Statist. 17 590-610. MR2528238 Wu, Q., Liang, F. and Mukherjee, S. (2008). Regularized sliced inverse regression for kernel models. Technical report, Duke Univ., Durham, NC. XiA, Y., Tong, H., Li, W. K. and Zhu, L.-X. (2002). An adaptive estimation of dimen- sion reduction space. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 363-410. MR1924297 Yeh, Y.-R., Huang, S.-Y. and Lee, Y.-Y. (2009). Nonlinear dimension reduction with kernel sliced inverse regression. IEEE Transactions on Knowledge and Data Engineering 21 1590-1603. 30 B. LI, A. ARTEMIOU AND L. LI Yin, X. and Cook, R. D. (2002). Dimension reduction for the conditional fcth moment in regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 159-175. MR1904698 Yin, X., Li, B. and Cook, R. D. (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. J. Multivariate Anal. 99 1733-1757. MR2444817 Zhu, L., Miao, B. and Peng, H. (2006). On sliced inverse regression with high- dimensional covariates. J. Amer. Statist. Assoc. 101 630-643. MR2281245 B. Li Department of Statistics Pennsylvania State University 326 Thomas Building University Park, Pennsylvania 16802 USA E-MAIL; bing@stat.psu.cdu A. Artemiou Department of Mathematical Sciences Michigan Technological University Fisher Hall, Room 306 1400 Townsend Drive Houghton, Michigan 49931 USA E-MAIL: aarteniio@mtu.cdu L. Li Department of Statistics North Carolina State University 2311 Stinson Drive Campus Box 8203 Raleigh, North Carolina 27695-8203 USA E-MAIL: li@stat.ncsu.edu