{"title": "Algorithmic Stability and Generalization Performance", "book": "Advances in Neural Information Processing Systems", "page_first": 196, "page_last": 202, "abstract": null, "full_text": "Algorithmic Stability and Generalization \n\nPerformance \n\nOlivier Bousquet \n\nCMAP \n\nEcole Polytechnique \n\nF-91128 Palaiseau cedex \n\nFRANCE \n\nAndre Elisseeff'\" \nBarnhill Technologies \n6709 Waters A venue \nSavannah, GA 31406 \n\nUSA \n\nbousquet@cmapx.polytechnique\u00b7fr \n\nandre@barnhilltechnologies.com \n\nAbstract \n\nWe present a novel way of obtaining PAC-style bounds on the gen(cid:173)\neralization error of learning algorithms, explicitly using their stabil(cid:173)\nity properties. A stable learner is one for which the learned solution \ndoes not change much with small changes in the training set. The \nbounds we obtain do not depend on any measure of the complexity \nof the hypothesis space (e.g. VC dimension) but rather depend on \nhow the learning algorithm searches this space, and can thus be \napplied even when the VC dimension is infinite. We demonstrate \nthat regularization networks possess the required stability property \nand apply our method to obtain new bounds on their generalization \nperformance. \n\n1 \n\nIntroduction \n\nA key issue in computational learning theory is to bound the generalization error of \nlearning algorithms. Until recently, most of the research in that area has focused on \nuniform a-priori bounds giving a guarantee that the difference between the training \nerror and the test error is uniformly small for any hypothesis in a given class. \nThese bounds are usually expressed in terms of combinatorial quantities such as VC(cid:173)\ndimension. In the last few years, researchers have tried to use more refined quantities \nto either estimate the complexity of the search space (e.g. covering numbers [1]) \nor to use a posteriori information about the solution found by the algorithm (e.g. \nmargin [11]). There exist other approaches such as observed VC dimension [12], but \nall are concerned with structural properties of the learning systems. In this paper \nwe present a novel way of obtaining PAC bounds for specific algorithms explicitly \nusing their stability properties. The notion of stability, introduced by Devroye \nand Wagner [4] in the context of classification for the analysis of the Leave-one(cid:173)\nout error and further refined by Kearns and Ron [8] is used here in the context \nof regression in order to get bounds on the empirical error rather than the leave(cid:173)\none-out error. This method has the nice advantage of providing bounds that do \n\n-This work was done while the author was at Laboratoire ERIC, Universite Lumiere \n\nLyon 2,5 avenue Pierre Mendes-France, F-69676 Bron cedex, FRANCE \n\n\fnot depend on any complexity measure of the search space (e.g. VC-dimension or \ncovering numbers) but rather on the way the algorithm searches this space. In that \nrespect, our approach can be related to Freund's [7] where the estimated size of the \nsubset of the hypothesis space actually searched by the algorithm is used to bound \nits generalization error. However Freund's result depends on a complexity term \nwhich we do not have here since we are not looking separately at the hypotheses \nconsidered by the algorithm and their risk. \n\nThe paper is structured as follows: next section introduces the notations and the \ndefinition of stability used throughout the paper. Section 3 presents our main result \nas a PAC-like theorem. In Section 4 we will prove that regularization networks are \nstable and apply the main result to obtain bounds on their generalization ability. \nA discussion of the results will be presented in Section 5. \n\n2 Notations and Definitions \n\nX and 'lJ being respectively an input and an output space, we consider a learning \nset S = {Zl = (Xl, yd, .. , Zm = (xm, Ym)} of size m in Z = X x 'lJ drawn i.i.d. from \nan unknown distribution D. A learning algorithm is a function L from zm into \n'lJx mapping a learning set S onto a function Is from X to 'lJ. To avoid complex \nnotations, we consider only deterministic algorithms. It is also assumed that the \nalgorithm A is symmetric with respect to S, i. e. \nfor any permutation over the \nelements of S, Is yields the same result. Furthermore, we assume that all functions \nare measurable and all sets are countable which does not limit the interest of the \nresults presented here. \nThe empirical error of a function I measured on the training set Sis: \n\nRm(f) = - L c(f, Zi) \n\n1 m \n\nm i=l \n\nc: 'lJx X X x 'lJ -t 1R+ being a cost function. The risk or generalization error can be \nwritten as: \n\nR(f) = Ez~D [c(f,z)] \n\nThe study we describe here intends to bound the difference between empirical and \ngeneralization error for specific algorithms. More precisely, our goal is to bound for \nany E > 0, the term \n\n(1) \n\nUsually, learning algorithms cannot output just any function in 'lJx but rather pick \na function Is in a set :r S;; 'lJX representing the structure or the architecture or the \nmodel. Classical VC theory deals with structural properties and aims at bounding \nthe following quantity: \n\nPS~Dm [sup IRm(f) - R(f)1 > EJ \n\nfE'3' \n\n(2) \n\nThis applies to any algorithm using :r as a hypothesis space and a bound on this \nquantity directly implies a similar bound on (1). However, classical bounds require \nthe VC dimension of :r to be finite and do not use information about algorithmic \nproperties. For a set :r, there exists many ways to search it which may yield different \nperformance. For instance, multilayer perceptrons can be learned by a simple back(cid:173)\npropagation algorithm or combined with a weight decay procedure. The outcome \n\n\fof the algorithm belongs in both cases to the same set of functions, although their \nperformance can be different. \n\nVC theory was initially motivated by empirical risk minimization (ERM) in which \ncase the uniform bounds on the quantity (2) give tight error bounds. Intuitively, \nthe empirical risk minimization principle relies on a uniform law of large numbers. \nBecause it is not known in advance, what will be the minimum of the empirical \nrisk, it is necessary to study the difference between empirical and generalization \nerror for all possible functions in 9\". If, now, we do not consider this minimum, but \ninstead, we focus on the outcome of a learning algorithm A, we may then know a \nlittle bit more what kind of functions will be obtained. This limits the possibilities \nand restricts the supremum over all the functions in 9\" to the possible outcomes of \nthe algorithm. An algorithm which always outputs the null function does not need \nto be studied by a uniform law of large numbers. \n\nif S denotes the initial \nLet's introduce a notation for modified training sets: \ntraining set, S = {Zl, ... ,Zi-l,Zi,Zi+l, ... ,zm}, then Si denotes the training \nset after Zi has been replaced by a different training example z~, that is Si = \n{Zl, ... , Zi-l, z~, Zi+1, . .. , zm}. Now, we define a notion of stability for regression. \n\nDefinition 1 (UniforIll stability) Let S = {Zl, ... ,zm} be a training set, Si = \nS\\Zi be the training set where instance i has been removed and A a symmetric \nalgorithm. We say that A is (3-stable if the following holds: \n\n' 0, for any m ~ 8~C , we have: \n\nPS~D'\" [lRm(fS) - R(fs) I > E] :::; \n\n64Mm{3 + 8M 2 \n\n2 \nmE \n\nand for any m ~ 1, \n\n(4) \n\n(5) \n\nNotice that this theorem gives tight bounds when the stability (3 is of the order \nof 11m. It will be proved in next section that regularization networks satisfy this \nrequirement. \nIn order to prove theorem 2, one has to study the random variable X = R(fs) -\nRm(fs), which can be done using two different approaches. The first one (corre(cid:173)\nsponding to the exponential inequality) uses a classical martingale inequality and is \n\n\fdetailed below. The second one is a bit more technical and requires to use standard \nproof techniques such as symmetrization. Here we only briefly sketch this proof and \nrefer the reader to [5] for more details. \n\nProof of inequality (5) We use the following theorem: \n\nTheorem 3 {McDiarmid [9}}. Let Y1 , .. . , Yn be n i.i.d. random variables taking \nvalues in a set A, and assume that F : An -+ ~ satisfies for I ~ i ~ n: \n\nsup \n\nIF(Yl, ... ,Yn)-F(Yl, ... ,Yi-l,Y~,Yi+1, ... ,Yn)1 ~Ci \n\nthen \n\nIn order to apply theorem 3, we have to bound the expectation of X. We begin \nwith a useful lemma: \n\nLemma 1 For any symmetric learning algorithm we have for all I ~ i ~ m: \n\nES~D~ [R(fs) - Rm(fs)] = ES,z:~D~+l [c(fs, Z~) - c(fs\" zD] \n\nProof: Notice that \nES~D'\" [Rm(fs)] = - 2:ES~D'\" [c(fs,Zj)] = ES~D'\" [c(fS,Zi)], 'Vi E {I, ... ,m} \n\nI m \n\nm j=l \n\nby symmetry and the i.i.d. assumption. Now, by simple renaming of Zi as z~ we get \n\nES~D'\" [Rm(fs)] = ES'~D'\" [c(fs.,zDJ = ES,Z:~D~+l [c(fs.,zD] \n\nwhich, with the observation that \n\nES~D~ [R(fs)] = ES,z:~D\"'+l [c(fs, zD ] \n\nconcludes the proof. \n\nUsing the above lemma and the fact that A is ,B-stable, we easily get: \n\nES~D'\" [R(fs) - Rm(fs)] ~ ES,z:~D\"'+l [,B] = ,B \n\nWe now have to compute the constants Ci appearing in theorem 3. \n\nWe have \n\nIR(fs) - R(fs\u00b7) I ~ Ez~D [Ic(fs, z) - c(fs.,z)l] ~,B \n\no \n\nand \nIRm(fs) - Rm(fsi)1 < ~ 2: Ic(fs,zj) - c(fsi,zj)1 + ~ IC(fS,Zi) - c(fs\u00b7,zDI \n\nm #i \n\nm \n\nTheorem 3 applied to R(fs) - Rm(fs) then gives inequality (5). \n\n< 2M +,B \n\nm \n\nSketch of the proof of inequality (4) Recall Chebyshev's inequality: \n\nE[X2] \n\nP(IXI ;:::: \u20ac) ~ - 2 - ' \n\n(6) \nfor any random variable X. In order to apply this inequality, we have to bound \nE[X2]. This can be done with a similar reasoning as for the expectation. Calcu(cid:173)\nlations are however more complex and we do not describe them here. For more \ndetails, see [5]. The result is the following: \n\n\u20ac \n\nwhich with (6) gives inequality (4) and concludes the proof. \n\nEu[X2] ~ M2/m + 8M,B \n\n\f4 Stability of Regularization Networks \n\n4.1 Definitions \n\nRegularization networks have been introduced in machine learning by Poggio and \nGirosi [10]. The relationship between these networks and the Support Vector Ma(cid:173)\nchines, as well as their Bayesian interpretation, make them very attractive. We \nconsider a training set S = {(Xl, Yl), ... ,(xm, Ym)} with Xi E IRd and Yi E IR, that \nis we are in the regression setting. The regularization network technique consists \nin finding a function I : IRd -t IR in a space H which minimizes the following \nfunctional: \n\n(7) \n\nwhere II/II~ denotes the norm in the space H. In this framework, H is chosen to \nbe a reproducing kernel Hilbert space (rkhs), which is basically a functional space \nendowed with a dot productl. A rkhs is defined by a kernel function, that is a \nsymmetric function k : IRd X IRd -t IR which we will assume to be bounded by a \nconstant K in what follows2 . In particular the following property will hold: \n\n(8) \n\n4.2 Stability study \n\nIn this section, we show that regularization networks are, furthermore, stable as \nsoon as A is not too small. \nTheorem 4 For regularization networks with Ilkli H ~ K and (f(x) - y)2 ~ M, \n\nand \n\n(\n\nR(fs) ~ Rm(fs) + 2M \n\n4K \n\n2) In(2/6) \n\n2K2 \n\nA2 + A + \n\nm \n\n( 64K + 2) ...!... \n\nm6 \n\nA \n\n(9) \n\n(10) \n\nProof: Let's denote by Is the minimizer of C. Let's define \n\nLet lSi be the minimizer of C i and let 9 denote the difference lSi - Is. By simple \nalgebra, we have for t E [0,1] \n\nC(fs) - C(fs + tg) = -- ~)Is(xj) - Yj)g(Xj) - 2tA < Is,g > +t2 A(g) \n\n2t m \n\nm j=l \n\nlWe do not detail here the properties of such a space and refer the reader to [2, 3] for \n\nadditional details. \n\n20nce again we do not give full detail of the definition of appropriate kernel functions \n\nand refer the reader to [3] . \n\n\fwhere A(g) which is not explicitly written here is the factor of t2 . Similarly we have \n\n2t \nm l)fSi(Xj) -Yj)g(Xj) \n\n#i \n\n+ 2t (fsi (xD - YDg(xD + 2tA < fSi, 9 > +t2 Ai(g) \n\nm \n\nBy optimality, we have \n\nC(fs) - C(fs + tg) :::; 0 and Ci(fsi) - Ci(fsi - tg) :::; 0 \n\nthus, summing those inequalities, dividing by tim and making t -t 0, we get \n2 L g(Xj)2 - 2(fS(Xi) - Yi)g(Xi) + 2(fsi (xD - YDg(x~) + 2mAIIgilir :::; 0 \n\n#i \nwhich gives \n\nmAllgllir :::; (fS(Xi) - Yi)g(Xi) - (fs(x~) - YDg(xD :::; 2VMI\\;IIgliH \n\nusing (8). We thus obtain \n\nand also \n\nIlfsi - fsllH :::; 2VMI\\;/(mA) \n\n(11) \n\nVx,y I (fs(x) - y)2 - (fsi(X) - y)21:::; 2VMlfs(x) - fSi(X)1 :::; 4MI\\;/(mA) \n\nWe thus proved that the minimization of C[f] is a 4:;>.'\" -stable procedure which \nallows to apply theorem 2. \n<>. \n\n4.3 Discussion \n\nThese inequalities are both of interest since the range where they are tight is differ(cid:173)\nent. Indeed, (10) has a poor dependence on c5 which makes it deteriorate when high \nconfidence is sought for. However, (9) can give high confidence bounds but will be \nlooser when A is small. \n\nMoreover, results exposed by Evgeniou et al. [6] indicate that the optimal depen(cid:173)\ndence of A with m is obtained for Am = 0 (In In m). If we plug this into the above \nbounds, we can notice that (9) does not converge as m -t 00. It may be conjectured \nthat the poor estimation of the variance coming from the martingale method in Mc(cid:173)\nDiarmid's inequality is responsible for this effect, but a finer analysis is required to \nfully understand this phenomenon. \nOne of the interests of these results is to provide a mean for choosing the A parameter \nby minimizing the right hand side of the inequality. Usually, it is determined with \na validation set: some of the data is not used during learning and A is chosen such \nthat the error of fs over the validation set is minimized. The drawback of this \napproach is to reduce the amount of data available for learning. \n\n5 Conclusion and future work \n\nWe have presented a new approach to get bounds on the generalization performance \nof learning algorithms which makes use of specific properties of these algorithms. \nThe bounds we obtain do not depend on the complexity of the hypothesis class but \non a measure of how stable the algorithm's output is with respect to changes in the \ntraining set. \n\n\fAlthough this work has focused on regression, we believe that it can be extended \nto classification, in particular by making the stability requirement less demanding \n(e.g. stability in average instead of uniform stability). Future work will also aim \nat finding other algorithms that are stable or can be appropriately modified to ex(cid:173)\nhibit the stability property. At last, a promising application of this work could be \nthe model selection problem where one has to tune parameters of the algorithms \n(e.g. A and parameters of the kernel for regularization networks). Instead of using \ncross-validation, one could measure how stability is influenced by the various pa(cid:173)\nrameters of interest and plug these measures into theorem 2 to derive bounds on \nthe generalization error. \n\nAcknowledgments \n\nWe would like to thank G. Lugosi, S. Boucheron and O. Chapelle for interesting \ndiscussions on stability and concentration inequalities. Many thanks to A. Smola \nand to the anonymous reviewers who helped improve the readability. \n\nReferences \n\n[1] N. Alon , S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, \n\nuniform convergence and learnability. Journal of the ACM, 44(4):615- 631, 1997. \n\n[2] N. Aronszajn. Theory ofreproducing kernels. Trans. Amer. Math. Soc. , 68:337- 404, \n\n1950. \n\n[3] M. Atteia. Hilbertian Kernels and splines functions. Studies in computational math(cid:173)\n\nematics 4. North-Holland, 1992. \n\n[4] L.P. Devroye and T.J. Wagner. Distribution-free performance bounds for potential \n\nfunction rules. IEEE Trans. on Information Theory, 25(5) :202- 207, 1979. \n\n[5] A. Elisseeff. A study about algorithmic stability and its relation to generalization \n\nperformances. Technical report, Laboratoire ERIC, Univ. Lyon 2, 2000. \n\n[6] T. Evgeniou, M. Pontil, and T. Poggio. A unified framework for regularization net(cid:173)\nworks and support vector machines. Technical Memo AIM-1654, Massachusetts In(cid:173)\nstitute of Technology, Artificial Intelligence Laboratory, December 1999. \n\n[7] Y. Freund. Self bounding learning algorithms. In Proceedings of the 11 th Annual Con(cid:173)\n\nference on Computational Learning Theory (COLT-9S), pages 247- 258, New York, \nJuly 24- 26 1998. ACM Press. \n\n[8] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one(cid:173)\n\nout cross-validation. Neural Computation, 11(6):1427- 1453, 1999. \n\n[9] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, \n\npages 148- 188. Cambridge University Press, Cambridge, 1989. \n\n[10] T . Poggio and F . Girosi. Regularization algorithms for learning that are equivalent \n\nto multilayer networks. Science, 247:978- 982, 1990. \n\n[11] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A framework for \nstructural risk minimization. In Proc. 9th Annu. Conf. on Comput. Learning Theory, \npages 68- 76. ACM Press, New York, NY, 1996. \n\n[12] J. Shawe-Taylor and R. C. Williamson. Generalization performance of classifiers in \n\nterms of observed covering numbers. In Paul Fischer and Hans Ulrich Simon, edi(cid:173)\ntors, Proceedings of the 4th European Conference on Computational Learning The(cid:173)\nory (Eurocolt-99) , volume 1572 of LNAI, pages 274- 284, Berlin, March 29- 31 1999. \nSpringer. \n\n\f", "award": [], "sourceid": 1854, "authors": [{"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}]}