TKL introduction

What is Kernel Learning?

Kernel methods for classification and regression (and Support Vector Machines (SVMs) in particular) require selection of a kernel. Kernel Learning (KL) algorithms automate this task by finding the kernel , k from the set K which optimizes an achievable metric such as the soft margin (for classification).

To understand how the choice of K influences performance and robustness, three properties were proposed to characterize the set K - tractability, density, and universality.

  • K is tractable if K is convex (or, preferably, a linear variety) - implying the KL problem is solvable using.
  • The set K has the density property if, for any e>0 and any positive kernel, \(\hat k\) there exists a k from K such that:
\[\|k - \hat k\| \le e\]
  • K has the universal property if any k from K is universal - ensuring the classifier/predictor will perform arbitrarily well on large sets of training data.

More about this property you can find here. Moreover, The Tessellated Kernels (TKs) were shown to have all 3 properties

The general optimization problem can be formulated as

\[\min_{k \in K} \max_{\alpha \in A} -\frac{1}{2} \sum\limits_{i=1}^m\sum\limits_{j=1}^m \alpha_i \alpha_j k(x_i, x_j) + \kappa(\alpha)\]

where

\[\kappa(\alpha) = \begin{cases} -\epsilon \sum\nolimits_{i=1}^m |\alpha_i| + \sum\nolimits_{i=1}^m y_i\alpha_i, &~\text{ If the task is regression }\\ \sum\nolimits_{i=1}^m \alpha, &~\text{if the task is classification}. \end{cases}\]

The optimization set is also different

\[A =\begin{cases} \{\alpha \in R^m\;:\; \sum\nolimits_{i=1}^m \alpha_i = 0, \; \alpha_i \in [-C, C]\} &~\text{ If the task is regression }\\ \{\alpha \in R^m\; : \; \sum\nolimits_{i=1}^m \alpha_iy_i = 0,\; 0 \leq \alpha_i \leq C \} &~\text{if the task is classification}. \end{cases}\]

But there are two main questions

  1. How to choose \(K\).
  2. How to find the optimal solution.

Tessellated Kernels

Tessellated Kernels is a new class of kernel functions, that can be represented in the following way

\[K \hspace{-1mm} := \hspace{-1mm} \left\{ k ~|~ k(x,y)= \int_{X} N(z,x)^T P N(z,y) dz, P \geq 0 \right\}\]

where

\[N^d_{T}(z,x) = \begin{bmatrix}Z_d(z,x)I(z-x) \\ Z_d(z,x) I(x-z) \end{bmatrix} \;\text{ and }\; I(z) = \begin{cases} 1 & \quad z \ge 0\\ 0 & \quad \text{otherwise.}\\ \end{cases}\]

After integrations and long calculations we got the final results

\[k(x,y)=\sum_{i,j=1}^q Q_{i,j} g_{i,j}(x,y) + R_{i,j}t_{i,j}(x,y) +R^T_{i,j}t_{i,j}(y,x) + S_{i,j} h_{i,j}(x,y)\]

Where

\[g_{i,j}(x,y) := x^{\delta_i}y^{\delta_j} T(p^*(x,y),b,\gamma_{i,j} + {1} )\] \[t_{i,j}(x,y) := x^{\delta_{i}}y^{\delta_{j}} T(x,b,\gamma_{i,j} + {1} ) - g_{i,j}(x,y)\] \[h_{i,j}(x,y) := x^{\delta_{i}}y^{\delta_{j}} T(a,b,\gamma_i + \gamma_j + {1} ) - g_{i,j}(x,y)-t_{i,j}(x,y) - t_{i,j}(y,x)\] \[p^*(x,y)_i = \max \{x_i,y_i \}\] \[T(x,y,\zeta) = \prod_{j=1}^n \left( \frac{y_j^{\zeta_j}}{\zeta_j}-\frac{x_j^{\zeta_j}}{\zeta_j}\right).\]

Fortunately, we implemented our kernel and now you can use it. Please, look at kernel functions.

Optimization

We now propose the efficient implementation of the Frank-Wolfe algorithm, based on our paper. In more details An Efficient FW Algorithm for TKL

  1. Initialize \(P_0=I\), \(k=0\), \(\alpha_0 = OPT\_A(P_0)\);
  2. while \(OPT\_P(\alpha_k) - OPT\_A(P_k) \ge \epsilon\)
    1. \[\alpha_k = \arg OPT\_A(P_k)\]
    2. \[S_k = \arg OPT\_P(\alpha_k)\]
    3. \[\gamma_k = \arg \min_ OPT\_A(P_k+\gamma(S_k-P_k))\]
    4. \[P_{k+1} = P_k + \gamma_k(S_k - P_k), k = k + 1\]
  3. end while .

Where we denoted \(OPT\_P\) and \(OPT\_A\) like \(OPT\_P(\alpha) = \min_{k \in K} -\frac{1}{2} \sum\limits_{i=1}^m\sum\limits_{j=1}^m \alpha_i \alpha_j k(P, x_i, x_j) + \kappa(\alpha)\)

and

\[OPT\_A(P) = \max_{\alpha \in A} -\frac{1}{2} \sum\limits_{i=1}^m\sum\limits_{j=1}^m \alpha_i \alpha_j k(P, x_i, x_j) + \kappa(\alpha)\]