6.5. 在线极速学习机

6.5.1. 简介

6.5.2. OSELM 原理

下面介绍 在线极速学习机 (Oniline Sequential Extreme Learning Machine, OS-ELM) 原理.

考虑如下ELM问题

\[\bm{H} \boldsymbol{\beta}=\bm{T} \]

其最小二乘解为

\[\hat{\boldsymbol{\beta}}=\left(\bm{H}^{\bm{T}} \bm{H}\right)^{-\bm{1}} \bm{H}^{\bm{T}} \bm{T} \]

如果 \(\bm{H}^{\bm{T}} \bm{H}\) 趋于奇异, 可以选择较小的隐层神经元节点数 \(L\) , 或者增加样本数量.

给定初始训练集 \({\mathbb S}_0 = \{({\bm x}_i, {\bm t}_i)\}_{i=1}^{N_0}\) , 其中, \(N_0\) 为样本数, 且 \(N_0 > L\) . 从而有初始解

\[{\bm \beta_{0}}=\bm{K}_{0}^{-1} \bm{H}_{0}^{T} \bm{T}_{0}, \;\bm{K}_{0}=\bm{H}_{0}^{T} \bm{H}_{0} \]

假如又可以获得训练集 \({\mathbb S}_1 = \{({\bm x}_i, {\bm t}_i)\}_{i=N_0+1}^{N_0+N_1}\) , 其中 \(N_1\) 为新样本数目, 问题变为最小化

\[\left\| \left[ \begin{array}{l}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right] {\bm \beta}-\left[ \begin{array}{l}{\bm{T}_{0}} \\ {\bm{T}_{1}}\end{array}\right] \right\| \]

其解为:

\[\boldsymbol{\beta}_{1}=\bm{K}_{1}^{-1} \left[ \begin{array}{c}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right]^{T} \left[ \begin{array}{c}{\bm{T}_{0}} \\ {\bm{T}_{1}}\end{array}\right], \; \bm{K}_{1}=\left[ \begin{array}{l}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right]^{T} \left[ \begin{array}{l}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right] \]

又因为

\[\begin{aligned} \bm{K}_{1} &=\left[ \begin{array}{cc}{\bm{H}_{0}^{T}} & {\bm{H}_{1}^{T}}\end{array}\right] \left[ \begin{array}{l}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right] \\ &=\bm{K}_{0}+\bm{H}_{1}^{T} \bm{H}_{1} \end{aligned} \]

\[\begin{aligned} \left[ \begin{array}{c}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right]^{T} \left[ \begin{array}{c}{\bm{T}_{0}} \\ {\bm{T}_{1}}\end{array}\right] &=\bm{H}_{0}^{T} \bm{T}_{0}+\bm{H}_{1}^{T} \bm{T}_{1} \\ &=\bm{K}_{0} \bm{K}_{0}^{-1} \bm{H}_{0}^{T} \bm{T}_{0}+\bm{H}_{1}^{T} \bm{T}_{1} \\ &=\bm{K}_{0} {\bm \beta_{0}}+\bm{H}_{1}^{T} \bm{T}_{1} \\ &=\left(\bm{K}_{1}-\bm{H}_{1}^{T} \bm{H}_{1}\right) {\bm \beta_{0}}+\bm{H}_{1}^{T} \bm{T}_{1} \\ &=\bm{K}_{1} {\bm \beta_{0}}-\bm{H}_{1}^{T} \bm{H}_{1} \boldsymbol{\beta}_{0}+\bm{H}_{1}^{T} \bm{T}_{1} \end{aligned} \]

所以, \({\boldsymbol \beta}_1\) 可以表示成 \({\boldsymbol \beta}_0\) 的函数

\[\begin{aligned} {\bm \beta_{1}} &=\bm{K}_{1}^{-1} \left[ \begin{array}{c}{\bm{H}_{0}} \\ {\bm{H}_{1}}\end{array}\right]^{T} \left[ \begin{array}{c}{\bm{T}_{0}} \\ {\bm{T}_{1}}\end{array}\right] \\ &=\bm{K}_{1}^{-1}\left(\bm{K}_{1} \boldsymbol{\beta}_{0}-\bm{H}_{1}^{T} \bm{H}_{1} \boldsymbol{\beta}_{0}+\bm{H}_{1}^{T} \bm{T}_{1}\right) \\ &=\boldsymbol{\beta}_{0}+\bm{K}_{1}^{-1} \bm{H}_{1}^{T}\left(\bm{T}_{1}-\bm{H}_{1} {\bm \beta_{0}}\right) \end{aligned} \]

其中, \(\bm{K}_{1}=\bm{K}_{0}+\bm{H}_{1}^{T} \bm{H}_{1}\) . 依次类推, 可以得到权重 \(\bm \beta\) 的迭代更新公式

(6.6)\[\begin{aligned} \bm{K}_{k+1} &=\bm{K}_{k}+\bm{H}_{k+1}^{T} \bm{H}_{k+1} \\ {\bm \beta}_{k+1} &={\bm \beta}_{k}+\bm{K}_{k+1}^{-1} \bm{H}_{k+1}^{T}\left(\bm{T}_{k+1}-\bm{H}_{k+1} \boldsymbol{\beta}_{k}\right), \end{aligned} \]

其中, \(\bm{K}_{k}^{-1} \in {\mathbb R}^{L \times L}\) , \(\bm{H}_{k} \in {\mathbb R}^{N_k\times L}\) , \({\bm \beta}_k \in {\mathbb R}^{L \times m}\) , \({\bm T}_k \in {\mathbb R}^{N_k\times m}\) \(k=1,2,\cdots\) . 根据输出权重的迭代更新公 式.6.6 可知, 每次更新权重都需要计算一次一个 \(\bm{K}_{k+1}^{-1} \in {\mathbb R}^{L \times L}\) 的逆.

由 Sherman-Morrison-Woodbury 等 式.7.2 (参见 矩阵论常用总结 小结) 知

\[\begin{aligned} \bm{K}_{k+1}^{-1} &=\left(\bm{K}_{k}+\bm{H}_{k+1}^{T} \bm{H}_{k+1}\right)^{-1} \\ &=\bm{K}_{k}^{-1}-\bm{K}_{k}^{-1} \bm{H}_{k+1}^{T}\left(\bm{I}+\bm{H}_{\bm{k}+1} \bm{K}_{\bm{k}}^{-1} \bm{H}_{\bm{k}+1}^{\mathrm{T}}\right)^{-1} \bm{H}_{k+1} \bm{K}_{k}^{-1}. \end{aligned} \]

若记 \({\bm P}_{k+1} = {\bm K}^{-1}_{k+1}\) , 则有如下迭代公式

(6.7)\[\begin{aligned} \bm{P}_{k+1} &=\bm{P}_{k}-\bm{P}_{k} \bm{H}_{k+1}^{T}\left(\bm{I}+\bm{H}_{k+1} \bm{P}_{k} \bm{H}_{k+1}^{T}\right)^{-1} \bm{H}_{k+1} \bm{P}_{k} \\ \boldsymbol{\beta}_{k+1} &=\boldsymbol{\beta}_{k}+\bm{P}_{k+1} \bm{H}_{k+1}^{T}\left(\bm{T}_{k+1}-\bm{H}_{k+1} \boldsymbol{\beta}_{k}\right) \end{aligned} \]

其中, \({\bm I}\in{\mathbb R}^{N_k\times N_k}\) 为单位矩阵, 当 \(N_k < L\) 时, 式.6.7 可以减小计算量.

若样本以 one-by-one 的形式输入 OSELM, 则更新公 式.6.7 简化为

(6.8)\[\begin{array}{l}{\bm{P}_{k+1}=\bm{P}_{k}-\frac{\bm{P}_{k} \bm{h}_{k+1} \bm{h}_{k+1}^{T} \bm{P}_{k}}{1+\bm{h}_{k+1}^{T} \bm{P}_{k} \bm{h}_{k+1}}} \\ {{\bm \beta}_{k+1}={\bm \beta}_{k}+\bm{P}_{k+1} \bm{h}_{k+1}\left(\bm{t}_{k+1}^{T}-\bm{h}_{k+1}^{T} {\bm \beta}_{k}\right)}\end{array}. \]

上述在线训练算法总结如下:

小技巧

输入: 样本序列 \(\mathbb{S}_{0}=\left\{\left(\bm{x}_{i}, \bm{t}_{i}\right)\right\}_{i=1}^{N_{0}}\) , \({\mathbb S}_{1}=\left\{\left(\bm{x}_{i}, \bm{t}_{i}\right)\right\}_{i=N_{0}+1}^{N_{0}+N_{1}}\) , \(\cdots\) , \({\mathbb S}_{k+1}=\left\{\left(\bm{x}_{i}, \bm{t}_{i}\right)\right\}_{i=\sum_{j=1}^k N_{j}+1}^{\sum_{j=1}^{k+1} N_{j}}\) , \(k=1, 2, \cdots, K\) .

输出: 更新输出层权重 \({\bm \beta}_{K}\)

  1. 初始训练(\(k=0\), \(N_0 > L\)):

    计算初始样本集 \({\mathbb S}_0\) 的隐藏层输出 \({\bm H}_0\) , 估计初始权重 \({\bm \beta}_0 = \bm{P}_{0} \bm{H}_{0}^{T} \bm{T}_{0}\) , 其中 \(\bm{P}_{0}=\left(\bm{H}_{0}^{T} \bm{H}_{0}\right)^{-1}\) , \(\bm{T}_{0}=\left[\bm{t}_{1}, \dots, \bm{t}_{N_{0}}\right]^{T}\) .

  2. 序列学习(\(k=1, 2, \cdots, K\), \(N_k\) 为样本数):

    • \(N_k > 1\) , 计算第 \(k+1\) 个样本集 \({\mathbb S}_{k+1}\) 的隐藏层输出 \({\bm H}_{k+1}\), 并根据 式.6.7 更新权重;

    • \(N_k = 1\) , 计算第 \(k+1\) 个样本集 \({\mathbb S}_{k+1}\) 的隐藏层输出 \({\bm h}_{k+1}\), 则根据 式.6.11 更新权重;

    \(k = K\) 时停止训练, 并输出权重 \({\bm \beta}_{K}\) .