29 数据降维与可视化（t-SNE）——数无形时少直觉

作者: 7125messi | 来源:发表于2018-06-25 09:48 被阅读16次

29 数据降维与可视化（t-SNE）——数无形时少直觉
t-SNE原理总览
sc-RNA-seq数据分析|| UMAP与t-NSE的区别
UMAP的初步了解及与t-SNE的比较
93-非监督学习之PCA线性降维
机器学习算法：UMAP 深入理解
数据降维与可视化工具——t-SNE实践教程
高维数据可视化Visualizing data using t-
从Word2vec可视化算法t-SNE谈起
数形结合思想的力量

t-SNE(t-distributed stochastic neighbor embedding)是用于降维的一种机器学习算法,又叫“T 分布随机近邻嵌入算法（t-SNE）”，是由 Laurens van der Maaten 和 Geoffrey Hinton在08年提出来。此外，t-SNE 是一种非线性降维算法，非常适用于高维数据降维到2维或者3维，进行可视化。

t-SNE是由SNE(Stochastic Neighbor Embedding, SNE; Hinton and Roweis, 2002)发展而来。我们先介绍SNE的基本原理，之后再扩展到t-SNE。最后再看一下t-SNE的实现以及一些优化。

详细内容请参考作者的网站

http://lvdmaaten.github.io/tsne/

t-SNE是目前来说效果最好的数据降维与可视化方法，但是它的缺点也很明显，比如：占内存大，运行时间长。
但是，当我们想要对高维数据进行分类，又不清楚这个数据集有没有很好的可分性（即同类之间间隔小，异类之间间隔大），可以通过t-SNE投影到2维或者3维的空间中观察一下。
如果在低维空间中具有可分性，则数据是可分的；
如果在高维空间中不具有可分性，可能是数据不可分，也可能仅仅是因为不能投影到低维空间。

t-distributed Stochastic Neighbor Embedding(t-SNE)
t-SNE（TSNE）将数据点之间的亲和力转换为概率。
原始空间中的亲和度由高斯联合概率表示，嵌入空间的亲和度由“学生t分布”表示。这使得t-SNE对局部结构特别敏感，并且与现有技术相比具有一些其他优点：

在单个图上显示多个尺度的结构
可以显示多类的、不同的、在高维空间上交错的或聚合的数据
显示图像中相同类的数据点的分布更发散

虽然Isomap，LLE和variants等数据降维和可视化方法，最适合展开单个连续的低维的manifold，但是t-SNE主要是关注数据的局部结构，并且对于下图所示的S曲线（不同颜色的图像表示不同类别的数据）。正是这种关注数据局部特征的方法，更有利于manifold数据的可视化。

manifold：可以称之为流形数据。像绳结一样的数据，虽然在高维空间中可分，但是在人眼所看到的低维空间中，绳结中的绳子是互相重叠的不可分的。

通过原始空间和嵌入空间的联合概率的Kullback-Leibler（KL）散度来评估可视化效果的好坏，也就是说用有关KL散度的函数作为loss函数，然后通过梯度下降最小化loss函数，最终获得收敛结果。注意，有关KL散度的loss函数不是凸函数，即具有不同初始值的多次运行将收敛于KL散度函数的局部最小值中，以致获得不同的结果。因此，尝试不同的随机数种子(Python中可以通过设置seed来获得不同的随机分布)有时候是有用的，并选择具有最低KL散度值的结果。

使用t-SNE的缺点是：

t-SNE的计算复杂度很高，在数百万个样本数据集中可能需要几个小时，而PCA可以在几秒钟或几分钟内完成Barnes-Hut t-SNE方法（下面讲）限于二维或三维嵌入。
算法是随机的，具有不同种子的多次实验可以产生不同的结果。虽然选择loss最小的结果就行，但可能需要多次实验以选择超参数。
全局结构未明确保留。这个问题可以通过PCA初始化点（使用init ='pca'）来缓解。

一个简单的例子，输入4个3维的数据，然后通过t-SNE降维称2维的数据。

import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
tsne = TSNE(n_components=2)
tsne.fit_transform(X)
print(tsne.embedding_)

'''输出
[[   3.17274952 -186.43092346]
 [  43.70787048 -283.6920166 ]
 [ 100.43157196 -145.89025879]
 [ 140.96669006 -243.15138245]]'''

手写数字的降维可视化

t-SNE将8*8即64维的数据降维成2维，并在平面图中显示，这里只选取了0-5，6个手写数字。

# coding='utf-8'
"""t-SNE对手写数字进行可视化"""
from time import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.manifold import TSNE


def get_data():
    digits = datasets.load_digits(n_class=6)
    data = digits.data
    label = digits.target
    n_samples, n_features = data.shape
    return data, label, n_samples, n_features


def plot_embedding(data, label, title):
    x_min, x_max = np.min(data, 0), np.max(data, 0)
    data = (data - x_min) / (x_max - x_min)

    fig = plt.figure()
    ax = plt.subplot(111)
    for i in range(data.shape[0]):
        plt.text(data[i, 0], data[i, 1], str(label[i]),
                 color=plt.cm.Set1(label[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})
    plt.xticks([])
    plt.yticks([])
    plt.title(title)
    return fig


def main():
    data, label, n_samples, n_features = get_data()
    print('Computing t-SNE embedding')
    tsne = TSNE(n_components=2, init='pca', random_state=0)
    t0 = time()
    result = tsne.fit_transform(data)
    fig = plot_embedding(result, label,
                         't-SNE embedding of the digits (time %.2fs)'
                         % (time() - t0))
    plt.show(fig)


if __name__ == '__main__':
    main()

1.SNE

1.1基本原理

SNE是通过仿射(affinitie)变换将数据点映射到概率分布上，主要包括两个步骤：

SNE构建一个高维对象之间的概率分布，使得相似的对象有更高的概率被选择，而不相似的对象有较低的概率被选择。
SNE在低维空间里在构建这些点的概率分布，使得这两个概率分布之间尽可能的相似。
我们看到t-SNE模型是非监督的降维，他跟kmeans等不同，他不能通过训练得到一些东西之后再用于其它数据（比如kmeans可以通过训练得到k个点，再用于其它数据集，而t-SNE只能单独的对数据做操作，也就是说他只有fit_transform，而没有fit操作）

1.2 SNE原理推导

2.t-SNE

尽管SNE提供了很好的可视化方法，但是他很难优化，而且存在”crowding problem”(拥挤问题)。后续中，Hinton等人又提出了t-SNE的方法。与SNE不同，主要如下:

使用对称版的SNE，简化梯度公式
低维空间下，使用t分布替代高斯分布表达两点之间的相似度

t-SNE在低维空间下使用更重长尾分布的t分布来避免crowding问题和优化问题。在这里，首先介绍一下对称版的SNE，之后介绍crowding问题，之后再介绍t-SNE。

2.1 Symmetric SNE

2.2 Crowding问题

2.3 t-SNE

2.4 算法过程

t-sne_optimise.gif

t-sne_optimise.gif插图绘制

import numpy as np
from numpy.linalg import norm
from matplotlib import pyplot as plt
plt.style.use('ggplot')

def sne_crowding():
    npoints = 1000 # 抽取1000个m维球内均匀分布的点
    plt.figure(figsize=(20, 5))
    for i, m in enumerate((2, 3, 5, 8)):
        # 这里模拟m维球中的均匀分布用到了拒绝采样，
        # 即先生成m维立方中的均匀分布，再剔除m维球外部的点
        accepts = []
        while len(accepts) < 1000:
            points = np.random.rand(500, m)
            accepts.extend([d for d in norm(points, axis=1)
                            if d <= 1.0]) # 拒绝采样
        accepts = accepts[:npoints]
        ax = plt.subplot(1, 4, i+1)
        if i == 0:
            ax.set_ylabel('count')
        if i == 2:

            ax.set_xlabel('distance')
        ax.hist(accepts, bins=np.linspace(0., 1., 50))
        ax.set_title('m=%s' %m)
    plt.savefig("./images/sne_crowding.png")

    x = np.linspace(0, 4, 100)
    ta = 1 / (1 + np.square(x))
    tb = np.sum(ta) - 1
    qa = np.exp(-np.square(x))
    qb = np.sum(qa) - 1

def sne_norm_t_dist_cost():
    plt.figure(figsize=(8, 5))
    plt.plot(qa/qb, c="b", label="normal-dist")
    plt.plot(ta/tb, c="g", label="t-dist")
    plt.plot((0, 20), (0.025, 0.025), 'r--')
    plt.text(10, 0.022, r'$q_{ij}$')
    plt.text(20, 0.026, r'$p_{ij}$')

    plt.plot((0, 55), (0.005, 0.005), 'r--')
    plt.text(36, 0.003, r'$q_{ij}$')
    plt.text(55, 0.007, r'$p_{ij}$')

    plt.title("probability of distance")
    plt.xlabel("distance")
    plt.ylabel("probability")
    plt.legend()
    plt.savefig("./images/sne_norm_t_dist_cost.png")

if __name__ == '__main__':
    sne_crowding()
    sne_norm_t_dist_cost()

2.5 不足

主要不足有四个:

主要用于可视化，很难用于其他目的。比如测试集合降维，因为他没有显式的预估部分，不能在测试集合直接降维；比如降维到10维，因为t分布偏重长尾，1个自由度的t分布很难保存好局部特征，可能需要设置成更高的自由度。
t-SNE倾向于保存局部特征，对于本征维数(intrinsic dimensionality)本身就很高的数据集，是不可能完整的映射到2-3维的空间
t-SNE没有唯一最优解，且没有预估部分。如果想要做预估，可以考虑降维之后，再构建一个回归方程之类的模型去做。但是要注意，t-sne中距离本身是没有意义，都是概率分布问题。
训练太慢。有很多基于树的算法在t-sne上做一些改进

3 t-sne的完整代码实现

# coding utf-8
'''
代码参考了作者Laurens van der Maaten的开放出的t-sne代码, 并没有用类进行实现,主要是优化了计算的实现
'''
import numpy as np


def cal_pairwise_dist(x):
    '''计算pairwise 距离, x是matrix
    (a-b)^2 = a^w + b^2 - 2*a*b
    '''
    sum_x = np.sum(np.square(x), 1)
    dist = np.add(np.add(-2 * np.dot(x, x.T), sum_x).T, sum_x)
    return dist


def cal_perplexity(dist, idx=0, beta=1.0):
    '''计算perplexity, D是距离向量，
    idx指dist中自己与自己距离的位置，beta是高斯分布参数
    这里的perp仅计算了熵，方便计算
    '''
    prob = np.exp(-dist * beta)
    # 设置自身prob为0
    prob[idx] = 0
    sum_prob = np.sum(prob)
    perp = np.log(sum_prob) + beta * np.sum(dist * prob) / sum_prob
    prob /= sum_prob
    return perp, prob


def seach_prob(x, tol=1e-5, perplexity=30.0):
    '''二分搜索寻找beta,并计算pairwise的prob
    '''

    # 初始化参数
    print("Computing pairwise distances...")
    (n, d) = x.shape
    dist = cal_pairwise_dist(x)
    pair_prob = np.zeros((n, n))
    beta = np.ones((n, 1))
    # 取log，方便后续计算
    base_perp = np.log(perplexity)

    for i in range(n):
        if i % 500 == 0:
            print("Computing pair_prob for point %s of %s ..." %(i,n))

        betamin = -np.inf
        betamax = np.inf
        perp, this_prob = cal_perplexity(dist[i], i, beta[i])

        # 二分搜索,寻找最佳sigma下的prob
        perp_diff = perp - base_perp
        tries = 0
        while np.abs(perp_diff) > tol and tries < 50:
            if perp_diff > 0:
                betamin = beta[i].copy()
                if betamax == np.inf or betamax == -np.inf:
                    beta[i] = beta[i] * 2
                else:
                    beta[i] = (beta[i] + betamax) / 2
            else:
                betamax = beta[i].copy()
                if betamin == np.inf or betamin == -np.inf:
                    beta[i] = beta[i] / 2
                else:
                    beta[i] = (beta[i] + betamin) / 2

            # 更新perb,prob值
            perp, this_prob = cal_perplexity(dist[i], i, beta[i])
            perp_diff = perp - base_perp
            tries = tries + 1
        # 记录prob值
        pair_prob[i,] = this_prob
    print("Mean value of sigma: ", np.mean(np.sqrt(1 / beta)))
    return pair_prob


def pca(x, no_dims = 50):
    ''' PCA算法
    使用PCA先进行预降维
    '''
    print("Preprocessing the data using PCA...")
    (n, d) = x.shape
    x = x - np.tile(np.mean(x, 0), (n, 1))
    l, M = np.linalg.eig(np.dot(x.T, x))
    y = np.dot(x, M[:,0:no_dims])
    return y


def tsne(x, no_dims=2, initial_dims=50, perplexity=30.0, max_iter=1000):
    """Runs t-SNE on the dataset in the NxD array x
    to reduce its dimensionality to no_dims dimensions.
    The syntaxis of the function is Y = tsne.tsne(x, no_dims, perplexity),
    where x is an NxD NumPy array.
    """

    # Check inputs
    if isinstance(no_dims, float):
        print("Error: array x should have type float.")
        return -1
    if round(no_dims) != no_dims:
        print("Error: number of dimensions should be an integer.")
        return -1

    # 初始化参数和变量
    x = pca(x, initial_dims).real
    (n, d) = x.shape
    initial_momentum = 0.5
    final_momentum = 0.8
    eta = 500
    min_gain = 0.01
    y = np.random.randn(n, no_dims)
    dy = np.zeros((n, no_dims))
    iy = np.zeros((n, no_dims))
    gains = np.ones((n, no_dims))

    # 对称化
    P = seach_prob(x, 1e-5, perplexity)
    P = P + np.transpose(P)
    P = P / np.sum(P)
    # early exaggeration
    P = P * 4
    P = np.maximum(P, 1e-12)

    # Run iterations
    for iter in range(max_iter):
        # Compute pairwise affinities
        sum_y = np.sum(np.square(y), 1)
        num = 1 / (1 + np.add(np.add(-2 * np.dot(y, y.T), sum_y).T, sum_y))
        num[range(n), range(n)] = 0
        Q = num / np.sum(num)
        Q = np.maximum(Q, 1e-12)

        # Compute gradient
        PQ = P - Q
        for i in range(n):
            dy[i,:] = np.sum(np.tile(PQ[:,i] * num[:,i], (no_dims, 1)).T * (y[i,:] - y), 0)

        # Perform the update
        if iter < 20:
            momentum = initial_momentum
        else:
            momentum = final_momentum
        gains = (gains + 0.2) * ((dy > 0) != (iy > 0)) + (gains * 0.8) * ((dy > 0) == (iy > 0))
        gains[gains < min_gain] = min_gain
        iy = momentum * iy - eta * (gains * dy)
        y = y + iy
        y = y - np.tile(np.mean(y, 0), (n, 1))
        # Compute current value of cost function
        if (iter + 1) % 100 == 0:
            if iter > 100:
                C = np.sum(P * np.log(P / Q))
            else:
                C = np.sum( P/4 * np.log( P/4 / Q))
            print("Iteration ", (iter + 1), ": error is ", C)
        # Stop lying about P-values
        if iter == 100:
            P = P / 4
    print("finished training!")
    return y


if __name__ == "__main__":
    # Run Y = tsne.tsne(X, no_dims, perplexity) to perform t-SNE on your dataset.
    X = np.loadtxt("mnist2500_X.txt")
    labels = np.loadtxt("mnist2500_labels.txt")
    Y = tsne(X, 2, 50, 20.0)
    from matplotlib import pyplot as plt
    plt.scatter(Y[:,0], Y[:,1], 20, labels)
    plt.show()

29 数据降维与可视化（t-SNE）——数无形时少直觉
t-SNE(t-distributed stochastic neighbor embedding)是用于降维的一...
t-SNE原理总览
引言：t-SNE 是一种非线性降维算法，非常适用于高维数据降维到2维或者3维，进行可视化。是单细胞测序高分文章中常...
sc-RNA-seq数据分析|| UMAP与t-NSE的区别
UMAP 将会与tsne一样作为高纬数据可视化的利器，并且优于tsne. UMAP 与 t-SNE 均是非线性降维...
UMAP的初步了解及与t-SNE的比较
降维是机器学习中的可视化和理解高维数据的强大工具。t-SNE是最广泛使用的可视化技术之一，但其性能在大型数据集中会...
93-非监督学习之PCA线性降维
为何要降维？方便可视化探索；减轻维度诅咒；缓解共线性。降维方法：PCA, t-SNE, UMAP, SOM, LL...
机器学习算法：UMAP 深入理解
导读降维是机器学习从业者可视化和理解大型高维数据集的常用方法。最广泛使用的可视化技术之一是 t-SNE[http...
数据降维与可视化工具——t-SNE实践教程
姓名：张国帅学号：19021110506 【嵌牛导读】t-SNE是一种集降维与可视化于一体的技术，它是基于SNE...
高维数据可视化Visualizing data using t-
t-SNE高维数据可视化（python）这篇文章非常好，贴出来的代码，直接可正确运行。 t-SNE算法理解：An ...
从Word2vec可视化算法t-SNE谈起
刚好最近经常看一些word2vec的文章，在最后往往看到作者说用t-SNE可视化结果，也即把高维度的数据降维并可视...
数形结合思想的力量
数形结合思想在解决数学问题的意义与价值数与形，本是相倚依，焉能分作两边飞，数无形时少直觉，形少数时...