文本分类器

总结几个常用的文本分类器，从数学原理到效果测评。

一、常用文本分类器

1、朴素贝叶斯分类器

$P(Y|X)=\frac{P(Y)P(X|Y)}{P(X)}$

P(Y|X)是已知X发生后Y的条件概率，也由于得自X的取值而被称作Y的后验概率。

P(Y)是Y的先验概率（或边缘概率）。之所以称为”先验”是因为它不考虑任何X方面的因素。在文本分类中，Y代表文本所属类别，X代表出现的文本。分类器解决的问题是，寻找最大的P(Y|X)，即Y在给定X下的最大后验概率。

注意公式中的分母部分，在给定X条件下，与Y无关，所以只需要比较分子部分P(Y)P(X|Y)即可。其中P(X|Y)称为似然项。假设短文本长度为N（X的维度）,词表大小为M（X每一维的取值），类别数为K(Y的取值)。整体的组合复杂度为$ K*M^{N} $

复杂度过高，所以引入条件独立性假设，各个词之间独立出现，不会相互影响。这样似然项可以改写为P((X1,X2,…Xm)|Y) ,如果不考虑位置，复杂度降为K*M。在实际使用中，还要考虑由于数据不足导致的某个词在某个类别下没有出现的情况，加入拉普拉斯平滑项。

举一个实际的例子，出自信息检索导论

类别：短文本

类别	短文本
China	Chinese Beijing Chinese
China	Chinese Chinese Shanghai
China	Chinese Macao
Japan	Tokyo Japanese Chinese
?	Chinese Chinese Chinese Tokyo Japanese

根据前四条训练数据，判断最后一条数据所属类别。首先求各个类别的先验概率：

P(China)=3/4，P(Japan)1/4

然后统计加入拉普拉斯平滑后的各个词在各个类别下的后验概率：

测试数据在各个类别下概率计算： $\begin{aligned} P(China|Chinese Chinese Chinese Tokyo Japanese) &=P(China)P(Chinese|China)^{3}P(Tokyo|China)P(Japanese|China)\\ &=\frac{3}{4}\ast (\frac{3}{7})^{3}\ast \frac{1}{14}\ast \frac{1}{14} \end{aligned}$

$\begin{aligned} P(Japan|Chinese Chinese Chinese Tokyo Japanese)&=P(Japan)P(Chinese|Japan)^{3}P(Tokyo|Japan)P(Japanese|Japan)\\ &=\frac{1}{4}\ast (\frac{2}{9})^{3}\ast \frac{2}{9}\ast \frac{2}{9} \end{aligned}$

测试数据在China类下的概率为0.0003，在Japan类的概率为0.0001，所以会分到China类下。

2、fastText

fastText是facebook提出的一个文本分类器，其特点是实现简单，训练速度快。在普通cpu上，fastText可以在10分钟内训练10亿条数据，在1分钟内预测50万条数据分类（类别种类30万）。论文原文可以在这里下载。

fastText的网络结构和word2vec的CBOW十分相似，只是把中间词换成了类别标签。其网络结构如下：

fastText的目标函数可以写为：

$Obj.=min(-\frac{1}{N}\sum_{n=1}^{N}y_{n}log(f(BAx_{n})))$

其中，x为输入词的向量均值，y为分类的label，应该是一个多维one hot编码向量。整个网络只有简单的三层结构。

从目标函数可以看出，算法是利用极大似然估计法，求解参数，使得全体样本正确分类的概率最大，最后的output层可以添加softmax激活函数。

但是，这个简单的模型却有一个问题，那就是如果需要分类的类别过多，会使得模型每轮迭代需要更新的参数过多，影响训练速度。

作者采用了和CBOW一样的思想，对输出层的label进行huffman编码，利用hierarchical softmax方法，降低模型计算复杂度。

假设一共有k个类别，hidden输出向量维度为h(一般取100)，原训练方式每次迭代需要更新的参数为O(kh)，使用hierarchical softmax后，更新的需要更新的参数将为O(hlog(k))，注意，模型整体的参数总数并没有减少，只是每次迭代需要更新的参数减少了。

因为模型采用的是词袋模型，没有考虑词之间的先后顺序，在实际使用中，可采用n-gram模型，增加局部词的空间信息，一般情况下使用3-gram。

最终，作者在两类任务上(情感分型和标签预测)给出了测评。最终结果显示，与其他的方法相比，fastText在准确率上与其他方法基本持平，但是在训练时间上有十分明显的优势，这也是这个模型最大的亮点-fast。

3、textCNN

textCNN也是一个经典的base文本分类器，在2014年由New York大学的Yoon Kim提出，主要思想是利用CNN网络和预训练好的词向量进行文本分类。论文可以在这里下载。

textCNN提出的网络模型（其中一个）如下：

在论文中，作者共提出了四个模型变种，上图显示的是最复杂的一个，CNN-multichannel，这里的multichannel指的是第一层的两个通道。具体四个模型如下：

CNN-rand
CNN-static
CNN-non-statistic
CNN-multichannel

下面详细解释下上面的模型：

第一层是文本表示层或者叫输入层，是一个nk2的张量。其中n为词的个数，k为每个词向量对应的维数。第一层还有两个通道，分别对应静态和非静态通道，静态的意思是，词向量在模型训练过程中，不进行更新，非静态的是指模型训练过程中，更新词向量，模型在初始化阶段，两个通道的值是完全相同的。

第二层就是卷积层，卷积核的形状为h*k，注意，这里的h为词向量的维度，也就是说，经过卷积层，输出张量的维度会变为(n-h+1,1,m)，m为卷积核的个数。

第三层为pooling层，每个卷积核的输出分别进行max-pooling操作，输出变为m维向量。

最后一层是全连接层+softmax函数，用于对分类进行预测。

在训练过程中，除了更新模型的参数，也会更新输入层的非静态通道对应的词向量。这就是CNN-multichannel。

其余三个模型基本与这个模型一致，只不过输入变为单通道。

CNN-rand: 输入层的词向量并不是预先训练好的，而是随机生成的。在模型训练过程中，同时更新参数和词向量。
CNN-static: 输入层词向量是预先训练好的，在模型训练过程中，只更新参数，词向量保持不变。
CNN-non-statistic: 输入词向量是预先训练好的，在模型训练过程中，更新词向量。

第一个模型CNN-rand可以算是个基础的base版，在四个模型中效果是最差的，这也从侧面反应了这种网络结构不适合直接训练word-embedding。

第二个模型CNN-static使用预先训练的词向量，但是不更新词向量，训练过程是最简单的，其效果也不错。

第三个模型CNN-non-statistic与第二个模型的唯一区别是，在训练过程中更新词向量，其效果有了明显的提高。

第四个模型CNN-multichannel，在输入层采用了双通道，一个更新词向量，一个不更新词向量。作者交代这样做的目的是，为了对抗模型三中可能产生的过拟合。但是从最终的结果上，模型4和模型3的效果基本持平，没有明显提升。实际使用时，可以考虑使用模型3就行了。

最后放一个四个模型的测评效果图。

从实验结果可以看出，在不同的数据集上，textCNN都有不俗的表现。

最终，作者对模型进行了几点总结：

输出层的dropout在正则化上非常有用，可以有效对抗过拟合，经评估，dropout提升了模型效果2%~4%。
高质量的embedding非常有用，对模型效果起着决定性作用。
使用卷积+embedding就可以达到这种效果，说明embedding在NLP领域都有非常重要的作用。

二、代码实现和效果测评

所有代码都可以在这里找到。

分别实现的naive bayes,fast-text和text-cnn的rand版本，在一份新闻数据集上进行测试。

1、数据集介绍

为了实现文本分类，我在网上搜索了一些数据集，最终决定采用20 Newsgroups这个数据集，关于这个数据集的详细介绍可以看这里。相关的数据也可以在我的baidu云下载。原作者关于这份数据集做了简单的类别分布统计，如下图：

各个类别的数据分布均匀，每个类别的训练数据和测试数据大概按4：3分布也算比较合理，数据总量含笑，只有不到2万，可以方便我们快速进行实验。数据的格式如下：

新闻类别1 tab 新闻内容
新闻类别1 tab 新闻内容
新闻类别2 tab 新闻内容
。。。
新闻类别20 tab 新闻内容

2、naive bayes分类器

naive bayes分类器的原理在之前已经介绍过了，训练过程主要是根据训练数据得到一些统计值，具体如下：

每个类别的先验概率
每个词在每个类别下的后验概率

同时，使用naive bayes分类器还要注意laplace平滑和存储计算概率的log值而不是原值，log值累加可以防止原值累成产生的越界。

首先引入math库，用于计算log，并定义一些统计变量。

# coding: utf8
import math
total_train_num = 0 # 总样本数
label_num_dict = {} # 每个类别样本数
word_label_num_dict = {} # 每个词在每个类别下样本数
label_word_count_dict = {} # 每个类别所有样本总词数
word_set = set() # 单词表，用于laplace平滑

定义好变量后，我们可以读入训练数据，进行统计。

train_data = '../data/20ng-train-no-stop.txt'
with open(train_data) as f:
    for line in f:
        line = line.strip()
        line_items = line.split('\t')
        if len(line_items) != 2 or line_items[0] == '' or line_items[1] == '':
            continue
        total_train_num += 1
        label = line_items[0]
        words = line_items[1].split()
        label_num_dict[label] = label_num_dict.get(label, 0) + 1
        for word in words:
            word_set.add(word)
            word_label_num_dict[word+'_'+label] = word_label_num_dict.get(word+'_'+label, 0) + 1
            label_word_count_dict[label] = label_word_count_dict.get(label, 0) + 1

为了防止某些词在某些类别中一次都没有出现的特殊情况，进行laplace平滑。

# for laplace smoothing
for word in word_set:
    for label in label_num_dict:
        word_label_num_dict[word+'_'+label] = word_label_num_dict.get(word+'_'+label, 0) + 1
        label_word_count_dict[label] = label_word_count_dict.get(label, 0) + 1

至此，所有的统计工作已经完成，接下来可以在测试数据上进行测试了。为了方便，我们定义一个函数，用于计算一篇文档属于某个类别的概率。

def get_label_prob(label, words_list):
    # priority prob
    p = math.log(label_num_dict.get(label) / total_train_num)
    for word in words_list:
        if word not in word_set:
            continue
        p += math.log((word_label_num_dict.get(word+'_'+label) / label_word_count_dict[label]))
    return p

这里可以根据words_list中的词，计算这些词属于类别label的概率。代码为了简单易懂，没有进行效率上的优化。其实每个词属于每个类别的概率log值应该存下来，不应该重复计算，不过我们的数据集很小，这点计算量不是问题，可以忽略不计。

最后，我们可以在测试数据上进行测试了。

test_data = '../data/20ng-test-no-stop.txt'
test_total = 0
correct_num = 0
with open(test_data) as f:
    for line in f:
        line = line.strip()
        line_items = line.split('\t')
        if len(line_items) != 2 or line_items[0] == '' or line_items[1] == '':
            continue
        test_total += 1
        real_label = line_items[0]
        words = line_items[1].split()
        predict_list = []
        for label in label_num_dict:
            predict_list.append([label, get_label_prob(label, words)])
        predict_label = max(predict_list, key=lambda x:x[1])[0]
        if predict_label == real_label:
            correct_num += 1

print(correct_num / test_total)
# output：0.815

至此，一个简单的navie bayes分类器就完成了。在20个类别上的平均准确率达到81.5%。这只是一个base版的分类器，还有很多可以改进的地方。除了前面提到的效率问题，还可以加入n-gram信息，考虑相邻词语的组合。在训练数据中，停用词并没有去除的很好，我们可以考虑每个词的tf-idf值，对于每篇文章，使用tf-idf值靠前的词也许效果会更好。

3. fast-text分类器

fast-text和text-cnn都是基于神经网络的分类器，本文使用tensorflow构建这些分类器。

首先，需要引入依赖包，定义一些常量。

# coding: utf8

import tensorflow as tf
import collections
import numpy as np
import time

epoches = 10 # 训练数据有限，数据过10遍
words_length = 300 # 数据维度
n_words = 60000 # 词总数
batch_size = 30
embedding_size = 10
learning_rate = 0.01
train_data_path = '../data/20ng-train-no-stop.txt'

因为训练数据有限，我们把epoches设置为10，表示所有的训练数据会过10遍。words_length设置为300，当某条训练数据长度超过300时会进行截断，不足300会通过占位符在后面进行补充，训练数据平均长度为140。词的总数设置为6000，不考虑一些十分低频的词，只考虑词频top 6000的词。在训练数据中，总词数在6000到7000之间。

接下来，我们要对原始数据进行处理。我们把训练数据从文件中读出，存入一个list中，注意，这种方法只适用于训练数据很少的情况，如果训练数据很大，这么做不仅效率低下，而且容易内存溢出。

train_data_path = '../data/20ng-train-no-stop.txt'
raw_data_record_list = []
words_list = []
label_set = set()
with open(train_data_path) as f:
    for line in f:
        line = line.strip()
        line_items = line.split('\t')
        if len(line_items) != 2 or line_items[0] == '' or line_items[1] == '':
            continue
        label = line_items[0]
        words = line_items[1].split()
        if len(words) > words_length:
            words = words[:words_length]
        while len(words) < words_length:
            words.append('&')
        words_list.extend(words)
        label_set.add(label)
        raw_data_record_list.append([words, label])

读取训练数据后，我们需要对词进行编号，转换成神经网络能处理的形式。这里需要注意的是，在编号时，一般默认把词频最高的词编在前面，起始编号0为占位符的编号。在编好号后，也要注意把字典存储下来，在测试时使用。

# build dictionary
count = collections.Counter(words_list).most_common(n_words)
word_dict = {}
# every word with unique id
for word, _ in count:
    word_dict[word] = len(word_dict)

with open('./model/word_dict', 'w') as fw:
    for word, ids in word_dict.items():
        fw.write(word + '\t' + str(ids) + '\n')

label_dict = {}
for label in label_set:
    label_dict[label] = len(label_dict)

with open('./model/label_dict', 'w') as fw:
    for word, ids in label_dict.items():
        fw.write(word + '\t' + str(ids) + '\n')

data_record_list = []
for data in raw_data_record_list:
    words = [word_dict.get(k, 0) for k in data[0]]
    labels = label_dict[data[1]]
    data_record_list.append(([words, labels]))

我们使用place_holder的方式给神经网络提供数据，所以通过一个迭代器来提供数据。

# generate train data
def generate_batch(batch_size, data_record_list):
    for i in range(0, len(data_record_list), batch_size):
        batch = [k[0] for k in data_record_list[i:i+batch_size]]
        labels = [k[1] for k in data_record_list[i:i+batch_size]]
        yield batch, labels

下面就是运行图的构建和训练了

# build graph
vocabulary_size = len(word_dict)
num_classes = len(label_dict)
graph = tf.Graph()
with graph.as_default():
    # input data
    with tf.name_scope('inputs'):
        train_inputs = tf.placeholder(tf.int64, shape=[None, words_length], name='inputs')
        train_labels = tf.placeholder(tf.int64, shape=[None], name='labels')
    with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
        with tf.name_scope('embeddings'):
            embeddings = tf.Variable(
                tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
            embed = tf.reduce_mean(tf.nn.embedding_lookup(embeddings, train_inputs), 1)
        input_layer = embed
        logits = tf.layers.dense(
            inputs=input_layer,
            units=num_classes,
            activation=None,
        )
        predictions = tf.argmax(logits, axis=-1, name='predictions')
        correct_predictions = tf.equal(predictions, train_labels)
        accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
        mean_loss = tf.reduce_mean(
            tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=train_labels,
                logits=logits,
            ),
        )
        tf.summary.scalar('loss', mean_loss)
        tf.summary.scalar('accuracy', accuracy)
        train_step = tf.train.AdamOptimizer(learning_rate).minimize(mean_loss, global_step=tf.train.get_global_step())
        summary_op = tf.summary.merge_all()
    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init)
    summary_writer = tf.summary.FileWriter('./events', sess.graph)
    num = 0
    start_time = time.time()
    for k in range(epoches):
        np.random.shuffle(data_record_list)
        batch_generator = generate_batch(batch_size, data_record_list)
        for batch, labels in batch_generator:
            num += 1
            loss, _, _summary = sess.run([mean_loss, train_step, summary_op], feed_dict={train_inputs: batch, train_labels: labels})
            summary_writer.add_summary(_summary, num)
    end_time = time.time()
    print('total time:{}'.format(end_time - start_time))
    saver = tf.train.Saver()
    saver.save(sess, './model/fast_text.model')

以上只是训练模型的代码，使用测试集进行测试的代码可以查看github。最终，模型在测试集上的的准确率为80%左右。

下图展示了训练过程中，loss值的变化。从图中可以看错，经过3.5k轮的训练，loss已经趋近于0，说明模型在训练集上拟合的很好。

4.text-cnn

不同于fast-text和naive bayes，text-cnn的网络结构要相对复杂，使用了不同尺寸的的卷积层，保留了局部的结构信息。而fast-text和naive-bayes都没有考虑任何结构信息，只有加入n-gram特征才有结构信息。由于结构的复杂性，text-cnn如果想要更好的拟合，就需要更多的训练数据。text-cnn有四种结构，为方便快速进行测试，我们使用text-cnn-rand版，即随机初始化word embedding，而不使用训练好的的embedding，embedding随着训练更新。

首先，我们同样引入依赖的各种包，定义一些常量。

注意：这里与fast-text有一个重要区别，就是words_length的长度仅仅设置为30。

在fast-text中，我们介绍过，所有训练数据平均词数为140个，为了方便网络处理，我们设置了个大概2倍的数值300作为默认长度，对于超过300个词的训练数据进行截断，对于不足300个词的训练数据在后面通过占位符补充到300个。

在开始进行实验的时候，我同样把words-length设置为300，但是效果极差，因为有相当一部分训练数据在后面补充了大量的占位符。对于fast-text，占位符会和其余非占位符求平均后作为特征进行训练，但是text-cnn会更多考虑局部信息，导致卷积层滑动到后部的特征在这些数据上全部一样（都是占位符），无法有效的训练拟合。

在将words-length值设置成30后，每条训练数据只用前30个词，效果提高的非常明显，模型可以迅速收敛。对于不同的网络结构，数据的处理方式还是有很大的影响的，不可照搬。个人认为，相比于fast-text，text-cnn更适合处理长文本，并且停用词对于text-cnn这种考虑局部信息的词更敏感，因为卷积核的尺寸是固定的，通过去除停用词，可以使更多的有用信息集中在卷积核的窗格内。

# coding: utf8

import tensorflow as tf
import collections
import numpy as np
import time

epoches = 20
words_length = 30
n_words = 60000
batch_size = 30
embedding_size = 100
learning_rate = 0.1

接下来，我们用与fast-text类似的方式处理训练数据。

train_data_path = '../data/20ng-train-no-stop.txt'
raw_data_record_list = []
words_list = []
label_set = set()
with open(train_data_path) as f:
    for line in f:
        line = line.strip()
        line_items = line.split('\t')
        if len(line_items) != 2 or line_items[0] == '' or line_items[1] == '':
            continue
        label = line_items[0]
        words = line_items[1].split()
        if len(words) > words_length:
            words = words[:words_length]
        while len(words) < words_length:
            words.append('&')
        words_list.extend(words)
        label_set.add(label)
        raw_data_record_list.append([words, label])

# build dictionary
count = collections.Counter(words_list).most_common(n_words)
word_dict = {}
# every word with unique id
for word, _ in count:
    word_dict[word] = len(word_dict)

reverse_word_dict = dict((v,k) for k,v in word_dict.items())

with open('./model/word_dict', 'w') as fw:
    for word, ids in word_dict.items():
        fw.write(word + '\t' + str(ids) + '\n')

label_dict = {}
for label in label_set:
    label_dict[label] = len(label_dict)

reverse_label_dict = dict((v,k) for k,v in label_dict.items())

with open('./model/label_dict', 'w') as fw:
    for word, ids in label_dict.items():
        fw.write(word + '\t' + str(ids) + '\n')

data_record_list = []
for data in raw_data_record_list:
    words = [word_dict.get(k, 0) for k in data[0]]
    labels = label_dict[data[1]]
    data_record_list.append([words, labels])

# generate train data
def generate_batch(batch_size, data_record_list):
    for i in range(0, len(data_record_list), batch_size):
        batch = [k[0] for k in data_record_list[i:i+batch_size]]
        labels = [k[1] for k in data_record_list[i:i+batch_size]]
        yield batch, labels

如果发现模型不能按照预定的方式工作，首先应该检查训练数据本身是否有问题，其次是训练数据的读取是否有问题。更详细的的检查措施，可以看我的这篇博文——神经网络调优。

我就是在模型效果不符合预期时，检查训练数据到模型这一步（generator输出）发现的问题。

接下来就是构建训练图，进行训练了。

# build graph
vocabulary_size = len(word_dict)
num_classes = len(label_dict)
graph = tf.Graph()
filter_sizes = list(range(3,6))
num_filters = 8
with graph.as_default():
    # input data
    input_x = tf.placeholder(tf.int64, [None, words_length], name="input_x")
    input_y = tf.placeholder(tf.int64, [None], name="input_y")
    dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
    with tf.device('/cpu:0'), tf.name_scope("embedding"):
        W = tf.Variable(
            tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),
            name="W")
        embedded_chars = tf.nn.embedding_lookup(W, input_x)
        embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)
        def get_pool(filters, size):
            conv = tf.layers.conv2d(
                inputs=embedded_chars_expanded,
                filters=filters,
                kernel_size=[size, embedding_size],
                padding='valid',
                activation=tf.nn.relu,
                use_bias=True,
            )
            pool = tf.layers.max_pooling2d(
                inputs=conv,
                pool_size=[words_length-size+1, 1],
                strides=1,
            )
            return pool
        pool2 = get_pool(6, 2)
        pool3 = get_pool(6, 3)
        pool4 = get_pool(6, 4)
        pool5 = get_pool(5, 5)
        pool6 = get_pool(5, 6)
        pool7 = get_pool(5, 7)
        pool = tf.concat(
            values=[pool2, pool3, pool4, pool5, pool6, pool7],
            axis=3
        )
        pool = tf.reshape(pool, [-1, pool.shape[3]])
        logits = tf.layers.dense(
            inputs=pool,
            units=num_classes,
        )
        predictions = tf.argmax(logits, axis=-1, name='predictions')
        losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels = input_y,
                logits = logits
                )
        loss = tf.reduce_mean(losses)
        correct_predictions = tf.equal(predictions, input_y)
        accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
        tf.summary.scalar('loss', loss)
        tf.summary.scalar('acccuray', accuracy)
        optimizer = tf.train.AdamOptimizer(1e-4)
        grads_and_vars = optimizer.compute_gradients(loss)
        train_op = optimizer.apply_gradients(grads_and_vars)
        summary_op = tf.summary.merge_all()
        init = tf.global_variables_initializer()
        sess = tf.Session()
        sess.run(init)
        summary_writer = tf.summary.FileWriter('./events', sess.graph)
        index = 0
        for k in range(epoches):
            np.random.shuffle(data_record_list)
            batch_generator = generate_batch(batch_size, data_record_list)
            for x_batch, y_batch in batch_generator:
                index += 1
                feed_dict = {
                    input_x: x_batch,
                    input_y: y_batch,
                    dropout_keep_prob: 0.9
                }
                _, _loss, _accuracy, _summary = sess.run(
                        [train_op, loss, accuracy, summary_op],
                    feed_dict)
                summary_writer.add_summary(_summary, index)

因为论文中，并没有明确的卷积核的尺寸（宽度固定为词向量的维度，高度不确定，对应一次处理词的个数）和个数，我根据实际情况，分别使用高度为2到7的卷积核，对应深度前三个为6层，后三个为5层。这样的选择不一定是最合理的，可以进行更多的尝试。

训练过程中，我们可以通过tensorboard观察每个batch上的准确率和损失，从上图中可以看出，随着训练轮数增加，准确率逐渐提高至80%，损失逐渐降低至1.0左右，而准确率的上升和损失的下降并没有明显减缓的趋势，说明模型还没有充分的训练。

三、总结

至此，关于naive bayes、fast-text和text-cnn三个常用的文本分类器就介绍完了，但是，还有很多内容需要进一步探究。

TODO LIST:

naive bayes使用n-gram效果测评。
fast-text准确率只有80%，能否继续提高，是否受到过多占位符的影响。
text-cnn只用到每条训练数据前30个词，大部分信息被丢弃，是否可以更好的解决。本文只测评了rand，对于static等其余三个模型，而论文中rand的效果是最差的，其余三个模型能提高多少？
进一步剔除停用词，是否对效果有帮助。
如果只使用tf-idf得分靠前的词，是否对效果有帮助？