前言

跟随着之前提到的教程（可以看一看这篇内容：【NLP】学习笔记），我做完了项目一，对于NLP整个概念有了一个非常浅显的理解

跟着作业的进度，我完成了一个最简单的中文分词工具，并对于分词这一项工作有了更深一步的理解，因此决定将这一部分的方法总结出来，因此就有了这篇内容。

一、分词工具简介

分词，是在做自然语言处理工作的第一步，即将给定的输入内容适当分割成为一系列的单词（即词序列）这一过程。

如给定输入：今天的天气非常好，经过分词工具的处理，可能会输出的结果为：[“今天”, “的”, “天气”, “非常”, “好”]，这便是一个分词工具所具有的相关职责。

对于不同语言，分词方式是不同的，就如同在基于自然语言处理技术对书籍句段进行情感分析进而推荐适合的阅读背景音乐中所说：对于汉语、日语等，这类没有明确的词界标记的语言，其对应标记化的过程不像英语那样琐碎，但是需要进行分词。

而分词的时候，这类没有明确的词界标记（即不是使用空格作为分界），因此分割时需要使用其他方法。

在本文中，首先使用给定的综合类中文词库.xlsx作为分词工具所依赖的词典库。

其次，分别采用前项最大匹配原则、前项最小匹配原则、后项最大匹配原则和后向最小匹配原则四种方式，根据给定词典进行分词，根据给定单词的概率，计算出所有分割结果可能性最高的结果作为最终结果进行输出

二、数据准备和下载

代码下载方式：点击这里

此项目需要的数据：

综合类中文词库.xlsx：包含了中文词，当做词典来用
以变量的方式提供了部分unigram概率 word_prob

例如：给定词典=[我们学习人工智能人工智能未来是]，另外我们给定unigram概率：p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15

三、分词工具编写

1、导入相应的包

1 2	import pandas as pd import numpy as np

2、导入文件作为词典数据

1
2
3

path = "./data/综合类中文词库.xlsx"
data_frame = pd.read_excel(path, header = None)
dic_word_list = data_frame[data_frame.columns[0]].tolist() # 保存所有单词的list作为词典

3、构建单词概率的字典

以下是每一个单词出现的概率。为了问题的简化，我们只列出了一小部分单词的概率。

dic_words =  dic_word_list   # 保存词典库中读取的单词

word_prob = {"北京":0.03,"的":0.08,"天":0.005,"气":0.005,"天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02, 
             "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,"很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
             "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,"有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005}

在这里没有出现的的单词但是出现在词典里的，统一把概率设置成为0.00001

比如 p(“学院”)=p(“概率”)=…0.00001

for key in word_prob.keys(): # 将词典中没有的单词添加进词典
    if key not in dic_words:
        dic_words.append(key)
for item in dic_words: # 将没有给定概率的单词写入概率词典，这里使用词典类型存储以加速访问
    word_prob.setdefault(item, 0.00001)

4、编写分词方法

(1).前项最大匹配原则

def split_word_with_dic_front_max(dic=[], input_str=""):
    '''基本描述
    前向最大匹配原则

    详细描述

    Args:
        dic (list): 给定的单词词典
        input_str (str): 输入的待分割内容

    Returns:
        segments: 返回匹配结果的list
    '''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        tmp_length = 0
        tmp_dic_content = ""
        for word in dic:
            if(word == input_str_tmp[:len(word)]):
                if(tmp_length < len(word)):
                    tmp_dic_content = word
                    tmp_length = len(word)
        segments.append(tmp_dic_content)
        input_str_tmp = input_str_tmp[tmp_length:]
    return segments

def split_word_with_dic_front_max(dic=[], input_str=""):
    '''前项最大分词'''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        for i in range(len(input_str_tmp), -1, -1):
            word = input_str_tmp[:i]
            if word in dic:
                segments.append(word)
                input_str_tmp = input_str_tmp[len(word):]
                break
    return segments

(2).前项最小匹配原则

def split_word_with_dic_front_min(dic=[], input_str=""):
    '''基本描述
    前向最小匹配原则

    详细描述

    Args:
        dic (list): 给定的单词词典
        input_str (str): 输入的待分割内容

    Returns:
        segments: 返回匹配结果的list
    '''
    input_str_tmp = input_str # 做临时变量用于切分
    segments = [] # 存储分割结果用于返回
    while(input_str_tmp!=""):
        tmp_length = len(input_str)
        tmp_dic_content = "" # 临时变量存储分割出的单词
        for word in dic: # 遍历字典
            if(word == input_str_tmp[:len(word)]):
                if(tmp_length > len(word)): # 查找最长的匹配单词
                    tmp_dic_content = word
                    tmp_length = len(word)
        segments.append(tmp_dic_content)
        input_str_tmp = input_str_tmp[tmp_length:]
    return segments

def split_word_with_dic_front_min(dic=[], input_str=""):
    '''前项最小分词'''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        for i in range(len(input_str_tmp)):
            word = input_str_tmp[:i] if len(input_str_tmp) != 1 else input_str_tmp[0] # 防止为空时找不到索引而循环
            if word in dic:
                segments.append(word)
                input_str_tmp = input_str_tmp[len(word):]
                break
    return segments

(3).后项最大匹配原则

def split_word_with_dic_back_max(dic=[], input_str=""):
    '''基本描述
    后向最大匹配原则

    详细描述

    Args:
        dic (list): 给定的单词词典
        input_str (str): 输入的待分割内容

    Returns:
        segments[::-1]: 返回匹配结果的list
    '''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        tmp_length = 0
        tmp_dic_content = ""
        for word in dic:
            if(word == input_str_tmp[-len(word):]):
                if(tmp_length < len(word)):
                    tmp_dic_content = word
                    tmp_length = len(word)
        segments.append(tmp_dic_content)
        input_str_tmp = input_str_tmp[:-tmp_length]
    return segments[::-1]

def split_word_with_dic_back_max(dic=[], input_str=""):
    '''后项最大分词'''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        for i in range(len(input_str_tmp)):
            word = input_str_tmp[i:]
            if word in dic:
                segments.append(word)
                input_str_tmp = input_str_tmp[:-len(word)]
                break
    return segments[::-1]

(4).后项最小匹配原则

def split_word_with_dic_back_min(dic=[], input_str=""):
    '''基本描述
    后向最小匹配原则

    详细描述

    Args:
        dic (list): 给定的单词词典
        input_str (str): 输入的待分割内容

    Returns:
        segments[::-1]: 返回匹配结果的list
    '''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        tmp_length = len(input_str)
        tmp_dic_content = ""
        for word in dic:
            if(word == input_str_tmp[-len(word):]):
                if(tmp_length > len(word)):
                    tmp_dic_content = word
                    tmp_length = len(word)
        segments.append(tmp_dic_content)
        input_str_tmp = input_str_tmp[:-tmp_length]
    return segments[::-1]

(5).调用分词方法

def split_word(dic=[], input=""):
    '''基本描述
    后向最小匹配原则

    详细描述

    Args:
        dic (list): 给定的单词词典
        input_str (str): 输入的待分割内容

    Returns:
        tmp_result: 存储着分词结果的list
    '''
    tmp_result = []
    tmp_result.append(split_word_with_dic_front_max(dic, input)) # 调用前项最大分词
    tmp_result.append(split_word_with_dic_back_max(dic, input)) # 调用前项最小分词
    tmp_result.append(split_word_with_dic_front_min(dic, input)) # 调用后项最大分词
    tmp_result.append(split_word_with_dic_back_min(dic, input)) # 调用后项最小分词
    return tmp_result

def split_word_with_dic_back_min(dic=[], input_str=""):
    '''后项最小分词'''
    input_str_tmp = input_str
    segments = []
    while(input_str_tmp!=""):
        for i in range(len(input_str_tmp), -1, -1):
            word = input_str_tmp[i:]
            if word in dic:
                segments.append(word)
                input_str_tmp = input_str_tmp[:-len(word)]
                break
    return segments[::-1]

5、构建计算分词概率的方法

下边概率使用的是负的log值，目的是降低计算时结果溢出的可能性

def get_split_probability_use_segment(split_word_segment=[]):
    '''
    递归计算分词结果的概率
    '''
    sum_result = 0
    for seg in split_word_segment:
        sum_result -= np.log(word_prob.get(seg))
    return sum_result
def get_split_probability(split_word_segments=[[], ]):
    '''
    基本描述
    根据传入的分词结果计算出概率最高的分词结果并返回

    详细描述

    Args:
        split_word_segments (list): 分词结果

    Returns:
        split_word_segments[max_index]: 最优的分词结果
        sum: 最优分词结果的分数
    '''
    index = 0
    max_index = 0
    sum = get_split_probability_use_segment(split_word_segments[0])
    for segment in split_word_segments:
        tmp = get_split_probability_use_segment(segment)
        if(sum>tmp): 
            sum = tmp
            max_index = index
        index += 1
    return split_word_segments[max_index], sum

6、整合各个方法以供调用

#  分数（10）
## TODO 请编写word_segment_naive函数来实现对输入字符串的分词
def word_segment_naive(input_str):
    """
    input_str: 输入字符串   输入格式：“今天天气好”
    best_segment: 最好的分词结果  输出格式：["今天"，"天气"，"好"]
    """

    # 第一步： 计算所有可能的分词结果，要保证每个分完的词存在于词典里，这个结果有可能会非常多。 
    segments = split_word(dic_words, input_str)  # 存储所有分词的结果。如果次字符串不可能被完全切分，则返回空列表(list)
                   # 格式为：segments = [["今天"，“天气”，“好”],["今天"，“天“，”气”，“好”],["今“，”天"，“天气”，“好”],...]
    
    #第二步：循环所有的分词结果，并计算出概率最高的分词结果，并返回
    best_segment, best_score = get_split_probability(segments)
    
    return best_segment

四、代码测试

1
2
3

print(word_segment_naive("北京的天气真好啊"))
print(word_segment_naive("今天的课程内容很有意思"))
print(word_segment_naive("经常有意见分歧"))

根据上述测试的输出结果，可以得到输出结果

1
2
3

['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']

五、代码优化

上述代码中，虽然可以解决分词的问题，但是在处理的过程中，分为两个步骤，首先是进行分词，根据分词的结果计算每种结果的可能性，这样无疑是增大了计算的资源开销，有没有一种方式是可以一步到位的呢？

因此优化从这一点着手，将两步变成一步。

优化时，我们使用著名的一种动态规划算法——维特比算法（viterbi）

首先，我们需要有一个词典库，并且给定词典中每个词出现的概率，根据概率计算其对应的负log值。

然后，根据上述词典和概率，以及给定的输入句子建立一个带权有向图（DirectedGraph）。

最后，在构建好的带权有向图中找到一条权重最小的路径，即为最好的单词划分方法。

1、导入相应的包

1 2	import pandas as pd import numpy as np

2、导入文件作为词典数据

1
2
3

path = "./data/综合类中文词库.xlsx"
data_frame = pd.read_excel(path, header = None)
dic_word_list = data_frame[data_frame.columns[0]].tolist() # 保存所有单词的list作为词典

3、构建单词概率的字典

以下是每一个单词出现的概率。为了问题的简化，我们只列出了一小部分单词的概率。

dic_words =  dic_word_list   # 保存词典库中读取的单词

word_prob = {"北京":0.03,"的":0.08,"天":0.005,"气":0.005,"天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02, 
             "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,"很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
             "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,"有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005}

在这里没有出现的的单词但是出现在词典里的，统一把概率设置成为0.00001

比如 p(“学院”)=p(“概率”)=…0.00001

for key in word_prob.keys(): # 将词典中没有的单词添加进词典
    if key not in dic_words:
        dic_words.append(key)
for item in dic_words: # 将没有给定概率的单词写入概率词典，这里使用词典类型存储以加速访问
    word_prob.setdefault(item, 0.00001)

4、编写分词方法

例如待分割的句子为：经常有意见分歧

根据上述描述方法，可以构建如图的带权有向图

(1).构建存储带权有向图的方法

该带权有向图可以使用邻接表或者逆邻接表的方式进行存储，这里提供两种方式，但是对于后续计算过程的使用方便，决定采取逆邻接表的方式。

def set_graph_set_log(input_str, word_prob = word_prob):
    '''
    邻接表
    未计算log值，找到后计算对应log值
    '''
    str_len = len(input_str)
    master_list = [{} for _ in range(str_len)] # 初始化含有定点的主列表
    window = 1 # 初始化窗口大小
    start_position = 0 # 窗口起始位置
    while(window <= str_len):
        end_position = start_position + window - 1 # 窗口结束为止
        split_str = input_str[start_position:end_position] # 当前查找的字符串
        log_value = word_prob.get(split_str)# 找到当前字符串对应的log值
        if(None != log_value): # 值存在则插入字典
            print(split_str)
            master_list[start_position][end_position + 1] = -np.log(log_value)
        if(str_len > end_position): start_position += 1# 向后一位进行检测
        else:
            window += 1 # 增大窗口大小
            start_position = 0
        
    return master_list

def set_graph_back_set_log(input_str, word_prob = word_prob):
    '''
    逆邻接表
    未计算log值，找到后计算对应log值
    '''
    str_len = len(input_str)
    master_list = [{} for _ in range(str_len + 1)] # 初始化含有定点的主列表
    window = 1 # 初始化窗口大小
    start_position = 0 # 窗口起始位置
    while(window <= str_len):
        end_position = start_position + window - 1 # 窗口结束为止
        split_str = input_str[start_position:end_position] # 当前查找的字符串
        log_value = word_prob.get(split_str)# 找到当前字符串对应的log值
        if(None != log_value): # 值存在则插入字典
            master_list[end_position][start_position] = -np.log(log_value)
        if(str_len > end_position): start_position += 1# 向后一位进行检测
        else:
            window += 1 # 增大窗口大小
            start_position = 0
        
    return master_list

(2).构建根据邻接表求最短路径的方法

def get_total_weight(list_input, index):
    """
    递归求出当前节点对应的前项最小值
    """
    if(list_input[index][0] == 0 and list_input[index][1] == 1): return 0
    else:
#         print(list_input[index][1])
        return list_input[index][0] + get_total_weight(list_input, list_input[index][1])
def calc_min_weight_path_back(master_list):
    '''
    逆邻接表求最短路径
    '''
    len_master_list = len(master_list)
#     tmp_min_list = [[None, None]] * len_master_list # 初始化给定长度的数组保存最短路径
    
    tmp_min_list = [[None] * 2 for row in range(len_master_list)] # 初始化给定长度的数组保存最短路径
    tmp_min_list[0] = [0, 1] # 初始化第一个节点默认代表 最小值、来源
    for index in range(1, len_master_list):
        for key, value in master_list[index].items():
            if(tmp_min_list[index][0] == None):
                tmp_min_list[index][0] = value
                tmp_min_list[index][1] = key
            else:
                tmp_sum1 = get_total_weight(tmp_min_list, index) # 原来的结果
                tmp_sum2 = get_total_weight(tmp_min_list, tmp_min_list[key][1]) + value # 待插入的结果
                if(tmp_sum2 < tmp_sum1): # 新插入的值更小
                    tmp_min_list[index][0] = value
                    tmp_min_list[index][1] = key
    
    return tmp_min_list

(3).根据求得最短路径切分

def get_split_by_viterbi(min_path_list, input_str):
    min_path = [ i[1] for i in min_path_list]
    tmp_list =[]
    last_position = len(input_str)    
    position = min_path[-1]
    while(position != 0):
        tmp_list.append(input_str[position:last_position])
        last_position = position
        position = min_path[position]
    tmp_list.append(input_str[position:last_position])
    return tmp_list[::-1]

(4).编写调用入口方法

def word_segment_viterbi(input_str):
    """
    1. 基于输入字符串，词典，以及给定的unigram概率来创建DAG(有向图）。
    2. 编写维特比算法来寻找最优的PATH
    3. 返回分词结果
    
    input_str: 输入字符串   输入格式：“今天天气好”
    best_segment: 最好的分词结果  输出格式：["今天"，"天气"，"好"]
    """
    
    # 第一步：根据词典，输入的句子，以及给定的unigram概率来创建带权重的有向图（Directed Graph）
    graph = set_graph_back_set_log(input_str)
    
    # 第二步： 利用维特比算法来找出最好的PATH， 这个PATH是P(sentence)最大或者 -log P(sentence)最小的PATH。
    min_path_list = calc_min_weight_path_back(graph)
    
    # 第三步： 根据最好的PATH, 返回最好的切分
    best_segment = get_split_by_viterbi(min_path_list, input_str)

    return best_segment

5、代码测试

1
2
3

print (word_segment_viterbi("北京的天气真好啊"))
print (word_segment_viterbi("今天的课程内容很有意思"))
print (word_segment_viterbi("经常有意见分歧"))

根据上述测试的输出结果，可以得到输出结果

1
2
3

['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很', '有意思']
['经常', '有意见', '分歧']

六、对比结果

['北京', '的', '天气', '真好啊']
- ['今天', '的', '课程', '内容', '很有', '意思']
+ ['今天', '的', '课程', '内容', '很', '有意思']
['经常', '有意见', '分歧']

七、参考链接

【邻接表的python实现】: https://www.cnblogs.com/chay/articles/10259422.html

【数据结构：图的存储结构之邻接表】: https://blog.csdn.net/jnu_simba/article/details/8866844

八、注释

本文涉及到的所有数据及和代码均可在NLPLearning-CNWordSegmentation 这个仓库中下载到

【NLP】从零开始搭建一个中文分词工具

前言

一、分词工具简介

二、数据准备和下载

三、分词工具编写

1、导入相应的包

2、导入文件作为词典数据

3、构建单词概率的字典

4、编写分词方法

(1).前项最大匹配原则

(2).前项最小匹配原则

(3).后项最大匹配原则

(4).后项最小匹配原则

(5).调用分词方法

5、构建计算分词概率的方法

6、整合各个方法以供调用

四、代码测试

五、代码优化

1、导入相应的包

2、导入文件作为词典数据

3、构建单词概率的字典

4、编写分词方法

(1).构建存储带权有向图的方法

(2).构建根据邻接表求最短路径的方法

(3).根据求得最短路径切分

(4).编写调用入口方法

5、代码测试

六、对比结果

七、参考链接

八、注释