文本挖掘分析电商的评论数据

Posted on 2019-08-14 Edited on 2019-08-15

背景介绍

电商平台中有海量的非结构化文本数据，如商品描述、用户评论、用户搜索词、用户咨询等。这些文本数据不仅反映了产品特性，也蕴含了用户的需求以及使用反馈。通过深度挖掘，可以精细化定位产品与服务的不足。

用户评论能反映出用户对商品、服务的关注点和不满意点。评论从情感分析上可以分为正面与负面。细粒度上也可以将负面评论按照业务环节进行分类，便于定位哪个环节需要不断优化。

数据

分词词典：

电商评论数据：

代码实现

'''
电商用户评论文本分类
'''
import jieba
import gensim
import scipy
import numpy
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB

#加载词典
stop = ''
with open('stopwords.txt','r',encoding='gbk',errors='ignore') as s:
    for line in s:
        line = line.strip()
        stop += line


#读取文本变量数据
dataList = [] #特征
tagList = [] #标签
Count = 0
fobjRead =open('1578698_content.txt','r',encoding='utf-8')

for row in fobjRead:
    if Count >= 5000:
        break
    score = int(row[2])
    if score >= 4:
        flag = 1
    elif score >= 3:
        flag = 2
    else:
        flag =3
    if flag in [1,3]:
        content = row.strip("\n").split(':#:')[1].replace(' ','')
        #分词　停用词　去重等预处理
        wordList = jieba.cut(content, cut_all=False)
        termsAll = list(set([term for term in wordList if term not in stop]))
        dataList.append(termsAll)
        tagList.append(str(flag))
        Count = Count + 1
fobjRead.close()

#文本特征向量
wordDict = gensim.corpora.Dictionary(dataList)
corpus = [wordDict.doc2bow(doc) for doc in dataList]

#文本特征向量转为sklearn 可以识别的稀疏矩阵
data = []
rows = []
cols = []
line_count = 0
for line in corpus:
    for elem in line:
        rows.append(line_count)
        cols.append(elem[0])
        data.append(elem[1])
    line_count = line_count + 1
matrix = scipy.sparse.csr_matrix((data,(rows,cols))).toarray()
rarray = numpy.random.random(size=line_count)
#print(matrix)
#print(rarray)
'''
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[0.26456867 0.31671304 0.24109927 ... 0.68329533 0.81079462 0.91965377]
'''

# 划分训练集　测试集、
train_set = []
train_tag = []
test_set = []
test_tag = []
totalCount = sum([500,500])
#print(totalCount)
posCount, negCount = [500,500]
posNow, negNow =0, 0
recordCount = 0
for i in range(line_count):
    if rarray[i] < 0.8 and (posNow + negNow) < totalCount:
        if tagList[i] == "1" and posNow < posCount:
            train_set.append(matrix[i,:])
            train_tag.append(tagList[i])
            posNow = posNow + 1
        elif tagList[i] == "3" and negNow < posCount:
            train_set.append(matrix[i,:])
            train_tag.append(tagList[i])
            negNow = negNow + 1
        else:
            test_set.append(matrix[i,:])
            test_tag.append(tagList[i])
    else:
        test_set.append(matrix[i,:])
        test_tag.append(tagList[i])
del matrix
del rarray

print(train_set)
print(train_tag)
print('--------------------------------------------')
print(test_set)
print(test_tag)

#建模

#决策树
clf = DecisionTreeClassifier()
clf.fit(train_set, train_tag)
clf_predict_test = clf.predict(test_set)
print(sklearn.metrics.classification_report(test_tag,clf_predict_test))
'''
              precision    recall  f1-score   support

           1       0.74      0.52      0.61      2918
           3       0.28      0.50      0.36      1082

    accuracy                           0.52      4000
   macro avg       0.51      0.51      0.48      4000
weighted avg       0.61      0.52      0.54      4000
'''

#朴素贝叶斯
clf1 = BernoulliNB()
clf1.fit(train_set, train_tag)
clf1_predict_test = clf1.predict(test_set)
print(sklearn.metrics.classification_report(test_tag,clf1_predict_test))
'''
             precision    recall  f1-score   support

           1       0.74      0.36      0.49      2918
           3       0.28      0.65      0.39      1082

    accuracy                           0.44      4000
   macro avg       0.51      0.51      0.44      4000
weighted avg       0.61      0.44      0.46      4000

'''

NLP常用开发工具包

Posted on 2019-08-09

1.NumPy

numpy系统是Python的一种开源的数值计算包。包括：1、一个强大的N维数组对象Array；2、比较成熟的（广播）函数库；3、用于整合C/C++和Fortran代码的工具包；4、实用的线性代数、傅里叶变换和随机数生成函数。numpy和稀疏矩阵运算包scipy配合使用更加方便。

1	conda install numpy

2. NLTK

Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。

1	conda install nltk

3.Gensim

Gensim是一个占内存低，接口简单，免费的Python库，它可以用来从文档中自动提取语义主题。它包含了很多非监督学习算法如：TF/IDF，潜在语义分析（Latent Semantic Analysis，LSA）、隐含狄利克雷分配（Latent Dirichlet Allocation，LDA），层次狄利克雷过程（Hierarchical Dirichlet Processes，HDP）等。
Gensim支持Word2Vec,Doc2Vec等模型。

1	conda install gensim

4.Tensorflow

TensorFlow是谷歌基于DistBelief进行研发的第二代人工智能学习系统。TensorFlow可被用于语音识别或图像识别等多项机器学习和深度学习领域。TensorFlow是一个采用数据流图（data flow graphs），用于数值计算的开源软件库。节点（Nodes）在图中表示数学操作，图中的线（edges）则表示在节点间相互联系的多维数据数组，即张量（tensor）。它灵活的架构让你可以在多种平台上展开计算，例如台式计算机中的一个或多个CPU（戒GPU），服务器，移动设备等等。TensorFlow 最初由Google大脑小组（隶属于Google机器智能研究机构）的研究员和工程师们开发出来，用于机器学习和深度神经网络方面的研究，但这个系统的通用性使其也可广泛用于其他计算领域。

1	conda install tensorflow

5.jieba

“结巴”中文分词：是广泛使用的中文分词工具，具有以下特点：
1）三种分词模式：精确模式，全模式和搜索引擎模式
2）词性标注和返回词语在原文的起止位置（ Tokenize）
3）可加入自定义字典
4）代码对 Python 2/3 均兼容
5）支持多种语言，支持简体繁体
项目地址：https://github.com/fxsjy/jieba

1	pip install jieba

6.Stanford NLP

Stanford NLP提供了一系列自然语言分析工具。它能够给出基本的词形，词性，不管是公司名还是人名等，格式化的日期，时间，量词，并且能够标记句子的结构，语法形式和字词依赖，指明那些名字指向同样的实体，指明情绪，提取发言中的开放关系等。 1.一个集成的语言分析工具集； 2.进行快速，可靠的任意文本分析； 3.整体的高质量的文本分析; 4.支持多种主流语言; 5.多种编程语言的易用接口; 6.方便的简单的部署web服务。

安裝

Python 版本stanford nlp 安装
• 1)安装stanford nlp自然语言处理包: pip install stanfordcorenlp
• 2)下载Stanford CoreNLP文件
https://stanfordnlp.github.io/CoreNLP/download.html
• 3)下载中文模型jar包, http://nlp.stanford.edu/software/stanford-chinese-
corenlp-2018-02-27-models.jar,
• 4)把下载的stanford-chinese-corenlp-2018-02-27-models.jar
放在解压后的Stanford CoreNLP文件夹中，改Stanford CoreNLP文件夹名为stanfordnlp（可选）
• 5)在Python中引用模型:
• from stanfordcorenlp import StanfordCoreNLP
• nlp = StanfordCoreNLP(r‘path', lang='zh')
例如：
nlp = StanfordCoreNLP(r'/home/kuo/NLP/module/stanfordnlp/', lang='zh')

测试

#-*-encoding=utf8-*-
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'/home/kuo/NLP/module/stanfordnlp/', lang='zh')

fin=open('news.txt','r',encoding='utf8')
fner=open('ner.txt','w',encoding='utf8')
ftag=open('pos_tag.txt','w',encoding='utf8')
for line in fin:
    line=line.strip()
    if len(line)<1:
        continue
 
    fner.write(" ".join([each[0]+"/"+each[1] for  each in nlp.ner(line) if len(each)==2 ])+"\n")
    ftag.write(" ".join([each[0]+"/"+each[1] for each in nlp.pos_tag(line) if len(each)==2 ])+"\n")
fner.close()   
ftag.close()
print ("okkkkk")

7.Hanlp

HanLP是由一系列模型与算法组成的Java工具包，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。功能：中文分词词性标注命名实体识别依存句法分析关键词提取新词发现短语提取自动摘要文本分类拼音简繁

Hanlp环境安装

• 1、安装Java:我装的是Java 1.8
• 2、安裝Jpype,
> conda install -c conda-forge jpype1
>[或者]pip install jpype1
• 3、测试是否按照成功:
from jpype import *
startJVM(getDefaultJVMPath(), "-ea")
java.lang.System.out.println("Hello World")
shutdownJVM()

Hanlp安装

• 1、下载hanlp.jar包: https://github.com/hankcs/HanLP
• 2、下载data.zip: https://github.com/hankcs/HanLP/releases 中
http://hanlp.linrunsoft.com/release/data-for-1.7.0.zip 后解压数据包。
• 3、配置文件
• 示例配置文件:hanlp.properties
• 配置文件的作用是告诉HanLP数据包的位置,只需修改第一行:root=usr/home/HanLP/
• 比如data目录是/Users/hankcs/Documents/data,那么root=/Users/hankcs/Documents/

测试

#-*- coding:utf-8 -*-
from jpype import *

startJVM(getDefaultJVMPath(), "-Djava.class.path=/home/kuo/NLP/module/hanlp/hanlp-1.6.2.jar:/home/kuo/NLP/module/hanlp",
         "-Xms1g",
         "-Xmx1g") # 启动JVM，Linux需替换分号;为冒号:

print("=" * 30 + "HanLP分词" + "=" * 30)
HanLP = JClass('com.hankcs.hanlp.HanLP')
# 中文分词
print(HanLP.segment('你好，欢迎在Python中调用HanLP的API'))
print("-" * 70)

聚类的本质

Posted on 2019-08-03 Edited on 2019-08-14

聚类的本质

TF-IDF词频　　余弦距离

Hello World

Posted on 2019-07-31

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment