caomaocao的家
3f3f773baac43e4d4040876c0f849f3dbcff71df-www.caomaocao.com
2018-03-08T12:24:00Z
时髦的Python开发环境配置流程(pyenv + pipenv)
2018-03-08T12:24:00Z
python-kai-fa-huan-jing-pei-zhi-sheng-chan-huan-jing-fa-bu-liu-cheng
caomaocao的家
<h2 id="toc_0" class="h16">介绍</h2>
<p class="md_block">
<span class="md_line md_line_start md_line_end">pyenv + pipenv是现在较为流行Python项目开发环境配置组合。支持多个Python版本切换,以项目为基础构建虚拟空间,使用方便。</span>
</p>
<h4 id="toc_1" class="h16">pyenv 是什么?</h4>
<p class="md_block">
<span class="md_line md_line_start md_line_end">pyenv 是 Python 版本管理工具。 pyenv 可以改变全局的 Python 版本,安装多个版本的 Python, 设置目录级别的 Python 版本,还能创建和管理 virtual python environments 。所有的设置都是用户级别的操作,不需要 sudo 命令。</span>
</p>
<h4 id="toc_2" class="h16">pipenv 是什么?</h4>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start">Pipenv 是 Python 项目的依赖管理器。尽管pip可以安装Python包,但仍推荐使用Pipenv,因为它是一种更高级的工具,可简化依赖关系管理的常见使用情况。<br /></span>
<span class="md_line md_line_end">主要特性包含:</span>
</p>
<ol>
<li class="md_li"><span>根据 Pipfile 自动寻找项目根目录。
</span></li>
<li class="md_li"><span>如果不存在,可以自动生成 Pipfile 和 Pipfile.lock。
</span></li>
<li class="md_li"><span>自动在项目目录的 .venv 4.目录创建虚拟环境。(当然这个目录地址通过设置WORKON_HOME改变)
</span></li>
<li class="md_li"><span>自动管理 Pipfile 新安装和删除的包。
</span></li>
<li class="md_li"><span>自动更新 pip。
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start md_line_end">它们配合使用还是有坑的,具体会在配置流程中说明。</span>
</p>
<h2 id="toc_3" class="h16">开发环境配置</h2><h3 id="toc_4" class="h16">1.安装git</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">新拿到机子,连git都没有...</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="c1">#需要root权限</span>
<span class="n">yum</span> <span class="n">update</span>
<span class="n">yum</span> <span class="n">install</span> <span class="n">git</span>
</pre></div>
<!--block_code_end--><h3 id="toc_5" class="h16">2.安装pyenv</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">切到用户目录下</span>
</p>
<pre><code>cd ~
git clone git://github.com/yyuu/pyenv.git .pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
exec $SHELL</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">如果shell是bash,就 <code>>>.bashrc</code>,zsh <code>>>.zshrc</code>。<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/j9i8p51v2da03uknda8cta9x/image.png" alt="image.png-62.4kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">具体使用请见<a class="md_compiled" href="https://github.com/pyenv/pyenv">pyenv官方文档</a></span>
</p>
<h3 id="toc_6" class="h16">3.安装Python</h3>
<pre><code>pyenv install 3.6.3 -v</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">报错:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/bv07gqwbfr5gebxn50yo5b3g/image.png" alt="image.png-26.9kB" title="" ><br /></span>
<span class="md_line img_before only_img_before">一般来说,看到连接SSL error,这肯定是国内特有的情况,你懂的。<br /></span>
<span class="md_line md_line_end">pyenv支持离线安装Python,log中有Python的资源url手动下载到$PYENV_ROOT/cache文件夹,cache不存在则自己新建。</span>
</p>
<pre><code>cd $PYENV_ROOT
mkdir cache
wget https://www.python.org/ftp/python/3.6.3/Python-3.6.3.tgz
or
scp 本地Python包 到服务器
pyenv install 3.6.3 -v</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">还是报错,log显示没走离线安装流程,还在尝试从python.org下载Python,这肯定是哪出了问题。</span>
</p>
<pre><code>vi /home/app/.pyenv/plugins/python-build/share/python-build</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/247k5zeph8aoay5nxqysx8lf/image.png" alt="image.png-51.9kB" title="" ><br /></span>
<span class="md_line md_line_dom_embed md_line_with_image img_before only_img_before"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/p7kyy5i63qj6frrupu23p47l/image.png" alt="image.png-15.9kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">注意,log提示下载.tgz,然而.tar.gz没有下载成功;</span>
</p>
<pre><code>cd $PYENV_ROOT/cache
mv Python-3.6.3.tgz Python-3.6.3.tar.gz #改名</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">注意不要先解压缩再压缩,md5会改变导致安装失败。<br /></span>
<span class="md_line md_line_end">再次安装:</span>
</p>
<pre><code>pyenv install 3.6.3 -v
pyenv global 3.6.3
pyenv rehash</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">Python成功啦,同样道理,可以在本机多安装几个版本的Python以供切换</span>
</p>
<h3 id="toc_7" class="h16">4.安装pipenv</h3>
<pre><code>pip install pipenv --user
export PATH="$HOME/.local/bin:$PATH"</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">如果pip安装很慢的话,也是国情原因,请切换pip源。<br /></span>
<span class="md_line">好了,可以使用pipenv了:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/g7t8ezcgcp214iez7lnwv4ds/image.png" alt="image.png-164.3kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">具体使用请见<a class="md_compiled" href="https://github.com/pypa/pipenv">pipenv官方文档</a></span>
</p>
<h3 id="toc_8" class="h16">5.创建虚拟环境</h3>
<pre><code>mkdir test_project
cd test_project
pipenv install --python 3.6.3
pipenv shell</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/xpp7efv6g3s6nfauifi88vo6/image.png" alt="image.png-91.6kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">如上图,在pipenv 命令后指定Python 3.6.3版,它自动的去找到我们pyenv安装的Python,软链起来,如果指定的版本我们没有安装,会报错,那我们去pyenv安装下。使用pipenv shell进入虚拟环境,类同于 Anaconda的 source activate environment 和virtualen的 source bin/activate</span>
</p>
<h3 id="toc_9" class="h16">6.修改pip源</h3>
<p class="md_block">
<span class="md_line md_line_start">项目根目录下会看到2文件:Pipfile, Pipfile.lock,后者我们不要去动,会自动更新;<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/gcnic39bz4fka3sx8gsqum6b/image.png" alt="image.png-15.6kB" title="" ><br /></span>
<span class="md_line img_before only_img_before">url替换为<code>"https://mirrors.aliyun.com/pypi/simple"</code>阿里源,看个人喜好。<br /></span>
<span class="md_line md_line_end">这之后我们可以使用<code>pipenv install *</code> 给该项目安装指定package了</span>
</p>
<h3 id="toc_10" class="h16">7.Pycharm绑定</h3>
<p class="md_block">
<span class="md_line md_line_start">项目开发过程中,我们会使用到Pycharm,2018.1版本之后它是默认支持pipenv的,打开项目工程会自动定位Pipfile文件,然后index第三方package,蛮方便的。具体配置见<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/p7pg48mzr0yosf03yte4fwm5/image.png" alt="image.png-38.9kB" title="" ><br /></span>
<span class="md_line md_line_dom_embed md_line_with_image img_before only_img_before md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/ka1b8wrhq25mevq8v0jdderx/image.png" alt="image.png-51.9kB" title="" ></span>
</p>
<h2 id="toc_11" class="h16">生产环境配置</h2><h3 id="toc_12" class="h16">1.发布流程</h3><h3 id="toc_13" class="h16">2.package安装</h3><h3 id="toc_14" class="h16">3.集群统一安装</h3>
topic model在商品推荐领域的应用
2017-04-24T14:27:00Z
topic_modelzai-shang-pin-tui-jian-ling-yu-de-ying-yong
caomaocao的家
<p class="md_block">
<span class="md_line md_line_start">对相同类目的商品文本corpus进行主题模型训练,哎,就是LDA啦。文本预处理和训练过程略过,网上资料应该很多,主要使用Python的Gensim包;LDA的数学解释和物理意义可言看wiki或者其他资料,他们肯定比我讲得好,本文主要还是写主题模型LDA这项技术在我司的应用吧。好了,训练结束得得到三张表<br /></span>
<span class="md_line">表一:.商品文本词对应的topic表<br /></span>
<span class="md_line">spid: word1:topic_id, word2:topic_id, ..., wordn:topic_id<br /></span>
<span class="md_line">可以把它想象成一个稀疏矩阵:spid:()<br /></span>
<span class="md_line">表二.商品与topic概率对应表<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/ljzjp5lrkuk0qznkr68yy95g/QQ20180709-235449.png" alt="QQ20180709-235449.png-18.2kB" title="" ><br /></span>
<span class="md_line img_before only_img_before">表三.topic和word对应表<br /></span>
<span class="md_line">word的n就是文本词典中词的个数,可以看到每个topic下所有词在该主题的概率<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/f82r7s0oqqnks6niiw5ivrl5/QQ20180709-235459.png" alt="QQ20180709-235459.png-20.1kB" title="" ></span>
</p>
<h2 id="toc_0" class="h16">1. 相似商品发现</h2>
<p class="md_block md_block_as_opening md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start md_line_end">文本的降维表示有很多方法,比如早先的LSA,最近5,6年随着深度学习起来的word2vec, Glove等,上面的表2其实也是对商品文本的降维表示,有了它,我们可以计算商品与商品之间相似度,不同文本描述向量的相似度计算步骤大体差不多,cosine距离,KL距离,编辑距离等等,我们在主题模型上的计算商品与商品直接相似度算法步骤:</span>
</p>
<ol>
<li class="md_li"><span>读取表2
</span></li>
<li class="md_li"><span>为每个商品分布计算其他商品与它的主题向量距离
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/rylvt8tsgjiccr2spkycinw9/image.png" alt="image.png-7.6kB" title="" ></span>
</p>
</span></li>
<li class="md_li"><span>取相似度排名topN的其他商品作为相似商品,可以用于推荐
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start">效果展示:<br /></span>
<span class="md_line">spid:530819928411 标题:"可步茶叶 2015限量版冰岛百年古树普洱茶生茶 一提(七饼)500元" <br /></span>
<span class="md_line">http://static.zybuluo.com/caomaocao/wh1xxxlakajvhpt012k6dwkc/image_1asjt6jrd1ioh1gbia8k1gjn151t8q.png<img class="md_compiled " src="http://static.zybuluo.com/caomaocao/kq3k4gcshwcfcke8qbehq9c4/image.png" alt="image.png-44.3kB" title="" ><br /></span>
<span class="md_line">例子仅使用商品标题,是为了展示效果快嘛,主题也设置的少,仅设置20个。短文本其实并不适合LDA用,主题个数设置也是门学问,这些以后再细讲吧。<br /></span>
<span class="md_line">http://static.zybuluo.com/caomaocao/m0fujexptgw1oxj8gnrlvzy1/image_1asjq8ucp1h8l7c1b41keu63876.png<img class="md_compiled " src="http://static.zybuluo.com/caomaocao/m0fujexptgw1oxj8gnrlvzy1/image_1asjq8ucp1h8l7c1b41keu63876.png" alt="此处输入图片的描述" title="" ><br /></span>
<span class="md_line md_line_end">这个商品的前8个相似商品,从推荐系统角度来说,推荐效果光看着title就挺不错的。</span>
</p>
<h2 id="toc_1" class="h16">2. 给商品打标签</h2>
<ol>
<li class="md_li"><span>输入商品类目和它的分词
</span></li>
<li class="md_li"><span>读取类目表2,获取该商品的概率最大的topic。
</span></li>
<li class="md_li"><span>读取类目表3,获得该topic下的每个词概率权重。
</span></li>
<li class="md_li"><span>取在步骤3中获得的topic 中topN的词,作为该商品的标签
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start">效果展示:<br /></span>
<span class="md_line">还是用章节1的例子吧,spid:530819928411 标题:"可步茶叶 2015限量版冰岛百年古树普洱茶生茶 一提(七饼)500元" <br /></span>
<span class="md_line md_line_end">它主题11权重最大,该主题内,“生茶”,“古树”,“普洱茶”3个词概率权重最大,就给该商品打上这3个标签吧。</span>
</p>
<h2 id="toc_2" class="h16">3. LDA and LR推荐</h2>
<p class="md_block md_block_as_opening md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start">逻辑回归计算方便,可解释性强,在互联网广告CTR领域已经成熟应用,用在商品推荐中,我们把LDA产生的商品文本表示作为特征,输入LR来做商品推荐。以实际例子来说吧:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/3dubcjkf6ynzre17yvlouzz1/image.png" alt="image.png-696kB" title="" ></span>
</p>
<ol>
<li class="md_li"><span>商品是淘宝连衣裙类目list页的头2行,混合了直通车,自然搜索结果。假设这8个位置对用户影响一致(当然这是不可能的....),获取商品的点击log,log显示,仅商品4被点击了:
</span></li>
</ol>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/2uscj3o6wi94bqw5ysrynsck/image.png" alt="image.png-61.2kB" title="" ></span>
</p>
<ol>
<li class="md_li"><span>读取类目表2,拿到了这8个商品的主题向量:
</span></li>
</ol>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/6lsmd31w72n70bb1t5nydiul/image.png" alt="image.png-82.2kB" title="" ></span>
</p>
<ol>
<li class="md_li"><span>哎,步骤1的点击结果0,1就是LR训练的Y,而各个商品的主题向量表示就是X,扔到LR里训练,我们可以得到每个topic对商品是否会被点击的影响权重:<code>title_score = w1 * topic_1 + w2 * topic_2 + w3 * topic_3 +...+ w20*topic_20</code>
</span></li>
<li class="md_li"><span>从商品池中根据上式拿score topnN给用户推荐
</span></li>
</ol>
<p class="md_block md_block_as_opening md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start md_line_end">我们有了点击log,那很自然的想到给用户贴标签,因为有它点击了的商品的主题向量嘛!计算步骤:</span>
</p>
<ol>
<li class="md_li"><span>获取用户点击日志,根据该用户对商品的点击率给商品排序。
</span></li>
<li class="md_li"><span>根据类目表2,拿到用户点击过的商品的top3的主题,主题概率累积。这里我们要引入Hacker News的衰减函数<code>Score = (P-1) / (T+2)^G</code>,其中P是点击,T是日子,G就抄它的1.8。主题累积概率根据此公式衰减。
</span></li>
<li class="md_li"><span>当用户再次访问网站的时候,从它的画像中拿到他最感兴趣的topic, 再去表2的倒排拿热门商品进行推荐。
</span></li>
</ol>
<h2 id="toc_3" class="h16">4.捞类目精选词</h2>
<p class="md_block md_block_as_opening md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start md_line_end">有了类目下商品的文本corpus,我们想知道哪些词重要,最能体现主题n,应用上这些词做直通车投放效果应该比较好,那好,看类目表3,是不是可以把它转置作倒排,就能看出哪个word对topic影响更大,那这些词我们就认为是该类目的精选词,计算步骤:</span>
</p>
<ol>
<li class="md_li"><span>读取类目表3,转置拿到word -> topic关系,每个word的向量表示成<code>wc = [wc0, wc1, wc2,...,wcn]</code>,其中wc1意思是词w在topic_1下的概率。
</span></li>
<li class="md_li"><span>计算每个词在所有topic下的平均概率,相当于概率的“白噪声”,<code>wu = [1/n,1/n,...,1/n]</code>,n是主题个数。
</span></li>
<li class="md_li"><span>遍历类目词表,计算wc和wn的差,由大到小排列topN个词就是类目的精选词。
</span></li>
</ol>
<h2 id="toc_4" class="h16">5.商品文本质量评分</h2>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start md_line_end">注意嗷,该算法主要参考他人资料,限于我司机器的计算能力以及代码实现复杂度,我们并没有上线,但是实现过程参考意义蛮大,还是记录下来,也许日后会上线。</span>
</p>
<ol>
<li class="md_li"><span>文档中该主题的概率 * 词在该主题上的概率
</span></li>
<li class="md_li"><span>计算类目下所有商品文本的 主题平均概率向量,
</span></li>
<li class="md_li"><span>
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start">参考资料:<br /></span>
<span class="md_line md_line_dom_embed"><a class="md_compiled" href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">LDA wiki</a><br /></span>
<span class="md_line md_line_dom_embed"><a class="md_compiled" href="http://www.flickering.cn/tag/lda/">腾讯火光摇曳博客</a><br /></span>
<span class="md_line md_line_dom_embed md_line_end"><a class="md_compiled" href="https://www.epubit.com/book/detail/23066;jsessionid=67C9919DC7146DC36479647147750F03">LDA漫游指南</a></span>
</p>
文本分类任务预处理模板
2017-04-04T02:21:00Z
test
caomaocao的家
<h1 id="toc_0" class="h16">文本分类模板</h1>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start">文本分类是很常见的NLP任务,二分类任务如舆情分析,而电商文本的的类目判别,新闻类目判断就属于多分类了。文本分类算法很多,从基于统计打车Bayes再发展到线性模型判断,如LR,再到非线性SVM;目前用的比较多,能有效减少特征工程的是深度神经网络DNN。<br /></span>
<span class="md_line">本文并没有关注模型的选择和调参部分,是对工作中文本分类任务中公用部分方法总结,给算法部分打个基础。<br /></span>
<span class="md_line md_line_end">一个完整的文本分类任务通常包含以下任务:</span>
</p>
<ol>
<li class="md_li"><span>定义数据清洗原则
</span></li>
<li class="md_li"><span>读取文本
</span></li>
<li class="md_li"><span>建立词典
</span></li>
<li class="md_li"><span>训练集/开发集/测试集的切分
</span></li>
<li class="md_li"><span>建模
</span></li>
<li class="md_li"><span>评估
</span></li>
<li class="md_li"><span>调参
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start">注意到其中1, 2, 3, 4, 6是共同的,这些共同的步骤提出来做成一个文本分类模板,之后就仅关注模型的选择和调参会。<br /></span>
<span class="md_line md_line_end">文本分类算法孰优孰劣没有定论,先选定一个baseline,再更改算法看是否有提高,实际工作中,我通常使用TFIDF灌进LR作为baseline,TFIDF是文本算法中通俗易懂特征,LR可解释性强。再开始使用Keras搭建DNN,看比LR线性模型提高了多少。</span>
</p>
<h3 id="toc_1" class="h16">数据清洗</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">对输入doc的乱七八糟符号要去除,再用jiaba分词对输入doc进行分词,注意,不需要对停用词做处理,因为1. LSTM,CNN等结构对可能能捕捉到“的得地”词影响;2.我们可以控制词典过滤超过n%文档出现过的词语,出现小于m次的词语。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">import</span> <span class="nn">jieba</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="n">etl_regex</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">r"[\s+\!\/_,$%^*(+</span><span class="se">\"</span><span class="s2">\']+|[+——!,。?、~@#¥%……&*():]+"</span><span class="p">)</span>
<span class="c1"># 去掉符号</span>
<span class="k">def</span> <span class="nf">delete_symbol</span><span class="p">(</span><span class="n">content</span><span class="p">):</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">etl_regex</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s1">''</span><span class="p">,</span> <span class="n">content</span><span class="p">)</span>
<span class="k">return</span> <span class="n">content</span>
<span class="k">def</span> <span class="nf">clean_doc</span><span class="p">(</span><span class="n">doc</span><span class="p">,</span> <span class="n">vocab</span><span class="p">):</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">lcut</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="nb">map</span><span class="p">(</span><span class="n">delete_symbol</span><span class="p">,</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">document</span><span class="p">,</span> <span class="n">cut_all</span><span class="o">=</span><span class="bp">True</span><span class="p">)))]</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">tokens</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">vocab</span><span class="p">]</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tokens</span>
</pre></div>
<!--block_code_end--><h3 id="toc_2" class="h16">建立词典</h3>
<p class="md_block">
<span class="md_line md_line_start">词典就是一个Counter,遍历清洗后过的文本token list,得到{word1:count, word2:count, ... , wordn:count}的Counter结构。遍历Counter结构,可以过滤出现次数<n的词语,也可以删除在文本中出现>m次的词语,类似于TFIDF的思想。<br /></span>
<span class="md_line md_line_end">Counter使用比较原始,可以from gensim.corpora import Dictionary, 把分完词的二维list 填进,用它的方法过滤词更加方便。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="k">def</span> <span class="nf">add_doc_to_vocab</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">vocab</span><span class="p">):</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">load_doc</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">clean_doc</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
<span class="n">vocab</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">process_docs</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">vocab</span><span class="p">):</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">filename</span><span class="o">.</span><span class="n">endwith</span><span class="p">(</span><span class="s1">'.txt'</span><span class="p">):</span>
<span class="k">continue</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">directory</span> <span class="o">+</span> <span class="s1">'/'</span> <span class="o">+</span> <span class="n">filename</span>
<span class="n">add_doc_to_vocab</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span>
<span class="n">process_docs</span><span class="p">(</span><span class="s1">'pos/'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="n">process_docs</span><span class="p">(</span><span class="s1">'neg/'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">c</span> <span class="ow">in</span> <span class="n">vocab</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="n">c</span> <span class="o">>=</span> <span class="n">min_occurrence</span><span class="p">]</span>
<span class="o">----------------------------------------</span>
<span class="kn">from</span> <span class="nn">gensim.corpora</span> <span class="kn">import</span> <span class="n">Dictionary</span>
<span class="n">dictionary</span> <span class="o">=</span> <span class="n">corpora</span><span class="o">.</span><span class="n">Dictionary</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
<span class="n">dictionary</span><span class="o">.</span><span class="n">filter_extremes</span><span class="p">(</span><span class="n">no_below</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">no_above</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">keep_n</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
</pre></div>
<!--block_code_end--><h3 id="toc_3" class="h16">读文本</h3><div class="codehilite code_lang_python highlight"><pre><span></span><span class="k">def</span> <span class="nf">load_doc</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s2">"r"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s2">"utf-8"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">text</span> <span class="o">+=</span> <span class="n">line</span>
<span class="k">return</span> <span class="n">text</span>
<span class="k">def</span> <span class="nf">process_docs</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">vocab</span><span class="p">,</span> <span class="n">is_train</span><span class="p">):</span>
<span class="n">documents</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">is_train</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">filename</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'.txt'</span><span class="p">):</span>
<span class="k">continue</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">directory</span> <span class="o">+</span> <span class="s1">'/'</span> <span class="o">+</span> <span class="n">filename</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">load_doc</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">clean_doc</span><span class="p">(</span><span class="n">doc</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="n">documents</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="k">return</span> <span class="n">documents</span>
<span class="k">def</span> <span class="nf">load_clean_dataset</span><span class="p">(</span><span class="n">vocab</span><span class="p">,</span> <span class="n">is_train</span><span class="p">):</span>
<span class="n">neg</span> <span class="o">=</span> <span class="n">process_docs</span><span class="p">(</span><span class="s1">'txt_sentoken/neg'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">,</span> <span class="n">is_train</span><span class="p">)</span>
<span class="n">pos</span> <span class="o">=</span> <span class="n">process_docs</span><span class="p">(</span><span class="s1">'txt_sentoken/pos'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">,</span> <span class="n">is_train</span><span class="p">)</span>
<span class="n">docs</span> <span class="o">=</span> <span class="n">neg</span> <span class="o">+</span> <span class="n">pos</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">array</span><span class="p">([</span><span class="mi">0</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">neg</span><span class="p">))]</span> <span class="o">+</span> <span class="p">[</span><span class="mi">1</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">pos</span><span class="p">))])</span>
<span class="k">return</span> <span class="n">docs</span><span class="p">,</span> <span class="n">labels</span>
</pre></div>
<!--block_code_end--><h3 id="toc_4" class="h16">对数据编码</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">DNN输入得是定长vector,对于超过定义长度的输入,我们要做trunc截断,对于小于的,我们要前/后paddinhg。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="k">def</span> <span class="nf">create_tokenizer</span><span class="p">(</span><span class="n">lines</span><span class="p">):</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">()</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tokenizer</span>
<span class="c1"># integer encode and pad documents</span>
<span class="k">def</span> <span class="nf">encode_docs</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">max_length</span><span class="p">,</span> <span class="n">docs</span><span class="p">):</span>
<span class="n">encoded</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
<span class="n">padded</span> <span class="o">=</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">encoded</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">max_length</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s1">'post'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">padded</span>
</pre></div>
<!--block_code_end--><h3 id="toc_5" class="h16">建模</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">这儿就是该干嘛想干嘛的地方了。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="k">def</span> <span class="nf">define_model</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">max_length</span><span class="p">):</span>
<span class="k">pass</span>
</pre></div>
<!--block_code_end--><h3 id="toc_6" class="h16">评估</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">测试集派上用场了,对测试集做相同的预处理流程读到内存里,使用keras model自带的evaluate()会输出一份详尽的报告。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">Xtest</span> <span class="o">=</span> <span class="n">encode_docs</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">max_length</span><span class="p">,</span> <span class="n">test_docs</span><span class="p">)</span>
<span class="n">test_docs</span><span class="p">,</span> <span class="n">ytest</span> <span class="o">=</span> <span class="n">load_clean_dataset</span><span class="p">(</span><span class="n">vocab</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">_</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">Xtest</span><span class="p">,</span> <span class="n">ytest</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre></div>
<!--block_code_end--><h3 id="toc_7" class="h16">完整流程</h3><div class="codehilite code_lang_python highlight"><pre><span></span><span class="c1">#建词典</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span>
<span class="n">process_docs</span><span class="p">(</span><span class="s1">'pos/'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="n">process_docs</span><span class="p">(</span><span class="s1">'neg/'</span><span class="p">,</span> <span class="n">vocab</span><span class="p">)</span>
<span class="c1">#读训练文本</span>
<span class="n">train_docs</span><span class="p">,</span> <span class="n">ytrain</span> <span class="o">=</span> <span class="n">load_clean_dataset</span><span class="p">(</span><span class="n">vocab</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
<span class="c1"># 创建分词器</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">create_tokenizer</span><span class="p">(</span><span class="n">train_docs</span><span class="p">)</span>
<span class="n">vocab_size</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">word_index</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">max_length</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">split</span><span class="p">())</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">train_docs</span><span class="p">])</span>
<span class="c1"># 文本编码,做DNN输入</span>
<span class="n">Xtrain</span> <span class="o">=</span> <span class="n">encode_docs</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">max_length</span><span class="p">,</span> <span class="n">train_docs</span><span class="p">)</span>
<span class="c1"># 定义模型</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">define_model</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">max_length</span><span class="p">)</span>
<span class="c1"># 训练</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">Xtrain</span><span class="p">,</span> <span class="n">ytrain</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">_</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">Xtrain</span><span class="p">,</span> <span class="n">ytrain</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># 测试集评估Xtest = encode_docs(tokenizer, max_length, test_docs)</span>
<span class="n">test_docs</span><span class="p">,</span> <span class="n">ytest</span> <span class="o">=</span> <span class="n">load_clean_dataset</span><span class="p">(</span><span class="n">vocab</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">_</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">Xtest</span><span class="p">,</span> <span class="n">ytest</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre></div>
<!--block_code_end-->
Flask app支持命令行输入启动
2017-01-24T16:31:00Z
2018-01-25
caomaocao的家
<p class="md_block">
<span class="md_line md_line_start md_line_end">一个flask服务,它能随时接收指定店铺id来计算交易信息,又可以在指定时间对所有店铺计算。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">POST</span><span class="err">参数:</span>
<span class="n">flag</span><span class="p">:</span> <span class="c1">#True为全部店铺,False为指定店铺</span>
<span class="n">shop_id</span><span class="p">:</span><span class="c1"># 可以为空,当flag=False时,必须有店铺id</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">随时接收POST body开始计算,半夜接收个全量flag。可刚巧看到个Click包,能碾压我之前常常使用得agrparser包,那我就让我的这个flask服务既能接收POST,也能使用crontab跟着指定参数定时跑吧。<br /></span>
<span class="md_line md_line_end">原始flask服务:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">Flask</span><span class="p">,</span> <span class="n">request</span><span class="p">,</span> <span class="n">jsonify</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Flask</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">app</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">'0.0.0.0'</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">9965</span><span class="p">)</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">Click官方例子</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">import</span> <span class="nn">click</span>
<span class="nd">@click.command</span><span class="p">()</span>
<span class="nd">@click.option</span><span class="p">(</span><span class="s1">'--count'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">'Number of greetings.'</span><span class="p">)</span>
<span class="nd">@click.option</span><span class="p">(</span><span class="s1">'--name'</span><span class="p">,</span> <span class="n">prompt</span><span class="o">=</span><span class="s1">'Your name'</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">'The person to greet.'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">hello</span><span class="p">(</span><span class="n">count</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">count</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s1">'Hello </span><span class="si">%s</span><span class="s1">!'</span> <span class="o">%</span> <span class="n">name</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">hello</span><span class="p">()</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">option 是最基本的,从命令行读取参数值并传递给方法。在上面的例子,我们看到,除了设置命令行选项的名称还有:<br /></span>
<span class="md_line">default: 设置命令行参数的默认值<br /></span>
<span class="md_line">help: 参数说明<br /></span>
<span class="md_line">type: 参数类型,可以是 string, int, float 等<br /></span>
<span class="md_line">prompt: 当在命令行中没有输入相应的参数时,会根据 prompt 提示用户输入<br /></span>
<span class="md_line md_line_end">nargs: 指定命令行参数接收的值的个数</span>
</p>
<p class="md_block">
<span class="md_line md_line_start md_line_end">不应该使用@click装饰main方法,所以要把flask app 和全量店铺计算方法抠出来到一个方法,由click装饰:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="nd">@click.command</span><span class="p">()</span>
<span class="nd">@click.option</span><span class="p">(</span><span class="s1">'--mode'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s2">"server"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="n">click</span><span class="o">.</span><span class="n">Choice</span><span class="p">([</span><span class="s2">"server"</span><span class="p">,</span> <span class="s2">"client"</span><span class="p">]),</span> <span class="n">help</span><span class="o">=</span><span class="s2">"client run all shops, server receive specific seller_id"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">mode</span><span class="p">):</span>
<span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="s2">"server"</span><span class="p">:</span>
<span class="n">app</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s1">'0.0.0.0'</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">9965</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">calc_all_shops</span><span class="p">()</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">run</span><span class="p">()</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">解释一下,click装饰run(),它根据命令行参数mode,如果mode是server, 它就是flask服务,接收需要计算的店铺id;如果mode=client, run()里面的判断接收到了参数,运行calc_all_shop()。mode参数只能接受server或者client,输入别的会报help信息哦。运行一下,报错了:</span>
</p>
<div class="codehilite code_lang_shell highlight"><pre><span></span>Traceback <span class="o">(</span>most recent call last<span class="o">)</span>:
...
RuntimeError: Click will abort further execution because Python <span class="m">3</span> was
configured to use ASCII as encoding <span class="k">for</span> the environment. Either switch
to Python <span class="m">2</span> or consult http://click.pocoo.org/python3/ <span class="k">for</span>
mitigation steps.
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">click好像对Python3支持不大好哦,官方有文档说明:http://click.pocoo.org/5/python3/,当然也提供了临时解决办法:</span>
</p>
<div class="codehilite code_lang_shell highlight"><pre><span></span><span class="nb">export</span> <span class="nv">LC_ALL</span><span class="o">=</span>de_US.utf-8
<span class="nb">export</span> <span class="nv">LANG</span><span class="o">=</span>de_US.utf-8
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">如果使用Pycharm调试的话,记得edit configuration里添加,不然还会继续报错哦:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="/_image/2018-01-25/00-33-17.jpg" alt="Image" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">好了,本地调好了,我们怎么部署这个服务呢?线上flask app可不能使用debug模式哦,也不能使用flask自带得web服务器哦,所以要引入gunicorn</span>
</p>
<pre><code>nohup gunicorn main_server:app -b 0.0.0.0:9965 -w 4 &</code></pre>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">凌晨03:01跑全量店铺计算:</span>
</p>
<pre><code>1 3 * * * cd ~/所在目录; nohup ~/anaconda2/envs/python35/bin/python main_server.py --mode client >> allshop.log 2>&1 &</code></pre>
<!--block_code_end-->
搜索引擎toy实现 by Python
2016-12-26T13:30:00Z
sou-suo-yin-qing-toyshi-xian-by-python
caomaocao的家
<h2 id="toc_0" class="h16">1. 倒排表:</h2>
<pre><code>1. 第一列设计:
1. 跳跃表
2. Hash表 如果就中文还好说,cookie咋办冲突很大;不适合磁盘
3. B+树,叶子节点存mmap offset;因为底层叶子顺序连接,适合多路归并;
4. trie树
2. 第二列设计:倒排文件放磁盘,MMAP映射到内存</code></pre>
<!--block_code_end--><h2 id="toc_1" class="h16">2. 文本相关性</h2><h2 id="toc_2" class="h16">3. 构建倒排索引:</h2>
<pre><code>先扫一次拿统计信息预计算,再建索引
1. 一次性全量:
1. 遍历文档,设置doc_id
2. 切词取term,拿到当前doc的term1,term2,…termn
3. 建立{term: doc_id_list} 的dict
4. 遍历完所有的文档,得到{term: doc_id_list} 大dict。
5. 把大dict写入文件,记录{term: 文件offset}构建B+树
2. 分批合并:
1. 内存中开固定空间
2. 切词取term,拿到当前doc的term1,term2,…termn
3. {term_1: doc_id}, {term_2: doc_id}…写进内存文档数据,保证按照term排序
4. 前2步循环,内存空间用满了就写入磁盘,清空内存
5. 合并磁盘文件,一个term生成它的倒排链追加到倒排文件中,词典铲屎这个term的offset
3. 结合:
1. 设定n个文档写入一个段,按照第一种方法
2. 搜索时候就多个段一期搜再出结果
3. 段多了就合并</code></pre>
<!--block_code_end--><h2 id="toc_3" class="h16">4. 构建正排索引,过滤使用</h2>
<pre><code>1. 以B+树为基础,
2. 对指定字段以list来做正排,[doc_1_price, doc_2_price, doc_3_price, … ,doc_n_price]
1. 倒排拿到doc_id_list
2. 去1维list筛选,满足price留下</code></pre>
<!--block_code_end--><h2 id="toc_4" class="h16">5. 索引管理</h2>
<pre><code>1. 不同field支持不同的行为:完整匹配,过滤(<, >, 区间等)
2. 做1个field class来支持字段管理, add(), query(), filter(), get()</code></pre>
<!--block_code_end--><h2 id="toc_5" class="h16">6. 段的管理,段就是一部分索引</h2>
<pre><code>1. 合并段
1. 合并倒排,B+树是排序的,归并排序B+
2. 合并正排,遍历append
2. 策略:每新增10万条数据持久化一个段,每到5个段将所有段合并成一个段</code></pre>
<!--block_code_end--><h2 id="toc_6" class="h16">7. 索引层,增删改查</h2>
<pre><code>1. 删,bitmap,找到doc_id; 把bitmap第doc_id位置1标记删除;
2. 增:随机数-时间戳-本机IP(MAC)地址生成_id
3. 查:
1. query分析/改写,最简单就是分词
2. 倒排doc_id递增就是为了这一步求多个倒排list的交集
1. 优化:取最短倒排链作标准,拿它每个元素去其他倒排链链比较,
1. 如果发现某个元素不在某个倒排链中,那么他肯定不在最终结果中,直接跳出
2. 如果某个倒排链被查找完了,那么也可以跳出了。
3. 正排,过滤操作在这里</code></pre>
<!--block_code_end-->
大文本文件处理实例
2016-10-12T03:15:00Z
2016-12-12
caomaocao的家
<p class="md_block">
<span class="md_line md_line_start md_line_end">做数据挖掘工作,最耗时,最费脑子的步骤应该是数据预处理了。</span>
</p>
<p class="md_block">
<span class="md_line md_line_start">扫库出来格式:<br /></span>
<span class="md_line">类目 商品id 来源网址 标题<br /></span>
<span class="md_line">整个商品库扫下来的.txt很大,大约3亿条,20个G,我们的类目体系(类目树)全展开,大约15000+个类目,想在每个类目下的标题,使用topic model做一些数据挖掘工作。所以,数据预处理先得把这3亿条数据放到各自类目的.txt中。<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/u6e38lxqy2dcj6zsogw6xtx7/image_1aseh179suoe15q4ab8c64qmb13.png" alt="图像1" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">咋一想,这种工作使用awk+sed这套文本处理套餐应该ok。</span>
</p>
<div class="codehilite code_lang_shell highlight"><pre><span></span>awk -F<span class="s1">'\t'</span> <span class="s1">'{print $1"\t"$2"\t"$3 >> $1".txt"}'</span> all_cid.log
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">以tab为分割符,把一行的商品id,网站,标题写到类目.txt<br /></span>
<span class="md_line md_line_end">结果一跑,速度超级慢,大约2m/s的速度,这得到哪时候去呀。慢的原因应该是每读一行都打开对应类目txt做追加操作。</span>
</p>
<p class="md_block">
<span class="md_line md_line_start">显然,先按类目排序,再一个个类目的数据一起写入txt,能取得较好效果。以我浅薄的算法知识,对这类大文件以归并排序排列500MB,速度仍然不满意。那只好拿出大杀器MapReduce咯。<br /></span>
<span class="md_line">有同学说这种一次性文本工作要专门写个Jar包,放Hadoop集群上跑有点吃力,有木有快一点的MapReduce实现<br /></span>
<span class="md_line md_line_end">啊。有个shell MapReduce模板</span>
</p>
<div class="codehilite code_lang_shell highlight"><pre><span></span>cat *.txt <span class="p">|</span> Map操作 <span class="p">|</span> sort <span class="p">|</span> Reduce操作
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">其中Map,Reduce操作部分可以用任何语言实现,这儿我选用了Python。</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="p">:</span>
<span class="n">line_list</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">line_list</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">4</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">cid</span> <span class="o">=</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">print</span> <span class="s2">"</span><span class="si">%s</span><span class="se">\t</span><span class="si">%s</span><span class="se">\t</span><span class="si">%s</span><span class="se">\t</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">cid</span><span class="p">,</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">Map操作很简单,<code>sys.stdin</code>读取<code>cat</code>流进来的数据,做一些脏数据的出来,输出KV,这儿Key是类目号,Value是标题,来源,id。<br /></span>
<span class="md_line">中间sort操作就交给shell完成咯,<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/l66e9s8ykjjr19vxd2al7b5j/image_1aseij6mvbho1kjsts01m0k6dc2n.png" alt="图片2" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">关键操作在Reduce里:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">current_cid</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">cid</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="p">:</span>
<span class="n">line_list</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="p">)</span>
<span class="n">cid</span> <span class="o">=</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">if</span> <span class="n">current_cid</span> <span class="o">==</span> <span class="n">cid</span><span class="p">:</span>
<span class="n">tmp_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">line_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:]))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">if</span> <span class="n">current_cid</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"./data/</span><span class="si">%s</span><span class="s2">.txt"</span> <span class="o">%</span> <span class="n">current_cid</span><span class="p">,</span> <span class="s2">"a"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tmp_list</span><span class="p">:</span>
<span class="n">fp</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">i</span><span class="p">))</span>
<span class="n">current_cid</span> <span class="o">=</span> <span class="n">cid</span>
<span class="n">tmp_list</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">line_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
<span class="k">if</span> <span class="n">current_cid</span> <span class="o">==</span> <span class="n">cid</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"./data/</span><span class="si">%s</span><span class="s2">.txt"</span> <span class="o">%</span> <span class="n">current_cid</span><span class="p">,</span> <span class="s2">"a"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">fp</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="se">\t</span><span class="si">%s</span><span class="se">\t</span><span class="si">%s</span><span class="se">\n</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">line_list</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">line_list</span><span class="p">[</span><span class="mi">3</span><span class="p">]))</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">得到:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/soinwc8a6x2r8ukv1iqz2d1l/image_1asehh921u6ga381tdp1lvol711t.png" alt="图片3" title="" ><br /></span>
<span class="md_line img_before only_img_before">其中<code>1623.txt</code>中的数据:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/zap07gzhpj9wv34vae071xyq/image_1asehhtl435a1hrv44m193n7ju2a.png" alt="图片4" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">好了。对了,数据大了做处理时候别忘记打日志:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging_format_str</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"</span><span class="si">%(levelname)-8s</span><span class="s2"> </span><span class="si">%(asctime)s</span><span class="s2"> </span><span class="si">%(filename)s</span><span class="s2">:</span><span class="si">%(lineno)d</span><span class="s2">"</span>
<span class="s2">" ] </span><span class="si">%(message)s</span><span class="s2">"</span><span class="p">)</span>
<span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">format</span><span class="o">=</span><span class="n">logging_format_str</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)</span>
<span class="k">if</span> <span class="n">count</span> <span class="o">%</span> <span class="mi">1000000</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="s2"> processed cid:</span><span class="si">%s</span><span class="s2">"</span><span class="o">%</span><span class="p">(</span><span class="n">count</span><span class="p">,</span> <span class="n">cid</span><span class="p">))</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">对所有数据做这样的MapReduce时间控制在1小时,足够了。如果数据再多,那就得上Hadoop集群了,Hadoop有个Streaming工具,可以不限定java语言,对任何能使用Map,Reduce流操作都可以使用集群,非常方便,配置模板:</span>
</p>
<div class="codehilite code_lang_shell highlight"><pre><span></span>hadoop jar Hadoop路径/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar <span class="se">\</span>
-D mapred.job.name<span class="o">=</span>任务名称<span class="s2">" \</span>
<span class="s2">-input HDFS上输入路径 \</span>
<span class="s2">-output HDFS上输出路径 \</span>
<span class="s2">-mapper mapper.py \</span>
<span class="s2">-reducer reducer.py \</span>
<span class="s2">-file ./mapper.py \</span>
<span class="s2">-file ./reducer.py</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">提交后,大象般的Hadoop集群就开始吭哧吭哧运作了.</span>
</p>
word2vec词向量在3C产品同款检索中的应用
2016-05-05T06:32:00Z
word2vecci-xiang-liang-zai-3cchan-pin-tong-kuan-jian-suo-zhong-de-ying-yong
caomaocao的家
<h2 id="toc_0" class="h16">一. 数据源</h2>
<p class="md_block">
<span class="md_line md_line_start">根据类目型号采集3C产品的标题。品牌,型号,和标题均做分词处理,英文转小写,数据如下:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/4nbm5upiwhmecyc3ktwefwtr/9AE65849-3BE2-472D-BB2F-54FCE09C0BA9.png" alt="9AE65849-3BE2-472D-BB2F-54FCE09C0BA9.png-137kB" title="" ></span>
</p>
<h2 id="toc_1" class="h16">二. 训练</h2>
<p class="md_block md_block_as_opening md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_start md_line_end">word2vec的java实现包:deeplearning4j,python实现包:gensim。工程结构是使用gensim训练每个类目的模型,线上环境初始化时候把这些模型载入,使用deeplearning4j计算相似度获取匹配结果。原因如下:</span>
</p>
<ol>
<li class="md_li"><span>deeplearning4j 训练时间非常长,要加速的话可以的,Windows要用Vistual Studio的C++编译器,还要配置蛮多东西,我就没来搞了。
</span></li>
</ol>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/cmzl3c9f9lz4t9wsnc8zb9ps/QQ%E6%88%AA%E5%9B%BE20160505162634.png" alt="QQ截图20160505162634.png-5.1kB" title="" ></span>
</p>
<ol>
<li class="md_li"><span>相同数据:手机类目100507个标题,python版本训练时间1分钟,java版本7分钟,这导致了deeplearning4j的训练参数调节不方便。
</span></li>
<li class="md_li"><span>最关键的问题是出来的精度堪忧啊(应该是训练参数问题),看下图相同训练集,:
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/v4j96cmtymove8p89jsz0az2/QQ%E6%88%AA%E5%9B%BE20160505162706.png" alt="QQ截图20160505162706.png-364.9kB" title="" ></span>
</p>
<h3 id="toc_2" class="h16">调参</h3><div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">Word2Vec</span><span class="p">(</span><span class="n">LineSentence</span><span class="p">(</span><span class="s2">"./brand_title/title_token_</span><span class="si">%s</span><span class="s2">.txt"</span><span class="o">%</span><span class="p">(</span><span class="n">cid</span><span class="p">)),</span> <span class="n">size</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">window</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">min_count</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">workers</span><span class="o">=</span><span class="n">multiprocessing</span><span class="o">.</span><span class="n">cpu_count</span><span class="p">(),</span> <span class="n">sg</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">默认sg=1是skip-gram算法,对低频词敏感,如果要多近似词就要选择CBOW(sg=0);size是词向量的维数;window是前后看词的单位,5表示在目标词前看5-b个词,后面看b个词(b在0-5之间随机);因为语料(title)太短,设置过大的window会导致结果并不那么理想,最后CBOW用的3;mini_count=5是忽略出现次数小于5的极低频词。</span>
</p>
<h2 id="toc_3" class="h16">三. 匹配</h2>
<ol>
<li class="md_li"><span>检索title获得品牌型号gid
</span></li>
<li class="md_li"><span>获得该gid的match_words:
</span></li>
</ol>
<p class="md_block md_has_block_below md_has_block_below_ol">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/jmxxqiaj93pxp1z3gypjko68/QQ%E6%88%AA%E5%9B%BE20160427175111.png" alt="QQ截图20160427175111.png-8.3kB" title="" ></span>
</p>
<ol>
<li class="md_li"><span>title分词,读word2vec产生的模型,计算title中每个词与match_words的相似距离。
</span></li>
<li class="md_li"><span>match_words中分别有一个词与title分词结果近似。任务solr检索结果正确。
</span></li>
</ol>
<p class="md_block">
<span class="md_line md_line_start md_line_end">原先问题在于验证过程,粗暴的使用match_words是否在title分词结果中存在,忽略了语义的联系,如iphone <-> 苹果,可以说明该title分词(苹果)和match_word(iphone)有较强联系。现在使用两两词向量的cos距离表示词语间的近似关系,可以表现语义联系。</span>
</p>
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/6hjswyisxngmo5o1fwrqzb7f/QQ%E6%88%AA%E5%9B%BE20160505162133.png" alt="QQ截图20160505162133.png-274.2kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">规则:词与match_word的相似度大于0.35即可认为有联系;词与match_words相似度总和大于0.35*match_word个数可认为词语与品牌型号有强关联。</span>
</p>
<p class="md_block">
<span class="md_line md_line_start">例如title:国行现货【送膜+壳】Apple/苹果iPhone5s移动联通4G手机分期购<br /></span>
<span class="md_line">Apple <-> iphone 0.96<br /></span>
<span class="md_line">Apple <-> 5s 0.48<br /></span>
<span class="md_line">Apple <-> iphone5s 0.76<br /></span>
<span class="md_line">0.96 + 0.48 + 0.76 > 3 * 0.35 说明Apple与iPhone5S有强关联。<br /></span>
<span class="md_line">联通 <-> iphone 0.06<br /></span>
<span class="md_line">联通 <-> 5s 0.20<br /></span>
<span class="md_line">联通 <-> iphone5s 0.01<br /></span>
<span class="md_line">说明“联通”和iPhone5s没啥关系。<br /></span>
<span class="md_line md_line_end">在Title的分词结果中,有1个词语与品牌型号关键词有强联系,可认为Solr检索结果正确,返回品牌型号。</span>
</p>
<h2 id="toc_4" class="h16">四. 代码结构:</h2>
<p class="md_block">
<span class="md_line md_line_start md_line_end">计算词与品牌型号关键词的相似度代码:</span>
</p>
<div class="codehilite code_lang_java highlight"><pre><span></span><span class="kt">double</span> <span class="n">sim</span> <span class="o">=</span> <span class="n">word2vecConfig</span><span class="o">.</span><span class="na">getCidVectors</span><span class="o">().</span><span class="na">get</span><span class="o">(</span><span class="n">categoryCode</span><span class="o">).</span><span class="na">similarity</span><span class="o">(</span><span class="n">word</span><span class="o">.</span><span class="na">toLowerCase</span><span class="o">(),</span> <span class="n">matchModelList</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">i</span><span class="o">));</span>
</pre></div>
<!--block_code_end-->
商品标题类目判断算法
2016-01-28T06:32:00Z
lei-mu-yu-pan-diao-can
caomaocao的家
<h2 id="toc_0" class="h16">类目体系:</h2>
<p class="md_block">
<span class="md_line md_line_start">经去重,合并,子到父类共有645个商品类目<br /></span>
<span class="md_line md_line_end">选用Liblinear作为分类工具,相比于它的“兄弟”Libsvm,速度快很多。</span>
</p>
<h2 id="toc_1" class="h16">程序结构</h2>
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="/_image/类目预判调参/训练类目分类器.png" alt="" title="" ></span>
</p>
<h2 id="toc_2" class="h16">调参</h2>
<p class="md_block">
<span class="md_line md_line_start md_line_end">因为对所有父级类目的训练做交叉验证非常缓慢,所以,从中随机选择了178个类目,设定每类目1000条标题是测试集,100条测试。</span>
</p>
<h3 id="toc_3" class="h16">训练参数选择:</h3><h4 id="toc_4" class="h16">分类方法比较</h4>
<table>
<thead>
<tr>
<th>分类方法</th>
<th style="text-align: right">用时(s)</th>
<th style="text-align: center">准群率</th>
</tr>
</thead>
<tbody>
<tr>
<td>L2-regularized L2-loss support vector classification</td>
<td style="text-align: right">91</td>
<td style="text-align: center">86.426% (14606/16900)</td>
</tr>
<tr>
<td>support vector classification by Crammer and Singer</td>
<td style="text-align: right">34</td>
<td style="text-align: center">85.9704% (14529/16900)</td>
</tr>
<tr>
<td>L2-regularized logistic regression</td>
<td style="text-align: right">113</td>
<td style="text-align: center">87.0828% (14717/16900)</td>
</tr>
</tbody>
</table>
<p class="md_block">
<span class="md_line md_line_start md_line_end">逻辑回归表现优秀,但在全类目训练上表现非常缓慢;Crammer和 Singer的SVM训练时间短,没看实现代码不熟悉;L2正则化能有效防止过拟合,就选L2-regularized L2-loss support vector classification这个啦。</span>
</p>
<h4 id="toc_5" class="h16">训练参数</h4>
<p class="md_block">
<span class="md_line md_line_start">cost of constrain(C):<br /></span>
<span class="md_line">五折交叉验证,取平均命中率,C从log2C=-23,即C近似于0开始查找。如图所示,当log2C = -2,C=0.25时候,Cross Validation Accuracy = 87.16%最高。<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="/_image/类目预判调参/3A.png" alt="" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">stopping criterion(eps):0.1 不作调整</span>
</p>
<h3 id="toc_6" class="h16">混淆矩阵</h3>
<p class="md_block">
<span class="md_line md_line_start">根据测试结果,输出混淆矩阵。因为类目过多,类目数*类目数的矩阵形式无法表述,所以输出了每个类目的正确率和它最容易混淆的标题。<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="/_image/类目预判调参/123.png" alt="" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">如图:290601(ZIPPO/瑞士军刀/眼镜-瑞士军刀),350301(大家电-洗衣机)等7个类目测试命中率100%。而往后根据命中率从低往高排列:50000802(玩具/模型/动漫/早教/益智-其它玩具)命中率仅48%,类目2512(汽车/用品/配件/改装-汽车影音/车用电子/电器-车用电子/电器-蓝牙检测仪)有6个和它混淆;50016434 (居家日用/婚庆/创意礼品-创意礼品)有4个和它混淆,排第二,从高到低排列。</span>
</p>
<p class="md_block">
<span class="md_line md_line_start md_line_end">怎么解决某些类目命中率过低的问题?1.该类目训练集数量*2 2.对互相混淆,比如说cid1命中率50%,其最高的混淆cid2接近50%; 而cid2命中率也仅50%,和cid1大量混淆,说明这俩类目类似,应当合并这俩类目</span>
</p>
<h2 id="toc_7" class="h16">训练</h2>
<p class="md_block">
<span class="md_line md_line_start">数量:<br /></span>
<span class="md_line">共645个类目,超过3000个标题(阈值)的有628个。<br /></span>
<span class="md_line">| 训练集 | Cost | 用时 | 命中率 |<br /></span>
<span class="md_line">| -------- | -----: | :----: | |<br /></span>
<span class="md_line">| 1000 | 0.0315 | 360 |73.09% = 443661/607000 |<br /></span>
<span class="md_line">| 1000| 0.25 | 845 | 74.19%=450330/607000|<br /></span>
<span class="md_line">| 1000| 1 | 1264 | 73.2%=444364/607000|<br /></span>
<span class="md_line">| 2000| 0.0315 | 1456 |74.39%=451565/607000 |<br /></span>
<span class="md_line">| 2000| 0.25 | 1808 | 75.76%459859/607000|<br /></span>
<span class="md_line">| 2000| 1 | 2512 |75.24%=456692/607000 |<br /></span>
<span class="md_line">| 5000| 0.0315 | 3975 | 76.14%=462163/607000|<br /></span>
<span class="md_line">| 5000| 0.25 |4590 |77.4%=469429/607000 |<br /></span>
<span class="md_line">| 5000| 1 | 7622 |77.27%=469025/607000 |<br /></span>
<span class="md_line">显然,训练样本数量越多,耗时越长,准确率越高;训练样本数一致的时候,C=0.25达到命中率最高,用时何时,最有效率。所以,全部数据训练参数选择如下:<br /></span>
<span class="md_line">过滤类目阈值:包含标题数大于3000<br /></span>
<span class="md_line">每个类目训练标题数目:5000<br /></span>
<span class="md_line">每个类目训练测试数目:1000<br /></span>
<span class="md_line">C:0.25<br /></span>
<span class="md_line md_line_end">eps:0.1</span>
</p>
商品标题类目判断的实现
2016-01-26T03:22:00Z
shang-pin-biao-ti-lei-mu-pan-duan-de-shi-xian
caomaocao的家
<p class="md_block">
<span class="md_line md_line_start md_line_end">根据已有的商品类目标题库,使用Liblinear生成模型,对未知品类商品的标题分类。本文档是工程文档,分类方法的选择,算法的参数设定,benchmark参考我的另一篇博客<a class="md_compiled" href="http://www.caomaocao.com/post/lei-mu-yu-pan-diao-can">商品标题类目预判算法调参</a>。</span>
</p>
<h2 id="toc_0" class="h16">工程结构:</h2>
<ul>
<li class="md_li"><span>包结构
</span></li>
</ul>
<p class="md_block md_has_block_below md_has_block_below_ul">
<span class="md_line md_line_start">train:训练入口<br /></span>
<span class="md_line">category:类目精简及映射<br /></span>
<span class="md_line">model:训练集和测试集的数据结构<br /></span>
<span class="md_line">examiner:训练效果检测<br /></span>
<span class="md_line md_line_end">predict:单个商品title的类目检测</span>
</p>
<ul>
<li class="md_li"><span>资源
</span></li>
</ul>
<p class="md_block md_has_block_below md_has_block_below_ul">
<span class="md_line md_line_start">vector/*:类目title文件,文件名是类目号,内容是该类目分词后的title集合<br /></span>
<span class="md_line">category_code_collect.txt:有用类目列表<br /></span>
<span class="md_line md_line_end">category_map.txt:类目映射表</span>
</p>
<ul>
<li class="md_li"><span>结果
</span></li>
</ul>
<p class="md_block">
<span class="md_line md_line_start">model:训练生成的模型,具体解释看liblinear文档<br /></span>
<span class="md_line">test_percent.txt:所有测试title的预测结果/标记类目,总体命中率。<br /></span>
<span class="md_line md_line_end">test_data.txt:测试title集</span>
</p>
<h2 id="toc_1" class="h16">程序结构</h2>
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/oseegnz2wf369tn6hn8voevz/%E8%AE%AD%E7%BB%83%E7%B1%BB%E7%9B%AE%E5%88%86%E7%B1%BB%E5%99%A8.png" alt="训练类目分类器.png-22.1kB" title="" ></span>
</p>
<h3 id="toc_2" class="h16">一.类目精简及映射</h3>
<p class="md_block">
<span class="md_line md_line_dom_embed md_line_with_image md_line_start md_line_end"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/kena0o8t60y9fjuxbm7s0xjy/QQ%E6%88%AA%E5%9B%BE20160112115159.png" alt="QQ截图20160112115159.png-5.5kB" title="" ></span>
</p>
<h3 id="toc_3" class="h16">二.读取类目title数据到训练集,测试集</h3>
<p class="md_block">
<span class="md_line md_line_start"><code>FeatureNode</code>是使用Liblinear的基本单元,它有index和value属性,<code>ArrayList<ArrayList<FeatureNode>> x_trainDataMatrix</code>是训练集title,<code>ArrayList<Double> trainLabelList</code>保存训练集cid。比如:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/yvj17p6if2jzj48exymw4xil/5000.png" alt="5000.png-5.8kB" title="" ><br /></span>
<span class="md_line img_before only_img_before">是类目500019279的title文件,我们的存储为:<br /></span>
<span class="md_line">500019279 114344:1 127329:1 146946:1<br /></span>
<span class="md_line">500019279 4522:1 35322:1 47670:1 94559:1 75745:1 114320:1 120807:1 162660:1<br><br /></span>
<span class="md_line md_line_end">注:FeatureNode的List要按照index的顺序排列 </span>
</p>
<h3 id="toc_4" class="h16">三.设定分类参数</h3><div class="codehilite code_lang_java highlight"><pre><span></span><span class="n">struct</span> <span class="n">parameter</span>
<span class="o">{</span>
<span class="kt">int</span> <span class="n">solver_type</span><span class="o">;</span> <span class="c1">//分类方法选择</span>
<span class="kt">double</span> <span class="n">eps</span><span class="o">;</span> <span class="c1">// stopping criteria </span>
<span class="kt">double</span> <span class="n">C</span><span class="o">;</span> <span class="c1">//cost of constrain</span>
<span class="kt">int</span> <span class="n">nr_weight</span><span class="o">;</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">weight_label</span><span class="o">;</span>
<span class="kt">double</span><span class="o">*</span> <span class="n">weight</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">p</span><span class="o">;</span>
<span class="o">};</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">选择L2 - regularized L2- loss support vector classification分类方法,C=0.25,eps=0.1, eps=0.01取得了较好的效果。</span>
</p>
<h3 id="toc_5" class="h16">四.设定训练参数</h3><div class="codehilite code_lang_java highlight"><pre><span></span><span class="n">struct</span> <span class="n">problem</span>
<span class="o">{</span>
<span class="kt">int</span> <span class="n">l</span><span class="o">;</span> <span class="c1">//训练集合长度, x_trainDataMatrix.size()</span>
<span class="kt">int</span> <span class="n">n</span><span class="o">;</span> <span class="c1">//特征数目(词典词个数 + 偏移1):171045 + 1</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">y</span><span class="o">;</span> <span class="c1">//训练集tag, trainLabelList</span>
<span class="n">struct</span> <span class="n">feature_node</span> <span class="o">**</span><span class="n">x</span><span class="o">;</span> <span class="c1">//Feature的二维数组格式,x_trainDataMatrix</span>
<span class="kt">double</span> <span class="n">bias</span><span class="o">;</span>
<span class="o">};</span>
</pre></div>
<!--block_code_end--><h3 id="toc_6" class="h16">五.训练并保存模型</h3><div class="codehilite code_lang_java highlight"><pre><span></span><span class="n">aModel</span> <span class="o">=</span> <span class="n">Linear</span><span class="o">.</span><span class="na">train</span><span class="o">(</span><span class="n">aProblem</span><span class="o">,</span> <span class="n">aParameter</span><span class="o">);</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">把得到的模型保存成文件,供检测使用:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/q7uaiv9arxajh0rs221avbr6/QQ%E6%88%AA%E5%9B%BE20160112144252.png" alt="QQ截图20160112144252.png-17.3kB" title="" ><br /></span>
<span class="md_line img_before only_img_before md_line_end">model格式:</span>
</p>
<div class="codehilite code_lang_java highlight"><pre><span></span><span class="n">solver_type</span><span class="o">;</span> <span class="c1">//训练参数</span>
<span class="n">nr_class</span><span class="o">;</span> <span class="c1">//类目数量</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">label</span><span class="o">;</span> <span class="c1">//类目号</span>
<span class="kt">double</span> <span class="n">bias</span><span class="o">;</span> <span class="c1">//偏移</span>
<span class="kt">int</span> <span class="n">nr_feature</span><span class="o">;</span> <span class="c1">//特征数量</span>
<span class="kt">double</span> <span class="o">*</span><span class="n">w</span><span class="o">;</span> <span class="c1">//支撑向量value</span>
</pre></div>
<!--block_code_end--><h3 id="toc_7" class="h16">六.效果检测</h3><h4 id="toc_8" class="h16">1.单条title检测</h4>
<p class="md_block">
<span class="md_line md_line_start">title:"苹果手机iphone6s 全新"<br /></span>
<span class="md_line">分词结果:苹果 手机 iphone <br /></span>
<span class="md_line">查词典:苹果:136631 手机:76330 iphone:11969 全新:39527<br /></span>
<span class="md_line">组成一个title的向量:aNodeList: <br /></span>
<span class="md_line md_line_end">136631:1 76330:1 11969:1 39527:1</span>
</p>
<div class="codehilite code_lang_java highlight"><pre><span></span><span class="kt">double</span> <span class="n">result</span> <span class="o">=</span> <span class="n">Linear</span><span class="o">.</span><span class="na">predict</span><span class="o">(</span><span class="n">aModel</span><span class="o">,</span> <span class="n">aNodeList</span><span class="o">);</span>
</pre></div>
<!--block_code_end--><h4 id="toc_9" class="h16">2.模型效果检测</h4><div class="codehilite code_lang_java highlight"><pre><span></span><span class="n">DefaultModelExaminer</span><span class="o">.</span><span class="na">calcTestData</span><span class="o">(</span><span class="n">TrainTestDataInfo</span> <span class="n">dataInfo</span><span class="o">,</span> <span class="n">Model</span> <span class="n">aModel</span><span class="o">)</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">读取测试集,输出命中率test_percent.txt:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="http://static.zybuluo.com/caomaocao/nhbx5wrui7vv2g03vdljlz5e/123.png" alt="123.png-6.9kB" title="" ><br /></span>
<span class="md_line img_before only_img_before">格式:目标cid/预测的cid title向量<br /></span>
<span class="md_line md_line_end">命中个数/测试集大小</span>
</p>
在项目中解决python2.*中文编码问题
2016-01-20T12:25:00Z
zai-xiang-mu-zhong-jie-jue-python2.-zhong-wen-bian-ma-wen-ti
caomaocao的家
<p class="md_block">
<span class="md_line md_line_start md_line_end">从用户行为日志中检测到用户的搜索行为,从而获得搜索关键词。</span>
</p>
<h2 id="toc_0" class="h16">搜索关键词的获取</h2><h3 id="toc_1" class="h16">日志读取</h3>
<p class="md_block">
<span class="md_line md_line_start">面对海量的用户行为日志,使用Hadoop Streaming工具做初始的数据提取,根据MapReduce模型。<br /></span>
<span class="md_line">Map阶段输出:<br /></span>
<span class="md_line">用户id,网站类型,搜索关键词,跳转前出现次数,跳转后出现次数<br /></span>
<span class="md_line">Reduce阶段输出:<br /></span>
<span class="md_line md_line_end">用户id,网站类型,</span>
</p>
<h3 id="toc_2" class="h16">搜索引擎的正则表达:</h3>
<p class="md_block">
<span class="md_line md_line_start">根据某段时间各搜索引擎的pv排行前六位:<br /></span>
<span class="md_line">baidu,haosou,sougou,bing,google,youdao<br /></span>
<span class="md_line md_line_end">记录用户在这6个搜索引擎上的搜寻关键词应该能覆盖95%的需求。写一段正则:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">r'(google.+?q=|baidu.+?wd=|baidu.+?kw=|baidu.+?word=|haosou.+?q=|youdao.+?q=|sogou.+?query=|bing.+?q=)([^&]*)'</span><span class="p">)</span>
</pre></div>
<!--block_code_end--><h3 id="toc_3" class="h16">购物网站搜索词字段</h3>
<p class="md_block">
<span class="md_line md_line_start md_line_end">通过观察电子商务网站搜索请求的URL,下列这段正则:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">r'(list.tmall.com/search_product.htm?q=|s.taobao.com/search?q=|search.jd.com/Search?keyword=|search.suning.com/|search.gome.com.cn/search?question=|search.gome.com.cn/search?question=)([^&]*)'</span><span class="p">)</span>
</pre></div>
<!--block_code_end--><h4 id="toc_4" class="h16">URL中的中文编码问题</h4>
<p class="md_block">
<span class="md_line md_line_start">比如说:“苹果手机6p”的淘宝搜索url:<code>https://s.taobao.com/search?q=%E8%8B%B9%E6%9E%9C%E6%89%8B%E6%9C%BA6p&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.7724922.8452-taobao-item.2&initiative_id=tbindexz_20150814</code><br /></span>
<span class="md_line">它的Unicode编码UTF-8实现是:<code>\xE8\x8B\xB9\xE6\x9E\x9C\xE6\x89\x8B\xE6\x9C\xBA\x36\x70</code><br /></span>
<span class="md_line">URL的UTF-8表示:<br /></span>
<span class="md_line md_line_dom_embed"><code>%E8%8B%B9%E6%9E%9C%E6%89%8B%E6%9C%BA6p</code><br /></span>
<span class="md_line md_line_end">因此,要获得关键词,先得把URL的UTF-8表示的编码转换成正常的UTF-8编码,用到python里的urllib包:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">kewyowrd_utf</span><span class="o">-</span><span class="mi">8</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">unquote</span><span class="p">(</span><span class="n">keyword_url_utf</span><span class="o">-</span><span class="mi">8</span><span class="p">)</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">此时utf-8编码的关键词在终端上能正确显示(环境编码配置为utf-8),然而情况要复杂些。</span>
</p>
<p class="md_block">
<span class="md_line md_line_start">回顾下中文编码知识,中文可以由GBK2312编码,也可以Unicode编码,再由UTF-8,UTF-16表示,那我们从url中获得的关键词编码怎么才能知道是用啥编码表示的呢,这样才能正确解析啊。<br /></span>
<span class="md_line">绝大部分网站的正常搜索行为的url中得关键词编码均是UTF-8的,但存在莫名其妙的跳转去taobao搜索等情况,关键词会变成gbk2312编码,而天猫的关键词是用gbk2312编码,所以,目前不能总结出网站,情况与关键词编码格式的对应表,只能对一串编码进行检测,获得它最有可能的编码表达方式。<br /></span>
<span class="md_line md_line_end">Python的chardet包,morzilla开发的,用于检测一串字符串最有可能的编码方式,使用方法如下:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">chardet</span><span class="o">.</span><span class="n">detect</span><span class="p">(</span><span class="n">keyword_in_url</span><span class="p">)</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start">返回一段字符串的最有可能的编码方式和置信度,以此为判断,对UTF-8,GBK,ASCii字符串进行解析。<br /></span>
<span class="md_line md_line_end">代码表示:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="n">keyword_str</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">unquote</span><span class="p">(</span><span class="n">keyword_str</span><span class="p">)</span>
<span class="n">keyword_in_ref</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">unquote</span><span class="p">(</span><span class="n">keyword_in_ref</span><span class="p">)</span>
<span class="k">if</span> <span class="n">chardet</span><span class="o">.</span><span class="n">detect</span><span class="p">(</span><span class="n">keyword_in_ref</span><span class="p">)[</span><span class="s1">'encoding'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'utf-8'</span><span class="p">:</span>
<span class="n">keyword_in_ref</span> <span class="o">=</span> <span class="n">keyword_in_ref</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'+'</span><span class="p">,</span><span class="s1">' '</span><span class="p">)</span>
<span class="k">print</span> <span class="n">guid</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">web_type_ref</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">keyword_in_ref</span> <span class="o">+</span> <span class="s1">'</span><span class="se">\t</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'1'</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="s1">'0'</span>
<span class="k">pass</span>
<span class="k">elif</span> <span class="n">chardet</span><span class="o">.</span><span class="n">detect</span><span class="p">(</span><span class="n">keyword_in_ref</span><span class="p">)[</span><span class="s1">'encoding'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'ascii'</span><span class="p">:</span>
<span class="n">keyword_in_ref</span> <span class="o">=</span> <span class="n">keyword_in_ref</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'+'</span><span class="p">,</span><span class="s1">' '</span><span class="p">)</span>
<span class="k">print</span> <span class="n">guid</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">web_type_ref</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">keyword_in_ref</span> <span class="o">+</span> <span class="s1">'</span><span class="se">\t</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'1'</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="s1">'0'</span>
<span class="k">pass</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">keyword_in_ref</span> <span class="o">=</span> <span class="n">keyword_in_ref</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'+'</span><span class="p">,</span><span class="s1">' '</span><span class="p">)</span>
<span class="n">keyword_in_ref</span> <span class="o">=</span> <span class="n">keyword_in_ref</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">'gbk'</span><span class="p">)</span>
<span class="k">print</span> <span class="n">guid</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">web_type_ref</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">keyword_in_ref</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">)</span> <span class="o">+</span> <span class="s1">'</span><span class="se">\t</span><span class="s1">'</span> <span class="o">+</span> <span class="s1">'1'</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="s1">'0'</span>
<span class="k">except</span> <span class="ne">UnicodeDecodeError</span><span class="p">:</span>
<span class="k">pass</span>
</pre></div>
<!--block_code_end-->
<p class="md_block">
<span class="md_line md_line_start md_line_end">用户的搜索行为习惯使用空格作为分词手法,例如:连衣裙 波点的<code>%E8%BF%9E%E8%A1%A3%E8%A3%99+%E6%B3%A2%E7%82%B9</code>,以加号表示空格,所以在转成中文的时候,得先把加号replace成空格。</span>
</p>
<h2 id="toc_5" class="h16">搜索关键词持久化</h2>
<p class="md_block">
<span class="md_line md_line_start">导入Mongodb过程中,出现错误。<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="/_image/2016-01-21/22-41-07.jpg" alt="Image" title="" ><br /></span>
<span class="md_line img_before only_img_before">幸好错误提示很清晰,是Mongodb的输入出现了非utf-8编码的字符。定位到错误数据:<br /></span>
<span class="md_line">某用户id%密封箱 食品%密封箱%气袋%酸奶杯%气垫床医用%气垫床%酸奶机˫᪸˾�<br /></span>
<span class="md_line">在最后的一个关键词中出现了乱码。<br /></span>
<span class="md_line">出现乱码最简单的原因是一种编码的字符串被另一张编码规则解码,比如一个UTF-8编码表示的字符串安装GBK2312来解,那出来很有可能就是生僻字或者乱码。找到原始日记中这个关键词的记录:<br /></span>
<span class="md_line">ref字段中的关键词编码:这段编码长度是8,显然应当是GBK编码,这个是“双岐杆菌”的GBK编码,但是根据chardet包的判断:<br /></span>
<span class="md_line md_line_dom_embed md_line_with_image"><img class="md_compiled " src="/_image/2016-01-21/22-42-35.jpg" alt="Image" title="" ><br /></span>
<span class="md_line img_before only_img_before">它是UTF-8编码表示的置信度高达87.6%,于是,我的程序就把它按UTF-8解码了,出现了乱码。<br /></span>
<span class="md_line">解决方法:<br /></span>
<span class="md_line md_line_end">暂未改进chardet包中的判断编码方式的算法,那只能靠捕捉导入mongodb时候的异常了咯:</span>
</p>
<div class="codehilite code_lang_python highlight"><pre><span></span><span class="k">try</span><span class="p">:</span>
<span class="err">导入</span>
<span class="k">except</span> <span class="n">bson</span><span class="o">.</span><span class="n">errors</span><span class="o">.</span><span class="n">InvalidStringData</span><span class="p">:</span>
<span class="k">pass</span>
</pre></div>
<!--block_code_end-->