最近在玩scrapy。然后某一天逛豆瓣的时候，看到一个活动：#我的欲望清单。突然很好奇大家的欲望清单都是怎么样的。于是决定放只爬虫出去，获取该活动的所有回复信息。然后计算关键词及词频生成词云。代码已上传到doubanOnlineAnalyzer

爬取活动回复信息

安装scrapy

首先，我们需要安装scrapy。 scrapy的安装这里就不赘述了，可以参考平台安装指南

新建项目

接着，我们需要创建一个项目，进入到你想要放置这个项目的目录中，执行下列命令：

1	scrapy startproject doubanScrapy

此命令运行结束后，会创建包含下面内容的目录：

doubanScrapy/
    scrapy.cfg
    doubanScrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

## 定义Item 下面，我们来定义保存保存爬取到的数据的容器。(如果你玩django，会发现这里的Item和django的module很像) 编辑doubanOnline/doubanOnline/items.py：

import scrapy

class doubanOnlineItem(scrapy.Item):
	link = scrapy.Field() # 回复的链接
	content = scrapy.Field() # 此回复的内容
	author = scrapy.Field() # 此回复的作者
	date = scrapy.Field() # 此回复的日期

## 编写爬虫在doubanScrapy/doubanScrapy/spiders/下新建一个文件doubanOnline.py后编辑：

# -*- coding: utf-8 -*-
import scrapy
from doubanScrapy.items import DoubanOnlineItem

class doubanOnlineSpider(scrapy.Spider):
	name = "doubanOnline"
	allowed_domains = ["douban.com"]

	def __init__(self, online1st=None):
		self.start_urls = [online1st] # start_urls通过命令行传参设定

	def parse(self, response):
		item = DoubanOnlineItem()
		item['link'] = response.url #使用xpath解析内容
		item['content'] = response.xpath('//blockquote[@class="photo-text"]/p/text()').extract()
		item['author'] = response.xpath('//div[@class="photo-ft"]/a/text()').extract()[0].encode('utf-8')
		item['date'] = response.css('div[class="photo-ft"]::text').extract()[-1].strip()[3:].encode('utf-8')
		item['content'] = item['content'][0].encode('utf-8').strip() if len(item['content'])>0 else ""
		yield item
		next = response.xpath('//a[@name="next_photo"]/@href').extract()[0]
		total = response.xpath('//span[@class="ll"]/text()').extract()[0]
		cur, total = total.split("/")
		print "total: %s, current: %s" % (total[1:-1],cur[1:-1])
		if cur[1:-1] == total[1:-1]: # 爬到最后一页，结束
			raise CloseSpider('------------------ End Search! ------------------')
		yield scrapy.Request(next, callback=self.parse)

## 修改配置项由于有些网站做了防爬虫处理，因此，我们需要修改爬虫的配置项，将其伪装成普通浏览器。将doubanOnline/doubanOnline/settings.py修改为：

1 2	USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5' DOWNLOAD_DELAY=2

## 执行爬虫运行爬虫很简单，只要cd到项目路径下，运行下面这个命令即可：

1	scrapy crawl doubanOnline -a online1st=[豆瓣线上活动首页] -o onlineac.json

注： * 这里的doubanOnline即在爬虫中定义的name * -a可以指定传给爬虫的参数。这里，我们需要指定某个线上活动的首页 * -o使用feed export，指定生成的json文件的名称。这样就不必使用Pipline，直接将爬虫中yield item保存为json格式。

小笔记

python的json.dumps方法默认会输出成这种格式""。要输出中文需要指定ensure_ascii参数为False，如下代码片段：
1
json.dumps({'text':"中文"},ensure_ascii=False,indent=2)
使用scrapy的feed export生成的json文件如果包含了中文，会变成unicode的形式。可以参考解决Scrapy中feed export为json格式时中文显示为对scrapy源码进行修改。

关键词可视化

过程如下：分词 -> 词频计算 -> 词云生成

分词

大部分的第三方分词模块都是针对英文分词的。而对于中文，我们需要祭出神器：jieba。

安装jieba只需要：pip install jieba 而其使用也很简单。jieba支持三种模式的分词：全模式，精确模式，搜索引擎模式。这里我们使用全模式，使用方式如下：

1	linewc = jieba.cut(resp["content"], cut_all=True)

## 词频计算由于我们前面已经保存了每一个回复对应的内容，因此我们只需要统计分词后的结果，生成一个以单词为键，出现次数为值的字典即可。因此，总结第一和第二步，可以得出代码如下：

import jieba
import json

def _genWordCount(filename, nonsense=[]):
	''' this method is used to abstract the words from the text
		and calculate the count of text
		filename - file's name which generated through scrapy.
		nonsense - words' list which contains the words we'll exclude.
	'''
	wc = {}  # 一个以单词为键，出现次数为值的字典
	with open(filename,"r") as f:
		for line in f.readlines():
			resp = json.loads(line) # 以json格式解析
			linewc = jieba.cut(resp["content"], cut_all=True) # 分词
			for item in linewc:
				if item in nonsense: # 排除nonsense中的单词，不统计
					continue
				if item.strip():
					wc.setdefault(item,0)
					wc[item] +=1 # 计数
	return wc

## 词云生成 PyTagCloud可以帮助我们生成词云。首先，安装PyTagCloud

1	pip install pytagcloud

* 注意，这里可能需要安装simplejson

由于我们一般只关心出现频率最高的单词，因此需要对词频字典进行排序，然后取其出现频率最高的前top个词及其频率进行展示。

另外，pytagcloud本身是不带中文字库的。这样会导致中文无法展示出来。因此我们需要下载一个中文字体文件（随便你喜欢什么中文字体，这里我选择了微软华文雅黑，文件名为yahei.tff）。将此字体文件放在python的安装目录的Lib\site-packages\pytagcloud\fonts下，然后在此目录下的fonts.json文件中添加一条记录。

[
	{
		"name": "MSyh",
		"ttf": "yahei.ttf",
		"web": "none"
	},

这样，我们就可以正常显示中文了。完整的词云图像生成代码如下：

from pytagcloud import create_tag_image, make_tags
from operator import itemgetter
#import random
def genWdCloud(wc,top=50,SIZE1=1,SIZE2=100, pngname='doubanOnlineActivity.png'):
	''' this method is used to generate words cloud according to the words' frequency
		wc - the dictionary which describes the words and corresponding frequency
		top - display top max frequency
		SIZE1 - minsize of words
		SIZE2 - maxsize of words
		pngname - the name of the generated image
	'''
	swc = sorted(wc.iteritems(), key=itemgetter(1), reverse=True) # 根据字典键对应的值排序，逆序
	#print swc
	tags = make_tags(swc[:top], # 只取出现频率最高的top个词
					minsize=SIZE1,
					maxsize=SIZE2,
					#colors=random.choice(COOR_SCHEMES.values())
					)
	create_tag_image(tags,
					 pngname, 
					 background=(0,0,0,255),
					 size=(900,600),
					 fontname="MSyh")

你将在当前目录下发现一个名为pngname所指定的名字的图像。

后续

pytagcloud的HTML形式
如何扩展呢？

∞

基于豆瓣的线上活动的关键字分析

爬取活动回复信息

安装scrapy

新建项目

小笔记

关键词可视化

分词

后续

参考资料