python爬虫和数据挖掘
考虑用python做爬虫,需要研究学习的python模块
1内置的 urllib, urllib2 库用来爬取数据
2 使用BeautifulSoup做数据清洗
http://www.crummy.com/software/BeautifulSoup/
编码规则
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
1 An encoding you pass in as the fromEncoding argument to the soup constructor.
2 An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
3 An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
4 An encoding sniffed by the chardet library, if you have it installed.
5 UTF-8
6 Windows-1252
可以用fromEncoding参数来构造BeautifulSoup
soup = BeautifulSoup(euc_jp, fromEncoding="gbk")
3 使用python chardet 字符编码判断
http://chardet.feedparser.org/download/
4 更加强大的 selenium
Leave a Reply
标签云
.htaccess 301 2010 Android apache cache cacti CSS date ddos discuz django fastcgi freebsd git google http iftop linux macos mysql nginx njava php pr python sed seo snmp ssh ubuntu ubuntu10.04 wordpress xdebug 优化 密码 文件 时区 用户 登录 监控 缓存 网站 脚本 颜色
WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.
近期文章
近期评论
文章归档
链接表
QR Code
