文章标签 ‘spider’
转载时请标明文章原始出处和作者信息, 作者: lostsnow.http://www.lsproc.com/blog/python_spider/ #coding=utf-8 import sys import urllib2 import gzip import StringIO # 页面url url = "http://china.toocle.com/company/show/pdetail--1000436--10532651.html" # 页面编码 page_encode = "gbk" request = urllib2.Request(url) request.add_header("Accept-encoding", "gzip") usock = urllib2.urlopen(request) page = usock.read() # 处理gzip过的页面 if usock.headers.get('content-encoding', None) == 'gzip': page = gzip.GzipFile(fileobj=StringIO.StringIO(page)).read() # 转unicode(gbk/utf8) if not isinstance(page, unicode): page = unicode(page, page_encode) print(page) -- [...]
分类: Program&Database
