python3 怎么爬取新闻网站

首页 >> 正文

python3 怎么爬取新闻网站

来源：www.zuowenzhai.com 作者：编辑日期：2024-06-02

python3怎么爬取网页的指定链接

一般用正则表达式取到相应的链接
然后再获取指定网址的内容
一般是使用urllib.request库

方法1

寻找页面中的xhr请求, 并得到实际的请求参数. 直接获取相关搜索的请求返回代码, 然后进行数据整理.

方法2

模拟浏览器操作, 比如使用Selenium 模块.

八爪鱼采集器是一款功能强大的网页数据采集器，可以帮助您快速、高效地爬取新闻网站的数据。以下是使用Python3进行新闻网站爬取的一般步骤：1. 导入所需的库，如requests、BeautifulSoup等。2. 使用requests库发送HTTP请求，获取新闻网站的HTML源代码。3. 使用BeautifulSoup库解析HTML源代码，提取所需的新闻数据。4. 根据新闻网站的页面结构，使用CSS选择器或XPath表达式定位和提取新闻标题、内容、发布时间等信息。5. 将提取的数据保存到本地文件或数据库中，以便后续分析和使用。需要注意的是，使用Python进行网页爬取需要遵守相关的法律法规和网站的使用规则，避免对网站造成过大的访问压力。另外，一些新闻网站可能会对爬虫进行反爬虫处理，您可能需要使用一些反反爬虫的技术手段来应对。如果您想要更加方便、快捷地进行新闻网站的数据采集，推荐您使用八爪鱼采集器。八爪鱼采集器提供了智能识别和自定义采集规则设置等功能，可以帮助您快速、准确地采集新闻网站的数据，并支持将采集结果秒同步至企业数据库。八爪鱼新闻采集可覆盖全网10w+信息源，日均数据采集量可达百万级，采集结果支持秒同步至企业数据库，请前往官网了解更多详情。

需求：

从门户网站爬取新闻，将新闻标题，作者，时间，内容保存到本地txt中。

用到的python模块：

import re  # 正则表达式
import bs4  # Beautiful Soup 4 解析模块
import urllib2  # 网络访问模块
import News   #自己定义的新闻结构
import codecs  #解决编码问题的关键 ，使用codecs.open打开文件
import sys   #1解决不同页面编码问题

其中bs4需要自己装一下，安装方法可以参考：Windows命令行下pip安装python whl包

程序：

#coding=utf-8
import re  # 正则表达式
import bs4  # Beautiful Soup 4 解析模块
import urllib2  # 网络访问模块
import News   #自己定义的新闻结构
import codecs  #解决编码问题的关键 ，使用codecs.open打开文件
import sys   #1解决不同页面编码问题

reload(sys)                         # 2
sys.setdefaultencoding('utf-8')     # 3

# 从首页获取所有链接
def GetAllUrl(home):
    html = urllib2.urlopen(home).read().decode('utf8')
    soup = bs4.BeautifulSoup(html, 'html.parser')
    pattern = 'http://\w+\.baijia\.baidu\.com/article/\w+'
    links = soup.find_all('a', href=re.compile(pattern))
    for link in links:
        url_set.add(link['href'])

def GetNews(url):
    global NewsCount,MaxNewsCount  #全局记录新闻数量
    while len(url_set) != 0:
        try:
            # 获取链接
            url = url_set.pop()
            url_old.add(url)

            # 获取代码
            html = urllib2.urlopen(url).read().decode('utf8')

            # 解析
            soup = bs4.BeautifulSoup(html, 'html.parser')
            pattern = 'http://\w+\.baijia\.baidu\.com/article/\w+'  # 链接匹配规则
            links = soup.find_all('a', href=re.compile(pattern))

            # 获取URL
            for link in links:
                if link['href'] not in url_old:
                    url_set.add(link['href'])

                    # 获取信息
                    article = News.News()
                    article.url = url  # URL信息
                    page = soup.find('div', {'id': 'page'})
                    article.title = page.find('h1').get_text()  # 标题信息
                    info = page.find('div', {'class': 'article-info'})
                    article.author = info.find('a', {'class': 'name'}).get_text()  # 作者信息
                    article.date = info.find('span', {'class': 'time'}).get_text()  # 日期信息
                    article.about = page.find('blockquote').get_text()
                    pnode = page.find('div', {'class': 'article-detail'}).find_all('p')
                    article.content = ''
                    for node in pnode:  # 获取文章段落
                        article.content += node.get_text() + '
'  # 追加段落信息

                    SaveNews(article)

                    print NewsCount
                    break
        except Exception as e:
            print(e)
            continue
        else:
            print(article.title)
            NewsCount+=1
        finally:
            # 判断数据是否收集完成
            if NewsCount == MaxNewsCount:
                break

def SaveNews(Object):
    file.write("【"+Object.title+"】"+"")
    file.write(Object.author+""+Object.date+"
")
    file.write(Object.content+"
"+"
")

url_set = set()  # url集合
url_old = set()  # 爬过的url集合

NewsCount = 0
MaxNewsCount=3

home = 'http://baijia.baidu.com/'  # 起始位置

GetAllUrl(home)

file=codecs.open("D:\est.txt","a+") #文件操作

for url in url_set:
    GetNews(url)
    # 判断数据是否收集完成
    if NewsCount == MaxNewsCount:
        break

file.close()

新闻文章结构

#coding: utf-8
# 文章类定义
class News(object):
    def __init__(self):
        self.url = None
        self.title = None
        self.author = None
        self.date = None
        self.about = None
        self.content = None

对爬取的文章数量就行统计。

（编辑：顾馨楠）