python怎么爬取网站上的数据库 python爬取网页内容并保存到数据库

开发者资讯
2025-05-13
编辑

　　在数据时代，数据分析越来越受到重视，而数据的获取则成为了数据分析中重要的一环。Python作为一种强大的编程语言，提供了丰富的库和工具来实现网页数据的爬取与存储。小编将详细介绍如何使用Python爬取网页内容，并将其保存到数据库中，以MySQL和MongoDB为例进行说明。

　　一、准备工作

　　1. 安装必要的库

　　首先，需要安装一些Python库来帮助我们完成爬取和存储任务。常用的库包括requests、BeautifulSoup、pymysql(用于MySQL)和pymongo(用于MongoDB)。

　　pip install requests beautifulsoup4 pymysql pymongo

　　2. 数据库准备

　　MySQL

　　创建数据库和表

　　CREATE DATABASE baby_info;

　　USE baby_info;

　　CREATE TABLE mamawang_info (

　　id bigint(20) NOT NULL AUTO_INCREMENT,

　　title varchar(255) DEFAULT NULL,

　　href varchar(255) DEFAULT NULL,

　　content text,

　　imgs varchar(255) DEFAULT NULL,

　　PRIMARY KEY (id)

　　) ENGINE=InnoDB AUTO_INCREMENT=627 DEFAULT CHARSET=utf8;

　　连接数据库

　　import pymysql.cursors

　　connect = pymysql.Connect(

　　host='localhost',

　　port=3306,

　　user='root',

　　passwd='admin',

　　db='baby_info',

　　charset='utf8'

　　)

　　MongoDB

　　连接数据库

　　import pymongo

　　myclient = pymongo.MongoClient('localhost', 27017)

　　mydb = myclient['webpages']

　　dblist = myclient.list_database_names()

　　if "webpages" in dblist:

　　print("该数据库存在")

　　mycol = mydb['gov.publicity']

360截图20250426224640574.png

　　二、爬取网页内容

　　1. 使用requests模块获取网页源代码

　　import requests

　　url = 'http://www.mama.cn/z/t1183/'

　　headers = {

　　'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'

　　}

　　response = requests.get(url, headers=headers)

　　2. 使用BeautifulSoup解析网页

　　from bs4 import BeautifulSoup

　　soup = BeautifulSoup(response.text, "html.parser")

　　div = soup.find(class_='list-left')

　　3. 提取所需数据

　　# 示例：提取文章标题和链接

　　articles = div.find_all('a')

　　for article in articles:

　　title = article.get_text()

　　href = article['href']

　　print(title, href)

　　三、将数据保存到数据库

　　1. 保存到MySQL

　　def save_to_mysql(title, href, content, imgs):

　　cursor = connect.cursor()

　　sql = "ｉｎｓｅｒｔ INTO mamawang_info (title, href, content, imgs) VALUES (%s, %s, %s, %s)"

　　cursor.execute(sql, (title, href, content, imgs))

　　connect.commit()

　　cursor.close()

　　# 示例调用

　　save_to_mysql('示例标题', 'http://example.com', '示例内容', 'http://example.com/image.jpg')

　　2. 保存到MongoDB

　　def save_to_mongodb(title, href, content, imgs):

　　mycol.insert_one({

　　'title': title,

　　'href': href,

　　'content': content,

　　'imgs': imgs

　　})

　　# 示例调用

　　save_to_mongodb('示例标题', 'http://example.com', '示例内容', 'http://example.com/image.jpg')

　　四、完整示例代码

　　以下是一个完整的示例代码，展示了如何从妈妈网爬取文章数据并保存到MySQL数据库中。

　　import requests

　　from bs4 import BeautifulSoup

　　import pymysql.cursors

　　# 连接数据库

　　connect = pymysql.Connect(

　　host='localhost',

　　port=3306,

　　user='root',

　　passwd='admin',

　　db='baby_info',

　　charset='utf8'

　　)

　　def get_one_page():

　　url = 'http://www.mama.cn/z/t1183/'

　　headers = {

　　'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'

　　}

　　response = requests.get(url, headers=headers)

　　soup = BeautifulSoup(response.text, "html.parser")

　　div = soup.find(class_='list-left')

　　articles = div.find_all('a')

　　for article in articles:

　　title = article.get_text()

　　href = article['href']

　　save_to_mysql(title, href, '', '')

　　def save_to_mysql(title, href, content, imgs):

　　cursor = connect.cursor()

　　sql = "ｉｎｓｅｒｔ INTO mamawang_info (title, href, content, imgs) VALUES (%s, %s, %s, %s)"

　　cursor.execute(sql, (title, href, content, imgs))

　　connect.commit()

　　cursor.close()

　　if __name__ == '__main__':

　　get_one_page()

　　connect.close()

　　五、注意事项

　　反爬虫机制：许多网站都有反爬虫机制，可以通过设置请求头、使用代理等方式来应对。

　　数据清洗：爬取的数据可能包含不需要的信息，需要进行清洗和整理。

　　法律问题：确保有权访问和使用数据，遵守网站规则和隐私政策。

　　通过以上步骤，你可以使用Python实现从网页爬取数据并将其保存到数据库中。这不仅有助于数据的存储和管理，也为后续的数据分析和可视化提供了基础。

微信分享

上一篇：python未响应怎么办 python程序未响应的问题

下一篇：idea怎么运行单个java文件 idea运行单个java文

猜你喜欢

python怎么爬取网站上的数据库 python爬取网页内容并保存到数据库

猜你喜欢

阅读排行

如何在Python中实现Web抓取?Python Web抓取教程

Python如何输出换行?Python如何输出有颜色的字

HTML 图像映射教程，实现复杂的图像交互

windows下安装openssl流程

Python中如何处理异常?

热门标签

随便看看

怎么使用js实现动画效果?js实现持续的动画效果怎么样

python安装完了还要安装什么 python安装需要配置环境吗

python为什么要创建虚拟环境 python创建虚拟环境的命令

JavaScript实现拖拽的方法是什么?

数据存储的性能优化方法有哪些?