爬取药监局相关数据【★★】

By yesmore on 2021-07-23
阅读时间 1 分钟
文章共 292
阅读量
  • 携带post参数的爬虫
  • 子链爬取
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
import json

if __name__ == "__main__":
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

id_list = [] # 存储企业的id
all_data_list = [] # 存储所有的企业详情数据

# 1.批量获取不同企业的id值
url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
# 参数的封装
for page in range(1, 6):
# 获取前5页所有企业数据
page = str(page)
data = {
'on': 'true',
'page': page,
'pageSize': '15',
'productName': '',
'conditionType': '1',
'applyname': '',
'applysn': '',
}
json_ids = requests.post(url=url, headers=headers, data=data).json()
for dic in json_ids['list']:
# 批量获取id保存到数组
id_list.append(dic['ID'])

# 2.获取企业详情数据
post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
for id in id_list:
data = {
'id': id
}
detail_json = requests.post(url=post_url, headers=headers, data=data).json()
print(detail_json, '\n-------------ending-----------')
all_data_list.append(detail_json)

# 3.持久化存储all_data_list
fp = open('./allData.json', 'w', encoding='utf-8')
json.dump(all_data_list, fp=fp, ensure_ascii=False)
print('\n\n****************************完毕***************************')

Tips: Please indicate the source and original author when reprinting or quoting this article.