今日熱榜:https://tophub.today/
爬取數(shù)據(jù)及保存格式:
爬取后保存為.txt文件:
部分內(nèi)容:
源碼及注釋:
import requests from bs4 import BeautifulSoup def download_page(url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"} try: r = requests.get(url,timeout = 30,headers=headers) return r.text except: return "please inspect your url or setup" def get_content(html,tag): output = """ 排名:{}\n 標(biāo)題:{} \n 熱度:{}\n 鏈接:{}\n ------------\n""" output2 = """平臺:{} 榜單類型:{} 最近更新:{}\n------------\n""" num=[] title=[] hot=[] href=[] soup = BeautifulSoup(html, 'html.parser') con = soup.find('div',attrs={'class':'bc-cc'}) con_list = con.find_all('div', class_="cc-cd") for i in con_list: author = i.find('div', class_='cc-cd-lb').get_text() # 獲取平臺名字 time = i.find('div', class_='i-h').get_text() # 獲取最近更新 link = i.find('div', class_='cc-cd-cb-l').find_all('a') # 獲取所有鏈接 gender = i.find('span', class_='cc-cd-sb-st').get_text() # 獲取類型 save_txt(tag,output2.format(author, gender,time)) for k in link: href.append(k['href']) num.append(k.find('span', class_='s').get_text()) title.append(str(k.find('span', class_='t').get_text())) hot.append(str(k.find('span', class_='e').get_text())) for h in range(len(num)): save_txt(tag,output.format(num[h], title[h], hot[h], href[h])) def save_txt(tag,*args): for i in args: with open(tag+'.txt', 'a', encoding='utf-8') as f: f.write(i) def main(): # 綜合 科技 娛樂 社區(qū) 購物 財經(jīng) page=['news','tech','ent','community','shopping','finance'] for tag in page: url = 'https://tophub.today/c/{}'.format(tag) html = download_page(url) get_content(html,tag) if __name__ == '__main__': main()
到此這篇關(guān)于python爬蟲今日熱榜數(shù)據(jù)到txt文件的源碼的文章就介紹到這了,更多相關(guān)python爬蟲今日熱榜數(shù)據(jù)內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
標(biāo)簽:合肥 信陽 昭通 阜新 濟(jì)源 興安盟 隨州 淘寶好評回訪
巨人網(wǎng)絡(luò)通訊聲明:本文標(biāo)題《python爬蟲今日熱榜數(shù)據(jù)到txt文件的源碼》,本文關(guān)鍵詞 python,爬蟲,今日,熱榜,數(shù)據(jù),;如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題,煩請?zhí)峁┫嚓P(guān)信息告之我們,我們將及時溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò),涉及言論、版權(quán)與本站無關(guān)。