[Python] 網路爬蟲（crawler） -- 網頁解析－Jialin

要擷取特定網頁內容的第一步，就是要先了解該網頁的組成。

Python有urlparse套件可以用來剖析URL，

可以參考：

https://docs.python.org/3.5/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse

以在某拍賣平台鍵入關鍵字「anello」為例，

這是搜尋後所顯示的網址：

http://search.ruten.com.tw/search/s000.php?enc=u&searchfrom=indexbar&k=anello&t=0

當按下「第二頁」的搜尋結果時，所顯示之網址為：

http://search.ruten.com.tw/search/s000.php?enc=u&searchfrom=indexbar&k=anello&t=0&p=2

當按下「第五頁」的搜尋結果時，所顯示之網址為：

http://search.ruten.com.tw/search/s000.php?enc=u&searchfrom=indexbar&k=anello&t=0&p=5

可以觀察上述網址發現網址最後的參數隨著user的換頁而變化，

下面是以urlparse套件來對網址進行解析

執行結果印出

ParseResult(scheme='http', netloc='search.ruten.com.tw', path='/search/s000.php', params='', query='enc=u&searchfrom=indexbar&k=anello&t=0&p=5', fragment='')

再以某入口網站的電影排行為例，

這是進入電影排行榜的網址：https://tw.movies.yahoo.com/chart.html

點選每一個排行榜的分類，例如：年度票房榜、週冠軍票房榜、全美票房榜......等，

觀察網址列可以發現網址最後的參數在變化，如下：

https://tw.movies.yahoo.com/chart.html?cate=year

https://tw.movies.yahoo.com/chart.html?cate=week

https://tw.movies.yahoo.com/chart.html?cate=us

一樣以上例的程式碼（urlparse套件）來對網址進行解析，印出

ParseResult(scheme='https', netloc='tw.movies.yahoo.com', path='/chart.html', params='', query='cate=us', fragment='')

接下來可以利用splitlines()來一行行擷取特定網頁的原始碼，

印出的內容為：

可以利用瀏覽器開啟該網頁，並檢視原始碼，發現真的成功擷取了第1~15行的內容啦~~

還請不吝指教 =)

Jialin

Jialin 發表在痞客邦留言(0) 人氣()

E-mail轉寄

Jialin

It's more fun to be a pirate than to join the navy. (Steve Jobs)
email: jialin9112@gmail.com，我的CodePen

[Python] 網路爬蟲（crawler） -- 網頁解析

留言列表

文章搜尋

文章分類

熱門文章

最新文章

文章精選

最新留言

新聞交換(RSS)

參觀人氣

QR Code

POWERED BY

Jialin

It's more fun to be a pirate than to join the navy. (Steve Jobs) email: jialin9112@gmail.com，我的CodePen

[Python] 網路爬蟲（crawler） -- 網頁解析

留言列表

文章搜尋

文章分類

熱門文章

最新文章

文章精選

最新留言

新聞交換(RSS)

參觀人氣

QR Code

POWERED BY

It's more fun to be a pirate than to join the navy. (Steve Jobs)
email: jialin9112@gmail.com，我的CodePen