웹크롤링) 코랩으로 웹크롤링하기

코랩으로 웹크롤링하기

1학기에 하나도 못 알아들었는데

조금이나마 알 것 같다.

여러 사이트를 훑어보면서

내가 사랑하는 잡지 월간 디자인의

인기 포스트 끌어오기를 해봤다.

완전 뿌듯하네 헤헤

새로 알게된 것 중

쓸모있어보이는 요소

1) find_all 했을 때 리스트와

성질이 거어어어의 같은

(인덱스, 슬라이싱 가능)

bs4.element.ResultSet 값자료형이 나와서

아주 유용하다.

2) .get_text 또는 .text

이게 되게 신기하다.

사이트에서 html태그를 긁어오면 앞뒤에 코드가 있는데

싹 글자만 데려온다.

진짜 유용해보인다.

코랩 링크 : 클릭

코드 :

#Beutiful Soup을 세팅하기

#1)html파일 가져오기 / 2) urllib으로 웹소스 가져오기 / 3) requests 모듈로 가져오기

#2)방법으로 가져오기로 함

import urllib.request

import urllib.parse

from bs4 import BeautifulSoup                      

with urllib.request.urlopen('https://post.naver.com/my.nhn?memberNo=34550514') as response:

  html = response.read()

  soup = BeautifulSoup(html,'html.parser')

#태그와 속성을 이용해서 가져오기

#1) find_all('태그명',{'속성명':'값'.............})

#2) find는 1개만 중복되면 앞에 하나만 ('태그명',{'속성명':'값.......})

# 1번째 가져오기) href를 리스트에 모으기

# 굉장히 난해했다. find와 달리 find_all은 'bs4.element.ResultSet' 값이 된다. 

# 리스트와 꼭 닮게 출력된다. 찾은 값이 1개라도 find와 달리 리스트처럼 생겼다. 인덱스, 슬라이싱도 된다. 근데 난 왜 안되는지 모르겠다.

# ['href'] 부분도 아직 이해가 안되네!

a=soup.find('div',{'class':'beset_post_list'}).find_all('a')

link_list=[]

for i in a :

  link_list.append(i['href'])

# 2번째 가져오기) 타이틀 가져오기

title=soup.find('div',{'class':'beset_post_list'}).find_all('p',{'class':'tit ell'})

# 1번째, 2번째 뭉쳐서 출력하기) get text, text https://crazyj.tistory.com/201

i=0

for i in range(len(title)):

  print('\n',f'{i+1}번째 소식, {title[i].text},링크:,https://post.naver.com{link_list[i]}')

오리씨네(Orissine) 블로그