Lesson 14: 빅데이터와 시각화 – 크롤링 + 시각화

Author

최규빈

Published

January 10, 2024

{{< https://youtu.be/playlist?list=PLQqh36zP38-xHvsBJCEyhWyj1V3U0qtQJ&si=A6MudgUp7ETE0g6z >}}
# !pip install yfinance
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import yfinance as yf

yfinance를 이용한 주식자료 시각화

A. 크롤링 + 데이터정리

- yahoo finance: https://finance.yahoo.com/

Apple: 'AAPL'

Apple 코드

삼성전자: '005930.KS'

삼성전자 코드

- 크롤링을 위한 코드

symbols = ['AMZN','AAPL','GOOG','MSFT','NFLX','NVDA','TSLA']
start = '2020-01-01'
end = '2024-01-10'
df = yf.download(symbols,start,end)
df
[                       0%%                      ][**************        29%%                      ]  2 of 7 completed[**************        29%%                      ]  2 of 7 completed[**********************57%%*                     ]  4 of 7 completed[**********************71%%********              ]  5 of 7 completed[**********************86%%***************       ]  6 of 7 completed[*********************100%%**********************]  7 of 7 completed
Price Adj Close Close ... Open Volume
Ticker AAPL AMZN GOOG MSFT NFLX NVDA TSLA AAPL AMZN GOOG ... NFLX NVDA TSLA AAPL AMZN GOOG MSFT NFLX NVDA TSLA
Date
2020-01-02 73.152657 94.900497 68.368500 154.779526 329.809998 59.744045 28.684000 75.087502 94.900497 68.368500 ... 326.100006 59.687500 28.299999 135480400 80580000 28132000 22622100 4485800 23753600 142981500
2020-01-03 72.441460 93.748497 68.032997 152.852249 325.899994 58.787777 29.534000 74.357498 93.748497 68.032997 ... 326.779999 58.775002 29.366667 146322800 75288000 23728000 21116200 3806900 20538400 266677500
2020-01-06 73.018677 95.143997 69.710503 153.247330 335.829987 59.034313 30.102667 74.949997 95.143997 69.710503 ... 323.119995 58.080002 29.364668 118387200 81236000 34646000 20813700 5663100 26263600 151995000
2020-01-07 72.675285 95.343002 69.667000 151.850098 330.750000 59.749020 31.270666 74.597504 95.343002 69.667000 ... 336.470001 59.549999 30.760000 108872000 80898000 30054000 21634100 4703200 31485600 268231500
2020-01-08 73.844345 94.598503 70.216003 154.268829 339.260010 59.861088 32.809334 75.797501 94.598503 70.216003 ... 331.489990 59.939999 31.580000 132079200 70160000 30560000 27746500 7104500 27710800 467164500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2024-01-03 184.250000 148.470001 140.360001 370.600006 470.260010 475.690002 238.449997 184.250000 148.470001 140.360001 ... 467.320007 474.850006 244.979996 58414500 49425500 18974300 23083500 3443700 32089600 121082600
2024-01-04 181.910004 144.570007 138.039993 367.940002 474.670013 479.980011 237.929993 181.910004 144.570007 138.039993 ... 472.980011 477.670013 239.250000 71983600 56039800 18253300 20901500 3636500 30653500 102629300
2024-01-05 181.179993 145.240005 137.389999 367.750000 474.059998 490.970001 237.490005 181.179993 145.240005 137.389999 ... 476.500000 484.619995 236.860001 62303300 45124800 15433200 20987000 2612500 41456800 92379400
2024-01-08 185.559998 149.100006 140.529999 374.690002 485.029999 522.530029 240.449997 185.559998 149.100006 140.529999 ... 473.890015 495.119995 236.139999 59144500 46757100 17645300 23134000 3675800 64251000 85166600
2024-01-09 185.139999 151.369995 142.559998 375.790009 482.089996 531.400024 234.960007 185.139999 151.369995 142.559998 ... 475.529999 524.010010 238.110001 42841800 43812600 19579700 20830000 3526800 77310000 96705700

1012 rows × 42 columns

B. 시각화

df.loc[:,'Adj Close'].plot.line(backend='plotly')

출산율 시각화

A. 크롤링 + 데이터정리

- 대한민국의 저출산문제

ref: https://ko.wikipedia.org/wiki/대한민국의_저출산

- 위의 url에서 5번째 테이블을 읽고싶다.

  • 5번째 테이블: 시도별 출생아 수
df_lst = pd.read_html('https://ko.wikipedia.org/wiki/%EB%8C%80%ED%95%9C%EB%AF%BC%EA%B5%AD%EC%9D%98_%EC%A0%80%EC%B6%9C%EC%82%B0')
df = df_lst[4]
df
지역/연도[6] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 서울 93266 91526 93914.000 84066.000 83711.000 83005 75.536 65389 58074 53.673 47400 45531
1 부산 27415 27759 28673.000 25831.000 26190.000 26645 24906.000 21480 19152 17049.000 15100 14446
2 대구 20557 20758 21472.000 19340.000 19361.000 19438 18298.000 15946 14400 13233.000 11200 10661
3 인천 25752 20758 21472.000 25560.000 25786.000 25491 23609.000 20445 20087 18522.000 16000 14947
4 광주 13979 13916 14392.000 12729.000 12729.000 12441 11580.000 10120 9105 8364.000 7300 7956
5 대전 14314 14808 15279.000 14099.000 13962.000 13774 12436.000 10851 9337 8410.000 7500 7414
6 울산 11432 11542 12160.000 11330.000 11556.000 11732 10910.000 9381 8149 7539.000 6600 6127
7 세종 - - 1054.000 1111.000 1344.000 2708 3297.000 3504 3703 3819.000 3500 3570
8 경기 121753 122027 124746.000 112129.000 112.169 113495 105643.000 94088 83198 83.198 77800 76139
9 강원 12477 12408 12426.000 10980.000 10662.000 10929 10058.000 9958 8351 8283.000 7800 7357
10 충북 14670 14804 15139.000 13658.000 13366.000 13563 12742.000 11394 10586 9333.000 8600 8190
11 충남 20.242 20.398 20.448 18.628 18200.000 18604 17302.000 15670 14380 13228.000 11900 10984
12 전북 16100 16175 16238.000 14555.000 14231.000 14087 12698.000 11348 10001 8971.000 8200 7745
13 전남 16654 16612 16990.000 15401.000 14817.000 15061 13980.000 12354 11238 10832.000 9700 8430
14 경북 23700 24250 24635.000 22206.000 22062.000 22310 20616.000 17957 16079 14472.000 12900 12045
15 경남 32203 32536 33211.000 29504.000 29763.000 29537 27138.000 23849 21224 19250.000 16800 15562
16 제주 5657 5628 5992.000 5328.000 5526.000 5600 5494.000 5037 4781 4500.000 4000 3728
17 전국 470171 471265 484550.000 436455.000 435435.000 438420 406243.000 357771 326822 302676.000 272400 260562

B. 시각화1: 전국 출생아수 시각화

df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x))\
.iloc[:-1,:]\
.sum(axis=0)\
.plot.line(backend='plotly')
/tmp/ipykernel_2675596/925425684.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

C. 시각화2: 시도별 출생아수 시각화 (line)

df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x)).T\
.loc[:,'서울':'제주']\
.plot.line(backend='plotly')
/tmp/ipykernel_2675596/2975889792.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

D. 시각화3: 시도별 출생아수 시각화 (area)

- 시각화1,시각화2의 정보가 적절히 혼합되어있는 시각화는 없을까?

df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x)).T\
.loc[:,'서울':'제주']\
.plot.area(backend='plotly')
/tmp/ipykernel_2675596/1346429013.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

E. 시각화1,2,3 수정

- 시각화1의 수정

df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x))\
.applymap(lambda x: x*1000 if x<1000 else x)\
.iloc[:-1,:]\
.sum(axis=0)\
.plot.line(backend='plotly')
/tmp/ipykernel_2675596/1644146122.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

/tmp/ipykernel_2675596/1644146122.py:3: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.
df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x))\
.applymap(lambda x: x*1000 if x<1000 else x).T\
.loc[:,'서울':'제주']\
.plot.line(backend='plotly')
/tmp/ipykernel_2675596/2864061887.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

/tmp/ipykernel_2675596/2864061887.py:3: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.
df.set_index('지역/연도[6]')\
.applymap(lambda x: 0 if x == '-' else float(x))\
.applymap(lambda x: x*1000 if x<1000 else x).T\
.loc[:,'서울':'제주']\
.plot.area(backend='plotly')
/tmp/ipykernel_2675596/869707808.py:2: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.

/tmp/ipykernel_2675596/869707808.py:3: FutureWarning:

DataFrame.applymap has been deprecated. Use DataFrame.map instead.