강의영상

- (1/6): 박스플랏: 전북고예제 (평균은 좋은 측정값인가?)

- (2/6): 박스플랏 기본개념

- (3/6): plotly

- (4/6): 히스토그램

- (5/6): 히스토그램 2개 겹쳐서 비교하기

- (6/6): 과제안내

import

import matplotlib.pyplot as plt 
import numpy as np

boxplot

전북고예제: 평균은 괜찮은 측정값인가?

- 전북고등학교에서 통계학을 수업하는 두 선생님이 있다. 편의상 A선생님과 B선생님이라고 하자. A선생님이 강의한 반의 통계학 점수는 79.1점이고, B선생님이 강의한 반의 통계학 점수는 78.3점 이라고 하자.

- 의사결정: A선생님에게 배운 학생들의 실력이 평균적으로 좋을 것이다.

y1=[75,75,76,76,77,77,79,79,79,98] # A선생님에게 통계학을 배운 학생의 점수들
y2=[76,76,77,77,78,78,80,80,80,81] # B선생님에게 통계학을 배운 학생의 점수들

np.mean(y1), np.mean(y2)

(79.1, 78.3)

- 평균은 A반(=A선생님에게 통계학을 배운 반)이 더 높다. 그런데 98점을 받은 학생때문에 전체평균이 올라간 것이고, 나머지 학생들은 전체적으로 B반 학생들이 점수가 더 높다고 해석할 수 있다.

- 단순한 평균비교보다 분포를 비교해보는 것이 중요하다. 분포를 살펴보는 방법 중 유용한 방법이 박스플랏이다.

plt.boxplot(y1)

{'whiskers': [<matplotlib.lines.Line2D at 0x7fc9291b5040>,
  <matplotlib.lines.Line2D at 0x7fc9291b53d0>],
 'caps': [<matplotlib.lines.Line2D at 0x7fc9291b5760>,
  <matplotlib.lines.Line2D at 0x7fc9291b5af0>],
 'boxes': [<matplotlib.lines.Line2D at 0x7fc92b216c70>],
 'medians': [<matplotlib.lines.Line2D at 0x7fc9291b5eb0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fc9291bf280>],
 'means': []}

A반의 boxplot
뚝 떨어진 하나의 점은 98점
붉은 선은 중앙값 (평균이 아니라 중앙값)
나머지 점들은 7~80점에 분포되어있다.

plt.boxplot(y2)

{'whiskers': [<matplotlib.lines.Line2D at 0x7fc92911b550>,
  <matplotlib.lines.Line2D at 0x7fc92911b8e0>],
 'caps': [<matplotlib.lines.Line2D at 0x7fc92911bca0>,
  <matplotlib.lines.Line2D at 0x7fc929127070>],
 'boxes': [<matplotlib.lines.Line2D at 0x7fc92911b1c0>],
 'medians': [<matplotlib.lines.Line2D at 0x7fc929127400>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fc929127790>],
 'means': []}

B반의 boxplot

- 아래와 같이 하면 박스플랏을 나란히 그릴 수 있다.

plt.boxplot([y1,y2])

{'whiskers': [<matplotlib.lines.Line2D at 0x7fc929089550>,
  <matplotlib.lines.Line2D at 0x7fc9290898e0>,
  <matplotlib.lines.Line2D at 0x7fc929095e80>,
  <matplotlib.lines.Line2D at 0x7fc92909e250>],
 'caps': [<matplotlib.lines.Line2D at 0x7fc929089c70>,
  <matplotlib.lines.Line2D at 0x7fc929095040>,
  <matplotlib.lines.Line2D at 0x7fc92909e5e0>,
  <matplotlib.lines.Line2D at 0x7fc92909e970>],
 'boxes': [<matplotlib.lines.Line2D at 0x7fc9290891c0>,
  <matplotlib.lines.Line2D at 0x7fc929095af0>],
 'medians': [<matplotlib.lines.Line2D at 0x7fc9290953d0>,
  <matplotlib.lines.Line2D at 0x7fc92909ed00>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fc929095760>,
  <matplotlib.lines.Line2D at 0x7fc9290ac0d0>],
 'means': []}

- 미적인 그래프는 아니지만 이정도는 괜찮은것 같다.

boxplot이란?

- ref: https://github.com/mGalarnyk/Python_Tutorials/blob/master/Statistics/boxplot/box_plot.ipynb

np.random.seed(916170)

# connection path is here: https://stackoverflow.com/questions/6146290/plotting-a-line-over-several-graphs
mu, sigma = 0, 1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)

fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(10, 5))

# rectangular box plot
bplot = axes.boxplot(s,
                vert=False,
                patch_artist=True, 
                showfliers=True, # This would show outliers (the remaining .7% of the data)
                positions = [0],
                boxprops = dict(linestyle='--', linewidth=2, color='Black', facecolor = 'red', alpha = .4),
                medianprops = dict(linestyle='-', linewidth=2, color='Yellow'),
                whiskerprops = dict(linestyle='-', linewidth=2, color='Blue', alpha = .4),
                capprops = dict(linestyle='-', linewidth=2, color='Black'),
                flierprops = dict(marker='o', markerfacecolor='green', markersize=10,
                  linestyle='none', alpha = .4),
                widths = .3,
                zorder = 1)   

axes.set_xlim(-4, 4)
plt.xticks(fontsize = 14)

axes.set_yticks([])
axes.annotate(r'',
            xy=(-.73, .205), xycoords='data',
            xytext=(.66, .205), textcoords='data',
            arrowprops=dict(arrowstyle="|-|",
                            connectionstyle="arc3")
            );

axes.text(0, .25, "Interquartile Range \n(IQR)",  horizontalalignment='center', fontsize=18)
axes.text(0, -.21, r"Median", horizontalalignment='center', fontsize=16);
axes.text(2.65, -.15, "\"Maximum\"", horizontalalignment='center', fontsize=18);
axes.text(-2.65, -.15, "\"Minimum\"", horizontalalignment='center', fontsize=18);
axes.text(-.68, -.24, r"Q1", horizontalalignment='center', fontsize=18);
axes.text(-2.65, -.21, r"(Q1 - 1.5*IQR)", horizontalalignment='center', fontsize=16);
axes.text(.6745, -.24, r"Q3", horizontalalignment='center', fontsize=18);
axes.text(.6745, -.30, r"(75th Percentile)", horizontalalignment='center', fontsize=12);
axes.text(-.68, -.30, r"(25th Percentile)", horizontalalignment='center', fontsize=12);
axes.text(2.65, -.21, r"(Q3 + 1.5*IQR)", horizontalalignment='center', fontsize=16);

axes.annotate('Outliers', xy=(2.93,0.015), xytext=(2.52,0.20), fontsize = 18,
            arrowprops={'arrowstyle': '->', 'color': 'black', 'lw': 2},
            va='center');

axes.annotate('Outliers', xy=(-3.01,0.015), xytext=(-3.41,0.20), fontsize = 18,
            arrowprops={'arrowstyle': '->', 'color': 'black', 'lw': 2},
            va='center');

fig.tight_layout()

plotly

!pip install plotly 
!pip install ipywidgets
!pip install jupyter-dash
!pip install dash 
!pip install pandas

import plotly.express as px 
import pandas as pd
from IPython.display import HTML

A=pd.DataFrame({'score':y1,'class':['A']*len(y1)})
B=pd.DataFrame({'score':y2,'class':['B']*len(y2)})

df=pd.concat([A,B],ignore_index=True)

df

fig=px.box(data_frame=df, x='class',y='score')
HTML(fig.to_html(include_plotlyjs='cdn',include_mathjax=False))

histogram

히스토그램이란?

- X축이 변수의 구간, Y축은 그 구간에 포함된 빈도를 의미하는 그림

- 예를들면 아래와 같음

plt.hist(np.random.normal(loc=0, scale=1, size=1000000))

(array([2.40000e+01, 1.21000e+03, 2.13380e+04, 1.39948e+05, 3.51614e+05,
        3.39662e+05, 1.27147e+05, 1.80510e+04, 9.87000e+02, 1.90000e+01]),
 array([-5.05590169, -4.03792348, -3.01994528, -2.00196708, -0.98398887,
         0.03398933,  1.05196753,  2.06994573,  3.08792394,  4.10590214,
         5.12388034]),
 <BarContainer object of 10 artists>)

전북고예제

- 중심경향값, 집중경향치 (Measure of central tendency): 분포의 중심성을 나타내기 위한 값, 예시로는 평균, 중앙값.

https://en.wikipedia.org/wiki/Central_tendency

- '평균이 항상 좋은 중심경향값은 아니다.'라는 사실은 이해했음.

- 하지만 특수한 상황을 가정하면 평균이 좋은 중심경향값임

np.random.seed(43052)
y1=np.random.normal(loc=0,scale=1,size=10000) #전북고 A반의 통계학 성적이라 생각하자. 
y2=np.random.normal(loc=0.5,scale=1,size=10000) #전북고 B반의 통계학 성적이라 생각하자.

np.mean(y1), np.mean(y2)

(-0.011790879905079434, 0.4979147460611458)

(np.mean(y2)-np.mean(y1)).round(3)

0.51

plt.boxplot([y1,y2])

{'whiskers': [<matplotlib.lines.Line2D at 0x7fc92668e9d0>,
  <matplotlib.lines.Line2D at 0x7fc92668ed90>,
  <matplotlib.lines.Line2D at 0x7fc926627370>,
  <matplotlib.lines.Line2D at 0x7fc926627700>],
 'caps': [<matplotlib.lines.Line2D at 0x7fc92661c160>,
  <matplotlib.lines.Line2D at 0x7fc92661c4f0>,
  <matplotlib.lines.Line2D at 0x7fc926627a90>,
  <matplotlib.lines.Line2D at 0x7fc926627e20>],
 'boxes': [<matplotlib.lines.Line2D at 0x7fc92668e640>,
  <matplotlib.lines.Line2D at 0x7fc92661cfa0>],
 'medians': [<matplotlib.lines.Line2D at 0x7fc92661c880>,
  <matplotlib.lines.Line2D at 0x7fc9266351f0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fc92661cc10>,
  <matplotlib.lines.Line2D at 0x7fc926635580>],
 'means': []}

분포의 모양이 거의 비슷하고, 왼쪽그림을 거의 컨트롤+C,V 오른쪽에 붙인다음 위치조정을 한 느낌
이런상황에서는 $B반의 성적 \approx A반의 성적 + 0.51$ 라고 주장해도 큰 무리가 없음.

- 정규분포인것은 어떻게 아는가? $\to$ 히스토그램을 그려보아서 종 모양이 나오는지 살펴보자.

plt.hist(y1,bins=50)

(array([  1.,   1.,   3.,   0.,   1.,   4.,   5.,  12.,  14.,  26.,  32.,
         52.,  67.,  89., 144., 171., 238., 282., 325., 378., 489., 492.,
        561., 635., 652., 636., 626., 606., 573., 539., 475., 444., 350.,
        250., 232., 172., 137.,  80.,  58.,  47.,  30.,  23.,  17.,  12.,
          9.,   4.,   4.,   0.,   1.,   1.]),
 array([-4.12186916, -3.96068404, -3.79949892, -3.6383138 , -3.47712868,
        -3.31594356, -3.15475844, -2.99357332, -2.8323882 , -2.67120308,
        -2.51001796, -2.34883284, -2.18764772, -2.0264626 , -1.86527748,
        -1.70409236, -1.54290724, -1.38172212, -1.220537  , -1.05935188,
        -0.89816676, -0.73698164, -0.57579652, -0.4146114 , -0.25342628,
        -0.09224116,  0.06894396,  0.23012908,  0.3913142 ,  0.55249932,
         0.71368444,  0.87486956,  1.03605468,  1.1972398 ,  1.35842492,
         1.51961004,  1.68079516,  1.84198028,  2.0031654 ,  2.16435052,
         2.32553564,  2.48672076,  2.64790588,  2.809091  ,  2.97027612,
         3.13146124,  3.29264636,  3.45383148,  3.6150166 ,  3.77620172,
         3.93738684]),
 <BarContainer object of 50 artists>)

plt.hist(y2,bins=50)

(array([  1.,   0.,   3.,   2.,   4.,   5.,   5.,  10.,  16.,  25.,  33.,
         56.,  74., 116., 119., 152., 244., 272., 351., 362., 438., 509.,
        531., 621., 624., 690., 636., 571., 564., 514., 462., 402., 356.,
        297., 233., 184., 144., 113.,  80.,  55.,  38.,  34.,  21.,  18.,
          4.,   3.,   2.,   4.,   1.,   1.]),
 array([-3.5752867 , -3.4164866 , -3.2576865 , -3.0988864 , -2.9400863 ,
        -2.7812862 , -2.6224861 , -2.463686  , -2.3048859 , -2.1460858 ,
        -1.9872857 , -1.8284856 , -1.6696855 , -1.5108854 , -1.3520853 ,
        -1.1932852 , -1.0344851 , -0.875685  , -0.7168849 , -0.5580848 ,
        -0.3992847 , -0.2404846 , -0.0816845 ,  0.0771156 ,  0.2359157 ,
         0.3947158 ,  0.5535159 ,  0.712316  ,  0.87111611,  1.02991621,
         1.18871631,  1.34751641,  1.50631651,  1.66511661,  1.82391671,
         1.98271681,  2.14151691,  2.30031701,  2.45911711,  2.61791721,
         2.77671731,  2.93551741,  3.09431751,  3.25311761,  3.41191771,
         3.57071781,  3.72951791,  3.88831801,  4.04711811,  4.20591821,
         4.36471831]),
 <BarContainer object of 50 artists>)

plt.hist([y1,y2],bins=200)

(array([[  1.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   1.,   1.,   1.,
           0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   1.,   1.,   2.,
           1.,   0.,   1.,   1.,   3.,   4.,   4.,   2.,   2.,   6.,   4.,
           1.,   4.,   7.,   8.,   9.,  11.,   5.,   9.,   9.,  14.,  12.,
          16.,  11.,   9.,  18.,  25.,  30.,  22.,  18.,  28.,  29.,  39.,
          40.,  41.,  37.,  42.,  48.,  56.,  58.,  49.,  80.,  62.,  62.,
          91.,  78.,  75.,  82.,  89.,  81., 106.,  85.,  89., 126., 125.,
         106., 142., 141., 121., 121., 135., 154., 166., 146., 125., 169.,
         160., 170., 172., 162., 161., 161., 193., 146., 186., 170., 166.,
         197., 152., 149., 167., 173., 158., 155., 156., 153., 152., 137.,
         151., 147., 126., 141., 125., 139., 117., 116., 135., 118.,  93.,
         115.,  99.,  78.,  91.,  77.,  63.,  81.,  52.,  83.,  53.,  61.,
          49.,  46.,  46.,  47.,  45.,  26.,  48.,  31.,  27.,  27.,  20.,
          17.,  22.,  15.,  15.,  14.,  14.,  15.,  10.,   8.,  13.,   7.,
           5.,   8.,   6.,   6.,   6.,   2.,   4.,   9.,   3.,   3.,   6.,
           2.,   1.,   4.,   2.,   2.,   2.,   2.,   0.,   1.,   1.,   2.,
           2.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,
           0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
           0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
           0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,
           1.,   1.,   0.,   1.,   0.,   1.,   1.,   1.,   0.,   3.,   1.,
           2.,   1.,   1.,   1.,   2.,   1.,   2.,   2.,   1.,   6.,   1.,
           6.,   3.,   7.,   7.,   5.,   6.,  10.,   5.,   7.,  16.,   9.,
          11.,  13.,  28.,  21.,  16.,  20.,  27.,  25.,  32.,  33.,  28.,
          31.,  31.,  39.,  42.,  34.,  43.,  44.,  64.,  56.,  80.,  64.,
          74.,  77.,  69.,  82.,  92., 101.,  99.,  81.,  90., 115., 110.,
         106., 108., 127., 127., 138., 145., 138., 121., 159., 135., 145.,
         156., 186., 158., 164., 172., 182., 147., 194., 178., 176., 195.,
         162., 182., 164., 164., 163., 145., 150., 143., 155., 144., 142.,
         161., 141., 146., 129., 115., 132., 120., 118., 128., 103.,  88.,
         111., 104.,  97.,  82.,  77.,  83.,  83.,  80.,  77.,  67.,  58.,
          48.,  47.,  54.,  50.,  43.,  36.,  43.,  33.,  33.,  42.,  29.,
          24.,  28.,  19.,  22.,  16.,  18.,  14.,  14.,  11.,  10.,   9.,
           7.,  12.,  10.,   8.,   8.,   9.,   4.,   7.,   4.,   6.,   3.,
           8.,   1.,   1.,   1.,   0.,   1.,   0.,   2.,   1.,   0.,   2.,
           0.,   0.,   2.,   2.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,
           0.,   1.]]),
 array([-4.12186916, -4.07943623, -4.03700329, -3.99457035, -3.95213741,
        -3.90970448, -3.86727154, -3.8248386 , -3.78240567, -3.73997273,
        -3.69753979, -3.65510685, -3.61267392, -3.57024098, -3.52780804,
        -3.4853751 , -3.44294217, -3.40050923, -3.35807629, -3.31564335,
        -3.27321042, -3.23077748, -3.18834454, -3.1459116 , -3.10347867,
        -3.06104573, -3.01861279, -2.97617986, -2.93374692, -2.89131398,
        -2.84888104, -2.80644811, -2.76401517, -2.72158223, -2.67914929,
        -2.63671636, -2.59428342, -2.55185048, -2.50941754, -2.46698461,
        -2.42455167, -2.38211873, -2.33968579, -2.29725286, -2.25481992,
        -2.21238698, -2.16995405, -2.12752111, -2.08508817, -2.04265523,
        -2.0002223 , -1.95778936, -1.91535642, -1.87292348, -1.83049055,
        -1.78805761, -1.74562467, -1.70319173, -1.6607588 , -1.61832586,
        -1.57589292, -1.53345998, -1.49102705, -1.44859411, -1.40616117,
        -1.36372824, -1.3212953 , -1.27886236, -1.23642942, -1.19399649,
        -1.15156355, -1.10913061, -1.06669767, -1.02426474, -0.9818318 ,
        -0.93939886, -0.89696592, -0.85453299, -0.81210005, -0.76966711,
        -0.72723417, -0.68480124, -0.6423683 , -0.59993536, -0.55750243,
        -0.51506949, -0.47263655, -0.43020361, -0.38777068, -0.34533774,
        -0.3029048 , -0.26047186, -0.21803893, -0.17560599, -0.13317305,
        -0.09074011, -0.04830718, -0.00587424,  0.0365587 ,  0.07899164,
         0.12142457,  0.16385751,  0.20629045,  0.24872338,  0.29115632,
         0.33358926,  0.3760222 ,  0.41845513,  0.46088807,  0.50332101,
         0.54575395,  0.58818688,  0.63061982,  0.67305276,  0.7154857 ,
         0.75791863,  0.80035157,  0.84278451,  0.88521744,  0.92765038,
         0.97008332,  1.01251626,  1.05494919,  1.09738213,  1.13981507,
         1.18224801,  1.22468094,  1.26711388,  1.30954682,  1.35197976,
         1.39441269,  1.43684563,  1.47927857,  1.52171151,  1.56414444,
         1.60657738,  1.64901032,  1.69144325,  1.73387619,  1.77630913,
         1.81874207,  1.861175  ,  1.90360794,  1.94604088,  1.98847382,
         2.03090675,  2.07333969,  2.11577263,  2.15820557,  2.2006385 ,
         2.24307144,  2.28550438,  2.32793732,  2.37037025,  2.41280319,
         2.45523613,  2.49766906,  2.540102  ,  2.58253494,  2.62496788,
         2.66740081,  2.70983375,  2.75226669,  2.79469963,  2.83713256,
         2.8795655 ,  2.92199844,  2.96443138,  3.00686431,  3.04929725,
         3.09173019,  3.13416313,  3.17659606,  3.219029  ,  3.26146194,
         3.30389487,  3.34632781,  3.38876075,  3.43119369,  3.47362662,
         3.51605956,  3.5584925 ,  3.60092544,  3.64335837,  3.68579131,
         3.72822425,  3.77065719,  3.81309012,  3.85552306,  3.897956  ,
         3.94038894,  3.98282187,  4.02525481,  4.06768775,  4.11012068,
         4.15255362,  4.19498656,  4.2374195 ,  4.27985243,  4.32228537,
         4.36471831]),
 <a list of 2 BarContainer objects>)

seaborn

import seaborn as sns

A=pd.DataFrame({'score':y1,'class':['A']*len(y1)})
B=pd.DataFrame({'score':y2,'class':['B']*len(y2)})
df=pd.concat([A,B],ignore_index=True)

sns.histplot(df,x='score',hue='class')

<AxesSubplot:xlabel='score', ylabel='Count'>

plotnine

from plotnine import *

ggplot(df)+geom_histogram(aes(x='score',fill='class'),position='identity',alpha=0.5)

/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/stat_bin.py:95: PlotnineWarning: 'stat_bin()' using 'bins = 84'. Pick better value with 'binwidth'.

<ggplot: (8781336505688)>

plotly

- 인터랙티브 그래프를 위해서 plotly 홈페이지를 방문하여 적당한 코드를 가져온다.

import plotly.figure_factory as ff
import numpy as np

hist_data=[y1,y2]
group_labels=['A','B']

fig = ff.create_distplot(hist_data, group_labels,
                         bin_size=.2, show_rug=False)
HTML(fig.to_html(include_plotlyjs='cdn',include_mathjax=False))

숙제

(1) 자기학번으로 np.random.seed(202043052)를 만들고

(2) y1, y2 // 10만개의 정규분포를 생성해서 저장

y1: 평균 0, 표준편차=1
y2: 평균 1, 표준편차=1

(3) plotly 를 활용하여 히스토그램을 겹쳐서 그려보는것.

	score	class
0	75	A
1	75	A
2	76	A
3	76	A
4	77	A
5	77	A
6	79	A
7	79	A
8	79	A
9	98	A
10	76	B
11	76	B
12	77	B
13	77	B
14	78	B
15	78	B
16	80	B
17	80	B
18	80	B
19	81	B

	score	class
0	75	A
1	75	A
2	76	A
3	76	A
4	77	A
5	77	A
6	79	A
7	79	A
8	79	A
9	98	A
10	76	B
11	76	B
12	77	B
13	77	B
14	78	B
15	78	B
16	80	B
17	80	B
18	80	B
19	81	B

	score	class
0	75	A
1	75	A
2	76	A
3	76	A
4	77	A
5	77	A
6	79	A
7	79	A
8	79	A
9	98	A
10	76	B
11	76	B
12	77	B
13	77	B
14	78	B
15	78	B
16	80	B
17	80	B
18	80	B
19	81	B