{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 08wk-1: Pandas (1)\n",
        "\n",
        "최규빈  \n",
        "2023-04-24\n",
        "\n",
        "<a href=\"https://colab.research.google.com/github/guebin/PP2023/blob/main/posts/02_DataScience/2023-04-24-8wk-1.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" style=\"text-align: left\"></a>\n",
        "\n",
        "# 강의영상\n",
        "\n",
        "> youtube:\n",
        "> <https://youtube.com/playlist?list=PLQqh36zP38-xqAT5XH-YhYj1s2WQWhKE8>\n",
        "\n",
        "# import"
      ],
      "id": "5a2e8960-4c43-4363-a5ea-62552fdfb484"
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd"
      ],
      "id": "277503d6-2d68-4226-b5dd-96cbf5ce6819"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# pandas 개발동기\n",
        "\n",
        "## 부분 데이터 꺼내기: 판다스를 왜 써야할까?\n",
        "\n",
        "`-` 예시1: 때로는 인덱스로 때로는 key로 데이터를 부르고 싶다."
      ],
      "id": "d7cff516-79ea-4395-a856-13c1d5dce7ef"
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "np.random.seed(43052)\n",
        "att = np.random.choice(np.arange(10,21)*5,20)\n",
        "rep = np.random.choice(np.arange(5,21)*5,20)\n",
        "mid = np.random.choice(np.arange(0,21)*5,20)\n",
        "fin = np.random.choice(np.arange(0,21)*5,20)\n",
        "key = ['2022-12'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]"
      ],
      "id": "b1f403c2-f03a-414c-b65b-c9bdbf5c39e9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "학번 ’2022-12363’에 해당하는 학생의 출석점수를 알고 싶다면?\n",
        "\n",
        "(풀이1) – dct로 자료를 저장하고 출력"
      ],
      "id": "f3f10e87-1e6b-465b-90e2-4f79d2b15f1f"
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [],
      "source": [
        "dct = {'att':{key[i]:att[i] for i in range(20)}, \n",
        "       'rep':{key[i]:rep[i] for i in range(20)}, \n",
        "       'mid':{key[i]:mid[i] for i in range(20)}, \n",
        "       'fin':{key[i]:fin[i] for i in range(20)}}\n",
        "#dct"
      ],
      "id": "2996e09b"
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "dct['att']['2022-12363']"
      ],
      "id": "1c219caf"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(풀이2) – ndarray로 자료를 저장하고 출력"
      ],
      "id": "be2d6abf-5859-4a5a-8fc7-aafb6aa94130"
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {},
      "outputs": [],
      "source": [
        "arr = np.array([att,rep,mid,fin,key]).T\n",
        "arr"
      ],
      "id": "9845acd6-2adb-44dc-97ca-e45fade83f83"
    },
    {
      "cell_type": "code",
      "execution_count": 91,
      "metadata": {},
      "outputs": [],
      "source": [
        "arr[arr[:,-1] == '2022-12363',0] # 읽기어려운 코드"
      ],
      "id": "2a409934"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**(풀이2)가 (풀이1)에 비하여 불편한 점**\n",
        "\n",
        "-   arr 마지마칼럼이 student id 이고 첫번째 칼럼은 att라는 사실을\n",
        "    암기하고 있어야 한다.\n",
        "-   자료형이 문자로 강제로 바뀌어서 저장되어있음\n",
        "-   작성한 코드의 가독성이 없다. (위치로 접근하기 때문)\n",
        "\n",
        "`-` 요약: hash 스타일로 정보를 추출하는 것이 유용할 때가 있다. 그리고\n",
        "보통 hash 스타일로 정보를 뽑는 것이 유리하다. (사실 numpy는 정보추출을\n",
        "위해 개발된 자료형이 아니라 행렬 및 벡터의 수학연산을 지원하기 위해\n",
        "개발된 자료형이다)\n",
        "\n",
        "`-` 소망: 정보를 추출할때는 hash 스타일도 유용하다는 것은 이해함 $\\to$\n",
        "하지만 나는 가끔 넘파이스타일로 정보를 뽑고 싶은걸? 그리고 딕셔너리\n",
        "형태가 아니고 엑셀처럼(행렬처럼) 데이터를 보고 싶은걸? $\\to$ pandas의\n",
        "개발\n",
        "\n",
        "## 엑셀처럼 데이터를 테이블 형태로 정리하고 싶다\n",
        "\n",
        "(방법1) – 넘파이"
      ],
      "id": "1989bd13-099b-4ef6-9535-c0f7d0d3f6e4"
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [],
      "source": [
        "arr"
      ],
      "id": "878db530"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(방법2) – 판다스 with stacked dict"
      ],
      "id": "dd3f9699-b5b2-41c2-bc35-42f66a102d54"
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame(dct)\n",
        "df.head()"
      ],
      "id": "59980ba2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(방법3) – 판다스 with index"
      ],
      "id": "e127125e-f0f7-4d8f-a9da-b5080b688afe"
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame({'att':att,'rep':rep,'mid':mid,'fin':fin},index=key)\n",
        "df.head()"
      ],
      "id": "518c6966"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 해싱으로 원하는 정보를 뽑으면 좋겠다 (마치 딕셔너리처럼)\n",
        "\n",
        "`-` 예제1: 출석점수를 출력 (딕셔너리가 되면 판다스도 된다)"
      ],
      "id": "24add66b-fec1-444f-a833-15e2b82c37dc"
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {},
      "outputs": [],
      "source": [
        "# dct['att']\n",
        "df['att']"
      ],
      "id": "a27cace6"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 예제2: 학번 `2022-12380` 의 출석점수 출력"
      ],
      "id": "8996cc5a-4f6f-4721-a8f2-6d045746ba43"
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {},
      "outputs": [],
      "source": [
        "#dct['att']['2022-12380']\n",
        "df['att']['2022-12380']"
      ],
      "id": "82a11888"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 인덱싱으로 정보를 뽑는 기능도 지원을 하면 좋겠다 (마치 리스트나 넘파이처럼)\n",
        "\n",
        "`-` 예제1: 첫번째 학생의 기말고사 성적을 출력하고 싶다."
      ],
      "id": "6ea5eebe-f227-400e-8f26-eb1d89ea9259"
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[0,-1]"
      ],
      "id": "c66b2edf"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "-   벼락치기: df에서 iloc이라는 특수기능을 이용하면 넘파이 인덱싱처럼\n",
        "    원소출력이 가능하다.\n",
        "\n",
        "> df는 딕셔너리 같은것이지만 df.iloc은 넘파이같은것이라고 생각하면 된다.\n",
        "\n",
        "`-` 예제2: 홀수번째 학생 의 점수를 뽑고 싶다. (홀수번째 학생은 인덱스\n",
        "0,2,4,… 에 대응)"
      ],
      "id": "bacf9d8b-3027-4284-ac53-b8c65c20c4d3"
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[::2,:]"
      ],
      "id": "2bac90df"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 예제3: 맨 끝에서 3명의 점수를 출력하고 싶다."
      ],
      "id": "885aafcc-81b2-4e17-a19b-1c15d5d8a8ea"
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[-3:,:]"
      ],
      "id": "3ecf1288"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 예제4: 맨 끝에서 3명의 점수중 마지막 2개의 칼럼만 출력하고 싶다."
      ],
      "id": "2f2ea529-23c0-4754-b4f6-45ef5edcaaa3"
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[-3:,-2:]"
      ],
      "id": "4067303f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 궁극: 해싱과 인덱싱을 모두 지원하는 아주 우수한 자료형을 만들고 싶음\n",
        "\n",
        "`-` 예제1: ’mid \\>= 20 and att \\<60’인 학생들의 ’fin’을 출력\n",
        "\n",
        "(방법1) query\n",
        "\n",
        "-   데이터베이스 스타일"
      ],
      "id": "2767db28-cb01-4019-bf51-e788dbb42ba1"
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.query('mid>=20 and att<60')['fin']"
      ],
      "id": "de85f240"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(방법2) numpy"
      ],
      "id": "6cc873b4-7315-4ecb-8e95-806778de05c1"
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {},
      "outputs": [],
      "source": [
        "arr[(arr[:,2].astype(dtype=np.int64) >= 20) & (arr[:,0].astype(dtype=np.int64) < 60),3]"
      ],
      "id": "60b14318"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 예제2: ’중간고사점수\\<기말고사점수’인 학생들의 출석점수 평균을\n",
        "구하자."
      ],
      "id": "f1eb1d59-8a29-4781-ae6a-7dd1fe5a57dc"
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.query('mid<fin')['att'].mean()"
      ],
      "id": "c3b342a2"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# pandas 공부 1단계\n",
        "\n",
        "## 데이터프레임 선언\n",
        "\n",
        "`-` 방법1: dictionary에서 만든다."
      ],
      "id": "758c3c09-9167-4778-9cab-d81d8c9034bd"
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':[30,40,50],'mid':[50,60,70]})"
      ],
      "id": "58e85fc1-8f39-42fa-b6e4-dbb0a09ec1d4"
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':(30,40,50),'mid':(50,60,70)})"
      ],
      "id": "3dffb65c-e243-4e82-b4f8-3e492659d2e0"
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':np.array([30,40,50]),'mid':np.array([50,60,70])})"
      ],
      "id": "1cc108d1-4373-4fcd-a001-dcf35b7711e7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법: 2차원 ndarray에서 만든다."
      ],
      "id": "d2ad4a5a-a1e7-4630-91cd-6d051d16941f"
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {},
      "outputs": [],
      "source": [
        "np.arange(2*3).reshape(2,3)"
      ],
      "id": "a43cbec4-5c2e-474a-ab44-f01b8ce059b1"
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame(np.arange(2*3).reshape(2,3))"
      ],
      "id": "421eecf9-f0b0-4800-9aac-1642a0e3ed16"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 열의 이름 부여\n",
        "\n",
        "`-` 방법1: 딕셔너리를 통하여 만들면 딕셔너리의 key가 자동으로 열의\n",
        "이름이 된다."
      ],
      "id": "a82d932e-3974-4cd0-bf71-4a6772d443b7"
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':np.array([30,40,50]),'mid':np.array([50,60,70])})"
      ],
      "id": "ed9c5e8c-2ea6-4978-8f3f-00fcedc72eb1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법2: pd.DataFrame()의 옵션에 columns를 이용"
      ],
      "id": "5a77c184-4dff-4b90-bcf4-1d4a16f54ad5"
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame(np.arange(2*3).reshape(2,3),columns=['X1','X2','X3'])"
      ],
      "id": "01f53545-dc4d-4a1f-a35d-1f80489c3766"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법3: df.columns에 원하는 열이름을 덮어씀 (1)"
      ],
      "id": "86a7bf98-81da-4d8b-9487-132c89d10030"
    },
    {
      "cell_type": "code",
      "execution_count": 38,
      "metadata": {},
      "outputs": [],
      "source": [
        "df=pd.DataFrame(np.arange(2*3).reshape(2,3))\n",
        "df"
      ],
      "id": "a98f0d8e-6a37-47d1-a0a0-40ce92ab80e5"
    },
    {
      "cell_type": "code",
      "execution_count": 41,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.columns = ['X1','X2','X3']\n",
        "df"
      ],
      "id": "8316b758"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법4: df.columns에 원하는 열이름을 덮어씀 (2)"
      ],
      "id": "5ad6d34c-0fc7-468b-b21c-6279260a6bff"
    },
    {
      "cell_type": "code",
      "execution_count": 42,
      "metadata": {},
      "outputs": [],
      "source": [
        "df=pd.DataFrame(np.arange(2*3).reshape(2,3))\n",
        "df"
      ],
      "id": "6ae75f62-5f43-4c26-bc51-fe4fc0745a45"
    },
    {
      "cell_type": "code",
      "execution_count": 43,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.columns = pd.Index(['X1','X2','X3'])"
      ],
      "id": "85192d69-16cd-435d-b422-a01417455db9"
    },
    {
      "cell_type": "code",
      "execution_count": 44,
      "metadata": {},
      "outputs": [],
      "source": [
        "df"
      ],
      "id": "a5111023-929a-4c91-99f7-5f6a64a88ab5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "방법4가 방법3의 방식보다 컴퓨터가 이해하기 좋다. (= 불필요한 에러 혹은\n",
        "경고메시지를 방지할 수 있다)\n",
        "\n",
        "## 행의 이름 부여\n",
        "\n",
        "`-` 방법1: 중첩 dict이면 nested dic의 key가 알아서 행의 이름으로 된다."
      ],
      "id": "cefc3b88-2f13-4d0e-b2ae-44c03991bc21"
    },
    {
      "cell_type": "code",
      "execution_count": 45,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':{'guebin':30, 'iu':40, 'hynn':50} , 'mid':{'guebin':5, 'iu':45, 'hynn':90}})"
      ],
      "id": "b2255740-022e-4c64-bb31-8883129d22a1"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법2: pd.DataFrame()의 index 옵션 이용"
      ],
      "id": "ea3ad88d-4a58-46bd-960f-df2855dcb899"
    },
    {
      "cell_type": "code",
      "execution_count": 46,
      "metadata": {},
      "outputs": [],
      "source": [
        "pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]},index=['guebin','iu','hynn'])"
      ],
      "id": "d14ec901"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법3: df.index에 덮어씌움"
      ],
      "id": "68ad989b-e63b-4f0b-8a80-f95793358a34"
    },
    {
      "cell_type": "code",
      "execution_count": 47,
      "metadata": {},
      "outputs": [],
      "source": [
        "df=pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]})\n",
        "df"
      ],
      "id": "681326f9-7ef2-471d-a2a1-3d34c2d5f1ac"
    },
    {
      "cell_type": "code",
      "execution_count": 48,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.index = pd.Index(['guebin','iu','hynn'])\n",
        "#df.index = ['guebin','iu','hynn'] <- 이것도 실행가능하기는함 \n",
        "df"
      ],
      "id": "fa1eb50f-f6dc-45b6-b991-db2bfea6b9e5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법4: df.set_index() 를 이용하여 덮어씌운다"
      ],
      "id": "19ee6467-ed67-4124-97ee-3c94666a6c30"
    },
    {
      "cell_type": "code",
      "execution_count": 49,
      "metadata": {},
      "outputs": [],
      "source": [
        "df=pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]})\n",
        "df"
      ],
      "id": "2ee8fe37-8c57-4fb1-8fcd-2062b8da232e"
    },
    {
      "cell_type": "code",
      "execution_count": 50,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.set_index(pd.Index(['guebin','iu','hynn']))"
      ],
      "id": "e606bc12-7be5-4669-bcf9-a0532a81d78e"
    },
    {
      "cell_type": "code",
      "execution_count": 51,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.set_index(['guebin','iu','hynn'])"
      ],
      "id": "cdad9b7f-51ad-45f5-959f-0775b12af959"
    },
    {
      "cell_type": "code",
      "execution_count": 52,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.set_index([['guebin','iu','hynn']]) # 꺽쇠를 한번 더 넣어주면 에러를 피할수 있다. "
      ],
      "id": "065d3428-1954-4bb4-a3b2-5da90d18b427"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "-   그러나 이런 코드를 권장하지 않음\n",
        "\n",
        "## 자료형, len, shape, for문의 반복변수"
      ],
      "id": "4822f655-f828-48c2-9747-c4bbb6635af1"
    },
    {
      "cell_type": "code",
      "execution_count": 53,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame({'att':[30,40,50],'mid':[5,45,90]})\n",
        "df"
      ],
      "id": "6844676c-00d5-4e62-a293-42eb9aded5b3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` type"
      ],
      "id": "e42cb9e8-aecb-45d9-ba20-94bbebc6c1f4"
    },
    {
      "cell_type": "code",
      "execution_count": 54,
      "metadata": {},
      "outputs": [],
      "source": [
        "type(df)"
      ],
      "id": "967025d5-afdc-4f66-9f28-64e958c95a7d"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` len"
      ],
      "id": "404bd932-2761-4d5d-b96f-a82847157292"
    },
    {
      "cell_type": "code",
      "execution_count": 55,
      "metadata": {},
      "outputs": [],
      "source": [
        "len(df) # row의 갯수 "
      ],
      "id": "f32db27b-bead-48a1-b12d-adbe734b3294"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` shape"
      ],
      "id": "8bb1c816-da9a-45be-bf24-e6fdf8b3c7fc"
    },
    {
      "cell_type": "code",
      "execution_count": 56,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.shape "
      ],
      "id": "c8d254e7-ec13-4ccd-a1b0-43fff9805d1c"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` for문의 반복변수"
      ],
      "id": "5c0cc7cb-3d60-4635-8cba-5b3d4f69371a"
    },
    {
      "cell_type": "code",
      "execution_count": 57,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "att\n",
            "mid"
          ]
        }
      ],
      "source": [
        "for k in df:\n",
        "    print(k) # 딕셔너리같죠"
      ],
      "id": "ced5d6bf-1cf1-4aaa-bf07-db716fcbfb0a"
    },
    {
      "cell_type": "code",
      "execution_count": 58,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "att\n",
            "mid"
          ]
        }
      ],
      "source": [
        "for k in {'att':[30,40,50],'mid':[5,45,90]}: \n",
        "    print(k)"
      ],
      "id": "a7880a2b-524a-4c03-8020-14fd67c7cc19"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**참고:** df는 진짜 딕셔너리 느낌 강해요"
      ],
      "id": "39454947-3710-42fa-9328-ab0a692ef65f"
    },
    {
      "cell_type": "code",
      "execution_count": 59,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.keys()"
      ],
      "id": "2c52a755-5a57-4a9f-8e12-7853a42a2017"
    },
    {
      "cell_type": "code",
      "execution_count": 62,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "att\n",
            "mid"
          ]
        }
      ],
      "source": [
        "for k,v in df.items():\n",
        "    print(k)"
      ],
      "id": "70887c85-c05d-4c14-91bb-de6cb89739fa"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## pd.Series\n",
        "\n",
        "`-` 2차원 ndarray가 pd.DataFrame에 대응한다면 1차원 ndarray는\n",
        "pd.Series에 대응한다."
      ],
      "id": "f647cfeb-af77-4b61-9195-dde90959b4b8"
    },
    {
      "cell_type": "code",
      "execution_count": 63,
      "metadata": {},
      "outputs": [],
      "source": [
        "a=pd.Series(np.random.randn(10))\n",
        "a"
      ],
      "id": "904dd1bc-3faa-4b95-9a3b-3ca2e7fdacf0"
    },
    {
      "cell_type": "code",
      "execution_count": 64,
      "metadata": {},
      "outputs": [],
      "source": [
        "type(a)"
      ],
      "id": "1778a071-3be4-4036-b5c0-3cb4c1a193b7"
    },
    {
      "cell_type": "code",
      "execution_count": 65,
      "metadata": {},
      "outputs": [],
      "source": [
        "len(a)"
      ],
      "id": "2c6970e0-8fb1-4a84-83ed-472855a3d956"
    },
    {
      "cell_type": "code",
      "execution_count": 66,
      "metadata": {},
      "outputs": [],
      "source": [
        "a.shape"
      ],
      "id": "189f7e43-adab-40c7-a26c-0086bb35d713"
    },
    {
      "cell_type": "code",
      "execution_count": 67,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.10617283591748639\n",
            "0.7237590624253404\n",
            "0.21798967912700873\n",
            "0.1940223087322443\n",
            "-0.6889899757985083\n",
            "-0.3516696436204985\n",
            "0.9909329773184973\n",
            "1.2121468150185186\n",
            "-0.6089654373693767\n",
            "0.03254898346416765"
          ]
        }
      ],
      "source": [
        "for value in a: \n",
        "    print(value)"
      ],
      "id": "c3764463-9ba4-4de8-a3d4-795404e4c3b0"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# pandas 공부 2단계\n",
        "\n",
        "`-` 데이터"
      ],
      "id": "cfbea947-c573-4eaa-b3a4-7911e8e6bf5c"
    },
    {
      "cell_type": "code",
      "execution_count": 68,
      "metadata": {},
      "outputs": [],
      "source": [
        "np.random.seed(43052)\n",
        "att = np.random.choice(np.arange(10,21)*5,20)\n",
        "rep = np.random.choice(np.arange(5,21)*5,20)\n",
        "mid = np.random.choice(np.arange(0,21)*5,20)\n",
        "fin = np.random.choice(np.arange(0,21)*5,20)\n",
        "key = ['2022-12'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]"
      ],
      "id": "f32a956a-7ab3-4507-ae3f-ec540253602a"
    },
    {
      "cell_type": "code",
      "execution_count": 69,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame({'att':att,'rep':rep,'mid':mid,'fin':fin},index=key)\n",
        "df.head()"
      ],
      "id": "a57f961b-4e90-4c85-8284-476a03042411"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 열의 선택\n",
        "\n",
        "`-` 방법1: `df[]` + 칼럼이름, 칼럼이름의 list"
      ],
      "id": "3c65b5c0-7901-4891-bce9-1296fb49f986"
    },
    {
      "cell_type": "code",
      "execution_count": 87,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df['att'] # 칼럼이름 \n",
        "# df[['att']] # 칼럼이름의 list \n",
        "# df[['att','rep']] # 칼럼이름의 list "
      ],
      "id": "1b7a5fa9"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법2: `df.iloc[:,]` + 정수, 정수의 list, range, 슬라이싱,\n",
        "스트라이딩, bool의 list"
      ],
      "id": "7320e350-169d-4907-b4cc-0d2eab491260"
    },
    {
      "cell_type": "code",
      "execution_count": 88,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df.iloc[:,0] # 정수\n",
        "# df.iloc[:,[0]] # 정수의 list \n",
        "# df.iloc[:,[0,1]] # 정수의 list \n",
        "# df.iloc[:,range(2)] # range\n",
        "# df.iloc[:,-2:] # 슬라이싱\n",
        "# df.iloc[:,1::2] # 스트라이딩\n",
        "# df.iloc[:,[True,True,False,False]] # bool의 list "
      ],
      "id": "d9520919"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법3: `df.loc[:,]` + 칼럼이름, 컬럼이름의 list, 칼럼이름으로\n",
        "슬라이싱($\\star$), 칼럼이름으로 스트라이딩($\\star$), bool의 list"
      ],
      "id": "41fdfa89-ddfb-4b74-a4b4-da93f110c81c"
    },
    {
      "cell_type": "code",
      "execution_count": 89,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df.loc[:,'att'] # 칼럼이름\n",
        "# df.loc[:,['att']] # 칼럼이름의 list \n",
        "# df.loc[:,['att','rep']] # 칼럼이름의 list \n",
        "# df.loc[:,'rep':'mid'] # 칼럼이름으로 슬라이싱 \n",
        "# df.loc[:,'rep'::2] # 칼럼이름으로 스트라이딩\n",
        "# df.loc[:,[True,False,False,True]] # bool의 list"
      ],
      "id": "372a1c69"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 행의 선택\n",
        "\n",
        "**여기서는 `df=중첩된list` 라고 생각해야 코드가 잘 읽힌다.**\n",
        "\n",
        "`-` 방법1: `df.iloc[]` + 정수, 정수의리스트, range, 슬라이싱,\n",
        "스트라이딩, bool의 list"
      ],
      "id": "e68a1c98-07d9-4607-aad2-5051004d3ea9"
    },
    {
      "cell_type": "code",
      "execution_count": 75,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df.iloc[0] # 정수 \n",
        "# df.iloc[[0]] # 정수의 list \n",
        "# df.iloc[[0,1]] # 정수의 list \n",
        "# df.iloc[range(2)] # range\n",
        "# df.iloc[-2:] # 슬라이싱\n",
        "# df.iloc[1::2] # 스트라이딩\n",
        "# df.iloc[[True]+[False]*19] # bool의 list \n",
        "# df.iloc[list(df['att']>70)] # bool의 list "
      ],
      "id": "1e4d6efc"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**여기서는 `df=2차원array`라고 생각해야 코드가 잘 읽힌다.**\n",
        "\n",
        "`-` 방법1: `df.iloc[,:]` + 정수, 정수의리스트, range, 슬라이싱,\n",
        "스트라이딩, bool의 list"
      ],
      "id": "6f885951-11f8-4a71-8559-742b2f390fae"
    },
    {
      "cell_type": "code",
      "execution_count": 76,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df.iloc[0,:] # 정수 \n",
        "# df.iloc[[0],:] # 정수의 list \n",
        "# df.iloc[[0,1],:] # 정수의 list \n",
        "# df.iloc[range(2),:] # range\n",
        "# df.iloc[-2:,:] # 슬라이싱\n",
        "# df.iloc[1::2,:] # 스트라이딩\n",
        "# df.iloc[[True]+[False]*19,:] # bool의 list \n",
        "# df.iloc[list(df['att']>70),:] # bool의 list "
      ],
      "id": "9ab01095"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 방법2: `df.loc[,:]` + 인덱스이름의 리스트, 인덱스이름으로\n",
        "슬라이싱($\\star$), 인덱스이름으로 스트라이딩($\\star$), bool의 list"
      ],
      "id": "02900870-9b63-4bbe-8406-5970d171b515"
    },
    {
      "cell_type": "code",
      "execution_count": 77,
      "metadata": {},
      "outputs": [],
      "source": [
        "# df.loc['2022-12380',:] # 인덱스이름 \n",
        "# df.loc[['2022-12380','2022-12370'],:] # 인덱스이름의 리스트\n",
        "# df.loc['2022-12452':,:] # 인덱스이름으로 슬라이싱\n",
        "# df.loc['2022-12380'::3,:] # 인덱스이름으로 스트라이딩\n",
        "# df.loc[list(df['att']>70),:] # bool의 list \n",
        "# df.loc[df['att']>70,:] # bool의 list "
      ],
      "id": "1fe348b3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 안썼으면 좋겠는 코드\n",
        "\n",
        "`-` 제가 안쓰는 코드1:"
      ],
      "id": "8bcb7933-7434-45bd-b871-5ebfdbeb3348"
    },
    {
      "cell_type": "code",
      "execution_count": 78,
      "metadata": {},
      "outputs": [],
      "source": [
        "df['2022-12380':'2022-12370']"
      ],
      "id": "53b21575-4783-4053-a4c7-b720724a9be5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "이러면 내 입장에서는 마치 아래가 동작할 것 같잖아.."
      ],
      "id": "720ba7a1-d1cf-4f21-af93-f0965ebff27d"
    },
    {
      "cell_type": "code",
      "execution_count": 79,
      "metadata": {},
      "outputs": [],
      "source": [
        "df['2022-12380']"
      ],
      "id": "4187a4b7-83b8-49e4-a469-32b633b854e3"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 제가 안쓰는 코드2: bool의 list를 사용할때 iloc은 가급적 쓰지마세요"
      ],
      "id": "03c76d52-b66c-4365-b007-f73e65d10fea"
    },
    {
      "cell_type": "code",
      "execution_count": 83,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[list(df['att']<80),:]"
      ],
      "id": "860eceda-6630-4e23-9433-458b40a82d50"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "이러면 마치 아래도 동작할 것 같잖아.."
      ],
      "id": "ff17050b-f0c1-4884-8ffd-4b6b2624f76e"
    },
    {
      "cell_type": "code",
      "execution_count": 84,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[df['att']<80,:]"
      ],
      "id": "14a17968-ddd3-4cad-8049-2e4f5aef884a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 참고: 맨날 틀리는 코드"
      ],
      "id": "098872d9-e937-4c9c-abb8-421892db3bdb"
    },
    {
      "cell_type": "code",
      "execution_count": 86,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.loc['att']"
      ],
      "id": "9e12c880-fe18-440b-93dc-9fe88648a98f"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# HW\n",
        "\n",
        "`(1)` 아래와 같은 데이터 프레임을 선언하라."
      ],
      "id": "ec7d8e4e-3688-4408-9ddf-1d01805e58be"
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {},
      "outputs": [],
      "source": [
        "from IPython.core.display import HTML \n",
        "HTML('<table border=\"1\" class=\"dataframe\">\\n  <thead>\\n    <tr style=\"text-align: right;\">\\n      <th></th>\\n      <th>A</th>\\n      <th>B</th>\\n    </tr>\\n  </thead>\\n  <tbody>\\n    <tr>\\n      <th>0</th>\\n      <td>1</td>\\n      <td>-2</td>\\n    </tr>\\n    <tr>\\n      <th>1</th>\\n      <td>2</td>\\n      <td>-3</td>\\n    </tr>\\n    <tr>\\n      <th>2</th>\\n      <td>3</td>\\n      <td>-4</td>\\n    </tr>\\n  </tbody>\\n</table>')"
      ],
      "id": "2f7bae7c-1cff-4ce3-8386-202b8fd96bce"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(풀이)"
      ],
      "id": "ed7286d4-65e4-42ba-af47-883fc21b74ae"
    },
    {
      "cell_type": "code",
      "execution_count": 54,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame({'A':[1,2,3], 'B':[-2,-3,-4]})\n",
        "df"
      ],
      "id": "9be34f50-3b9c-4904-b49d-a9b302a37976"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(2)` Column을 이름을 X1, X2로 변경하라. 출력결과는 아래와 같아야 한다."
      ],
      "id": "e42c1de5-5715-4169-b6a9-1132e647ca88"
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {},
      "outputs": [],
      "source": [
        "from IPython.core.display import HTML \n",
        "HTML('<table border=\"1\" class=\"dataframe\">\\n  <thead>\\n    <tr style=\"text-align: right;\">\\n      <th></th>\\n      <th>X1</th>\\n      <th>X2</th>\\n    </tr>\\n  </thead>\\n  <tbody>\\n    <tr>\\n      <th>0</th>\\n      <td>1</td>\\n      <td>-2</td>\\n    </tr>\\n    <tr>\\n      <th>1</th>\\n      <td>2</td>\\n      <td>-3</td>\\n    </tr>\\n    <tr>\\n      <th>2</th>\\n      <td>3</td>\\n      <td>-4</td>\\n    </tr>\\n  </tbody>\\n</table>')"
      ],
      "id": "a207bba6-d4aa-499f-96c5-0705d557b5ea"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(풀이)"
      ],
      "id": "5357e6c9-ea14-4c0b-90b7-39b64ddfb06f"
    },
    {
      "cell_type": "code",
      "execution_count": 55,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.columns = pd.Index(['X1','X2'])\n",
        "df"
      ],
      "id": "8eaa97ca-5e4d-478d-ac8e-7c093cd72cfb"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(3)-(5)` 아래와 같은 자료를 고려하자."
      ],
      "id": "feded8e0-25f3-47d1-b71c-700a9a328b62"
    },
    {
      "cell_type": "code",
      "execution_count": 56,
      "metadata": {},
      "outputs": [],
      "source": [
        "df = pd.DataFrame(np.random.normal(size=(100,5)),columns=list('ABCDE'))\n",
        "df"
      ],
      "id": "da1b084b-0a31-4a72-9b0d-93e6841f3543"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(3)` B,D열을 선택하라.\n",
        "\n",
        "(풀이)"
      ],
      "id": "92239cb9-32a3-4e6f-9d12-5b0d0eb35679"
    },
    {
      "cell_type": "code",
      "execution_count": 57,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.loc[:,['B','D']]"
      ],
      "id": "432b2702-4331-480e-8da6-365af7eb7b56"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(4)` 마지막 10개의 row를 출력하라.\n",
        "\n",
        "(풀이)"
      ],
      "id": "e770bcc0-521e-4490-9439-763e1575356d"
    },
    {
      "cell_type": "code",
      "execution_count": 58,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.iloc[-10:]"
      ],
      "id": "ebd8a9e0-762d-480c-934b-85c5b53f624a"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(5)` A,B 열의 처음 10개의 row를 출력하라.\n",
        "\n",
        "(풀이)"
      ],
      "id": "3a69525d-420a-4ef6-8194-ccb50d36a2db"
    },
    {
      "cell_type": "code",
      "execution_count": 60,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.loc[:,['A','B']].iloc[:10]"
      ],
      "id": "ebc460fd-104b-499f-8113-e42971fad26e"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(6)-(9)` 아래와 같은 자료를 고려하자."
      ],
      "id": "2e7aceee-2661-4abe-8eae-2291074ca7f7"
    },
    {
      "cell_type": "code",
      "execution_count": 63,
      "metadata": {},
      "outputs": [],
      "source": [
        "df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')\n",
        "df"
      ],
      "id": "d58493a9-833e-48c5-81bf-4e07f8190dc7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(6)` 이 데이터프레임에는 몇개의 컬럼이 있는지 count하라.\n",
        "\n",
        "**hint**: df.columns의 len을 조사\n",
        "\n",
        "(풀이)"
      ],
      "id": "8faec3f1-0109-4680-b829-c791fe5a0999"
    },
    {
      "cell_type": "code",
      "execution_count": 64,
      "metadata": {},
      "outputs": [],
      "source": [
        "len(df.columns)"
      ],
      "id": "d46faf99-8272-463c-9791-4063ece75d09"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(7)` 데이터프레임의 컬럼이름이 c혹은 d로 시작하는 열은 몇개 있는지\n",
        "세어보라.\n",
        "\n",
        "**hint:** 아래의 코드를 관찰"
      ],
      "id": "13b2e7d1-1cec-4862-8cd0-fb3cf5872a5b"
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {},
      "outputs": [],
      "source": [
        "lst = ['color', 'director_name', 'num_critic_for_reviews', 'duration'] \n",
        "[l for l in lst if l[0]=='c' or l[0]=='d']"
      ],
      "id": "b40a5e73-e630-4134-a2ca-3ab3633db8a7"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(풀이)"
      ],
      "id": "646bf59e-62b9-4928-a3c0-afdd57176c64"
    },
    {
      "cell_type": "code",
      "execution_count": 67,
      "metadata": {},
      "outputs": [],
      "source": [
        "[l for l in df.columns if l[0]=='c' or l[0]=='d']"
      ],
      "id": "2b1fb859-95f0-4a64-9781-1af4a3a1df11"
    },
    {
      "cell_type": "code",
      "execution_count": 68,
      "metadata": {},
      "outputs": [],
      "source": [
        "len([l for l in df.columns if l[0]=='c' or l[0]=='d'])"
      ],
      "id": "34d20334-328b-419f-92ca-920a37b2e1ed"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(8)` 이 데이터프레임에서 ’actor’라는 단어가 포함된 열이 몇개있는지\n",
        "세어보라.\n",
        "\n",
        "(풀이)"
      ],
      "id": "444c898f-1f8a-44ce-9cc2-52e76415cc6b"
    },
    {
      "cell_type": "code",
      "execution_count": 70,
      "metadata": {},
      "outputs": [],
      "source": [
        "[l for l in df.columns if 'actor' in l]"
      ],
      "id": "a2575988-4527-4123-808a-bb5dd6cf11e8"
    },
    {
      "cell_type": "code",
      "execution_count": 72,
      "metadata": {},
      "outputs": [],
      "source": [
        "len([l for l in df.columns if 'actor' in l])"
      ],
      "id": "13ff878e-2f3d-46f4-a33d-76860a3da405"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`(9)` 이 데이터프레임에서 ’actor’라는 단어가 포함된 열을 출력하라.\n",
        "\n",
        "**hint**: 아래의 코드를 관찰하라."
      ],
      "id": "82b44ce7-1e16-4702-8b07-dedca703fb50"
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {},
      "outputs": [],
      "source": [
        "_df = pd.DataFrame(\n",
        "    np.random.randint(1,200,size=(100,2)),\n",
        "    columns=['director_facebook_likes', 'actor_3_facebook_likes']\n",
        ")\n",
        "_df"
      ],
      "id": "b486badb-d613-474b-bf71-c495e843156c"
    },
    {
      "cell_type": "code",
      "execution_count": 40,
      "metadata": {},
      "outputs": [],
      "source": [
        "_df.loc[:,['actor' in colname for colname in _df.columns]]"
      ],
      "id": "ecd31991-4ea5-4427-a5a2-7e69a6cb9331"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(풀이)"
      ],
      "id": "4f691ed1-7e87-4b10-8279-d50748ab0e01"
    },
    {
      "cell_type": "code",
      "execution_count": 81,
      "metadata": {},
      "outputs": [],
      "source": [
        "df.loc[:, ['actor' in l for l in df.columns]]"
      ],
      "id": "fda29dac-19ea-410e-84ef-68a44692fb70"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3 (ipykernel)",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.16"
    }
  }
}