{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 04wk-1: 감성분석 파고들기 (2)\n",
        "\n",
        "최규빈  \n",
        "2024-09-27\n",
        "\n",
        "<a href=\"https://colab.research.google.com/github/guebin/MP2024/blob/main/posts/04wk-1.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" style=\"text-align: left\"></a>\n",
        "\n",
        "# 1. 강의영상\n",
        "\n",
        "<https://youtu.be/playlist?list=PLQqh36zP38-yH7aA3RY7GNf1qNIgU6CDs&si=e4-tQHb8cD0FhrUW>\n",
        "\n",
        "# 2. Imports"
      ],
      "id": "a1252f5d-32f8-4a70-8c1b-771f3a4c6b63"
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
            "  from .autonotebook import tqdm as notebook_tqdm"
          ]
        }
      ],
      "source": [
        "import datasets\n",
        "import transformers\n",
        "import evaluate\n",
        "import numpy as np\n",
        "import torch # 파이토치"
      ],
      "id": "cell-5"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 3. 이전시간 요약\n",
        "\n",
        "`-` 이전시간의 내용중 이번시간에 기억할것들을 요약\n",
        "\n",
        "`-` `DatasetDict`: 임의의 자료에 대한 `DatasetDict` 오브젝트 만들기"
      ],
      "id": "121fd6d5-4cac-4be0-8bfc-3b2088121665"
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_dict = {\n",
        "    'text': [\n",
        "        \"I prefer making decisions based on logic and objective facts.\",\n",
        "        \"I always consider how others might feel when making a decision.\",\n",
        "        \"Data and analysis drive most of my decisions.\",\n",
        "        \"I rely on my empathy and personal values to guide my choices.\"\n",
        "    ],\n",
        "    'label': [0, 1, 0, 1]  # 0은 T(사고형), 1은 F(감정형)\n",
        "}\n",
        "\n",
        "test_dict = {\n",
        "    'text': [\n",
        "        \"I find it important to weigh all the pros and cons logically.\",\n",
        "        \"When making decisions, I prioritize harmony and people's emotions.\"\n",
        "    ],\n",
        "    'label': [0, 1]  # 0은 T(사고형), 1은 F(감정형)\n",
        "}"
      ],
      "id": "cell-9"
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "train_data = datasets.Dataset.from_dict(train_dict)\n",
        "test_data = datasets.Dataset.from_dict(test_dict)\n",
        "나의데이터 = datasets.dataset_dict.DatasetDict({'train':train_data, 'test':test_data})"
      ],
      "id": "cell-10"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` `토크나이저`: 토크나이저는 “`Str` $\\to$ `Dict`” 인 함수이다."
      ],
      "id": "49c1b30d-47fa-4ab2-9a4e-eb43b567c464"
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
            "  warnings.warn("
          ]
        }
      ],
      "source": [
        "데이터전처리하기1 = 토크나이저 = transformers.AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\")"
      ],
      "id": "cell-12"
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [],
      "source": [
        "토크나이저(나의데이터['train']['text'][0])"
      ],
      "id": "cell-13"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` `토크나이저`: 토크나이저의 “`Str` $\\to$ `Dict`” 인 함수는 배치처리가\n",
        "가능하다."
      ],
      "id": "113f1a47-3133-435b-aa49-c5530601c17c"
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [],
      "source": [
        "토크나이저(나의데이터['train']['text'])"
      ],
      "id": "cell-15"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` `인공지능`: 인공지능은 많은 파라메터를 포함하고 있는 어떠한\n",
        "물체이다."
      ],
      "id": "0e8d99ed-b505-4cc0-84fb-614257cf8e4a"
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."
          ]
        }
      ],
      "source": [
        "인공지능 = model = transformers.AutoModelForSequenceClassification.from_pretrained(\n",
        "    \"distilbert/distilbert-base-uncased\",\n",
        "    num_labels=2\n",
        ")\n",
        "인공지능.classifier.weight"
      ],
      "id": "cell-17"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` `인공지능`: 인공지능의 파라메터는 변화할 수 있으며, loss가 더\n",
        "작은쪽으로 파라메터를 변화시키는 과정을 “학습”이라고 부른다.\n",
        "\n",
        "`-` `인공지능`: 인공지능은 “`**Dict` $\\to$\n",
        "`transformers.modeling_outputs.SequenceClassifierOutput`”인 함수이다.\n",
        "그런데 쓰기 까다롭다.\n",
        "\n",
        "-   `1`. `Dict`에는 특정한 key를 포함하고 있어야한다. (`input_ids`,\n",
        "    `attention_mask`)\n",
        "-   `2`. key에 대응하는 숫자는 파이토치 텐서형태이어야 한다. (`3`.\n",
        "    따라서 (m,n)꼴의 차원을 **반드시** 가져야 한다)\n",
        "-   `4`. `Dict`에 `labels`이 (텐서형으로) 포함된 경우 loss가 계산된다.\n",
        "    (그리고 이걸 계산해야지 학습을 할 수 있음)"
      ],
      "id": "16b29b26-5b0f-4439-a66a-1a0840d960b0"
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "토크나이저(나의데이터['train']['text'][0])"
      ],
      "id": "cell-20"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#` 인공지능의 입력예시1 – 텐서형으로 정리된 텍스트자료만 있는 경우"
      ],
      "id": "b7e5cf36-77e5-4f84-bea4-347462fe3266"
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능입력1 = {\n",
        "        'input_ids': torch.tensor([[ 101, 1045, 102],[101, 9544, 102]]), \n",
        "        'attention_mask': torch.tensor([[1, 1, 1],[1, 1, 1]]) # 생략가능\n",
        "}"
      ],
      "id": "cell-22"
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능(**인공지능입력1)"
      ],
      "id": "cell-23"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "`#` 인공지능의 입력예시2 – 텐서형으로 정리된 텍스트자료와 `labels`이\n",
        "같이 있는경우"
      ],
      "id": "757b88d4-b511-4799-a040-e60102746436"
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능입력2 = {\n",
        "        'input_ids': torch.tensor([[ 101, 1045, 102],[101, 9544, 102]]), \n",
        "        'attention_mask': torch.tensor([[1, 1, 1],[1, 1, 1]]), # 생략가능\n",
        "        'labels': torch.tensor([1, 0]) # 생략가능\n",
        "}"
      ],
      "id": "cell-26"
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능(**인공지능입력2)"
      ],
      "id": "cell-27"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "`-` `인공지능`: 인공지능의 출력결과[1] 에서 확률/예측값을 추출하는 방법\n",
        "\n",
        "-   인공지능의 출력결과 $\\to$ 로짓 $\\to$ 확률 $\\to$ 인공지능의예측\n",
        "-   인공지능의 출력결과 $\\to$ 로짓 $\\to$ 인공지능의예측\n",
        "\n",
        "**예제1:** 인공지능의 출력결과에서 인공지능의 예측값을 계산하자. – 로짓\n",
        "$\\to$ 인공지능의예측\n",
        "\n",
        "[1] `transformers.modeling_outputs.SequenceClassifierOutput` 자료형을\n",
        "가짐"
      ],
      "id": "8a5bbe66-c001-4d91-84c1-6f3a8526ba6b"
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능출력 = 인공지능(**인공지능입력2)\n",
        "로짓 = 인공지능출력.logits\n",
        "로짓"
      ],
      "id": "cell-31"
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {},
      "outputs": [],
      "source": [
        "로짓.argmax(axis=1)"
      ],
      "id": "cell-32"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**예제2:** 인공지능의 출력결과에서 인공지능의 예측확률을 계산하자 – 로짓\n",
        "$\\to$ 확률"
      ],
      "id": "c9a2e4bb-e6e3-4c16-a196-a683cdaea3f1"
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능출력 = 인공지능(**인공지능입력2)\n",
        "로짓 = 인공지능출력.logits\n",
        "로짓"
      ],
      "id": "cell-34"
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {},
      "outputs": [],
      "source": [
        "torch.exp(로짓) / torch.exp(로짓).sum(axis=1)"
      ],
      "id": "cell-35"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`확률게산하기`라는 함수선언하여 외의 과정을 단순화 하기"
      ],
      "id": "d48161a7-108e-460b-a1a1-211b8299f449"
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {},
      "outputs": [],
      "source": [
        "def 확률계산하기(인공지능출력):\n",
        "    로짓 = 인공지능출력.logits\n",
        "    return torch.exp(로짓) / torch.exp(로짓).sum(axis=1)"
      ],
      "id": "cell-37"
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**인공지능입력2))"
      ],
      "id": "cell-38"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "# 4. 데이터전처리하기2\n",
        "\n",
        "`-` 아래코드를 파고들어보자.\n",
        "\n",
        "``` python\n",
        "def 데이터전처리하기2(examples):\n",
        "    return 데이터전처리하기1(examples[\"text\"], truncation=True)\n",
        "전처리된나의데이터 = 나의데이터.map(lambda x: {'dummy': '메롱'})\n",
        "```\n",
        "\n",
        "`-` 인공지능의 입력으로 가정된 두가지 경우: (1) 토크나이징결과, (2)\n",
        "토크나이징결과 + label\n",
        "\n",
        "-   1.  와 같은 형태의 입력을 정리하기 위해서는,\n",
        "        `{'text':[...], 'label':[...]}` 이러한 형태로 정리된 자료를\n",
        "        `{'text':[...], 'label':[...], 'input_ids':[...], 'attention_mask':[...]}`\n",
        "        이러한 형태로 만들어야 하는데 이를 쉽게처리해주는 함수가 바로\n",
        "        `나의데이터.map()` 임.\n",
        "\n",
        "-   `나의데이터.map()`의 도움말을 확인해본 결과 map은 (1) function\n",
        "    자체를 입력으로 받는데 (2) function 은 Dict를 입력으로 받고, Dict를\n",
        "    출력하는 함수이어야 한다는 사실을 확인할 수 있었음.\n",
        "\n",
        "`-` 개념1: Hugging Face의 `datasets` 라이브러리에서 제공하는\n",
        "`datasets.dataset_dict.DatasetDict`은, 요소들의 변환에 특화된\n",
        "`map`이라는 메소드가(=함수가) 내장되어있다.\n",
        "\n",
        "`# 예시1` – 메롱"
      ],
      "id": "9497fc45-6a32-4ddb-8b7b-4776684d2493"
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Map: 100%|██████████| 4/4 [00:00<00:00, 1570.02 examples/s]\n",
            "Map: 100%|██████████| 2/2 [00:00<00:00, 1216.45 examples/s]"
          ]
        }
      ],
      "source": [
        "전처리된나의데이터 = 나의데이터.map(lambda dct: {'dummy':'메롱'})"
      ],
      "id": "cell-45"
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {},
      "outputs": [],
      "source": [
        "전처리된나의데이터['train'][0]"
      ],
      "id": "cell-46"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "`# 예시2` – Text의 length 계산"
      ],
      "id": "2aab77fb-e911-4fc5-a827-293f9dbf79af"
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Map: 100%|██████████| 4/4 [00:00<00:00, 2154.79 examples/s]\n",
            "Map: 100%|██████████| 2/2 [00:00<00:00, 1365.33 examples/s]"
          ]
        }
      ],
      "source": [
        "전처리된나의데이터 = 나의데이터.map(lambda dct: {'dummy':'메롱', 'length':len(dct['text'])})"
      ],
      "id": "cell-49"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "`# 예시2` – 토크나이징결과"
      ],
      "id": "b4e37123-d0ee-48c9-a92d-7ff24e7d5001"
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Map: 100%|██████████| 4/4 [00:00<00:00, 1274.67 examples/s]\n",
            "Map: 100%|██████████| 2/2 [00:00<00:00, 966.10 examples/s]"
          ]
        }
      ],
      "source": [
        "전처리된나의데이터 = 나의데이터.map(lambda dct: 토크나이저(dct[\"text\"], truncation=True))"
      ],
      "id": "cell-52"
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {},
      "outputs": [],
      "source": [
        "전처리된나의데이터['train'][0]"
      ],
      "id": "cell-53"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "# 5. 데이터콜렉터\n",
        "\n",
        "`-` 전터리된 데이터가 인공지능은 마음에 들지 않는다."
      ],
      "id": "7fc64dd6-99e4-41ae-a900-3e53d8d22d30"
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {},
      "outputs": [],
      "source": [
        "전처리된나의데이터['train'][2]"
      ],
      "id": "cell-57"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "-   이유: 인공지능은 `torch.tensor()` 자료형을 가지며 (n,m)의 행렬로\n",
        "    정리된 “묶음” 형태의 자료형을 기대함.\n",
        "\n",
        "`-` 자료처리과정요약\n",
        "\n",
        "|                |    주어진자료     | $\\overset{tokenizer,map}{\\Longrightarrow}$ 전처리된자료 | $\\overset{datacollector}{\\Longrightarrow}$더전처리된자료 |\n",
        "|:----------------:|:----------------:|:----------------:|:----------------:|\n",
        "|  Dict의 Keys   |  `text`,`label`   |         `input_ids`, `attention_mask`, `label`          |         `input_ids`, `attention_mask`, `labels`          |\n",
        "|   자료의형태   |    텍스트,라벨    |                   숫자화 O, 행렬화 X                    |                    숫자화 O, 행렬화 O                    |\n",
        "| `torch.tensor` |        \\-         |                            X                            |                            O                             |\n",
        "|    미니배치    |        \\-         |                            X                            |                            O                             |\n",
        "| 패딩/동적패딩  |        \\-         |                            X                            |                            O                             |\n",
        "|    예측할때    | 강인공지능의 입력 |                     트레이너의 입력                     |                     인공지능의 입력                      |\n",
        "\n",
        "`-` 데이터콜렉터에서 우리가 기대하는 것:\n",
        "\n",
        "-   자료의 형태가 \\[Dict,Dict,Dict,Dict\\] 로 되어있는 경우[1] (4,??)\n",
        "    shape 텐서의 `input_ids`, (4,??) shape 텐서의 `attention_mask`,\n",
        "    그리고 (4,) shape 텐서의 `labels`로 변환해줌.\n",
        "\n",
        "`-` 데이터콜렉터 사용방법\n",
        "\n",
        "[1] `전처리된나의데이터['train']`이 이러한 형태로 되어있음"
      ],
      "id": "95ee820b-fe5f-45d8-8dd7-3496fbf48291"
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {},
      "outputs": [],
      "source": [
        "데이터콜렉터 = transformers.DataCollatorWithPadding(tokenizer=토크나이저, return_tensors='pt')\n",
        "# 데이터콜렉터?\n",
        "# 도움말에서 당장필요한 것: 입력형태가 [Dict, Dict, Dict, ... ]"
      ],
      "id": "cell-63"
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {},
      "outputs": [],
      "source": [
        "데이터콜렉터(\n",
        "    [\n",
        "        dict(label=0, input_ids=[101,1045,102],attention_mask=[1,1,1]),\n",
        "        dict(label=1, input_ids=[101,1045,9544,102],attention_mask=[1,1,1,1]),\n",
        "        dict(label=0, input_ids=[101,1045,9544,9544,102],attention_mask=[1,1,1,1,1])\n",
        "        \n",
        "    ]\n",
        ")"
      ],
      "id": "cell-64"
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능(**데이터콜렉터(\n",
        "    [\n",
        "        dict(label=0, input_ids=[101,1045,102],attention_mask=[1,1,1]),\n",
        "        dict(label=1, input_ids=[101,1045,9544,102],attention_mask=[1,1,1,1]),\n",
        "        dict(label=0, input_ids=[101,1045,9544,9544,102],attention_mask=[1,1,1,1,1])\n",
        "        \n",
        "    ]\n",
        "))"
      ],
      "id": "cell-65"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 6. 평가하기\n",
        "\n",
        "`-` `accuracy.compute`의 기능"
      ],
      "id": "0d424029-711e-405f-a1c4-82e6f73ce16a"
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {},
      "outputs": [],
      "source": [
        "accuracy = evaluate.load(\"accuracy\")"
      ],
      "id": "cell-68"
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {},
      "outputs": [],
      "source": [
        "accuracy.compute(predictions=[0,0,0], references=[0,0,0])"
      ],
      "id": "cell-69"
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {},
      "outputs": [],
      "source": [
        "accuracy.compute(predictions=[1,1,1], references=[0,0,0])"
      ],
      "id": "cell-70"
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {},
      "outputs": [],
      "source": [
        "accuracy.compute(predictions=[0,1,0], references=[0,0,0])"
      ],
      "id": "cell-71"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` `평가하기` 함수의 내용"
      ],
      "id": "ff029cf1-aa7e-4ca5-beed-d08eb95bedda"
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {},
      "outputs": [],
      "source": [
        "# def 평가하기(eval_pred):\n",
        "#     predictions, labels = eval_pred\n",
        "#     predictions = np.argmax(predictions, axis=1)\n",
        "#     accuracy = evaluate.load(\"accuracy\")\n",
        "#     return accuracy.compute(predictions=predictions, references=labels)"
      ],
      "id": "cell-73"
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {},
      "outputs": [],
      "source": [
        "def 평가하기(eval_pred):\n",
        "    로짓, 실제정답 = eval_pred\n",
        "    인공지능의예측 = np.argmax(로짓, axis=1)\n",
        "    accuracy = evaluate.load(\"accuracy\")\n",
        "    return accuracy.compute(predictions=인공지능의예측, references=실제정답)"
      ],
      "id": "cell-74"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 7. 트레이너\n",
        "\n",
        "`-` 이전코드"
      ],
      "id": "0d7c5ea9-1e6f-4149-baed-cce1d17ceb63"
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
            "  warnings.warn(\n",
            "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."
          ]
        }
      ],
      "source": [
        "## Step1 \n",
        "데이터불러오기 = datasets.load_dataset\n",
        "토크나이저 = transformers.AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\") \n",
        "## Step2 \n",
        "인공지능생성하기 = transformers.AutoModelForSequenceClassification.from_pretrained\n",
        "## Step3 \n",
        "데이터콜렉터 = transformers.DataCollatorWithPadding(tokenizer=토크나이저)\n",
        "def 평가하기(eval_pred):\n",
        "    predictions, labels = eval_pred\n",
        "    predictions = np.argmax(predictions, axis=1)\n",
        "    accuracy = evaluate.load(\"accuracy\")\n",
        "    return accuracy.compute(predictions=predictions, references=labels)\n",
        "트레이너세부지침생성기 = transformers.TrainingArguments\n",
        "트레이너생성기 = transformers.Trainer\n",
        "## Step4 \n",
        "강인공지능생성하기 = transformers.pipeline\n",
        "#---#\n",
        "## Step1 \n",
        "데이터 = 데이터불러오기('imdb')\n",
        "전처리된데이터 = 데이터.map(lambda dct: 토크나이저(dct[\"text\"], truncation=True),batched=True)\n",
        "전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['test']\n",
        "## Step2 \n",
        "torch.manual_seed(43052)\n",
        "인공지능 = 인공지능생성하기(\"distilbert/distilbert-base-uncased\", num_labels=2)"
      ],
      "id": "cell-77"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## A. 트레이너의 제1역할 – CPU에서 GPU로..\n",
        "\n",
        "### `#` 트레이너 생성전\n",
        "\n",
        "`-` 인공지능의 파라메터 상태확인 1"
      ],
      "id": "10db822e-4fb6-4c33-9fc0-7121b3a5bf09"
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능.classifier.weight"
      ],
      "id": "cell-81"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "-   중요한내용1: 숫자들 = 초기숫자들\n",
        "-   중요한내용2: 숫자들이 CPU에 존재한다는 의미\n",
        "\n",
        "`-` 인공지능을 이용한 예측 1"
      ],
      "id": "26b2b574-0b67-4f11-994c-dffecd3d612b"
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This movie was a huge disappointment.\")])))"
      ],
      "id": "cell-84"
    },
    {
      "cell_type": "code",
      "execution_count": 38,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This was a masterpiece.\")])))"
      ],
      "id": "cell-85"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### `#` 트레이너 생성후\n",
        "\n",
        "`-` 트레이너생성"
      ],
      "id": "4b889311-38b4-4092-ac54-889887f0ef81"
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {},
      "outputs": [],
      "source": [
        "## Step3 \n",
        "트레이너세부지침 = 트레이너세부지침생성기(\n",
        "    output_dir=\"my_awesome_model\",\n",
        "    learning_rate=2e-5,\n",
        "    per_device_train_batch_size=16,\n",
        "    per_device_eval_batch_size=16,\n",
        "    num_train_epochs=2, # 전체문제세트를 2번 공부하라..\n",
        "    weight_decay=0.01,\n",
        "    eval_strategy=\"epoch\",\n",
        "    save_strategy=\"epoch\",\n",
        "    load_best_model_at_end=True,\n",
        "    push_to_hub=False,\n",
        ")\n",
        "트레이너 = 트레이너생성기(\n",
        "    model=인공지능,\n",
        "    args=트레이너세부지침,\n",
        "    train_dataset=전처리된훈련자료,\n",
        "    eval_dataset=전처리된검증자료,\n",
        "    tokenizer=토크나이저,\n",
        "    data_collator=데이터콜렉터,\n",
        "    compute_metrics=평가하기,\n",
        ")"
      ],
      "id": "cell-88"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 인공지능의 파라메터 상태확인 2"
      ],
      "id": "9e85adf5-cce7-48eb-a969-76c32911947e"
    },
    {
      "cell_type": "code",
      "execution_count": 40,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능.classifier.weight"
      ],
      "id": "cell-90"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "-   중요한내용1: 숫자들 = 초기숫자들\n",
        "-   중요한내용2: device=“cuda:0” // 숫자들이 GPU에 있다는 의미\n",
        "\n",
        "`-` 인공지능을 이용한 예측 2"
      ],
      "id": "956c578f-a01a-4f43-a43d-cebf5c33782b"
    },
    {
      "cell_type": "code",
      "execution_count": 41,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This movie was a huge disappointment.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-93"
    },
    {
      "cell_type": "code",
      "execution_count": 42,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This was a masterpiece.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-94"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> 트레이너의 제1역할: 트레이너는 생성과 동시에 하는역할이 있는데, 바로\n",
        "> 인공지능의 파라메터를 GPU에 올리는 것이다.\n",
        "\n",
        "## B. 트레이너의 제2역할 – 예측하기\n",
        "\n",
        "> 트레이너의 제2역할: `트레이너.predict()` 사용가능.\n",
        "> `트레이너.predict()`의 입력형태는 input_ids, attention_mask, label 이\n",
        "> 존재하는 `Dataset`\n",
        "\n",
        "`# 예제1` 트레이너를 이용한 예측"
      ],
      "id": "d74587c8-0187-48ed-88b1-aa3bb0dccbbc"
    },
    {
      "cell_type": "code",
      "execution_count": 43,
      "metadata": {},
      "outputs": [],
      "source": [
        "sample_dict = {\n",
        "    'text': [\"This movie was a huge disappointment.\"],\n",
        "    'label': [0],\n",
        "    'input_ids': [[101, 2023, 3185, 2001, 1037, 4121, 10520, 1012, 102]],\n",
        "    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]\n",
        "}\n",
        "sample_dataset = datasets.Dataset.from_dict(sample_dict)\n",
        "sample_dataset"
      ],
      "id": "cell-99"
    },
    {
      "cell_type": "code",
      "execution_count": 44,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.predict(sample_dataset)"
      ],
      "id": "cell-100"
    },
    {
      "cell_type": "code",
      "execution_count": 45,
      "metadata": {},
      "outputs": [],
      "source": [
        "logits = np.array([[-0.11731032,  0.02610314]])\n",
        "np.exp(logits)/ np.exp(logits).sum(axis=1)"
      ],
      "id": "cell-101"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "`# 예제2` – 트레이너를 이용하여 `train_data`, `test_data` 의 prediction\n",
        "값을 구하라."
      ],
      "id": "1603a91a-80a4-4338-9584-0b9e10c7330d"
    },
    {
      "cell_type": "code",
      "execution_count": 46,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.predict(전처리된데이터['train'])"
      ],
      "id": "cell-104"
    },
    {
      "cell_type": "code",
      "execution_count": 47,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.predict(전처리된데이터['test'])"
      ],
      "id": "cell-105"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`#`\n",
        "\n",
        "## C. 트레이너의 제3역할 – 학습 및 결과저장\n",
        "\n",
        "### `#` 학습"
      ],
      "id": "592589ce-a87e-4a23-8fb6-ebb207709737"
    },
    {
      "cell_type": "code",
      "execution_count": 127,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.train()"
      ],
      "id": "cell-109"
    },
    {
      "cell_type": "code",
      "execution_count": 128,
      "metadata": {},
      "outputs": [],
      "source": [
        "25000 / 16 "
      ],
      "id": "cell-110"
    },
    {
      "cell_type": "code",
      "execution_count": 129,
      "metadata": {},
      "outputs": [],
      "source": [
        "1563 * 2 "
      ],
      "id": "cell-111"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### `#` 학습후\n",
        "\n",
        "`-` 인공지능이 똑똑해졌을까?\n",
        "\n",
        "`-` 인공지능의 파라메터 상태확인 3"
      ],
      "id": "62fc23f3-3166-495f-b6a1-993abfe58681"
    },
    {
      "cell_type": "code",
      "execution_count": 130,
      "metadata": {},
      "outputs": [],
      "source": [
        "인공지능.classifier.weight"
      ],
      "id": "cell-115"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "*인공지능의 파라메터 상태확인 2와 비교삿*\n",
        "\n",
        "    Parameter containing:\n",
        "    tensor([[-0.0234,  0.0279,  0.0242,  ...,  0.0091, -0.0063, -0.0133],\n",
        "            [ 0.0087,  0.0007, -0.0099,  ...,  0.0183, -0.0007,  0.0295]],\n",
        "           device='cuda:0', requires_grad=True)\n",
        "\n",
        "*숫자들이 바뀐걸 확인 $\\to$ 뭔가 다른 계산결과를 준다는 의미겠지? $\\to$\n",
        "진짜 그런지 보자..*\n",
        "\n",
        "`-` 인공지능을 이용한 예측 3"
      ],
      "id": "8ad1ad0b-4cac-4721-9a77-121caba4618d"
    },
    {
      "cell_type": "code",
      "execution_count": 131,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This movie was a huge disappointment.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-120"
    },
    {
      "cell_type": "code",
      "execution_count": 132,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This was a masterpiece.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-121"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 다시 트레이너를 이용하여 `train_data`, `test_data` 의 prediction\n",
        "값을 구해보자."
      ],
      "id": "1795d6db-7446-4677-9e70-1b0ead5497a2"
    },
    {
      "cell_type": "code",
      "execution_count": 95,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.predict(전처리된데이터['train'])"
      ],
      "id": "cell-123"
    },
    {
      "cell_type": "code",
      "execution_count": 97,
      "metadata": {},
      "outputs": [],
      "source": [
        "트레이너.predict(전처리된데이터['test'])"
      ],
      "id": "cell-124"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`-` 우리가 가져야할 생각: 신기하다 X // 노가다 많이 했구나.. O\n",
        "\n",
        "# 8. 파이프라인\n",
        "\n",
        "`-` 강인공지능?\n",
        "\n",
        "> ref: <https://zdnet.co.kr/view/?no=20160622145838>"
      ],
      "id": "3f436105-e1ec-4198-8cdb-d0767c7fa798"
    },
    {
      "cell_type": "code",
      "execution_count": 133,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU."
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[{'label': 'LABEL_0', 'score': 0.9885253310203552}]\n",
            "[{'label': 'LABEL_1', 'score': 0.978060781955719}]"
          ]
        }
      ],
      "source": [
        "강인공지능 = transformers.pipeline(\"sentiment-analysis\", model=\"my_awesome_model/checkpoint-1563\")\n",
        "print(강인공지능(\"This movie was a huge disappointment.\"))\n",
        "print(강인공지능(\"This was a masterpiece.\"))"
      ],
      "id": "cell-129"
    },
    {
      "cell_type": "code",
      "execution_count": 139,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This movie was a huge disappointment.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-130"
    },
    {
      "cell_type": "code",
      "execution_count": 140,
      "metadata": {},
      "outputs": [],
      "source": [
        "확률계산하기(인공지능(**데이터콜렉터([토크나이저(\"This was a masterpiece.\")]).to(\"cuda:0\")))"
      ],
      "id": "cell-131"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "hf",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "codemirror_mode": {
        "name": "ipython",
        "version": "3"
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.2"
    }
  }
}