Jekyll2023-05-27T05:41:53+00:00/feed.xmlCHLMLNLPWrite an awesome description for your new site here. You can edit this line in _config.yml. It will appear in your document head meta (for Google search results) and in your feed.xml site description.chloammeAdvice on reading search papers by Prof. Andrew Ng(CS230)2022-10-20T00:00:00+00:002022-10-20T00:00:00+00:00/2022/10/20/advice_on_reading_research_papers<p>This post is a note-taking for the ‘Reading Research Papers’ part of the Stanford CS230 Deep Learning course lecture on YouTube <a href="https://youtu.be/733m6qBH-jI">URL</a>. The class is all about advice on how to master a body of literature around some topics you want to learn. So I’ve written down his writings on the blackboard and his verbal comments.</p>
<ul>
<li>Compiled list of papers(+ medium/blog posts)</li>
<li>Skip around list
<ul>
<li><img src="/images/221020-reading_research_papers/1.png" alt="" /></li>
<li>(1) 10% of each paper or try to quickly skim and understand each of these papers.</li>
<li>(2) If based on that you decide that paper 2 is worthless, others say they sure got it wrong, or you read it, and it just doesn’t make sense. Then go ahead and forget it.</li>
<li>(3) As you skip around to different papers, it might decide that paper 3 is a seminal one and then it will continuosly go ahead and read and understand the whole thing.</li>
<li>(4) And based on that, you might then find the 6th paper from the citations and read that.</li>
<li>(5) Go back and flesh out your understanding on paper 4.</li>
<li>(6) And then find a paper 7 and go and read that all the way to the conclusion.</li>
</ul>
</li>
<li>Number of papers you should read
<ul>
<li>15-20 papers: good enough to do some work, and apply some algorithms.</li>
<li>50-100 papers: enough to give you a very good understanding of an area.</li>
</ul>
</li>
<li>Read 1 paper
<ul>
<li>The bad way to read the paper is to go from the first word until the last word.</li>
<li>Take multiple passes through the paper.
<ol>
<li>Title/abstract/figures: the most</li>
<li>Intro + Conclusions + Figures + Skim rest. (Skip/skim related work)</li>
<li>Read but skip/skim math</li>
<li>Whole thing, but skip parts that don’t make sense.</li>
</ol>
</li>
</ul>
</li>
<li>Questions for having good understanding of the paper
<ul>
<li>What did authors try to accomplish?</li>
<li>What were the key elements of the approach?</li>
<li>What can you use yourself?</li>
<li>What other references do you want to follow?</li>
</ul>
</li>
<li>Sources of papers
<ul>
<li>Twitter (@kiankatan, @AndrewYNg)</li>
<li>ML subreddit</li>
<li>NIPS/ICML/ICLR</li>
<li>Friends</li>
<li>Arxiv sanity</li>
<li>(TMK) Papers with code (Web), Alpha zeta vector (YouTube)</li>
</ul>
</li>
<li>To more deeply understand the paper,
<ul>
<li>Math
<ul>
<li>Redrive from scratch</li>
</ul>
</li>
<li>Code
<ul>
<li>Download/Run open-source code</li>
<li>Reimplement from scratch</li>
</ul>
</li>
</ul>
</li>
<li>Longer term advice
<ul>
<li>Steady reading</li>
<li>Not shorts and burst</li>
</ul>
</li>
</ul>
<p>I’m grateful to Andrew Ng for sharing his wise and practical advice. As ending this lecture, he said, ‘some of this I wish I had known when I was a first-year Ph.D. student, but c’est la vie.’ I agree, but I think I’m lucky to know this now. It’s never too late to start. So keep going, and happy reading!</p>chloammeThis post is a note-taking for the ‘Reading Research Papers’ part of the Stanford CS230 Deep Learning course lecture on YouTube URL. The class is all about advice on how to master a body of literature around some topics you want to learn. So I’ve written down his writings on the blackboard and his verbal comments.My e-book reader device review2022-10-01T00:00:00+00:002022-10-01T00:00:00+00:00/2022/10/01/my_ebook_reader_review<p>I have decided to read a bunch of books and papers during my free time. I have been floating from the library to study cafes like a nomad. I prepared printed papers or books to read every day, but I often wanted to read other topics depending on my mood.</p>
<p>So, I thought I needed a device to help me read. A device that I can use every time and everywhere. I already had an iPad mini, but it doesn’t fit the screen. And I got attracted to the Onyx Boox Nova 3 Color E-book reader and bought it. At that time, there wasn’t any stock available on the website. So, I reserved one. And after a week, they notified me that the device was already available. So, I happily bought it right away. But a new version (which is even cheaper) was coming in a few days or a week after. I felt a little bit sad because I wasn’t informed about the new version of the device, but it’s okay.</p>
<p>While using the device, I found a few uncomfortable things, and I thought I needed another device. So, I decided to buy a new one, but this time, I bought the <strong>13.3” big device</strong> with the same brand.</p>
<table>
<tbody>
<tr>
<td><img src="/images/221001-ebook_reader/1.png" alt="" /></td>
<td><img src="/images/221001-ebook_reader/2.png" alt="" /></td>
</tr>
</tbody>
</table>
<ul>
<li>Pros
<ul>
<li><em>Eye comfortability:</em>
I’m an avid user of reading devices. I’ve been using the iPad mini for a long time due to its convenience. But, whenever I want to read a book or a paper, my eyes feel very uneasy. But using an E-book device makes it more comfortable to read everything.</li>
<li><em>Reading quality:</em>
It is easy and fast for me to read an article, especially if it’s written in my mother language. In the past, I was used to small reading devices(7.8 inches). I couldn’t properly read, so I felt a little stuffy.</li>
<li><em>Portability:</em>
Its weight is around 570g only. I can carry it anywhere and read comfortably. You can carry it in any small backpack or briefcase. And its size is similar to an A4 folder.</li>
<li><em>Downloadable applications:</em>
Unlike Kindle, Onyx is based on Android OS. And Boox 13.3 is a brand new model with Android 11. So, We can use the Play Store and download various applications.</li>
</ul>
</li>
<li>Cons
<ul>
<li><em>Notepad:</em>
It’s fit for reading, and there’s a pen where you can write down something. But the synchronicity is not good. I’m not sure if it’s only on my device, and there’s a bug area that can’t recognize the pen.</li>
<li><em>Watching Videos:</em>
I am into lecture videos. And I spend most of my time watching Youtube videos. I am interested in blackboard-type presentations. So, I thought watching on a grayscale display E-book reader would be possible. But, some images are not compatible with the device.</li>
<li><em>Browser Add-ons:</em>
In the past, I didn’t know that I was a big fan of Chrome extensions(e.g., translation, pocket, highlighter). I usually use a chrome browser that doesn’t support extensions. I can download and use a chrome browser on the Onyx Boox device, but extensions are incompatible. So, I tried Firefox Add-ons, Brave Add-ons, etc., but it also doesn’t work well. This time, I tried the Edge browser with a translation add-on, but it is slightly different from chrome, so I feel uncomfortable.</li>
</ul>
</li>
</ul>chloammeI have decided to read a bunch of books and papers during my free time. I have been floating from the library to study cafes like a nomad. I prepared printed papers or books to read every day, but I often wanted to read other topics depending on my mood.[번역] 그림으로 설명하는 Retrieval Transformer2022-01-08T00:00:00+00:002022-01-08T00:00:00+00:00/2022/01/08/illustrated-retrieval-transformer-korean<div class="tooltip">
<blockquote>
<p>이 글은 <a href="http://jalammar.github.io/illustrated-retrieval-transformer/">Jay Alammar님의 글</a>을 번역한 글입니다. [<a href="#additional-info">추가정보</a>]
<span class="tooltiptext">
This post is a translated version of <a href="http://jalammar.github.io/illustrated-retrieval-transformer/">The Illustrated Retrieval Transformer)</a> by Jay Alammar.
</span></p>
</blockquote>
</div>
<p><br /></p>
<div class="tooltip">
<p><strong>요약</strong>: 최신 언어 모델 batch는 매우 작지만 DB를 쿼리하거나 웹에서 정보를 검색할 수 있어서 GPT-3와 같은 성능을 달성할 수 있습니다. 우리가 주목해봐야하는 점은 성능을 향상하기 위한 유일한 방법이 더 큰 모델을 build하는 것이 아니라는 것입니다.
<span class="tooltiptext">
Summary: The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance.
</span></p>
</div>
<hr />
<div class="tooltip">
<p>최근 몇년간 우리는 거대 언어 모델(LLM; Large Language Model)들의 등장을 봤습니다 – 어떻게 기계가 (인간의) 언어를 처리하고 생성하는 방식을 빠르게 개선한 기계 학습(ML; Machine Learning) 모델입니다. 2017년 이후 하이라이트는 다음과 같습니다:
<span class="tooltiptext">
The last few years saw the rise of Large Language Models (LLMs) – machine learning models that rapidly improve how machines process and generate language. Some of the highlights since 2017 include:
</span></p>
</div>
<div class="tooltip">
<ul>
<li>최초의 <a href="http://jalammar.github.io/illustrated-transformer/">Transformer</a>는 기계 번역의 이전 성능 기록을 경신했습니다.</li>
<li><a href="http://jalammar.github.io/illustrated-bert/">BERT</a>는 pre-training 후 finetuning 프로세스와 Transformer 기반의 contextual word embedding을 대중화합니다. 그리고 빠르게 <a href="https://blog.google/products/search/search-language-understanding-bert/">Google 검색</a> 및 <a href="https://azure.microsoft.com/en-us/blog/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus/">Bing 검색</a>에 파워를 넣기 시작합니다.</li>
<li><a href="http://jalammar.github.io/illustrated-gpt2/">GPT-2</a>는 기계가 인간처럼 글을 쓰는 능력을 보여줍니다.</li>
<li>먼저 <a href="https://arxiv.org/abs/1910.10683">T5</a>가, 그 뒤로는 <a href="https://huggingface.co/bigscience/T0pp">T0</a>가 transfer learning (한 task에서 모델을 훈련시킨 후, 거기서 얻은 정보를 바탕으로 다른 비슷한 task에 대해 잘 수행하도록 함) 및 text-to-text task들 처럼 많은 다형의 task들을 수행하는 것의 경계를 확장합니다.</li>
<li><a href="http://jalammar.github.io/how-gpt3-works-visualizations-animations/">GPT-3</a>는 생성 모델의 대규모 스케일링(크기 확장)이 충격적인 새로운 어플리케이션으로 이어질 수 있음을 보여주었습니다 (업계에서는 <a href="https://deepmind.com/research/publications/2021/scaling-language-models-methods-analysis-insights-from-training-gopher">Gopher</a>, <a href="https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/">MT-NLG</a> 등과 같이 더 큰 모델을 꾸준히 학습하고 있습니다).
<span class="tooltiptext">
<span>*</span> The original <a href="http://jalammar.github.io/illustrated-transformer/">Transformer</a> breaks previous performance records for machine translation.
<span>*</span> <a href="http://jalammar.github.io/illustrated-bert/">BERT</a> popularizes the pre-training then finetuning process, as well as Transformer-based contextualized word embeddings. It then rapidly starts to power <a href="https://blog.google/products/search/search-language-understanding-bert/">Google Search</a> and <a href="https://azure.microsoft.com/en-us/blog/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus/">Bing Search</a>.
<span>*</span> <a href="http://jalammar.github.io/illustrated-gpt2/">GPT-2</a> demonstrates the machine’s ability to write as well as humans do.
<span>*</span> First <a href="https://arxiv.org/abs/1910.10683">T5</a>, then <a href="https://huggingface.co/bigscience/T0pp">T0</a> push the boundaries of transfer learning (training a model on one task, and then having it do well on other adjacent tasks) and posing a lot of different tasks as text-to-text tasks.
<span>*</span> <a href="http://jalammar.github.io/how-gpt3-works-visualizations-animations/">GPT-3</a> showed that massive scaling of generative models can lead to shocking emergent applications (the industry continues to train larger models like <a href="https://deepmind.com/research/publications/2021/scaling-language-models-methods-analysis-insights-from-training-gopher">Gopher</a>, <a href="https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/">MT-NLG</a>…etc).
</span></li>
</ul>
</div>
<div class="tooltip">
<p>한동안은 모델의 크기를 확장해나가는 것이 성능을 향상시키는 주요 방법인 것처럼 보였습니다. 하지만 DeepMind의 <a href="https://deepmind.com/research/publications/2021/improving-language-models-by-retrieving-from-trillions-of-tokens">RETRO Transformer</a> 및 OpenAI의 <a href="https://openai.com/blog/improving-factual-accuracy/">WebGPT</a>와 같은 이 분야의 최근 개발에서, 정보를 검색/쿼리하는 방법으로 확장한다면 작은 생성 언어 모델도 거대 모델과 동등한 성능을 발휘할 수 있음을 보여줌으로써 트렌드를 바꿔놓았습니다.
<span class="tooltiptext">
For a while, it seemed like scaling larger and larger models is the main way to improve performance. Recent developments in the field, like , reverse this trend by showing that smaller generative language models can perform on par with massive models if we augment them with a way to search/query for information.
</span></p>
</div>
<div class="tooltip">
<p>이 글은 DeepMind의 RETRO (<strong>R</strong>etrieval-<strong>E</strong>nhanced <strong>TR</strong>ansf<strong>O</strong>rmer)와 그 동작방식을 설명합니다. 이 모델은 사이즈가 4% (75억개 파라미터 vs. GPT-3 다빈치 모델의 경우 1850억개 파라미터) 밖에 되지 않지만 GPT-3와 유사하게 성능을 냅니다.
<span class="tooltiptext">
This article breaks down DeepMind’s RETRO (<strong>R</strong>etrieval-<strong>E</strong>nhanced <strong>TR</strong>ansf<strong>O</strong>rmer) and how it works. The model performs on par with GPT-3 despite being 4% its size (7.5 billion parameters vs. 185 billion for GPT-3 Da Vinci).
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/deepmind-retro-retrieval-transformer.png" />
<br />
RETRO는 database에서 retrieval된 정보를 추가하여, 파라미터들이 fact와 world knowledge의 값비싼 저장소가 되는 것을 방지합니다.
</div>
<span class="tooltiptext">
RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge.
</span>
</div>
<div class="tooltip">
<p>RETRO는 <a href="https://arxiv.org/abs/2112.04426">Improving Language Models by Retrieving from Trillions of Tokens</a> 논문에 기술되어 있습니다. 연구 커뮤니티에서 넓고 다양한 retrival work가 꾸준히 build되고 있습니다. (참고: <a href="http://www.crm.umontreal.ca/2018/Langue18/pdf/Cheung.pdf">1</a> <a href="https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/">2</a> <a href="https://openreview.net/forum?id=HklBjCEKvH">3</a> <a href="https://arxiv.org/abs/2102.02557">4</a> <a href="https://openreview.net/forum?id=B184E5qee">5</a>) 이 글은 RETRO 모델에 대한 설명이며, 모델의 참신성에 대한 것은 아닙니다.
<span class="tooltiptext">
RETRO was presented in the paper <a href="https://arxiv.org/abs/2112.04426">Improving Language Models by Retrieving from Trillions of Tokens</a>. It continues and builds on a wide variety of retrieval <a href="http://www.crm.umontreal.ca/2018/Langue18/pdf/Cheung.pdf">work</a> <a href="https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/">in</a> <a href="https://openreview.net/forum?id=HklBjCEKvH">the</a> <a href="https://arxiv.org/abs/2102.02557">research</a> <a href="https://openreview.net/forum?id=B184E5qee">community</a>. This article explains the model and not what is especially novel about it.
</span></p>
</div>
<!--more-->
<div class="tooltip">
<h2 id="world-knowledge--">중요한 이유: 언어 정보를 World Knowledge 정보와 분리시킴</h2>
<p><span class="tooltiptext">
Why This is Important: Separating Language Information from World Knowledge Information
</span></p>
</div>
<div class="tooltip">
<p>언어 모델링은 기본적으로 문장 끝의 빈칸(blank)을 채우기 위해 다음 단어를 예측하도록 모델을 학습합니다.
<span class="tooltiptext">
Language modeling trains models to predict the next word–to fill-in-the-blank at the end of the sentence, essentially.
</span></p>
</div>
<div class="tooltip">
<p>빈칸을 채우는 것은 가끔 사실 정보 (예. 이름 또는 날짜)의 지식이 필요합니다. 예를 들면:
<span class="tooltiptext">
Filling the blank sometimes requires knowledge of factual information (e.g. names or dates). For example:
</span></p>
</div>
<div class="img-div">
<img src="/images/retro/prompt-1.png" />
<br />
입력 프롬프트: 영화 듄은 출시(시간정보)가 ....(이다).
</div>
<div class="tooltip">
<p>다른 경우에는, 해당 언어에 대한 친숙도가 빈칸에 들어갈 내용을 추측하기에 충분합니다.
예를 들어:
<span class="tooltiptext">
Other times, familiarity with the language is enough to guess what goes in the blank. For example:
</span></p>
</div>
<div class="img-div">
<img src="/images/retro/prompt-2.png" />
<br />
입력 프롬프트: 입소문으로 퍼진 인기가 Herbert가 시작할 수 있게 했습니다, 본격적인 ....(을).
</div>
<div class="tooltip">
<p>이 구분은 LLM이 알고 있는 모든 것을 모델 파라미터에 인코딩하기 때문에 중요합니다. 이 것은 언어 정보에 대해서는 타당하지만, fact 및 world-knowledge 정보에 대해서는 비효율적입니다.
<span class="tooltiptext">
This distinction is important because LLMs encoded everything they know in their model parameters. While this makes sense for language information, it is inefficient for factual and world-knowledge information.
</span></p>
</div>
<div class="tooltip">
<p>언어 모델에 retrieval 방법을 추가함으로써, 모델은 훨씬 작아질 수 있습니다. neural database는 텍스트 생성 중에 필요한 fact 정보를 retrieving 하는 것에 도움이 됩니다.
<span class="tooltiptext">
By including a retrieval method in the language model, the model can be much smaller. A neural database aids it with retrieving factual information it needs during text generation.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/Large-GPT-vs-Retro-transformer-world-knowledge-information.png" />
<br />
retrieval 방법으로 언어 모델을 지원하는 것은, 언어 모델이 텍스트를 잘 생성하기 위해 파라미터에 인코딩하는 정보의 양(크기)를 줄일 수 있게 합니다.
</div>
<span class="tooltiptext">
Aiding language models with retrieval methods allows us to reduce the amount of information a language model needs to encode in its parameters to perform well at text generation.
</span>
</div>
<div class="tooltip">
<p>훈련 데이터 암기(memorization)가 줄어들기 때문에 작은 언어 모델로 훈련이 빨라집니다. 누구나 작고 저렴한 GPU를 이용하여 이런 작은 모델들을 deploy할 수 있고 필요에 따라 조정할 수 있습니다.
<span class="tooltiptext">
Training becomes fast with small language models, as training data memorization is reduced. Anyone can deploy these models on smaller and more affordable GPUs and tweak them as per need.
</span></p>
</div>
<div class="tooltip">
<p>구조적으로, RETRO는 최초의 transformer와 같은 encoder-decoder 모델입니다. 하지만 retrieval database의 도움으로 입력 시퀀스가 보강됩니다. 모델은 database에서 가장 유력한 시퀀스를 찾고 입력에 추가합니다. 그러면 RETRO는 출력 예측을 생성하기 위한 마법을 겁니다.
<span class="tooltiptext">
Mechanically, RETRO is an encoder-decoder model just like the original transformer. However, it augments the input sequence with the help of a retrieval database. The model finds the most probable sequences in the database and adds them to the input. RETRO works its magic to generate the output prediction.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/dune-prompt-into-retro-transformer-4.png" />
<br />
RETRO는 database를 활용하여 입력 프롬프트를 보강합니다. 프롬프트는 관련 정보를 database에서 retrieve하는데 사용됩니다.
</div>
<span class="tooltiptext">
RETRO utilizes a database to augment its input prompt. The prompt is used to retrieve relevant information from the database.
</span>
</div>
<div class="tooltip">
<p>모델 아키텍처를 보기 전에, retrieval database에 대해 깊이 있게 알아보겠습니다.
<span class="tooltiptext">
Before we explore the model architecture, let’s dig deeper into the retrieval database.
</span></p>
</div>
<div class="tooltip">
<h2 id="retro-retrieval-database--">RETRO의 Retrieval Database 세부 사항</h2>
<p><span class="tooltiptext">
Inspecting RETRO’s Retrieval Database
</span></p>
</div>
<div class="tooltip">
<p>database는 key-value 저장소입니다.
<span class="tooltiptext">
The database is a key-value store.
</span></p>
</div>
<div class="tooltip">
<p>key는 일반적으로 사용하는 BERT 문장 임베딩입니다.
<span class="tooltiptext">
The key is a standard BERT sentence embedding.
</span></p>
</div>
<div class="tooltip">
<p>value는 2파트의 텍스트로 되어있습니다:
<span class="tooltiptext">
The value is text in two parts:
</span></p>
</div>
<div class="tooltip">
<ol>
<li>Neighbor: key를 계산하기 위해 사용됨</li>
<li>Completion: 원본 문서에서 text의 연속 (Neighbor 문장에서의 연속적으로 연결된 다음 텍스트)
<span class="tooltiptext"></span></li>
<li>Neighbor, which is used to compute the key</li>
<li>Completion, the continuation of the text in the original document.
</span></li>
</ol>
</div>
<div class="tooltip">
<p>RETRO에서의 database는 <em>MassiveText</em> 데이터셋으로부터의 2조개의 다국어(multi-lingual) 토큰을 가지고 있습니다. 길이는 neighbor chunk와 completion chunk 모두 최대 64개 토큰까지 가능합니다.
<span class="tooltiptext">
RETRO’s database contains 2 trillion multi-lingual tokens based on the <em>MassiveText</em> dataset. Both the neighbor and completion chunks are at most 64 tokens long.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/retro/database-key-value-examples.png" />
<br />
RETRO database의 내부를 살펴보면 RETRO database의 key-value 쌍의 예를 볼 수 있습니다. value는 neighbor chunk와 completion chunk를 포함합니다.
A look inside RETRO's database shows examples of key-value pairs in the RETRO database. The value contains a neighbor chunk and a completion chunk.
</div>
<div class="tooltip">
<p>RETRO는 입력 프롬프트르르 여러개의 chunk로 쪼갭니다. 단순하게 설명하기 위해, 하나의 chunk가 retreive된 text로 보강되는 방법을 집중해서 살펴보겠습니다. 하지만, 모델은 입력 프롬프트에서 (첫번째를 제외하고) 각 chunk에 대해 이 프로세스를 수행합니다.
<span class="tooltiptext">
RETRO breaks the input prompt into multiple chunks. For simplicity, we’ll focus on how one chunk is augmented with retrieved text. The model, however, does this process for each chunk (except the first) in the input prompt.
</span></p>
</div>
<div class="tooltip">
<h2 id="database-">Database 조회</h2>
<p><span class="tooltiptext">
The Database Lookup
</span></p>
</div>
<div class="tooltip">
<p>RETRO에 들어가기 전에 입력 프롬프트는 BERT에 먼저 입력됩니다. 출력 contextualized vector들은 평균이 계산되어 문장 임베딩 벡터를 구성됩니다. 그렇게 만들어진 벡터는 database를 쿼리하는데에 사용됩니다.
<span class="tooltiptext">
Before hitting RETRO, the input prompt goes into BERT. The output contextualized vectors are then averaged to construct a sentence embedding vector. That vector is then used to query the database.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/retro/bert-sentence-embedding.png" />
<br />
BERT로 입력 프롬프트를 처리하면 contextualized 토큰 임베딩이 생성됩니다. 그 결과들을 평균을 계산해서 문장 임베딩을 생성합니다.
Processing the input prompt with BERT produces contextualized token embeddings. Averaging them produces a sentence embedding.
</div>
<div class="tooltip">
<p>그 문장 임베딩은 <a href="https://github.com/google-research/google-research/tree/master/scann">근사 근접 이웃 탐색(approximate nearest neighbor search)</a>에서 사용됩니다.
<span class="tooltiptext">
That sentence embedding is then used in an approximate nearest neighbor search (<a href="https://github.com/google-research/google-research/tree/master/scann">https://github.com/google-research/google-research/tree/master/scann</a>).
</span></p>
</div>
<div class="tooltip">
<p>두개의 근접 이웃이 retrieve되고, 그 텍스트가 RETRO의 입력의 일부가 됩니다.
<span class="tooltiptext">
The two nearest neighbors are retrieved, and their text becomes a part of the input into RETRO.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/neighbor-retrieval-from-retro-neural-database-with-bert-embeddings.png" />
<br />
BERT 문장 임베딩은 RETRO의 neural database에서 근접 이웃을 retrieve하는데에 사용됩니다. 검색된 결과들을 언어 모델의 입력에 추가됩니다.
</div>
<span class="tooltiptext">
The BERT sentence embedding is used to retrieve the nearest neighbors from RETRO's neural database. These are then added to the input of the language model.
</span>
</div>
<div class="tooltip">
<p>RETRO의 입력은, 입력 프롬프트와 (입력 프롬프트로를) database에서 검색한 두개의 근접 이웃 (및 그들의 continuation(연결된 다음 문장))입니다.
<span class="tooltiptext">
This is now the input to RETRO. The input prompt and its two nearest neighbors from the database (and their continuations).
</span></p>
</div>
<div class="tooltip">
<p>이제 Transformer와 RETRO는 정보를 통합하고, 프로세싱을 합니다.
<span class="tooltiptext">
From here, the Transformer and RETRO Blocks incorporate the information into their processing.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/input-prompt-and-retrieved-text-retro-transformer.png" />
<br />
retrieve된 이웃은 언어 모델의 입력에 추가됩니다. 하지만, 그 이웃들은 모델 안에서 조금 다르게 처리됩니다.
</div>
<span class="tooltiptext">
The retrieved neighbors are added to the input of the language model. They're treated a little differently inside the model, however.
</span>
</div>
<div class="tooltip">
<h2 id="retro--">RETRO 아키텍처 조망</h2>
<p><span class="tooltiptext">
RETRO Architecture at a High Level
</span></p>
</div>
<div class="tooltip">
<p>RETRO 아키텍처는 인코더 스택과 디코더 스택으로 구성되어 있습니다.
<span class="tooltiptext">
RETRO’s architecture is an encoder stack and a decoder stack.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/Retro-transformer-encoder-decoder-stacks-2.png" />
<br />
RETRO transformer는 (neighbor들을 처리하기 위한) 인코더 스택과 (입력을 처리하기 위한) 디코더 스택으로 구성되어 있습니다.
</div>
<span class="tooltiptext">
A RETRO transformer consists of an encoder stack (to process the neighbors) and a decoder stack (to process the input)
</span>
</div>
<div class="tooltip">
<p>인코더는 기본 Transformer 인코더 블록(self-attention + FFNN)들으로 구성되어 있습니다. 제가 알기로는, Retro는 두개의 Transformer 인코더 블럭으로 구성된 인코더를 사용합니다.
<span class="tooltiptext">
The encoder is made up of standard Transformer encoder blocks (self-attention + FFNN). To my best understanding, Retro uses an encoder made up of two Transformer Encoder Blocks.
</span></p>
</div>
<div class="tooltip">
<p>디코더 스택은 두 종류의 디코더 블록을 interleave(사이에 끼워넣음) 합니다:
<span class="tooltiptext">
The decoder stack interleaves two kinds of decoder blocks:
</span></p>
</div>
<div class="tooltip">
<ul>
<li>기본 transformer 디코더 블록 (ATTN + FFNN)</li>
<li>RETRO 디코더 블록 (ATTN + CCA(Chunked Cross Attention) + FFNN)
<span class="tooltiptext">
<span>*</span> Standard transformer decoder block (ATTN + FFNN)
<span>*</span> RETRO decoder block (ATTN + Chunked cross attention (CCA) + FFNN)
</span></li>
</ul>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/retro-transformer-blocks-4.png" />
<br />
세 종류의 Transformer 블록이 RETRO를 구성합니다.
</div>
<span class="tooltiptext">
The three types of Transformer blocks that make up RETRO
</span>
</div>
<div class="tooltip">
<p>retrive된 이웃을 처리하여, 나중에 attention을 위해 사용될 KEYS 및 VALUES 행렬을 생성하는 인코더 스택부터 살펴보겠습니다 (복습을 위해 <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> 참고하세요).
<span class="tooltiptext">
Let’s start by looking at the encoder stack, which processes the retrieved neighbors, resulting in KEYS and VALUES matrices that will later be used for attention (see <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> for a refresher).
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/retro-encoder-block-keys-values-2.png" />
<br />
인코더 스택은 retrive된 이웃을 처리하여 KEYS 및 VALUES 행렬을 생성합니다.
</div>
<span class="tooltiptext">
The encoder stack processes the retrieved neighbors resulting in KEYS and VALUE matrices
</span>
</div>
<div class="tooltip">
<p>디코더 블락은 입력 텍스트를 GPT가 하는 것 처럼 처리합니다. 프롬프트 token에 self-attention을 적용시키고 (인과적으로, 이전 토큰에만 attention을 줌), FFNN 레이어를 통과시킵니다.
<span class="tooltiptext">
Decoder blocks process the input text just like a GPT would. It applies self-attention on the prompt token (causally, so only attending to previous tokens), then passes through a FFNN layer.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/retro-transformer-decoders-2.png" />
<br />
입력 프롬프트는 self-attention 및 FFNN 레이어를 포함하는 일반 디코더 블록을 통과합니다.
</div>
<span class="tooltiptext">
Input prompt passes through standard decoder block containing self-attention and FFNN layers
</span>
</div>
<div class="tooltip">
<p>RETRO 디코더에 도달해서야 retrieve된 정보를 통합하기 시작합니다. 9번 이후의 매 세번째 블록들은 RETRO 블록입니다 (입력을 이웃에 attention을 줄 수 있도록 합니다). 그래서 9, 12, 15…32는 RETRO 블록입니다. (두개의 더 작은 Retro 모델과 Retrofit 모델들은 9번째가 아닌 6번째 부터 이러한 레이어들이 시작됩니다.)
<span class="tooltiptext">
It’s only when a RETRO decoder is reached do we start to incorporate the retrieved information. Every third block starting from 9 is a RETRO block (that allows its input to attend to the neighbors). So layers 9, 12, 15… are RETRO blocks. (The two smaller Retro models, and the Retrofit models have these layers starting from the 6th instead of the 9th layer).</span></p>
<p></span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/retro-decoder-attention-2.png" />
<br />
입력 프롬프트는 RETRO 디코더 블록에 도달하고 정보 retrieval이 시작됩니다.
</div>
<span class="tooltiptext">
Input prompt reaches RETRO Decoder block to start information retrieval
</span>
</div>
<div class="tooltip">
<p>retrieve된 정보가 프롬프트를 완료하는데 필요한 날짜 정보를 한눈에 볼 수 있는 효과적인 단계입니다.
<span class="tooltiptext">
So effectively, this is the step where the retrieved information can glance at the dates it needs to complete the prompt.
</span></p>
</div>
<div class="tooltip">
<div class="img-div" style="display: inline-block; text-align: center; color:#92A9BD; font-size: 0.8em">
<img src="/images/retro/retro-decoder-chunked-cross-attention.png" />
<br />
Chunked Cross-Attention을 이용하여 근접 이웃 chunk로 부터 정보를 retrieving하는 RETRO 디코더 블록.
</div>
<span class="tooltiptext">
RETRO Decoder block retrieving information from nearest neighbour chunks using Chunked Cross-Attention
</span>
</div>
<div class="tooltip">
<h2 id="section">이전 연구</h2>
<p><span class="tooltiptext">
Previous Work
</span></p>
</div>
<div class="tooltip">
<p>retrieval 기술로 언어 모델을 지원하는 것은 활발한 연구 영역입니다. 이전 연구들은 다음과 같습니다:
<span class="tooltiptext">
Aiding language models with retrieval techniques has been an active area of research. Some of the previous work in the space includes:
</span></p>
</div>
<ul>
<li><a href="https://openreview.net/forum?id=B184E5qee">Improving Neural Language Models with a Continuous Cache</a></li>
<li><a href="https://openreview.net/forum?id=HklBjCEKvH">Generalization through Memorization: Nearest Neighbor Language Models</a></li>
<li>Read the <a href="https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/">Retrieval Augmented Generation</a> blog from Meta AI and go through Jackie Chi Kit Cheung’s lecture on <a href="http://www.crm.umontreal.ca/2018/Langue18/pdf/Cheung.pdf">Leveraging External Knowledge in Natural Language Understanding Systems</a></li>
<li>SPALM: <a href="https://arxiv.org/abs/2102.02557">Adaptive Semiparametric Language Models</a></li>
<li>DPR: <a href="https://aclanthology.org/2020.emnlp-main.550/">Dense Passage Retrieval for Open-Domain Question Answering</a></li>
<li><a href="https://arxiv.org/abs/2002.08909">REALM: Retrieval-Augmented Language Model Pre-Training</a></li>
<li>FiD: <a href="https://aclanthology.org/2021.eacl-main.74/">Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering</a></li>
<li>EMDR: <a href="https://arxiv.org/abs/2106.05346">End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering</a></li>
<li>BlenderBot 2.0: <a href="https://arxiv.org/abs/2107.07566">Internet-Augmented Dialogue Generation</a></li>
</ul>
<div class="tooltip">
<p>수정사항이나 피드백이 있으시면, <a href="https://github.com/jalammar/jalammar.github.io/discussions/21">이 쓰레드</a>에 글을 남겨주시거나 <a href="https://twitter.com/JayAlammar">이 트위터</a>로 연락 주세요.
<span class="tooltiptext">
Please post in <a href="https://github.com/jalammar/jalammar.github.io/discussions/21">this thread</a> or reach out to me on <a href="https://twitter.com/JayAlammar">Twitter</a> for any corrections or feedback.
</span></p>
</div>
<hr />
<h2 id="추가-정보">추가 정보<a href="#additional-info" name="additional-info">.</a></h2>
<ul>
<li>이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 <a href="http://jalammar.github.io/illustrated-retrieval-transformer/">The Illustrated Retrieval Transformer</a>에서 확인하실 수 있습니다.</li>
<li>원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.</li>
<li>번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 <a href="https://nlpinkorean.github.io">찬</a>님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.</li>
</ul>chloamme이 글은 Jay Alammar님의 글을 번역한 글입니다. [추가정보] This post is a translated version of The Illustrated Retrieval Transformer) by Jay Alammar.[실습] VSMs(Vector-space models)2021-12-29T00:00:00+00:002021-12-29T00:00:00+00:00/2021/12/29/vsm_01_distributional_practice<h1 id="vector-space-models-designs-distances-basic-reweighting">Vector-space models: designs, distances, basic reweighting</h1>
<pre><code class="language-notebook">__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"
# 이 문서는, 아파치 라이선스 2.0 하에, Christopher Potts 교수님의 CS224U 과목을 공부하면서 설명을 추가(번역 아님)한 자료 입니다.
</code></pre>
<h2 id="contents">Contents</h2>
<ol>
<li><a href="#Overview">Overview</a></li>
<li><a href="#Motivation">Motivation</a></li>
<li><a href="#Terminological-notes">Terminological notes</a></li>
<li><a href="#Set-up">Set-up</a></li>
<li><a href="#Matrix-designs">Matrix designs</a></li>
<li><a href="#Pre-computed-example-matrices">Pre-computed example matrices</a></li>
<li><a href="#Vector-comparison">Vector comparison</a>
<ol>
<li><a href="#Euclidean">Euclidean</a></li>
<li><a href="#Length-normalization">Length normalization</a></li>
<li><a href="#Cosine-distance">Cosine distance</a></li>
<li><a href="#Matching-based-methods">Matching-based methods</a></li>
<li><a href="#Summary">Summary</a></li>
</ol>
</li>
<li><a href="#Distributional-neighbors">Distributional neighbors</a></li>
<li><a href="#Matrix-reweighting">Matrix reweighting</a>
<ol>
<li><a href="#Normalization">Normalization</a></li>
<li><a href="#Observed/Expected">Observed/Expected</a></li>
<li><a href="#Pointwise-Mutual-Information">Pointwise Mutual Information</a></li>
<li><a href="#TF-IDF">TF-IDF</a></li>
</ol>
</li>
<li><a href="#Subword-information">Subword information</a></li>
<li><a href="#Visualization">Visualization</a></li>
<li><a href="#Exploratory-exercises">Exploratory exercises</a></li>
</ol>
<h2 id="overview">Overview</h2>
<p>This notebook is the first in our series about creating effective <strong>distributed representations</strong>. The focus is on matrix designs, assessing similarity, and methods for matrix reweighting.</p>
<p>The central idea (which takes some getting used to!) is that we can represent words and phrases as dense vectors of real numbers. These take on meaning by being <strong>embedded</strong> in a larger matrix of representations with comparable structure.</p>
<p><code class="language-plaintext highlighter-rouge">matrix design</code>, <code class="language-plaintext highlighter-rouge">similarity 계산</code>, <code class="language-plaintext highlighter-rouge">matrix reweighting</code> 등 <strong>distributed representation</strong> 에 대해 알아보겠습니다.</p>
<h2 id="motivation">Motivation</h2>
<p>Why build distributed representations? There are potentially many reasons. The two we will emphasize in this course:</p>
<ol>
<li>
<p><strong>Understanding words in context</strong>: There is value to linguists in seeing what these data-rich approaches can teach use about natural language lexicons, and there is value for social scientists in understanding how words are being used.</p>
</li>
<li>
<p><strong>Feature representations for other models</strong>: As we will see, many models can benefit from representing examples as distributed representations.</p>
</li>
</ol>
<p>(1) word를 <code class="language-plaintext highlighter-rouge">context 상에서 이해</code>할 수 있고, (2) 다른 어떤 <code class="language-plaintext highlighter-rouge">모델의 feature representation</code>으로 활용 가능하다는 점에 있어서, <code class="language-plaintext highlighter-rouge">distributed representation</code>을 build하는 것은 의미가 있습니다.</p>
<h2 id="terminological-notes">Terminological notes</h2>
<ul>
<li>
<p>The distributed representations we build will always be vectors of real numbers. The models are often called <strong>vector space models</strong> (VSMs).</p>
</li>
<li>
<p><strong>Distributional representations</strong> are the special case where the data come entirely from co-occurrence counts in corpora.</p>
</li>
<li>
<p>We’ll look at models that use supervised labels to obtain vector-based word representations. These aren’t purely distributional, in that they take advantage of more than just co-occurrence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.</p>
</li>
<li>
<p>If a neural network is used to train the representations, then they might be called <strong>neural representations</strong>.</p>
</li>
<li>
<p>The term <strong>word embedding</strong> is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.</p>
</li>
<li>
<p>In any case, <strong>distributed representation</strong> seems like the most general cover term for what we’re trying to achieve, and its only downside is that sometimes people think it has something to do with distributed databases.</p>
</li>
<li>distributed representation을 build하면 실수 vector로 표현되며, 이 모델을 <strong>vector space model(VSM)</strong> 이라고도 부릅니다.</li>
<li>여러 코퍼스들에서 co-occurrence(동시발생) 카운트에서 전체적으로 데이터가 도출된 경우, <strong>distributional representation</strong> 이라고 합니다.</li>
<li>representation을 뉴럴넷을 활용해서 학습시킨 경우는, <strong>neural representation</strong> 이라고 합니다.</li>
<li><strong>word embedding</strong> 은 distributed representation에 사용될 수도 있고, distributional representation에 사용될 수도 있다.</li>
<li><strong>distributed representation</strong> 이 가장 일반적인 용어, 통칭할 수 있는 용어로 사용될 수 있음 (단, 분산DB와 아무 관련 없는 용어)</li>
</ul>
<p><a href="https://jair.org/index.php/jair/article/view/10640">From Frequency to Meaning: Vector Space Models for Semantics</a> 참고</p>
<h2 id="set-up">Set-up</h2>
<ul>
<li>
<p>Make sure your environment meets all the requirements for <a href="https://github.com/cgpotts/cs224u/">the cs224u repository</a>. For help getting set-up, see <a href="setup.ipynb]">setup.ipynb</a>.</p>
</li>
<li>
<p>Download <a href="http://web.stanford.edu/class/cs224u/data/data.zip">the course data</a>, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change <code class="language-plaintext highlighter-rouge">DATA_HOME</code> below.)</p>
</li>
<li>vsm.py에 있는 구현체를, 이 노트북에 바로 입력해서 살펴볼 예정</li>
<li>imdb corpus가 yelp corpus로 변경되어 있음에 따라, 수정해서 실습 예정</li>
</ul>
<pre><code class="language-notebook">%matplotlib inline
import numpy as np
import os
import pandas as pd
# import vsm
import scipy
import scipy.spatial.distance
from collections import defaultdict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
</code></pre>
<pre><code class="language-notebook">DATA_HOME = os.path.join('data', 'vsmdata')
!ls $DATA_HOME
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>giga_window20-flat.csv.gz yelp_window20-flat.csv.gz
giga_window5-scaled.csv.gz yelp_window5-scaled.csv.gz
</code></pre></div></div>
<h2 id="matrix-designs">Matrix designs</h2>
<p>There are many, many ways to define distributional matrices. Here’s a schematic overview that highlights the major decisions for building a word $\times$ word matrix:</p>
<ol>
<li>
<p>Define a notion of <strong>co-occurrence context</strong>. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.</p>
</li>
<li>
<p>Define a <strong>count scaling method</strong>. The simplest method just counts everything in the context window, giving equal weight to everything inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $1/d$, where $d$ is the distance in tokens from the target.</p>
</li>
<li>
<p>Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w’$ occurs in the same context (as you defined it in 1), increment $d[(w, w’)]$ by whatever value is determined by your weighting scheme. You’d increment by $1$ with the weighting scheme that simply counts co-occurrences.</p>
</li>
<li>Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types.
<ol>
<li>For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow the vocabulary at this point.</li>
<li>You might do this by filtering to a specific subset, or just imposing a minimum count threshold.</li>
<li>You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don’t have enough evidence to support good representations.</li>
<li>For words outside the vocabulary you choose, you could ignore them entirely or accumulate all their values into a designated <em>UNK</em> vector.</li>
</ol>
</li>
<li>Now build a matrix $M$ of dimension $|V| \times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the value $d[(w_{i}, w_{j})]$.</li>
</ol>
<p>distributional matrix를 정의하는데 많은 방법이 있지만, word $\times$ word matrix를 build하는 데 아래 순서로 가능:</p>
<ol>
<li><strong>co-occurrence context</strong> 의 개념을 정의할 것. 전체 문서, 단락, 문장, 절, 명사구(Noun Phrase) 등 연관관계를 보려는 어떤 것이라도 될 수 있음.</li>
<li><strong>카운트 스케일링 메소드</strong> 를 정의할 것. context window 내의 모든 토큰에 같은 weight을 주는 가장 간단한 방법도 있고, 타겟 토큰으로 부터 얼마나 떨어졌는지 거리($d$)에 따라 $1/d$로 weight를 scaling할 수도 있음.</li>
<li>코퍼스를 스캔해서 단어쌍에 co-occurrence 값을 매핑하는 사전 $d$를 build할 것. 1번에서 정의했던 같은 context 내에서 등장한 단어 $w$와 $w’$의 쌍 마다, 2번에서 정의한 weighting shceme으로 값을 증가시키기.</li>
<li>3번에 서 만든 카운트 사전 $d$를 이용해서 전체 vocab $V$ (<code class="language-plaintext highlighter-rouge">word type</code>의 ordered list)를 만들기.</li>
<li>행과 열이 word인 $|V| \times |V|$인 행렬 $M$ 만들기. $M[i,j]$의 각 셀은 단어쌍 거리의 값($d[(w_{i}. w_{j})]$)으로 채워져있음.</li>
</ol>
<h2 id="pre-computed-example-matrices">Pre-computed example matrices</h2>
<p>The data distribution includes four matrices that we’ll use for hands-on exploration. All of them were designed in the same basic way:</p>
<ul>
<li>
<p>They are word $\times$ word matrices with 6K rows and 6K columns.</p>
</li>
<li>
<p>The vocabulary is the top 6K most frequent unigrams.</p>
</li>
</ul>
<p>Two come from Yelp user-supplied reviews, and two come from Gigaword, a collection of newswire and newspaper text. Further details:</p>
<table>
<thead>
<tr>
<th>filename</th>
<th>source</th>
<th>window size</th>
<th>count weighting</th>
</tr>
</thead>
<tbody>
<tr>
<td>yelp_window5-scaled.csv.gz</td>
<td>Yelp reviews</td>
<td>5</td>
<td>1/d</td>
</tr>
<tr>
<td>yelp_window20-flat.csv.gz</td>
<td>Yelp reviews</td>
<td>20</td>
<td>1</td>
</tr>
<tr>
<td>gigaword_window5-scaled.csv.gz</td>
<td>Gigaword</td>
<td>5</td>
<td>1/d</td>
</tr>
<tr>
<td>gigaword_window20-flat.csv.gz</td>
<td>Gigaword</td>
<td>20</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>Any hunches about how these matrices might differ from each other?</p>
<pre><code class="language-notebook">yelp5 = pd.read_csv(
os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'), index_col=0)
yelp5
</code></pre>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>):</th>
<th>);</th>
<th>..</th>
<th>...</th>
<th>:(</th>
<th>:)</th>
<th>:/</th>
<th>:D</th>
<th>:|</th>
<th>;p</th>
<th>...</th>
<th>younger</th>
<th>your</th>
<th>yourself</th>
<th>youth</th>
<th>zebra</th>
<th>zero</th>
<th>zinc</th>
<th>zombie</th>
<th>zone</th>
<th>zoo</th>
</tr>
</thead>
<tbody>
<tr>
<th>):</th>
<td>154.566667</td>
<td>0.000000</td>
<td>16.516667</td>
<td>60.750000</td>
<td>2.550000</td>
<td>8.466667</td>
<td>0.000000</td>
<td>0.866667</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>0.666667</td>
<td>40.516667</td>
<td>4.616667</td>
<td>0.416667</td>
<td>0.000000</td>
<td>4.450000</td>
<td>0.000000</td>
<td>0.166667</td>
<td>0.166667</td>
<td>0.333333</td>
</tr>
<tr>
<th>);</th>
<td>0.000000</td>
<td>24.866667</td>
<td>3.200000</td>
<td>36.200000</td>
<td>0.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>0.250000</td>
<td>32.733333</td>
<td>2.200000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.533333</td>
<td>0.000000</td>
<td>0.250000</td>
<td>0.450000</td>
<td>0.700000</td>
</tr>
<tr>
<th>..</th>
<td>16.516667</td>
<td>3.200000</td>
<td>13494.866667</td>
<td>6945.366667</td>
<td>196.750000</td>
<td>501.650000</td>
<td>40.450000</td>
<td>29.400000</td>
<td>0.700000</td>
<td>2.15</td>
<td>...</td>
<td>31.850000</td>
<td>2908.116667</td>
<td>247.366667</td>
<td>3.183333</td>
<td>1.783333</td>
<td>146.733333</td>
<td>1.250000</td>
<td>9.733333</td>
<td>23.816667</td>
<td>20.300000</td>
</tr>
<tr>
<th>...</th>
<td>60.750000</td>
<td>36.200000</td>
<td>6945.366667</td>
<td>62753.766667</td>
<td>783.216667</td>
<td>1711.583333</td>
<td>161.383333</td>
<td>101.700000</td>
<td>4.883333</td>
<td>7.85</td>
<td>...</td>
<td>119.100000</td>
<td>12530.533333</td>
<td>1259.683333</td>
<td>29.550000</td>
<td>4.516667</td>
<td>675.816667</td>
<td>8.833333</td>
<td>33.400000</td>
<td>108.783333</td>
<td>77.583333</td>
</tr>
<tr>
<th>:(</th>
<td>2.550000</td>
<td>0.000000</td>
<td>196.750000</td>
<td>783.216667</td>
<td>423.433333</td>
<td>13.950000</td>
<td>2.133333</td>
<td>0.166667</td>
<td>0.366667</td>
<td>0.00</td>
<td>...</td>
<td>0.333333</td>
<td>72.433333</td>
<td>5.333333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>7.650000</td>
<td>0.000000</td>
<td>0.533333</td>
<td>4.533333</td>
<td>0.166667</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>zero</th>
<td>4.450000</td>
<td>0.533333</td>
<td>146.733333</td>
<td>675.816667</td>
<td>7.650000</td>
<td>8.183333</td>
<td>0.750000</td>
<td>0.200000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>1.466667</td>
<td>121.233333</td>
<td>11.483333</td>
<td>0.333333</td>
<td>0.000000</td>
<td>476.733333</td>
<td>0.000000</td>
<td>0.783333</td>
<td>3.583333</td>
<td>0.866667</td>
</tr>
<tr>
<th>zinc</th>
<td>0.000000</td>
<td>0.000000</td>
<td>1.250000</td>
<td>8.833333</td>
<td>0.000000</td>
<td>0.500000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>0.000000</td>
<td>2.600000</td>
<td>0.166667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>2.066667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr>
<th>zombie</th>
<td>0.166667</td>
<td>0.250000</td>
<td>9.733333</td>
<td>33.400000</td>
<td>0.533333</td>
<td>1.733333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>0.750000</td>
<td>20.700000</td>
<td>1.866667</td>
<td>0.000000</td>
<td>0.166667</td>
<td>0.783333</td>
<td>0.000000</td>
<td>25.400000</td>
<td>4.616667</td>
<td>0.500000</td>
</tr>
<tr>
<th>zone</th>
<td>0.166667</td>
<td>0.450000</td>
<td>23.816667</td>
<td>108.783333</td>
<td>4.533333</td>
<td>7.066667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>0.916667</td>
<td>257.866667</td>
<td>4.783333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>3.583333</td>
<td>0.000000</td>
<td>4.616667</td>
<td>28.800000</td>
<td>0.000000</td>
</tr>
<tr>
<th>zoo</th>
<td>0.333333</td>
<td>0.700000</td>
<td>20.300000</td>
<td>77.583333</td>
<td>0.166667</td>
<td>4.850000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>...</td>
<td>1.000000</td>
<td>31.783333</td>
<td>3.533333</td>
<td>0.000000</td>
<td>0.333333</td>
<td>0.866667</td>
<td>0.000000</td>
<td>0.500000</td>
<td>0.000000</td>
<td>85.166667</td>
</tr>
</tbody>
</table>
<p>6000 rows × 6000 columns</p>
</div>
<pre><code class="language-notebook">yelp20 = pd.read_csv(
os.path.join(DATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)
yelp20
</code></pre>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>):</th>
<th>);</th>
<th>..</th>
<th>...</th>
<th>:(</th>
<th>:)</th>
<th>:/</th>
<th>:D</th>
<th>:|</th>
<th>;p</th>
<th>...</th>
<th>younger</th>
<th>your</th>
<th>yourself</th>
<th>youth</th>
<th>zebra</th>
<th>zero</th>
<th>zinc</th>
<th>zombie</th>
<th>zone</th>
<th>zoo</th>
</tr>
</thead>
<tbody>
<tr>
<th>):</th>
<td>3910.0</td>
<td>24.0</td>
<td>250.0</td>
<td>1162.0</td>
<td>48.0</td>
<td>261.0</td>
<td>4.0</td>
<td>22.0</td>
<td>0.0</td>
<td>1.0</td>
<td>...</td>
<td>6.0</td>
<td>945.0</td>
<td>61.0</td>
<td>3.0</td>
<td>1.0</td>
<td>27.0</td>
<td>1.0</td>
<td>1.0</td>
<td>2.0</td>
<td>3.0</td>
</tr>
<tr>
<th>);</th>
<td>24.0</td>
<td>1738.0</td>
<td>55.0</td>
<td>595.0</td>
<td>7.0</td>
<td>27.0</td>
<td>0.0</td>
<td>2.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>5.0</td>
<td>564.0</td>
<td>29.0</td>
<td>0.0</td>
<td>0.0</td>
<td>7.0</td>
<td>0.0</td>
<td>2.0</td>
<td>3.0</td>
<td>5.0</td>
</tr>
<tr>
<th>..</th>
<td>250.0</td>
<td>55.0</td>
<td>330680.0</td>
<td>154726.0</td>
<td>1605.0</td>
<td>5707.0</td>
<td>311.0</td>
<td>336.0</td>
<td>10.0</td>
<td>21.0</td>
<td>...</td>
<td>437.0</td>
<td>40631.0</td>
<td>2680.0</td>
<td>36.0</td>
<td>12.0</td>
<td>1441.0</td>
<td>8.0</td>
<td>107.0</td>
<td>237.0</td>
<td>182.0</td>
</tr>
<tr>
<th>...</th>
<td>1162.0</td>
<td>595.0</td>
<td>154726.0</td>
<td>1213778.0</td>
<td>6404.0</td>
<td>21831.0</td>
<td>1230.0</td>
<td>1172.0</td>
<td>41.0</td>
<td>73.0</td>
<td>...</td>
<td>1684.0</td>
<td>174619.0</td>
<td>13862.0</td>
<td>323.0</td>
<td>57.0</td>
<td>6345.0</td>
<td>71.0</td>
<td>405.0</td>
<td>1208.0</td>
<td>840.0</td>
</tr>
<tr>
<th>:(</th>
<td>48.0</td>
<td>7.0</td>
<td>1605.0</td>
<td>6404.0</td>
<td>1646.0</td>
<td>528.0</td>
<td>55.0</td>
<td>33.0</td>
<td>4.0</td>
<td>1.0</td>
<td>...</td>
<td>9.0</td>
<td>1425.0</td>
<td>84.0</td>
<td>1.0</td>
<td>1.0</td>
<td>66.0</td>
<td>0.0</td>
<td>5.0</td>
<td>15.0</td>
<td>7.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>zero</th>
<td>27.0</td>
<td>7.0</td>
<td>1441.0</td>
<td>6345.0</td>
<td>66.0</td>
<td>110.0</td>
<td>6.0</td>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>43.0</td>
<td>4141.0</td>
<td>368.0</td>
<td>4.0</td>
<td>0.0</td>
<td>3894.0</td>
<td>0.0</td>
<td>14.0</td>
<td>28.0</td>
<td>19.0</td>
</tr>
<tr>
<th>zinc</th>
<td>1.0</td>
<td>0.0</td>
<td>8.0</td>
<td>71.0</td>
<td>0.0</td>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>34.0</td>
<td>7.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>98.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>zombie</th>
<td>1.0</td>
<td>2.0</td>
<td>107.0</td>
<td>405.0</td>
<td>5.0</td>
<td>22.0</td>
<td>0.0</td>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>3.0</td>
<td>287.0</td>
<td>28.0</td>
<td>1.0</td>
<td>1.0</td>
<td>14.0</td>
<td>0.0</td>
<td>604.0</td>
<td>22.0</td>
<td>1.0</td>
</tr>
<tr>
<th>zone</th>
<td>2.0</td>
<td>3.0</td>
<td>237.0</td>
<td>1208.0</td>
<td>15.0</td>
<td>62.0</td>
<td>1.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>14.0</td>
<td>1721.0</td>
<td>99.0</td>
<td>3.0</td>
<td>1.0</td>
<td>28.0</td>
<td>0.0</td>
<td>22.0</td>
<td>764.0</td>
<td>14.0</td>
</tr>
<tr>
<th>zoo</th>
<td>3.0</td>
<td>5.0</td>
<td>182.0</td>
<td>840.0</td>
<td>7.0</td>
<td>47.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>16.0</td>
<td>699.0</td>
<td>53.0</td>
<td>2.0</td>
<td>10.0</td>
<td>19.0</td>
<td>0.0</td>
<td>1.0</td>
<td>14.0</td>
<td>3148.0</td>
</tr>
</tbody>
</table>
<p>6000 rows × 6000 columns</p>
</div>
<pre><code class="language-notebook">giga5 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
giga5
</code></pre>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>):</th>
<th>);</th>
<th>..</th>
<th>...</th>
<th>:(</th>
<th>:)</th>
<th>:/</th>
<th>:D</th>
<th>:|</th>
<th>;p</th>
<th>...</th>
<th>younger</th>
<th>your</th>
<th>yourself</th>
<th>youth</th>
<th>zebra</th>
<th>zero</th>
<th>zinc</th>
<th>zombie</th>
<th>zone</th>
<th>zoo</th>
</tr>
</thead>
<tbody>
<tr>
<th>):</th>
<td>4.300000</td>
<td>8.100000</td>
<td>7.700000</td>
<td>113.766667</td>
<td>2.666667</td>
<td>1.116667</td>
<td>0.50</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>2.850000</td>
<td>147.633333</td>
<td>8.483333</td>
<td>4.500000</td>
<td>0.000000</td>
<td>2.733333</td>
<td>0.200000</td>
<td>0.333333</td>
<td>344.066667</td>
<td>0.700000</td>
</tr>
<tr>
<th>);</th>
<td>8.100000</td>
<td>1092.400000</td>
<td>0.650000</td>
<td>52.783333</td>
<td>0.666667</td>
<td>0.000000</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>7.100000</td>
<td>44.400000</td>
<td>3.333333</td>
<td>7.116667</td>
<td>0.400000</td>
<td>1.616667</td>
<td>1.000000</td>
<td>0.500000</td>
<td>11.516667</td>
<td>1.250000</td>
</tr>
<tr>
<th>..</th>
<td>7.700000</td>
<td>0.650000</td>
<td>18900.766667</td>
<td>12657.800000</td>
<td>0.500000</td>
<td>0.500000</td>
<td>0.00</td>
<td>0.5</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>1.166667</td>
<td>29.883333</td>
<td>0.833333</td>
<td>0.250000</td>
<td>0.000000</td>
<td>0.916667</td>
<td>0.250000</td>
<td>0.000000</td>
<td>2.983333</td>
<td>1.083333</td>
</tr>
<tr>
<th>...</th>
<td>113.766667</td>
<td>52.783333</td>
<td>12657.800000</td>
<td>131656.033333</td>
<td>0.866667</td>
<td>0.500000</td>
<td>0.85</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>40.333333</td>
<td>2187.066667</td>
<td>89.750000</td>
<td>42.000000</td>
<td>8.200000</td>
<td>30.683333</td>
<td>0.666667</td>
<td>2.333333</td>
<td>79.233333</td>
<td>11.083333</td>
</tr>
<tr>
<th>:(</th>
<td>2.666667</td>
<td>0.666667</td>
<td>0.500000</td>
<td>0.866667</td>
<td>0.000000</td>
<td>0.333333</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.000000</td>
<td>0.333333</td>
<td>0.000000</td>
<td>0.666667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>zero</th>
<td>2.733333</td>
<td>1.616667</td>
<td>0.916667</td>
<td>30.683333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.666667</td>
<td>26.516667</td>
<td>1.533333</td>
<td>1.600000</td>
<td>0.000000</td>
<td>270.333333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>6.900000</td>
<td>0.000000</td>
</tr>
<tr>
<th>zinc</th>
<td>0.200000</td>
<td>1.000000</td>
<td>0.250000</td>
<td>0.666667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.000000</td>
<td>7.116667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>24.966667</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr>
<th>zombie</th>
<td>0.333333</td>
<td>0.500000</td>
<td>0.000000</td>
<td>2.333333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.200000</td>
<td>2.700000</td>
<td>0.000000</td>
<td>0.250000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>4.700000</td>
<td>0.500000</td>
<td>0.000000</td>
</tr>
<tr>
<th>zone</th>
<td>344.066667</td>
<td>11.516667</td>
<td>2.983333</td>
<td>79.233333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.00</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>1.000000</td>
<td>112.033333</td>
<td>3.850000</td>
<td>2.333333</td>
<td>0.666667</td>
<td>6.900000</td>
<td>0.000000</td>
<td>0.500000</td>
<td>248.466667</td>
<td>0.816667</td>
</tr>
<tr>
<th>zoo</th>
<td>0.700000</td>
<td>1.250000</td>
<td>1.083333</td>
<td>11.083333</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.25</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>1.000000</td>
<td>9.383333</td>
<td>0.250000</td>
<td>0.400000</td>
<td>1.500000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.816667</td>
<td>141.133333</td>
</tr>
</tbody>
</table>
<p>6000 rows × 6000 columns</p>
</div>
<pre><code class="language-notebook">giga20 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
giga20
</code></pre>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>):</th>
<th>);</th>
<th>..</th>
<th>...</th>
<th>:(</th>
<th>:)</th>
<th>:/</th>
<th>:D</th>
<th>:|</th>
<th>;p</th>
<th>...</th>
<th>younger</th>
<th>your</th>
<th>yourself</th>
<th>youth</th>
<th>zebra</th>
<th>zero</th>
<th>zinc</th>
<th>zombie</th>
<th>zone</th>
<th>zoo</th>
</tr>
</thead>
<tbody>
<tr>
<th>):</th>
<td>7046.0</td>
<td>517.0</td>
<td>29.0</td>
<td>1684.0</td>
<td>45.0</td>
<td>10.0</td>
<td>15.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>...</td>
<td>34.0</td>
<td>2716.0</td>
<td>318.0</td>
<td>75.0</td>
<td>0.0</td>
<td>18.0</td>
<td>11.0</td>
<td>5.0</td>
<td>1128.0</td>
<td>11.0</td>
</tr>
<tr>
<th>);</th>
<td>517.0</td>
<td>108712.0</td>
<td>19.0</td>
<td>288.0</td>
<td>15.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>1.0</td>
<td>0.0</td>
<td>...</td>
<td>105.0</td>
<td>872.0</td>
<td>34.0</td>
<td>80.0</td>
<td>5.0</td>
<td>19.0</td>
<td>15.0</td>
<td>5.0</td>
<td>111.0</td>
<td>13.0</td>
</tr>
<tr>
<th>..</th>
<td>29.0</td>
<td>19.0</td>
<td>101566.0</td>
<td>116373.0</td>
<td>1.0</td>
<td>1.0</td>
<td>6.0</td>
<td>2.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>72.0</td>
<td>886.0</td>
<td>17.0</td>
<td>24.0</td>
<td>23.0</td>
<td>75.0</td>
<td>33.0</td>
<td>0.0</td>
<td>225.0</td>
<td>10.0</td>
</tr>
<tr>
<th>...</th>
<td>1684.0</td>
<td>288.0</td>
<td>116373.0</td>
<td>1627084.0</td>
<td>27.0</td>
<td>48.0</td>
<td>7.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>671.0</td>
<td>27690.0</td>
<td>1158.0</td>
<td>886.0</td>
<td>392.0</td>
<td>417.0</td>
<td>6.0</td>
<td>34.0</td>
<td>1522.0</td>
<td>149.0</td>
</tr>
<tr>
<th>:(</th>
<td>45.0</td>
<td>15.0</td>
<td>1.0</td>
<td>27.0</td>
<td>50.0</td>
<td>5.0</td>
<td>2.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>1.0</td>
<td>49.0</td>
<td>3.0</td>
<td>2.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>zero</th>
<td>18.0</td>
<td>19.0</td>
<td>75.0</td>
<td>417.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>25.0</td>
<td>596.0</td>
<td>49.0</td>
<td>26.0</td>
<td>0.0</td>
<td>3460.0</td>
<td>2.0</td>
<td>0.0</td>
<td>140.0</td>
<td>7.0</td>
</tr>
<tr>
<th>zinc</th>
<td>11.0</td>
<td>15.0</td>
<td>33.0</td>
<td>6.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>85.0</td>
<td>2.0</td>
<td>1.0</td>
<td>0.0</td>
<td>2.0</td>
<td>1366.0</td>
<td>0.0</td>
<td>2.0</td>
<td>1.0</td>
</tr>
<tr>
<th>zombie</th>
<td>5.0</td>
<td>5.0</td>
<td>0.0</td>
<td>34.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>3.0</td>
<td>65.0</td>
<td>1.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>188.0</td>
<td>3.0</td>
<td>0.0</td>
</tr>
<tr>
<th>zone</th>
<td>1128.0</td>
<td>111.0</td>
<td>225.0</td>
<td>1522.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>1.0</td>
<td>...</td>
<td>54.0</td>
<td>1464.0</td>
<td>60.0</td>
<td>52.0</td>
<td>7.0</td>
<td>140.0</td>
<td>2.0</td>
<td>3.0</td>
<td>11250.0</td>
<td>19.0</td>
</tr>
<tr>
<th>zoo</th>
<td>11.0</td>
<td>13.0</td>
<td>10.0</td>
<td>149.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>34.0</td>
<td>257.0</td>
<td>15.0</td>
<td>16.0</td>
<td>36.0</td>
<td>7.0</td>
<td>1.0</td>
<td>0.0</td>
<td>19.0</td>
<td>5568.0</td>
</tr>
</tbody>
</table>
<p>6000 rows × 6000 columns</p>
</div>
<h2 id="vector-comparison">Vector comparison</h2>
<p>Vector comparisons form the heart of our analyses in this context.</p>
<ul>
<li>
<p>For the most part, we are interested in measuring the <strong>distance</strong> between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.</p>
</li>
<li>
<p>The <a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/spatial.distance.html">scipy.spatial.distance</a> module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we’ll use allow all float values.</p>
</li>
</ul>
<h3 id="euclidean">Euclidean</h3>
<p>The most basic and intuitive distance measure between vectors is <strong>euclidean distance</strong>. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is</p>
\[\textbf{euclidean}(u, v) =
\sqrt{\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}\]
<p>In two-dimensions, this corresponds to the length of the most direct line between the two points.</p>
<p>In <code class="language-plaintext highlighter-rouge">vsm.py</code>, the function <code class="language-plaintext highlighter-rouge">euclidean</code> just uses the corresponding <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">scipy.spatial.distance</a> method to define it.</p>
<p>Here’s the tiny vector space from the screencast on vector comparisons associated with this notebook:</p>
<p>Running example <br />
• Focus on distance measures <br />
• Illustrations with row vectors <br />
어떤 문서(혹은 코퍼스 등) x, y에, <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">B</code>, <code class="language-plaintext highlighter-rouge">C</code>라는 단어가 등장한다고 할 때 <br />
혹은, 어떤 단어 x, y와, <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">B</code>, <code class="language-plaintext highlighter-rouge">C</code>라는 단어가 함께 등장한다고 할 때</p>
<pre><code class="language-notebook">ABC = pd.DataFrame([
[ 2.0, 4.0],
[10.0, 15.0],
[14.0, 10.0]],
index=['A', 'B', 'C'],
columns=['x', 'y'])
</code></pre>
<pre><code class="language-notebook">ABC
</code></pre>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>A</th>
<td>2.0</td>
<td>4.0</td>
</tr>
<tr>
<th>B</th>
<td>10.0</td>
<td>15.0</td>
</tr>
<tr>
<th>C</th>
<td>14.0</td>
<td>10.0</td>
</tr>
</tbody>
</table>
</div>
<pre><code class="language-notebook">def plot_ABC(df):
ax = df.plot.scatter(x='x', y='y', marker='.', legend=False)
m = df.values.max(axis=None)
ax.set_xlim([0, m*1.2])
ax.set_ylim([0, m*1.2])
for label, row in df.iterrows():
ax.text(row['x'], row['y'], label)
</code></pre>
<pre><code class="language-notebook">plot_ABC(ABC)
</code></pre>
<p><img src="/images/vsm_01_distributional_practice_files/vsm_01_distributional_practice_29_0.png" alt="png" /></p>
<p>The euclidean distances align well with raw visual distance in the plot:</p>
<pre><code class="language-notebook">def euclidean(u, v):
return scipy.spatial.distance.euclidean(u, v)
</code></pre>
<pre><code class="language-notebook">def abc_comparisons(df, distfunc):
"""(A,B)간, (B,C)간 거리 함수를 parameter로 받아서, 두 점 사이의 거리를 계산하는 함수
"""
for a, b in (('A', 'B'), ('B', 'C')):
dist = distfunc(df.loc[a], df.loc[b]) # df.loc[a]: Series type
print('{0:}({1:}, {2:}) = {3:7.02f}'.format(
distfunc.__name__, a, b, dist))
</code></pre>
<pre><code class="language-notebook">abc_comparisons(ABC, euclidean) # plain X, Y축 값을 갖는 dataframe ABC를 넘김.
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>euclidean(A, B) = 13.60
euclidean(B, C) = 6.40
</code></pre></div></div>
<p>However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don’t look good:</p>
<ul>
<li>
<p>The distributions of B and C are more or less directly opposed, suggesting very different meanings, whereas A and B are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.</p>
</li>
<li>
<p>In terms of the large models we will soon explore, A and B resemble a pair like <em>superb</em> and <em>good</em>, which have similar meanings but very different frequencies.</p>
</li>
<li>
<p>In contrast, B and C are like <em>good</em> and <em>disappointing</em> — similar overall frequencies but different distributions with respect to the overall vocabulary.</p>
</li>
</ul>
<h3 id="length-normalization">Length normalization</h3>
<p>These affinities are immediately apparent if we <strong>normalize the vectors by their length</strong>. To do this, we first define the L2-length of a vector:</p>
\[\|u\|_{2} = \sqrt{\sum_{i=1}^{n} u_{i}^{2}}\]
<p>And then the normalization step just divides each value by this quantity:</p>
\[\left[
\frac{u_{1}}{\|u\|_{2}},
\frac{u_{2}}{\|u\|_{2}},
\ldots
\frac{u_{n}}{\|u\|_{2}}
\right]\]
<pre><code class="language-notebook">def vector_length(u):
"""
u vector의 dot(inner) product
([1, 2, 3]).dot([1, 2, 3]) = 1+4+9 = 14
"""
return np.sqrt(u.dot(u))
def length_norm(u):
return u / vector_length(u)
</code></pre>
<pre><code class="language-notebook">ABC_normed = ABC.apply(length_norm, axis=1)
</code></pre>
<pre><code class="language-notebook">plot_ABC(ABC_normed)
</code></pre>
<p><img src="/images/vsm_01_distributional_practice_files/vsm_01_distributional_practice_38_0.png" alt="png" /></p>
<pre><code class="language-notebook">abc_comparisons(ABC_normed, euclidean) # normalize된 ABC dataframe의 euclidean distance
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>euclidean(A, B) = 0.12
euclidean(B, C) = 0.36
</code></pre></div></div>
<p>Here, the connection between A and B is more apparent, as is the opposition between B and C.</p>
<h3 id="cosine-distance">Cosine distance</h3>
<p>Cosine distance takes overall length into account. The cosine distance between two vectors $u$ and $v$ of dimension $n$ is</p>
\[\textbf{cosine}(u, v) =
1 - \frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}\]
<p>The similarity part of this (the righthand term of the subtraction) is actually measuring the angles between the two vectors. The result is the same (in terms of rank order) as one gets from first normalizing both vectors using $|\cdot|_{2}$ and then calculating their Euclidean distance.</p>
<pre><code class="language-notebook">def cosine(u, v):
return scipy.spatial.distance.cosine(u, v)
</code></pre>
<pre><code class="language-notebook">abc_comparisons(ABC, cosine)
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cosine(A, B) = 0.01
cosine(B, C) = 0.07
</code></pre></div></div>
<p>So, in building in the length normalization, cosine distance achieves our goal of associating A and B and separating both from C.</p>
<h3 id="matching-based-methods">Matching-based methods</h3>
<p>Matching-based methods are also common in the literature. The basic matching measure effectively creates a vector consisting of all of the smaller of the two values at each coordinate, and then sums them:</p>
\[\textbf{matching}(u, v) = \sum_{i=1}^{n} \min(u_{i}, v_{i})\]
<p>This is implemented in <code class="language-plaintext highlighter-rouge">vsm</code> as <code class="language-plaintext highlighter-rouge">matching</code>.</p>
<p>One approach to normalizing the matching values is the <a href="https://en.wikipedia.org/wiki/Jaccard_index"><strong>Jaccard coefficient</strong></a>. The numerator is the matching coefficient. The denominator — the normalizer — is intuitively like the set union: for binary vectors, it gives the cardinality of the union of the two being compared:</p>
\[\textbf{jaccard}(u, v) =
1 - \frac{\textbf{matching}(u, v)}{\sum_{i=1}^{n} \max(u_{i}, v_{i})}\]
<h3 id="summary">Summary</h3>
<p>Suppose we set for ourselves the goal of associating A with B and disassociating B from C, in keeping with the semantic intuition expressed above. Then we can assess distance measures by whether they achieve this goal:</p>
<pre><code class="language-notebook">def matching(u, v):
return np.sum(np.minimum(u, v))
def jaccard(u, v):
return 1.0 - (matching(u, v) / np.sum(np.maximum(u, v)))
</code></pre>
<pre><code class="language-notebook">for m in (euclidean, cosine, jaccard):
fmt = {
'n': m.__name__,
'AB': m(ABC.loc['A'], ABC.loc['B']),
'BC': m(ABC.loc['B'], ABC.loc['C'])}
print('{n:>15}(A, B) = {AB:5.2f} {n:>15}(B, C) = {BC:5.2f}'.format(**fmt))
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> euclidean(A, B) = 13.60 euclidean(B, C) = 6.40
cosine(A, B) = 0.01 cosine(B, C) = 0.07
jaccard(A, B) = 0.76 jaccard(B, C) = 0.31
</code></pre></div></div>
<h2 id="distributional-neighbors">Distributional neighbors</h2>
<p>The <code class="language-plaintext highlighter-rouge">neighbors</code> function in <code class="language-plaintext highlighter-rouge">vsm</code> is an investigative aide. For a given word <code class="language-plaintext highlighter-rouge">w</code>, it ranks all the words in the vocabulary according to their distance from <code class="language-plaintext highlighter-rouge">w</code>, as measured by <code class="language-plaintext highlighter-rouge">distfunc</code> (default: <code class="language-plaintext highlighter-rouge">vsm.cosine</code>).</p>
<p>By playing around with this function, you can start to get a sense for how the distance functions differ. Here are some example uses; you might try some new words to get a feel for what these matrices are like and how different words look.</p>
<pre><code class="language-notebook">def neighbors(word, df, distfunc=cosine):
"""Tool for finding the nearest neighbors of `word` in `df` according
to `distfunc`. The comparisons are between row vectors.
Parameters
----------
word : str
The anchor word. Assumed to be in `rownames`.
df : pd.DataFrame
The vector-space model.
distfunc : function mapping vector pairs to floats (default: `cosine`)
The measure of distance between vectors. Can also be `euclidean`,
`matching`, `jaccard`, as well as any other distance measure
between 1d vectors.
Raises
------
ValueError
If word is not in `df.index`.
Returns
-------
pd.Series
Ordered by closeness to `word`.
"""
if word not in df.index:
raise ValueError('{} is not in this VSM'.format(word))
w = df.loc[word]
dists = df.apply(lambda x: distfunc(w, x), axis=1)
return dists.sort_values()
</code></pre>
<pre><code class="language-notebook">neighbors('A', ABC, distfunc=euclidean)
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 0.000000
C 13.416408
B 13.601471
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('A', ABC, distfunc=cosine)
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 0.000000
B 0.007722
C 0.116212
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp5, distfunc=euclidean).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
great 363501.370559
very 463140.024533
there 484997.266143
so 503078.898630
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp20, distfunc=euclidean).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000e+00
very 1.582079e+06
food 1.706024e+06
as 2.180615e+06
great 2.309515e+06
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp5, distfunc=cosine).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
decent 0.052814
weak 0.061491
impressive 0.063750
solid 0.072259
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp20, distfunc=cosine).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
decent 0.005797
pretty 0.006568
really 0.007324
quite 0.007358
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga5, distfunc=euclidean).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
very 96629.020074
another 102967.358766
little 111678.083808
big 112087.562461
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga20, distfunc=euclidean).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
very 342311.566423
here 510760.069407
think 529750.740523
right 561272.646841
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga5, distfunc=cosine).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
bad 0.047839
painful 0.070546
simple 0.073625
safe 0.076624
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga20, distfunc=cosine).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
bad 0.004829
everyone 0.006463
always 0.006505
very 0.007011
dtype: float64
</code></pre></div></div>
<h2 id="matrix-reweighting">Matrix reweighting</h2>
<ul>
<li>
<p>The goal of reweighting is to amplify the important, trustworthy, and unusual, while deemphasizing the mundane and the quirky.</p>
</li>
<li>
<p>Absent a defined objective function, this will remain fuzzy, but the intuition behind moving away from raw counts is that frequency is a poor proxy for our target semantic ideas.</p>
</li>
</ul>
<h3 id="normalization">Normalization</h3>
<p>Normalization (row-wise or column-wise) is perhaps the simplest form of reweighting. With <code class="language-plaintext highlighter-rouge">vsm.length_norm</code>, we normalize using <code class="language-plaintext highlighter-rouge">vsm.vector_length</code>. We can also normalize each row by the sum of its values, which turns each row into a probability distribution over the columns:</p>
\[\left[
\frac{u_{1}}{\sum_{i=1}^{n}u_{i}},
\frac{u_{2}}{\sum_{i=1}^{n}u_{i}},
\ldots
\frac{u_{n}}{\sum_{i=1}^{n}u_{i}},
\right]\]
<p>These normalization measures are <strong>insensitive to the magnitude of the underlying counts</strong>. This is often a mistake in the messy world of large data sets; $[1,10]$ and $[1000,10000]$ are very different vectors in ways that will be partly or totally obscured by normalization.</p>
<h3 id="observedexpected">Observed/Expected</h3>
<p>Reweighting by observed-over-expected values captures one of the central patterns in all of VSMs: we can adjust the actual cell value in a co-occurrence matrix using information from the corresponding row and column.</p>
<p>In the case of observed-over-expected, the rows and columns define our expectation about what the cell value would be if the two co-occurring words were independent. In dividing the observed count by this value, we amplify cells whose values are larger than we would expect.</p>
<p>So that this doesn’t look more complex than it is, for an $m \times n$ matrix $X$, define</p>
\[\textbf{rowsum}(X, i) = \sum_{j=1}^{n}X_{ij}\]
\[\textbf{colsum}(X, j) = \sum_{i=1}^{m}X_{ij}\]
\[\textbf{sum}(X) = \sum_{i=1}^{m}\sum_{j=1}^{n} X_{ij}\]
\[\textbf{expected}(X, i, j) =
\frac{
\textbf{rowsum}(X, i) \cdot \textbf{colsum}(X, j)
}{
\textbf{sum}(X)
}\]
<p>Then the observed-over-expected value is</p>
\[\textbf{oe}(X, i, j) = \frac{X_{ij}}{\textbf{expected}(X, i, j)}\]
<p>In many contexts, it is more intuitive to first normalize the count matrix into a joint probability table and then think of $\textbf{rowsum}$ and $\textbf{colsum}$ as probabilities. Then it is clear that we are comparing the observed joint probability with what we would expect it to be under a null hypothesis of independence. These normalizations do not affect the final results, though.</p>
<p>Let’s do a quick worked-out example. Suppose we have the count matrix $X$ =</p>
<table>
<thead>
<tr>
<th> </th>
<th>a</th>
<th>b</th>
<th>rowsum</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>x</strong></td>
<td>34</td>
<td>11</td>
<td>45</td>
</tr>
<tr>
<td><strong>y</strong></td>
<td>47</td>
<td>7</td>
<td>54</td>
</tr>
<tr>
<td><strong>colsum</strong></td>
<td>81</td>
<td>18</td>
<td>99</td>
</tr>
</tbody>
</table>
<p>Then we calculate like this:</p>
\[\textbf{oe}(X, 1, 0) = \frac{47}{\frac{54 \cdot 81}{99}} = 1.06\]
<p>And the full table looks like this:</p>
<table>
<thead>
<tr>
<th> </th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>x</strong></td>
<td>0.92</td>
<td>1.34</td>
</tr>
<tr>
<td><strong>y</strong></td>
<td>1.06</td>
<td>0.71</td>
</tr>
</tbody>
</table>
<pre><code class="language-notebook">def observed_over_expected(df):
col_totals = df.sum(axis=0)
total = col_totals.sum()
row_totals = df.sum(axis=1)
expected = np.outer(row_totals, col_totals) / total
oe = df / expected
return oe
</code></pre>
<pre><code class="language-notebook">oe_ex = np.array([[ 34., 11.], [ 47., 7.]])
observed_over_expected(oe_ex).round(2)
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[0.92, 1.34],
[1.06, 0.71]])
</code></pre></div></div>
<p>The implementation <code class="language-plaintext highlighter-rouge">vsm.observed_over_expected</code> should be pretty efficient.</p>
<pre><code class="language-notebook">yelp5_oe = observed_over_expected(yelp5)
</code></pre>
<pre><code class="language-notebook">yelp20_oe = observed_over_expected(yelp20)
</code></pre>
<pre><code class="language-notebook">neighbors('good', yelp5_oe).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
great 0.323037
decent 0.381849
but 0.403750
excellent 0.411684
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp20_oe).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
too 0.058905
really 0.060264
pretty 0.064466
but 0.066778
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">giga5_oe = observed_over_expected(giga5)
</code></pre>
<pre><code class="language-notebook">giga20_oe = observed_over_expected(giga20)
</code></pre>
<pre><code class="language-notebook">neighbors('good', giga5_oe).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
bad 0.485495
better 0.533762
well 0.553319
but 0.556113
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga20_oe).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
just 0.102044
so 0.106773
that's 0.108614
better 0.111044
dtype: float64
</code></pre></div></div>
<h3 id="pointwise-mutual-information">Pointwise Mutual Information</h3>
<p>Pointwise Mutual Information (PMI) is observed-over-expected in log-space:</p>
\[\textbf{pmi}(X, i, j) = \log\left(\frac{X_{ij}}{\textbf{expected}(X, i, j)}\right)\]
<p>This basic definition runs into a problem for $0$ count cells. The usual response is to set $\log(0) = 0$, but this is arguably confusing – cell counts that are smaller than expected get negative values, cell counts that are larger than expected get positive values, and 0-count values are placed in the middle of this ranking without real justification.</p>
<p>For this reason, it is more typical to use <strong>Positive PMI</strong>, which maps all negative PMI values to $0$:</p>
\[\textbf{ppmi}(X, i, j) =
\begin{cases}
\textbf{pmi}(X, i, j) & \textrm{if } \textbf{pmi}(X, i, j) > 0 \\
0 & \textrm{otherwise}
\end{cases}\]
<p>This is the default for <code class="language-plaintext highlighter-rouge">vsm.pmi</code>.</p>
<pre><code class="language-notebook">def pmi(df, positive=True):
df = observed_over_expected(df)
# Silence distracting warnings about log(0):
with np.errstate(divide='ignore'):
df = np.log(df)
df[np.isinf(df)] = 0.0 # log(0) = 0
if positive:
df[df < 0] = 0.0
return df
</code></pre>
<pre><code class="language-notebook">yelp5_pmi = pmi(yelp5)
</code></pre>
<pre><code class="language-notebook">yelp20_pmi = pmi(yelp20)
</code></pre>
<pre><code class="language-notebook">neighbors('good', yelp5_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
decent 0.448116
great 0.466581
tasty 0.532094
excellent 0.569720
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', yelp20_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
tasty 0.201911
delicious 0.261547
flavorful 0.317310
ordered 0.349519
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">giga5_pmi = pmi(giga5)
</code></pre>
<pre><code class="language-notebook">giga20_pmi = pmi(giga20)
</code></pre>
<pre><code class="language-notebook">neighbors('good', giga5_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
bad 0.439959
better 0.494448
terrific 0.509023
decent 0.520705
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('good', giga20_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>good 0.000000
really 0.271877
that's 0.272447
it's 0.285056
pretty 0.294105
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('market', giga5_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>market 0.000000
markets 0.327014
stocks 0.512569
prices 0.516767
sales 0.552022
dtype: float64
</code></pre></div></div>
<pre><code class="language-notebook">neighbors('market', giga20_pmi).head()
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>market 0.000000
markets 0.175934
prices 0.252851
stocks 0.266186
profits 0.269412
dtype: float64
</code></pre></div></div>
<h3 id="tf-idf">TF-IDF</h3>
<p>Perhaps the best known reweighting schemes is <strong>Term Frequency–Inverse Document Frequency (TF-IDF)</strong>, which is, I believe, still the backbone of today’s Web search technologies. As the name suggests, it is built from TF and IDF measures:</p>
<p>For an $m \times n$ matrix $X$:</p>
\[\textbf{TF}(X, i, j) = \frac{X_{ij}}{\textbf{colsum}(X, i, j)}\]
\[\textbf{IDF}(X, i, j) = \log\left(\frac{n}{|\{k : X_{ik} > 0\}|}\right)\]
\[\textbf{TF-IDF}(X, i, j) = \textbf{TF}(X, i, j) \cdot \textbf{IDF}(X, i, j)\]
<p>TF-IDF generally performs best with sparse matrices. It severely punishes words that appear in many documents; if a word appears in every document, then its IDF value is 0. As a result, it can even be problematic with verb dense word $\times$ word matrices like ours, where most words appear with most other words.</p>
<p>There is an implementation of TF-IDF for dense matrices in <code class="language-plaintext highlighter-rouge">vsm.tfidf</code>.</p>
<p><strong>Important</strong>: <code class="language-plaintext highlighter-rouge">sklearn</code>’s version, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TfidfTransformer</a>, assumes that term frequency (TF) is defined row-wise and document frequency is defined column-wise. That is, it assumes <code class="language-plaintext highlighter-rouge">sklearn</code>’s document $\times$ word basic design, which makes sense for classification tasks, where the design is example $\times$ features. This is the transpose of the way we’ve been thinking.</p>
<pre><code class="language-notebook">def tfidf(df):
# Inverse document frequencies:
doccount = float(df.shape[1])
freqs = df.astype(bool).sum(axis=1)
idfs = np.log(doccount / freqs)
idfs[np.isinf(idfs)] = 0.0 # log(0) = 0
# Term frequencies:
col_totals = df.sum(axis=0)
tfs = df / col_totals
return (tfs.T * idfs).T
</code></pre>
<h2 id="subword-information">Subword information</h2>
<p><a href="https://papers.nips.cc/paper/603-word-space">Schütze (1993)</a> pioneered the use of subword information to improve representations by reducing sparsity, thereby increasing the density of connections in a VSM. In recent years, this idea has shown value in numerous contexts.</p>
<p><a href="https://arxiv.org/abs/1607.04606">Bojanowski et al. (2016)</a> (the <a href="https://fasttext.cc">fastText</a> team) explore a particularly straightforward approach to doing this: represent each word as the sum of the representations for the character-level n-grams it contains.</p>
<p>It is simple to derive character-level n-gram representations from our existing VSMs. The function <code class="language-plaintext highlighter-rouge">vsm.ngram_vsm</code> implements the basic step. Here, we create the 4-gram version of <code class="language-plaintext highlighter-rouge">imdb5</code>:</p>
<pre><code class="language-notebook">def ngram_vsm(df, n=2):
"""Create a character-level VSM from `df`.
Parameters
----------
df : pd.DataFrame
n : int
The n-gram size.
Returns
-------
pd.DataFrame
This will have the same column dimensionality as `df`, but the
rows will be expanded with representations giving the sum of
all the original rows in `df` that contain that row's n-gram.
"""
unigram2vecs = defaultdict(list)
for w, x in df.iterrows():
for c in get_character_ngrams(w, n):
unigram2vecs[c].append(x)
unigram2vecs = {c: np.array(x).sum(axis=0)
for c, x in unigram2vecs.items()}
cf = pd.DataFrame(unigram2vecs).T
cf.columns = df.columns
return cf
def get_character_ngrams(w, n):
"""Map a word to its character-level n-grams, with boundary
symbols '<w>' and '</w>'.
Parameters
----------
w : str
n : int
The n-gram size.
Returns
-------
list of str
"""
if n > 1:
w = ["<w>"] + list(w) + ["</w>"]
else:
w = list(w)
return ["".join(w[i: i+n]) for i in range(len(w)-n+1)]
</code></pre>
<pre><code class="language-notebook">yelp5_ngrams = ngram_vsm(yelp5, n=4)
</code></pre>
<pre><code class="language-notebook">yelp5_ngrams.shape
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(10382, 6000)
</code></pre></div></div>
<p>This has the same column dimension as the <code class="language-plaintext highlighter-rouge">yelp5</code>, but the rows are expanded with all the 4-grams, including boundary symbols <code class="language-plaintext highlighter-rouge"><w></code> and <code class="language-plaintext highlighter-rouge"></w></code>. Here’s a simple function for creating new word representations from the associated character-level ones:</p>
<pre><code class="language-notebook">def character_level_rep(word, cf, n=4):
ngrams = get_character_ngrams(word, n)
ngrams = [n for n in ngrams if n in cf.index]
reps = cf.loc[ngrams].values
return reps.sum(axis=0)
</code></pre>
<p>Many variations on this are worth trying – including the original word vector where available, changing the aggregation method from <code class="language-plaintext highlighter-rouge">sum</code> to something else, using a real morphological parser instead of just n-grams, and so on.</p>
<p>One very powerful thing about this is that we can represent words that are not even in the original VSM:</p>
<pre><code class="language-notebook">'superbly' in yelp5.index # yelp5에는 'superbly'라는 단어가 없지만,
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>False
</code></pre></div></div>
<pre><code class="language-notebook">superbly = character_level_rep("superbly", yelp5_ngrams) # ngram으로 만든 것으로 부터 vector 값을 계산할 수 있다.
</code></pre>
<pre><code class="language-notebook">superb = character_level_rep("superb", yelp5_ngrams)
</code></pre>
<pre><code class="language-notebook">cosine(superb, superbly) # 두 벡터 사이의 consine distance. 매우 가까움.
</code></pre>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.004362871833741844
</code></pre></div></div>
<h2 id="visualization">Visualization</h2>
<ul>
<li>
<p>You can begin to get a feel for what your matrix is like by poking around with <code class="language-plaintext highlighter-rouge">vsm.neighbors</code> to see who is close to or far from whom.</p>
</li>
<li>
<p>It’s very useful to complement this with the more holistic view one can get from looking at a visualization of the entire vector space.</p>
</li>
<li>
<p>Of course, any visualization will have to be much, much lower dimension than our actual VSM, so we need to proceed cautiously, balancing the high-level view with more fine-grained exploration.</p>
</li>
<li>
<p>We won’t have time this term to cover VSM visualization in detail. scikit-learn has a bunch of functions for doing this in <a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold">sklearn.manifold</a>, and the <a href="http://scikit-learn.org/stable/modules/manifold.html#manifold-learning">user guide</a> for that package is detailed.</p>
</li>
<li>
<p>It’s also worth checking out the online TensorFlow <a href="http://projector.tensorflow.org">Embedding Projector tool</a>, which includes a fast implementation of t-SNE.</p>
</li>
<li>
<p>In addition, <code class="language-plaintext highlighter-rouge">vsm.tsne_viz</code> is a wrapper around <a href="http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE">sklearn.manifold.TSNE</a> that handles the basic preprocessing and layout for you. t-SNE stands for <a href="http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">t-Distributed Stochastic Neighbor Embedding</a>, a powerful method for visualizing high-dimensional vector spaces in 2d. See also <a href="https://lvdmaaten.github.io/multiplemaps/Multiple_maps_t-SNE/Multiple_maps_t-SNE.html">Multiple Maps t-SNE</a>.</p>
</li>
</ul>
<pre><code class="language-notebook">def tsne_viz(df, colors=None, output_filename=None, figsize=(40, 50)):
"""2d plot of `df` using t-SNE, with the points labeled by `df.index`,
aligned with `colors` (defaults to all black).
Parameters
----------
df : pd.DataFrame
The matrix to visualize.
colors : list of colornames or None (default: None)
Optional list of colors for the vocab. The color names just
need to be interpretable by matplotlib. If they are supplied,
they need to have the same length as `df.index`. If `colors=None`,
then all the words are displayed in black.
output_filename : str (default: None)
If not None, then the output image is written to this location.
The filename suffix determines the image type. If `None`, then
`plt.plot()` is called, with the behavior determined by the
environment.
figsize : (int, int) (default: (40, 50))
Default size of the output in display units.
"""
# Colors:
vocab = df.index
if not colors:
colors = ['black' for i in vocab]
# Recommended reduction via PCA or similar:
n_components = 50 if df.shape[1] >= 50 else df.shape[1]
dimreduce = PCA(n_components=n_components)
X = dimreduce.fit_transform(df)
# t-SNE:
tsne = TSNE(n_components=2, random_state=0)
tsnemat = tsne.fit_transform(X)
# Plot values:
xvals = tsnemat[: , 0]
yvals = tsnemat[: , 1]
# Plotting:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=figsize)
ax.plot(xvals, yvals, marker='', linestyle='')
# Text labels:
for word, x, y, color in zip(vocab, xvals, yvals, colors):
try:
ax.annotate(word, (x, y), fontsize=8, color=color)
except UnicodeDecodeError: ## Python 2 won't cooperate!
pass
# Output:
if output_filename:
plt.savefig(output_filename, bbox_inches='tight')
else:
plt.show()
</code></pre>
<pre><code class="language-notebook">tsne_viz(yelp20_pmi)
</code></pre>
<p><img src="/images/vsm_01_distributional_practice_files/vsm_01_distributional_practice_103_0.png" alt="png" /></p>
<pre><code class="language-notebook">
</code></pre>chloammeVector-space models: designs, distances, basic reweighting[번역] BERT를 처음 사용하기 위한 시각적 가이드2021-12-22T00:00:00+00:002021-12-22T00:00:00+00:00/2021/12/22/a-visual-guide-to-using-bert-for-the-first-time-korean<div class="tooltip">
<blockquote>
<p>이 글은 <a href="https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/">Jay Alammar님의 글</a>을 번역한 글입니다. [<a href="#additional-info">추가정보</a>]
<span class="tooltiptext">
This post is a translated version of <a href="https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/">A Visual Guide to Using BERT for the First Time</a> by Jay Alammar.
</span></p>
</blockquote>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-sentence-classification.png" />
<br />
</div>
<div class="tooltip">
<p>지난 몇년간 언어를 처리하는 머신러닝 모델의 발전이 빠르게 가속화되었습니다. 이런 발전은 연구실을 떠나 주요 디지털 제품 중 일부에 힘을 공급하기 시작했습니다. 이에 대한 좋은 예는 <a href="https://www.blog.google/products/search/search-language-understanding-bert/">BERT 모델이 이제 구글 검색의 원동력이 되었다는 최근 발표</a>입니다. 구글은 이 단계 (또는 검색에 적용한 자연어 이해의 발전)가 “지난 5년간의 가장 큰 도약이자, 검색 역사상 가장 큰 도약 중 하나”를 의미한다고 믿습니다.
<span class="tooltiptext">
Progress has been rapidly accelerating in machine learning models that process language over the last couple of years. This progress has left the research lab and started powering some of the leading digital products. A great example of this is the <a href="https://www.blog.google/products/search/search-language-understanding-bert/">recent announcement of how the BERT model is now a major force behind Google Search</a>. Google believes this step (or progress in natural language understanding as applied in search) represents “the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search”.
</span></p>
</div>
<div class="tooltip">
<p>이번 포스트에서는 BERT의 변형을 이용하여 문장을 분류하는 간단한 튜토리얼을 다루겠습니다. 이 것은 첫번째 소개로 충분히 기본적이지만 관련된 주요 개념 중 일부를 보여주기에 고급이기도 합니다.
<span class="tooltiptext">
This post is a simple tutorial for how to use a variant of BERT to classify sentences. This is an example that is basic enough as a first intro, yet advanced enough to showcase some of the key concepts involved.
</span></p>
</div>
<div class="tooltip">
<p>이 포스트와 함께 notebook도 준비했습니다. <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">notebook</a>을 보거나 또는 <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">colab에서 실행</a> 해보실 수 있습니다.
<span class="tooltiptext">
Alongside this post, I’ve prepared a notebook. You can see it here <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">the notebook</a> or <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">run it on colab</a>.
</span></p>
</div>
<!--more-->
<div class="tooltip">
<h2 id="sst2">데이터셋: SST2</h2>
<p><span class="tooltiptext">
Dataset: SST2
</span></p>
</div>
<div class="tooltip">
<p>이 예제에서 사용할 데이터셋은 <a href="https://nlp.stanford.edu/sentiment/index.html">SST2</a>이고, 이 것은 영화 리뷰 문장들과, 각 문장 별 긍정(값이 1) 또는 부정(값이 0)으로 레이블링 되어 있습니다:
<span class="tooltiptext">
The dataset we will use in this example is <a href="https://nlp.stanford.edu/sentiment/index.html">SST2</a>, which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):
</span></p>
</div>
<table class="features-table">
<tr>
<th class="mdc-text-light-green-600">
sentence (문장)
</th>
<th class="mdc-text-purple-600">
label (레이블)
</th>
</tr>
<tr>
<td class="mdc-bg-light-green-50" style="text-align:left">
a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
</td>
<td class="mdc-bg-purple-50">
1
</td>
</tr>
<tr>
<td class="mdc-bg-light-green-50" style="text-align:left">
apparently reassembled from the cutting room floor of any given daytime soap
</td>
<td class="mdc-bg-purple-50">
0
</td>
</tr>
<tr>
<td class="mdc-bg-light-green-50" style="text-align:left">
they presume their audience won't sit still for a sociology lesson
</td>
<td class="mdc-bg-purple-50">
0
</td>
</tr>
<tr>
<td class="mdc-bg-light-green-50" style="text-align:left">
this is a visually stunning rumination on love , memory , history and the war between art and commerce
</td>
<td class="mdc-bg-purple-50">
1
</td>
</tr>
<tr>
<td class="mdc-bg-light-green-50" style="text-align:left">
jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
</td>
<td class="mdc-bg-purple-50">
1
</td>
</tr>
</table>
<div class="tooltip">
<h2 id="section">모델: 문장 감정 분류</h2>
<p><span class="tooltiptext">
Models: Sentence Sentiment Classification
</span></p>
</div>
<div class="tooltip">
<p>우리의 목표는 (우리의 데이터셋과 같은) 문장을 받아서 (긍정적 감정을 가진 문장을 의미하는) 1 또는 (부정적 감정을 가진 문장을 의미하는) 0을 생성하는 것입니다. 다음과 같이 생겼다고 생각할 수 있습니다:
<span class="tooltiptext">
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/sentiment-classifier-1.png" />
<br />
</div>
<div class="tooltip">
<p>모델은 사실 내부적으로 두개의 모델로 구성되어 있습니다.
<span class="tooltiptext">
Under the hood, the model is actually made up of two model.
</span></p>
</div>
<div class="tooltip">
<ul>
<li><a href="https://medium.com/huggingface/distilbert-8cf3380435b5">DistilBERT</a>는 문장을 처리하고, 문장에서 추출한 몇가지 정보를 다음 모델에 전달합니다. DistilBERT는 <a href="https://huggingface.co/">HuggingFace</a> 팀에서 개발하고 오픈소스로 제공하는 BERT의 작은 버전입니다. 성능은 BERT와 대략 비슷하지만 가볍고 빠른 버전입니다.</li>
<li>다음 모델인 scikit learn의 기본 Logistic Regression 모델은 DistilBERT 처리 결과를 받아 문장을 긍정 또는 부정 (각각 1 또는 0)으로 분류합니다.
<span class="tooltiptext">
<span>*</span> <a href="https://medium.com/huggingface/distilbert-8cf3380435b5">DistilBERT</a> processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at <a href="https://huggingface.co/">HuggingFace</a>. It’s a lighter and faster version of BERT that roughly matches its performance.
<span>*</span> The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).
</span></li>
</ul>
</div>
<div class="tooltip">
<p>두 모델 간 전달하는 데이터는 768 크기의 벡터입니다. 벡터를 분류하려는 문장에 대한 임베딩으로 생각할 수 있습니다.
<span class="tooltiptext">
The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/distilbert-bert-sentiment-classifier.png" />
<br />
</div>
<div class="tooltip">
<p>제 이전 글인 <a href="https://jalammar.github.io/illustrated-bert/">Illustrated BERT</a>를 읽으셨다면, 벡터는 첫번째 위치(입력으로 [CLS] 토큰을 수신)의 결과임을 아실 수 있습니다.
<span class="tooltiptext">
If you’ve read my previous post, <a href="https://jalammar.github.io/illustrated-bert/">Illustrated BERT</a>, this vector is the result of the first position (which receives the [CLS] token as input).
</span></p>
</div>
<div class="tooltip">
<h2 id="section-1">모델 학습</h2>
<p><span class="tooltiptext">
Model Training
</span></p>
</div>
<div class="tooltip">
<p>두개의 모델을 사용하지만, logistic regression 모델만을 학습시킬 것 입니다. DistilBERT의 경우 이미 pre-train되었고 영어를 이해할 수 있는 모델을 사용할 것입니다. 하지만 이 모델은 문장 분류를 위해 fine-tune되지 않았습니다. 그러나 우리는 BERT가 훈련된 일반적인 objectives에서 일부 문장 분류 능력을 얻습니다. 이 것은 특히 첫번째 위치에 대한 BERT의 출력([CLS] 토큰과 관련됨)이 그렇습니다. BERT의 두번째 훈련 object(다음 문장 분류) 때문에 이 능력을 약간 갖게 되었다고 생각합니다. 이 objective는 첫번째 위치의 출력에 문장 전체의 의미를 캡슐화하도록 모델을 훈련시키는 것 같습니다. <a href="https://github.com/huggingface/transformers">transformers</a> 라이브러리는 pretrain된 버전의 모델 뿐만 아니라 DistilBERT의 구현도 제공합니다.
<span class="tooltiptext">
While we’ll be using two models, we will only train the logistic regression model. For DistilBERT, we’ll use a model that’s already pre-trained and has a grasp on the English language. This model, however is neither trained not fine-tuned to do sentence classification. We get some sentence classification capability, however, from the general objectives BERT is trained on. This is especially the case with BERT’s output for the first position (associated with the [CLS] token). I believe that’s due to BERT’s second training object – Next sentence classification. That objective seemingly trains the model to encapsulate a sentence-wide sense to the output at the first position. The <a href="https://github.com/huggingface/transformers">transformers</a> library provides us with an implementation of DistilBERT as well as pretrained versions of the model.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/model-training.png" />
<br />
</div>
<div class="tooltip">
<h2 id="section-2">튜토리얼 개요</h2>
<p><span class="tooltiptext">
Tutorial Overview
</span></p>
</div>
<div class="tooltip">
<p>이 튜토리얼의 전략은 다음과 같습니다. 우리는 먼저 2,000 문장에 대한 문장 임베딩을 생성하기 위해 훈련된 distilBERT를 사용할 것입니다.
<span class="tooltiptext">
So here’s the game plan with this tutorial. We will first use the trained distilBERT to generate sentence embeddings for 2,000 sentences.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />
<br />
</div>
<div class="tooltip">
<p>이 단계 이후로는 distilBERT를 건드리지 않습니다. 여기부터는 모두 Scikit Learn입니다. 이 데이터셋에 대해 일반적인 train/test 분할을 수행합니다.
<span class="tooltiptext">
We will not touch distilBERT after this step. It’s all Scikit Learn from here. We do the usual train/test split on this dataset:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />
<br />
<div class="tooltip">
<p>distilBert(모델 #1)의 출력에 대한 Train/test 분할은, logistric regression(모델 #2)을 훈련하고 평가할 데이터셋을 생성합니다. 현실에서는, sklearn의 train/test split이 분할 전에 example들을 섞기 때문에, 데이터셋에서 위쪽 75%의 example들을 (훈련에) 사용하는 것은 아닙니다. (이 포스팅에서 example은 dataset을 구성하고 있는 문장-레이블 조합을 의미합니다. 즉, dataset의 한 row 입니다.)
<span class="tooltiptext">
Train/test split for the output of distilBert (model #1) creates the dataset we’ll train and evaluate logistic regression on (model #2). Note that in reality, sklearn’s train/test split shuffles the examples before making the split, it doesn’t just take the first 75% of examples as they appear in the dataset.
</span></p>
</div>
</div>
<div class="tooltip">
<p>그 다음 우리는 train set을 가지고 logistic regression 모델을 훈련합니다:
<span class="tooltiptext">
Then we train the logistic regression model on the training set:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-training-logistic-regression.png" />
<br />
</div>
<div class="tooltip">
<h2 id="section-3">(단일) 예측을 계산하는 방법</h2>
<p><span class="tooltiptext">
How a single prediction is calculated
</span></p>
</div>
<div class="tooltip">
<p>code로 들어가서 어떻게 모델이 훈련되는지 설명하기 전에, 훈련된 모델이 예측을 어떻게 계산하는지 알아봅시다.
<span class="tooltiptext">
Before we dig into the code and explain how to train the model, let’s look at how a trained model calculates its prediction.
</span></p>
</div>
<div class="tooltip">
<p>“a visually stunning rumination on love”(사랑에 대한 시각적으로 굉장히 놀라운 반추)라는 문장을 분류해봅시다. 첫번째 단계는 BERT 토크나이저를 사용하여 단어를 토큰으로 나누는 것입니다. 그 다음, 문장 분류를 위해 필요한 스페셜 토큰(첫번째 위치에 [CLS], 문장의 끝에 [SEP]이 있음)을 추가합니다.
<span class="tooltiptext">
Let’s try to classify the sentence “a visually stunning rumination on love”. The first step is to use the BERT tokenizer to first split the word into tokens. Then, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-tokenization-1.png" />
<br />
</div>
<div class="tooltip">
<p>토크나이저가 하는 세번째 단계는 각 토큰을 훈련된 모델의 일부분인 임베딩 테이블을 이용하여 id로 치환하는 것입니다. 단어 임베딩에 대한 배경정보는 <a href="https://jalammar.github.io/illustrated-word2vec/">The Illustrated Word2vec</a>를 읽어보세요.
<span class="tooltiptext">
The third step the tokenizer does is to replace each token with its id from the embedding table which is a component we get with the trained model. Read <a href="https://jalammar.github.io/illustrated-word2vec/">The Illustrated Word2vec</a> for a background on word embeddings.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />
<br />
</div>
<div class="tooltip">
<p>토크나이저는 이 모든 과정을 한줄의 코드로 수행합니다:
<span class="tooltiptext">
Note that the tokenizer does all these steps in a single line of code:
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"a visually stunning rumination on love"</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>이제 우리의 입력 문장은 DistilBERT로 전달되기에 적합한 shape이 되었습니다.
<span class="tooltiptext">
Our input sentence is now the proper shape to be passed to DistilBERT.
</span></p>
</div>
<div class="tooltip">
<p>만약 <a href="https://jalammar.github.io/illustrated-bert/">Illustrated BERT</a>를 읽으셨다면, 이 단계 또한 아래와 같이 시각화할 수 있습니다.
<span class="tooltiptext">
If you’ve read <a href="https://jalammar.github.io/illustrated-bert/">Illustrated BERT</a>, this step can also be visualized in this manner:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-input-tokenization.png" />
<br />
</div>
<div class="tooltip">
<h2 id="distilbert--">DistilBERT 이전 단계</h2>
<p><span class="tooltiptext">
Flowing Through DistilBERT
</span></p>
</div>
<div class="tooltip">
<p>입력 벡터를 DistilBERT로 전달하는 것은 <a href="https://jalammar.github.io/illustrated-bert/">BERT</a>와 동일합니다. 출력은 각 입력 토큰의 벡터가 됩니다. 각 벡터는 768개의 숫자(float)으로 구성됩니다.
<span class="tooltiptext">
Passing the input vector through DistilBERT works <a href="https://jalammar.github.io/illustrated-bert/">just like BERT</a>. The output would be a vector for each input token. each vector is made up of 768 numbers (floats).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-model-input-output-1.png" />
<br />
</div>
<div class="tooltip">
<p>문장 분류 task이므로, 첫번째 벡터([CLS] 토큰과 관련된 것)를 제외하고 모두 무시합니다. logistic regression 모델에 대한 입력으로 전달하는 하나의 벡터입니다.
<span class="tooltiptext">
Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). The one vector we pass as the input to the logistic regression model.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-model-calssification-output-vector-cls.png" />
<br />
</div>
<div class="tooltip">
<p>여기에서, 훈련 단계에서 배운 것을 기반으로 이 벡터를 분류하는 것이 logistic regression 모델이 해야하는 일 입니다. 예측 계산은 다음과 같습니다:
<span class="tooltiptext">
From here, it’s the logistic regression model’s job to classify this vector based on what it learned from its training phase. We can think of a prediction calculation as looking like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-sentence-classification-example.png" />
<br />
</div>
<div class="tooltip">
<p>훈련은 전체적인 프로세스의 코드와 함께 다음 섹션에서 알아보겠습니다.
<span class="tooltiptext">
The training is what we’ll discuss in the next section, along with the code of the entire process.
</span></p>
</div>
<p>
<div class="tooltip">
<h2 id="section-4">코드</h2>
<p><span class="tooltiptext">
The Code
</span></p>
</div>
</p>
<div class="tooltip">
<p>이번 섹션에서는 문장 분류 모델을 훈련하기 위한 코드를 살펴보겠습니다. 모든 코드를 포함하는 notebook은 <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">colab</a>과 <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">github</a>에서 확인하실 수 있습니다.
<span class="tooltiptext">
In this section we’ll highlight the code to train this sentence classification model. A notebook containing all this code is available on <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">colab</a> and <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">github</a>.
</span></p>
</div>
<div class="tooltip">
<p>import 부터 시작해봅시다
<span class="tooltiptext">
Let’s start by importing the tools of the trade
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">transformers</span> <span class="k">as</span> <span class="n">ppb</span> <span class="c1"># pytorch transformers
</span><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
</code></pre></div></div>
<div class="tooltip">
<p>데이터셋은 <a href="https://github.com/clairett/pytorch-sentiment-classification/">이 github</a>에서 파일로 있으므로, pandas dataframe으로 바로 가져오기만 하면 됩니다.
<span class="tooltiptext">
The dataset is <a href="https://github.com/clairett/pytorch-sentiment-classification/">available</a> as a file on github, so we just import it directly into a pandas dataframe
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">'</span><span class="se">\t</span><span class="s">'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>dataframe의 첫 5행을 읽어서 데이터가 어떤 모양인지 확인하기위해 df.head()를 사용할 수 있습니다.
<span class="tooltiptext">
We can use df.head() to look at the first five rows of the dataframe to see how the data looks.
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div class="tooltip">
<p>출력:
<span class="tooltiptext">
Which outputs:
</span></p>
<div class="img-div-any-width">
<image src="/images/distilBERT/sst2-df-head.png" />
<br />
</div>
</div>
<div class="tooltip">
<h3 id="pre-train-distilbert----import-">pre-train된 DistilBERT 모델 및 토크나이저 import 하기</h3>
<p><span class="tooltiptext">
Importing pre-trained DistilBERT model and tokenizer
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_class</span><span class="p">,</span> <span class="n">tokenizer_class</span><span class="p">,</span> <span class="n">pretrained_weights</span> <span class="o">=</span> <span class="p">(</span><span class="n">ppb</span><span class="p">.</span><span class="n">DistilBertModel</span><span class="p">,</span> <span class="n">ppb</span><span class="p">.</span><span class="n">DistilBertTokenizer</span><span class="p">,</span> <span class="s">'distilbert-base-uncased'</span><span class="p">)</span>
<span class="c1">## Want BERT instead of distilBERT? Uncomment the following line:
## distilBERT 대신 BERT를 사용하고 싶으신가요? 그럼 아래의 주석처리를 해제하세요:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
</span>
<span class="c1"># Load pretrained model/tokenizer
# pretrain된 model/tokenizer를 로딩하기
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">tokenizer_class</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_weights</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">model_class</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_weights</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>이제 데이터셋을 토큰화할 수 있습니다. 이전 예제와는 조금 다른 것들을 해보려고 합니다. 위 예제에서는 한 문장에 대해서만 토큰화하고 처리를 했습니다. 지금은, 모든 문장을 배치로 묶어서 토큰화하고 처리하겠습니다 (notebook은 리소스 제약으로 작은 그룹의 example들만 처리합니다. 2000개 example이라고 가정해봅시다).
<span class="tooltiptext">
We can now tokenize the dataset. Note that we’re going to do things a little differently here from the example above. The example above tokenized and processed only one sentence. Here, we’ll tokenize and process all sentences together as a batch (the notebook processes a smaller group of examples just for resource considerations, let’s say 2000 examples).
</span></p>
</div>
<div class="tooltip">
<h3 id="section-5">토큰화하기</h3>
<p><span class="tooltiptext">
Tokenization
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenized</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nb">apply</span><span class="p">((</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">)))</span>
</code></pre></div></div>
<div class="tooltip">
<p>이 코드는 모든 문장을 id로 바꿉니다.
<span class="tooltiptext">
This turns every sentence into the list of ids.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/sst2-text-to-tokenized-ids-bert-example.png" />
<br />
</div>
<div class="tooltip">
<p>데이터셋은 현재 중첩 리스트(또는 pandas의 Series/DataFrame)입니다. DistilBERT가 이 것을 입력으로 처리할 수 있기 전에, 짧은 문장에 토큰 id 0을 패딩(padding)해서 모든 벡터를 같은 크기로 만들 필요가 있습니다. 패딩 단계는 notebook을 참고하세요. 기본적인 파이썬 문자열(string) 및 배열(array) 조작(manipulation)입니다.
<span class="tooltiptext">
The dataset is currently a list (or pandas Series/DataFrame) of lists. Before DistilBERT can process this as input, we’ll need to make all the vectors the same size by padding shorter sentences with the token id 0. You can refer to the notebook for the padding step, it’s basic python string and array manipulation.
</span></p>
</div>
<div class="tooltip">
<p>패딩 이후에, BERT에 전달할 준비가 된 행렬/텐서가 있습니다.
<span class="tooltiptext">
After the padding, we have a matrix/tensor that is ready to be passed to BERT:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-input-tensor.png" />
<br />
</div>
<p>
<div class="tooltip">
<h3 id="distilbert-">DistilBERT로 처리하기</h3>
<p><span class="tooltiptext">
Processing with DistilBERT
</span></p>
</div>
</p>
<div class="tooltip">
<p>패딩된 토큰 행렬에서 입력 텐서를 생성하여 DistilBERT로 전달합니다.
<span class="tooltiptext">
We now create an input tensor out of the padded token matrix, and send that to DistilBERT
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_ids</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">padded</span><span class="p">))</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">last_hidden_states</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>이 단계가 실행된 후 <code class="language-plaintext highlighter-rouge">last_hidden_states</code>는 DistilBERT의 출력을 가지고 있습니다. 이 것은 (example 수, sequence에서 최대 토큰 수, DistilBERT 모델에서 hidden unit 수) shape을 가진 tuple입니다. 우리의 경우에, 2000 (우리 스스로 2000개 example로 제한했으므로), 66 (2000 example로 부터 가장 긴 sequence의 토큰의 개수), 768 (DistilBERT 모델의 hidden unit 수) 입니다.
<span class="tooltiptext">
After running this step, <code class="language-plaintext highlighter-rouge">last_hidden_states</code> holds the outputs of DistilBERT. It is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model). In our case, this will be 2000 (since we only limited ourselves to 2000 examples), 66 (which is the number of tokens in the longest sequence from the 2000 examples), 768 (the number of hidden units in the DistilBERT model).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-output-tensor-predictions.png" />
<br />
</div>
<div class="tooltip">
<h3 id="bert---">BERT 출력 텐서 펼쳐보기</h3>
<p><span class="tooltiptext">
Unpacking the BERT output tensor
</span></p>
</div>
<div class="tooltip">
<p>이 3-차원 출력 텐서를 펼쳐서 보겠습니다. 이 것의 차원을 확인하는 것으로부터 시작할 수 있습니다.
<span class="tooltiptext">
Let’s unpack this 3-d output tensor. We can first start by examining its dimensions:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-output-tensor.png" />
<br />
</div>
<div class="tooltip">
<h3 id="section-6">문장의 처리과정 요약</h3>
<p><span class="tooltiptext">
Recapping a sentence’s journey
</span></p>
</div>
<div class="tooltip">
<p>각 행은 우리 데이터셋으로 부터 문장과 관련이 있습니다. 첫번째 문장의 처리 경로를 요약하면 아래와 같습니다:
<span class="tooltiptext">
Each row is associated with a sentence from our dataset. To recap the processing path of the first sentence, we can think of it as looking like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-input-to-output-tensor-recap.png" />
<br />
</div>
<div class="tooltip">
<h3 id="section-7">필요한 부분만 자르기</h3>
<p><span class="tooltiptext">
Slicing the important part
</span></p>
</div>
<div class="tooltip">
<p>문장 분류를 위해, 우리는 [CLS] 토큰에 대한 BERT의 출력에만 관심이 있습니다. 그래서, 큐브의 해당 조각만 취하고 나머지는 버립니다.
<span class="tooltiptext">
For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-output-tensor-selection.png" />
<br />
</div>
<div class="tooltip">
<p>3d 텐서에서 우리가 관심있는 2d 텐서를 얻기 위해 자르는 방법입니다:
<span class="tooltiptext">
This is how we slice that 3d tensor to get the 2d tensor we’re interested in:
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1"># Slice the output for the first position for all the sequences, take all hidden unit outputs
</span><span class="n">features</span> <span class="o">=</span> <span class="n">last_hidden_states</span><span class="p">[</span><span class="mi">0</span><span class="p">][:,</span><span class="mi">0</span><span class="p">,:].</span><span class="n">numpy</span><span class="p">()</span>
</code></pre></div></div>
<div class="tooltip">
<p>이제 <code class="language-plaintext highlighter-rouge">features</code>는 데이터셋 모든 문장의 임베딩을 포함하고 있는 2차원 NumPy 배열입니다.
<span class="tooltiptext">
And now <code class="language-plaintext highlighter-rouge">features</code> is a 2d numpy array containing the sentence embeddings of all the sentences in our dataset.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-output-cls-senteence-embeddings.png" />
<br />
<div class="tooltip">
<p>BERT의 출력에서 slice한 텐서
<span class="tooltiptext">
The tensor we sliced from BERT’s output
</span></p>
</div>
</div>
<div class="tooltip">
<h2 id="logistic-regression--">Logistic Regression을 위한 데이터셋</h2>
<p><span class="tooltiptext">
Dataset for Logistic Regression
</span></p>
</div>
<div class="tooltip">
<p>이제 우리는 BERT의 출력을 가지고, logistic regression 모델을 훈련하는데 필요한 데이터셋을 조립했습니다. 768개의 열은 초기 데이터셋에서 방금 얻은 feature와 label입니다.
<span class="tooltiptext">
Now that we have the output of BERT, we have assembled the dataset we need to train our logistic regression model. The 768 columns are the features, and the labels we just get from our initial dataset.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/logistic-regression-dataset-features-labels.png" />
<br />
<div class="tooltip">
Logistic Regression을 훈련하는데 사용하는 레이블링된 데이터셋입니다. feature는, 이전 그림에서 슬라이스 했던 (위치 #0) [CLS] 토큰에 대한 BERT의 출력 벡터입니다. 각 행은 데이터셋의 문장에 해당하며, 각 열은 Bert/DistilBERT 모델의 가장 상단 transformer block에 있는 feed-forward neural network의 hidden unit의 출력에 해당합니다.
<span class="tooltiptext">
The labeled dataset we use to train the Logistic Regression. The features are the output vectors of BERT for the [CLS] token (position #0) that we sliced in the previous figure. Each row corresponds to a sentence in our dataset, each column corresponds to the output of a hidden unit from the feed-forward neural network at the top transformer block of the Bert/DistilBERT model.
</span>
</div>
</div>
<div class="tooltip">
<p>머신러닝에서의 전통적인 train/test split을 한 뒤, Logistic Regression 모델을 선언하고, 데이터셋에 대해 훈련할 수 있습니다.
<span class="tooltiptext">
After doing the traditional train/test split of machine learning, we can declare our Logistic Regression model and train it against the dataset.
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">labels</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">train_features</span><span class="p">,</span> <span class="n">test_features</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">test_labels</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">features</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>데이터셋을 training set과 testing set으로 분할합니다:
<span class="tooltiptext">
Which splits the dataset into training/testing sets:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />
<br />
</div>
<div class="tooltip">
<p>그 다음, training set에 대해 Logistic Regression 모델을 훈련합니다.
<span class="tooltiptext">
Next, we train the Logistic Regression model on the training set.
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lr_clf</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span>
<span class="n">lr_clf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_features</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>이제 모델이 훈련됐고, test set에 대한 점수(score)를 계산할 수 있습니다:
<span class="tooltiptext">
Now that the model is trained, we can score it against the test set:
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lr_clf</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">test_features</span><span class="p">,</span> <span class="n">test_labels</span><span class="p">)</span>
</code></pre></div></div>
<div class="tooltip">
<p>모델이 약 81%의 정확도(accuracy)를 달성했음을 보여줍니다.
<span class="tooltiptext">
Which shows the model achieves around 81% accuracy.
</span></p>
</div>
<p>
<div class="tooltip">
<h2 id="section-8">스코어 벤치마크</h2>
<p><span class="tooltiptext">
Score Benchmarks
</span></p>
</div>
</p>
<div class="tooltip">
<p>참고로, 이 데이터셋에 대한 최고 정확도 스코어는 <strong>96.8</strong>였습니다. DistilBERT는 이 task에 대해 score를 개선하기 위해 훈련될 수 있습니다 – 이 프로세스는 fine-tuning이라고 부르며(<em>downstream task</em>라고도 부름), BERT의 weight를 문장 분류에서 더 좋은 성능을 달성하도록 업데이트 합니다. fine-tune된 DistilBERT는 <strong>90.7</strong>이 정확도를 달성하는 것으로 나왔습니다. full size BERT 모델은 <strong>94.9</strong>를 달성했습니다.
<span class="tooltiptext">
For reference, the highest accuracy score for this dataset is currently <strong>96.8</strong>. DistilBERT can be trained to improve its score on this task – a process called fine-tuning which updates BERT’s weights to make it achieve a better performance in the sentence classification (which we can call the <em>downstream task</em>). The fine-tuned DistilBERT turns out to achieve an accuracy score of <strong>90.7</strong>. The full size BERT model achieves <strong>94.9</strong>.
</span></p>
</div>
<p>
<div class="tooltip">
<h2 id="notebook-">Notebook 실습</h2>
<p><span class="tooltiptext">
The Notebook
</span></p>
</div>
</p>
<div class="tooltip">
<p>이 <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">notebook</a> 또는 <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">colab에서 실행</a>하여 바로 확인해보세요.
<span class="tooltiptext">
Dive right into <a href="https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">the notebook</a> or <a href="https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb">run it on colab</a>.
</span></p>
</div>
<div class="tooltip">
<p>해냈습니다! 첫번째로 BERT를 다뤄보는 좋은 만남이었습니다. 다음 단계에서는 <a href="https://huggingface.co/transformers/examples.html#glue">fine-tuning</a>하는 것입니다. 돌아가서 distilBERT와 BERT를 바꿔가면서 어떻게 동작하는지 확인할 수도 있습니다.
<span class="tooltiptext">
And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at <a href="https://huggingface.co/transformers/examples.html#glue">fine-tuning</a>. You can also go back and switch from distilBERT to BERT and see how that works.
</span></p>
</div>
<div class="tooltip">
<p>이 튜토리얼의 초기 버전에 대한 피드백을 주신 <a href="https://twitter.com/ClementDelangue">Clément Delangue</a>, <a href="https://twitter.com/SanhEstPasMoi">Victor Sanh</a>, Huggingface 팀 분들께 감사합니다.
<span class="tooltiptext">
Thanks to <a href="https://twitter.com/ClementDelangue">Clément Delangue</a>, <a href="https://twitter.com/SanhEstPasMoi">Victor Sanh</a>, and the Huggingface team for providing feedback to earlier versions of this tutorial.
</span></p>
</div>
<hr />
<h2 id="추가-정보">추가 정보<a href="#additional-info" name="additional-info">.</a></h2>
<ul>
<li>이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 <a href="https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/">A Visual Guide to Using BERT for the First Time</a>에서 확인하실 수 있습니다.</li>
<li>원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.</li>
<li>번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 <a href="https://nlpinkorean.github.io">찬</a>님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.</li>
</ul>chloamme이 글은 Jay Alammar님의 글을 번역한 글입니다. [추가정보] This post is a translated version of A Visual Guide to Using BERT for the First Time by Jay Alammar.[번역] NumPy 및 데이터 표현에 대한 시각적 소개2021-12-20T00:00:00+00:002021-12-20T00:00:00+00:00/2021/12/20/visual-numpy-korean<div class="tooltip">
<blockquote>
<p>이 글은 <a href="https://jalammar.github.io/visual-numpy/">Jay Alammar님의 글</a>을 번역한 글입니다. [<a href="#additional-info">추가정보</a>]
<span class="tooltiptext">
This post is a translated version of <a href="https://jalammar.github.io/visual-numpy/">A Visual Intro to NumPy and Data Representation</a> by Jay Alammar.
</span></p>
</blockquote>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array.png" />
<br />
</div>
<div class="tooltip">
<p><a href="https://www.numpy.org/">NumPy</a> 패키지는 파이썬 생태계에서 데이터 분석, 머신러닝, 과학적 컴퓨팅의 핵심 요소입니다. 벡터(vector)와 행렬(matrix)의 조작과 크런칭(대량 고속 처리)을 엄청나게 단순화시킵니다. (scikit-learn, SciPy, pandas, tensorflow 등) 파이썬의 주요 패키지는 NumPy를 인프라스트럭쳐의 기본/근본 부분으로 사용합니다. 숫자 데이터를 쪼개어 분석하는 능력 외에도 NumPy를 마스터하는 것은 이러한 라이브러리의 고급 활용 시 처리하거나 디버깅할 때에 우위를 확보할 수 있습니다.
<span class="tooltiptext">
The <a href="https://www.numpy.org/">NumPy</a> package is the workhorse of data analysis, machine learning, and scientific computing in the python ecosystem. It vastly simplifies manipulating and crunching vectors and matrices. Some of python’s leading package rely on NumPy as a fundamental piece of their infrastructure (examples include scikit-learn, SciPy, pandas, and tensorflow). Beyond the ability to slice and dice numeric data, mastering numpy will give you an edge when dealing and debugging with advanced usecases in these libraries.
</span></p>
</div>
<div class="tooltip">
<p>이번 포스팅에서는 NumPy를 사용하기 위한 주요 방법들을 살펴보고, 머신러닝 모델에 넣기 전에 (테이블, 이미지, 텍스트 등) 다른 데이터 타입을 표현하는 방법에 대해 알아보겠습니다.
<span class="tooltiptext">
In this post, we’ll look at some of the main ways to use NumPy and how it can represent different types of data (tables, images, text…etc) before we can serve them to machine learning models.
</span></p>
</div>
<!--more-->
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<h2 id="creating-arrays-배열-생성">Creating Arrays (배열 생성)</h2>
<div class="tooltip">
<p>파이썬 리스트를 <code class="language-plaintext highlighter-rouge">np.array()</code>에 전달하여, NumPy 배열(즉, 강력한 <a href="https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html">ndarray</a>)을 생성할 수 있습니다. 이 케이스에서 파이썬은 오른쪽에 보이는 배열을 생성합니다.
<span class="tooltiptext">
We can create a NumPy array (a.k.a. the mighty <a href="https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html">ndarray</a>) by passing a python list to it and using <code class="language-plaintext highlighter-rouge">np.array()</code>. In this case, python creates the array we can see on the right here:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/create-numpy-array-1.png" />
<br />
</div>
<div class="tooltip">
<p>NumPy가 배열의 값을 초기화해주기를 바라는 케이스들이 종종 있습니다. NumPy는 이런 경우에 ones(), zeros(), random.random()과 같은 메서드를 제공합니다. 우리는 생성하기 원하는 element의 개수만 전달하면 됩니다:
<span class="tooltiptext">
There are often cases when we want NumPy to initialize the values of the array for us. NumPy provides methods like ones(), zeros(), and random.random() for these cases. We just pass them the number of elements we want it to generate:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/create-numpy-array-ones-zeros-random.png" />
<br />
</div>
<div class="tooltip">
<p>배열을 생성하면, 재밌는 방법들로 배열을 조작(manipulate)할 수 있습니다.
<span class="tooltiptext">
Once we’ve created our arrays, we can start to manipulate them in interesting ways.
</span></p>
</div>
<h2 id="array-arithmetic-배열-산술연산">Array Arithmetic (배열 산술연산)</h2>
<div class="tooltip">
<p>NumPy 배열의 유용성을 설명하기 위해 두 개의 NumPy 배열을 생성합니다. <code class="language-plaintext highlighter-rouge">data</code>와 <code class="language-plaintext highlighter-rouge">ones</code>라고 부르겠습니다:
<span class="tooltiptext">
Let’s create two NumPy arrays to showcase their usefulness. We’ll call them <code class="language-plaintext highlighter-rouge">data</code> and <code class="language-plaintext highlighter-rouge">ones</code>:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-arrays-example-1.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>position-wise 덧셈(즉, 각 행마다의 값들을 더하기)은 <code class="language-plaintext highlighter-rouge">data + ones</code>를 입력하면 됩니다. 간단합니다.
<span class="tooltiptext">
Adding them up position-wise (i.e. adding the values of each row) is as simple as typing <code class="language-plaintext highlighter-rouge">data + ones</code>:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-arrays-adding-1.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>이러한 도구들을 배우기 시작했을 때, loop를 돌면서 계산하는 것을 프로그래밍할 필요가 없는 이런 추상화가 신선하게 느껴졌습니다. 상위 레벨에서 문제를 생각해볼 수 있게 만드는 멋진 추상화입니다.
<span class="tooltiptext">
When I started learning such tools, I found it refreshing that an abstraction like this makes me not have to program such a calculation in loops. It’s a wonderful abstraction that allows you to think about problems at a higher level.
</span></p>
</div>
<div class="tooltip">
<p>이런 방식으로 할 수 있는 것은 덧셈 뿐만이 아닙니다:
<span class="tooltiptext">
And it’s not only addition that we can do this way:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array-subtract-multiply-divide.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>배열과 하나의 숫자값을 연산하고 싶을 때도 종종 있습니다 (우리는 이 것을 벡터와 스칼라간 연산이라고 부릅니다). 예를 들어, 마일로 표현된 거리 값을 가지고 있는 배열을 킬로미터로 변환하려고 한다고 가정해 보겠습니다. 간단하게 <code class="language-plaintext highlighter-rouge">data * 1.6</code>라고 하면 됩니다:
<span class="tooltiptext">
There are often cases when we want to carry out an operation between an array and a single number (we can also call this an operation between a vector and a scalar). Say, for example, our array represents distance in miles, and we want to convert it to kilometers. We simply say <code class="language-plaintext highlighter-rouge">data * 1.6</code>:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array-broadcast.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>NumPy가 이 연산(<code class="language-plaintext highlighter-rouge">*</code>)이 곱셈이 각 셀에서 수행되어야 한다는 것을 의미하는 것임을 이해하는 방법을 주목/확인해보세요. 그 컨셉을 <em>브로트캐스팅</em>이라고 합니다. 매우 유용하죠.
<span class="tooltiptext">
See how NumPy understood that operation to mean that the multiplication should happen with each cell? That concept is called <em>broadcasting</em>, and it’s very useful.
</span></p>
</div>
<h2 id="indexing-인덱싱">Indexing (인덱싱)</h2>
<div class="tooltip">
<p>파이썬 리스트에서 슬라이싱을 할 수 있는 모든 방법으로 NumPy에서도 인덱싱과 슬라이싱을 할 수 있습니다:
<span class="tooltiptext">
We can index and slice NumPy arrays in all the ways we can slice python lists:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array-slice.png" />
<br />
</div>
<h2 id="aggregation-집계">Aggregation (집계)</h2>
<div class="tooltip">
<p>NumPy의 추가적인 이점은 집계 함수(aggregation function)들입니다.
<span class="tooltiptext">
Additional benefits NumPy gives us are aggregation functions:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array-aggregation.png" />
<br />
</div>
<div class="tooltip">
<p><code class="language-plaintext highlighter-rouge">min</code>, <code class="language-plaintext highlighter-rouge">max</code>, <code class="language-plaintext highlighter-rouge">sum</code> 뿐만아니라 평균을 구하는 <code class="language-plaintext highlighter-rouge">mean</code>, 모든 element들을 곱한 결과를 구하는 <code class="language-plaintext highlighter-rouge">prod</code>, 표준 편차를 구하는 <code class="language-plaintext highlighter-rouge">std</code>, <a href="https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html">다른 많은 연산들</a>을 얻을 수 있습니다.
<span class="tooltiptext">
In addition to <code class="language-plaintext highlighter-rouge">min</code>, <code class="language-plaintext highlighter-rouge">max</code>, and <code class="language-plaintext highlighter-rouge">sum</code>, you get all the greats like <code class="language-plaintext highlighter-rouge">mean</code> to get the average, <code class="language-plaintext highlighter-rouge">prod</code> to get the result of multiplying all the elements together, <code class="language-plaintext highlighter-rouge">std</code> to get standard deviation, and <a href="https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html">plenty of others</a>.
</span></p>
</div>
<h2 id="in-more-dimensions-더-높은-차원">In more dimensions (더 높은 차원)</h2>
<div class="tooltip">
<p>현재가지 우리가 살펴본 예제들은 1차원인 벡터들이었습니다. NumPy의 정수는, 우리가 지금까지 배운 능력들을 어떤 차원에도 적용할 수 있다는 것입니다.
<span class="tooltiptext">
All the examples we’ve looked at deal with vectors in one dimension. A key part of the beauty of NumPy is its ability to apply everything we’ve looked at so far to any number of dimensions.
</span></p>
</div>
<h3 id="creating-matrices-배열-생성">Creating Matrices (배열 생성)</h3>
<div class="tooltip">
<p>아래와 같은 모양의 파이썬의 중첩 리스트를 전달하여, NumPy가 이를 나타내는 행렬을 생성하도록 할 수 있습니다:
<span class="tooltiptext">
We can pass python lists of lists in the following shape to have NumPy create a matrix to represent them:
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">],[</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">]])</span>
</code></pre></div></div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-array-create-2d.png" />
<br />
</div>
<div class="tooltip">
<p>우리가 생성하려는 배열의 차원을 기술하는 튜플을 (파라미터로) 전달한다면, (<code class="language-plaintext highlighter-rouge">ones()</code>, <code class="language-plaintext highlighter-rouge">zeros()</code>, <code class="language-plaintext highlighter-rouge">random.random()</code> 등) 위에서 살펴본 메서드들을 사용할 수 있습니다.
<span class="tooltiptext">
We can also use the same methods we mentioned above (<code class="language-plaintext highlighter-rouge">ones()</code>, <code class="language-plaintext highlighter-rouge">zeros()</code>, and <code class="language-plaintext highlighter-rouge">random.random()</code>) as long as we give them a tuple describing the dimensions of the matrix we are creating:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-ones-zeros-random.png" />
<br />
</div>
<h3 id="matrix-arithmetic-행렬-산술연산">Matrix Arithmetic (행렬 산술연산)</h3>
<div class="tooltip">
<p>두 행렬의 크기가 같다면, 산술연산자 (<code class="language-plaintext highlighter-rouge">+-*/</code>)를 사용해서 행렬을 더하거나 곱할 수 있습니다. NumPy는 연산을 position-wise로 처리합니다:
<span class="tooltiptext">
We can add and multiply matrices using arithmetic operators (<code class="language-plaintext highlighter-rouge">+-*/</code>) if the two matrices are the same size. NumPy handles those as position-wise operations:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-arithmetic.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>(행렬이 오직 1개의 열 또는 1개의 행을 가지고 있는 경우 등) 서로 크기가 다른 차원이 1개인 경우에만, 크기가 다른 행렬에 대해 이러한 산술 연산을 수행할 수 있습니다. 이 경우에 NumPy는 연산에 브로드캐스트를 적용합니다.
<span class="tooltiptext">
We can get away with doing these arithmetic operations on matrices of different size only if the different dimension is one (e.g. the matrix has only one column or one row), in which case NumPy uses its broadcast rules for that operation:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-broadcast.png" />
<br />
</div>
<h3 id="dot-product-내적">Dot Product (내적)</h3>
<div class="tooltip">
<p>산술 연산과의 주요 차이점은 내적을 사용한 <a href="https://www.mathsisfun.com/algebra/matrix-multiplying.html">행렬 곱셈</a>의 경우입니다. NumPy는 행렬에 다른 행렬과 내적 연산을 수행하는데 사용할 수 있는 <code class="language-plaintext highlighter-rouge">dot()</code> 메서드를 제공합니다:
<span class="tooltiptext">
A key distinction to make with arithmetic is the case of <a href="https://www.mathsisfun.com/algebra/matrix-multiplying.html">matrix multiplication</a> using the dot product. NumPy gives every matrix a <code class="language-plaintext highlighter-rouge">dot()</code> method we can use to carry-out dot product operations with other matrices:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-dot-product-1.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>행렬의 차원 정보를 그림 하단에 추가하여, 두 행렬이 서로 마주하는 면이 서로 같은 차원을 가지고 있어야 함을 강조했습니다. 이 연산을 아래와 같이 시각화하여 표현할 수 있습니다.
<span class="tooltiptext">
I’ve added matrix dimensions at the bottom of this figure to stress that the two matrices have to have the same dimension on the side they face each other with. You can visualize this operation as looking like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-dot-product-2.png" />
<br />
</div>
<h3 id="matrix-indexing-행렬-인덱싱">Matrix Indexing (행렬 인덱싱)</h3>
<div class="tooltip">
<p>인덱싱과 슬라이싱은 행렬을 다룰 때 더 유용합니다:
<span class="tooltiptext">
Indexing and slicing operations become even more useful when we’re manipulating matrices:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-indexing.png" />
<br />
</div>
<h3 id="matrix-aggregation-행렬-집계연산">Matrix Aggregation (행렬 집계연산)</h3>
<div class="tooltip">
<p>벡터를 집계했던 것과 동일한 방법은 행렬도 집계 연산을 할 수 있습니다:
<span class="tooltiptext">
We can aggregate matrices the same way we aggregated vectors:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-aggregation-1.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>행렬의 모든 값을 집계할 수 있을 뿐만 아니라, <code class="language-plaintext highlighter-rouge">axis</code> 파라미터를 사용하여 행 또는 영을 집계할 수도 있습니다.
<span class="tooltiptext">
Not only can we aggregate all the values in a matrix, but we can also aggregate across the rows or columns by using the <code class="language-plaintext highlighter-rouge">axis</code> parameter:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-matrix-aggregation-4.png" />
<br />
</div>
<h2 id="transposing-and-reshaping-전치-및-재구조화재배열">Transposing and Reshaping (전치 및 재구조화/재배열)</h2>
<div class="tooltip">
<p>행렬을 다룰 때 일반적으로(많은 경우) 회전을 시킬 필요가 있습니다. 우리가 두 행렬에 내적을 취할 필요가 있거나 두 행렬이 공유하는(맞대는) 차원을 맞출 필요가 있을 때입니다. NumPy 배열을 행렬의 전치를 구하는 <code class="language-plaintext highlighter-rouge">T</code>라는 편리한 기능을 가지고 있습니다.
<span class="tooltiptext">
A common need when dealing with matrices is the need to rotate them. This is often the case when we need to take the dot product of two matrices and need to align the dimension they share. NumPy arrays have a convenient property called <code class="language-plaintext highlighter-rouge">T</code> to get the transpose of a matrix:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-transpose.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>고급 유즈케이스로, 특정 행렬의 차원을 서로 바꾸는(switch) 필요를 느낄 수 있습니다. 데이터셋과 다른 입력이 있고, 그 입력이 특정 shape일 것을 모델이 기대하는 머신러닝 어플리케이션에서 종종 발생합니다. NumPy의 <code class="language-plaintext highlighter-rouge">reshape()</code> 메서드는 이러한 케이스에서 유용합니다. 원하는 행렬의 차원을 전달하기만 하면 됩니다. 만약 -1을 차원 값으로 전달하면, NumPy는 그 행렬의 정보를 기반으로 정확한 차원 값을 계산해냅니다.
<span class="tooltiptext">
In more advanced use case, you may find yourself needing to switch the dimensions of a certain matrix. This is often the case in machine learning applications where a certain model expects a certain shape for the inputs that is different from your dataset. NumPy’s <code class="language-plaintext highlighter-rouge">reshape()</code> method is useful in these cases. You just pass it the new dimensions you want for the matrix. You can pass -1 for a dimension and NumPy can infer the correct dimension based on your matrix:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-reshape.png" />
<br />
</div>
<h2 id="yet-more-dimensions-한층-더-높은-차원">Yet More Dimensions (한층 더 높은 차원)</h2>
<div class="tooltip">
<p>NumPy는 앞서 언급한 모든 것을 모든 차원수에 대해 수행할 수 있습니다. 중심이되는 자료 구조를 ndarray(N-Dimensional Array; N-차원 배열)이라고 부르는 이유입니다.
<span class="tooltiptext">
NumPy can do everything we’ve mentioned in any number of dimensions. Its central data structure is called ndarray (N-Dimensional Array) for a reason.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-3d-array.png" />
<br />
</div>
<div class="tooltip">
<p>많은 방법들 중에서, 새로운 차원을 다루는 방법은 콤마(<code class="language-plaintext highlighter-rouge">,</code>)를 NumPy 함수의 파라미터에 추가하는 것입니다.
<span class="tooltiptext">
In a lot of ways, dealing with a new dimension is just adding a comma to the parameters of a NumPy function:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-3d-array-creation.png" />
<br />
</div>
<div class="tooltip">
<p>참고: 3-차원 NumPy 배열을 출력하려고 할 때, 텍스트 출력은 여기에 표현한 것과 다르게 배열을 시각화합니다. NumPy의 n-차원 배열을 출력하는 순서는 마지막 축(axis)이 가장 빠르게 반복되고, 첫번째 축이 가장 느립니다. 이 것은 <code class="language-plaintext highlighter-rouge">np.ones((4,3,2))</code>이 아래와 같이 출력됨을 의미합니다:
<span class="tooltiptext">
Note: Keep in mind that when you print a 3-dimensional NumPy array, the text output visualizes the array differently than shown here. NumPy’s order for printing n-dimensional arrays is that the last axis is looped over the fastest, while the first is the slowest. Which means that <code class="language-plaintext highlighter-rouge">np.ones((4,3,2))</code> would be printed as:
</span></p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">array</span><span class="p">([[[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">]],</span>
<span class="p">[[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">]],</span>
<span class="p">[[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">]],</span>
<span class="p">[[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">]]])</span>
</code></pre></div></div>
<h2 id="practical-usage-실질적인-사용법">Practical Usage (실질적인 사용법)</h2>
<div class="tooltip">
<p>마무리입니다. 다음은 NumPy가 도움이 될 만한 몇가지 유용한 것들의 예입니다.
<span class="tooltiptext">
And now for the payoff. Here are some examples of the useful things NumPy will help you through.
</span></p>
</div>
<h3 id="formulas-수식">Formulas (수식)</h3>
<div class="tooltip">
<p>행렬과 벡터를 다루는 수학적인 공식을 구현하는 것은 NumPy를 고려해야 하는 주요 사용 사례입니다. 이 것은 왜 NumPy가 과학적 파이썬 커뮤니티에서 사랑받는 이유 입니다. 예를 들어, 회귀 문제를 다루는 지도학습 머신러닝 모델의 중심인 평균 제곱 오차 수식을 고려해보세요:
<span class="tooltiptext">
Implementing mathematical formulas that work on matrices and vectors is a key use case to consider NumPy for. It’s why NumPy is the darling of the scientific python community. For example, consider the mean square error formula that is central to supervised machine learning models tackling regression problems:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/mean-square-error-formula.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>NumPy에서 이 것을 구현하는 것은 간단합니다:
<span class="tooltiptext">
Implementing this is a breeze in NumPy:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-mean-square-error-formula.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>이 것의 정수는 NumPy가 <code class="language-plaintext highlighter-rouge">predictions</code>와 <code class="language-plaintext highlighter-rouge">labels</code>이 한개던 천개던 몇개의 값을 가지고 있던지 간에 신경쓰지 않는다는 것입니다 (그 것들이 서로 같은 크기이기만 하다면). 다음 코드에서 4개의 연산을 순차적으로 단계별로 실행하는 예제를 살펴볼 수 있습니다:
<span class="tooltiptext">
The beauty of this is that numpy does not care if <code class="language-plaintext highlighter-rouge">predictions</code> and <code class="language-plaintext highlighter-rouge">labels</code> contain one or a thousand values (as long as they’re both the same size). We can walk through an example stepping sequentially through the four operations in that line of code:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-mse-1.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>predictions 및 labels 벡터는 모두 3개의 값을 가지고 있습니다. 이 것은 n이 3임을 의미합니다. 뺼셈을 수행한 뒤, 다음과 같은 값이 나옵니다:
<span class="tooltiptext">
Both the predictions and labels vectors contain three values. Which means n has a value of three. After we carry out the subtraction, we end up with the values looking like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-mse-2.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>그런 다음 벡터에 있는 값들을 제곱합니다:
<span class="tooltiptext">
Then we can square the values in the vector:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-mse-3.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>이제 이 값들을 더합니다:
<span class="tooltiptext">
Now we sum these values:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-mse-4.png" />
<br />
</div>
<p><br /></p>
<div class="tooltip">
<p>해당 예측에 대한 에러 값이자 모델 품질에 대한 점수가 산출됩니다.
<span class="tooltiptext">
Which results in the error value for that prediction and a score for the quality of the model.
</span></p>
</div>
<h3 id="data-representation-데이터-표현">Data Representation (데이터 표현)</h3>
<div class="tooltip">
<p>모델을 처리하거나 빌드하는데 필요한 모든 데이터 유형에 대해 생각해보십시오 (스프레드시트, 이미지, 오디오 등). 대부분의 경우 n-차원 배열의 표현과 완벽하게 적합합니다.
<span class="tooltiptext">
Think of all the data types you’ll need to crunch and build models around (spreadsheets, images, audio…etc). So many of them are perfectly suited for representation in an n-dimensional array:
</span></p>
</div>
<h4 id="tables-and-spreadsheets">Tables and Spreadsheets</h4>
<div class="tooltip">
<ul>
<li>스프레드시트나 테이블은 2차원 행렬입니다. 스프레드시트의 각 시트가 행렬로 사용될 수 있습니다. 파이썬에서 이 것을 위한 가장 인기있는 추상화는 <a href="https://jalammar.github.io/gentle-visual-intro-to-data-analysis-python-pandas/">판다스 데이터프레임</a>이며, 이 것은 NumPy를 사용했고 그 위에 빌드한 것입니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> A spreadsheet or a table of values is a two dimensional matrix. Each sheet in a spreadsheet can be its own variable. The most popular abstraction in python for those is the <a href="https://jalammar.github.io/gentle-visual-intro-to-data-analysis-python-pandas/">pandas dataframe</a>, which actually uses NumPy and builds on top of it.
</span></li>
</ul>
</div>
<div class="img-div-any-width">
<image src="/images/pandas-intro/0%20excel-to-pandas.png" />
<br />
</div>
<h4 id="audio-and-timeseries시계열">Audio and Timeseries(시계열)</h4>
<div class="tooltip">
<ul>
<li>오디오 파일은 오디오 샘플들의 1차원 배열입니다. 각 샘플들은 오디오 신호의 작은 조각을 숫자로 표현한 것 입니다. CD품질 오디오는 초당 44,100 샘플을 가지며 각 샘플은 -32767 ~ 32767 사이의 정수 값을 갖습니다. CD품질 10초 WAVE 파일을 가지고 있다면, 10 * 44,100 = 441,000 샘플 길이의 NumPy 배열로 로딩할 수 있습니다. 오디오의 첫 1초를 추출하고 싶으신가요? <code class="language-plaintext highlighter-rouge">audio</code>라고 명명할 파일을 NumPy 배열로 단순히 로드하고 <code class="language-plaintext highlighter-rouge">audio[:44100]</code>를 가져오면 됩니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> An audio file is a one-dimensional array of samples. Each sample is a number representing a tiny chunk of the audio signal. CD-quality audio may have 44,100 samples per second and each sample is an integer between -32767 and 32768. Meaning if you have a ten-seconds WAVE file of CD-quality, you can load it in a NumPy array with length 10 * 44,100 = 441,000 samples. Want to extract the first second of audio? simply load the file into a NumPy array that we’ll call <code class="language-plaintext highlighter-rouge">audio</code>, and get <code class="language-plaintext highlighter-rouge">audio[:44100]</code>.
</span></li>
</ul>
</div>
<div class="tooltip">
<p>다음은 오디오 파일의 일부분입니다:
<span class="tooltiptext" style="display: inline-block; text-align: left;">
Here’s a look at a slice of an audio file:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-audio.png" />
<br />
</div>
<div class="tooltip">
<p>시계열 데이터도 동일합니다 (예를 들어, 시간에 걸친 주식 가격)
<span class="tooltiptext">
The same goes for time-series data (for example, the price of a stock over time).
</span></p>
</div>
<h4 id="images">Images</h4>
<div class="tooltip">
<ul>
<li>
<p>이미지는 (높이 x 너비) 크기의 픽셀 행렬입니다.</p>
<ul>
<li>만약 이미지가 흑백(회색조라고도 함)이라면, 각 픽셀은 1개의 숫자로 표현됩니다(주로 0(검정), 255(흰색) 사이의 값). 이미지의 왼쪽 상단의 10 x 10 픽셀을 자르고 싶은가요? NumPy에게 <code class="language-plaintext highlighter-rouge">image[:10,:10]</code>라고 하시면 됩니다.</li>
</ul>
</li>
</ul>
<p><span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> An image is a matrix of pixels of size (height x width).</span></p>
<p><span>*</span> If the image is black and white (a.k.a. grayscale), each pixel can be represented by a single number (commonly between 0 (black) and 255 (white)). Want to crop the top left 10 x 10 pixel part of the image? Just tell NumPy to get you <code class="language-plaintext highlighter-rouge">image[:10,:10]</code>.
</span></p>
</div>
<div class="tooltip">
<p>다음은 이미지 파일의 일부입니다:
<span class="tooltiptext">
Here’s a look at a slice of an image file:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-grayscale-image.png" />
<br />
</div>
<div class="tooltip">
<ul>
<li>만약 이미지가 컬러이미지라면, 각 픽셀은 3개의 숫자로 표현됩니다 - 빨강, 초록, 파랑 각각의 값. 이 경우에 우리는 3차원이 필요합니다 (각 셀이 한 숫자만 표현할 수 있기 때문에). 그래서 컬러이미지는 (높이 x 너비 x 3) 차원의 ndarray로 표현됩니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> If the image is colored, then each pixel is represented by three numbers - a value for each of red, green, and blue. In that case we need a 3rd dimension (because each cell can only contain one number). So a colored image is represented by an ndarray of dimensions: (height x width x 3).
</span></li>
</ul>
</div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><div class="img-div-any-width" markdown="0">
<image src="/images/numpy/numpy-color-image.png"/>
<br />
</div>
</code></pre></div></div>
<h4 id="language">Language</h4>
<div class="tooltip">
<p>만약 텍스트를 다룬다면, 얘기가 조금 달라집니다. 텍스트의 숫자 표현은 어휘(vocab; 모델이 아는 모든 고유한 단어 목록)이 만들어져야 하고 <a href="https://jalammar.github.io/illustrated-word2vec/">임베딩 단계</a>가 있어야 합니다. 고대에 쓰여진 (번역된) 아래 인용구를 수치적으로 나타내는 단계를 살펴보겠습니다.
<span class="tooltiptext">
If we’re dealing with text, the story is a little different. The numeric representation of text requires a step of building a vocabulary (an inventory of all the unique words the model knows) and an <a href="https://jalammar.github.io/illustrated-word2vec/">embedding step</a>. Let us see the steps of numerically representing this (translated) quote by an ancient spirit:
</span></p>
</div>
<p>“Have the bards who preceded me left any theme unsung?”
(“내 이전 음유시인들이 노래를 부르지 않고 남겨둔 주제가 있었던가?”)</p>
<div class="tooltip">
<p>모델은 이 전사 시인의 불안의 말(단어들)을 숫자로 나타내기 전에 대량의 텍스트를 볼 필요가 있습니다. <a href="http://mattmahoney.net/dc/textdata.html">작은 데이터셋</a>을 처리하여, (71,290 단어의) 어휘를 구축하는데 사용할 수 있습니다:
<span class="tooltiptext">
A model needs to look at a large amount of text before it can numerically represent the anxious words of this warrior poet. We can proceed to have it process a <a href="http://mattmahoney.net/dc/textdata.html">small dataset</a> and use it to build a vocabulary (of 71,290 words):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-nlp-vocabulary.png" />
<br />
</div>
<div class="tooltip">
<p>문장은 토큰(일반적으로 단어나 단어의 일부분)의 배열로 나뉩니다.
<span class="tooltiptext">
The sentence can then be broken into an array of tokens (words or parts of words based on common rules):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-nlp-tokenization.png" />
<br />
</div>
<div class="tooltip">
<p>각 단어를 어휘 테이블의 id로 치환할 수 있습니다.
<span class="tooltiptext">
We then replace each word by its id in the vocabulary table:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-nlp-ids.png" />
<br />
</div>
<div class="tooltip">
<p>이 id들은 여전히 모델에게 많은 정보 가치를 제공하지는 않습니다. 그래서 모델에게 단어의 시퀀스를 공급하기 전에, 토큰/워드는 임베딩으로 대체될 필요가 있습니다 (이 예에서는 50 차원 <a href="https://jalammar.github.io/illustrated-word2vec/">word2vec 임베딩</a>)
<span class="tooltiptext">
These ids still don’t provide much information value to a model. So before feeding a sequence of words to a model, the tokens/words need to be replaced with their embeddings (50 dimension <a href="https://jalammar.github.io/illustrated-word2vec/">word2vec embedding</a> in this case):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-nlp-embeddings.png" />
<br />
</div>
<div class="tooltip">
<p>이 NumPy 배열이 [임베딩_차원 x 시퀀스_길이] 크기의 차원임을 알 수 있습니다. 실제로는 반대일 것이지만, 시각적 일관성을 위해 이렇게 표현하겠습니다. 성능 이유로, 딥러닝 모델은 첫번째 차원을 배치 사이즈로 남겨두는 경향이 있습니다 (왜냐하면 모델은 여러 예제를 병렬로 훈련하는 경우 빠르기 때문입니다). 이 것은 <code class="language-plaintext highlighter-rouge">reshape()</code>이 매우 유용해지는 명백한 케이스입니다. 예를들어, <a href="https://jalammar.github.io/illustrated-bert/">BERT</a>와 같은 모델은 이 것의 입력을 [배치_사이즈, 시퀀스_길이, 임베딩_크기] shape의 입력을 예상합니다.
<span class="tooltiptext">
You can see that this NumPy array has the dimensions [embedding_dimension x sequence_length]. In practice these would be the other way around, but I’m presenting it this way for visual consistency. For performance reasons, deep learning models tend to preserve the first dimension for batch size (because the model can be trained faster if multiple examples are trained in parallel). This is a clear case where <code class="language-plaintext highlighter-rouge">reshape()</code> becomes super useful. A model like <a href="https://jalammar.github.io/illustrated-bert/">BERT</a>, for example, would expect its inputs in the shape: [batch_size, sequence_length, embedding_size].
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/numpy/numpy-nlp-bert-shape.png" />
<br />
</div>
<div class="tooltip">
<p>이제 모델이 유용한 작업을 수행할 수 있는 숫자 볼륨(단위)입니다. 다른 행은 비워 두었지만 모델이 학습(또는 예측)할 다른 예제(숫자값)들로 채워질 것입니다.
<span class="tooltiptext">
This is now a numeric volume that a model can crunch and do useful things with. I left the other rows empty, but they’d be filled with other examples for the model to train on (or predict).
</span></p>
</div>
<div class="tooltip">
<p>위의 예에서 <a href="https://en.wikisource.org/wiki/The_Poem_of_Antara">이 시인(Antara)의 시</a>은 어떤 다른 시들 보다 훨씬 더 불후의 명성을 얻었습니다. 아버지 소유의 노예로 태어난 Antarah(어머니가 포로/노예로 잡혀간 에티오피아 공주)는 용맹과 언어 구사력을 통해 자유를 얻었고, 그의 시가 이슬람 이전 아라비아의 <a href="https://en.wikipedia.org/wiki/Mu%27allaqat">카바(사우디아라비아 메카 소재 ‘하람 성원’의 중심에 위치)에 게시된 7개의 시</a> 중 하나로 신화적 지위를 얻었습니다.
<span class="tooltiptext">
(It turned out the <a href="https://en.wikisource.org/wiki/The_Poem_of_Antara">poet’s words</a> in our example were immortalized more so than those of the other poets which trigger his anxieties. Born a slave owned by his father, <a href="https://en.wikipedia.org/wiki/Antarah_ibn_Shaddad">Antarah’s</a> valor and command of language gained him his freedom and the mythical status of having his poem as one of <a href="https://en.wikipedia.org/wiki/Mu%27allaqat">seven poems suspended in the kaaba</a> in pre-Islamic Arabia).
</span></p>
</div>
<hr />
<h2 id="추가-정보">추가 정보<a href="#additional-info" name="additional-info">.</a></h2>
<ul>
<li>이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 <a href="https://jalammar.github.io/visual-numpy/">A Visual Intro to NumPy and Data Representation</a>에서 확인하실 수 있습니다.</li>
<li>원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.</li>
<li>번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 <a href="https://nlpinkorean.github.io">찬</a>님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.</li>
</ul>chloamme이 글은 Jay Alammar님의 글을 번역한 글입니다. [추가정보] This post is a translated version of A Visual Intro to NumPy and Data Representation by Jay Alammar.[번역] GPT3가 작동하는 방법 - 시각화 및 동영상 설명2021-12-18T00:00:00+00:002021-12-18T00:00:00+00:00/2021/12/18/how-gpt3-works-visualizations-animations-korean<div class="tooltip">
<blockquote>
<p>이 글은 <a href="https://jalammar.github.io/how-gpt3-works-visualizations-animations/">Jay Alammar님의 글</a>을 번역한 글입니다. [<a href="#additional-info">추가정보</a>]
<span class="tooltiptext">
This post is a translated version of <a href="https://jalammar.github.io/how-gpt3-works-visualizations-animations/">How GPT3 Works - Visualizations and Animations</a> by Jay Alammar.
</span></p>
</blockquote>
</div>
<div class="tooltip">
<p>요즘(GPT3가 릴리즈 되었을 당시) 기술분야에서는 GPT3에 대한 <a href="https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential">많은 이야기</a>를 합니다. (GPT3와 같은) 대규모 언어 모델들은 그 능력으로 우리를 놀래키기 시작했습니다. 대부분의 비즈니스에서 고객용으로 사용하기에 충분히 신뢰성(reliable)이 있다고 할 수는 없지만, 이 모델들은 자동화(automation)로의 꾸준한 전환과 지능형 컴퓨터 시스템의 가능성을 가속화할 영리함의 번뜩이는 순간들을 보여주고 있습니다.
<span class="tooltiptext">
The tech world is <a href="https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential">abuzz</a> with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.
</span></p>
</div>
<div class="tooltip">
<p>훈련된 모델은 텍스트를 생성합니다.
<span class="tooltiptext">
A trained language model generates text.
</span></p>
</div>
<div class="tooltip">
<p>선택적으로 일부 텍스트를 입력으로 넣을 수 있으며, 출력에 영향을 줍니다.
<span class="tooltiptext">
We can optionally pass it some text as input, which influences its output.
</span></p>
</div>
<div class="tooltip">
<p>대량의 텍스트를 읽는(scan하는) 훈련 기간동안 “학습된” 모델로 부터 출력이 생성됩니다.
<span class="tooltiptext">
The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/01-gpt3-language-model-overview.gif" />
<br />
</div>
<!--more-->
<div class="tooltip">
<p>훈련은 모델에게 많은 텍스트를 노출시키는 일련의 과정입니다. 바로 그 GPT3 훈련 과정이 완료되었습니다([추가설명] GPT-3가 공개된 바로 그 시점). (커뮤니티/소셜/유투브 등에서) 지금 보고 계신 모든 실험들은, 바로 그 훈련된 모델로부터 나온 것입니다. 그 모델은 355년(GPU; V100기준) 및 460만 달러($4.6m)가 소비/소요되는 것으로 예상합니다.
<span class="tooltiptext">
Training is the process of exposing the model to lots of text. That process has been completed. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/02-gpt3-training-language-model.gif" />
<br />
</div>
<div class="tooltip">
<p>텍스트의 3000억개의 토큰 규모의 데이터셋이 모델을 위한 훈련 예제를 생성하기 위해 사용됩니다. 예를 들어, 맨 위 그림의 하나의 문장에서 생성된 3개의 훈련 예제 입니다.
<span class="tooltiptext">
The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.
</span></p>
</div>
<div class="tooltip">
<p>모든 텍스트에 대해 윈도우를 슬라이드(slide)하면서, 많은 예제들을 만들어내는 것을 볼 수 있습니다.
<span class="tooltiptext">
You can see how you can slide a window across all the text and make lots of examples.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/gpt3-training-examples-sliding-window.png" />
<br />
</div>
<div class="tooltip">
<p>모델을 예제와 함께 표현했습니다. 우리는 모델에게 feature만을 보여주고, 다음 단어(word)를 예측하도록 요청합니다.
<span class="tooltiptext">
The model is presented with an example. We only show it the features and ask it to predict the next word.
</span></p>
</div>
<div class="tooltip">
<p>모델의 예측이 틀리면, 우리는 모델의 예측의 error를 계산하고 model을 업데이트 합니다. 그래서 다음 번에 더 나은 예측을 만들도록 합니다.
<span class="tooltiptext">
The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.
</span></p>
</div>
<div class="tooltip">
<p>이 것을 수백만번 반복합니다.
<span class="tooltiptext">
Repeat millions of times
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/03-gpt3-training-step-back-prop.gif" />
<br />
</div>
<div class="tooltip">
<p>같은 단계를 좀 더 자세히 살펴봅시다.
<span class="tooltiptext">
Now let’s look at these same steps with a bit more detail.
</span></p>
</div>
<div class="tooltip">
<p>GPT3는 한번에 하나의 토큰을 출력으로 생성합니다 (지금은 토큰이 단어라고 가정합시다).
<span class="tooltiptext">
GPT3 actually generates output one token at a time (let’s assume a token is a word for now).
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/04-gpt3-generate-tokens-output.gif" />
<br />
</div>
<div class="tooltip">
<p>강조: 이 것은 GPT-3가 동작하는 설명이며, (주로 엄청나게 큰 규모의) 모델이 얼마나 참신한가에 대한 논의가 아닙니다. 여기서 설명할 아키텍처는 <a href="https://arxiv.org/pdf/1801.10198.pdf">Generating Wikipedia by Summarizing Long Sequences 논문</a>을 기반으로 하는 transformer decoder 모델입니다.
<span class="tooltiptext">
Please note: This is a description of how GPT-3 works and not a discussion of what is novel about it (which is mainly the ridiculously large scale). The architecture is a transformer decoder model based on this paper https://arxiv.org/pdf/1801.10198.pdf
</span></p>
</div>
<div class="tooltip">
<p>GPT3는 거.대.합니다. (파라미터라고 불리는) 1750억개 (175 billion)의 숫자를 훈련시켜 학습한 내용을 인코딩합니다. 이 숫자들은 각 실행(학습/평가/추론 등 모든 종류의 실행)에서 생성할 토큰을 계산하는데 사용됩니다.
<span class="tooltiptext">
GPT3 is MASSIVE. It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which token to generate at each run.
</span></p>
</div>
<div class="tooltip">
<p>훈련되지 않은 모델은 임의의 파라미터 값을 갖습니다. 훈련은 더 나은 예측을 하게 만드는 값을 찾습니다.
<span class="tooltiptext">
The untrained model starts with random parameters. Training finds values that lead to better predictions.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/gpt3-parameters-weights.png" />
<br />
</div>
<div class="tooltip">
<p>이 숫자들은 모델 안의 수백개의 matrix들의 일부입니다. 예측은 대부분 많은 횟수의 matrix 곱셈입니다.
<span class="tooltiptext">
These numbers are part of hundreds of matrices inside the model. Prediction is mostly a lot of matrix multiplication.
</span></p>
</div>
<div class="tooltip">
<p>제 <a href="https://youtube.com/watch?v=mSTCzNgDJy4">AI 소개 유투브 영상</a>에서, 한 개의 파라미터를 갖는 단순한 ML 모델을 설명했습니다. 175B이라는 가공할만한 이 파라미터에 대해 풀기(unpack)에 좋은 시작 지점이 될 것입니다.
<span class="tooltiptext">
In my <a href="https://youtube.com/watch?v=mSTCzNgDJy4">Intro to AI on YouTube</a>, I showed a simple ML model with one parameter. A good start to unpack this 175B monstrosity.
</span></p>
</div>
<div class="tooltip">
<p>이러한 파라미터가 어떻게 분포되고 사용되는지 밝히기 위해, 우리는 모델 안을 열어보고 살펴볼 필요가 있습니다.
<span class="tooltiptext">
To shed light on how these parameters are distributed and used, we’ll need to open the model and look inside.
</span></p>
</div>
<div class="tooltip">
<p>GPT3는 2048 토큰을 입력으로 받아들일 수 있습니다. 그 것을 “컨텍스트 윈도우”라고 합니다. 이 것은, 토큰들이 처리될 2048개의 경로가 있음을 의미합니다.
<span class="tooltiptext">
GPT3 is 2048 tokens wide. That is its “context window”. That means it has 2048 tracks along which tokens are processed.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/05-gpt3-generate-output-context-window.gif" />
<br />
</div>
<div class="tooltip">
<p>보라색 경로를 따라가 봅시다. 시스템은 어떻게 “robotics”를 처리하고 “A”를 생성할까요?
<span class="tooltiptext">
Let’s follow the purple track. How does a system process the word “robotics” and produce “A”?
</span></p>
</div>
<div class="tooltip">
<p>상위레벨 단계:
<span class="tooltiptext">
High-level steps:
</span></p>
</div>
<div class="tooltip">
<ol>
<li>단어를 <a href="https://jalammar.github.io/illustrated-word2vec/">단어를 표현(representing)하는 vector(숫자의 나열/리스트)</a>로 변환합니다.</li>
<li>예측을 수행합니다.</li>
<li>vecor를 단어로 변환합니다.
<span class="tooltiptext"></span></li>
<li>Convert the word to <a href="https://jalammar.github.io/illustrated-word2vec/">a vector (list of numbers) representing the word</a></li>
<li>Compute prediction</li>
<li>Convert resulting vector to word
</span></li>
</ol>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/06-gpt3-embedding.gif" />
<br />
</div>
<div class="tooltip">
<p>GPT3의 중요한 계산은 96개의 transformer decoder 레이어의 층(stack) 안에서 일어납니다.
<span class="tooltiptext">
The important calculations of the GPT3 occur inside its stack of 96 transformer decoder layers.
</span></p>
</div>
<div class="tooltip">
<p>모든 레이어가 보이죠? 이 것이 “딥러닝”의 “깊이(depth)” 입니다.
<span class="tooltiptext">
See all these layers? This is the “depth” in “deep learning”.
</span></p>
</div>
<div class="tooltip">
<p>이 레이어들 각각이, 계산을 위한 각자의 18억개(1.8B)의 파라미터를 가지고 있습니다. 여기가 바로 “마법”이 일어나는 곳 입니다. 이 과정의 상위 레벨 모습은 아래와 같습니다:
<span class="tooltiptext">
Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/07-gpt3-processing-transformer-blocks.gif" />
<br />
</div>
<div class="tooltip">
<p>decoder 내부의 모든 것에 대한 자세한 설명은 제 블로그 포스트 <a href="https://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT2</a>에서 보실 수 있습니다.
<span class="tooltiptext">
You can see a detailed explanation of everything inside the decoder in my blog post <a href="https://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT2</a>.
</span></p>
</div>
<div class="tooltip">
<p>GPT3와의 차이점은 dense와 <a href="https://arxiv.org/pdf/1904.10509.pdf">sparse self-attention layers</a>가 번갈아가며 구성되어 있다는 것입니다.
<span class="tooltiptext">
The difference with GPT3 is the alternating dense and <a href="https://arxiv.org/pdf/1904.10509.pdf">sparse self-attention layers</a>.
</span></p>
</div>
<div class="tooltip">
<p>이 것은 입력의 엑스레이 이미지이며, GPT3의 응답(“Okay human”)입니다. 모든 토큰이 전체 레이어 층을 따라 흐르는 것에 주목하세요. 우리는 앞쪽의 단어들의 출력에 신경쓰지 않습니다. 입력이 완료되었을 때, 출력에 신경쓰기 시작합니다. 우리는 (출력으로 나온) 모든 단어를 모델에 다시 공급(feed)합니다.
<span class="tooltiptext">
This is an X-ray of an input and response (“Okay human”) within GPT3. Notice how every token flows through the entire layer stack. We don’t care about the output of the first words. When the input is done, we start caring about the output. We feed every word back into the model.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/08-gpt3-tokens-transformer-blocks.gif" />
<br />
</div>
<div class="tooltip">
<p><a href="https://twitter.com/sharifshameem/status/1284421499915403264">리액트 코드 생성 예제</a>에서, 이 글의 설명 파트는 설명=>코드의 몇가지 예제 이외의 입력 프롬프트 (녹색)가 될 것이라 생각합니다. 그리고 리액트 코드는 이 그림에서 분홍색 토큰들처럼, 토큰 단위로, 생성이 될 것입니다.
<span class="tooltiptext">
In the <a href="https://twitter.com/sharifshameem/status/1284421499915403264">React code generation example</a>, the description would be the input prompt (in green), in addition to a couple of examples of description=>code, I believe. And the react code would be generated like the pink tokens here token after token.
</span></p>
</div>
<div class="tooltip">
<p>제 가정은, 마중물(priming) 예제와 설명이 예제와 결과를 구분하는 특정 토큰과 함께 입력으로 추가된다는 것 입니다. 그런 다음 모델로 공급(feed) 됩니다.
<span class="tooltiptext">
My assumption is that the priming examples and the description are appended as input, with specific tokens separating examples and the results. Then fed into the model.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/09-gpt3-generating-react-code-example.gif" />
<br />
</div>
<div class="tooltip">
<p>이렇게 동작하는 것은 인상적입니다. 아직 GPT3를 위한 fine-tuning은 나오지도 않았습니다. 이 가능성은 더욱 놀랍습니다.
<span class="tooltiptext">
It’s impressive that this works like this. Because you just wait until fine-tuning is rolled out for the GPT3. The possibilities will be even more amazing.
</span></p>
</div>
<div class="tooltip">
<p>Fine-tuning은 모델의 weight를 실제로 업데이트 해서 특정 태스크에 대한 모델(의 성능)을 더 좋게 만듭니다.
<span class="tooltiptext">
Fine-tuning actually updates the model’s weights to make the model better at a certain task.
</span></p>
</div>
<div class="img-div-any-width">
<img src="/images/gpt3/10-gpt3-fine-tuning.gif" />
<br />
</div>
<hr />
<h2 id="추가-정보">추가 정보<a href="#additional-info" name="additional-info">.</a></h2>
<ul>
<li>이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 <a href="https://jalammar.github.io/how-gpt3-works-visualizations-animations/">How GPT3 Works - Visualizations and Animations</a>에서 확인하실 수 있습니다.</li>
<li>원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.</li>
<li>번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 <a href="https://nlpinkorean.github.io">찬</a>님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.</li>
</ul>chloamme이 글은 Jay Alammar님의 글을 번역한 글입니다. [추가정보] This post is a translated version of How GPT3 Works - Visualizations and Animations by Jay Alammar.[번역] 그림으로 설명하는 GPT-2 (Transformer Language Model 시각화)2021-12-08T00:00:00+00:002021-12-08T00:00:00+00:00/2021/12/08/illustrated-gpt2-korean<div class="tooltip">
<blockquote>
<p>이 글은 <a href="https://jalammar.github.io/illustrated-gpt2/">Jay Alammar님의 글</a>을 번역한 글입니다. [<a href="#additional-info">추가정보</a>]
<span class="tooltiptext">
This post is a translated version of <a href="https://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT-2 (Visualizing Transformer Language Models)</a> by Jay Alammar.
</span></p>
</blockquote>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/openAI-GPT-2-3.png" />
<br />
</div>
<div class="tooltip">
<p>올 해, 우리는 눈부시게 빛나는 머신러닝 어플리케이션을 보았습니다. <a href="https://openai.com/blog/better-language-models/">OpenAI의 GPT-2</a>는 조리있고 강렬한 에세이들을 써내는 엄청난 능력을 보여주었습니다. 우리가 현재의 language model들이 만들어낼 것으로 기대하는 수준 이상이었습니다. GPT-2는 특별히 새로운 아키텍처는 아닙니다 – GPT-2의 아키텍처는 decoder로만 구성된 transformer와 매우 유사합니다. 하지만 GPT-2는 방대한 양의 dataset으로 훈련된, transformer 기반의 매우 큰 language model입니다. 이번 글에서, 이 모델이 이러한 결과를 만들어낼 수 있게 한 아키텍처를 알아보고자 합니다. self-attention 레이어를 깊이 있게 살펴보고, language model 그 이상의 decoder-only transformer를 위한 어플리케이션들을 살펴보도록 하겠습니다.
<span class="tooltiptext">
This year, we saw a dazzling application of machine learning. <a href="https://openai.com/blog/better-language-models/">The OpenAI GPT-2</a> exhibited impressive ability of writing coherent and passionate essays that exceed what we anticipated current language models are able to produce. The GPT-2 wasn’t a particularly novel architecture – it’s architecture is very similar to the decoder-only transformer. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset. In this post, we’ll look at the architecture that enabled the model to produce its results. We will go into the depths of its self-attention layer. And then we’ll look at applications for the decoder-only transformer beyond language modeling.
</span></p>
</div>
<div class="tooltip">
<p>저의 이번 목표는, 이전 글인 <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a>에 더 많은 시각적 설명을 더하여 transformer의 내부 동작 원리를 설명하고, 최초의 논문으로 부터 어떻게 발전되어 왔는지에 대해 설명하는 것 입니다. 이러한 시각적 설명을 통해, 내부 동작 방식이 계속 진화되고 있는 transformer 기반의 후속 모델들이 더 쉽게 설명이 되었으면 하는 바람이 있습니다.
<span class="tooltiptext">
My goal here is to also supplement my earlier post, <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a>, with more visuals explaining the inner-workings of transformers, and how they’ve evolved since the original paper. My hope is that this visual language will hopefully make it easier to explain later Transformer-based models as their inner-workings continue to evolve.
</span></p>
</div>
<!--more-->
<div style="font-size:75%; background-color:#eee; border: 1px solid #bbb; display: table; padding: 7px">
<div style="text-align:center">
<p><strong>목차</strong></p>
</div>
<ul>
<li><strong><a href="#part-1-got-and-language-modeling">파트 1: GPT2와 Language Modeling</a></strong>
<ul>
<li>Language Model이란</li>
<li>Language Modeling을 위한 Transformer</li>
<li>BERT와 한가지 차이점</li>
<li>Transformer block의 진화</li>
<li>GPT-2의 내부를 살펴보기</li>
<li>더 깊이 알아보기</li>
<li>파트 1의 마무리: 몇가지 안내사항</li>
</ul>
</li>
<li><strong><a href="#part-2-illustrated-self-attention">파트 2: 그림으로 설명하는 Self-Attention</a></strong>
<ul>
<li>(Masking 없는) Self-Attention</li>
<li>1- Query, Key, Value 벡터 todtjd</li>
<li>2- Score 걔산</li>
<li>3- 전체 합산</li>
<li>그림으로 설명하는 Masked Self-Attention</li>
<li>GPT-2의 Masked Self-Attention</li>
<li>드디어 해냈습니다! ‘It’을 만들어냈습니다!</li>
</ul>
</li>
<li><strong><a href="#part-3-beyond-language-modeling">파트 3: Language Modeling, 그 이상의 것</a></strong>
<ul>
<li>기계 번역(Machine Translation)</li>
<li>요약(Summarization)</li>
<li>전이 학습(Transfer Learning)</li>
<li>음악 생성(Music Generation)</li>
</ul>
</li>
</ul>
</div>
<!-- <div class="tooltip" markdown="1">
<span class="tooltiptext">
* **[Part 1: GPT2 And Language Modeling](#part-1-got-and-language-modeling)**
* What is a Language Model
* Transformers for Language Modeling
* One Difference From BERT
* The Evolution of The Transformer Block
* Crash Course in Brain Surgery: Looking Inside GPT-2
* A Deeper Look Inside
* End of part #1: The GPT-2, Ladies and Gentlemen
* **[Part 2: The Illustrated Self-Attention](#part-2-illustrated-self-attention)**
* Self-Attention (without masking)
* 1- Create Query, Key, and Value Vectors
* 2- Score
* 3- Sum
* The Illustrated Masked Self-Attention
* GPT-2 Masked Self-Attention
* Beyond Language modeling
* You've Made it!
* **[Part 3: Beyond Language Modeling](#part-3-beyond-language-modeling)**
* Machine Translation
* Summarization
* Transfer Learning
* Music Generation
</span>
</div> -->
<h2 id="파트-1-gpt2와-language-modeling-">파트 #1: GPT2와 Language Modeling <a href="#part-1-got-and-language-modeling" name="part-1-got-and-language-modeling">#</a></h2>
<div class="tooltip">
<p>그래서, language model이 정확히 무엇일까요?
<span class="tooltiptext">
So what exactly is a language model?
</span></p>
</div>
<h3 id="language-model-이란">Language Model 이란</h3>
<div class="tooltip">
<p>이전 글인 <a href="https://jalammar.github.io/illustrated-word2vec/">The Illustrated Word2vec</a>(<a href="https://databreak.netlify.app/2019-04-25-Illustrated_word2vec/">한국어 번역본</a>)에서 language model이 무엇인지 살펴보았습니다 – 기본적으로는, 문장의 일부를 보고 다음 단어를 예측하는 것을 할 수 있는 머신 러닝 모델입니다. 가장 유명한 language model로는 현재까지 입력된 것을 보고 다음 단어를 제안하는 스마트폰 키보드가 있습니다.
<span class="tooltiptext">
In <a href="https://jalammar.github.io/illustrated-word2vec/">The Illustrated Word2vec</a>, we’ve looked at what a language model is – basically a machine learning model that is able to look at part of a sentence and predict the next word. The most famous language models are smartphone keyboards that suggest the next word based on what you’ve currently typed.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/word2vec/swiftkey-keyboard.png" />
<br />
</div>
<div class="tooltip">
<p>이런 점에서는 GPT-2가 기본적으로 키보드 앱의 다음단어 예측 기능과 동일하다고 말할 수도 있겠지만, 키보드 앱 이상으로 훨씬 더 크고 더욱 복잡합니다. GPT-2는 OpenAI에서 연구 노력의 일환으로 인터넷을 크롤링한 대량의 dataset(40GB)인 WebText으로 학습 되었습니다. 스토리지 크기 측면으로 비교를 해본다면, 제가 사용하고 있는 키보드 앱인 SwiftKey가 78MB의 공간을 차지합니다. 훈련된 GPT-2 중에서 가장 작은 것의 경우에, 모든 파라미터를 저장하기 위해 500MB의 저장공간을 차지합니다. 가장 큰 GPT-2의 경우에는 사이즈가 13배 크기 때문에, 6.5GB 이상의 저장 공간을 차지할 수 있습니다.
<span class="tooltiptext">
In this sense, we can say that the GPT-2 is basically the next word prediction feature of a keyboard app, but one that is much larger and more sophisticated than what your phone has. The GPT-2 was trained on a massive 40GB dataset called WebText that the OpenAI researchers crawled from the internet as part of the research effort. To compare in terms of storage size, the keyboard app I use, SwiftKey, takes up 78MBs of space. The smallest variant of the trained GPT-2, takes up 500MBs of storage to store all of its parameters. The largest GPT-2 variant is 13 times the size so it could take up more than 6.5 GBs of storage space.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-sizes.png" />
<br />
</div>
<div class="tooltip">
<p>GPT-2를 시험해보는 가장 좋은 방법은 <a href="https://gpt2.apps.allenai.org/?text=Joel%20is">AllenAI의 GPT-2 Explorer</a>를 이용하는 것 입니다. GPT-2를 사용하여, (확률 점수와 함께) 다음 단어로 가능한 10개의 예측을 표시해줍니다. 그 것들 중에서 한 단어를 선택한 뒤, 다시 예측된 리스트를 보고, 계속해서 글을 작성해나갈 수 있습니다.
<span class="tooltiptext">
One great way to experiment with GPT-2 is using the <a href="https://gpt2.apps.allenai.org/?text=Joel%20is">AllenAI GPT-2 Explorer</a>. It uses GPT-2 to display ten possible predictions for the next word (alongside their probability score). You can select a word then see the next list of predictions to continue writing the passage.
</span></p>
</div>
<h3 id="language-modeling을-위한-transformer">Language Modeling을 위한 Transformer</h3>
<div class="tooltip">
<p><a href="https://jalammar.github.io/illustrated-transformer/">Illustrated Transformer</a>에서 본 것과 같이, 최초의 transformer 모델은 encoder와 decoder로 구성되어 있습니다 – 그 각각은 우리가 transformer 블록(block) 이라고 부르는 것들을 쌓아놓은 것(stacking) 입니다. 이 아키텍처는 원래 기계 번역 용도로 적합했었습니다 – encoder-decoder 아키텍처가 과거에는 성공적이었습니다.
<span class="tooltiptext">
As we’ve seen in The <a href="https://jalammar.github.io/illustrated-transformer/">Illustrated Transformer</a>, the original transformer model is made up of an encoder and decoder – each is a stack of what we can call transformer blocks. That architecture was appropriate because the model tackled machine translation – a problem where encoder-decoder architectures have been successful in the past.
</span></p>
</div>
<div class="img-div">
<image src="/images/xlnet/transformer-encoder-decoder.png" />
<br />
</div>
<div class="tooltip">
<p>많은 후속 연구들에서는 architecture에서 encoder 또는 decoder 중 하나를 없애고, 하나의 transformer 블록층(block stack)만을 사용합니다 – block들을 현실적으로 가능한 한 높이 쌓아 올리고, 대량의 텍스트들을 학습에 이용하고, 엄청난 양의 컴퓨팅을 투하합니다. (이러한 language model들을 학습하는데 수십만 달러, <a href="https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/">AlphaStar</a>의 경우 수백만 달러 소요)
<span class="tooltiptext">
A lot of the subsequent research work saw the architecture shed either the encoder or decoder, and use just one stack of transformer blocks – stacking them up as high as practically possible, feeding them massive amounts of training text, and throwing vast amounts of compute at them (hundreds of thousands of dollars to train some of these language models, likely millions in the case of <a href="https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/">AlphaStar</a>).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt-2-transformer-xl-bert-3.png" />
<br />
</div>
<div class="tooltip">
<p>얼마나 이 block들을 높에 쌓을 수 있을까요? 이것이 GPT2 모델 크기를 결정짓는 주요 구별 요소임이 밝혀졌습니다.
<span class="tooltiptext">
How high can we stack up these blocks? It turns out that’s one of the main distinguishing factors between the different GPT2 model sizes:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-sizes-hyperparameters-3.png" />
<br />
</div>
<h3 id="bert와-한가지-차이점">BERT와 한가지 차이점</h3>
<blockquote class="subtle">
<strong>First Law of Robotics</strong><br />
A robot may not injure a human being or, through inaction, allow a human being to come to harm.
<br />
(로보틱스 제1원칙: 로봇은 인간에 해를 가하거나, 혹은 행동을 하지 않음으로써 인간에게 해가 가도록 해서는 안 된다.)
</blockquote>
<div class="tooltip">
<p>GPT-2는 transformer의 decoder 블럭으로 구성됩니다. 반대로 BERT는, transformer의 encoder 블럭을 사용합니다. 다음 섹션에서 이 차이를 살펴보겠습니다. 하지만, 이 둘 간의 주요한 차이는 GPT2가 다른 전통적인 language model들 처럼 한번에 하나의 token을 출력한다는 것 입니다. 잘 훈련된 GPT-2가 로보틱스의 제 1원칙을 출력하도록 해봅시다.
<span class="tooltiptext">
The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. We will examine the difference in a following section. But one key difference between the two is that GPT2, like traditional language models, outputs one token at a time. Let’s for example prompt a well-trained GPT-2 to recite the first law of robotics:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/gpt-2-output.gif" />
<br />
</div>
<div class="tooltip">
<p>이러한 모델들은 각 token이 생성된 후에 입력 시퀀스에 더해지는 방식으로 동작합니다. 그러한 새 시퀀스는 다음 단계에서 모델의 입력으로 들어갑니다. 이 것을 “auto-regression”이라고 부릅니다. 이 것은 <a href="https://karpathy.github.io/2015/05/21/rnn-effectiveness/">RNN을 엄청나게 효과적으로 만든</a> 방법 중 하나 입니다.
<span class="tooltiptext">
The way these models actually work is that after each token is produced, that token is added to the sequence of inputs. And that new sequence becomes the input to the model in its next step. This is an idea called “auto-regression”. This is one of the ideas that <a href="https://karpathy.github.io/2015/05/21/rnn-effectiveness/">made RNNs unreasonably effective</a>.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/gpt-2-autoregression-2.gif" />
<br />
</div>
<div class="tooltip">
<p>GPT2와 TransformerXL, XLNet과 같은 후속 모델들은 본질적으로 auto-regressive입니다. BERT는 그렇지 않습니다. 이 것은 상충관계(trade off)가 있습니다. auto-regression 특성을 잃는 대신, BERT는 더 좋은 결과를 내기 위해 단어의 양쪽 방향으로부터의 context를 활용할 수 있는 능력을 얻었습니다. XLNet은 양쪽 방향의 context를 활용하기 대안적 방법과 autoregression을 모두 사용합니다.
<span class="tooltiptext">
The GPT2, and some later models like TransformerXL and XLNet are auto-regressive in nature. BERT is not. That is a trade off. In losing auto-regression, BERT gained the ability to incorporate the context on both sides of a word to gain better results. XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.
</span></p>
</div>
<h3 id="transformer-block의-진화">Transformer block의 진화</h3>
<div class="tooltip">
<p><a href="https://arxiv.org/abs/1706.03762">최초의 transformer 논문(Attention Is All You Need)</a>에서는 두가지 타입의 transformer block에 대해 소개합니다.
<span class="tooltiptext">
The <a href="https://arxiv.org/abs/1706.03762">initial transformer paper</a> introduced two types of transformer blocks:
</span></p>
</div>
<h4 id="encoder-block">Encoder block</h4>
<div class="tooltip">
<p>먼저 encoder block입니다:
<span class="tooltiptext">
First is the encoder block:
</span></p>
</div>
<div class="img-div">
<image src="/images/xlnet/transformer-encoder-block-2.png" />
<br />
<div class="tooltip">
<p>최초의 transformer 논문에서 encoder block은 입력을 특정 최대 시퀀스 길이(예: 512개의 토큰)까지 가질 수 있습니다. 입력 시퀀스가 이 길이보다 짧으면 괜찮습니다. 시퀀스의 나머지 부분에 패딩(padding)을 덧붙일 수 있습니다.
<span class="tooltiptext">
An encoder block from the original transformer paper can take inputs up until a certain max sequence length (e.g. 512 tokens). It’s okay if an input sequence is shorter than this limit, we can just pad the rest of the sequence.
</span></p>
</div>
</div>
<h4 id="decoder-block">Decoder block</h4>
<div class="tooltip">
<p>두번째로, encoder block의 아키텍처를 살짝 변형한 decoder block이 있습니다 – encoder로 부터의 특정 segment에 attention을 줄 수 있도록 하는 레이어가 있습니다.
<span class="tooltiptext">
Second, there’s the decoder block which has a small architectural variation from the encoder block – a layer to allow it to pay attention to specific segments from the encoder:
</span></p>
</div>
<div class="img-div">
<image src="/images/xlnet/transformer-decoder-block-2.png" />
<br />
</div>
<div class="tooltip">
<p>decoder의 self-attention 레이어에서의 주요 다른 특징은, 앞으로 나올 token들을 masking하는 것 입니다 – BERT 처럼 word를 [mask]로 치환하는 것이 아니라, aelf-attention 계산 시에, 계산되어야 하는 위치의 오른쪽에 있는 token들로부터의 정보를 막는 방법으로 차단합니다.
<span class="tooltiptext">
One key difference in the self-attention layer here, is that it masks future tokens – not by changing the word to [mask] like BERT, but by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated.
</span></p>
</div>
<div class="tooltip">
<p>예를 들어, #4번의 경로를 볼 때, 현재와 이전 token들만이 attention 됨을 알 수 있습니다.
<span class="tooltiptext">
If, for example, we’re to highlight the path of position #4, we can see that it is only allowed to attend to the present and previous tokens:
</span></p>
</div>
<div class="img-div">
<image src="/images/xlnet/transformer-decoder-block-self-attention-2.png" />
<br />
</div>
<div class="tooltip">
<p>(BERT가 사용하는) self-attention과 (GPT-2가 사용하는) masked self-attention이 확연히 다르다는 것은 중요합니다. 일반 self-attention block은 자신보다 오른쪽에 있는 token을 계산 과정에서 볼 수 있도록 합니다. masked self-attention의 경우에는, 이런 상황을 막습니다.
<span class="tooltiptext">
It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening:
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/self-attention-and-masked-self-attention.png" />
<br />
</div>
<h4 id="the-decoder-only-block">The Decoder-Only Block</h4>
<div class="tooltip">
<p>최초의 논문에 이어, <a href="https://arxiv.org/pdf/1801.10198.pdf">Generating Wikipedia by Summarizing Long Sequences 논문</a>에서는 language modeling이 가능한, 다른 배열의 transformer block을 제안했습니다. 이 모델은 transformer의 encoder를 버렸습니다. 그러한 이유로, 이 모델을 “Transformer-Decoder”라고 부르겠습니다. 이 초창기의 transformer 기반의 language model은 6개의 transformer decoder block을 쌓아 구성했습니다:
<span class="tooltiptext">
Subsequent to the original paper, <a href="https://arxiv.org/pdf/1801.10198.pdf">Generating Wikipedia by Summarizing Long Sequences</a> proposed another arrangement of the transformer block that is capable of doing language modeling. This model threw away the Transformer encoder. For that reason, let’s call the model the “Transformer-Decoder”. This early transformer-based language model was made up of a stack of six transformer decoder blocks:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/transformer-decoder-intro.png" />
<br />
<div class="tooltip">
<p>이 decoder block들은 동일합니다. 첫번째 block을 확대했기 때문에, self-attention 레이어가 masked 버전임을 알 수 있습니다. 현재는, 이 모델이 특정 segment에서 최대 4,000개 token까지 참조할 수 있습니다 – 최초 transformer에서 512개 였던 것에 비하면 크게 업그레이드 되었습니다.
<span class="tooltiptext">
The decoder blocks are identical. I have expanded the first one so you can see its self-attention layer is the masked variant. Notice that the model now can address up to 4,000 tokens in a certain segment – a massive upgrade from the 512 in the original transformer.
</span></p>
</div>
</div>
<div class="tooltip">
<p>이 block들은, 두번째 self-attention layer를 제거한 것을 제외하면, 최초의 decoder block들과 매우 유사합니다. 비슷한 아키텍처가 <a href="https://arxiv.org/pdf/1808.04444.pdf">Character-Level Language Modeling with Deeper Self-Attention 논문</a>에서, 한번에 하나의 문자(letter/character)를 예측하는 language model을 만들기 위해 실험되었습니다.
<span class="tooltiptext">
These blocks were very similar to the original decoder blocks, except they did away with that second self-attention layer. A similar architecture was examined in <a href="https://arxiv.org/pdf/1808.04444.pdf">Character-Level Language Modeling with Deeper Self-Attention</a> to create a language model that predicts one letter/character at a time.
</span></p>
</div>
<div class="tooltip">
<p>OpenAI의 GPT-2 모델은 이러한 decoder만으로 구성된 block들(decoder-only blocks)을 사용합니다.
<span class="tooltiptext">
The OpenAI GPT-2 model uses these decoder-only blocks.
</span></p>
</div>
<h3 id="gpt-2의-내부를-살펴보기">GPT-2의 내부를 살펴보기</h3>
<blockquote class="subtle">
<p>Look inside and you will see,
The words are cutting deep inside my brain.
Thunder burning, quickly burning,
Knife of words is driving me insane, insane yeah.
~<strong><a href="https://en.wikipedia.org/wiki/Budgie_(band)">Budgie</a></strong>
<br />
(Budgie의 노래 “Crash Course in Brain Surgery” 중에서)</p>
</blockquote>
<div class="tooltip">
<p>훈련된 GPT-2를 우리의 수술대 위에 올려놓고, 어떻게 동작하는지 살펴봅시다.
<span class="tooltiptext">
Let’s lay a trained GPT-2 on our surgery table and look at how it works.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt-2-layers-2.png" />
<br />
<div class="tooltip">
<p>GPT-2는 1024개의 token을 처리할 수 있습니다. 각 token은 각자의 경로를 따라서 모든 decoder block으로 흘러갑니다.
<span class="tooltiptext">
The GPT-2 can process 1024 tokens. Each token flows through all the decoder blocks along its own path.
</span></p>
</div>
</div>
<div class="tooltip">
<p>훈련된 GPT-2 모델을 돌려보는 가장 간단한 방법은 그 것 자체의 램블링(rambling; 패턴없이 되는 대로 퍼져나가는)을 하도록 만드는 것 입니다 (기술적 용어로, <em>generating unconditional samples</em>입니다) – 또는, prompt를 주고 특정 주제에 대해 말해보도록 할 수 있습니다. (<em>interactive conditional samples</em> 생성이라고도 알려져 있습니다). rambling 방법에서, 우리는 간단히 start token을 입력해서 단어들을 생성하기 시작하도록 만들 수 있습니다 (훈련된 모델은 <code class="language-plaintext highlighter-rouge"><|endoftext|></code>를 start token으로 쓰지만, 여기서는 <code class="language-plaintext highlighter-rouge"><s></code>으로 사용하겠습니다).
<span class="tooltiptext">
The simplest way to run a trained GPT-2 is to allow it to ramble on its own (which is technically called <em>generating unconditional samples</em>) – alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating <em>interactive conditional samples</em>). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses <code class="language-plaintext highlighter-rouge"><|endoftext|></code> as its start token. Let’s call it <code class="language-plaintext highlighter-rouge"><s></code> instead).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-simple-output-2.gif" />
<p><br /></p>
</div>
<div class="tooltip">
<p>모델은 input token 하나를 가지고 있으므로, 경로가 1개만 활성화 됩니다. token이 모든 레이어를 연속적으로 거쳐 처리되고 나면, 그 경로를 따라 vector가 생성됩니다. 그 vector에 모델의 어휘(vocab) 전체(모델이 알고 있는 모든 단어들, GPT-2의 경우엔 50,000개의 단어)에 대해 점수(score)가 매겨질 수 있습니다. 이 경우에 우리는 가장 높은 확률을 갖는 ‘the’를 선택했습니다. 하지만, 선택에 변화를 주는 방법도 있습니다 – 키보드 앱에서 제안하는 단어를 클릭하기를 계속하면, 가끔 반복 루프에 빠질 때가 있고, 이 때 유일한 탈출 방법은 두번째나 세번째로 제안한 단어를 선택하는 것 입니다. 동일한 상황이 여기서도 있습니다. GPT-2는 top-k라는 parameter를 가지고 있어서 우리는 모델이 top 단어(top-k가 1인 경우)가 아닌 다른 단어를 샘플링하게 만들 수도 있습니다.
<span class="tooltiptext">
The model only has one input token, so that path would be the only active one. The token is processed successively through all the layers, then a vector is produced along that path. That vector can be scored against the model’s vocabulary (all the words the model knows, 50,000 words in the case of GPT-2). In this case we selected the token with the highest probability, ‘the’. But we can certainly mix things up – you know how if you keep clicking the suggested word in your keyboard app, it sometimes can stuck in repetitive loops where the only way out is if you click the second or third suggested word. The same can happen here. GPT-2 has a parameter called top-k that we can use to have the model consider sampling words other than the top word (which is the case when top-k = 1).
</span></p>
</div>
<div class="tooltip">
<p>다음 단계에서, 첫번째 단계의 출력을 입력 시퀀스에 덧붙인 뒤 모델이 다음 예측을 수행합니다:
<span class="tooltiptext">
In the next step, we add the output from the first step to our input sequence, and have the model make its next prediction:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt-2-simple-output-3.gif" />
<br />
</div>
<div class="tooltip">
<p>이번 계산에서는 두번째 경로만 활성화되는 것을 주목하세요. GPT-2의 각 레이어는 첫번째 token의 interpretation을 유지하고, 두번째 token을 처리할 때에 사용합니다 (뒤에서 self-attention 섹션에서 더 자세히 알아보겠습니다). GPT-2는 두번째 token에 비추어 첫번째 token을 재계산(re-interpret)하지 않습니다.
<span class="tooltiptext">
Notice that the second path is the only that’s active in this calculation. Each layer of GPT-2 has retained its own interpretation of the first token and will use it in processing the second token (we’ll get into more detail about this in the following section about self-attention). GPT-2 does not re-interpret the first token in light of the second token.
</span></p>
</div>
<h3 id="더-깊이-알아보기">더 깊이 알아보기</h3>
<h4 id="입력-encoding">입력 Encoding</h4>
<div class="tooltip">
<p>모델에 대해 더 상세하게 알기 위해 더 자세한 사항들을 봅시다. 입력으로 부터 시작합시다. 이전에 논의했던 다른 NLP 모델들처럼, GPT-2 모델도 embedding matrix에서 입력 단어의 embedding을 조회합니다 – 이 embedding matrix는 학습된 모델의 일부로서 얻을 수 있는, 구성요소 중 하나 입니다.
<span class="tooltiptext">
Let’s look at more details to get to know the model more intimately. Let’s start from the input. As in other NLP models we’ve discussed before, the model looks up the embedding of the input word in its embedding matrix – one of the components we get as part of a trained model.
</span></p>
</div>
<div class="img-div">
<image src="/images/gpt2/gpt2-token-embeddings-wte-2.png" />
<br />
<div class="tooltip">
<p>각 행은 단어 embedding 입니다: 단어를 표현하고 그 의미를 캡쳐(함유)하는 숫자형태의 표현(representation) 리스트 입니다. 이 리스트의 크기는 GPT2 모델 사이즈에 따라 다릅니다. 제일 작은 모델은 단어(토큰) 당 768 embedding 크기를 사용합니다.
<span class="tooltiptext">
Each row is a word embedding: a list of numbers representing a word and capturing some of its meaning. The size of that list is different in different GPT2 model sizes. The smallest model uses an embedding size of 768 per word/token.
</span></p>
</div>
</div>
<div class="tooltip">
<p>처음에는, embedding matrix에서 시작 토큰 <code class="language-plaintext highlighter-rouge"><s></code>의 embedding을 조회합니다. 모델의 첫번째 block에 이 정보를 전달하기 전에, 위치 인코딩(positional encoding) 정보를 통합해야 합니다. – positional encoding은 transformer block에게, 시퀀스 상에서의 word들의 순서 정보를 알려주는 신호(정보)입니다. 훈련된 모델을 구성하는 또다른 한 부분은 입력의 1024개 위치 각각에 대한 positional encoding vector입니다.
<span class="tooltiptext">
So in the beginning, we look up the embedding of the start token <code class="language-plaintext highlighter-rouge"><s></code> in the embedding matrix. Before handing that to the first block in the model, we need to incorporate positional encoding – a signal that indicates the order of the words in the sequence to the transformer blocks. Part of the trained model is a matrix that contains a positional encoding vector for each of the 1024 positions in the input.
</span></p>
</div>
<div class="img-div">
<image src="/images/gpt2/gpt2-positional-encoding.png" />
<br />
</div>
<div class="tooltip">
<p>이로써 입력 단어들이 transformer의 첫번째 block에 전달되기 전에 어떤 처리를 거치는지 살펴봤습니다. 또한 훈련된 GPT-2를 구성하는 두개의 weight matrix(가중치 행렬)도 알게 되었습니다.
<span class="tooltiptext">
With this, we’ve covered how input words are processed before being handed to the first transformer block. We also know two of the weight matrices that constitute the trained GPT-2.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-input-embedding-positional-encoding-3.png" />
<br />
<div class="tooltip">
<p>word를 transformer block의 첫번째로 전달하는 것은 그 word의 embedding과 #1번 positional encoding vector를 더하는 것을 의미합니다.
<span class="tooltiptext">
Sending a word to the first transformer block means looking up its embedding and adding up the positional encoding vector for position #1.
</span></p>
</div>
</div>
<h4 id="상위-stack으로의-이동">상위 Stack으로의 이동</h4>
<div class="tooltip">
<p>이제 첫번째 block은 token을 self-attention 프로세스를 통해 전달하고, neural network 레이어로 전달하여 처리할 수 있습니다. 첫번째 transformer block이 이 token을 처리하면, 그 결과 vector를 다음 block에서 처리하도록 윗쪽 stack으로 올립니다. 프로세스는 각 block마다 동일하지만 각 block은 각자의 self-attention 및 neural network 하위 레이어에 대한 weight들을 가지고 있습니다.
<span class="tooltiptext">
The first block can now process the token by first passing it through the self-attention process, then passing it through its neural network layer. Once the first transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. The process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-transformer-block-vectors-2.png" />
<br />
</div>
<h4 id="self-attention-recap">Self-Attention Recap</h4>
<div class="tooltip">
<p>언어는 context(문맥)에 매우 의존적입니다. 예를들어, 제 2원칙을 봅시다:
<span class="tooltiptext">
Language heavily relies on context. For example, look at the second law:
</span></p>
</div>
<blockquote class="subtle">
<strong>Second Law of Robotics</strong><br />
A robot must obey the orders given <strong style="color:#D81B60">it</strong> by human beings except where <strong style="color:#689F38">such orders</strong> would conflict with the <strong style="color:#6D4C41">First Law</strong>.
<br />
(로보틱스 제2원칙: 로봇은 인간이 <strong style="color:#D81B60">그것</strong>에 내리는 명령들에 복종해야만 하며, 단 <strong style="color:#689F38">이러한 명령들</strong>이 <strong style="color:#6D4C41">제1원칙</strong>에 위배될 때는 예외로 한다.)
</blockquote>
<div class="tooltip">
<p>문장에서 다른 단어(word)를 참조하는 단어들 3군데를 하이라이트 표기 했습니다. 이 단어들은 참조하는 context와 통합적으로 보지 않으면 이해 또는 처리할 수 없습니다. 모델이 이 문장을 처리할 때, 다음을 알 수 있어야 합니다.
<span class="tooltiptext">
I have highlighted three places in the sentence where the words are referring to other words. There is no way to understand or process these words without incorporating the context they are referring to. When a model processes this sentence, it has to be able to know that:
</span></p>
</div>
<div class="tooltip">
<ul>
<li><strong style="color:#D81B60">그것</strong>은 로봇을 가르킵니다.</li>
<li><strong style="color:#689F38">그러한 명령들</strong>은 이 법칙의 앞부분(참고: 한국어에서는 언어 구조 상 뒷부분에 위치)의, “인간이 그것에 내리는 명령들”을 가르킵니다.</li>
<li><strong style="color:#6D4C41">제 1원칙</strong>은 제 1원칙 전체를 가르킵니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> <strong style="color:#D81B60">it</strong> refers to the robot
<br />
<span>*</span> <strong style="color:#689F38">such orders</strong> refers to the earlier part of the law, namely “the orders given it by human beings”
<br />
<span>*</span> <strong style="color:#6D4C41">The First Law</strong> refers to the entire First Law
</span></li>
</ul>
</div>
<div class="tooltip">
<p>이 것이 self-attention이 하는 일입니다. (neural network로 전달해서) 단어를 처리하기 전에, 특정 단어의 context를 설명하는 관련 word들에 대한 모델의 이해를 만듭니다. segment에서 각 word가 얼마나 관련되어 있는지 score를 할당하고, 그 vector representation을 합산하는 방법으로 이를 수행합니다.
<span class="tooltiptext">
This is what self-attention does. It bakes in the model’s understanding of relevant and associated words that explain the context of a certain word before processing that word (passing it through a neural network). It does that by assigning scores to how relevant each word in the segment is, and adding up their vector representation.
</span></p>
</div>
<div class="tooltip">
<p>예를 들어, 상단의 block에서 self-attention 레이어는 단어 “it”을 처리할 때 “a robot”에 attention을 줍니다. neural network으로 전달하는 vector는, 그 3개의 단어들의 vector에 각 score들을 곱한 것의 합입니다.
<span class="tooltiptext">
As an example, this self-attention layer in the top block is paying attention to “a robot” when it processes the word “it”. The vector it will pass to its neural network is a sum of the vectors for each of the three words multiplied by their scores.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-example-2.png" />
<br />
</div>
<h4 id="self-attention-프로세스">Self-Attention 프로세스</h4>
<div class="tooltip">
<p>Self-attention은 segment에서 각 token의 경로를 따라 처리됩니다. 중요한 요소들은 다음 세가지 vector들 입니다:
<span class="tooltiptext">
Self-attention is processed along the path of each token in the segment. The significant components are three vectors:
</span></p>
</div>
<div class="tooltip">
<ul>
<li><span class="decoder">Query</span>: Query는 다른 모든 word들과 score를 계산(각 단어마다 고유 key값 사용)하는데 사용되는 현재 단어(word)의 representation 입니다. 우리는 현재 처리 중인 token의 query 값만 고려합니다.</li>
<li><span class="context">Key</span>: Key vector는 segment에서 모든 word들에 대한 레이블과 같습니다. 관련 word를 검색할 때 매칭해보는 항목입니다.</li>
<li><span class="step_no">Value</span>: Value vector는 실제 word representation 입니다. 각 단어가 얼마나 관련이 있는지 score를 매기고 나면, 현재의 word를 표현(representation)하기 위한 합산(add up)한 값입니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> <span class="decoder">Query</span>: The query is a representation of the current word used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.
<br />
<span>*</span> <span class="context">Key</span>: Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words.
<br />
<span>*</span> <span class="step_no">Value</span>: Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.
</span></li>
</ul>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/self-attention-example-folders-3.png" />
<br />
</div>
<div class="tooltip">
<p>대략적으로 비유해보자면, 서류 캐비넷에서 어떤 서류를 찾는 것과 같다고 생각할 수 있습니다. Query는 찾고자 하는 주제를 적은 메모지 입니다. Key는 캐비넷 안의 서류 폴더들에 달린 레이블과 같습니다. 메모지과 tag(레이블)를 매칭시키면, 폴더에서 내용물을 꺼내는데, 이 내용물이 바로 value vector 입니다. 단지 다른 점은, 하나의 value만 찾는 것이 아니라, 여러 폴더들에서 여러 value들의 혼합을 찾는다는 것입니다.
<span class="tooltiptext">
A crude analogy is to think of it like searching through a filing cabinet. The query is like a sticky note with the topic you’re researching. The keys are like the labels of the folders inside the cabinet. When you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. Except you’re not only looking for one value, but a blend of values from a blend of folders.
</span></p>
</div>
<div class="tooltip">
<p>Query vector를 각 key vector에 곱해서, 각 폴더 별 score 값을 만듭니다 (기술적으로: 내적(dot product) 연산 뒤 softmax 연산 수행).
<span class="tooltiptext">
Multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax).
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/self-attention-example-folders-scores-3.png" />
<br />
</div>
<div class="tooltip">
<p>각 value를 위에서 구한 score과 곱한 뒤, 합산합니다 – self-attention 결과가 나오게 됩니다.
<span class="tooltiptext">
We multiply each value by its score and sum up – resulting in our self-attention outcome.
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/gpt2-value-vector-sum.png" />
<br />
</div>
<div class="tooltip">
<p>이 가중치 혼합된 value vector는, 50%는 단어 <code class="language-plaintext highlighter-rouge">robot</code>에, 30%는 <code class="language-plaintext highlighter-rouge">a</code>에, 19%는 <code class="language-plaintext highlighter-rouge">it</code>에 attention을 준 vector를 생성합니다. 이 글의 뒷부분에서, self-attention에 대해 더 자세히 알아보겠습니다. 지금은, 모델의 출력 방향으로 윗쪽 stack을 계속 알아봅시다.
<span class="tooltiptext">
This weighted blend of value vectors results in a vector that paid 50% of its “attention” to the word <code class="language-plaintext highlighter-rouge">robot</code>, 30% to the word <code class="language-plaintext highlighter-rouge">a</code>, and 19% to the word <code class="language-plaintext highlighter-rouge">it</code>. Later in the post, we’ll got deeper into self-attention. But first, let’s continue our journey up the stack towards the output of the model.
</span></p>
</div>
<h4 id="모델-출력">모델 출력</h4>
<div class="tooltip">
<p>모델의 최상위 block이 (최상위 block의 self-attention 및 neural network 계산을 거친 결과인) output vector를 생성할 때, 모델은 그 vector와 embedding matrix를 곱합니다.
<span class="tooltiptext">
When the top block in the model produces its output vector (the result of its own self-attention followed by its own neural network), the model multiplies that vector by the embedding matrix.
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/gpt2-output-projection-2.png" />
<br />
</div>
<div class="tooltip">
<p>embedding matrix의 각 행은 모델 어휘(vocab) 단어들의 embedding에 해당합니다. 이 곱셈의 결과는 모델의 어휘에서 각 word에 대한 score로 해석됩니다. (즉, 단어를 선택하기 위한 score로 사용할 수 있습니다.)
<span class="tooltiptext">
Recall that each row in the embedding matrix corresponds to the embedding of a word in the model’s vocabulary. The result of this multiplication is interpreted as a score for each word in the model’s vocabulary.
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/gpt2-output-scores-2.png" />
<br />
</div>
<div class="tooltip">
<p>가장 높은 score를 갖는 token을 선택할 수도 있습니다 (top_k = 1). 하지만, 모델이 다른 word들도 고려한다면 더 좋은 결과를 얻을 수 있습니다. 더 좋은 전략은 전체 리스트에서 score를 어떤 word를 고르기 위한 확률값으로 사용하여 단어(word)를 선택하는 것 입니다 (그래서 높은 score를 갖는 word들이 선택될 가능성이 더 높습니다). 절충안은, top_k를 40으로 잡고 모델이 가장 높은 score를 갖는 40개의 word를 고려하도록 하는 것 입니다.
<span class="tooltiptext">
We can simply select the token with the highest score (top_k = 1). But better results are achieved if the model considers other words as well. So a better strategy is to sample a word from the entire list using the score as the probability of selecting that word (so words with a higher score have a higher chance of being selected). A middle ground is setting top_k to 40, and having the model consider the 40 words with the highest scores.
</span></p>
</div>
<div class="img-div-any-size">
<image src="/images/gpt2/gpt2-output.png" />
<br />
</div>
<div class="tooltip">
<p>그렇게 해서, 모델은 하나의 word를 출력하면서 한 iteration을 종료합니다. 모델은 전체 컨텍스트가 생성(1024개의 token)될 때까지 혹은 EOS(end-of-sequence) token이 생성될 떄까지 iteration을 계속 수행합니다.
<span class="tooltiptext">
With that, the model has completed an iteration resulting in outputting a single word. The model continues iterating until the entire context is generated (1024 tokens) or until an end-of-sequence token is produced.
</span></p>
</div>
<h3 id="파트-1의-마무리-몇가지-안내사항">파트 #1의 마무리: 몇가지 안내사항</h3>
<div class="tooltip">
<p>이 파트는 끝났고, 우리는 해냈습니다. GPT2 동작 방식에 대한 간단한 요약이었습니다. 만약 self-attention 레이어의 안쪽에서 무슨 일이 일어나는지 궁금하다면, 아래의 보너스 섹션을 살펴보세요. TransformerXL와 XLNet와 같은 후속 transformer 모델을 더 쉽게 알아보고 설명할 수 있도록, self-attention을 설명하는 더 시각적인 언어(설명)로 표현하기 위해 이 글을 썼습니다.
<span class="tooltiptext">
And there we have it. A run down of how the GPT2 works. If you’re curious to know exactly what happens inside the self-attention layer, then the following bonus section is for you. I created it to introduce more visual language to describe self-attention in order to make describing later transformer models easier to examine and describe (looking at you, TransformerXL and XLNet).
</span></p>
</div>
<div class="tooltip">
<p>이 글에서 매우 단순화시킨 점들은 다음과 같습니다:
<span class="tooltiptext">
I’d like to note a few oversimplifications in this post:
</span></p>
</div>
<div class="tooltip">
<ul>
<li>“word”와 “token”을 같은 의미로 사용했습니다. 하지만 실제로는, GPT2는 어휘(vocab)의 token들을 만들기 위해 BPE(Byte Pair Encoding)을 사용합니다. 이 것은 일반적으로 token이 word의 일부임을 의미합니다.</li>
<li>예로 든 GPT2는 추론(inference)/평가(evaluation) 모드입니다. (설명 과정에서) 한번에 하나의 word만을 처리하는 이유입니다. 학습(training) 시에는, 모델은 더 긴 문자열 시퀀스에 대해 학습하며, 한번에 여러개의 token을 처리할 것입니다. 또한, 모델은 evaluation 때 사용하는 배치 사이즈 보다 더 큰 배치 사이즈 (512)를 처리할 것 입니다.</li>
<li>그림에서 공간을 효과적으로 사용하기 위해 회전/치환을 자유롭게 사용했습니다. 하지만 구현 때에는, 보다 더 정확히 해야 합니다.</li>
<li>Transformer는 레이어 정규화(layer normalization)를 많이 사용하며, 꽤 중요합니다. 이전 블로그 포스팅 ‘Illustrated Transformer’에서는 몇가지를 언급했었지만, 이번 포스팅에서는 self-attention에 집중했습니다.</li>
<li>vector를 표현하기 위해 더 많은 상자(box)들로 표현해야할 때가 있습니다. 저는 이 상자들을 “zoom in”으로 표시했습니다. 예를 들어 다음과 같습니다:
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> I used “words” and “tokens” interchangeably. But in reality, GPT2 uses Byte Pair Encoding to create the tokens in its vocabulary. This means the tokens are usually parts of words.
<br />
<span>*</span> The example we showed runs GPT2 in its inference/evaluation mode. That’s why it’s only processing one word at a time. At training time, the model would be trained against longer sequences of text and processing multiple tokens at once. Also at training time, the model would process larger batch sizes (512) vs. the batch size of one that evaluation uses.
<br />
<span>*</span> I took liberties in rotating/transposing vectors to better manage the spaces in the images. At implementation time, one has to be more precise.
<br />
<span>*</span> Transformers use a lot of layer normalization, which is pretty important. We’ve noted a few of these in the Illustrated Transformer, but focused more on self-attentionin this post.
<br />
<span>*</span> There are times when I needed to show more boxes to represent a vector. I indicate those as “zooming in”. For example:
</span></li>
</ul>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/zoom-in.png" />
<br />
</div>
<h2 id="파트-2-그림으로-설명하는-self-attention-">파트 #2: 그림으로 설명하는 Self-Attention <a name="part-2-illustrated-self-attention" href="#part-2-illustrated-self-attention">#</a></h2>
<div class="tooltip">
<p>이 글의 앞 부분에서 단어 <code class="language-plaintext highlighter-rouge">it</code>을 처리하는 레이어에서 self-attention을 적용하는 것을 보여주기 위해 이 그림을 보여주었습니다.
<span class="tooltiptext">
Earlier in the post we showed this image to showcase self-attention being applied in a layer that is processing the word <code class="language-plaintext highlighter-rouge">it</code>:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-1-2.png" />
<br />
</div>
<div class="tooltip">
<p>이번 섹션에서는, 어떻게 동작하는지 자세히 살펴보겠습니다. 각 개별 word에 무슨일이 일어나는지 이해하는 방향으로 알아보겠습니다. 많은 단일 vector들을 보여줄 것입니다. 실제 구현은 거대한 matrix를 서로 곱하여 수행됩니다. 하지만 여기서는, word 수준에서 어떤 일이 일어나는지 직관적 표현에 집중하겠습니다.
<span class="tooltiptext">
In this section, we’ll look at the details of how that is done. Note that we’ll look at it in a way to try to make sense of what happens to individual words. That’s why we’ll be showing many single vectors. The actual implementations are done by multiplying giant matrices together. But I want to focus on the intuition of what happens on a word-level here.
</span></p>
</div>
<h3 id="masking-없는-self-attention">(Masking 없는) Self-Attention</h3>
<div class="tooltip">
<p>encoder block에서 계산된 최초의 self-attention을 살펴보는 것으로 부터 시작해봅시다. 한번에 4개의 token만 처리할 수 있는 토이(toy) transformer를 살펴보겠습니다.
<span class="tooltiptext">
Let’s start by looking at the original self-attention as it’s calculated in an encoder block. Let’s look at a toy transformer block that can only process four tokens at a time.
</span></p>
</div>
<div class="tooltip">
<p>Self-attention은 3개의 주요 단계가 있습니다:
<span class="tooltiptext">
Self-attention is applied through three main steps:
</span></p>
</div>
<div class="tooltip">
<ol>
<li>각 경로 마다 Query, Key, Value 벡터를 생성합니다.</li>
<li>각 input token 마다, query vector를 사용하여 모든 다른 key vector들에 대한 score를 계산합니다.</li>
<li>value vector에 그 조합된 score를 곱한 뒤 합산합니다.
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>1.</span> Create the Query, Key, and Value vectors for each path.
<span>2.</span> For each input token, use its query vector to score against all the other key vectors
<span>3.</span> Sum up the value vectors after multiplying them by their associated scores.
</span></li>
</ol>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/self-attention-summary.png" />
<br />
</div>
<h3 id="1--query-key-value-vector-생성">1- Query, Key, Value Vector 생성</h3>
<div class="tooltip">
<p>첫번째 경로를 봅시다. query를 받아서, 모든 key들과 비교할 것입니다. 각 key 별로 score를 계산합니다. self-attention에서의 첫번째 단계는 각 token 경로 별 3개의 vector를 계산하는 것 입니다 (attention head는 일단은 무시합니다):
<span class="tooltiptext">
Let’s focus on the first path. We’ll take its query, and compare against all the keys. That produces a score for each key. The first step in self-attention is to calculate the three vectors for each token path (let’s ignore attention heads for now):
</span></p>
</div>
<div class="img-div-any-width">
1) 각 input token 마다, weight matrix <strong style="color:#A144B8">W^Q</strong>, <strong style="color:#F18533">W^K</strong>, <strong style="color:#329CEB">W^V</strong>를 곱하여 <strong style="color:#A144B8">query vector</strong>, <strong style="color:#F18533">key vector</strong>, <strong style="color:#329CEB">value vector</strong>를 생성합니다.
<image src="/images/xlnet/self-attention-1.png" />
<br />
</div>
<h3 id="2--score-계산">2- Score 계산</h3>
<div class="tooltip">
<p>이제 vector들을 갖게 되었고, 현재 #2번 단계에서만 query 및 key vector를 사용합니다. 우리는 지금 첫번째 token을 집중해서 보고 있기 때문에, 그 token의 query를 모든 key vector들과 곱하여 4개의 token들 각각의 score를 얻습니다.
<span class="tooltiptext">
Now that we have the vectors, we use the query and key vectors only for step #2. Since we’re focused on the first token, we multiply its query by all the other key vectors resulting in a score for each of the four tokens.
</span></p>
</div>
<div class="img-div-any-width">
2) 현재의 <strong style="color:#A144B8">query vector</strong>와 모든 <strong style="color:#F18533">key vector</strong>가 얼마나 잘 매칭되는지 score를 얻기 위해 곱셈(dot product) 연산을 합니다.
<image src="/images/xlnet/self-attention-2.png" />
<br />
</div>
<h3 id="3--전체-합산">3- 전체 합산</h3>
<div class="tooltip">
<p>우리는 이제 score들과 value vector들을 곱할 수 있습니다. 높은 score의 value는, 결과가 다 더해지고 난 뒤에, 결과 vector에서 높은 비중을 차지하게 됩니다.
<span class="tooltiptext">
We can now multiply the scores by the value vectors. A value with a high score will constitute a large portion of the resulting vector after we sum them up.
</span></p>
</div>
<div class="img-div-any-width">
3) <strong style="color:#329CEB">value vector들</strong>을 <strong style="color:#E66A93">score들</strong>과 곱한 뒤 모두 합산합니다.
<image src="/images/xlnet/self-attention-3-2.png" />
<br />
<div class="tooltip">
<p>더 낮은 score일 수록 value vector가 더 투명하게 표시됩니다. 작은 수를 곱하는 것이 vector의 값을 희석하는 것(작게 만듦)을 표현합니다.
<span class="tooltiptext">
The lower the score, the more transparent we’re showing the value vector. That’s to indicate how multiplying by a small number dilutes the values of the vector.
</span></p>
</div>
</div>
<div class="tooltip">
<p>만약 각 경로마다 같은 동작을 수행한다면, 각 해당 token 마다, 적합한 context를 포함하는 token의 vector representation을 얻게 됩니다. 그 값은 transformer block의 다음 하위 layer(feed-forward neural network)에 제공됩니다.
<span class="tooltiptext">
If we do the same operation for each path, we end up with a vector representing each token containing the appropriate context of that token. Those are then presented to the next sublayer in the transformer block (the feed-forward neural network):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/self-attention-summary.png" />
<br />
</div>
<h3 id="그림으로-설명하는-masked-self-attention">그림으로 설명하는 Masked Self-Attention</h3>
<div class="tooltip">
<p>우리는 지금까지 transformer의 self-attention 단계를 살펴보았고, 이제 masked self-attention에 대해 살펴보겠습니다. Masked self-attention은 self-attention과 같지만, #2 단계는 다릅니다. 모델이 2개의 token만을 input으로 가지고 있으며, 우리는 두번째 token을 처리하는 상황이라고 가정해봅시다. 이러한 경우에, 마지막 2개의 token은 masking 됩니다. 모델은 scoring 단계를 방해합니다. 즉, 기본적으로 앞으로 나올 token에 대한 score를 0으로 만들어서, 모델이 앞으로 나올 word를 반영할 수 없습니다:
<span class="tooltiptext">
Now that we’ve looked inside a transformer’s self-attention step, let’s proceed to look at masked self-attention. Masked self-attention is identical to self-attention except when it comes to step #2. Assuming the model only has two tokens as input and we’re observing the second token. In this case, the last two tokens are masked. So the model interferes in the scoring step. It basically always scores the future tokens as 0 so the model can’t peak to future words:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/xlnet/masked-self-attention-2.png" />
<br />
</div>
<div class="tooltip">
<p>이러한 masking은 attention mask라고 불리는 matrix로 구현됩니다. 4개의 단어 sequence(예를 들어 “robot must obey orders”)를 생각해보세요. language modeling 시나리오에서, 이 sequence는 4단계에 걸쳐 입력됩니다 – word 당 하나씩 (모든 word는 token이라고 가정합니다). 모델은 배치로 동작하기 때문에, 전체 sequence (4단계를 갖는)를 한 배치로 처리하는 이 toy model의 배치 사이즈를 4로 가정할 수 있습니다.
<span class="tooltiptext">
This masking is often implemented as a matrix called an attention mask. Think of a sequence of four words (“robot must obey orders”, for example). In a language modeling scenario, this sequence is absorbed in four steps – one per word (assuming for now that every word is a token). As these models work in batches, we can assume a batch size of 4 for this toy model that will process the entire sequence (with its four steps) as one batch.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/transformer-decoder-attention-mask-dataset.png" />
<br />
</div>
<div class="tooltip">
<p>matrix 형태에서, query matrix를 key matrix와 곱해서 score를 계산할 수 있습니다. 아래 그림의 각 셀에서 word 대신에 word와 관련된 query (또는 key) vector가 있다고 가정하고 다음과 같이 시각적으로 표현해보겠습니다:
<span class="tooltiptext">
In matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. Let’s visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/queries-keys-attention-mask.png" />
<br />
</div>
<div class="tooltip">
<p>곱셈 이후에, attention mask 삼각형을 적용합니다. masking 하고 싶은 셀들을 마이너스 무한대 또는 매우 큰 음수로 설정합니다 (예. GPT2에서는 -10억):
<span class="tooltiptext">
After the multiplication, we slap on our attention mask triangle. It set the cells we want to mask to -infinity or a very large negative number (e.g. -1 billion in GPT2):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/transformer-attention-mask.png" />
<br />
</div>
<div class="tooltip">
<p>각 행에 softmax를 취함으로써 self-attention에 사용하는 실제 score가 생성됩니다.
<span class="tooltiptext">
Then, applying softmax on each row produces the actual scores we use for self-attention:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/transformer-attention-masked-scores-softmax.png" />
<br />
</div>
<div class="tooltip">
<p>이 score 테이블이 의미하는 것은 다음과 같습니다:
<span class="tooltiptext">
What this scores table means is the following:
</span></p>
</div>
<div class="tooltip">
<ul>
<li>모델이 dataset에서 첫번째 케이스(1번 행)를 처리할 때, 단 하나의 단어(“robot”)만을 포함하며, 그 단어에 모든(100%) attention을 갖습니다.</li>
<li>모델이 dataset에서 두번째 케이스(2번 행)을 처리할 때, “robot must”라는 단어들을 포함하며, “robot”에 48%, “must”에 52%의 attention을 갖으면서 단어 “must”를 처리합니다.</li>
<li>기타 등등…
<span class="tooltiptext" style="display: inline-block; text-align: left;">
<span>*</span> When the model processes the first example in the dataset (row #1), which contains only one word (“robot”), 100% of its attention will be on that word.
<span>*</span> When the model processes the second example in the dataset (row #2), which contains the words (“robot must”), when it processes the word “must”, 48% of its attention will be on “robot”, and 52% of its attention will be on “must”.
<span>*</span> And so on
</span></li>
</ul>
</div>
<h3 id="gpt-2의-masked-self-attention">GPT-2의 Masked Self-Attention</h3>
<div class="tooltip">
<p>GPT-2의 masked attention에 대해 더 깊이 알아봅시다.
<span class="tooltiptext">
Let’s get into more detail on GPT-2’s masked attention.
</span></p>
</div>
<h4 id="평가-시-한번에-한-토큰씩-처리">평가 시: 한번에 한 토큰씩 처리</h4>
<div class="tooltip">
<p>GPT-2가 masked self-attention이 동작하는 것과 똑같이 동작하도록 만들 수 있습니다. 하지만 evaluation 할 때에, 우리 모델이 각 iteration이 끝날 때 마다 하나의 새로운 word만 추가할 때, 이미 처리된 token에 대해 이전 경로를 따라 self-attention을 다시 계산하는 것은 비효율적 입니다.<br />
<span class="tooltiptext">
We can make the GPT-2 operate exactly as masked self-attention works. But during evaluation, when our model is only adding one new word after each iteration, it would be inefficient to recalculate self-attention along earlier paths for tokens which have already been processed.
</span></p>
</div>
<div class="tooltip">
<p>이 경우에, 첫번째 token을 처리합니다 (지금은 <code class="language-plaintext highlighter-rouge"><s></code>를 무시합니다).
<span class="tooltiptext">
In this case, we process the first token (ignoring <code class="language-plaintext highlighter-rouge"><s></code> for now).
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-qkv-1-2.png" />
<br />
</div>
<div class="tooltip">
<p>GPT-2는 <code class="language-plaintext highlighter-rouge">a</code> token의 key, value vector를 유지하고 있습니다. 모든 self-attention 레이어는 그 token에 대한 각각의 key, value vector를 유지합니다.
<span class="tooltiptext">
GPT-2 holds on to the key and value vectors of the the <code class="language-plaintext highlighter-rouge">a</code> token. Every self-attention layer holds on to its respective key and value vectors for that token:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-qkv-2-2.png" />
<br />
</div>
<div class="tooltip">
<p>이제 다음 iteration에서, 모델이 단어 <code class="language-plaintext highlighter-rouge">robot</code>을 처리할 때, query, key, value를 검색할 필요가 없습니다. 첫번째 iteration에서 저장한 것을 재사용 합니다.
<span class="tooltiptext">
Now in the next iteration, when the model processes the word <code class="language-plaintext highlighter-rouge">robot</code>, it does not need to generate query, key, and value queries for the <code class="language-plaintext highlighter-rouge">a</code> token. It just reuses the ones it saved from the first iteration:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-qkv-3-2.png" />
<br />
</div>
<h4 id="gpt-2의-self-attention-1--querie-key-value-값들-생성">GPT-2의 Self-attention: 1- querie, key, value 값들 생성</h4>
<div class="tooltip">
<p>모델이 단어 <code class="language-plaintext highlighter-rouge">it</code>를 처리하고 있다고 가정해봅시다. 하단 block의 경우, 그 token에 대한 입력 값은 <code class="language-plaintext highlighter-rouge">it의 embedding + 슬롯 #9에 대한 positional encoding</code>이 됩니다.
<span class="tooltiptext">
Let’s assume the model is processing the word <code class="language-plaintext highlighter-rouge">it</code>. If we’re talking about the bottom block, then its input for that token would be the embedding of <code class="language-plaintext highlighter-rouge">it</code> + the positional encoding for slot #9:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-1.png" />
<br />
</div>
<div class="tooltip">
<p>Transformer에서 모든 block은 각자의 weight를 갖습니다 (이 글의 후반부에서 설명하겠습니다). 우리가 가장 먼저 볼 것은 query, key, value를 생성하는 데 사용하는 weight matrix입니다.
<span class="tooltiptext">
Every block in a transformer has its own weights (broken down later in the post). The first we encounter is the weight matrix that we use to create the queries, keys, and values.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-2.png" />
<div class="tooltip">
<p>Self-attention은 입력을 weight matrix와 곱합니다 (그리고 여기서 표현하지는 않았지만, bias vector를 더해줍니다).
<span class="tooltiptext">
<br />
Self-attention multiplies its input by its weight matrix (and adds a bias vector, not illustrated here).
</span></p>
</div>
</div>
<div class="tooltip">
<p>이 곱셈 연산은 기본적으로 단어 <code class="language-plaintext highlighter-rouge">it</code>에 대한 query, key, value vector의 접합(concat)된 vector를 생성합니다.
<span class="tooltiptext">
The multiplication results in a vector that’s basically a concatenation of the query, key, and value vectors for the word <code class="language-plaintext highlighter-rouge">it</code>.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-3.png" />
<div class="tooltip">
<p>input vector에 attention weight vector를 곱함으로써 (그리고 나중에 bias vector를 더함으로써), 이 token에 대한 key, value, query vector를 생성합니다.
<span class="tooltiptext">
<br />
Multiplying the input vector by the attention weights vector (and adding a bias vector aftwards) results in the key, value, and query vectors for this token.
</span></p>
</div>
</div>
<h4 id="gpt-2의-self-attention-15--attention-head로-분할하기">GPT-2의 Self-attention: 1.5- attention head로 분할하기</h4>
<div class="tooltip">
<p>이전 예제에서, “multi-head” 부분을 건너뛰고 self-attention을 바로 살펴봤습니다. 이제 그 개념에 대해 약간의 설명을 하는 것이 좋겠습니다. Self attention은 Q, K, V vector의 다른 부분들에 대해 여러번 수행됩니다. attention heads의 “분할(splitting)”은 긴 vector를 matrix로 단순히 재구성하는 것 입니다. small GPT2는 12개의 attention head를 갖으며, 재구성된 matrix의 첫번째 차원(dimension)이 됩니다.
<span class="tooltiptext">
In the previous examples, we dove straight into self-attention ignoring the “multi-head” part. It would be useful to shed some light on that concept now. Self attention is conducted multiple times on different parts of the Q,K,V vectors. “Splitting” attention heads is simply reshaping the long vector into a matrix. The small GPT2 has 12 attention heads, so that would be the first dimension of the reshaped matrix:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-split-attention-heads-1.png" />
<br />
</div>
<div class="tooltip">
<p>이전 예제에서, attention head 안에서 어떤 일이 일어나는지 살펴보았습니다. 다수의 attention-head를 생각하는 방법은 아래와 같습니다 (만약 12개의 attention head의 3개만을 그림으로 표현한다면):
<span class="tooltiptext">
In the previous examples, we’ve looked at what happens inside one attention head. One way to think of multiple attention-heads is like this (if we’re to only visualize three of the twelve attention heads):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-split-attention-heads-2.png" />
<br />
</div>
<h4 id="gpt-2의-self-attention-2--score-계산하기">GPT-2의 Self-attention: 2- Score 계산하기</h4>
<div class="tooltip">
<p>우리는 이제 score를 계산하는 것을 처리합니다 – 우리가 하나의 attention head를 바라보고 있음 (그리고 다른 것들은 비슷한 연산을 수행함)을 알고 있습니다:
<span class="tooltiptext">
We can now proceed to scoring – knowing that we’re only looking at one attention head (and that all the others are conducting a similar operation):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-scoring.png" />
<br />
</div>
<div class="tooltip">
<p>이제 token은 다른 token들의 key에 대해 score 값을 얻을 수 있습니다 (이전 iteration에서 attention head 1번에서 계산되었습니다).
<span class="tooltiptext">
Now the token can get scored against all of keys of the other tokens (that were calculated in attention head #1 in previous iterations):
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-scoring-2.png" />
<br />
</div>
<h4 id="gpt-2의-self-attention-3--합산하기">GPT-2의 Self-attention: 3- 합산하기</h4>
<div class="tooltip">
<p>이전에 살펴본 것과 같이, 각 value를 각 score와 곱하고, 그 결과들을 합산해서, attention-head #1를 위한 self-attention 결과를 만듭니다.
<span class="tooltiptext">
As we’ve seen before, we now multiply each value with its score, then sum them up, producing the result of self-attention for attention-head #1:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-multihead-sum-1.png" />
<br />
</div>
<h4 id="gpt-2의-self-attention-35--attention-head를-합치기merge">GPT-2의 Self-attention: 3.5- attention head를 합치기(merge)</h4>
<div class="tooltip">
<p>여러 attention head를 다루기 위한 방법은, 먼저 이들을 하나의 vector로 접합(concat)하는 것 입니다.
<span class="tooltiptext">
The way we deal with the various attention heads is that we first concatenate them into one vector:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-merge-heads-1.png" />
<br />
</div>
<div class="tooltip">
<p>하지만 이 vector는 아직 다음 순서의 하위 layer로 전달될 준비가 되지 않았습니다. 먼저 hidden state의 이 결과물을 동질적(homogenous) 표현(representation)으로 바꿔야 합니다.
<span class="tooltiptext">
But the vector isn’t ready to be sent to the next sublayer just yet. We need to first turn this Frankenstein’s-monster of hidden states into a homogenous representation.
</span></p>
</div>
<h4 id="gpt-2의-self-attention-4--projecting">GPT-2의 Self-attention: 4- Projecting</h4>
<div class="tooltip">
<p>우리는 모델이 연결된 self-attention 결과를, feed-forward neural network가 처리할 수 있는 하나의 vector로 잘 mapping시킬 수 있도록 학습하도록 만들 것 입니다.
<span class="tooltiptext">
We’ll let the model learn how to best map concatenated self-attention results into a vector that the feed-forward neural network can deal with. Here comes our second large weight matrix that projects the results of the attention heads into the output vector of the self-attention sublayer:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-project-1.png" />
<br />
</div>
<div class="tooltip">
<p>그리고 이렇게, 다음 layer로 보낼 수 있는 vector를 생성했습니다.
<span class="tooltiptext">
And with this, we have produced the vector we can send along to the next layer:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-self-attention-project-2.png" />
<br />
</div>
<h4 id="gpt-2의-fully-connected-neural-network-1번-레이어">GPT-2의 Fully-Connected Neural Network: #1번 레이어</h4>
<div class="tooltip">
<p>fully-connected neural network은 self-attention이 representation에 적합한 context를 포함시킨 뒤, block이 입력 token을 처리하는 곳 입니다. 이 것은 두 개의 레이어로 구성되어 있습니다. 첫번째 레이어는 모델 사이즈의 4배 입니다 (GPT2 small의 경우 768 이므로, 이 network는 768*4 = 3072 unit 입니다). 왜 4배 일까요? 그 것은 단순히 최초의 transformer에서 사용한 값과 같습니다 (모델 차원이 512 였고, #1번 레이어는 2048 이었습니다). 이 것은 transformer 모델에 주어진(처리해야 하는) task들을 다루기에 충분한 representation 능력/용량을 주는 것으로 보입니다.
<span class="tooltiptext">
The fully-connected neural network is where the block processes its input token after self-attention has included the appropriate context in its representation. It is made up of two layers. The first layer is four times the size of the model (Since GPT2 small is 768, this network would have 768*4 = 3072 units). Why four times? That’s just the size the original transformer rolled with (model dimension was 512 and layer #1 in that model was 2048). This seems to give transformer models enough representational capacity to handle the tasks that have been thrown at them so far.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-mlp1.gif" />
<div class="tooltip">
<p>(bias vector 생략)
<span class="tooltiptext">
(Not shown: A bias vector)
</span></p>
</div>
</div>
<h4 id="gpt-2의-fully-connected-neural-network-2번-레이어---모델-차원으로-projection-하기">GPT-2의 Fully-Connected Neural Network: #2번 레이어 - 모델 차원으로 projection 하기</h4>
<div class="tooltip">
<p>두번째 레이어는 첫번째 레이어의 결과를 모델 차원(dimension; small GPT2의 경우 768)으로 다시 projection 합니다. 이 곱셈 연산의 결과는 이 token에 대한 transformer block의 결과입니다.
<span class="tooltiptext">
The second layer projects the result from the first layer back into model dimension (768 for the small GPT2). The result of this multiplication is the result of the transformer block for this token.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-mlp-2.gif" />
<div class="tooltip">
<p>(bias vector 생략)
<span class="tooltiptext">
(Not shown: A bias vector)
</span></p>
</div>
</div>
<h3 id="드디어-해냈습니다-it을-만들어냈습니다">드디어 해냈습니다! ‘It’을 만들어냈습니다!</h3>
<div class="tooltip">
<p>이 것이 우리가 다룰 transformer block의 가장 상세한 버전입니다! 당신은 transformer language model 안에서 일어나는 대다수의 것들을 알게 되었습니다. 요약하자면, 우리의 입력 vector는 이러한 weight matrix들을 만납니다:
<span class="tooltiptext">
That’s the most detailed version of the transformer block we’ll get into! You now pretty much have the vast majority of the picture of what happens inside of a transformer language model. To recap, our brave input vector encounters these weight matrices:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-transformer-block-weights-2.png" />
<br />
</div>
<div class="tooltip">
<p>그리고 각 block 마다 이러한 weight들의 세트를 가지고 있습니다. 반면에, 모델은 하나의 token embedding matrix와 하나의 positional encoding matrix 만을 가지고 있습니다.
<span class="tooltiptext">
And each block has its own set of these weights. On the other hand, the model has only one token embedding matrix and one positional encoding matrix:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-weights-2.png" />
<br />
</div>
<div class="tooltip">
<p>만약 모델의 parameter 전체를 보고 싶다면, 여기에 집계해두었습니다:
<span class="tooltiptext">
If you want to see all the parameters of the model, then I have tallied them here:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/gpt2-117-parameters.png" />
<br />
</div>
<div class="tooltip">
<p>어떤 이유로 117M개가 아닌, total 124M개의 parameter 수가 나왔습니다. 이유는 모르겠지만, publish된 code들에서 보여지는 숫자 입니다 (만약 제가 틀린 경우 수정해주세요).
<span class="tooltiptext">
They add up to 124M parameters instead of 117M for some reason. I’m not sure why, but that’s how many of them seems to be in the published code (please correct me if I’m wrong).
</span></p>
</div>
<h2 id="파트-3-language-modeling-그-이상의-것-">파트 3: Language Modeling, 그 이상의 것 <a href="#part-3-beyond-language-modeling" name="part-3-beyond-language-modeling">#</a></h2>
<div class="tooltip">
<p>decoder-only transformer는 language modeling 이상의 가능성들을 계속 보여줍니다. 성공을 보여준 application들이 많이 있습니다. 이러한 application 몇 개를 보면서 이번 포스팅을 마치고자 합니다.
<span class="tooltiptext">
The decoder-only transformer keeps showing promise beyond language modeling. There are plenty of applications where it has shown success which can be described by similar visuals as the above. Let’s close this post by looking at some of these applications
</span></p>
</div>
<h3 id="기계번역machine-translation">기계번역(Machine Translation)</h3>
<div class="tooltip">
<p>번역(translation)을 하는데에 encoder가 필요하지 않습니다. 이 task를 decoder-only transformer로 처리할 수 있습니다:
<span class="tooltiptext">
An encoder is not required to conduct translation. The same task can be addressed by a decoder-only transformer:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/decoder-only-transformer-translation.png" />
<br />
</div>
<h3 id="요약summarization">요약(Summarization)</h3>
<div class="tooltip">
<p>요약(Summarization)은 첫번째 decoder-only transformer가 학습된 task 입니다. 즉, (목차 앞쪽의 서두 부분을 제외하고) 위키피디아 아티클을 읽고 요약하도록 학습했습니다. 실제 서두 부분은 학습 dataset에서 레이블로 사용되었습니다:
<span class="tooltiptext">
This is the task that the first decoder-only transformer was trained on. Namely, it was trained to read a wikipedia article (without the opening section before the table of contents), and to summarize it. The actual opening sections of the articles were used as the labels in the training datasest:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/wikipedia-summarization.png" />
<br />
</div>
<div class="tooltip">
<p>이 논문에서는 위키피디아 아티클에 대해 모델을 학습시켰고, 학습된 모델은 아티클을 요약할 수 있었습니다:
<span class="tooltiptext">
The paper trained the model against wikipedia articles, and thus the trained model was able to summarize articles:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/decoder-only-summarization.png" />
<br />
</div>
<h3 id="전이-학습transfer-learning">전이 학습(Transfer Learning)</h3>
<div class="tooltip">
<p><a href="https://arxiv.org/abs/1905.08836">Sample Efficient Text Summarization Using a Single Pre-Trained Transformer</a> 논문에서, decoder-only transformer는 먼저 language model에 대해 pre-train 하고, 요약(summary)에 대해 finetuning 했습니다. 이 것은 제한된 data 설정에서 encoder-decoder transformer를 pre-train하는 것 보다 더 좋은 결과를 보였습니다.
<span class="tooltiptext">
In <a href="https://arxiv.org/abs/1905.08836">Sample Efficient Text Summarization Using a Single Pre-Trained Transformer</a>, a decoder-only transformer is first pre-trained on language modeling, then finetuned to do summarization. It turns out to achieve better results than a pre-trained encoder-decoder transformer in limited data settings.
</span></p>
</div>
<div class="tooltip">
<p>GPT2 논문도 language modeling에 대해 pre-train한 뒤에 요약(summary) task의 결과를 보여줍니다.
<span class="tooltiptext">
The GPT2 paper also shows results of summarization after pre-training the model on language modeling.
</span></p>
</div>
<h3 id="음악-생성music-generation">음악 생성(Music Generation)</h3>
<div class="tooltip">
<p><a href="https://magenta.tensorflow.org/music-transformer">Music Transformer</a>는 decoder-only transformer를 사용하여 expressive timing과 dynamic한 음악을 생성합니다. “Music Modeling”은 language modeling과 같습니다 – 모델이 unsupervise한 방법으로 음악을 학습하도록 하고, 샘플 출력하도록 합니다 (우리가 이전에 “rambling”이라고 불렀습니다).
<span class="tooltiptext">
The <a href="https://magenta.tensorflow.org/music-transformer">Music Transformer</a> uses a decoder-only transformer to generate music with expressive timing and dynamics. “Music Modeling” is just like language modeling – just let the model learn music in an unsupervised way, then have it sample outputs (what we called “rambling”, earlier).
</span></p>
</div>
<div class="tooltip">
<p>이 시나리오에서 음악이 어떻게 reparesent 되는지 궁금할 것 입니다. language modeling은 단어의 일부인 문자(character), 단어(word), 토큰(token) 등이 vector representation을 통해 표현되었습니다. 음악 연주에서 (지금은 피아노를 생각해보세요), 우리는 음표를 표현해야 하지만 velocity도 표현해야 합니다 – 피아노 건반이 얼마나 세게 눌렀는지 측정.
<span class="tooltiptext">
You might be curious as to how music is represented in this scenario. Remember that language modeling can be done through vector representations of either characters, words, or tokens that are parts of words. With a musical performance (let’s think about the piano for now), we have to represent the notes, but also velocity – a measure of how hard the piano key is pressed.
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/music-transformer-performance-encoding-3.png" />
<br />
</div>
<div class="tooltip">
<p>연주는 이러한 일련의 one-hot vector들일 뿐입니다. midi 파일은 이러한 포맷으로 변환될 수 있습니다. 이 논문에서는 다음과 같은 입력 순서 예시가 있습니다:
<span class="tooltiptext">
A performance is just a series of these one-hot vectors. A midi file can be converted into such a format. The paper has the following example input sequence:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/music-representation-example.png" />
<br />
</div>
<div class="tooltip">
<p>이 입력 순서에 대한 one-hot vector는 이런 모양일 것 입니다.
<span class="tooltiptext">
The one-hot vector representation for this input sequence would look like this:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/music-transformer-input-representation-2.png" />
<br />
</div>
<div class="tooltip">
<p>저는 Music Transformer에서 self-attention을 표현하는 이 논문의 그림을 좋아합니다. 여기에 주석을 조금 달았습니다.
<span class="tooltiptext">
I love a visual in the paper that showcases self-attention in the Music Transformer. I’ve added some annotations to it here:
</span></p>
</div>
<div class="img-div-any-width">
<image src="/images/gpt2/music-transformer-self-attention-2.png" />
<br />
<div class="tooltip">
<p>“그림 8: 이 작품은 반복되는 삼각형 형태를 가지고 있습니다. query는 뒷쪽 peak들 중 하나에 있고, 곡의 시작부분에 이르기까지 모든 이전 고음(peak에 있는)에 attention을 줍니다.” …“[이] 그림은 query (모든 attention 선의 source)와 attention을 받는 이전 메모리(더 많은 softmax 확률을 받는 음표(note)가 강조됨)를 보여줍니다. attention line의 색상은 서로 다른 head에 해당하고 두께는 softmax 확률의 가중치(weight)에 해당합니다.”
<span class="tooltiptext">
“Figure 8: This piece has a recurring triangular contour. The query is at one of the latter peaks and it attends to all of the previous high notes on the peak, all the way to beginning of the piece.” … “[The] figure shows a query (the source of all the attention lines) and previous memories being attended to (the notes that are receiving more softmax probabiliy is highlighted in). The coloring of the attention lines correspond to different heads and the width to the weight of the softmax probability.”
</span></p>
</div>
</div>
<div class="tooltip">
<p>악보 representation에 대해 더 알고 싶으시면 <a href="https://www.youtube.com/watch?v=ipzR9bhei_o">이 영상</a>을 참고해보세요.
<span class="tooltiptext">
If you’re unclear on this representation of musical notes, <a href="https://www.youtube.com/watch?v=ipzR9bhei_o">check out this video</a>.
</span></p>
</div>
<h2 id="결론">결론</h2>
<div class="tooltip">
<p>이것으로 GPT2의 전반적인 과정과 상위 모델인 decoder-only transformer에 대한 탐험을 마치겠습니다. 이 포스팅을 통해 self-attention에 대한 더 깊은 이해와 transformer 내부에서 일어나는 것들에 대해 이해하는 것이 더 수월하기를 바랍니다.
<span class="tooltiptext">
This concludes our journey into the GPT2, and our exploration of its parent model, the decoder-only transformer. I hope that you come out of this post with a better understanding of self-attention and more comfort that you understand more of what goes on inside a transformer.
</span></p>
</div>
<h2 id="참고자료">참고자료</h2>
<div class="tooltip">
<ul>
<li>OpenAI의 <a href="https://github.com/openai/gpt-2">GPT2 구현</a></li>
<li><a href="https://huggingface.co/">Hugging Face</a>의 <a href="https://github.com/huggingface/pytorch-transformers">pytorch-transformers</a> 라이브러리 및 GPT2, BERT 구현, Transformer-XL, XLNet, 최신 transformer model들을 확인해보세요.
<span class="tooltiptext">
<span>*</span> The <a href="https://github.com/openai/gpt-2">GPT2 Implementation</a> from OpenAI
<span>*</span> Check out the <a href="https://github.com/huggingface/pytorch-transformers">pytorch-transformers</a> library from <a href="https://huggingface.co/">Hugging Face</a> in addition to GPT2, it implements BERT, Transformer-XL, XLNet and other cutting-edge transformer models.
</span></li>
</ul>
</div>
<h2 id="감사의-글">감사의 글</h2>
<div class="tooltip">
<p><a href="https://twitter.com/lukaszkaiser">Lukasz Kaiser</a>, <a href="https://www.cl.uzh.ch/de/people/team/compling/mmueller.html">Mathias Müller</a>, <a href="https://twitter.com/peterjliu">Peter J. Liu</a>, <a href="https://twitter.com/rsepassi">Ryan Sepassi</a>, <a href="https://www.linkedin.com/in/mohammad-saleh-39614224/">Mohammad Saleh</a>님들께 이 포스팅의 이전 버전에서 피드백을 주셔서 감사합니다.
<span class="tooltiptext">
Thanks to <a href="https://twitter.com/lukaszkaiser">Lukasz Kaiser</a>, <a href="https://www.cl.uzh.ch/de/people/team/compling/mmueller.html">Mathias Müller</a>, <a href="https://twitter.com/peterjliu">Peter J. Liu</a>, <a href="https://twitter.com/rsepassi">Ryan Sepassi</a> and <a href="https://www.linkedin.com/in/mohammad-saleh-39614224/">Mohammad Saleh</a> for feedback on earlier versions of this post.
</span></p>
</div>
<div class="tooltip">
<p>(원문에 대한) 의견이나 수정 사항이 있다면 <a href="https://twitter.com/JayAlammar">@JayAlammar</a>로 tweet 해주세요.
<br />
<span class="tooltiptext">
Comments or corrections? Please tweet me at <a href="https://twitter.com/JayAlammar">@JayAlammar</a>
</span></p>
</div>
<hr />
<h2 id="추가-정보">추가 정보<a href="#additional-info" name="additional-info">.</a></h2>
<ul>
<li>이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 <a href="https://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT-2 (Visualizing Transformer Language Models)</a>에서 확인하실 수 있습니다.</li>
<li>원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.</li>
<li>번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 <a href="https://nlpinkorean.github.io">찬</a>님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.</li>
</ul>
<!--
### Just Add Memory
So far, our models have only considered the keys and values from the current segment. What's to stop us from adding a bunch more keys and values representing words from previous tokens? Nothing stops us! That's exactly what memory is in this context
<div class="img-div-any-width" markdown="0">
<image src="/images/xlnet/memory-self-attention.png"/>
<br />
</div>
And there we have it! The model can now incorporate all previous tokens in previous segments into the self-attention calculation.
Let's go over an example to make sure we're on the same page. Say we want to process the first eight words of The Second Law using a toy memory-transformer with one block and four token segment length. From now on, we'll show vectors vertically rather than horizontally so we can squeeze them into matrices:
### Memory-Compression
In practice, we can very quickly run out memory if we memorize the keys and values of all previous tokens in a long text sequence. Here, we can turn to the idea of compressing this memory to save space. If we're to rotate our key and value vectors like the following:
<div class="img-div-any-width" markdown="0">
<image src="/images/xlnet/keys-and-values.png"/>
<br />
</div>
We can compress it by compressing every three vectors into one:
<div class="img-div-any-width" markdown="0">
<image src="/images/xlnet/transformer-memory-compression.png"/>
<br />
</div>
The compression is done using a convolutional neural network which learns (during training time) how to effectively turn every three key vectors into a a single vector. Likewise with the values vectors. Again, in technical jargon, the compression is done using a CNN with a kernel size of 3 and a stride of 3.
-->chloamme이 글은 Jay Alammar님의 글을 번역한 글입니다. [추가정보] This post is a translated version of The Illustrated GPT-2 (Visualizing Transformer Language Models) by Jay Alammar.