CHLMLNLP

Advice on reading search papers by Prof. Andrew Ng(CS230)

2022-10-20T00:00:00+00:00

This post is a note-taking for the ‘Reading Research Papers’ part of the Stanford CS230 Deep Learning course lecture on YouTube URL. The class is all about advice on how to master a body of literature around some topics you want to learn. So I’ve written down his writings on the blackboard and his verbal comments.

Compiled list of papers(+ medium/blog posts)
Skip around list
- (1) 10% of each paper or try to quickly skim and understand each of these papers.
- (2) If based on that you decide that paper 2 is worthless, others say they sure got it wrong, or you read it, and it just doesn’t make sense. Then go ahead and forget it.
- (3) As you skip around to different papers, it might decide that paper 3 is a seminal one and then it will continuosly go ahead and read and understand the whole thing.
- (4) And based on that, you might then find the 6th paper from the citations and read that.
- (5) Go back and flesh out your understanding on paper 4.
- (6) And then find a paper 7 and go and read that all the way to the conclusion.
Number of papers you should read
- 15-20 papers: good enough to do some work, and apply some algorithms.
- 50-100 papers: enough to give you a very good understanding of an area.
Read 1 paper
- The bad way to read the paper is to go from the first word until the last word.
- Take multiple passes through the paper.
  1. Title/abstract/figures: the most
  2. Intro + Conclusions + Figures + Skim rest. (Skip/skim related work)
  3. Read but skip/skim math
  4. Whole thing, but skip parts that don’t make sense.
Questions for having good understanding of the paper
- What did authors try to accomplish?
- What were the key elements of the approach?
- What can you use yourself?
- What other references do you want to follow?
Sources of papers
- Twitter (@kiankatan, @AndrewYNg)
- ML subreddit
- NIPS/ICML/ICLR
- Friends
- Arxiv sanity
- (TMK) Papers with code (Web), Alpha zeta vector (YouTube)
To more deeply understand the paper,
- Math
  - Redrive from scratch
- Code
  - Download/Run open-source code
  - Reimplement from scratch
Longer term advice
- Steady reading
- Not shorts and burst

I’m grateful to Andrew Ng for sharing his wise and practical advice. As ending this lecture, he said, ‘some of this I wish I had known when I was a first-year Ph.D. student, but c’est la vie.’ I agree, but I think I’m lucky to know this now. It’s never too late to start. So keep going, and happy reading!

My e-book reader device review

2022-10-01T00:00:00+00:00

I have decided to read a bunch of books and papers during my free time. I have been floating from the library to study cafes like a nomad. I prepared printed papers or books to read every day, but I often wanted to read other topics depending on my mood.

So, I thought I needed a device to help me read. A device that I can use every time and everywhere. I already had an iPad mini, but it doesn’t fit the screen. And I got attracted to the Onyx Boox Nova 3 Color E-book reader and bought it. At that time, there wasn’t any stock available on the website. So, I reserved one. And after a week, they notified me that the device was already available. So, I happily bought it right away. But a new version (which is even cheaper) was coming in a few days or a week after. I felt a little bit sad because I wasn’t informed about the new version of the device, but it’s okay.

While using the device, I found a few uncomfortable things, and I thought I needed another device. So, I decided to buy a new one, but this time, I bought the 13.3” big device with the same brand.

Pros
- Eye comfortability: I’m an avid user of reading devices. I’ve been using the iPad mini for a long time due to its convenience. But, whenever I want to read a book or a paper, my eyes feel very uneasy. But using an E-book device makes it more comfortable to read everything.
- Reading quality: It is easy and fast for me to read an article, especially if it’s written in my mother language. In the past, I was used to small reading devices(7.8 inches). I couldn’t properly read, so I felt a little stuffy.
- Portability: Its weight is around 570g only. I can carry it anywhere and read comfortably. You can carry it in any small backpack or briefcase. And its size is similar to an A4 folder.
- Downloadable applications: Unlike Kindle, Onyx is based on Android OS. And Boox 13.3 is a brand new model with Android 11. So, We can use the Play Store and download various applications.
Cons
- Notepad: It’s fit for reading, and there’s a pen where you can write down something. But the synchronicity is not good. I’m not sure if it’s only on my device, and there’s a bug area that can’t recognize the pen.
- Watching Videos: I am into lecture videos. And I spend most of my time watching Youtube videos. I am interested in blackboard-type presentations. So, I thought watching on a grayscale display E-book reader would be possible. But, some images are not compatible with the device.
- Browser Add-ons: In the past, I didn’t know that I was a big fan of Chrome extensions(e.g., translation, pocket, highlighter). I usually use a chrome browser that doesn’t support extensions. I can download and use a chrome browser on the Onyx Boox device, but extensions are incompatible. So, I tried Firefox Add-ons, Brave Add-ons, etc., but it also doesn’t work well. This time, I tried the Edge browser with a translation add-on, but it is slightly different from chrome, so I feel uncomfortable.

[번역] 그림으로 설명하는 Retrieval Transformer

2022-01-08T00:00:00+00:00

입력 프롬프트: 영화 듄은 출시(시간정보)가 ....(이다).

입력 프롬프트: 입소문으로 퍼진 인기가 Herbert가 시작할 수 있게 했습니다, 본격적인 ....(을).

RETRO database의 내부를 살펴보면 RETRO database의 key-value 쌍의 예를 볼 수 있습니다. value는 neighbor chunk와 completion chunk를 포함합니다. A look inside RETRO's database shows examples of key-value pairs in the RETRO database. The value contains a neighbor chunk and a completion chunk.

BERT로 입력 프롬프트를 처리하면 contextualized 토큰 임베딩이 생성됩니다. 그 결과들을 평균을 계산해서 문장 임베딩을 생성합니다. Processing the input prompt with BERT produces contextualized token embeddings. Averaging them produces a sentence embedding.

Improving Neural Language Models with a Continuous Cache
Generalization through Memorization: Nearest Neighbor Language Models
Read the Retrieval Augmented Generation blog from Meta AI and go through Jackie Chi Kit Cheung’s lecture on Leveraging External Knowledge in Natural Language Understanding Systems
SPALM: Adaptive Semiparametric Language Models
DPR: Dense Passage Retrieval for Open-Domain Question Answering
REALM: Retrieval-Augmented Language Model Pre-Training
FiD: Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
EMDR: End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering
BlenderBot 2.0: Internet-Augmented Dialogue Generation

추가 정보.

이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 The Illustrated Retrieval Transformer에서 확인하실 수 있습니다.
원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.
번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 찬님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.

[실습] VSMs(Vector-space models)

2021-12-29T00:00:00+00:00

Vector-space models: designs, distances, basic reweighting

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

# 이 문서는, 아파치 라이선스 2.0 하에, Christopher Potts 교수님의 CS224U 과목을 공부하면서 설명을 추가(번역 아님)한 자료 입니다.

Overview
Motivation
Terminological notes
Set-up
Matrix designs
Pre-computed example matrices
Vector comparison
Distributional neighbors
Matrix reweighting
Subword information
Visualization
Exploratory exercises

Overview

This notebook is the first in our series about creating effective distributed representations. The focus is on matrix designs, assessing similarity, and methods for matrix reweighting.

The central idea (which takes some getting used to!) is that we can represent words and phrases as dense vectors of real numbers. These take on meaning by being embedded in a larger matrix of representations with comparable structure.

matrix design, similarity 계산, matrix reweighting 등 distributed representation 에 대해 알아보겠습니다.

Motivation

Why build distributed representations? There are potentially many reasons. The two we will emphasize in this course:

Understanding words in context: There is value to linguists in seeing what these data-rich approaches can teach use about natural language lexicons, and there is value for social scientists in understanding how words are being used.
Feature representations for other models: As we will see, many models can benefit from representing examples as distributed representations.

(1) word를 context 상에서 이해할 수 있고, (2) 다른 어떤 모델의 feature representation으로 활용 가능하다는 점에 있어서, distributed representation을 build하는 것은 의미가 있습니다.

Terminological notes

The distributed representations we build will always be vectors of real numbers. The models are often called vector space models (VSMs).
Distributional representations are the special case where the data come entirely from co-occurrence counts in corpora.
We’ll look at models that use supervised labels to obtain vector-based word representations. These aren’t purely distributional, in that they take advantage of more than just co-occurrence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.
If a neural network is used to train the representations, then they might be called neural representations.
The term word embedding is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.
In any case, distributed representation seems like the most general cover term for what we’re trying to achieve, and its only downside is that sometimes people think it has something to do with distributed databases.
distributed representation을 build하면 실수 vector로 표현되며, 이 모델을 vector space model(VSM) 이라고도 부릅니다.
여러 코퍼스들에서 co-occurrence(동시발생) 카운트에서 전체적으로 데이터가 도출된 경우, distributional representation 이라고 합니다.
representation을 뉴럴넷을 활용해서 학습시킨 경우는, neural representation 이라고 합니다.
word embedding 은 distributed representation에 사용될 수도 있고, distributional representation에 사용될 수도 있다.
distributed representation 이 가장 일반적인 용어, 통칭할 수 있는 용어로 사용될 수 있음 (단, 분산DB와 아무 관련 없는 용어)

From Frequency to Meaning: Vector Space Models for Semantics 참고

Set-up

Make sure your environment meets all the requirements for the cs224u repository. For help getting set-up, see setup.ipynb.
Download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)
vsm.py에 있는 구현체를, 이 노트북에 바로 입력해서 살펴볼 예정
imdb corpus가 yelp corpus로 변경되어 있음에 따라, 수정해서 실습 예정

%matplotlib inline
import numpy as np
import os
import pandas as pd
# import vsm
import scipy
import scipy.spatial.distance
from collections import defaultdict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

DATA_HOME = os.path.join('data', 'vsmdata')
!ls $DATA_HOME

giga_window20-flat.csv.gz  yelp_window20-flat.csv.gz
giga_window5-scaled.csv.gz yelp_window5-scaled.csv.gz

Matrix designs

There are many, many ways to define distributional matrices. Here’s a schematic overview that highlights the major decisions for building a word $\times$ word matrix:

Define a notion of co-occurrence context. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.
Define a count scaling method. The simplest method just counts everything in the context window, giving equal weight to everything inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $1/d$, where $d$ is the distance in tokens from the target.
Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w’$ occurs in the same context (as you defined it in 1), increment $d[(w, w’)]$ by whatever value is determined by your weighting scheme. You’d increment by $1$ with the weighting scheme that simply counts co-occurrences.
Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types.
1. For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow the vocabulary at this point.
2. You might do this by filtering to a specific subset, or just imposing a minimum count threshold.
3. You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don’t have enough evidence to support good representations.
4. For words outside the vocabulary you choose, you could ignore them entirely or accumulate all their values into a designated UNK vector.
Now build a matrix $M$ of dimension $|V| \times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the value $d[(w_{i}, w_{j})]$.

distributional matrix를 정의하는데 많은 방법이 있지만, word $\times$ word matrix를 build하는 데 아래 순서로 가능:

co-occurrence context 의 개념을 정의할 것. 전체 문서, 단락, 문장, 절, 명사구(Noun Phrase) 등 연관관계를 보려는 어떤 것이라도 될 수 있음.
카운트 스케일링 메소드 를 정의할 것. context window 내의 모든 토큰에 같은 weight을 주는 가장 간단한 방법도 있고, 타겟 토큰으로 부터 얼마나 떨어졌는지 거리($d$)에 따라 $1/d$로 weight를 scaling할 수도 있음.
코퍼스를 스캔해서 단어쌍에 co-occurrence 값을 매핑하는 사전 $d$를 build할 것. 1번에서 정의했던 같은 context 내에서 등장한 단어 $w$와 $w’$의 쌍 마다, 2번에서 정의한 weighting shceme으로 값을 증가시키기.
3번에 서 만든 카운트 사전 $d$를 이용해서 전체 vocab $V$ (word type의 ordered list)를 만들기.
행과 열이 word인 $|V| \times |V|$인 행렬 $M$ 만들기. $M[i,j]$의 각 셀은 단어쌍 거리의 값($d[(w_{i}. w_{j})]$)으로 채워져있음.

Pre-computed example matrices

The data distribution includes four matrices that we’ll use for hands-on exploration. All of them were designed in the same basic way:

They are word $\times$ word matrices with 6K rows and 6K columns.
The vocabulary is the top 6K most frequent unigrams.

Two come from Yelp user-supplied reviews, and two come from Gigaword, a collection of newswire and newspaper text. Further details:

filename	source	window size	count weighting
yelp_window5-scaled.csv.gz	Yelp reviews	5	1/d
yelp_window20-flat.csv.gz	Yelp reviews	20	1
gigaword_window5-scaled.csv.gz	Gigaword	5	1/d
gigaword_window20-flat.csv.gz	Gigaword	20	1

Any hunches about how these matrices might differ from each other?

yelp5 = pd.read_csv(
    os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'), index_col=0)
yelp5

	):	);	..	...	:(	:)	:/	:D	:\|	;p	...	younger	your	yourself	youth	zebra	zero	zinc	zombie	zone	zoo
):	154.566667	0.000000	16.516667	60.750000	2.550000	8.466667	0.000000	0.866667	0.000000	0.00	...	0.666667	40.516667	4.616667	0.416667	0.000000	4.450000	0.000000	0.166667	0.166667	0.333333
);	0.000000	24.866667	3.200000	36.200000	0.000000	1.000000	0.000000	0.000000	0.000000	0.00	...	0.250000	32.733333	2.200000	0.000000	0.000000	0.533333	0.000000	0.250000	0.450000	0.700000
..	16.516667	3.200000	13494.866667	6945.366667	196.750000	501.650000	40.450000	29.400000	0.700000	2.15	...	31.850000	2908.116667	247.366667	3.183333	1.783333	146.733333	1.250000	9.733333	23.816667	20.300000
...	60.750000	36.200000	6945.366667	62753.766667	783.216667	1711.583333	161.383333	101.700000	4.883333	7.85	...	119.100000	12530.533333	1259.683333	29.550000	4.516667	675.816667	8.833333	33.400000	108.783333	77.583333
:(	2.550000	0.000000	196.750000	783.216667	423.433333	13.950000	2.133333	0.166667	0.366667	0.00	...	0.333333	72.433333	5.333333	0.000000	0.000000	7.650000	0.000000	0.533333	4.533333	0.166667
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
zero	4.450000	0.533333	146.733333	675.816667	7.650000	8.183333	0.750000	0.200000	0.000000	0.00	...	1.466667	121.233333	11.483333	0.333333	0.000000	476.733333	0.000000	0.783333	3.583333	0.866667
zinc	0.000000	0.000000	1.250000	8.833333	0.000000	0.500000	0.000000	0.000000	0.000000	0.00	...	0.000000	2.600000	0.166667	0.000000	0.000000	0.000000	2.066667	0.000000	0.000000	0.000000
zombie	0.166667	0.250000	9.733333	33.400000	0.533333	1.733333	0.000000	0.000000	0.000000	0.00	...	0.750000	20.700000	1.866667	0.000000	0.166667	0.783333	0.000000	25.400000	4.616667	0.500000
zone	0.166667	0.450000	23.816667	108.783333	4.533333	7.066667	0.000000	0.000000	0.000000	0.00	...	0.916667	257.866667	4.783333	0.000000	0.000000	3.583333	0.000000	4.616667	28.800000	0.000000
zoo	0.333333	0.700000	20.300000	77.583333	0.166667	4.850000	0.000000	0.000000	0.000000	0.00	...	1.000000	31.783333	3.533333	0.000000	0.333333	0.866667	0.000000	0.500000	0.000000	85.166667

6000 rows × 6000 columns

yelp20 = pd.read_csv(
    os.path.join(DATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)
yelp20

	):	);	..	...	:(	:)	:/	:D	:\|	;p	...	younger	your	yourself	youth	zebra	zero	zinc	zombie	zone	zoo
):	3910.0	24.0	250.0	1162.0	48.0	261.0	4.0	22.0	0.0	1.0	...	6.0	945.0	61.0	3.0	1.0	27.0	1.0	1.0	2.0	3.0
);	24.0	1738.0	55.0	595.0	7.0	27.0	0.0	2.0	0.0	0.0	...	5.0	564.0	29.0	0.0	0.0	7.0	0.0	2.0	3.0	5.0
..	250.0	55.0	330680.0	154726.0	1605.0	5707.0	311.0	336.0	10.0	21.0	...	437.0	40631.0	2680.0	36.0	12.0	1441.0	8.0	107.0	237.0	182.0
...	1162.0	595.0	154726.0	1213778.0	6404.0	21831.0	1230.0	1172.0	41.0	73.0	...	1684.0	174619.0	13862.0	323.0	57.0	6345.0	71.0	405.0	1208.0	840.0
:(	48.0	7.0	1605.0	6404.0	1646.0	528.0	55.0	33.0	4.0	1.0	...	9.0	1425.0	84.0	1.0	1.0	66.0	0.0	5.0	15.0	7.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
zero	27.0	7.0	1441.0	6345.0	66.0	110.0	6.0	4.0	0.0	0.0	...	43.0	4141.0	368.0	4.0	0.0	3894.0	0.0	14.0	28.0	19.0
zinc	1.0	0.0	8.0	71.0	0.0	4.0	0.0	0.0	0.0	0.0	...	0.0	34.0	7.0	0.0	0.0	0.0	98.0	0.0	0.0	0.0
zombie	1.0	2.0	107.0	405.0	5.0	22.0	0.0	4.0	0.0	0.0	...	3.0	287.0	28.0	1.0	1.0	14.0	0.0	604.0	22.0	1.0
zone	2.0	3.0	237.0	1208.0	15.0	62.0	1.0	1.0	0.0	0.0	...	14.0	1721.0	99.0	3.0	1.0	28.0	0.0	22.0	764.0	14.0
zoo	3.0	5.0	182.0	840.0	7.0	47.0	0.0	1.0	0.0	0.0	...	16.0	699.0	53.0	2.0	10.0	19.0	0.0	1.0	14.0	3148.0

6000 rows × 6000 columns

giga5 = pd.read_csv(
    os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
giga5

	):	);	..	...	:(	:)	:/	:D	:\|	;p	...	younger	your	yourself	youth	zebra	zero	zinc	zombie	zone	zoo
):	4.300000	8.100000	7.700000	113.766667	2.666667	1.116667	0.50	0.0	0.0	0.0	...	2.850000	147.633333	8.483333	4.500000	0.000000	2.733333	0.200000	0.333333	344.066667	0.700000
);	8.100000	1092.400000	0.650000	52.783333	0.666667	0.000000	0.00	0.0	0.0	0.0	...	7.100000	44.400000	3.333333	7.116667	0.400000	1.616667	1.000000	0.500000	11.516667	1.250000
..	7.700000	0.650000	18900.766667	12657.800000	0.500000	0.500000	0.00	0.5	0.0	0.0	...	1.166667	29.883333	0.833333	0.250000	0.000000	0.916667	0.250000	0.000000	2.983333	1.083333
...	113.766667	52.783333	12657.800000	131656.033333	0.866667	0.500000	0.85	0.0	0.0	0.0	...	40.333333	2187.066667	89.750000	42.000000	8.200000	30.683333	0.666667	2.333333	79.233333	11.083333
:(	2.666667	0.666667	0.500000	0.866667	0.000000	0.333333	0.00	0.0	0.0	0.0	...	0.000000	0.333333	0.000000	0.666667	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
zero	2.733333	1.616667	0.916667	30.683333	0.000000	0.000000	0.00	0.0	0.0	0.0	...	0.666667	26.516667	1.533333	1.600000	0.000000	270.333333	0.000000	0.000000	6.900000	0.000000
zinc	0.200000	1.000000	0.250000	0.666667	0.000000	0.000000	0.00	0.0	0.0	0.0	...	0.000000	7.116667	0.000000	0.000000	0.000000	0.000000	24.966667	0.000000	0.000000	0.000000
zombie	0.333333	0.500000	0.000000	2.333333	0.000000	0.000000	0.00	0.0	0.0	0.0	...	0.200000	2.700000	0.000000	0.250000	0.000000	0.000000	0.000000	4.700000	0.500000	0.000000
zone	344.066667	11.516667	2.983333	79.233333	0.000000	0.000000	0.00	0.0	0.0	0.0	...	1.000000	112.033333	3.850000	2.333333	0.666667	6.900000	0.000000	0.500000	248.466667	0.816667
zoo	0.700000	1.250000	1.083333	11.083333	0.000000	0.000000	0.25	0.0	0.0	0.0	...	1.000000	9.383333	0.250000	0.400000	1.500000	0.000000	0.000000	0.000000	0.816667	141.133333

6000 rows × 6000 columns

giga20 = pd.read_csv(
    os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
giga20

	):	);	..	...	:(	:)	:/	:D	:\|	;p	...	younger	your	yourself	youth	zebra	zero	zinc	zombie	zone	zoo
):	7046.0	517.0	29.0	1684.0	45.0	10.0	15.0	0.0	1.0	0.0	...	34.0	2716.0	318.0	75.0	0.0	18.0	11.0	5.0	1128.0	11.0
);	517.0	108712.0	19.0	288.0	15.0	0.0	0.0	1.0	1.0	0.0	...	105.0	872.0	34.0	80.0	5.0	19.0	15.0	5.0	111.0	13.0
..	29.0	19.0	101566.0	116373.0	1.0	1.0	6.0	2.0	0.0	0.0	...	72.0	886.0	17.0	24.0	23.0	75.0	33.0	0.0	225.0	10.0
...	1684.0	288.0	116373.0	1627084.0	27.0	48.0	7.0	3.0	0.0	0.0	...	671.0	27690.0	1158.0	886.0	392.0	417.0	6.0	34.0	1522.0	149.0
:(	45.0	15.0	1.0	27.0	50.0	5.0	2.0	1.0	0.0	0.0	...	1.0	49.0	3.0	2.0	0.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
zero	18.0	19.0	75.0	417.0	0.0	0.0	0.0	0.0	0.0	0.0	...	25.0	596.0	49.0	26.0	0.0	3460.0	2.0	0.0	140.0	7.0
zinc	11.0	15.0	33.0	6.0	0.0	0.0	0.0	0.0	3.0	0.0	...	0.0	85.0	2.0	1.0	0.0	2.0	1366.0	0.0	2.0	1.0
zombie	5.0	5.0	0.0	34.0	0.0	0.0	0.0	0.0	0.0	0.0	...	3.0	65.0	1.0	3.0	0.0	0.0	0.0	188.0	3.0	0.0
zone	1128.0	111.0	225.0	1522.0	1.0	0.0	0.0	1.0	0.0	1.0	...	54.0	1464.0	60.0	52.0	7.0	140.0	2.0	3.0	11250.0	19.0
zoo	11.0	13.0	10.0	149.0	0.0	0.0	1.0	0.0	0.0	0.0	...	34.0	257.0	15.0	16.0	36.0	7.0	1.0	0.0	19.0	5568.0

6000 rows × 6000 columns

Vector comparison

Vector comparisons form the heart of our analyses in this context.

For the most part, we are interested in measuring the distance between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.
The scipy.spatial.distance module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we’ll use allow all float values.

Euclidean

The most basic and intuitive distance measure between vectors is euclidean distance. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is

\[\textbf{euclidean}(u, v) = \sqrt{\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}\]

In two-dimensions, this corresponds to the length of the most direct line between the two points.

In vsm.py, the function euclidean just uses the corresponding scipy.spatial.distance method to define it.

Here’s the tiny vector space from the screencast on vector comparisons associated with this notebook:

Running example
• Focus on distance measures
• Illustrations with row vectors
어떤 문서(혹은 코퍼스 등) x, y에, A, B, C라는 단어가 등장한다고 할 때
혹은, 어떤 단어 x, y와, A, B, C라는 단어가 함께 등장한다고 할 때

ABC = pd.DataFrame([
    [ 2.0,  4.0], 
    [10.0, 15.0], 
    [14.0, 10.0]],
    index=['A', 'B', 'C'],
    columns=['x', 'y'])

ABC

	x	y
A	2.0	4.0
B	10.0	15.0
C	14.0	10.0

def plot_ABC(df):
    ax = df.plot.scatter(x='x', y='y', marker='.', legend=False)
    m = df.values.max(axis=None)
    ax.set_xlim([0, m*1.2])
    ax.set_ylim([0, m*1.2])
    for label, row in df.iterrows():
        ax.text(row['x'], row['y'], label)

plot_ABC(ABC)

The euclidean distances align well with raw visual distance in the plot:

def euclidean(u, v):
    return scipy.spatial.distance.euclidean(u, v)

def abc_comparisons(df, distfunc):
    """(A,B)간, (B,C)간 거리 함수를 parameter로 받아서, 두 점 사이의 거리를 계산하는 함수
    """
    for a, b in (('A', 'B'), ('B', 'C')):
        dist = distfunc(df.loc[a], df.loc[b]) # df.loc[a]: Series type
        print('{0:}({1:}, {2:}) = {3:7.02f}'.format(
            distfunc.__name__, a, b, dist))

abc_comparisons(ABC, euclidean) # plain X, Y축 값을 갖는 dataframe ABC를 넘김.

euclidean(A, B) =   13.60
euclidean(B, C) =    6.40

However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don’t look good:

The distributions of B and C are more or less directly opposed, suggesting very different meanings, whereas A and B are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.
In terms of the large models we will soon explore, A and B resemble a pair like superb and good, which have similar meanings but very different frequencies.
In contrast, B and C are like good and disappointing — similar overall frequencies but different distributions with respect to the overall vocabulary.

Length normalization

These affinities are immediately apparent if we normalize the vectors by their length. To do this, we first define the L2-length of a vector:

\[\|u\|_{2} = \sqrt{\sum_{i=1}^{n} u_{i}^{2}}\]

And then the normalization step just divides each value by this quantity:

\[\left[ \frac{u_{1}}{\|u\|_{2}}, \frac{u_{2}}{\|u\|_{2}}, \ldots \frac{u_{n}}{\|u\|_{2}} \right]\]

def vector_length(u):
    """
    u vector의 dot(inner) product 
    ([1, 2, 3]).dot([1, 2, 3]) = 1+4+9 = 14
    """
    return np.sqrt(u.dot(u)) 

def length_norm(u):
    return u / vector_length(u)

ABC_normed = ABC.apply(length_norm, axis=1)

plot_ABC(ABC_normed)

abc_comparisons(ABC_normed, euclidean) # normalize된 ABC dataframe의 euclidean distance

euclidean(A, B) =    0.12
euclidean(B, C) =    0.36

Here, the connection between A and B is more apparent, as is the opposition between B and C.

Cosine distance

Cosine distance takes overall length into account. The cosine distance between two vectors $u$ and $v$ of dimension $n$ is

\[\textbf{cosine}(u, v) = 1 - \frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}\]

The similarity part of this (the righthand term of the subtraction) is actually measuring the angles between the two vectors. The result is the same (in terms of rank order) as one gets from first normalizing both vectors using $|\cdot|_{2}$ and then calculating their Euclidean distance.

def cosine(u, v):
    return scipy.spatial.distance.cosine(u, v)

abc_comparisons(ABC, cosine)

cosine(A, B) =    0.01
cosine(B, C) =    0.07

So, in building in the length normalization, cosine distance achieves our goal of associating A and B and separating both from C.

Matching-based methods

Matching-based methods are also common in the literature. The basic matching measure effectively creates a vector consisting of all of the smaller of the two values at each coordinate, and then sums them:

\[\textbf{matching}(u, v) = \sum_{i=1}^{n} \min(u_{i}, v_{i})\]

This is implemented in vsm as matching.

One approach to normalizing the matching values is the Jaccard coefficient. The numerator is the matching coefficient. The denominator — the normalizer — is intuitively like the set union: for binary vectors, it gives the cardinality of the union of the two being compared:

\[\textbf{jaccard}(u, v) = 1 - \frac{\textbf{matching}(u, v)}{\sum_{i=1}^{n} \max(u_{i}, v_{i})}\]

Summary

Suppose we set for ourselves the goal of associating A with B and disassociating B from C, in keeping with the semantic intuition expressed above. Then we can assess distance measures by whether they achieve this goal:

def matching(u, v):
    return np.sum(np.minimum(u, v))

def jaccard(u, v):
    return 1.0 - (matching(u, v) / np.sum(np.maximum(u, v)))

for m in (euclidean, cosine, jaccard):
    fmt = {
        'n': m.__name__,  
        'AB': m(ABC.loc['A'], ABC.loc['B']), 
        'BC': m(ABC.loc['B'], ABC.loc['C'])}
    print('{n:>15}(A, B) = {AB:5.2f} {n:>15}(B, C) = {BC:5.2f}'.format(**fmt))

      euclidean(A, B) = 13.60       euclidean(B, C) =  6.40
         cosine(A, B) =  0.01          cosine(B, C) =  0.07
        jaccard(A, B) =  0.76         jaccard(B, C) =  0.31

Distributional neighbors

The neighbors function in vsm is an investigative aide. For a given word w, it ranks all the words in the vocabulary according to their distance from w, as measured by distfunc (default: vsm.cosine).

By playing around with this function, you can start to get a sense for how the distance functions differ. Here are some example uses; you might try some new words to get a feel for what these matrices are like and how different words look.

def neighbors(word, df, distfunc=cosine):
    """Tool for finding the nearest neighbors of `word` in `df` according
    to `distfunc`. The comparisons are between row vectors.

    Parameters
    ----------
    word : str
        The anchor word. Assumed to be in `rownames`.
    df : pd.DataFrame
        The vector-space model.
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`,
        `matching`, `jaccard`, as well as any other distance measure
        between 1d vectors.

    Raises
    ------
    ValueError
        If word is not in `df.index`.

    Returns
    -------
    pd.Series
        Ordered by closeness to `word`.

    """
    if word not in df.index:
        raise ValueError('{} is not in this VSM'.format(word))
    w = df.loc[word]
    dists = df.apply(lambda x: distfunc(w, x), axis=1)
    return dists.sort_values()

neighbors('A', ABC, distfunc=euclidean)

A     0.000000
C    13.416408
B    13.601471
dtype: float64

neighbors('A', ABC, distfunc=cosine)

A    0.000000
B    0.007722
C    0.116212
dtype: float64

neighbors('good', yelp5, distfunc=euclidean).head()

good          0.000000
great    363501.370559
very     463140.024533
there    484997.266143
so       503078.898630
dtype: float64

neighbors('good', yelp20, distfunc=euclidean).head()

good     0.000000e+00
very     1.582079e+06
food     1.706024e+06
as       2.180615e+06
great    2.309515e+06
dtype: float64

neighbors('good', yelp5, distfunc=cosine).head()

good          0.000000
decent        0.052814
weak          0.061491
impressive    0.063750
solid         0.072259
dtype: float64

neighbors('good', yelp20, distfunc=cosine).head()

good      0.000000
decent    0.005797
pretty    0.006568
really    0.007324
quite     0.007358
dtype: float64

neighbors('good', giga5, distfunc=euclidean).head()

good            0.000000
very        96629.020074
another    102967.358766
little     111678.083808
big        112087.562461
dtype: float64

neighbors('good', giga20, distfunc=euclidean).head()

good          0.000000
very     342311.566423
here     510760.069407
think    529750.740523
right    561272.646841
dtype: float64

neighbors('good', giga5, distfunc=cosine).head()

good       0.000000
bad        0.047839
painful    0.070546
simple     0.073625
safe       0.076624
dtype: float64

neighbors('good', giga20, distfunc=cosine).head()

good        0.000000
bad         0.004829
everyone    0.006463
always      0.006505
very        0.007011
dtype: float64

Matrix reweighting

The goal of reweighting is to amplify the important, trustworthy, and unusual, while deemphasizing the mundane and the quirky.
Absent a defined objective function, this will remain fuzzy, but the intuition behind moving away from raw counts is that frequency is a poor proxy for our target semantic ideas.

Normalization

Normalization (row-wise or column-wise) is perhaps the simplest form of reweighting. With vsm.length_norm, we normalize using vsm.vector_length. We can also normalize each row by the sum of its values, which turns each row into a probability distribution over the columns:

\[\left[ \frac{u_{1}}{\sum_{i=1}^{n}u_{i}}, \frac{u_{2}}{\sum_{i=1}^{n}u_{i}}, \ldots \frac{u_{n}}{\sum_{i=1}^{n}u_{i}}, \right]\]

These normalization measures are insensitive to the magnitude of the underlying counts. This is often a mistake in the messy world of large data sets; $[1,10]$ and $[1000,10000]$ are very different vectors in ways that will be partly or totally obscured by normalization.

Observed/Expected

Reweighting by observed-over-expected values captures one of the central patterns in all of VSMs: we can adjust the actual cell value in a co-occurrence matrix using information from the corresponding row and column.

In the case of observed-over-expected, the rows and columns define our expectation about what the cell value would be if the two co-occurring words were independent. In dividing the observed count by this value, we amplify cells whose values are larger than we would expect.

So that this doesn’t look more complex than it is, for an $m \times n$ matrix $X$, define

\[\textbf{rowsum}(X, i) = \sum_{j=1}^{n}X_{ij}\] \[\textbf{colsum}(X, j) = \sum_{i=1}^{m}X_{ij}\] \[\textbf{sum}(X) = \sum_{i=1}^{m}\sum_{j=1}^{n} X_{ij}\] \[\textbf{expected}(X, i, j) = \frac{ \textbf{rowsum}(X, i) \cdot \textbf{colsum}(X, j) }{ \textbf{sum}(X) }\]

Then the observed-over-expected value is

\[\textbf{oe}(X, i, j) = \frac{X_{ij}}{\textbf{expected}(X, i, j)}\]

In many contexts, it is more intuitive to first normalize the count matrix into a joint probability table and then think of $\textbf{rowsum}$ and $\textbf{colsum}$ as probabilities. Then it is clear that we are comparing the observed joint probability with what we would expect it to be under a null hypothesis of independence. These normalizations do not affect the final results, though.

Let’s do a quick worked-out example. Suppose we have the count matrix $X$ =

	a	b	rowsum
x	34	11	45
y	47	7	54
colsum	81	18	99

Then we calculate like this:

\[\textbf{oe}(X, 1, 0) = \frac{47}{\frac{54 \cdot 81}{99}} = 1.06\]

And the full table looks like this:

	a	b
x	0.92	1.34
y	1.06	0.71

def observed_over_expected(df):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    oe = df / expected
    return oe

oe_ex = np.array([[ 34.,  11.], [ 47.,   7.]])

observed_over_expected(oe_ex).round(2)

array([[0.92, 1.34],
       [1.06, 0.71]])

The implementation vsm.observed_over_expected should be pretty efficient.

yelp5_oe = observed_over_expected(yelp5)

yelp20_oe = observed_over_expected(yelp20)

neighbors('good', yelp5_oe).head()

good         0.000000
great        0.323037
decent       0.381849
but          0.403750
excellent    0.411684
dtype: float64

neighbors('good', yelp20_oe).head()

good      0.000000
too       0.058905
really    0.060264
pretty    0.064466
but       0.066778
dtype: float64

giga5_oe = observed_over_expected(giga5)

giga20_oe = observed_over_expected(giga20)

neighbors('good', giga5_oe).head()

good      0.000000
bad       0.485495
better    0.533762
well      0.553319
but       0.556113
dtype: float64

neighbors('good', giga20_oe).head()

good      0.000000
just      0.102044
so        0.106773
that's    0.108614
better    0.111044
dtype: float64

Pointwise Mutual Information

Pointwise Mutual Information (PMI) is observed-over-expected in log-space:

\[\textbf{pmi}(X, i, j) = \log\left(\frac{X_{ij}}{\textbf{expected}(X, i, j)}\right)\]

This basic definition runs into a problem for $0$ count cells. The usual response is to set $\log(0) = 0$, but this is arguably confusing – cell counts that are smaller than expected get negative values, cell counts that are larger than expected get positive values, and 0-count values are placed in the middle of this ranking without real justification.

For this reason, it is more typical to use Positive PMI, which maps all negative PMI values to $0$:

\[\textbf{ppmi}(X, i, j) = \begin{cases} \textbf{pmi}(X, i, j) & \textrm{if } \textbf{pmi}(X, i, j) > 0 \\ 0 & \textrm{otherwise} \end{cases}\]

This is the default for vsm.pmi.

def pmi(df, positive=True):
    df = observed_over_expected(df)
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df

yelp5_pmi = pmi(yelp5)

yelp20_pmi = pmi(yelp20)

neighbors('good', yelp5_pmi).head()

good         0.000000
decent       0.448116
great        0.466581
tasty        0.532094
excellent    0.569720
dtype: float64

neighbors('good', yelp20_pmi).head()

good         0.000000
tasty        0.201911
delicious    0.261547
flavorful    0.317310
ordered      0.349519
dtype: float64

giga5_pmi = pmi(giga5)

giga20_pmi = pmi(giga20)

neighbors('good', giga5_pmi).head()

good        0.000000
bad         0.439959
better      0.494448
terrific    0.509023
decent      0.520705
dtype: float64

neighbors('good', giga20_pmi).head()

good      0.000000
really    0.271877
that's    0.272447
it's      0.285056
pretty    0.294105
dtype: float64

neighbors('market', giga5_pmi).head()

market     0.000000
markets    0.327014
stocks     0.512569
prices     0.516767
sales      0.552022
dtype: float64

neighbors('market', giga20_pmi).head()

market     0.000000
markets    0.175934
prices     0.252851
stocks     0.266186
profits    0.269412
dtype: float64

TF-IDF

Perhaps the best known reweighting schemes is Term Frequency–Inverse Document Frequency (TF-IDF), which is, I believe, still the backbone of today’s Web search technologies. As the name suggests, it is built from TF and IDF measures:

For an $m \times n$ matrix $X$:

\[\textbf{TF}(X, i, j) = \frac{X_{ij}}{\textbf{colsum}(X, i, j)}\] \[\textbf{IDF}(X, i, j) = \log\left(\frac{n}{|\{k : X_{ik} > 0\}|}\right)\] \[\textbf{TF-IDF}(X, i, j) = \textbf{TF}(X, i, j) \cdot \textbf{IDF}(X, i, j)\]

TF-IDF generally performs best with sparse matrices. It severely punishes words that appear in many documents; if a word appears in every document, then its IDF value is 0. As a result, it can even be problematic with verb dense word $\times$ word matrices like ours, where most words appear with most other words.

There is an implementation of TF-IDF for dense matrices in vsm.tfidf.

Important: sklearn’s version, TfidfTransformer, assumes that term frequency (TF) is defined row-wise and document frequency is defined column-wise. That is, it assumes sklearn’s document $\times$ word basic design, which makes sense for classification tasks, where the design is example $\times$ features. This is the transpose of the way we’ve been thinking.

def tfidf(df):
    # Inverse document frequencies:
    doccount = float(df.shape[1])
    freqs = df.astype(bool).sum(axis=1)
    idfs = np.log(doccount / freqs)
    idfs[np.isinf(idfs)] = 0.0  # log(0) = 0
    # Term frequencies:
    col_totals = df.sum(axis=0)
    tfs = df / col_totals
    return (tfs.T * idfs).T

Subword information

Schütze (1993) pioneered the use of subword information to improve representations by reducing sparsity, thereby increasing the density of connections in a VSM. In recent years, this idea has shown value in numerous contexts.

Bojanowski et al. (2016) (the fastText team) explore a particularly straightforward approach to doing this: represent each word as the sum of the representations for the character-level n-grams it contains.

It is simple to derive character-level n-gram representations from our existing VSMs. The function vsm.ngram_vsm implements the basic step. Here, we create the 4-gram version of imdb5:

def ngram_vsm(df, n=2):
    """Create a character-level VSM from `df`.

    Parameters
    ----------
    df : pd.DataFrame
    n : int
        The n-gram size.

    Returns
    -------
    pd.DataFrame
        This will have the same column dimensionality as `df`, but the
        rows will be expanded with representations giving the sum of
        all the original rows in `df` that contain that row's n-gram.

    """
    unigram2vecs = defaultdict(list)
    for w, x in df.iterrows():
        for c in get_character_ngrams(w, n):
            unigram2vecs[c].append(x)
    unigram2vecs = {c: np.array(x).sum(axis=0)
                    for c, x in unigram2vecs.items()}
    cf = pd.DataFrame(unigram2vecs).T
    cf.columns = df.columns
    return cf

def get_character_ngrams(w, n):
    """Map a word to its character-level n-grams, with boundary
    symbols '<w>' and '</w>'.

    Parameters
    ----------
    w : str
    n : int
        The n-gram size.

    Returns
    -------
    list of str

    """
    if n > 1:
        w = ["<w>"] + list(w) + ["</w>"]
    else:
        w = list(w)
    return ["".join(w[i: i+n]) for i in range(len(w)-n+1)]

yelp5_ngrams = ngram_vsm(yelp5, n=4)

yelp5_ngrams.shape

(10382, 6000)

This has the same column dimension as the yelp5, but the rows are expanded with all the 4-grams, including boundary symbols <w> and </w>. Here’s a simple function for creating new word representations from the associated character-level ones:

def character_level_rep(word, cf, n=4):
    ngrams = get_character_ngrams(word, n)
    ngrams = [n for n in ngrams if n in cf.index]    
    reps = cf.loc[ngrams].values
    return reps.sum(axis=0)

Many variations on this are worth trying – including the original word vector where available, changing the aggregation method from sum to something else, using a real morphological parser instead of just n-grams, and so on.

One very powerful thing about this is that we can represent words that are not even in the original VSM:

'superbly' in yelp5.index # yelp5에는 'superbly'라는 단어가 없지만,

False

superbly = character_level_rep("superbly", yelp5_ngrams) # ngram으로 만든 것으로 부터 vector 값을 계산할 수 있다.

superb = character_level_rep("superb", yelp5_ngrams)

cosine(superb, superbly) # 두 벡터 사이의 consine distance. 매우 가까움.

0.004362871833741844

Visualization

You can begin to get a feel for what your matrix is like by poking around with vsm.neighbors to see who is close to or far from whom.
It’s very useful to complement this with the more holistic view one can get from looking at a visualization of the entire vector space.
Of course, any visualization will have to be much, much lower dimension than our actual VSM, so we need to proceed cautiously, balancing the high-level view with more fine-grained exploration.
We won’t have time this term to cover VSM visualization in detail. scikit-learn has a bunch of functions for doing this in sklearn.manifold, and the user guide for that package is detailed.
It’s also worth checking out the online TensorFlow Embedding Projector tool, which includes a fast implementation of t-SNE.
In addition, vsm.tsne_viz is a wrapper around sklearn.manifold.TSNE that handles the basic preprocessing and layout for you. t-SNE stands for t-Distributed Stochastic Neighbor Embedding, a powerful method for visualizing high-dimensional vector spaces in 2d. See also Multiple Maps t-SNE.

def tsne_viz(df, colors=None, output_filename=None, figsize=(40, 50)):
    """2d plot of `df` using t-SNE, with the points labeled by `df.index`,
    aligned with `colors` (defaults to all black).

    Parameters
    ----------
    df : pd.DataFrame
        The matrix to visualize.
    colors : list of colornames or None (default: None)
        Optional list of colors for the vocab. The color names just
        need to be interpretable by matplotlib. If they are supplied,
        they need to have the same length as `df.index`. If `colors=None`,
        then all the words are displayed in black.
    output_filename : str (default: None)
        If not None, then the output image is written to this location.
        The filename suffix determines the image type. If `None`, then
        `plt.plot()` is called, with the behavior determined by the
        environment.
    figsize : (int, int) (default: (40, 50))
        Default size of the output in display units.

    """
    # Colors:
    vocab = df.index
    if not colors:
        colors = ['black' for i in vocab]
    # Recommended reduction via PCA or similar:
    n_components = 50 if df.shape[1] >= 50 else df.shape[1]
    dimreduce = PCA(n_components=n_components)
    X = dimreduce.fit_transform(df)
    # t-SNE:
    tsne = TSNE(n_components=2, random_state=0)
    tsnemat = tsne.fit_transform(X)
    # Plot values:
    xvals = tsnemat[: , 0]
    yvals = tsnemat[: , 1]
    # Plotting:
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=figsize)
    ax.plot(xvals, yvals, marker='', linestyle='')
    # Text labels:
    for word, x, y, color in zip(vocab, xvals, yvals, colors):
        try:
            ax.annotate(word, (x, y), fontsize=8, color=color)
        except UnicodeDecodeError:  ## Python 2 won't cooperate!
            pass
    # Output:
    if output_filename:
        plt.savefig(output_filename, bbox_inches='tight')
    else:
        plt.show()

tsne_viz(yelp20_pmi)

[번역] BERT를 처음 사용하기 위한 시각적 가이드

2021-12-22T00:00:00+00:00

sentence (문장)	label (레이블)
a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films	1
apparently reassembled from the cutting room floor of any given daytime soap	0
they presume their audience won't sit still for a sociology lesson	0
this is a visually stunning rumination on love , memory , history and the war between art and commerce	1
jonathan parker 's bartleby should have been the be all end all of the modern office anomie films	1

tokenizer.encode("a visually stunning rumination on love", add_special_tokens=True)

import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

df.head()

model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
## distilBERT 대신 BERT를 사용하고 싶으신가요? 그럼 아래의 주석처리를 해제하세요:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
# pretrain된 model/tokenizer를 로딩하기
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

input_ids = torch.tensor(np.array(padded))

with torch.no_grad():
    last_hidden_states = model(input_ids)

 # Slice the output for the first position for all the sequences, take all hidden unit outputs
features = last_hidden_states[0][:,0,:].numpy()

labels = df[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

lr_clf.score(test_features, test_labels)

추가 정보.

이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 A Visual Guide to Using BERT for the First Time에서 확인하실 수 있습니다.
원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.
번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 찬님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.

[번역] NumPy 및 데이터 표현에 대한 시각적 소개

2021-12-20T00:00:00+00:00

import numpy as np

Creating Arrays (배열 생성)

Array Arithmetic (배열 산술연산)

Indexing (인덱싱)

Aggregation (집계)

In more dimensions (더 높은 차원)

Creating Matrices (배열 생성)

np.array([[1,2],[3,4]])

Matrix Arithmetic (행렬 산술연산)

Dot Product (내적)

Matrix Indexing (행렬 인덱싱)

Matrix Aggregation (행렬 집계연산)

Transposing and Reshaping (전치 및 재구조화/재배열)

Yet More Dimensions (한층 더 높은 차원)

array([[[1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.]]])

Practical Usage (실질적인 사용법)

Formulas (수식)

Data Representation (데이터 표현)

Tables and Spreadsheets

Audio and Timeseries(시계열)

Images

<div class="img-div-any-width" markdown="0">
  <image src="/images/numpy/numpy-color-image.png"/>
  <br />
</div>

Language

“Have the bards who preceded me left any theme unsung?” (“내 이전 음유시인들이 노래를 부르지 않고 남겨둔 주제가 있었던가?”)

추가 정보.

이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 A Visual Intro to NumPy and Data Representation에서 확인하실 수 있습니다.
원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.
번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 찬님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.

[번역] GPT3가 작동하는 방법 - 시각화 및 동영상 설명

2021-12-18T00:00:00+00:00

추가 정보.

이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 How GPT3 Works - Visualizations and Animations에서 확인하실 수 있습니다.
원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.
번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 찬님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.

[번역] 그림으로 설명하는 GPT-2 (Transformer Language Model 시각화)

2021-12-08T00:00:00+00:00

파트 1: GPT2와 Language Modeling
- Language Model이란
- Language Modeling을 위한 Transformer
- BERT와 한가지 차이점
- Transformer block의 진화
- GPT-2의 내부를 살펴보기
- 더 깊이 알아보기
- 파트 1의 마무리: 몇가지 안내사항
파트 2: 그림으로 설명하는 Self-Attention
- (Masking 없는) Self-Attention
- 1- Query, Key, Value 벡터 todtjd
- 2- Score 걔산
- 3- 전체 합산
- 그림으로 설명하는 Masked Self-Attention
- GPT-2의 Masked Self-Attention
- 드디어 해냈습니다! ‘It’을 만들어냈습니다!
파트 3: Language Modeling, 그 이상의 것
- 기계 번역(Machine Translation)
- 요약(Summarization)
- 전이 학습(Transfer Learning)
- 음악 생성(Music Generation)

파트 #1: GPT2와 Language Modeling #

Language Model 이란

Language Modeling을 위한 Transformer

BERT와 한가지 차이점

First Law of Robotics
A robot may not injure a human being or, through inaction, allow a human being to come to harm.
(로보틱스 제1원칙: 로봇은 인간에 해를 가하거나, 혹은 행동을 하지 않음으로써 인간에게 해가 가도록 해서는 안 된다.)

Transformer block의 진화

Encoder block

Decoder block

The Decoder-Only Block

GPT-2의 내부를 살펴보기

Look inside and you will see, The words are cutting deep inside my brain. Thunder burning, quickly burning, Knife of words is driving me insane, insane yeah. ~Budgie
(Budgie의 노래 “Crash Course in Brain Surgery” 중에서)

더 깊이 알아보기

입력 Encoding

상위 Stack으로의 이동

Self-Attention Recap

Second Law of Robotics
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
(로보틱스 제2원칙: 로봇은 인간이 그것에 내리는 명령들에 복종해야만 하며, 단 이러한 명령들이 제1원칙에 위배될 때는 예외로 한다.)

Self-Attention 프로세스

모델 출력

파트 #1의 마무리: 몇가지 안내사항

파트 #2: 그림으로 설명하는 Self-Attention #

(Masking 없는) Self-Attention

1- Query, Key, Value Vector 생성

1) 각 input token 마다, weight matrix W^Q, W^K, W^V를 곱하여 query vector, key vector, value vector를 생성합니다.

2- Score 계산

2) 현재의 query vector와 모든 key vector가 얼마나 잘 매칭되는지 score를 얻기 위해 곱셈(dot product) 연산을 합니다.

3- 전체 합산

3) value vector들을 score들과 곱한 뒤 모두 합산합니다.

그림으로 설명하는 Masked Self-Attention

GPT-2의 Masked Self-Attention

평가 시: 한번에 한 토큰씩 처리

GPT-2의 Self-attention: 1- querie, key, value 값들 생성

GPT-2의 Self-attention: 1.5- attention head로 분할하기

GPT-2의 Self-attention: 2- Score 계산하기

GPT-2의 Self-attention: 3- 합산하기

GPT-2의 Self-attention: 3.5- attention head를 합치기(merge)

GPT-2의 Self-attention: 4- Projecting

GPT-2의 Fully-Connected Neural Network: #1번 레이어

GPT-2의 Fully-Connected Neural Network: #2번 레이어 - 모델 차원으로 projection 하기

드디어 해냈습니다! ‘It’을 만들어냈습니다!

파트 3: Language Modeling, 그 이상의 것 #

기계번역(Machine Translation)

요약(Summarization)

전이 학습(Transfer Learning)

음악 생성(Music Generation)

결론

참고자료

감사의 글

추가 정보.

이 글은 GPT2에 대해 이해하기 쉽게 그림으로 설명한 포스팅을 저자인 Jay Alammar님의 허락을 받고 번역한 글 입니다. 원문은 The Illustrated GPT-2 (Visualizing Transformer Language Models)에서 확인하실 수 있습니다.
원서/영문블로그를 보실 때 term에 대한 정보 호환을 위해, 이 분야에서 사용하고 있는 단어, 문구에 대해 가급적 번역하지 않고 원문 그대로 두었습니다. 그리고, 직역 보다는 개념이나 의미에 대한 설명을 쉽게 하는 문장 쪽으로 더 무게를 두어 번역 했습니다. 번역에 대한 의견이나 수정 사항은 아래 댓글 창에 남겨주세요.
번역문에 대응하는 영어 원문을 보고싶으신 분들을 위해 찬님께서 만들어두신 툴팁 도움말 기능(해당 문단에 마우스를 올리면 (모바일의 경우 터치) 원문을 확인할 수 있는 기능)을 가져와서 적용했습니다. 감사합니다.

CHLMLNLP

Advice on reading search papers by Prof. Andrew Ng(CS230)

My e-book reader device review

[번역] 그림으로 설명하는 Retrieval Transformer

중요한 이유: 언어 정보를 World Knowledge 정보와 분리시킴

RETRO의 Retrieval Database 세부 사항

Database 조회

RETRO 아키텍처 조망

이전 연구

추가 정보.

[실습] VSMs(Vector-space models)

Vector-space models: designs, distances, basic reweighting

Contents

Overview

Motivation

Terminological notes

Set-up

Matrix designs

Pre-computed example matrices

Vector comparison

Euclidean

Length normalization

Cosine distance

Matching-based methods

Summary

Distributional neighbors

Matrix reweighting

Normalization

Observed/Expected

Pointwise Mutual Information

TF-IDF

Subword information

Visualization

[번역] BERT를 처음 사용하기 위한 시각적 가이드

데이터셋: SST2

모델: 문장 감정 분류

모델 학습

튜토리얼 개요

(단일) 예측을 계산하는 방법

DistilBERT 이전 단계

코드

pre-train된 DistilBERT 모델 및 토크나이저 import 하기

토큰화하기

DistilBERT로 처리하기

BERT 출력 텐서 펼쳐보기

문장의 처리과정 요약

필요한 부분만 자르기

Logistic Regression을 위한 데이터셋

스코어 벤치마크

Notebook 실습

추가 정보.

[번역] NumPy 및 데이터 표현에 대한 시각적 소개

Creating Arrays (배열 생성)

Array Arithmetic (배열 산술연산)

Indexing (인덱싱)

Aggregation (집계)

In more dimensions (더 높은 차원)

Creating Matrices (배열 생성)

Matrix Arithmetic (행렬 산술연산)

Dot Product (내적)

Matrix Indexing (행렬 인덱싱)

Matrix Aggregation (행렬 집계연산)

Transposing and Reshaping (전치 및 재구조화/재배열)

Yet More Dimensions (한층 더 높은 차원)

Practical Usage (실질적인 사용법)

Formulas (수식)

Data Representation (데이터 표현)

Tables and Spreadsheets

Audio and Timeseries(시계열)

Images

Language

추가 정보.

[번역] GPT3가 작동하는 방법 - 시각화 및 동영상 설명

추가 정보.

[번역] 그림으로 설명하는 GPT-2 (Transformer Language Model 시각화)

파트 #1: GPT2와 Language Modeling #

Language Model 이란

Language Modeling을 위한 Transformer

BERT와 한가지 차이점

Transformer block의 진화