Vector-space models: designs, distances, basic reweighting

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

# 이 문서는, 아파치 라이선스 2.0 하에, Christopher Potts 교수님의 CS224U 과목을 공부하면서 설명을 추가(번역 아님)한 자료 입니다. 

Contents

  1. Overview
  2. Motivation
  3. Terminological notes
  4. Set-up
  5. Matrix designs
  6. Pre-computed example matrices
  7. Vector comparison
    1. Euclidean
    2. Length normalization
    3. Cosine distance
    4. Matching-based methods
    5. Summary
  8. Distributional neighbors
  9. Matrix reweighting
    1. Normalization
    2. Observed/Expected
    3. Pointwise Mutual Information
    4. TF-IDF
  10. Subword information
  11. Visualization
  12. Exploratory exercises

Overview

This notebook is the first in our series about creating effective distributed representations. The focus is on matrix designs, assessing similarity, and methods for matrix reweighting.

The central idea (which takes some getting used to!) is that we can represent words and phrases as dense vectors of real numbers. These take on meaning by being embedded in a larger matrix of representations with comparable structure.

matrix design, similarity 계산, matrix reweightingdistributed representation 에 대해 알아보겠습니다.

Motivation

Why build distributed representations? There are potentially many reasons. The two we will emphasize in this course:

  1. Understanding words in context: There is value to linguists in seeing what these data-rich approaches can teach use about natural language lexicons, and there is value for social scientists in understanding how words are being used.

  2. Feature representations for other models: As we will see, many models can benefit from representing examples as distributed representations.

(1) word를 context 상에서 이해할 수 있고, (2) 다른 어떤 모델의 feature representation으로 활용 가능하다는 점에 있어서, distributed representation을 build하는 것은 의미가 있습니다.

Terminological notes

  • The distributed representations we build will always be vectors of real numbers. The models are often called vector space models (VSMs).

  • Distributional representations are the special case where the data come entirely from co-occurrence counts in corpora.

  • We’ll look at models that use supervised labels to obtain vector-based word representations. These aren’t purely distributional, in that they take advantage of more than just co-occurrence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.

  • If a neural network is used to train the representations, then they might be called neural representations.

  • The term word embedding is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.

  • In any case, distributed representation seems like the most general cover term for what we’re trying to achieve, and its only downside is that sometimes people think it has something to do with distributed databases.

  • distributed representation을 build하면 실수 vector로 표현되며, 이 모델을 vector space model(VSM) 이라고도 부릅니다.
  • 여러 코퍼스들에서 co-occurrence(동시발생) 카운트에서 전체적으로 데이터가 도출된 경우, distributional representation 이라고 합니다.
  • representation을 뉴럴넷을 활용해서 학습시킨 경우는, neural representation 이라고 합니다.
  • word embedding 은 distributed representation에 사용될 수도 있고, distributional representation에 사용될 수도 있다.
  • distributed representation 이 가장 일반적인 용어, 통칭할 수 있는 용어로 사용될 수 있음 (단, 분산DB와 아무 관련 없는 용어)

From Frequency to Meaning: Vector Space Models for Semantics 참고

Set-up

  • Make sure your environment meets all the requirements for the cs224u repository. For help getting set-up, see setup.ipynb.

  • Download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME below.)

  • vsm.py에 있는 구현체를, 이 노트북에 바로 입력해서 살펴볼 예정
  • imdb corpus가 yelp corpus로 변경되어 있음에 따라, 수정해서 실습 예정
%matplotlib inline
import numpy as np
import os
import pandas as pd
# import vsm
import scipy
import scipy.spatial.distance
from collections import defaultdict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
DATA_HOME = os.path.join('data', 'vsmdata')
!ls $DATA_HOME
giga_window20-flat.csv.gz  yelp_window20-flat.csv.gz
giga_window5-scaled.csv.gz yelp_window5-scaled.csv.gz

Matrix designs

There are many, many ways to define distributional matrices. Here’s a schematic overview that highlights the major decisions for building a word $\times$ word matrix:

  1. Define a notion of co-occurrence context. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.

  2. Define a count scaling method. The simplest method just counts everything in the context window, giving equal weight to everything inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $1/d$, where $d$ is the distance in tokens from the target.

  3. Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w’$ occurs in the same context (as you defined it in 1), increment $d[(w, w’)]$ by whatever value is determined by your weighting scheme. You’d increment by $1$ with the weighting scheme that simply counts co-occurrences.

  4. Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types.
    1. For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow the vocabulary at this point.
    2. You might do this by filtering to a specific subset, or just imposing a minimum count threshold.
    3. You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don’t have enough evidence to support good representations.
    4. For words outside the vocabulary you choose, you could ignore them entirely or accumulate all their values into a designated UNK vector.
  5. Now build a matrix $M$ of dimension $|V| \times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the value $d[(w_{i}, w_{j})]$.

distributional matrix를 정의하는데 많은 방법이 있지만, word $\times$ word matrix를 build하는 데 아래 순서로 가능:

  1. co-occurrence context 의 개념을 정의할 것. 전체 문서, 단락, 문장, 절, 명사구(Noun Phrase) 등 연관관계를 보려는 어떤 것이라도 될 수 있음.
  2. 카운트 스케일링 메소드 를 정의할 것. context window 내의 모든 토큰에 같은 weight을 주는 가장 간단한 방법도 있고, 타겟 토큰으로 부터 얼마나 떨어졌는지 거리($d$)에 따라 $1/d$로 weight를 scaling할 수도 있음.
  3. 코퍼스를 스캔해서 단어쌍에 co-occurrence 값을 매핑하는 사전 $d$를 build할 것. 1번에서 정의했던 같은 context 내에서 등장한 단어 $w$와 $w’$의 쌍 마다, 2번에서 정의한 weighting shceme으로 값을 증가시키기.
  4. 3번에 서 만든 카운트 사전 $d$를 이용해서 전체 vocab $V$ (word type의 ordered list)를 만들기.
  5. 행과 열이 word인 $|V| \times |V|$인 행렬 $M$ 만들기. $M[i,j]$의 각 셀은 단어쌍 거리의 값($d[(w_{i}. w_{j})]$)으로 채워져있음.

Pre-computed example matrices

The data distribution includes four matrices that we’ll use for hands-on exploration. All of them were designed in the same basic way:

  • They are word $\times$ word matrices with 6K rows and 6K columns.

  • The vocabulary is the top 6K most frequent unigrams.

Two come from Yelp user-supplied reviews, and two come from Gigaword, a collection of newswire and newspaper text. Further details:

filename source window size count weighting
yelp_window5-scaled.csv.gz Yelp reviews 5 1/d
yelp_window20-flat.csv.gz Yelp reviews 20 1
gigaword_window5-scaled.csv.gz Gigaword 5 1/d
gigaword_window20-flat.csv.gz Gigaword 20 1

Any hunches about how these matrices might differ from each other?

yelp5 = pd.read_csv(
    os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'), index_col=0)
yelp5
): ); .. ... :( :) :/ :D :| ;p ... younger your yourself youth zebra zero zinc zombie zone zoo
): 154.566667 0.000000 16.516667 60.750000 2.550000 8.466667 0.000000 0.866667 0.000000 0.00 ... 0.666667 40.516667 4.616667 0.416667 0.000000 4.450000 0.000000 0.166667 0.166667 0.333333
); 0.000000 24.866667 3.200000 36.200000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00 ... 0.250000 32.733333 2.200000 0.000000 0.000000 0.533333 0.000000 0.250000 0.450000 0.700000
.. 16.516667 3.200000 13494.866667 6945.366667 196.750000 501.650000 40.450000 29.400000 0.700000 2.15 ... 31.850000 2908.116667 247.366667 3.183333 1.783333 146.733333 1.250000 9.733333 23.816667 20.300000
... 60.750000 36.200000 6945.366667 62753.766667 783.216667 1711.583333 161.383333 101.700000 4.883333 7.85 ... 119.100000 12530.533333 1259.683333 29.550000 4.516667 675.816667 8.833333 33.400000 108.783333 77.583333
:( 2.550000 0.000000 196.750000 783.216667 423.433333 13.950000 2.133333 0.166667 0.366667 0.00 ... 0.333333 72.433333 5.333333 0.000000 0.000000 7.650000 0.000000 0.533333 4.533333 0.166667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zero 4.450000 0.533333 146.733333 675.816667 7.650000 8.183333 0.750000 0.200000 0.000000 0.00 ... 1.466667 121.233333 11.483333 0.333333 0.000000 476.733333 0.000000 0.783333 3.583333 0.866667
zinc 0.000000 0.000000 1.250000 8.833333 0.000000 0.500000 0.000000 0.000000 0.000000 0.00 ... 0.000000 2.600000 0.166667 0.000000 0.000000 0.000000 2.066667 0.000000 0.000000 0.000000
zombie 0.166667 0.250000 9.733333 33.400000 0.533333 1.733333 0.000000 0.000000 0.000000 0.00 ... 0.750000 20.700000 1.866667 0.000000 0.166667 0.783333 0.000000 25.400000 4.616667 0.500000
zone 0.166667 0.450000 23.816667 108.783333 4.533333 7.066667 0.000000 0.000000 0.000000 0.00 ... 0.916667 257.866667 4.783333 0.000000 0.000000 3.583333 0.000000 4.616667 28.800000 0.000000
zoo 0.333333 0.700000 20.300000 77.583333 0.166667 4.850000 0.000000 0.000000 0.000000 0.00 ... 1.000000 31.783333 3.533333 0.000000 0.333333 0.866667 0.000000 0.500000 0.000000 85.166667

6000 rows × 6000 columns

yelp20 = pd.read_csv(
    os.path.join(DATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)
yelp20
): ); .. ... :( :) :/ :D :| ;p ... younger your yourself youth zebra zero zinc zombie zone zoo
): 3910.0 24.0 250.0 1162.0 48.0 261.0 4.0 22.0 0.0 1.0 ... 6.0 945.0 61.0 3.0 1.0 27.0 1.0 1.0 2.0 3.0
); 24.0 1738.0 55.0 595.0 7.0 27.0 0.0 2.0 0.0 0.0 ... 5.0 564.0 29.0 0.0 0.0 7.0 0.0 2.0 3.0 5.0
.. 250.0 55.0 330680.0 154726.0 1605.0 5707.0 311.0 336.0 10.0 21.0 ... 437.0 40631.0 2680.0 36.0 12.0 1441.0 8.0 107.0 237.0 182.0
... 1162.0 595.0 154726.0 1213778.0 6404.0 21831.0 1230.0 1172.0 41.0 73.0 ... 1684.0 174619.0 13862.0 323.0 57.0 6345.0 71.0 405.0 1208.0 840.0
:( 48.0 7.0 1605.0 6404.0 1646.0 528.0 55.0 33.0 4.0 1.0 ... 9.0 1425.0 84.0 1.0 1.0 66.0 0.0 5.0 15.0 7.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zero 27.0 7.0 1441.0 6345.0 66.0 110.0 6.0 4.0 0.0 0.0 ... 43.0 4141.0 368.0 4.0 0.0 3894.0 0.0 14.0 28.0 19.0
zinc 1.0 0.0 8.0 71.0 0.0 4.0 0.0 0.0 0.0 0.0 ... 0.0 34.0 7.0 0.0 0.0 0.0 98.0 0.0 0.0 0.0
zombie 1.0 2.0 107.0 405.0 5.0 22.0 0.0 4.0 0.0 0.0 ... 3.0 287.0 28.0 1.0 1.0 14.0 0.0 604.0 22.0 1.0
zone 2.0 3.0 237.0 1208.0 15.0 62.0 1.0 1.0 0.0 0.0 ... 14.0 1721.0 99.0 3.0 1.0 28.0 0.0 22.0 764.0 14.0
zoo 3.0 5.0 182.0 840.0 7.0 47.0 0.0 1.0 0.0 0.0 ... 16.0 699.0 53.0 2.0 10.0 19.0 0.0 1.0 14.0 3148.0

6000 rows × 6000 columns

giga5 = pd.read_csv(
    os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
giga5
): ); .. ... :( :) :/ :D :| ;p ... younger your yourself youth zebra zero zinc zombie zone zoo
): 4.300000 8.100000 7.700000 113.766667 2.666667 1.116667 0.50 0.0 0.0 0.0 ... 2.850000 147.633333 8.483333 4.500000 0.000000 2.733333 0.200000 0.333333 344.066667 0.700000
); 8.100000 1092.400000 0.650000 52.783333 0.666667 0.000000 0.00 0.0 0.0 0.0 ... 7.100000 44.400000 3.333333 7.116667 0.400000 1.616667 1.000000 0.500000 11.516667 1.250000
.. 7.700000 0.650000 18900.766667 12657.800000 0.500000 0.500000 0.00 0.5 0.0 0.0 ... 1.166667 29.883333 0.833333 0.250000 0.000000 0.916667 0.250000 0.000000 2.983333 1.083333
... 113.766667 52.783333 12657.800000 131656.033333 0.866667 0.500000 0.85 0.0 0.0 0.0 ... 40.333333 2187.066667 89.750000 42.000000 8.200000 30.683333 0.666667 2.333333 79.233333 11.083333
:( 2.666667 0.666667 0.500000 0.866667 0.000000 0.333333 0.00 0.0 0.0 0.0 ... 0.000000 0.333333 0.000000 0.666667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zero 2.733333 1.616667 0.916667 30.683333 0.000000 0.000000 0.00 0.0 0.0 0.0 ... 0.666667 26.516667 1.533333 1.600000 0.000000 270.333333 0.000000 0.000000 6.900000 0.000000
zinc 0.200000 1.000000 0.250000 0.666667 0.000000 0.000000 0.00 0.0 0.0 0.0 ... 0.000000 7.116667 0.000000 0.000000 0.000000 0.000000 24.966667 0.000000 0.000000 0.000000
zombie 0.333333 0.500000 0.000000 2.333333 0.000000 0.000000 0.00 0.0 0.0 0.0 ... 0.200000 2.700000 0.000000 0.250000 0.000000 0.000000 0.000000 4.700000 0.500000 0.000000
zone 344.066667 11.516667 2.983333 79.233333 0.000000 0.000000 0.00 0.0 0.0 0.0 ... 1.000000 112.033333 3.850000 2.333333 0.666667 6.900000 0.000000 0.500000 248.466667 0.816667
zoo 0.700000 1.250000 1.083333 11.083333 0.000000 0.000000 0.25 0.0 0.0 0.0 ... 1.000000 9.383333 0.250000 0.400000 1.500000 0.000000 0.000000 0.000000 0.816667 141.133333

6000 rows × 6000 columns

giga20 = pd.read_csv(
    os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
giga20
): ); .. ... :( :) :/ :D :| ;p ... younger your yourself youth zebra zero zinc zombie zone zoo
): 7046.0 517.0 29.0 1684.0 45.0 10.0 15.0 0.0 1.0 0.0 ... 34.0 2716.0 318.0 75.0 0.0 18.0 11.0 5.0 1128.0 11.0
); 517.0 108712.0 19.0 288.0 15.0 0.0 0.0 1.0 1.0 0.0 ... 105.0 872.0 34.0 80.0 5.0 19.0 15.0 5.0 111.0 13.0
.. 29.0 19.0 101566.0 116373.0 1.0 1.0 6.0 2.0 0.0 0.0 ... 72.0 886.0 17.0 24.0 23.0 75.0 33.0 0.0 225.0 10.0
... 1684.0 288.0 116373.0 1627084.0 27.0 48.0 7.0 3.0 0.0 0.0 ... 671.0 27690.0 1158.0 886.0 392.0 417.0 6.0 34.0 1522.0 149.0
:( 45.0 15.0 1.0 27.0 50.0 5.0 2.0 1.0 0.0 0.0 ... 1.0 49.0 3.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zero 18.0 19.0 75.0 417.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 25.0 596.0 49.0 26.0 0.0 3460.0 2.0 0.0 140.0 7.0
zinc 11.0 15.0 33.0 6.0 0.0 0.0 0.0 0.0 3.0 0.0 ... 0.0 85.0 2.0 1.0 0.0 2.0 1366.0 0.0 2.0 1.0
zombie 5.0 5.0 0.0 34.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 3.0 65.0 1.0 3.0 0.0 0.0 0.0 188.0 3.0 0.0
zone 1128.0 111.0 225.0 1522.0 1.0 0.0 0.0 1.0 0.0 1.0 ... 54.0 1464.0 60.0 52.0 7.0 140.0 2.0 3.0 11250.0 19.0
zoo 11.0 13.0 10.0 149.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 34.0 257.0 15.0 16.0 36.0 7.0 1.0 0.0 19.0 5568.0

6000 rows × 6000 columns

Vector comparison

Vector comparisons form the heart of our analyses in this context.

  • For the most part, we are interested in measuring the distance between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.

  • The scipy.spatial.distance module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we’ll use allow all float values.

Euclidean

The most basic and intuitive distance measure between vectors is euclidean distance. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is

\[\textbf{euclidean}(u, v) = \sqrt{\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}\]

In two-dimensions, this corresponds to the length of the most direct line between the two points.

In vsm.py, the function euclidean just uses the corresponding scipy.spatial.distance method to define it.

Here’s the tiny vector space from the screencast on vector comparisons associated with this notebook:

Running example
• Focus on distance measures
• Illustrations with row vectors
어떤 문서(혹은 코퍼스 등) x, y에, A, B, C라는 단어가 등장한다고 할 때
혹은, 어떤 단어 x, y와, A, B, C라는 단어가 함께 등장한다고 할 때

ABC = pd.DataFrame([
    [ 2.0,  4.0], 
    [10.0, 15.0], 
    [14.0, 10.0]],
    index=['A', 'B', 'C'],
    columns=['x', 'y'])    
ABC
x y
A 2.0 4.0
B 10.0 15.0
C 14.0 10.0
def plot_ABC(df):
    ax = df.plot.scatter(x='x', y='y', marker='.', legend=False)
    m = df.values.max(axis=None)
    ax.set_xlim([0, m*1.2])
    ax.set_ylim([0, m*1.2])
    for label, row in df.iterrows():
        ax.text(row['x'], row['y'], label)
plot_ABC(ABC)

png

The euclidean distances align well with raw visual distance in the plot:

def euclidean(u, v):
    return scipy.spatial.distance.euclidean(u, v)
def abc_comparisons(df, distfunc):
    """(A,B)간, (B,C)간 거리 함수를 parameter로 받아서, 두 점 사이의 거리를 계산하는 함수
    """
    for a, b in (('A', 'B'), ('B', 'C')):
        dist = distfunc(df.loc[a], df.loc[b]) # df.loc[a]: Series type
        print('{0:}({1:}, {2:}) = {3:7.02f}'.format(
            distfunc.__name__, a, b, dist))
abc_comparisons(ABC, euclidean) # plain X, Y축 값을 갖는 dataframe ABC를 넘김.
euclidean(A, B) =   13.60
euclidean(B, C) =    6.40

However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don’t look good:

  • The distributions of B and C are more or less directly opposed, suggesting very different meanings, whereas A and B are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.

  • In terms of the large models we will soon explore, A and B resemble a pair like superb and good, which have similar meanings but very different frequencies.

  • In contrast, B and C are like good and disappointing — similar overall frequencies but different distributions with respect to the overall vocabulary.

Length normalization

These affinities are immediately apparent if we normalize the vectors by their length. To do this, we first define the L2-length of a vector:

\[\|u\|_{2} = \sqrt{\sum_{i=1}^{n} u_{i}^{2}}\]

And then the normalization step just divides each value by this quantity:

\[\left[ \frac{u_{1}}{\|u\|_{2}}, \frac{u_{2}}{\|u\|_{2}}, \ldots \frac{u_{n}}{\|u\|_{2}} \right]\]
def vector_length(u):
    """
    u vector의 dot(inner) product 
    ([1, 2, 3]).dot([1, 2, 3]) = 1+4+9 = 14
    """
    return np.sqrt(u.dot(u)) 

def length_norm(u):
    return u / vector_length(u)
ABC_normed = ABC.apply(length_norm, axis=1) 
plot_ABC(ABC_normed)    

png

abc_comparisons(ABC_normed, euclidean) # normalize된 ABC dataframe의 euclidean distance
euclidean(A, B) =    0.12
euclidean(B, C) =    0.36

Here, the connection between A and B is more apparent, as is the opposition between B and C.

Cosine distance

Cosine distance takes overall length into account. The cosine distance between two vectors $u$ and $v$ of dimension $n$ is

\[\textbf{cosine}(u, v) = 1 - \frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}\]

The similarity part of this (the righthand term of the subtraction) is actually measuring the angles between the two vectors. The result is the same (in terms of rank order) as one gets from first normalizing both vectors using $|\cdot|_{2}$ and then calculating their Euclidean distance.

def cosine(u, v):
    return scipy.spatial.distance.cosine(u, v)
abc_comparisons(ABC, cosine)
cosine(A, B) =    0.01
cosine(B, C) =    0.07

So, in building in the length normalization, cosine distance achieves our goal of associating A and B and separating both from C.

Matching-based methods

Matching-based methods are also common in the literature. The basic matching measure effectively creates a vector consisting of all of the smaller of the two values at each coordinate, and then sums them:

\[\textbf{matching}(u, v) = \sum_{i=1}^{n} \min(u_{i}, v_{i})\]

This is implemented in vsm as matching.

One approach to normalizing the matching values is the Jaccard coefficient. The numerator is the matching coefficient. The denominator — the normalizer — is intuitively like the set union: for binary vectors, it gives the cardinality of the union of the two being compared:

\[\textbf{jaccard}(u, v) = 1 - \frac{\textbf{matching}(u, v)}{\sum_{i=1}^{n} \max(u_{i}, v_{i})}\]

Summary

Suppose we set for ourselves the goal of associating A with B and disassociating B from C, in keeping with the semantic intuition expressed above. Then we can assess distance measures by whether they achieve this goal:

def matching(u, v):
    return np.sum(np.minimum(u, v))

def jaccard(u, v):
    return 1.0 - (matching(u, v) / np.sum(np.maximum(u, v)))
for m in (euclidean, cosine, jaccard):
    fmt = {
        'n': m.__name__,  
        'AB': m(ABC.loc['A'], ABC.loc['B']), 
        'BC': m(ABC.loc['B'], ABC.loc['C'])}
    print('{n:>15}(A, B) = {AB:5.2f} {n:>15}(B, C) = {BC:5.2f}'.format(**fmt))
      euclidean(A, B) = 13.60       euclidean(B, C) =  6.40
         cosine(A, B) =  0.01          cosine(B, C) =  0.07
        jaccard(A, B) =  0.76         jaccard(B, C) =  0.31

Distributional neighbors

The neighbors function in vsm is an investigative aide. For a given word w, it ranks all the words in the vocabulary according to their distance from w, as measured by distfunc (default: vsm.cosine).

By playing around with this function, you can start to get a sense for how the distance functions differ. Here are some example uses; you might try some new words to get a feel for what these matrices are like and how different words look.

def neighbors(word, df, distfunc=cosine):
    """Tool for finding the nearest neighbors of `word` in `df` according
    to `distfunc`. The comparisons are between row vectors.

    Parameters
    ----------
    word : str
        The anchor word. Assumed to be in `rownames`.
    df : pd.DataFrame
        The vector-space model.
    distfunc : function mapping vector pairs to floats (default: `cosine`)
        The measure of distance between vectors. Can also be `euclidean`,
        `matching`, `jaccard`, as well as any other distance measure
        between 1d vectors.

    Raises
    ------
    ValueError
        If word is not in `df.index`.

    Returns
    -------
    pd.Series
        Ordered by closeness to `word`.

    """
    if word not in df.index:
        raise ValueError('{} is not in this VSM'.format(word))
    w = df.loc[word]
    dists = df.apply(lambda x: distfunc(w, x), axis=1)
    return dists.sort_values()

neighbors('A', ABC, distfunc=euclidean)
A     0.000000
C    13.416408
B    13.601471
dtype: float64
neighbors('A', ABC, distfunc=cosine)
A    0.000000
B    0.007722
C    0.116212
dtype: float64
neighbors('good', yelp5, distfunc=euclidean).head()
good          0.000000
great    363501.370559
very     463140.024533
there    484997.266143
so       503078.898630
dtype: float64
neighbors('good', yelp20, distfunc=euclidean).head()
good     0.000000e+00
very     1.582079e+06
food     1.706024e+06
as       2.180615e+06
great    2.309515e+06
dtype: float64
neighbors('good', yelp5, distfunc=cosine).head()
good          0.000000
decent        0.052814
weak          0.061491
impressive    0.063750
solid         0.072259
dtype: float64
neighbors('good', yelp20, distfunc=cosine).head()
good      0.000000
decent    0.005797
pretty    0.006568
really    0.007324
quite     0.007358
dtype: float64
neighbors('good', giga5, distfunc=euclidean).head()
good            0.000000
very        96629.020074
another    102967.358766
little     111678.083808
big        112087.562461
dtype: float64
neighbors('good', giga20, distfunc=euclidean).head()
good          0.000000
very     342311.566423
here     510760.069407
think    529750.740523
right    561272.646841
dtype: float64
neighbors('good', giga5, distfunc=cosine).head()
good       0.000000
bad        0.047839
painful    0.070546
simple     0.073625
safe       0.076624
dtype: float64
neighbors('good', giga20, distfunc=cosine).head()
good        0.000000
bad         0.004829
everyone    0.006463
always      0.006505
very        0.007011
dtype: float64

Matrix reweighting

  • The goal of reweighting is to amplify the important, trustworthy, and unusual, while deemphasizing the mundane and the quirky.

  • Absent a defined objective function, this will remain fuzzy, but the intuition behind moving away from raw counts is that frequency is a poor proxy for our target semantic ideas.

Normalization

Normalization (row-wise or column-wise) is perhaps the simplest form of reweighting. With vsm.length_norm, we normalize using vsm.vector_length. We can also normalize each row by the sum of its values, which turns each row into a probability distribution over the columns:

\[\left[ \frac{u_{1}}{\sum_{i=1}^{n}u_{i}}, \frac{u_{2}}{\sum_{i=1}^{n}u_{i}}, \ldots \frac{u_{n}}{\sum_{i=1}^{n}u_{i}}, \right]\]

These normalization measures are insensitive to the magnitude of the underlying counts. This is often a mistake in the messy world of large data sets; $[1,10]$ and $[1000,10000]$ are very different vectors in ways that will be partly or totally obscured by normalization.

Observed/Expected

Reweighting by observed-over-expected values captures one of the central patterns in all of VSMs: we can adjust the actual cell value in a co-occurrence matrix using information from the corresponding row and column.

In the case of observed-over-expected, the rows and columns define our expectation about what the cell value would be if the two co-occurring words were independent. In dividing the observed count by this value, we amplify cells whose values are larger than we would expect.

So that this doesn’t look more complex than it is, for an $m \times n$ matrix $X$, define

\[\textbf{rowsum}(X, i) = \sum_{j=1}^{n}X_{ij}\] \[\textbf{colsum}(X, j) = \sum_{i=1}^{m}X_{ij}\] \[\textbf{sum}(X) = \sum_{i=1}^{m}\sum_{j=1}^{n} X_{ij}\] \[\textbf{expected}(X, i, j) = \frac{ \textbf{rowsum}(X, i) \cdot \textbf{colsum}(X, j) }{ \textbf{sum}(X) }\]

Then the observed-over-expected value is

\[\textbf{oe}(X, i, j) = \frac{X_{ij}}{\textbf{expected}(X, i, j)}\]

In many contexts, it is more intuitive to first normalize the count matrix into a joint probability table and then think of $\textbf{rowsum}$ and $\textbf{colsum}$ as probabilities. Then it is clear that we are comparing the observed joint probability with what we would expect it to be under a null hypothesis of independence. These normalizations do not affect the final results, though.

Let’s do a quick worked-out example. Suppose we have the count matrix $X$ =

  a b rowsum
x 34 11 45
y 47 7 54
colsum 81 18 99

Then we calculate like this:

\[\textbf{oe}(X, 1, 0) = \frac{47}{\frac{54 \cdot 81}{99}} = 1.06\]

And the full table looks like this:

  a b
x 0.92 1.34
y 1.06 0.71
def observed_over_expected(df):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    oe = df / expected
    return oe
oe_ex = np.array([[ 34.,  11.], [ 47.,   7.]])

observed_over_expected(oe_ex).round(2)
array([[0.92, 1.34],
       [1.06, 0.71]])

The implementation vsm.observed_over_expected should be pretty efficient.

yelp5_oe = observed_over_expected(yelp5)
yelp20_oe = observed_over_expected(yelp20)
neighbors('good', yelp5_oe).head()
good         0.000000
great        0.323037
decent       0.381849
but          0.403750
excellent    0.411684
dtype: float64
neighbors('good', yelp20_oe).head()
good      0.000000
too       0.058905
really    0.060264
pretty    0.064466
but       0.066778
dtype: float64
giga5_oe = observed_over_expected(giga5)
giga20_oe = observed_over_expected(giga20)
neighbors('good', giga5_oe).head()
good      0.000000
bad       0.485495
better    0.533762
well      0.553319
but       0.556113
dtype: float64
neighbors('good', giga20_oe).head()
good      0.000000
just      0.102044
so        0.106773
that's    0.108614
better    0.111044
dtype: float64

Pointwise Mutual Information

Pointwise Mutual Information (PMI) is observed-over-expected in log-space:

\[\textbf{pmi}(X, i, j) = \log\left(\frac{X_{ij}}{\textbf{expected}(X, i, j)}\right)\]

This basic definition runs into a problem for $0$ count cells. The usual response is to set $\log(0) = 0$, but this is arguably confusing – cell counts that are smaller than expected get negative values, cell counts that are larger than expected get positive values, and 0-count values are placed in the middle of this ranking without real justification.

For this reason, it is more typical to use Positive PMI, which maps all negative PMI values to $0$:

\[\textbf{ppmi}(X, i, j) = \begin{cases} \textbf{pmi}(X, i, j) & \textrm{if } \textbf{pmi}(X, i, j) > 0 \\ 0 & \textrm{otherwise} \end{cases}\]

This is the default for vsm.pmi.

def pmi(df, positive=True):
    df = observed_over_expected(df)
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df
yelp5_pmi = pmi(yelp5)
yelp20_pmi = pmi(yelp20)
neighbors('good', yelp5_pmi).head()
good         0.000000
decent       0.448116
great        0.466581
tasty        0.532094
excellent    0.569720
dtype: float64
neighbors('good', yelp20_pmi).head()
good         0.000000
tasty        0.201911
delicious    0.261547
flavorful    0.317310
ordered      0.349519
dtype: float64
giga5_pmi = pmi(giga5)
giga20_pmi = pmi(giga20)
neighbors('good', giga5_pmi).head()
good        0.000000
bad         0.439959
better      0.494448
terrific    0.509023
decent      0.520705
dtype: float64
neighbors('good', giga20_pmi).head()
good      0.000000
really    0.271877
that's    0.272447
it's      0.285056
pretty    0.294105
dtype: float64
neighbors('market', giga5_pmi).head()
market     0.000000
markets    0.327014
stocks     0.512569
prices     0.516767
sales      0.552022
dtype: float64
neighbors('market', giga20_pmi).head()
market     0.000000
markets    0.175934
prices     0.252851
stocks     0.266186
profits    0.269412
dtype: float64

TF-IDF

Perhaps the best known reweighting schemes is Term Frequency–Inverse Document Frequency (TF-IDF), which is, I believe, still the backbone of today’s Web search technologies. As the name suggests, it is built from TF and IDF measures:

For an $m \times n$ matrix $X$:

\[\textbf{TF}(X, i, j) = \frac{X_{ij}}{\textbf{colsum}(X, i, j)}\] \[\textbf{IDF}(X, i, j) = \log\left(\frac{n}{|\{k : X_{ik} > 0\}|}\right)\] \[\textbf{TF-IDF}(X, i, j) = \textbf{TF}(X, i, j) \cdot \textbf{IDF}(X, i, j)\]

TF-IDF generally performs best with sparse matrices. It severely punishes words that appear in many documents; if a word appears in every document, then its IDF value is 0. As a result, it can even be problematic with verb dense word $\times$ word matrices like ours, where most words appear with most other words.

There is an implementation of TF-IDF for dense matrices in vsm.tfidf.

Important: sklearn’s version, TfidfTransformer, assumes that term frequency (TF) is defined row-wise and document frequency is defined column-wise. That is, it assumes sklearn’s document $\times$ word basic design, which makes sense for classification tasks, where the design is example $\times$ features. This is the transpose of the way we’ve been thinking.

def tfidf(df):
    # Inverse document frequencies:
    doccount = float(df.shape[1])
    freqs = df.astype(bool).sum(axis=1)
    idfs = np.log(doccount / freqs)
    idfs[np.isinf(idfs)] = 0.0  # log(0) = 0
    # Term frequencies:
    col_totals = df.sum(axis=0)
    tfs = df / col_totals
    return (tfs.T * idfs).T

Subword information

Schütze (1993) pioneered the use of subword information to improve representations by reducing sparsity, thereby increasing the density of connections in a VSM. In recent years, this idea has shown value in numerous contexts.

Bojanowski et al. (2016) (the fastText team) explore a particularly straightforward approach to doing this: represent each word as the sum of the representations for the character-level n-grams it contains.

It is simple to derive character-level n-gram representations from our existing VSMs. The function vsm.ngram_vsm implements the basic step. Here, we create the 4-gram version of imdb5:

def ngram_vsm(df, n=2):
    """Create a character-level VSM from `df`.

    Parameters
    ----------
    df : pd.DataFrame
    n : int
        The n-gram size.

    Returns
    -------
    pd.DataFrame
        This will have the same column dimensionality as `df`, but the
        rows will be expanded with representations giving the sum of
        all the original rows in `df` that contain that row's n-gram.

    """
    unigram2vecs = defaultdict(list)
    for w, x in df.iterrows():
        for c in get_character_ngrams(w, n):
            unigram2vecs[c].append(x)
    unigram2vecs = {c: np.array(x).sum(axis=0)
                    for c, x in unigram2vecs.items()}
    cf = pd.DataFrame(unigram2vecs).T
    cf.columns = df.columns
    return cf

def get_character_ngrams(w, n):
    """Map a word to its character-level n-grams, with boundary
    symbols '<w>' and '</w>'.

    Parameters
    ----------
    w : str
    n : int
        The n-gram size.

    Returns
    -------
    list of str

    """
    if n > 1:
        w = ["<w>"] + list(w) + ["</w>"]
    else:
        w = list(w)
    return ["".join(w[i: i+n]) for i in range(len(w)-n+1)]

yelp5_ngrams = ngram_vsm(yelp5, n=4)
yelp5_ngrams.shape
(10382, 6000)

This has the same column dimension as the yelp5, but the rows are expanded with all the 4-grams, including boundary symbols <w> and </w>. Here’s a simple function for creating new word representations from the associated character-level ones:

def character_level_rep(word, cf, n=4):
    ngrams = get_character_ngrams(word, n)
    ngrams = [n for n in ngrams if n in cf.index]    
    reps = cf.loc[ngrams].values
    return reps.sum(axis=0)    

Many variations on this are worth trying – including the original word vector where available, changing the aggregation method from sum to something else, using a real morphological parser instead of just n-grams, and so on.

One very powerful thing about this is that we can represent words that are not even in the original VSM:

'superbly' in yelp5.index # yelp5에는 'superbly'라는 단어가 없지만,
False
superbly = character_level_rep("superbly", yelp5_ngrams) # ngram으로 만든 것으로 부터 vector 값을 계산할 수 있다.
superb = character_level_rep("superb", yelp5_ngrams)
cosine(superb, superbly) # 두 벡터 사이의 consine distance. 매우 가까움.
0.004362871833741844

Visualization

  • You can begin to get a feel for what your matrix is like by poking around with vsm.neighbors to see who is close to or far from whom.

  • It’s very useful to complement this with the more holistic view one can get from looking at a visualization of the entire vector space.

  • Of course, any visualization will have to be much, much lower dimension than our actual VSM, so we need to proceed cautiously, balancing the high-level view with more fine-grained exploration.

  • We won’t have time this term to cover VSM visualization in detail. scikit-learn has a bunch of functions for doing this in sklearn.manifold, and the user guide for that package is detailed.

  • It’s also worth checking out the online TensorFlow Embedding Projector tool, which includes a fast implementation of t-SNE.

  • In addition, vsm.tsne_viz is a wrapper around sklearn.manifold.TSNE that handles the basic preprocessing and layout for you. t-SNE stands for t-Distributed Stochastic Neighbor Embedding, a powerful method for visualizing high-dimensional vector spaces in 2d. See also Multiple Maps t-SNE.

def tsne_viz(df, colors=None, output_filename=None, figsize=(40, 50)):
    """2d plot of `df` using t-SNE, with the points labeled by `df.index`,
    aligned with `colors` (defaults to all black).

    Parameters
    ----------
    df : pd.DataFrame
        The matrix to visualize.
    colors : list of colornames or None (default: None)
        Optional list of colors for the vocab. The color names just
        need to be interpretable by matplotlib. If they are supplied,
        they need to have the same length as `df.index`. If `colors=None`,
        then all the words are displayed in black.
    output_filename : str (default: None)
        If not None, then the output image is written to this location.
        The filename suffix determines the image type. If `None`, then
        `plt.plot()` is called, with the behavior determined by the
        environment.
    figsize : (int, int) (default: (40, 50))
        Default size of the output in display units.

    """
    # Colors:
    vocab = df.index
    if not colors:
        colors = ['black' for i in vocab]
    # Recommended reduction via PCA or similar:
    n_components = 50 if df.shape[1] >= 50 else df.shape[1]
    dimreduce = PCA(n_components=n_components)
    X = dimreduce.fit_transform(df)
    # t-SNE:
    tsne = TSNE(n_components=2, random_state=0)
    tsnemat = tsne.fit_transform(X)
    # Plot values:
    xvals = tsnemat[: , 0]
    yvals = tsnemat[: , 1]
    # Plotting:
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=figsize)
    ax.plot(xvals, yvals, marker='', linestyle='')
    # Text labels:
    for word, x, y, color in zip(vocab, xvals, yvals, colors):
        try:
            ax.annotate(word, (x, y), fontsize=8, color=color)
        except UnicodeDecodeError:  ## Python 2 won't cooperate!
            pass
    # Output:
    if output_filename:
        plt.savefig(output_filename, bbox_inches='tight')
    else:
        plt.show()

tsne_viz(yelp20_pmi)

png