๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
NLP ๐Ÿ—ฃ

one-hot encoding

by Jouureee 2022. 1. 21.

๋‹จ์–ด๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ฐ€์žฅ ์›์ดˆ์ ์ธ ๋ฐฉ๋ฒ•

 

๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๊ฐ€ ์ฐจ์›์ด ๋œ๋‹ค. 

[the, cat, sat, on, the, mat] 5๊ฐ€์ง€ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์€  3์ฐจ์›์˜ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง„๋‹ค. 

 

 

from konlpy.tag import Komoran
import numpy as np
komoran = Komoran()
text = "์˜ค๋Š˜ ๋‚ ์”จ๋Š” ๊ตฌ๋ฆ„์ด ๋งŽ์•„์š”"

nouns = komoran.nouns(text)
print(nouns)

dics = {}
for word in nouns:
    if word not in dics.keys():
        dics[word] = len(dics) #0, 1, 2
print(dics)

#one-hot-encoding
nb_classes = len(dics)
targets = list(dics.values())
one_hot_targets = np.eye(nb_classes)[targets]
print(one_hot_targets)

 

ํ•˜์ง€๋งŒ ์›-ํ•ซ-์ธ์ฝ”๋”ฉ์€ ํฌ์†Œ ํ–‰๋ ฌ๋กœ ์‚ฌ์ „์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ๋ฉ”๋ชจ๋ฆฌ ๋‚ญ๋น„, ๊ณ„์‚ฐ ๋‚ญ๋น„๊ฐ€ ์‹ฌํ•˜๋‹ค.

๋˜ํ•œ ํ† ํฐ ๋‹จ์–ด์™€ ๊ทธ ์ฃผ๋ณ€์˜ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•œ๋‹ค. 

 

์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์–ด์˜ ์ž ์žฌ ์˜๋ฏธ๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๋‹ค์ฐจ์› ๊ณต๊ฐ„์— ๋ฒกํ„ฐํ™” ํ•˜๋Š” ๊ธฐ๋ฒ•์œผ๋กœ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

์ฒซ์งธ๋Š” ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™” ๋ฐฉ๋ฒ•์ธ LSA(์ž ์žฌ ์˜๋ฏธ ๋ถ„์„), HAL ๋“ฑ์ด ์žˆ์œผ๋ฉฐ,

๋‘˜์งธ๋Š” ์˜ˆ์ธก ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒกํ„ฐํ™”ํ•˜๋Š” NNLM, RNNLM, Word2Vec, FastText ๋“ฑ์ด ์žˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์นด์šดํŠธ ๊ธฐ๋ฐ˜๊ณผ ์˜ˆ์ธก ๊ธฐ๋ฐ˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ GloVe๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•œ๋‹ค.

 

์ฐธ๊ณ  ๋ฌธํ—Œ :

https://wikidocs.net/22647

https://www.tensorflow.org/text/guide/word_embeddings

'NLP ๐Ÿ—ฃ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

seq2seq(sequence to sequence), attention  (0) 2022.01.25

๋Œ“๊ธ€