[Paper Review] Attention Is All You Need(2017) #Transformer

2023. 3. 14. 13:14ใ†๐Ÿง‘๐Ÿป‍๐Ÿซ Ideas/(Advanced) Time-Series

"Attention is all you need", ์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด seq to seq ๋ชจ๋ธ์˜ ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•œ Transformer ๋ชจ๋ธ์˜ ๋“ฑ์žฅ์„ ์•Œ๋ฆฐ ๊ธฐ๋…๋น„์ ์ธ ๋…ผ๋ฌธ์ด๋‹ค.

ํ˜„์žฌ NLP์™€ ๊ฐ™์ด seq to seq ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ํƒœ์Šคํฌ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์ฃผ๋ฅ˜๋ฅผ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ๋„ ๊ทธ ํ™œ์šฉ์„ฑ์„ ๋†’์ด๋ ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค.

์ฆ‰, ์ƒˆ๋กœ์šด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋˜๋Š” "State of the art"(sota)๋ชจ๋ธ๋“ค์˜ ๋Œ€๋ถ€๋ถ„์ด ์ด ํŠธ๋žœ์Šคํฌ๋จธ์— ๋ฐ”ํƒ•์„ ๋‘๊ณ  ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

๋•Œ๋ฌธ์— ๊ฒฐ๊ตญ ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ Attention ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ์ตœ์‹  ํŠธ๋ Œ๋“œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋” ๋‚˜์•„๊ฐ€ ์ƒˆ๋กœ์šด ์—ฐ๊ตฌ ๊ธฐํšŒ๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ์ฒซ ๊ฑธ์Œ์ผ ๊ฒƒ์ด๋‹ค.

 

 

Introduction

 

์ด ๋…ผ๋ฌธ์ด ๋ฐœํ‘œ๋œ ๋‹น์‹œ์—๋Š” RNN๊ธฐ๋ฐ˜์˜ LSTM, GRU๊ฐ€ ์ž์—ฐ์–ด ๋ฒˆ์—ญ๊ณผ ๊ฐ™์€ Sequential data(์ˆœ์ฐจ ๋ฐ์ดํ„ฐ)๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ํƒœ์Šคํฌ์˜ ์ฃผ๋œ ๋ชจ๋ธ์ด์—ˆ์œผ๋ฉฐ sota์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ๋ฐ‘๋ฐ”ํƒ•์ด์—ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ RNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์€ "๋ฉ”๋ชจ๋ฆฌ, ์†๋„ ์ฐจ์›์˜ ๋น„ํšจ์œจ์„ฑ"๊ณผ "์žฅ๊ธฐ ์˜์กด์„ฑ์— ์˜ํ•œ ์„ฑ๋Šฅ ์ €ํ•˜"๋ผ๋Š” ์น˜๋ช…์ ์ธ ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.

recurrent ๋ชจํ˜•, ์ถœ์ฒ˜)์œ„ํ‚ค๋ฏธ๋””์•„

RNN(Recurrent Neural Network) ๊ธฐ๋ฐ˜์˜ Recurrent(์ˆœํ™˜) ๋ชจ๋ธ์€ ์œ„์™€ ๊ฐ™์ด ์ธํ’‹๊ณผ ์•„์›ƒํ’‹์˜ ํฌ์ง€์…˜์— ๋”ฐ๋ผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํˆฌ์ž…ํ•˜์—ฌ ์ฐจ๋ก€๋Œ€๋กœ ํ•™์Šตํ•œ๋‹ค. ์ฆ‰, ์ž…๋ ฅ์ธต์— ์ธํ’‹์„ ์ง‘์–ด๋„ฃ๋Š” ๊ฒƒ๊ณผ ์€๋‹‰์ธต Hidden state์˜ ์—ฐ์‚ฐ ๋˜ํ•œ ๊ณ„์† ์ˆœ์ฐจ์ ์œผ๋กœ ์ „๋‹ฌ๋˜๋ฉฐ ์ตœ์ข… ์•„์›ƒํ’‹์„ ์‚ฐ์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์€ ์ธํ’‹์ด ๊ธธ์ˆ˜๋ก ๊ณ„์‚ฐ๊ณผ์ •์—์„œ ๋ฉ”๋ชจ๋ฆฌ์— ์ œ์•ฝ์ด ์ƒ๊ธฐ๋ฉฐ ์†๋„ ๋˜ํ•œ ์ €ํ•˜๋œ๋‹ค.

 

์ด๋Ÿฌํ•œ ์ˆœํ™˜ ๋ชจ๋ธ์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ์˜ ์ข…์†์„ฑ์„ ์žƒ์–ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๋ฌธ์ œ์  ๋˜ํ•œ ์•ˆ๊ณ ์žˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ "์žฅ๊ธฐ ์ข…์†์„ฑ" ๋ฌธ์ œ๋กœ, ํŠนํžˆ ๊ธด ์ธํ’‹์—์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐ์ด ์ด์–ด์งˆ ๋•Œ, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ์˜ ์˜ํ–ฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๊ธฐ๋ณธ RNN ๋ชจํ˜• ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๊ตฌ์กฐ์ƒ ํ•œ๊ณ„์ ์ธ ๊ธฐ์šธ๊ธฐ ์†Œ๋ฉธ ๋ฌธ์ œ ๋˜ํ•œ ์•ˆ๊ณ ์žˆ๋‹ค.

 

์ด์— ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•˜๋ ค๋Š” ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋˜์—ˆ๊ณ  ์ผ๋ถ€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์œผ๋‚˜ ์ˆœ์ฐจ์ ์ธ ๊ณ„์‚ฐ์˜ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ์€ ์—ฌ์ „ํžˆ ๋‚จ์•„์žˆ๋Š” ๊ฒƒ์ด ํ˜„์‹ค์ด์—ˆ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ "Attention"(= ์–ดํ…์…˜)์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๊ธด ์ธํ’‹, ์žฅ๊ธฐ๊ฐ„์˜ ์‹œ์ ์„ ๋‹ค๋ฃจ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์— ์žˆ์–ด์„œ ์žฅ๊ธฐ ์ข…์†์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ RNN ๊ธฐ๋ฐ˜์˜ ์•„ํ‚คํ…Œ์ฒ˜์™€ ํ•จ๊ป˜ ์ด์šฉ๋œ ๊ฒƒ์ด ๋Œ€๋ถ€๋ถ„์ด์—ˆ๋‹ค.

 

์ด์— ์ด ๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ ๊ธฐ์กด์˜ recurrentํ•œ ์•„ํ‚คํ…Œ์ฒ˜๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  "attention"์— ์ „์ ์œผ๋กœ ์˜์กดํ•˜๋Š” transformer(์ดํ•˜ ํŠธ๋žœ์Šคํฌ๋จธ)๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ „์—ญ์ ์ธ dependence๋ฅผ ๋ฝ‘์•„๋‚ด๊ณ , ๋” ๋งŽ์€ ๋ณ‘๋ ฌํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ๋” ๋‚˜์€ ์ตœ์‹ (sota)์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

Background

 

์ˆœ์ฐจ์  ๊ณ„์‚ฐ(sequential computation)์„ ์ค„์ด๋ ค๋Š” ์—ฐ๊ตฌ๋Š” CNN์„ ๊ธฐ๋ณธ ๊ตฌ์กฐ๋กœ ๋‘” Extended Neural GPU, ByteNet, ConvS2S์—์„œ๋„ ์กด์žฌํ–ˆ๋‹ค. ์ด๋“ค์€ ์ธํ’‹, ์•„์›ƒํ’‹์˜ ํฌ์ง€์…˜์— ๋Œ€ํ•ด ์€๋‹‰์ธต์˜ ์ƒํƒœ๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ, ์ธํ’‹๊ณผ ์•„์›ƒํ’‹์„ ์—ฐ๊ด€์‹œํ‚ค๋Š” ๋ฐ์— ํ•„์š”ํ•œ ์ž‘์—…์˜ ์ˆ˜๊ฐ€ ํฌ์ง€์…˜ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋งŒํผ ์ฆ๊ฐ€ํ•œ๋‹ค. (์„ ํ˜•์  for ConvS2S and ๋กœ๊ทธ์  for ByteNe)

์ด๋Ÿฌํ•œ ํŠน์ง•์€ ๋–จ์–ด์ง„ ํฌ์ง€์…˜ ์‚ฌ์ด์— ์˜์กด์„ฑ ๋‚ด์ง€๋Š” ์–ด๋– ํ•œ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ์ค€๋‹ค.

 

๋”ฐ๋ผ์„œ ์ƒˆ๋กœ ์ œ์•ˆํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ๋Š”, averaging attention-weighted positions์œผ๋กœ ์ธํ•ด ํšจ๊ณผ์„ฑ์ด ๊ฐ์†Œํ•˜๋Š” ๋Œ€๊ฐ€๋ฅผ ์น˜๋ฅด์ง€๋งŒ, Multi-Head ์–ดํ…์…˜์œผ๋กœ ์ด๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ์ž‘์—…์˜ ์ˆ˜๋ฅผ ์ผ์ •ํ•œ ์ˆ˜๋กœ ๊ฐ์†Œ์‹œ์ผฐ๋‹ค.

 

 

Model Architecture

 

 

 

๋ณธ๊ฒฉ์ ์œผ๋กœ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์•„ํ‚คํ…Œ์ฒ˜๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ์— ์•ž์„œ, ๋…ผ๋ฌธ์—์„œ ์ง€์ •ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋จผ์ € ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

1.์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ฐจ์› d_model = 512

2.์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ์ˆ˜(layer) = 6

3.์–ดํ…์…˜ ํ—ค๋“œ์˜ ์ˆ˜(num_head) = 8

4.FFN(Feed Forward Network) ์€๋‹‰์ธต์˜ ํฌ๊ธฐ = 2048

 

 

๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์Ÿ๋ ฅ์žˆ๋Š” neural sequential ๋ณ€ํ™˜๊ธฐ๊ฐ€ ์ธ์ฝ”๋” - ๋””์ฝ”๋” ๊ตฌ์กฐ๋กœ ๋˜์–ด์žˆ๋“ฏ์ด, ํŠธ๋žœ์Šคํฌ๋จธ ๋˜ํ•œ ์ธ์ฝ”๋” - ๋””์ฝ”๋”๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค.

๋˜ํ•œ ๊ฐ ๋ชจ๋“ˆ์€ ๋‚ด๋ถ€์— ์…€ํ”„ ์–ดํ…์…˜๊ณผ fully-connected-layer๋ผ๋Š” sub-layer๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

๊ธฐ๋ณธ์ ์œผ๋กœ ์ธํ’‹์ธ x์˜ sequence๋ฅผ ๋งตํ•‘ํ•˜์—ฌ(self-attention) context vactor๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋””์ฝ”๋”๋Š” ๊ทธ๊ฒƒ์„ ์ถ”๊ฐ€์ ์ธ ์ธํ’‹์œผ๋กœ ๋ฐ›์•„ ์•„์›ƒํ’‹์„ ์ƒ์„ฑํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- ์ธ์ฝ”๋” & ๋””์ฝ”๋”

 

๋จผ์ € "์ธ์ฝ”๋”"๋Š” 6๊ฐœ์˜ ๋™์ผํ•œ layer(์ดํ•˜ ๋ ˆ์ด์–ด)๊ฐ€ ์ค‘์ฒฉ๋˜์–ด ์žˆ๋Š” ๊ตฌ์กฐ(N=6, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ)์ด๋‹ค.

๊ฐ ๋ ˆ์ด์–ด๋Š” 2๊ฐœ์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด๊ฐ€ ์žˆ๋Š”๋ฐ, ์ฒซ ๋ฒˆ์งธ๊ฐ€ self-attention ๋งค์ปค๋‹ˆ์ฆ˜์ด๊ณ  ๋‘ ๋ฒˆ์งธ๊ฐ€ ์™„์ „ ์—ฐ๊ฒฐ๋œ ์ˆœ๋ฐฉํ–ฅ ์‹ ๊ฒฝ๋ง(feed-forward network)์ด๋‹ค. ์ฆ‰, self-attention ๋ ˆ์ด์–ด์™€ feed-forward ๋ ˆ์ด์–ด๊ฐ€ ์—ฐ๊ฒฐ๋œ ํฐ ๋ธ”๋ก, ๋ชจ๋“ˆ 6๊ฐœ๊ฐ€ ์ค‘์ฒฉ๋œ ๊ตฌ์กฐ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

ํ•œํŽธ, ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด๋“ค์˜ ์•„์›ƒํ’‹์€ ์ž”์ฐจ ์—ฐ๊ฒฐ(residual-connection)๊ณผ ์ธต ์ •๊ทœํ™”(layer normalization)์„ ๊ฑฐ์นœ๋‹ค.

์ €์ž๋“ค์€ ์ž”์ฐจ์—ฐ๊ฒฐ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  ํ•˜์œ„ ๋ ˆ์ด์–ด๋“ค์ด ์ƒ์„ฑํ•˜๋Š” ์•„์›ƒํ’‹์˜ ์ฐจ์›์„ 512๋กœ ์„ค์ •ํ–ˆ๋‹ค.

 

์ •๋ฆฌํ•˜์ž๋ฉด, ์ธ์ฝ”๋” ๊ตฌ์กฐ์˜ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- self-attention layer(sub-layer)

- feed-foward network layer(sub-layer)

- residual - connection

- layer normalization

 

"๋””์ฝ”๋”"๋Š” ์ธ์ฝ”๋”์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 6๊ฐœ์˜ ๋™์ผํ•œ layer(์ดํ•˜ ๋ ˆ์ด์–ด)๊ฐ€ ์ค‘์ฒฉ๋˜์–ด ์žˆ๋Š” ๊ตฌ์กฐ(N=6)์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋””์ฝ”๋”๋Š”, ์ธ์ฝ”๋”์™€ ๊ฐ™์€ 2๊ฐœ์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ธ์ฝ”๋”์˜ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’๊ณผ encorder-decorder attention(not self, multi-head attention)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

๋˜ํ•œ ์ธ์ฝ”๋”์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด๋“ค์€ ์ž”์ฐจ ์—ฐ๊ฒฐ ์ดํ›„์— ์ธต ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์•„์›ƒํ’‹์„ ์‚ฐ์ถœํ•œ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, ์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์€ self-attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•˜์œ„ ๋ ˆ์ด์–ด๊ฐ€ masked self-attention์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

masked๋Š” ์ง€์› ๋‹ค๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ ์ด๋Š” ๋ช‡๋ช‡์˜ ๊ฐ’์„ ์ง€์šฐ๊ณ  ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

์™œ๋ƒํ•˜๋ฉด ๋””์ฝ”๋”์—์„œ ์ฒ˜์Œ ์ง„ํ–‰ํ•˜๋Š” masked self-attention์ดํ›„, ์ธ์ฝ”๋”์˜ ์•„์›ƒํ’‹๊ณผ multi-head-attention์„ ์ง„ํ–‰ํ•  ๋•Œ, t์‹œ์ ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œ ๊ทธ ์ด์ „์˜ ๊ฐ’๋งŒ ์ฐธ์กฐํ•˜๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค.

 

์ด๋Ÿฌํ•œ ๋””์ฝ”๋”์˜ ์ฒซ ๋ฒˆ์งธ ์–ดํ…์…˜์€ ๋งˆ์Šคํ‚น(masking)์„ ์ œ์™ธํ•˜๊ณ , self-attention๊ณผ multi-head attention์ด๋ผ๋Š” ์ ์€ ๋™์ผํ•˜๋‹ค.

์ •๋ฆฌํ•˜์ž๋ฉด ๋””์ฝ”๋”์˜ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- masked self-attention layer(sub-layer)

- feed-foward network layer(sub-layer)

- encorder - decorder attention

- residual - connection

- layer normalization

 

Attention

 

 

์–ดํ…์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Q(Query, ์ฟผ๋ฆฌ), K(Key, ํ‚ค), V(Value, ๊ฐ’) ์„ธ ๊ฐœ์˜ ์ธํ’‹์„ ๋ฐ›๋Š”๋ฐ, ๊ทธ ๊ธฐ๋Šฅ์„ ํ’€์–ด์„œ ์„ค๋ช…ํ•˜์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์–ดํ…์…˜์€ ์ฟผ๋ฆฌ์™€ ํ‚ค์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ๊ฐ€์ค‘์น˜๋กœ์จ ๊ฐ’์— ๋ฐ˜์˜ํ•˜๋ฉฐ, ์ด๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ attention value๋ฅผ ์‚ฐ์ถœํ•œ๋‹ค.

์ฆ‰, ์ธํ’‹์˜ ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜์—ฌ ๊ทธ๊ฒƒ์„ ๊ฐ€์ค‘์น˜๋กœ ํ•˜์—ฌ ๋Œ€์ƒ์ด ๋˜๋Š” ๊ฐ’์— ๋ฐ˜์˜ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. (์—ฌ๊ธฐ์„œ Q,K,V๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์ด๋‹ค.)

 

์•ž์„œ ๊ณ„์† ์–ธ๊ธ‰ํ–ˆ๋˜ attention๋“ค์„ ์ •๋ฆฌํ•˜๋ฉด, ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…Œ์ฒ˜์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ์–ดํ…์…˜์€ ์ด ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ถœ์ฒ˜) https://wikidocs.net/31379

 

1. Encoder self-attention (Q=K=V)

2. Masked decoder self-attention (Q=K=V)

3. Encoder - Decoder attention (Q(decoder), K(encoder) = V(encoder))

 

์ด๋“ค์€ ๋ชจ๋‘ multi-head attention(๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜)์ด๋ผ๋Š” ๊ณตํ†ต์ ์ด ์žˆ๋‹ค. ๋˜ํ•œ 1๋ฒˆ๊ณผ 2๋ฒˆ ์–ดํ…์…˜์€ ์ฟผ๋ฆฌ, ํ‚ค, ๊ฐ’์ด ๋ชจ๋‘ ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— self ์–ดํ…์…˜์ด๋ผ๋Š” ์ด๋ฆ„์ด ๋ช…๋ช…๋˜์—ˆ๋‹ค. 3๋ฒˆ ์–ดํ…์…˜ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๋””์ฝ”๋”์™€ ์ธ์ฝ”๋”์˜ ์–ดํ…์…˜์œผ๋กœ ์…€ํ”„ ์–ดํ…์…˜์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค.

์œ„ ๊ทธ๋ฆผ์„ ๋”ฐ๋ผ ์ „์ฒด ๊ณผ์ •์„ ์กฐ๊ธˆ ์ธ๋ฌธํ•™์ ์œผ๋กœ ํ’€์–ด ์„ค๋ช…ํ•œ๋‹ค๋ฉด, ์ธํ’‹๋“ค๊ณผ ์•„์›ƒํ’‹๋“ค ์Šค์Šค๋กœ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•œ ๋’ค์— ์ธํ’‹๊ณผ ์•„์›ƒํ’‹์„ ๋งตํ•‘์‹œ์ผœ ๊ทธ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•ด ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

๋‹ค์Œ์œผ๋กœ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์— ์“ฐ์ด๋Š” ์–ดํ…์…˜์„ ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค.

 

 

Scaled Dot-Product Attention

 

Scaled Dot-Product Attention์€ Q์™€ K(transpose)์˜ ๋‚ด์ (dot-product) ๊ฐ’์„ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜์— ๋„ฃ์–ด ๊ฐ€์ค‘์น˜๋ฅผ ๋งŒ๋“  ๋’ค V์™€ ๊ณฑํ•˜์—ฌ attention value๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ์“ฐ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ”ํžˆ ์“ฐ์ด๋Š” ์–ดํ…์…˜์€ ํฌ๊ฒŒ additive(ํ•ฉ) attention๊ณผ multiplicative(๊ณฑ) attention์œผ๋กœ ๋‚˜๋‰˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ ์“ฐ์ด๋Š” ์–ดํ…์…˜์€ ํ›„์ž์— ํ•ด๋‹นํ•œ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ, ์ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ dot-product attention์€ scaled๋ผ๋Š” ์ ์—์„œ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. ์—ฌ๊ธฐ์— ์“ฐ์ธ ์–ดํ…์…˜์€ Q์™€ K์˜ ๋‚ด์ ์— ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์ธ K ๋ฒกํ„ฐ ์ฐจ์›์˜ ์ œ๊ณฑ๊ทผ์œผ๋กœ ๋‚˜๋ˆ ์ฃผ์–ด ์Šค์ผ€์ผ๋ง์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๋…ผ๋ฌธ์—์„œ ์ €์ž๋Š” k ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ํด ๋•Œ, dot-product์˜ ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉด, ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜์˜ gredient๊ฐ€ ๋งค์šฐ ์ž‘์•„์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์—ฌ์ „ํžˆ ๋” ๋น ๋ฅด๊ณ  ๊ณต๊ฐ„ ํšจ์œจ์ ์ธ dot-product๋ฅผ ์ด์šฉํ•˜๋˜, ์˜ˆ์ƒ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ ์ž ์Šค์ผ€์ผ๋ง์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด๋‹ค.

 

Multi-Head Attention

 

ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ด๋ฃจ๋Š” ์–ดํ…์…˜ ๋ชจ๋“ˆ์˜ ๋˜ ๋‹ค๋ฅธ ํŠน์ง•์€ "Multi-head(๋ฉ€ํ‹ฐํ—ค๋“œ)"์ด๋‹ค.

๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ ํ•œ ๋ฒˆ์˜ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ณ‘๋ ฌ์ ์ธ ์–ดํ…์…˜์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.

์ด์— ํ—ค๋“œ์˜ ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ num_head๋ฅผ ์„ค์ •ํ•˜๊ณ  ๊ทธ ๋งŒํผ์˜ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ 8๋กœ ์„ค์ •ํ•˜์˜€๋‹ค. 

์ถœ์ฒ˜) https://wikidocs.net/31379

 

์œ„์™€ ๊ฐ™์ด ์–ดํ…์…˜์€ ๋ฌธ์žฅ ํ–‰๋ ฌ์— Q,K,V ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•˜์—ฌ Q,K,V ๊ฐ๊ฐ์˜ ๋ฒกํ„ฐ๋ฅผ ๋ฝ‘์•„๋‚ธ ๋’ค scaled dot-attention์„ ์ˆ˜ํ–‰ํ•˜์—ฌ attention value๋ฅผ ์–ป์–ด๋‚ธ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ mulit-head๋Š” ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ํ•œ๋ฒˆ์œผ๋กœ ๊ทธ์น˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ num_head๋งŒํผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ head์— ์ด์šฉ๋˜๋Š” ๊ฐ€์ค‘์น˜๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. (๊ทธ๋ฆผ์—์„œ w0,w1)

์ด๋ ‡๊ฒŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ์—ฌ๋Ÿฌ ์‹œ๊ฐ์˜ ์ •๋ณด๋ฅผ ๊ณจ๊ณ ๋ฃจ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฒƒ์ด multi-head(๋ฉ€ํ‹ฐํ—ค๋“œ)๊ฐ€ ๊ฐ€์ง€๋Š” ์žฅ์ ์ด ๋œ๋‹ค.

 

 

์œ„์˜ ์ˆ˜์‹์ด ๋ฐ”๋กœ ๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜์˜ ์ˆ˜์‹์ด๋‹ค. ์ตœ์ข…์ ์œผ๋กœ ๊ฐ head์— ๊ตฌํ•ด์ง„ attention value๋ฅผ ์—ฐ๊ฒฐ(concat)ํ•˜์—ฌ ๋˜ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๊ณฑํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ์‚ฐ์ถœํ•œ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ ํ•œ attention ์˜ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์ด ๋œ๋‹ค.

 

* ๋…ผ๋ฌธ์—์„œ ์“ฐ์ธ ๊ตฌ์ฒด์ ์ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค(์ž…์ถœ๋ ฅ ๋ฒกํ„ฐ์˜ ์ฐจ์›, Q/K/V์˜ ์ฐจ์›, ํ—ค๋“œ์˜ ์ˆ˜)์€ ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

 

Position-wise Feed-Forward Networks

 

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ธ์ฝ”๋”, ๋””์ฝ”๋” ์†์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด์—๋Š” ์–ดํ…์…˜ ๋ ˆ์ด์–ด ์ด์™ธ์—๋„ ์™„์ „ ์—ฐ๊ฒฐ๋œ ์ˆœ๋ฐฉํ–ฅ ์‹ ๊ฒฝ๋ง(feed-forward network)์ด ํฌํ•จ๋œ๋‹ค.

์‹ ๊ฒฝ๋ง์˜ ์—ฐ์‚ฐ์€ ์ธํ’‹ ๋ฒกํ„ฐ x์™€ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ ์„ ํ˜•๊ฒฐํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, ์ˆœ๋ฐฉํ–ฅ ์‹ ๊ฒฝ๋ง(FFN)์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

FFN์€ ์–ดํ…์…˜ ๋ ˆ์ด์–ด ๋‹ค์Œ์œผ๋กœ ์—ฐ๊ฒฐ๋˜๋ฏ€๋กœ ์—ฌ๊ธฐ์„œ x๋Š” ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜์˜ ์ตœ์ข… ์‚ฐ์ถœ๋ฌผ ํ–‰๋ ฌ์ด๋‹ค.

์ถœ์ฒ˜) https://wikidocs.net/31379

 

์œ„์™€ ๊ฐ™์ด x๋Š” ํ•œ ๋ฒˆ์˜ ์„ ํ˜•๊ฒฐํ•ฉ ์ดํ›„ ํ™œ์„ฑํ™” ํ•จ์ˆ˜(activation function)์„ ์ง€๋‚˜ ๋‹ค์‹œ ํ•œ ๋ฒˆ ์„ ํ˜•๊ฒฐํ•ฉ๋œ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ, ์—ฌ๊ธฐ์„œ ๋งค๊ฐœ๋ณ€์ˆ˜๋“ค์ธ ๊ฐ w์™€ b๋Š” ํ•˜๋‚˜์˜ layer ์•ˆ(์ธ์ฝ”๋”, ๋””์ฝ”๋” ๋ธ”๋ก)์—์„œ๋Š” ๊ฐ™์ง€๋งŒ layer๋งˆ๋‹ค ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

(์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋Š” ํ•˜๋‚˜์˜ layer๋กœ, Attention,FFN์ด๋ผ๋Š” sub_layer์„ ๊ฐ€์ง„๋‹ค.)

 

๋…ผ๋ฌธ์—์„œ๋Š” FFN์˜ ์€๋‹‰์ธต์˜ ํฌ๊ธฐ๋ฅผ 2048๋กœ ์„ค์ •ํ–ˆ๋‹ค. 

 

Embeddings and Softmax

 

๋‹ค๋ฅธ ๋ณ€ํ™˜๊ธฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ํŠธ๋žœ์Šคํฌ๋จธ ๋˜ํ•œ input๊ณผ ouput token์˜ ์ฐจ์›์„ ์•ž์„œ ์„ค์ •ํ•œ d_model(์ธ์ฝ”๋”, ๋””์ฝ”๋” ๋ชจ๋“ˆ์— ๋“ค์–ด๊ฐ€๋Š” ์ธํ’‹์˜ ์ฐจ์›)๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” embedding layer๋ฅผ ๊ฐ€์ง„๋‹ค. 

๋˜ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋””์ฝ”๋”์˜ ์•„์›ƒํ’‹์„ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‹ค์Œ token์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์ „, ๊ทธ ์ฐจ์›์„ ๋‹ค์‹œ token์˜ ์›๋ž˜ ์ฐจ์›์œผ๋กœ ๋Œ๋ ค์ฃผ๋Š” pre-softmax layer๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฆ‰, ์„ค์ •ํ•œ ์ฐจ์›์„ ํ†ตํ•ด ๊ฒฐ๊ณผ๊ฐ’์„ ์‚ฐ์ถœํ•œ ๋’ค, ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผํ•˜๊ธฐ ์ „์— ๋‹ค์‹œ ์„ ํ˜•๊ฒฐํ•ฉ์„ ํ†ตํ•ด ์›๋ž˜ ์ฐจ์›์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. 

 

ํ•œํŽธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋‘ embedding layer์™€ pre-softmax linear transformation(layer)๊ฐ€ ๊ฐ™์€ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๊ณต์œ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  embedding layer์˜ ๊ฐ€์ค‘์น˜์— d_model์˜ ์ œ๊ณฑ๊ทผ์„ ๊ณฑํ•ด์ค€๋‹ค.

 

 

Positional Encoding

 

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ˆœ์ฐจ์ ์ธ ์ž…๋ ฅ์„ ๋ฐ›๋Š” recurrence ํ˜น์€ convolution ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ order set ๋‚ด์ง€๋Š” position ์ •๋ณด๋ฅผ ๋„ฃ์–ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. (์—ฌ๊ธฐ์—์„œ ๋„ฃ์–ด์ฃผ๋Š” ์ •๋ณด๋Š” ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด์ผ ์ˆ˜๋„ ์žˆ๊ณ , ์ ˆ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด์ผ ์ˆ˜๋„ ์žˆ๋‹ค.)

๋”ฐ๋ผ์„œ ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ๋Š” ๋ชจ๋ธ์— embedding vector๋ฅผ ์ธํ’‹์œผ๋กœ ํˆฌ์ž…ํ•˜๊ธฐ ์ „์— position ์ •๋ณด๋ฅผ ๋”ํ•ด์ฃผ๋Š”๋ฐ, ์ด๋ฅผ Positional Encoding์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

์ถœ์ฒ˜) https://wikidocs.net/31379

 

 

์œ„ ๊ณผ์ •์€ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „, ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์€ ํ–‰๋ ฌ์— postional encoding ๊ฐ’์ด ๋”ํ•ด์ง€๋Š” ๊ฒƒ์„ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

(์œ„์—์„œ๋Š” d_model, ์ฆ‰, ์ž…๋ ฅ์˜ ์ฐจ์›์ด 4์ด์ง€๋งŒ ์‹ค์ œ ๋…ผ๋ฌธ์—์„œ๋Š” 512์ž„์„ ์œ ์˜ํ•ด์•ผ ํ•œ๋‹ค.)

์œ„์—์„œ pos๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ(i am a student)์—์„œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ(ํ–‰ ๋ฒกํ„ฐ)์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , i๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋‚ด์˜ ์ธ๋ฑ์Šค๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

์˜ค๋ฅธ์ชฝ ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ ๊ฐ’์€ sin๊ณผ cosine ํ•จ์ˆ˜์˜ ๊ฐ’์œผ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.

(pos, ์ง์ˆ˜)์ผ ๋•Œ๋Š” sinํ•จ์ˆ˜์˜ ๊ฐ’์œผ๋กœ ๊ฒฐ์ •ํ•˜๊ณ  (pos, ํ™€์ˆ˜)์ผ ๋•Œ๋Š” cosineํ•จ์ˆ˜์˜ ๊ฐ’์œผ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.

์ด๊ฒƒ์€ ํ•˜๋‚˜์˜ ๊ณต์‹์ผ ๋ฟ์ด๊ณ , ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ทธ ๋ชฉ์ ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•˜๋‹ค.

์ค‘์š”ํ•œ ๊ฒƒ์€ ์ด๋ ‡๊ฒŒ ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ ๊ฐ’์ด ๋‹ฌ๋ผ์ง์— ๋”ฐ๋ผ ๊ฐ™์€ ๋‹จ์–ด๋ผ๊ณ  ํ•  ์ง€๋ผ๋„, ๋‹ค๋ฅธ ๊ฐ’์ด ๋”ํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ์ ์œผ๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ(input)๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

Why Self-Attention

 

์ด๋ฒˆ ๋‹จ๋ฝ์—์„œ๋Š” ์—ฐ๊ตฌ์ž๋“ค์ด Transformer์˜ ์ค‘์š”ํ•œ ํ•™์Šต ๊ณผ์ •์ธ Self-Attention์„ RNN, CNN๊ณผ ๋น„๊ตํ•˜๋ฉฐ ์žฅ์ ์„ ๋ถ€๊ฐํ•œ๋‹ค.

 

๋น„๊ต๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ธก๋ฉด์—์„œ ์ด๋ฃจ์–ด์ง„๋‹ค.

 

1.  Total computational complexity per layer (์ด ๊ณ„์‚ฐ ๋ณต์žก๋„)

2. The amount of computation that can be parallelized (์ˆœ์ฐจ์  ์ž‘์—…์˜ "์ตœ์†Œ" ํ•„์š” ์ˆ˜)

3. Path length between long-range dependencies in the network (๊ธด ์‹œ์  ๊ฐ„์˜ ๊ฒฝ๋กœ๊ธธ์ด)

 

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ์—ฐ๊ตฌ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

 

1. Total computational complexity per layer

 

Self-Attention์€ ์ง€์†์ ์œผ๋กœ ์ผ์ •ํ•˜๊ฒŒ ๋ชจ๋“  position์„ ์—ฐ๊ฒฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— sequence length์˜ ๋‘ ๋ฐฐ ๋งŒํผ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. 

๋ฐ˜๋ฉด Recurrent๋‚˜ Convolution์€ sequence length๊ฐ€ ์•„๋‹Œ representation์˜ ์ฐจ์›์˜ ๋‘ ๋ฐฐ ๋งŒํผ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ, NLP์˜ state-of-the-art models๋“ค์€ ๊ฑฐ์˜ sequence length๊ฐ€ representation์˜ ์ฐจ์›๋ณด๋‹ค ์งง๋‹ค. 

๋”ฐ๋ผ์„œ self-attention์ด ๋” ์ ์€ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

2. The amount of computation that can be parallelized

 

Self-Attention์€ Recurrent์— ๋น„ํ•ด ํ•„์š”ํ•œ ์ตœ์†Œ ์ž‘์—…๋Ÿ‰์„ n์—์„œ 1๋กœ ์ค„์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

3. Path length between long-range dependencies in the network

 

๊ฐ€์žฅ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ๋ถ€๋ถ„์ด ๋ฐ”๋กœ ์ด ์„ธ ๋ฒˆ์งธ ๋น„๊ต์ด๋‹ค. long-range dependencies๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ sequentialํ•œ task์—์„œ ๋งค์šฐ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„์ด๋‹ค.

๊ทธ๋Ÿฌํ•œ ๋Šฅ๋ ฅ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ํ‘œ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ Path lenth์ด๋‹ค. ์—ฌ๊ธฐ์„œ Path length๋Š” ์ „๋ฐฉ ๋ฐ ํ›„๋ฐฉ ์‹ ํ˜ธ๊ฐ€ ๋„คํŠธ์›Œํฌ์—์„œ ํ†ต๊ณผํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ๋กœ์˜ ๊ธธ์ด๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ, Path length๊ฐ€ ์งง์„ ์ˆ˜๋ก long-range dependencies๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋‹ค.

 

ํ•œํŽธ ์œ„ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋“ค ์ค‘ Self-Attention์˜ ์ตœ๋Œ€ ๊ฒฝ๋กœ ๊ธธ์ด๊ฐ€ ๊ฐ€์žฅ ์งง๋‹ค.

๋”ฐ๋ผ์„œ, Self-Attention์ด ๋‹ค๋ฅธ ๋ ˆ์ด์–ด๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋” ํšจ๊ณผ์ ์œผ๋กœ long-range dependencies๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

+ long sequence์—์„œ computational performance๋ฅผ ํ–ฅ์ƒ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ „์ฒด๊ฐ€ ์•„๋‹Œ r๊ฐœ์˜ ์ด์›ƒ(neighborhood)์—๋งŒ self-attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.

์ด๋Š” ์œ„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด ๊ธฐ๋ณธ self-attention๋ณด๋‹ค "computational complexity" ์ธก๋ฉด์—์„œ ๊ฐœ์„ ์„ ์ด๋ค„๋ƒˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ €์ž๋“ค์€ ์‹คํ—˜ ๊ฒฐ๊ณผ, maximum path length๋Š” ์˜คํžˆ๋ ค ์ฆ๊ฐ€ํ–ˆ์Œ์„ ํ™•์ธํ•˜์—ฌ ์ถ”ํ›„ ๋” ์ž์„ธํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ๋‹ค.

 

(์‹ค์ œ๋กœ ์ด ๋…ผ๋ฌธ์ด ๋ฐœํ‘œ๋˜๊ณ  ๋ช‡ ํ•ด๊ฐ€ ์ง€๋‚œ ์˜ค๋Š˜ ๋‚ , ๊ณ„์‚ฐ ๋ณต์žก๋„์™€ ์ตœ๋Œ€ ๊ฒฝ๋กœ ๋“ฑ ํŠธ๋žœ์Šคํฌ๋จธ์˜ "ํšจ์œจ์„ฑ"์„ ๊ฐœ์„ ํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค.)

 

+ ๋…ผ๋ฌธ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ถ”๊ฐ€์ ์ธ ์žฅ์ ์œผ๋กœ "์„ค๋ช…์ด ๊ฐ€๋Šฅํ•˜๋‹ค"๋Š” ๊ฒƒ์„ ์–ธ๊ธ‰ํ•œ๋‹ค.

 

 

์—ฐ๊ตฌ์ž๋“ค์€ ์–ดํ…์…˜์ด ์ด๋ฃจ์–ด์ง€๋Š” ๋ถ„ํฌ๋ฅผ ์ ๊ฒ€ํ•˜๊ณ  ์ด๋ฅผ ๊ณต์œ ํ•˜์˜€๋Š”๋ฐ, ๋งŽ์€ ๊ฒƒ๋“ค์ด ๋ฌธ์žฅ์˜ ๊ตฌ๋ฌธ๊ณผ ์˜๋ฏธ ๊ตฌ์กฐ์™€ ๊ด€๋ จ๋œ ํ–‰๋™์„ ๋ณด์ธ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. ์œ„ ๊ทธ๋ฆผ์€ ์ธ์ฝ”๋”์—์„œ ์ด๋ฃจ์–ด์ง€๋Š” self-attention ๋ ˆ์ด์–ด ์ค‘ ํ•˜๋‚˜์ธ๋ฐ, ์‹ค์ œ๋กœ "making"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ "more" , "difficult"์— ๊ฐ•ํ•˜๊ฒŒ ์—ฐ๊ฒฐ๋œ๋‹ค. ์ด๋Š” ๋ฌธ๋ฒ• ๊ตฌ์กฐ์— ๋งž๋Š” ๋ฐฉํ–ฅ์ด๋‹ค.

 

Training

 

1. Optimizer

์—ฐ๊ตฌ์ง„์€ ์˜ตํ‹ฐ๋งˆ์ด์ €๋กœ Adam์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , ์œ„์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜์˜€๋‹ค.

 

2. Regularization

์—ฐ๊ตฌ์ง„์€ ์ •๊ทœํ™” ๊ธฐ๋ฒ•์œผ๋กœ Residual Dropout๊ณผ, Label Smoothing์„ ์ด์šฉํ•˜์˜€๋‹ค.

๋จผ์ € Residual Dropout์€ ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด(sub_layer)์˜ ์•„์›ƒํ’‹๊ณผ ์ž„๋ฒ ๋”ฉ, ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ์˜ ํ•ฉ์— ์ ์šฉํ•˜์˜€๋‹ค.

๊ทธ๋ฆฌ๊ณ  ํ•™์Šต๊ณผ์ •์—์„œ Lable Smoothing์„ ์ ์šฉํ•˜์˜€๋Š”๋ฐ, ์ด๋Š” ๋ชจ๋ธ์ด ๋” ํ™•์‹คํ•˜์ง€ ์•Š์€ ๊ฒƒ์„ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜ผ๋ž€์„ ๊ฐ€์ค‘์‹œํ‚ค์ง€๋งŒ ์ •ํ™•๋„์™€ BLEU ์ ์ˆ˜๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

Result

 

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” BLEU scores์—์„œ ์ด์ „ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋” ์ข‹์€ sota ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ํŠนํžˆ ๊ตฌ์กฐ๊ฐ€ ํฐ big ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ Traning cost์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค๊ณผ ์•™์ƒ๋ธ” ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค.

 

Model Variations

์—ฐ๊ตฌ์ž๋“ค์€ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํ•ต์‹ฌ ์š”์†Œ๋“ค์˜ ์ค‘์š”์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์— ์กฐ๊ธˆ์”ฉ ์ฐจ์ด๋ฅผ ์ฃผ๋ฉด์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์˜€๋‹ค.

 

 

์œ„ ๊ฒฐ๊ณผ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- (A): head๋ฅผ ๋งŽ์ด ๋‘๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€์ง€๋งŒ ๋ฌด์ž‘์ • ๋งŽ์ด ๋‘”๋‹ค๊ณ  ํ•ด์„œ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

- (B): key์˜ ์ฐจ์›(dk)์ด ์ž‘์€ ๊ฒƒ๋ณด๋‹ค ํฐ ๊ฒƒ์˜ ์„ฑ๋Šฅ์ด ๋” ๋†’๋‹ค.

- (C): ํฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋” ๋†’๋‹ค.

- (D): dropout์ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

- (E): ์ด ๋…ผ๋ฌธ์—์„œ ์ ์šฉํ•œ ๊ฒƒ๊ณผ ๋‹ค๋ฅธ positional embedding์„ ์‚ฌ์šฉํ•ด๋„ ์„ฑ๋Šฅ์— ํฐ ์ฐจ์ด๋Š” ์—†๋‹ค.

 

๋งˆ์น˜๋ฉฐ

 

์˜ค๋Š˜๋‚  AI์˜ ํŠธ๋žœ๋“œ๋ฅผ ๋…ผํ•  ๋•Œ ์ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๋นผ๋†“์„ ์ˆ˜ ์—†์„ ๊ฒƒ์ด๋‹ค. Sequential task, ํŠนํžˆ ๋ฒˆ์—ญ ์ž‘์—…์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํŒŒ๊ธ‰๋ ฅ์€ ์ด๋ฏธ ๋„๋ฆฌ ์•Œ๋ ค์กŒ๊ธฐ ๋•Œ๋ฌธ์— ๋” ๊ฐ•์กฐํ•˜๋Š” ๊ฒƒ์€ ์˜๋ฏธ๊ฐ€ ์—†์–ด๋ณด์ผ ์ •๋„์ด๋‹ค. ์—ฌ์ „ํžˆ ์ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๋ฒ ์ด์Šค๋กœ ํ•œ ์‘์šฉ ๋ชจ๋ธ๋“ค์ด ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ ํ˜„์—…์—์„œ๋„ ๊ทธ ์„ฑ๋Šฅ์„ ์ธ์ •๋ฐ›์•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ํ•„์ž๋Š” ์ด๋Ÿฌํ•œ NLP์—๋„ ๊ด€์‹ฌ์ด ์žˆ์ง€๋งŒ, ๊ฐ™์€ sequential task์ธ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์–ด๋–ค ์—ญํ• ์„ ํ•˜๊ฒŒ ๋  ์ง€์— ๋Œ€ํ•ด ๊ฐ€์žฅ ํฐ ๊ด€์‹ฌ์ด ์žˆ๋‹ค.

 

๋น„์ฆˆ๋‹ˆ์Šค ๋„๋ฉ”์ธ์„ ์ง€ํ–ฅํ•˜๋Š” ํ•™์ƒ์œผ๋กœ์„œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์ด ๋น„์ฆˆ๋‹ˆ์Šค์—์„œ ํฐ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ž˜ ์•Œ๊ณ  ์žˆ๋‹ค.  

์–ด๋–ป๊ฒŒ ๋ณด๋ฉด ์žฅ๊ธฐ ์ข…์†์„ฑ(long dependence) ๋ฌธ์ œ์— ๊ฐ€์žฅ ์น˜๋ช…์ ์ธ ๊ฒƒ์ด ๋ฐ”๋กœ ์‹œ๊ณ„์—ด ๋ถ„์„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ณ„์—ด ๋ถ„์„ ์ „๋ฌธ๊ฐ€๋“ค์—๊ฒŒ ์ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋“ฑ์žฅ์€ ํฐ ๋ฐ˜๊ฐ€์›€์œผ๋กœ ๋‹ค๊ฐ€์™”์„ ์ง€๋„ ๋ชจ๋ฅธ๋‹ค.

 

2023๋…„ ์ด ๋ฆฌ๋ทฐ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ์‹œ์ ์—์„œ ๋ฐ”๋ผ๋ณด๋ฉด, 2017๋…„ ์ด ๋…ผ๋ฌธ์„ ํ†ตํ•ด ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์ œ์•ˆ๋œ ๋’ค, ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ๋„ ์ด๋ฅผ ์ ์šฉํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ๋‹น์—ฐํžˆ ์žˆ์–ด์™”๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์‚ฌ์‹ค ๋” ์ •ํ™•ํ•˜๊ฒŒ ์„ค๋ช…ํ•œ๋‹ค๋ฉด, ์‹œ๊ฐ„์ด ์ง€๋‚˜ ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ๊ฐ€์ง€๋Š” ํ•œ๊ณ„์ ๋“ค์„ ๊ทœ๋ช…ํ•˜๊ณ , ์ด๋ฅผ ๊ฐœ์„ ํ•จ๊ณผ ๋™์‹œ์— ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ํŠน์ง•์„ ๋ชจ๋ธ์— ๋…น์—ฌ๋‚ด๋ ค๋Š” ์‹œ๋„๋“ค์ด ์ด์–ด์ง€๊ณ  ์žˆ๋‹ค. 

์‹ค์ œ๋กœ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์‘์šฉ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์„ ์ด ์ด๋ฃจ์–ด์กŒ๊ณ , ์‹œ๊ณ„์—ด ๋ถ„์„์ด ๊ฐ€์ง€๋Š” ์ž๊ธฐ์ƒ๊ด€์„ฑ, ์ถ”์„ธ/๊ณ„์ ˆ/์ˆœํ™˜ ๋ณ€๋™ ๋“ฑ๊ณผ ๊ฐ™์€ ๊ณ ์œ ํ•œ ํŠน์ง•๋“ค์„ ๊ณ ๋ คํ•œ ์‘์šฉ ๋ชจ๋ธ์ด ๋ฐœํ‘œ๋˜๊ธฐ๋„ ํ–ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์ตœ์‹  ํŠธ๋ Œ๋“œ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฒฐ๊ตญ ๊ธฐ์ดˆ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” "๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ, vanilla ํŠธ๋žœ์Šคํฌ๋จธ"๋ฅผ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜๊ณ  ์žˆ์–ด์•ผํ•œ๋‹ค๋Š” ์ ์—์„œ, Attention is all you need ์ด ๋…ผ๋ฌธ์„ ๊ฐ€์žฅ ๋จผ์ € ๋ฆฌ๋ทฐํ•˜๋Š” ๊ฒƒ์— ํฐ ์˜๋ฏธ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

 

๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ๋Š” ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์–ด๋–ป๊ฒŒ ์‘์šฉ๋˜์–ด ์™”๋Š” ์ง€๋ฅผ ์ •๋ฆฌํ•œ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

 

 

์ฐธ์กฐ

https://wikidocs.net/31379

 

16-01 ํŠธ๋žœ์Šคํฌ๋จธ(Transformer)

* ์ด๋ฒˆ ์ฑ•ํ„ฐ๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ฑ•ํ„ฐ์— ๋Œ€ํ•œ ์‚ฌ์ „ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ(Transformer)๋Š” 2017๋…„ ๊ตฌ๊ธ€์ด ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์ธ Attention i…

wikidocs.net

์›๋ฌธ

https://arxiv.org/abs/1706.03762

 

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org