[Paper Review] Transformers in Time Series: A Survey (2022)

2023. 3. 20. 16:00ใ†๐Ÿง‘๐Ÿป‍๐Ÿซ Ideas/(Advanced) Time-Series

์ด ๋…ผ๋ฌธ์€ ์‹œ๊ณ„์—ด ๋ถ„์„์— ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ ์šฉํ•ด์˜จ ์—ฐ๊ตฌ๋“ค์„ ์ •๋ฆฌํ•œ ๋…ผ๋ฌธ์ด๋‹ค.

 

์‹œ๊ณ„์—ด ๋ถ„์„์€ NLP์™€ ํ•จ๊ป˜ sequential task์˜ ๋Œ€ํ‘œ์ ์ธ ๋ถ„์•ผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธˆ์œต, ์ œ์กฐ์—…, ๋งˆ์ผ€ํŒ… ๋“ฑ ๋‹ค์–‘ํ•œ ๋น„์ฆˆ๋‹ˆ์Šค ๋„๋ฉ”์ธ์— ์‘์šฉ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์„ ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค.

2017๋…„ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋“ฑ์žฅ ์ดํ›„, NLP ๋ถ„์•ผ์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํฐ ์„ฑ๊ณต์„ ์ด๋ค„๋‚˜๊ฐ€๋ฉด์„œ ๊ฐ™์€ sequential task์ธ ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ๋„ ์ด๋ฅผ ์ ์šฉํ•˜๋ ค๋Š” ์›€์ง์ž„์ด ์ผ์–ด๋‚ฌ๋‹ค. ํŠนํžˆ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์žฅ๊ธฐ ์ข…์†์„ฑ(long dependece)๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ๊ธด sequence์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์ด ์ž…์ฆ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ๋„ ์žฅ๊ธฐ ์‹œ๊ณ„์—ด์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์•ˆ์œผ๋กœ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ ์—ญ์‹œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ•œ๊ณ„์ ์ด ์กด์žฌํ–ˆ์œผ๋ฉฐ, ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ฐœ์กฐํ•  ํ•„์š”์„ฑ์ด ๋Œ€๋‘๋˜์—ˆ๋‹ค.

์ด์— attention module๋ถ€ํ„ฐ ์ „์ฒด์ ์ธ architecture๊นŒ์ง€ ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๊ฐœ์กฐํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๋“ค์ด ํ™œ๋ฐœํ•˜๊ฒŒ ์ง„ํ–‰๋˜์—ˆ๋‹ค.

๊ทธ๋Ÿฌํ•œ ํ๋ฆ„์—์„œ ์ด ๋…ผ๋ฌธ์€ 2022๋…„ ๋‹น์‹œ๊นŒ์ง€ ์ง„ํ–‰๋œ ์—ฐ๊ตฌ๋“ค์„ ์ข…ํ•ฉ, ์ •๋ฆฌํ•˜๊ณ  ํ–ฅํ›„ ์—ฐ๊ตฌ์˜ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ด์ฃผ๊ณ  ์žˆ๋‹ค.

 

 

 

Introduction

 

 

๋”ฅ๋Ÿฌ๋‹์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํ˜์‹ ์€ NLP, CV, Speech processing์—์„œ์˜ ํ›Œ๋ฅญํ•œ ํผํฌ๋จผ์Šค์— ํž˜์ž…์–ด ํ•™๊ณ„์˜ ํฐ ๊ด€์‹ฌ์‚ฌ๊ฐ€ ๋˜์—ˆ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” sequential data์—์„œ long- range dependencies์™€ interactions์„ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์— ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€๋Š”๋ฐ, ์ด ์ ์€ ์‹œ๊ณ„์—ด ๋ชจ๋ธ๋ง ๋ถ„์•ผ์—๋„ ํฐ ๋งค๋ ฅ์œผ๋กœ ๋‹ค๊ฐ€์™”๋‹ค.

 

์ง€๋‚œ ๋ช‡๋…„๊ฐ„, ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ์—ฌ๋Ÿฌ ๋ฌธ์ œ๋“ค(challenges)์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด, ๋‹ค์–‘ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋ณ€ํ˜•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๊ณ , ์ด๋“ค์€ ์˜ˆ์ธก(Forecasting), ๋ถ„๋ฅ˜(Classification), ์ด์ƒ ํƒ์ง€(Anomaly Detection)๋กœ ๋Œ€ํ‘œ๋˜๋Š” ์—ฌ๋Ÿฌ ์‹œ๊ณ„์—ด task์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘๊ธฐ๋„ ํ•˜์˜€๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜, ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๊ฒฐ๊ณผ๋“ค์€ "ํšจ๊ณผ์ ์œผ๋กœ" temporal dependency(์‹œ๊ฐ„ ์ข…์†์„ฑ)์„ ํŒŒ์•…ํ•˜๋Š” ์ผ๊ณผ "๊ณ„์ ˆ์„ฑ, ์ถ”์„ธ์„ฑ" ๋“ฑ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ํŠน์ง•์„ ๊ณ ๋ คํ•œ ๋ชจ๋ธ๋ง์€ ์—ฌ์ „ํžˆ ๋ฌธ์ œ(challenge)๋กœ ๋‚จ์•„์žˆ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•ด์˜ค๋ฉฐ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์‹œ๊ณ„์—ด ๋ถ„์„์— ์ ํ•ฉํ•œ ๋ชจ๋ธ๋กœ ๊ฐœ์กฐํ•˜๋Š” ๊ณผ์ •์—์„œ, ๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ ์ง€๊ธˆ๊นŒ์ง€ ์žˆ์—ˆ๋˜ ์•„์ด๋””์–ด์™€ ์—ฐ๊ตฌ๊ฒฐ๊ณผ๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ์ •๋ฆฌํ•˜์—ฌ ์•ž์œผ๋กœ ์ด์–ด์งˆ ์—ฐ๊ตฌ์— ์‹œ์‚ฌ์ ์„ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ๋ชฉ์ ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค.

 

์ด์–ด์งˆ ๋‚ด์šฉ์˜ ๋ชฉ์ฐจ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 


 

1. Brief introduction about vanilla Transformer

 

2. Taxonomy of variants of TS Transformer

 

         2-1 Network modifications

 

                       - Positional Encoding

                       - Attention Module

                       - Architecture

 

         2-2  Application domains

 

                       - Forecasting

                       - Anomaly Detection

                       - Classification

 

3. Experimental Evaluation and Discussion

 

4. Future Research Opportunities

 


 

1.Brief introduction about vanilla Transformer (Preliminaries of the Transformer)

 

 

๋ณธ๊ฒฉ์ ์œผ๋กœ ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์ ์šฉ๋˜์–ด ์˜จ ๊ณผ์ •์„ ์ •๋ฆฌํ•˜๊ธฐ ์ „์—, ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์š”์†Œ๋ฅผ ๊ฐ„๋žตํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ตฌ์กฐ์  ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

1. Positional Encoding (์–ดํ…์…˜ ์ด์ „์— ์œ„์น˜์ •๋ณด๋ฅผ ๋„ฃ์–ด์ฃผ๋Š” ๊ณผ์ •)

2. Attention Module (self, multi-head ๋“ฑ ์–ดํ…์…˜์ด ์ผ์–ด๋‚˜๋Š” ๋ ˆ์ด์–ด์˜ ๊ตฌ์กฐ)

3. Architecture (Module๋“ค์ด ์—ฐ๊ฒฐ๋œ ๋ชจ์–‘์ด๋‚˜ ๊ตฌ์กฐ)

 

Forecasting์— ์“ฐ์ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ

 

๊ธฐ๋ณธ(vanilla) ํŠธ๋žœ์Šคํฌ๋จธ์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์•„๋ž˜ ๋งํฌ์— ์ž์„ธํžˆ ๊ธฐ์ˆ ๋˜์–ด ์žˆ๋‹ค. 

 

ํŠธ๋žœ์Šคํฌ๋จธ ๊ตฌ์กฐ ์ฐธ์กฐ

https://seollane22.tistory.com/20

 

Attention Is All You Need(2017) #Transformer

"Attention is all you need", ์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด seq to seq ๋ชจ๋ธ์˜ ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•œ Transformer ๋ชจ๋ธ์˜ ๋“ฑ์žฅ์„ ์•Œ๋ฆฐ ๊ธฐ๋…๋น„์ ์ธ ๋…ผ๋ฌธ์ด๋‹ค. ํ˜„์žฌ NLP์™€ ๊ฐ™์ด seq to seq ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌ

seollane22.tistory.com

 


2. Taxonomy of Variants (Transformers in Time Series)

 

 

์ด ๋‹จ๋ฝ๋ถ€ํ„ฐ ๋ณธ๊ฒฉ์ ์œผ๋กœ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์—ฐ๊ตฌ๋ฅผ ์š”์•ฝํ•˜๊ณ  ์ •๋ฆฌํ•œ๋‹ค. 

 

 

 

 

๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ ์‹œ๊ณ„์—ด์—์„œ์˜ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์—ฐ๊ตฌ๋˜์–ด ์˜จ ๊ณผ์ •์˜ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์œ„์™€ ๊ฐ™์€ ๋ถ„๋ฅ˜๋ฅผ ์ œ์‹œํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ๋ฆ„ ๋˜ํ•œ ์ด ๋ถ„๋ฅ˜์™€ ๊ฐ™์€๋ฐ, ๋จผ์ € ํฌ๊ฒŒ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ตฌ์กฐ(Network)๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ ๋„๋ฉ”์ธ์— ๋”ฐ๋ผ ์‘์šฉ๋œ ์ธก๋ฉด์œผ๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค.

 

2-1. Network Modification

 

๋จผ์ € ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ ์šฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ์—๋Š” ๊ทธ ๊ตฌ์กฐ(network)๋ฅผ ์ˆ˜์ •, ๊ฐœ์กฐํ•˜๋Š” ๊ด€์ ์ด ์žˆ๋‹ค.

์ด๋Š” ์‹œ๊ณ„์—ด task์˜ ์—ฌ๋Ÿฌ challenge๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ๋ฅผ ์ˆ˜์ •ํ•˜๋ ค๋Š” ๋ฐ์— ๊ทธ ๋ชฉ์ ์ด ์žˆ๋‹ค.

 


 

2-1-1. Positional Encoding

 

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” RNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค๊ณผ ๋‹ฌ๋ฆฌ ์ •๋ณด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ณ„์—ด ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ธํ’‹์— time order(์ˆœ์„œ)๋ฅผ ๋„ฃ์–ด์ฃผ๋Š” ์ž‘์—…์ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค.

์ด๋Ÿฌํ•œ ์ž‘์—…์„ Positional Encoding์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ์ด๋Š” ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์‹œ๊ณ„์—ด ๋ถ„์„์— ์“ฐ์ผ ์ธํ’‹์— ๋”ํ•ด์ฃผ๋Š” ๊ณผ์ •์ด๋‹ค. 

๊ทธ๋Ÿฐ๋ฐ ์‹œ๊ณ„์—ด ๋ชจ๋ธ๋ง์— ์žˆ์–ด์„œ "์–ด๋–ป๊ฒŒ ์œ„์น˜์ •๋ณด๋ฅผ ์–ป๋Š” ์ง€"์— ๋”ฐ๋ผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€ ๋ถ„๋ฅ˜๊ฐ€ ์žˆ๋‹ค.

 

- Vanilla Positional Encoding

 

Vanilla Positional Encoding์€ ๊ธฐ๋ณธ์ ์ธ(vanilla ํ•œ) ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์— ์“ฐ์ด๋Š” ์ž‘์—…์„ ๊ทธ๋Œ€๋กœ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Š” ์–ด๋– ํ•œ ์ƒ๋Œ€์ ์ธ ์ •๋ณด๋‚˜ Time Event ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ธํ’‹์ด ์“ฐ์—ฌ์ง„ ์ˆœ์„œ ๊ทธ๋Œ€๋กœ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์ธํ’‹์— ๋”ํ•ด์ฃผ๋Š” ์ž‘์—…์ด๋‹ค. ์ด๋Š” ์‹œ๊ฐ„์˜ ํ๋ฆ„์„ ๋‹ด๊ณ ์žˆ๋Š” ์ •๋ ฌ๋œ ๋ฐ์ดํ„ฐ์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”ํ•œ ํŠน์ง•๋“ค์„ ์ถฉ๋ถ„ํžˆ ๋ฝ‘์•„๋‚ด์ง€ ๋ชปํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. (they were unable to fully exploit the important features of time series data)

 

- Learnable Positional Encoding

 

์œ„ Vanilla Positional Encoding์€ ๋‹จ์ง€ ๋ฐ์ดํ„ฐ์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์œ„์น˜๋ฅผ ์ธ์ฝ”๋”ฉํ•ด์ฃผ๋Š”๋ฐ, ์ด๋Ÿฌํ•œ "hand-crafted"๋ฐฉ๋ฒ•์€ ์œ„์น˜์ •๋ณด์˜ ํ‘œํ˜„๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค.

์ด์— ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์€ "ํ•™์Šต ๊ฐ€๋Šฅํ•œ" positional embedding์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜๊ธฐ๋„ ํ•˜์˜€๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๊ธฐ๋ณธ ๋ฐฉ์‹๋ณด๋‹ค ๋” ์œ ์—ฐํ•˜๋ฉฐ, ํŠน์ • ์‚ฌ๊ฑด์ด ์ผ์–ด๋‚˜๋Š” ์‹œ์ ์„ ํ•™์Šตํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ชฉ์ ์— ๋งž๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

 

์ด์™€ ๊ด€๋ จํ•ด์„œ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๋Š” ๊ตฌ์ฒด์ ์ธ ์—ฐ๊ตฌ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. 

[Zerveas et al., 2021]์˜ ์—ฐ๊ตฌ๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์œ„์น˜ ์ธ๋ฑ์Šค๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” embedding layer๋ฅผ ๋„์ž…ํ•˜์˜€๋‹ค.

[Lim et al., 2021]์˜ ์—ฐ๊ตฌ๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ sequential order์„ ๋” ์ž˜ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋Š” LSTM์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๋„์ž…ํ•˜์˜€๋‹ค.

 

 

- Timestamp Encoding

 

์‹ค์ƒํ™œ์˜ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ๋ง ํ•  ๋•Œ ํƒ€์ž„์Šคํƒฌํ”„ ์ •๋ณด๊ฐ€ ๊ฐ€์žฅ ์ ‘๊ทผ, ์ถ”์ถœํ•˜๊ธฐ ์šฉ์ดํ•˜๋‹ค.

ํƒ€์ž„์Šคํƒฌํ”„๋Š” ๋‹ฌ๋ ฅ์— ๊ธฐ์ž…ํ•˜๋Š” ์‹œ๊ฐ„ ์ •๋ณด(์ผ, ์ฃผ๋ง, ์›”, ์—ฐ๋„ ๋“ฑ)๋‚˜ ์–ด๋– ํ•œ ํŠน์ •ํ•œ ์ด๋ฒคํŠธ(๊ฑฐ๋ž˜ ๋งˆ๊ฐ์ผ, ์„ธ์ผ ๊ธฐ๊ฐ„)์™€ ๊ฐ™์ด ์‹œ๊ณ„์—ด์—์„œ์˜ ์ฃผ๊ธฐ์ ์ธ ํฌ์ธํŠธ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ํƒ€์ž„์Šคํƒฌํ”„๋Š” ์‹ค์ƒํ™œ์˜ ์‘์šฉ์—์„œ ๋งค์šฐ ์œ ์˜๋ฏธํ•  ๋•Œ๊ฐ€ ๋งŽ์ง€๋งŒ ๊ธฐ๋ณธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์—์„œ๋Š” ์ด์™€ ๊ฐ™์€ ํฌ์ธํŠธ๋ฅผ ์ด์šฉํ•˜์ง€ ๋ชปํ•ด์™”๋‹ค.

์ด์— ๋‹ค์–‘ํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ๋“ค์ด positional encoding ๊ณผ์ •์—์„œ ์ด๋ฅผ ์ด์šฉํ•˜๊ณ ์ž ์‹œ๋„ํ•˜์˜€๋‹ค.

 

Informer [Zhou et al., 2021]

์ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋ชจ๋“ˆ, ์•„ํ‚คํ…Œ์ฒ˜(๊ตฌ์กฐ)๋ฅผ ์‹œ๊ณ„์—ด ๋ถ„์„์— ์šฉ์ดํ•˜๋„๋ก ์ „๋ฐฉ์œ„์ ์œผ๋กœ ๊ฐœ์กฐํ•œ ๋ชจ๋ธ์ธ๋ฐ, ์ด ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธ์ฝ”๋”ฉ ๊ณผ์ •์—์„œ ํ•™์Šต๊ฐ€๋Šฅํ•œ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค.

๋˜ํ•œ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Autoformer [Wu et al., 2021] and FEDformer [Zhou et al., 2022]์—์„œ๋„ ๋น„์Šทํ•œ ํƒ€์ž„์Šคํƒฌํ”„ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. 

 


 

2-1-2. Attention Module

 

์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ Network modification์˜ 2๋ฒˆ์งธ ํ•ญ๋ชฉ์€ Attention Module์ด๋‹ค.

์—ฌ๋Ÿฌ ์—ฐ๊ตฌ์ž๋“ค์€ postional encoding ์ด์™ธ์—๋„ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ธฐ๋ณธ๊ตฌ์กฐ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•˜๊ณ , ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ challenge๋“ค์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด "์–ดํ…์…˜ ๋ชจ๋“ˆ"์„ ์ˆ˜์ •ํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๋ฅผ ์ด์–ด์™”๋‹ค.

 

์–ดํ…์…˜ ๋ชจ๋“ˆ(ํŠนํžˆ self attention)์€ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํ•ต์‹ฌ ์š”์†Œ๋กœ์„œ input ์ „์ฒด๋ฅผ ํ›‘์œผ๋ฉฐ ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋Š” ๋งˆ์น˜ ์™„์ „ ์—ฐ๊ฒฐ๋œ ์‹ ๊ฒฝ๋ง๊ณผ ๊ฐ™์ด maximum path length๋ฅผ ๊ณต์œ ํ•˜๋ฉฐ long-range dependency๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํŒŒ์•…ํ•˜๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์žฅ์ ์—๋Š”, ๊ณ„์‚ฐ์  ๋ณต์žก๋„๊ฐ€ sequence length(N)์˜ ์ œ๊ณฑ์ด๋ผ๋Š” ํฐ ๋น„์šฉ์ด ๋”ฐ๋ผ์˜จ๋‹ค.

quadratic(์ œ๊ณฑ์˜) complexity๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์ด ํฐ ๋ณต์žก๋„๋Š” computational bottleneck(๋ณ‘๋ชฉํ˜„์ƒ)์„ ์•ผ๊ธฐํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๋–จ์–ดํŠธ๋ฆฌ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋Š” N์ด, ์‹œ๊ณ„์—ด์ด ๋”์šฑ ์žฅ๊ธฐ๋กœ ๊ฐˆ์ˆ˜๋ก ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ด๋ ค๋Š” ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

์ด๋ ‡๊ฒŒ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ํšจ์œจํ™”ํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ํฌ๊ฒŒ Sparsity bias๋ฅผ ๋„์ž…ํ•˜๋Š” ๊ฒƒ๊ณผ Low-rank property๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

 

 

Introducing "Sparsity bias" into the attention mechanism

 

ํŠธ๋žœ์Šคํฌ๋จธ ์–ดํ…์…˜ ๋ชจ๋“ˆ์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์™„ํ™”ํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ๋Š” ๋Œ€์•ˆ์œผ๋กœ Sparsity bias๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค.

Sparsity bias๋Š” ๋ณดํ†ต ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ฑฐ๋‚˜ ๊ณ„์‚ฐ์ด ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์•„์ง€๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ๋„์ž…ํ•˜๊ณค ํ•˜๋Š”๋ฐ, ์–ดํ…์…˜์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ด๋ฅผ ๋„์ž…ํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ์žˆ์—ˆ๋‹ค. 

์ด๋Š” ์™„์ „ ์—ฐ๊ฒฐ๋œ attention,์ฆ‰, ๋ชจ๋“  ํฌ์ง€์…˜์— ๋Œ€ํ•ด ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ Sparsity bias๋ฅผ ํ†ตํ•œ ํฌ์ง€์…˜๋งŒ ๊ณ„์‚ฐ์— ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

์•„๋ž˜์˜ ๊ทธ๋ฆผ์€ ์ด๋Ÿฌํ•œ ์•„์ด๋””์–ด๋ฅผ ์ž˜ ์ดํ•ดํ•˜๋„๋ก ๋•๋Š”๋‹ค.

(a)๋Š” ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” self-attention์ด๋‹ค. ์ด์™€ ๋‹ฌ๋ฆฌ, ๋‚˜๋จธ์ง€ ๋ชจํ˜•์—์„œ๋Š” ์ „์ฒด๊ฐ€ ์•„๋‹Œ ํŽธํ–ฅ์„ ํ†ตํ•ด ์„ ๋ณ„ํ•œ ํฌ์ง€์…˜์—๋งŒ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์™„ํ™”ํ•œ๋‹ค.

 

์•„๋ž˜์™€ ๊ฐ™์€ ํŠธ๋žœ์Šคํฌ๋จธ ๋ณ€ํ˜•๋ชจ๋ธ๋“ค์—์„œ Sparsity Bias๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค.

- LogTrans [Li et al., 2019]

- Pyraformer [Liu et al., 2022a]

 

 

Exploring the low-rank property of the self-attention

 

ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ํšจ์œจํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ์ˆ˜์ •ํ•˜๋Š” ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ๋ฐ”๋กœ Low-rank property๋ฅผ ์ฐพ์•„ ์ด๋ฅผ ๊ณ„์‚ฐ์—์„œ ์ œ์™ธํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

์ด์™€ ๊ฐ™์€ ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ ํŠธ๋žœ์Šคํฌ๋จธ ๋ณ€ํ˜•๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

- Informer [Zhou et al., 2021]

- FEDformer [Zhou et al., 2022]

 

ํŠนํžˆ ์ด ์•„์ด๋””์–ด๋ฅผ ๋จผ์ € ์ œ์•ˆํ•˜๊ณ  ๊ตฌํ˜„ํ•œ Informer์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ ๋„์ž…ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๋ฅผ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค.

Informer ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” ์œ„์—์„œ ๋‹ค๋ฃฌ ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ธ "Sparsity bias๋ฅผ ๋„์ž…ํ•˜๋Š” ๋ฐฉ์‹"์€ ์ธ๊ฐ„์˜ ์ฃผ๊ด€์ด ๊ฐœ์ž…๋˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋งํ•œ๋‹ค.

๋”ฐ๋ผ์„œ "์„ ํƒ์ ์ธ ๊ณ„์‚ฐ"์— ์žˆ์–ด์„œ ์กฐ๊ธˆ ๋” ๊ณผํ•™์ ์ธ ์ ‘๊ทผ์œผ๋กœ ์ˆ˜์‹์„ ํ†ตํ•ด ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋” ๋†’์€ ์ค‘์š”๋„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ๋“ค๋งŒ ๊ณ„์‚ฐ์— ์ด์šฉํ•˜๋Š” ๋ฐฉ์‹์ด ๋” ํ•ฉ๋ฆฌ์ ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค. Informer์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ "ProbSparse Attention"์ด๋ผ๊ณ  ๋ช…๋ช…ํ•˜์˜€๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์— ๋น„ํ•ด ๋” ๋น ๋ฅธ ๊ณ„์‚ฐ ์†๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์€ ์ด ๋‹จ๋ฝ์˜ ๋งˆ์ง€๋ง‰์—์„œ ๊ฐ์ข… ๋ณ€ํ˜• ๋ชจ๋ธ๋“ค๊ณผ ๊ทธ ๋ณต์žก๋„๋ฅผ ๋น„๊ตํ•˜์—ฌ ์ •๋ฆฌํ•˜๊ณ  ์žˆ๋‹ค.

์œ„ ๊ฒฐ๊ณผ๋Š” ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ์ˆ˜์ •ํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ๋“ค์ด quadratic complexity๋ฅผ ๊ฐ€์ง€๋Š” ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ์— ๋น„ํ•ด ๋” ์™„ํ™”๋œ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 


 

2-1-3. Architecture-based Attention Innovation

 

์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋ฐœ์ „์—์„œ Network modification์˜ ๋งˆ์ง€๋ง‰์€ Architecture์„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์•ž์„  Network modification์˜ ๋‘ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ด์—ˆ๋˜ ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ๊ฐœ์กฐํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ, ์ด๊ฒƒ์€ ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ์—ฐ๊ฒฐํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•˜๊ธธ, ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ ์‹œ๊ณ„์—ด ๋ถ„์„์„ ์œ„ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ์— hierarchical architecture(๊ณ„์ธต ๊ตฌ์กฐ)๋ฅผ ๋„์ž…ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. 

 

- Informer [Zhou et al., 2021]

Informer๋ฅผ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” ์–ดํ…์…˜ ๋ชจ๋“ˆ๋“ค ์‚ฌ์ด์— max-pooling layer๋ฅผ ๋„์ž…ํ•˜๋Š” architecture๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค.

์ด๋Š” ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ์ถ”์ถœํ•˜์—ฌ ์ „๋‹ฌํ•˜๊ธฐ ์œ„ํ•จ์œผ๋กœ, sample series๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋Š” ํšจ๊ณผ ๋˜ํ•œ ๊ฐ€์ ธ์˜จ๋‹ค. 

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ "Knowledge Distilling(์ง€์‹ ์ฆ๋ฅ˜)"์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค.

 

 

 

Informer์˜ max-pooling layer

 

- Pyraformer [Liu et al., 2022a]

pyraformer๋ฅผ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ƒˆ architecture๋กœ C-ary tree-based attention mechanism์„ ๋””์ž์ธ ํ•˜์˜€๋‹ค.

์ด๋Š” ํ”ผ๋ผ๋ฏธ๋“œํ˜• ์–ดํ…์…˜์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๊ธฐ๋„ ํ•˜๋Š”๋ฐ, intra attention์™€ inter attention ๋‘ ๊ฐ€์ง€๋ฅผ ๋™์‹œ์— ๊ตฌ์ถ•ํ•œ ๊ฒƒ์ด ํŠน์ง•์ด๋‹ค.

์ฆ‰, ์ด ๊ตฌ์กฐ๋Š” ๋‚ด๋ถ€์ ์ธ ์–ดํ…์…˜๊ณผ ์™ธ๋ถ€์ ์ธ ์–ดํ…์…˜์„ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ํ•™์Šต์„ ํ•ด๋‚˜๊ฐ€๋Š”๋ฐ, ์ด๋ฅผ ํ†ตํ•ด different resolutions ๊ฐ„์˜ ์‹œ๊ฐ„์  ์ข…์†์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํŒŒ์•…ํ•˜๋ฉฐ ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ๊นŒ์ง€๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์˜€๋‹ค.

 

vanilla transformer์™€ pyraformer์˜ ์•„ํ‚คํ…Œ์ฒ˜

 


 

2-2. Application Domain

 

์ง€๊ธˆ๊นŒ์ง€ ๋…ผ์˜ํ•œ network modification ์ด์™ธ์—๋„ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์—ฐ๊ตฌํ•˜๋Š” ํฐ ๋ฐฉํ–ฅ์—๋Š” Application Domain์ด ์žˆ๋‹ค.

 

 

์œ„ ๊ทธ๋ฆผ์˜ ์˜ค๋ฅธ์ชฝ ๊ฐ€์ง€์ฒ˜๋Ÿผ, ์‹œ๊ณ„์—ด task์˜ domain์€ ํฌ๊ฒŒ Forecasting, Anomaly Detection, Classification์ด ์žˆ๋‹ค.

 


 

2-2-1. Forecasting

 

Forecasting, ์˜ˆ์ธก ๋ฌธ์ œ๋Š” ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด์ž majorํ•œ ๋ถ„์•ผ์ด๋‹ค.

์ด ์˜ˆ์ธก ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ ์žˆ์–ด์„œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ฃผ๋œ ๋ฐฉํ–ฅ์€ Module-level๊ณผ Architecture-level์˜ ๋ณ€ํ˜•๋ชจํ˜•์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค.

์ด๋Š” ์•ž์„œ ๋‹ค๋ฃฌ Network Modification์˜ ์–ดํ…์…˜ ๋ชจ๋“ˆ๊ณผ ์•„ํ‚คํ…Œ์ฒ˜๋ฅผ ๊ฐœ์กฐํ•˜๋Š” ๊ทธ ๋ฐฉํ–ฅ๊ณผ ๊ฐ™๋‹ค.

์˜ˆ์ธก ์„ฑ๋Šฅ์˜ ์ง„๋ณด์™€ ๊ณ„์‚ฐ๊ณผ์ •์˜ ํšจ์œจํ™”๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ด๋Ÿฌํ•œ ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค์ด ์ด๋ฃจ์–ด์ง€๋ฉด์„œ ๋งŽ์€ ๋ณ€ํ˜• ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจํ˜•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ •๋ฆฌํ•œ ๋ชจํ˜•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

1-1) New Module Design


Sparsity inductive bias or low-rank approximation ๋„์ž…

- LogTrans [Li et al., 2019] 
- Informer [Zhou et al., 2021]
- AST [Wu et al., 2020a]
- Pyraformer [Liu et al., 2022a]
- Quatformer [Chen et al., 2022]
- FEDformer [Zhou et al., 2022]

์ด ๋ชจ๋ธ๋“ค์ด ์ถ”๊ตฌํ•˜๋Š” ๋ชฉํ‘œ์™€ ๊ทธ ์ด์œ ๋Š” ์œ„ network modification์—์„œ ๋…ผ์˜ํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

์ด๋“ค์€ "์žฅ๊ธฐ ์‹œ๊ณ„์—ด"์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ์—์„œ Sparsity inductive bias์™€  low-rank approximation๋ฅผ ํ†ตํ•ด "๋ฉ”๋ชจ๋ฆฌ ํšจ์œจํ™”"์™€ "๊ณ„์‚ฐ ์†๋„์˜ ํ–ฅ์ƒ"์„ ์ด๋ฃจ์–ด๋ƒˆ๋‹ค. 

 

1-2) Modifying the normalization mechanism

 

over-stationarization(๊ณผ๋Œ€ ์ •์ƒํ™”) ๋ฌธ์ œ ํ•ด๊ฒฐ

- Non-stationary Transformer [Liu et al., 2022b]

 

 

 

1-3) Utilizing the bias for token input

 

Segmentation-based representation mechanism.(์ธํ’‹ ํ† ํฐ์˜ ํŽธํ–ฅ์„ ์ตœ๋Œ€ํ•œ ์ด์šฉํ•˜๊ธฐ)

Simple "Seasonal-Trend Decomposition architecture" with an auto-correlation mechanism (์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ๋ฐ˜์˜)

 

- Autoformer [Wu et al., 2021]

Autoformer์˜ ์•„ํ‚คํ…Œ์ฒ˜

Autoformer์˜ ๊ตฌ์กฐ๋Š” ๋‹ค๋ฅธ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ ๋ณ€ํ˜•๋ชจ๋ธ๋“ค๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, ์ „ํ†ต์ ์ธ ์‹œ๊ณ„์—ด ๋ถ„์„ ๋ฐฉ๋ฒ•์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.

์ด ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์—ฐ์ ์œผ๋กœ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š” ์ž๊ธฐ์ƒ๊ด€์„ฑ(Auto-Correlation)๊ณผ Seasonal(๊ณ„์ ˆ๋ณ€๋™), Trend(์ถ”์„ธ๋ณ€๋™)์˜ ๊ฐœ๋…์„ ์ ํ•ฉํ•˜์—ฌ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ถ„์„ ๋งค์ปค๋‹ˆ์ฆ˜์„ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค.

 

 

ํ•œํŽธ Forecasting์— ์žˆ์–ด์„œ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์œ„์™€๊ฐ™์€ ์ˆ˜์น˜์ ์ธ ์˜ˆ์ธก ๋ฌธ์ œ ์™ธ์—๋„ Spatio-Temporal Forecasting, Event Forecasting์™€ ๊ฐ™์€ ๋ถ„์•ผ์—์„œ๋„ ๋งŽ์ด ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค.


 

2-2-2. Anomaly Detection

 

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์‹œ๊ณ„์—ด ์ด์ƒํƒ์ง€์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

- ํŠธ๋žœ์Šคํฌ๋จธ์™€ neural generative model์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ

1. TranAD [Tuli et al., 2022]

์ด ๋ชจ๋ธ์€ ๊ธฐ๋ณธ์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์ด์ƒ์น˜์˜ ์ž‘์€ ํŽธ์ฐจ๋ฅผ ๋†“์น˜๋Š” ๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด "adversarial train"(์ ๋Œ€์  ํ›ˆ๋ จ)์„ ํ†ตํ•ด recostruction error๋ฅผ ์ฆํญํ•˜์—ฌ ํ•™์Šตํ•˜๊ฒŒ ํ•œ๋‹ค.  

 

2. MT-RVAE [Wang et al., 2022]

3. TransAnomaly [Zhang et al., 2021]

์ด ๋‘ ์—ฐ๊ตฌ๋Š” ๊ณตํ†ต์ ์œผ๋กœ ํŠธ๋žœ์Šคํฌ๋จธ์™€ VAE(Variational Auto Encoder)๋ฅผ ๊ฒฐํ•ฉํ•˜์˜€๋Š”๋ฐ, ๊ทธ ๋ชฉ์ ์€ ์„œ๋กœ ๋‹ค๋ฅด๋‹ค.

MT-RVAE [Wang et al., 2022]๋Š” ๋” ๋งŽ์€ ๋ณ‘๋ ฌํ™”์™€ training cost ๊ฐ์†Œ๋ฅผ ์œ„ํ•ด VAE๋ฅผ ๊ฒฐํ•ฉํ•˜์˜€๊ณ , TransAnomaly [Zhang et al., 2021]๋Š” ๊ฐ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ์‹œ๊ณ„์—ด์˜ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ณ  ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์„ ๊ฐ€์ง„๋‹ค.

 

 

4. GTA [Chen et al., 2021c]

์ด ์—ฐ๊ตฌ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์™€ graph-based learning architecture๋ฅผ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค.

 


 

2-2-3. Classification

 

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” long-range dependency๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ณ„์—ด ๋ถ„๋ฅ˜์—๋„ ๋งค์šฐ ํšจ๊ณผ์ ์ด๋‹ค.

 

GTN [Liu et al., 2021]

์ด ์—ฐ๊ตฌ๋Š” Two-Tower Transformer๋ผ๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ๊ฐ ํƒ€์›Œ๋Š” "Time-Step-Wise Attention", "Channel - Wise Attetntion"์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด ๋•Œ ๋‘ ํƒ€์›Œ์˜ ํŠน์„ฑ์„ ํ•ฉ์น˜๊ธฐ ์œ„ํ•ด "Gating"์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” "a learnable weighted concatenation"๊ฐ€ ์ด์šฉ๋œ๋‹ค. ์ด ๋ชจ๋ธ์€ ์‹œ๊ณ„์—ด ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ sota ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 

[Rußwurm and Korner, 2020]

์ด ์—ฐ๊ตฌ๋Š” raw optical satellite time series classification์ด๋ผ๋Š” ๊ฐœ๋…์„ ๊ตฌ์ถ•ํ•˜๊ณ  self-attention๊ธฐ๋ฐ˜ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

 

[Yuan and Lin, 2020]

[Zerveas et al., 2021]

[Yang et al., 2021]

์ด ์—ฐ๊ตฌ๋“ค์€ ์‹œ๊ณ„์—ด ๋ถ„๋ฅ˜์— ์žˆ์–ด์„œ Pre-trained Transformers, ์ฆ‰, ์‚ฌ์ „ํ•™์Šต๋œ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๋„์ž…ํ•˜์˜€๋‹ค.

 

 


 

3. Experimental Evaluation and Discussion

 

 

์ €์ž๋“ค์€ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์š”์•ฝ, ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์˜ ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ ๋ณ€ํ˜•๋ชจ๋ธ๋“ค์„ ๋น„๊ตํ•˜๋ฉฐ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค.

ํ…Œ์ŠคํŠธ์— ์“ฐ์ธ ๋ฐ์ดํ„ฐ๋Š” ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ ์œ ๋ช…ํ•œ ๋ฒค์น˜๋งˆํ‚น ๋ฐ์ดํ„ฐ์ธ ETTm2 [Zhou et al., 2021] ๋ฐ์ดํ„ฐ์ด๋‹ค.

 

๊ณ ์ „์ ์ธ ํ†ต๊ณ„๋ชจ๋ธ์ธ ARIMA๋‚˜ CNN, RNN๊ณผ ๊ฐ™์€ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์€ ์ด๋ฏธ Informer [Zhou et al., 2021]์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ์— ๋น„ํ•ด ์—ด๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์กŒ๋‹ค๋Š” ๊ฒƒ์ด ์ž…์ฆ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๋ณ€ํ˜•๋ชจ๋ธ๋“ค์— ์ง‘์ค‘ํ•˜๊ณ  ์žˆ๋‹ค. 

 

 

3-1) Robustness Analysis

 

 

Table 2์—์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- ๋Œ€์ฒด์ ์œผ๋กœ Vanilla Transformer์— ๋น„ํ•ด ๋ณ€ํ˜•๋ชจ๋ธ๋“ค, ํŠนํžˆ Autoforemr์˜ ์„ฑ๋Šฅ(Forecasting Power)์ด ๋” ๋›ฐ์–ด๋‚˜๋‹ค.

- ๋ชจ๋“  ๋ชจ๋ธ๋“ค์ด Input Len์ด ํด ์ˆ˜๋ก, ์ฆ‰, ์žฅ๊ธฐ ์‹œ๊ณ„์—ด๋กœ ๊ฐˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•˜๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ธ๋‹ค.

 

"์ „ํ†ต์ ์ธ ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ํŠน์ง•์„ ์ž…ํžŒ" Autoformer์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹๋‹ค๋Š” ๊ฒƒ์€ ์˜๋ฏธ์žˆ๋Š” ์‹œ์‚ฌ์ ์„ ๋˜์ ธ์ฃผ๊ณ  ์žˆ๋‹ค.

์ด์—, ๊ธฐ์กด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ตฌ์กฐ์— ์ „ํ†ต์ ์ธ ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๋ฅผ ์ง€์†ํ•ด์•ผ ํ•œ๋‹ค.

๋˜ํ•œ ์—ฌ์ „ํžˆ ๊ธด Input์—์„œ๊นŒ์ง€ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ชจ๋ธ์ด ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋” ๊ธด ์‹œ๊ณ„์—ด์˜ Time-dependency๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ์„ ๊ณ„์† ๊ณ ๋ฏผํ•ด์•ผ ํ•œ๋‹ค.

 

 


3-2) Model Size Analysis

 

 

NLP๋‚˜ CV์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ์š”์ธ ์ค‘ ํ•˜๋‚˜๊ฐ€ ๋ฐ”๋กœ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ๋งค์šฐ ํฌ๊ฒŒ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํŠนํžˆ Layer์˜ ๊ฐœ์ˆ˜๋กœ ๊ทธ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜๋Š”๋ฐ, NLP, CV์—์„œ๋Š” ๋ณดํ†ต 12๊ฐœ์—์„œ 128๊ฐœ ์‚ฌ์ด์˜ ์ˆ˜๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

 

 

 

Table 3์˜ ๊ฒฐ๊ณผ๋ฅผ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

- 3~6 ๊ฐœ์˜ Layer๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

- Layer ์ˆ˜๋ฅผ ๋Š˜๋ฆฐ๋‹ค๊ณ  ํ•ด์„œ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด์ง€ ์•Š๋Š”๋‹ค.

 

ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ, ๋‹ค๋ฅธ ๋ถ„์„๋ฐฉ๋ฒ•๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ๋Š” ๋” ํฐ ๋ชจ๋ธ์˜ capacity๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด ๋ฐํ˜€์กŒ๋‹ค.

์ด์—  ํ–ฅํ›„ ์›์ธ์„ ๊ทœ๋ช…ํ•˜๊ณ , ๋” Deepํ•œ Layer๋“ค์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ™•์žฅํ•  Architecture๋ฅผ ๋””์ž์ธ ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์ด ๋  ๊ฒƒ์ด๋‹ค.  


 

3-3) Seasonal-Trend Decomposition Analysis

 

 

์ตœ๊ทผ์— [Wu et al., 2021; Zhou et al., 2022; Lin et al., 2021; Liu et al., 2022a] ๋“ฑ์˜ ์—ฐ๊ตฌ์ž๋“ค์€ ์‹œ๊ณ„์—ด์˜ ์—ฌ๋Ÿฌ ๋ณ€๋™์„ Decomposition(์š”์†Œ๋ถ„ํ•ด)ํ•˜๋Š” ๊ฒƒ์ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์„ฑ๋Šฅ์— ํ•ต์‹ฌ์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค.

์œ„ Table 4๋Š” Original version์˜ ์„ฑ๋Šฅ๊ณผ, Autoformer [Wu et al., 2021]์—์„œ ์ œ์•ˆํ•œ simple moving average seasonal-trend decomposition architecture๋ฅผ ์ ์šฉํ•œ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

 

๋งจ ์˜ค๋ฅธ์ชฝ promotion ์—ด์„ ๋ณด๋ฉด, Decomposition์„ ์ ์šฉํ•œ ๋ชจ๋ธ์ด ๊ธฐ๋ณธ ๋ฒ„์ „๋ณด๋‹ค ์ตœ์†Œ 50%์—์„œ ์ตœ๋Œ€ 80% ์ •๋„๋กœ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ์•ž์œผ๋กœ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์„ฑ๋Šฅ์— ์žˆ์–ด์„œ ์ด decomposition์ด ๋งค์šฐ ํ•ต์‹ฌ์ ์ธ ์š”์†Œ๊ฐ€ ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค.

์ด์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ ์ด ์ ์— ์ฃผ๋ชฉํ•˜์—ฌ ์•ž์œผ๋กœ์˜ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋” ๋ฐœ์ „๋œ ์‹œ๊ณ„์—ด decomposition ์Šคํ‚ค๋งˆ๋ฅผ ๋””์ž์ธ ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๊ฐ•์กฐํ•˜๊ณ  ์žˆ๋‹ค. 

 


 

4. Future Research Opportunities

 

 

๋ณธ ๋…ผ๋ฌธ์€ ๋งˆ์ง€๋ง‰์œผ๋กœ ์ง€๊ธˆ๊นŒ์ง€ ์‹œ๊ณ„์—ด ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ •๋ฆฌํ•ด์˜จ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‹œ์‚ฌ์ ๊ณผ ์•ž์œผ๋กœ์˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ด์ฃผ๊ณ  ์žˆ๋‹ค.

 

1. Inductive Biases for Time Series Transformers

- ๋…ธ์ด์ฆˆ๋ฅผ ํ†ต์ œํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ์‹ ํ˜ธ๋ฅผ ์ถ”์ถœํ•˜๋ผ. (based on understanding Time-Series Data and tasks)

2. Transformers and GNN(Graph-neural-network) for Time Series

- ์ˆ˜์น˜์˜ˆ์ธก ๋ชจ๋ธ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํšจ์œจ์ ์ธ ์ƒํƒœ-๊ณต๊ฐ„ ๋ชจํ˜• ๊ฐœ๋ฐœ์„ ์œ„ํ•ด์„œ GNN์„ ๊ฒฐํ•ฉํ•˜๋ผ.

3.  Pre-trained Transformers for Time Series

- ๊ฐ ํƒœ์Šคํฌ, ๋„๋ฉ”์ธ์— ๋”ฐ๋ฅธ ์ ์ ˆํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋ผ.

 

4. Transformers with Architecture Level Variants

- ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ๋ชจ๋“ˆ์„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์— ์ง‘์ค‘๋˜์–ด ์žˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ์˜ architecture(๊ตฌ์กฐ)๋ฅผ ์‹œ๊ณ„์—ด์— ๋งž๊ฒŒ ๋””์ž์ธํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

5. Transformers with NAS for Time Series (Neural architecture search (NAS) )

- ํ˜„์žฌ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์‹œ๊ณ„์—ด ๋ถ„์„์— ์ตœ์ ํ™”ํ•ด์•ผ ํ•œ๋‹ค. 
  (embedding dimension, number of heads(๋ณ‘๋ ฌ ์ˆ˜), and number of layers)

 


 

๋งˆ์น˜๋ฉฐ

 

์ด ๋…ผ๋ฌธ์€ "ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์‹œ๊ณ„์—ด ๋ถ„์„์— ์ ์šฉํ•ด๋ณด๋ฉด ์–ด๋–จ๊นŒ?"๋ผ๋Š” ํ•„์ž์˜ ์ผ์ฐจ์ ์ด๊ณ  ๋ง‰์—ฐํ•œ ๊ถ๊ธˆ์ฆ์„ ํ•ด์†Œํ•ด์ฃผ์—ˆ์Œ์€ ๋ฌผ๋ก , ์ง€๊ธˆ๊นŒ์ง€์˜ ํŠธ๋ Œ๋“œ์™€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ๋“ค๊นŒ์ง€ ๊ทธ ์‹œ์•ผ๋ฅผ ๋„“ํž ์ˆ˜ ์žˆ๋Š” ์ข‹์€ ๊ณ„๊ธฐ๊ฐ€ ๋˜์—ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์„ ๋จผ์ € ๋ฆฌ๋ทฐํ•˜๋Š” ๊ฒƒ์€ ์ง€๊ธˆ๊นŒ์ง€ ๋ฐœํ‘œ๋œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋ณ€ํ˜•๋“ค์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ธฐ ์ „์— ์ „์ฒด์ ์ธ ํ๋ฆ„๊ณผ ๋ฐฉํ–ฅ์„ ๋จผ์ € ์งš์–ด์ฃผ๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ๊ฒƒ์ด๋ผ ๊ธฐ๋Œ€ํ•œ๋‹ค.

 

์ง€๋‚˜์ณ์˜จ ๊ณผ์ •๋“ค, ์ด ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•˜๋Š” ์—ฌ๋Ÿฌ ๋ณ€ํ˜•๋“ค์€ ์ด์–ด์ง€๋Š” ๋‹ค๋ฅธ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์—์„œ ์ž์„ธํ•˜๊ฒŒ ๋‹ค๋ฃฐ ์˜ˆ์ •์ด๋‹ค.

 

 

์›๋ฌธ

https://arxiv.org/abs/2202.07125

 

Transformers in Time Series: A Survey

Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-rang

arxiv.org