AI TECH/๋…ผ๋ฌธ

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] T5/Exploring the Limits of Transfer Learning with a UnifiedText-to-Text Transformer

prefer_all 2023. 1. 2. 22:17
๐Ÿ’ฌ ๋“ค์–ด๊ฐ€๊ธฐ์— ์•ž์„œ

- Text to text๋ž€?

text ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์ œ์—์„œ text ์ •๋‹ต ์ฐพ๊ธฐ

- Transfer Learning in NLP
์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ ๋‘ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋ฉฐ ์ „๊ฐœํ•œ๋‹ค.


T5์—์„œ ์ ์šฉ๋œ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.



- Training objective : Modified MLM


- Model structure : Prefix LM + Casual with prefix


- Corruption
๋ณ„ ๋ถ™์€ ๊ฑด T5 base์—์„œ ์‚ฌ์šฉ๋œ ์˜ต์…˜ / ๋ณผ๋“œ์ฒด๋Š” ์ œ์ผ ์ข‹์€ ์„ฑ๋Šฅ

 

https://arxiv.org/abs/1910.10683    2020๋…„ ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์œผ๋กœ, 67์žฅ.. ์–‘์ด ์—„์ฒญ๋‚˜๋‹ค

๐Ÿ’ฌ PLM ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€์—์„œ '๋‚˜ ์ด๋ ‡๊ฒŒ Pretrain์„ ์ž˜ํ–ˆ์–ด'๊ฐ€ ์•„๋‹ˆ๋ผ
span์„ ๋ช‡๊ฐœ, corruption์„ ์–ด๋–ป๊ฒŒ ๋“ฑ ๊ตฌ์ฒด์ ์ธ ์‹คํ—˜์„ ์ง„ํ–‰ํ•ด์„œ ๋…ผ๋ฌธ์ด ๊ธธ์–ด์ง„ ๊ฒƒ์ด๋‹ค. ์‹คํ—˜์„ ์„ ๋ณ„ํ•ด์„œ ๋ณด๋Š” ๊ฒŒ ์ค‘์š”ํ•˜๊ณ , intro์—์„œ ํ๋ฆ„์„ ์ž˜ ๋А๊ปด๋ผ!

 

Introduction

์ตœ๊ทผ NLP task์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” Transfer Learning ๊ธฐ๋ฒ•์˜ ๊ฐ ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•œ ์ „์ฒด์ ์ธ ์—ฐ๊ตฌ๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. Unlabeled dataset, model architecture, pre-training objective, transfer approach ๋“ฑ ์ง€๊ธˆ๊นŒ์ง€ ๋ฐœํ‘œ๋œ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์ด ์„ฑ๋Šฅ ๊ฐœ์„ ์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ง€ ์‹คํ—˜์„ ํ†ตํ•ด ๋ฐํžˆ๊ณ , ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ธ 'T5'๋ฅผ ์ œ์•ˆํ•œ๋‹ค.


NLP ๋ถ„์•ผ์—์„œ๋Š” ๋ชจ๋ธ์ด ํ…์ŠคํŠธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” general-purpose knowledge๋ฅผ ๊ฐœ๋ฐœํ•˜๊ณ ์ž ํ–ˆ๋‹ค.
์ตœ๊ทผ์˜ NLP ์—ฐ๊ตฌ๋“ค์€ ๋Œ€๋Ÿ‰์˜ unsupervised dataset์— ๋Œ€ํ•ด pre-train๋œ ๋ชจ๋ธ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” task์— ๋Œ€ํ•ด supervised fine-tuningํ•˜๋Š” transfer learning์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ๋งŽ์ด ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐฉ์‹์ด task-specific model์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„์€ ์ด๋ฏธ ์ž…์ฆ๋˜์—ˆ๋‹ค. ๋˜ํ•œ ์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์— ์˜ํ•ด ๋” ํฐ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ• ์ˆ˜๋ก, ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ• ์ˆ˜๋ก ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค๋Š” ๊ฒƒ๋„ ์ž…์ฆ๋˜์—ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๊ตฌ๊ธ€ ์—ฐ๊ตฌํŒ€์€ rigorous understanding(์—„๊ฒฉํ•œ ์ดํ•ด)์˜ ํ•„์š”์„ฑ์„ ๋А๊ผˆ๊ณ , ์ด๋ฅผ ์œ„ํ•ด์„œ unified approach to transfer learning๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค. ๋ชจ๋“  NLP task๋ฅผ text-to-text ๋ฌธ์ œ๋กœ ์ƒ๊ฐํ•ด๋ณด๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋ชจ๋“  task๋“ค์„ ํ•˜๋‚˜์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์œผ๋กœ ํ’€๊ฒŒ ๋œ๋‹ค๋ฉด ๋‹ค์–‘ํ•œ downstream task์— ๋™์ผํ•œ model, objective, training procedure, decoding process๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
T5๋Š” ํš์ผํ™”๋œ ๋ฐฉ์‹์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ transfer learning objective์™€ unlabeled dataset ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋ง ์š”์†Œ๋“ค์— ๋Œ€ํ•ด์„œ ํšจ๊ณผ์ ์œผ๋กœ ๋น„๊ต ๋ถ„์„ํ•œ๋‹ค.

 

 

 


์ œ์•ˆ ๋ฐฉ๋ฒ•๋ก 

Set up

Model

  • ์ „๋ฐ˜์ ์œผ๋กœ ๊ธฐ์กด Transformer ๋…ผ๋ฌธ์— ๋‚˜์˜จ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. BERT๋‚˜ GPT ๊ฐ™์€ ๋ชจ๋ธ์ฒ˜๋Ÿผ Transformer ๊ตฌ์กฐ์˜ Encoder๋‚˜ Deocoder๋ฅผ ๋”ฐ๋กœ ๋–ผ์–ด๋‚ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ทธ๋ƒฅ ์›๋ž˜ Transformer์˜ Encoder-Decoder ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•œ๋‹ค.
    ์ฐจ์ด์ ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
    • ๊ฐ„์†Œํ™”๋œ layer normalization ์‚ฌ์šฉ (Transformer์˜ Layer Normalization์— ์‚ฌ์šฉ๋˜๋Š” layer norm bias๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  rescale๋งŒ ์ˆ˜ํ–‰)
    • residual path ์ „์— layer normalization
    • sinusoidal position signal ๋Œ€์‹  ๊ฐ„์†Œํ™”๋œ relative position embeddings ์‚ฌ์šฉ (์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ชจ๋“  layer์— position embedding parameters ๊ณต์œ )
    • Absolute positional embedding ๋Œ€์‹  Relative positional embedding ์‚ฌ์šฉ
  • ๋ชจ๋ธ์— ํฌ๊ธฐ์— ์˜ํ•œ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ํ™•์ธ ํ•  ๊ฒƒ์ด๋‹ค.
    • ํฐ ๋ชจ๋ธ์„ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด model and data parallelism ์‚ฌ์šฉ
    • ํ•™์Šต์€ Cloud TPU Pods ์‚ฌ์šฉ (ML supercomputers with 1,024 TPU v3 chips)


The Colossal Clean Crawled Corpus

  • T5์˜ pre-training์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ C4๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. 
  • ์ด ๋…ผ๋ฌธ์—์„œ๋Š” unlabeled data์˜ quality, characteristics, size๊ฐ€ ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ํƒ๊ตฌํ•œ๋‹ค.
  • ์ด๊ฒƒ์„ ์‹คํ—˜ํ•˜๊ธฐ ์œ„ํ•ด Colossal Clean Crawled Corpus๋ฅผ ๋งŒ๋“ ๋‹ค.

Common Crawl

  • Common Crawl์€ ๋ฌด๋ฃŒ๋กœ ๊ณต๊ฐœ๋œ web archive ์ด๋ฉฐ HTML ํŒŒ์ผ๋“ค์„ ์ฒ˜๋ฆฌํ•ด์„œ ๋งค๋‹ฌ 20TB์˜ text data๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค
  • ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์€ ์‚ฌ์šฉ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ text์ด๋‹ค (์—๋Ÿฌ ๋ฉ”์„ธ์ง€, ๋ฉ”๋‰ด, ์ค‘๋ณต text).
  • ๋˜๋Š” ์–ด๋– ํ•œ downstream task์˜ ํ•™์Šต์— ๋„์›€์ด ์•ˆ๋  text์ด๋‹ค (์š•์„ค, ์ฝ”๋“œ, placeholder text)

Common Crawl clean up!

  1. ์ข…์ง€ ๋ถ€ํ˜ธ๋กœ ๋๋‚˜๋Š” ๋ฌธ์žฅ๋“ค๋งŒ ํ—ˆ์šฉํ•œ๋‹ค (๋งˆ์นจํ‘œ, ๋А๋‚Œํ‘œ, ๋ฌผ์Œํ‘œ๋“ฑ)
  2. 5 ๋ฌธ์žฅ ๋ฏธ๋งŒ์ธ ํŽ˜์ด์ง€ ์ œ๊ฑฐ ๋ฐ ๋‹จ์–ด ๊ฐœ์ˆ˜๊ฐ€ 3๊ฐœ ์ด์ƒ์ธ line๋งŒ ํ—ˆ์šฉํ•œ๋‹ค
  3. List of Dirty, Naughty, Obscene or Otherwise Bad Words์— ํฌํ•จ๋œ ๋น„์†์–ด ๋‹จ์–ด๊ฐ€ ์žˆ๋Š” ํŽ˜์ด์ง€ ์ œ๊ฑฐํ•œ๋‹ค
  4. Javascript๋ž€ ๋‹จ์–ด๊ฐ€ ์žˆ๋Š” line ์ œ๊ฑฐํ•œ๋‹ค (Javascript๋ฅผ ํ—ˆ์šฉํ•˜๋ผ๋Š” ๊ฒฝ๊ณ ๋ฌธ์ด ๋งŽ์•˜๋‹ค๊ณ  ํ•จ)
  5. “lorem ipsum” ๋ฌธ๊ตฌ๊ฐ€ ์žˆ๋Š” ํŽ˜์ด์ง€ ์ œ๊ฑฐํ•œ๋‹ค
  6. ์ค‘๊ด„ํ˜ธ(“{”)๋Š” ์ž์—ฐ์–ด์—์„œ ์‚ฌ์šฉ์ด ์•ˆ๋˜๊ณ  ์ฝ”๋”ฉ ์–ธ์–ด์—์„œ ์ž์ฃผ ์‚ฌ์šฉ์ด ๋˜๋‹ˆ “{”์ด ์กด์žฌํ•˜๋Š” ํŽ˜์ด์ง€ ์ œ๊ฑฐํ•œ๋‹ค
  7. 3๋ฒˆ ์ด์ƒ ๋“ฑ์žฅํ•˜๋Š” three-sentence span ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ ๋นผ๊ณ  ์ œ๊ฑฐํ•œ๋‹ค
  8. ์˜๋ฌธ downstream tasks์— ์ง‘์ค‘์„ ํ•˜๋‹ˆ langdetect๋ผ๋Š” ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์˜์–ด๊ฐ€ ์•„๋‹Œ ํŽ˜์ด์ง€ ์ œ๊ฑฐํ•œ๋‹ค

2019๋…„ 4์›”์— ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์™€์„œ ํ•„ํ„ฐ๋งํ•ด์„œ ๋Œ€๋žต 750GB๋ž€ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ–ˆ๋‹ค. ์ด๋ฅผ ์ ๋‹นํ•˜๊ฒŒ cleanํ•˜๊ณ  naturalํ•œ ์˜๋ฌธ text๋ฅผ ํ™•๋ณดํ•œ ๋ฐ์ดํ„ฐ๋ฅผ "Colossal Clean Crawled Corpus (C4)"๋ผ๊ณ  ์นญํ•œ๋‹ค.

 

Downstream Tasks

  • ์ด ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ๋Š” general language learning abilities๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์—ฌ๋Ÿฌ benchmarks์„ ๊ธฐ์ค€์œผ๋กœ downstream tasks์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ๊ฒƒ์ด๋‹ค.

GLUE and SuperGLUE

General language understanding ์„ฑ๋Šฅ์„ ํ™•์ธํ•œ๋‹ค.

  • Sentence acceptability judgment
  • Sentiment Analysis
  • Paraphrasing/sentence similarity
  • Natural Language Inference
  • Coreference resolution
  • Sentence completion
  • Word sense disambiguation
  • Question Answering

fine-tuning์„ ํ•  ๋•Œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์„ concatํ•ด์„œ ํ•˜๋‚˜์˜ task๋กœ ์„ค์ •ํ•œ๋‹ค



CNN/Daily Mail - abstractive summarization
SQuAD - question answering
WMT Engligh to German, French and Romanian- translation

์ฐธ๊ณ ํ•  ๊ฒƒ์€ pre-training์€ ์˜ค๋กœ์ง€ ์˜๋ฌธ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๋ชจ๋ธ์€ ์ƒˆ๋กœ์šด ์–ธ์–ด๋กœ ๋‹จ์–ด๋ฅผ generateํ•˜๋Š” ๊ฒƒ์„ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค.

 

Input and Output Format

  • ํ•˜๋‚˜์˜ ๋ชจ๋ธ์„ ํ•™์Šตํ•ด์„œ ์—ฌ๋Ÿฌ tasks๋ฅผ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด tasks๋ฅผ “text-to-text” format์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค
  • ๋ชจ๋ธ์€ teacher forcing์„ ์ด์šฉํ•ด์„œ maximum likelihood objective๋ฅผ ๋ชฉํ‘œ๋กœ ํ•™์Šตํ•œ๋‹ค
  • ์–ด๋–ค task๋ฅผ ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๋Š”์ง€ ์•Œ๋ ค์ฃผ๊ธฐ ์œ„ํ•ด task-specific text prefix๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค
  • STS-B๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋“  tasks๋ฅผ text-to-text format์œผ๋กœ ์ˆ˜์›”ํ•˜๊ฒŒ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅํ•˜๋‹ค
  • STS-B๋Š” 1~5 ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” task์ธ๋ฐ ๋Œ€๋ถ€๋ถ„ 0.2 ๋ฐฐ์ˆ˜๋กœ ์ ์ˆ˜๊ฐ€ ์ •ํ•ด์ ธ ๋ชจ๋“  ์ ์ˆ˜๋ฅผ ์ œ์ผ ๊ฐ€๊นŒ์šด 0.2 ๋ฐฐ์ˆ˜๋กœ ๋Œ€์ฒดํ•œ๋‹ค
  • ๊ฒฐ๊ตญ 21-class classification problem์œผ๋กœ ๋Œ€์ฒด๋œ๋‹ค

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

1. Baseline

๋ชจ๋ธ

  • ์ผ๋ฐ˜์ ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ
  • ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋Š” Bert base(110 million)์™€ ๋น„์Šทํ•˜๋„๋ก ์„ค๊ณ„
    • 12๊ฐœ์˜ ์–ดํ…์…˜ ๋ธ”๋ก์ด ์Œ“์—ฌ์žˆ๋Š” ๊ตฌ์กฐ
    • encoder 110 million / decoder 110 million → Total 220 million parameter

 

Training

  • Seq-length = 512
  • batch size = 128 sequences
  • C4 ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ Pre-train : 524,288 steps ($2^{19}$)
  • ๊ฐ ํ…Œ์Šคํฌ์— ๋Œ€ํ•œ Fine-tune : 262,144steps ($2^{18}$)
  • Text-to-Text task

 

Vocab

  • SentencePiece(Kudo and Richardson, 2018)๋ฅผ ์‚ฌ์šฉํ•ด C4 ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ 32,000๊ฐœ ์‚ฌ์ „์„ ๊ตฌ์ถ•
  • ๋ฒˆ์—ญ task๋ฅผ ์œ„ํ•ด์„œ ๋…์ผ์–ด, ํ”„๋ž‘์Šค์–ด, ๋กœ๋งˆ์–ด ๋‹จ์–ด๋„ ์ถ”๊ฐ€ (10:1:1:1 = ์˜์–ด:๋…์ผ์–ด:ํ”„๋ž‘์Šค์–ด:๋กœ๋งˆ์–ด)
  • Unsupervised Objective : Denoising ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค - word ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ span ๋‹จ์œ„๋ฅผ sentinel token์œผ๋กœ ์น˜ํ™˜

T5์˜ pre-training objective๋Š” SpanBERT ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์›๋ž˜ BERT์˜ ๊ฒฝ์šฐ์—๋Š” ์ž…๋ ฅ๋˜๋Š” original text ๋‚ด์˜ ์ผ๋ถ€ token๋“ค์„ [MASK] token์œผ๋กœ ๊ต์ฒดํ•˜๋Š” ๋ฐฉ์‹์„ ํƒํ–ˆ์œผ๋‚˜, SpanBERT์—์„œ๋Š” ๊ฐ token์„ ๋งˆ์Šคํ‚นํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ Span ์„ ํ•˜๋‚˜์˜ [MASK] token์œผ๋กœ ๋งˆ์Šคํ‚นํ•œ๋‹ค. ์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์–ด๋А์ •๋„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์—ˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ณ„์‚ฐ ํšจ์œจ๋„ ๊ฝค ํ–ฅ์ƒ๋˜์—ˆ๋‹ค.

 

Baseline Performance

  • ํ•™์Šต์„ ํ•  ๋•Œ๋งˆ๋‹ค ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์—(i.e. with different random initializations and data set shuffling) Baseline model์„ 10๋ฒˆ ํ•™์Šต์‹œ์ผœ์„œ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ๊ณ„์‚ฐ๋œ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ธฐ์ค€์œผ๋กœ 2-sigma ๋ฒ”์œ„ ๋‚ด ์ ์ˆ˜๋Š” ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์ด๋ผ๊ณ  ํŒ๋‹จํ•œ๋‹ค.

  • Score ๊ธฐ์ค€
    • GLUE/SGLUE : GLUE์™€ SuperGLUE์— ์žˆ๋Š” ๋ชจ๋“  subtask ์ ์ˆ˜์˜ ํ‰๊ท 
    • CNNDM : ROUGE-2-F
    • SQuAD : EM
    • EnDe/EnFr/EnRo : BLUE

 

2. ํ…Œ์ŠคํŠธ ๋ชจ๋ธ ๊ตฌ์กฐ

  • Attention ์ข…๋ฅ˜
    • Fully-visible : input ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋“  output ๋ฐ์ดํ„ฐ์— attention
    • Causal : input ๋ฐ์ดํ„ฐ ์ด์ „ ์Šคํƒญ๊นŒ์ง€๋งŒ attention
    • Causal with prefix : ์•ž๋ถ€๋ถ„ ์ผ์ •๊ตฌ๊ฐ„(prefix)์€ ํ•ญ์ƒ attentionํ•˜๋ฉด์„œ prefix ์ดํ›„ ๊ตฌ๊ฐ„์€ ํƒ€์ž„ ์Šคํƒญ์— ๋”ฐ๋ผ attention

 

  • Model ์ข…๋ฅ˜
    • Encoder-Decoder
      • Encoder : Fully-visible
      • Decoder : Causal
    • Language Model
      • Only decoder : Causal
    • Prefix LM
      • LM + Prefix(query ๋ถ€๋ถ„)

 

  • Result
    • Denoising ๋ฐฉ๋ฒ•์œผ๋กœ ์ผ๋ฐ˜์ ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์Œ
    • Bert base parameter = P ๊ธฐ์ค€
    • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ parameter๋ฅผ shared ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ์„ฑ๋Šฅ์ด ์กฐ๊ธˆ ๋–จ์–ด์ง€์ง€๋งŒ ํฌ๊ฒŒ ์ฐจ์ด๋‚˜์ง€ ์•Š์Œ
    • Q) parameter๊ฐ€ ์ ˆ๋ฐ˜์ธ๋ฐ๋„ cost๊ฐ€ ๊ฐ™์€ ์ด์œ  : ์—ฌ๊ธฐ์„œ cost(M)์€ ๋ชจ๋ธ ํ•™์Šต์— ํ•„์š”ํ•œ cost๊ฐ€ ์•„๋‹ˆ๋ผ ๋ชจ๋ธ์ด ์ž‘๋™ํ•  ๋•Œ ํ•„์š”ํ•œ cost๋ฅผ ์˜๋ฏธํ•จ. ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” parameter๊ฐ€ ์ ˆ๋ฐ˜์ด๋”๋ผ๋„ ๋ชจ๋ธ์— ์กด์žฌํ•˜๋Š” parameter์˜ ๊ฐœ์ˆ˜๋Š” ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— cost๋„ ๋™์ผ
    • Attention ์ข…๋ฅ˜
      • Fully-visible : input ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋“  output ๋ฐ์ดํ„ฐ์— attention
      • Causal : input ๋ฐ์ดํ„ฐ ์ด์ „ ์Šคํƒญ๊นŒ์ง€๋งŒ attention
      • Causal with prefix : ์•ž๋ถ€๋ถ„ ์ผ์ •๊ตฌ๊ฐ„(prefix)์€ ํ•ญ์ƒ attentionํ•˜๋ฉด์„œ prefix ์ดํ›„ ๊ตฌ๊ฐ„์€ ํƒ€์ž„ ์Šคํƒญ์— ๋”ฐ๋ผ attention

 

3. Unsupervised Objectives

์–ด๋–ค ํ˜•ํƒœ์˜ ์ธํ’‹/์•„์›ƒํ’‹์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š”์ง€ ๊ฒฐ์ •ํ•˜๋Š” ๊ณผ์ •
๋†’์€ ๋ ˆ๋ฒจ์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ์„ ํƒํ•œ ํ›„, ํ•˜์œ„ ๋‹จ๊ณ„๋ฅผ ๋น„๊ตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

  • Prefix language modeling : ๋ฌธ์žฅ ์•ž ๋ถ€๋ถ„์„ ๋ณด๊ณ  ๋’ท ๋ถ€๋ถ„์„ ์˜ˆ์ธกํ•˜๋Š” objective
  • BERT-style : Bert์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ถœ๋ ฅ์ด ์—ฐ์†๋œ sequence ๋ฌธ์žฅ์œผ๋กœ ๋‚˜์˜ค๋„๋ก ๋ณ€๊ฒฝํ•œ objective
  • Deshuffling : ์ž…๋ ฅ ๋ฌธ์žฅ์„ ๋’ค์„ž์€ ํ›„ ์›๋ž˜ ์ˆœ์„œ๋Œ€๋กœ ๋งž์ถ”๋Š” objective

 

  • I.i.d. noise, mask : ์ž„์˜์˜ token์„ masking ํ•˜๋Š” ๋ฐฉ์‹์˜ objective
  • I.i.d. noise, replace spans : 1 ๋‹จ์œ„์˜ span์„ ์ž„์˜๋กœ ์ •ํ•˜๊ณ  span์ด ์—ฐ์†๋œ ๊ฒฝ์šฐ ํ•˜๋‚˜์˜ span์œผ๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ sentinel token์œผ๋กœ ๋Œ€์ฒดํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์‹œ ์˜ˆ์ธกํ•˜๋Š” objective
  • I.i.d. noise, drop : ์ž„์˜์˜ token์„ ์ œ์™ธ์‹œํ‚ค๊ณ  ์ œ์™ธ๋œ ํ† ํฐ์„ ๋‹ค์‹œ ์˜ˆ์ธกํ•˜๋Š” objective
  • Random spans : ์ž„์˜๋กœ ์ •ํ•ด์ง„ ๊ธธ์ด์˜ span์„ ์ •ํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์‹œ ์˜ˆ์ธกํ•˜๋Š” objective

 

Disparate High-Level Approaches
๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ๋‹ค๋ฅธ 3๊ฐ€์ง€ objective๋ฅผ ๋จผ์ € ๋น„๊ต → Bert-style์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค

์œ„์˜ ํ‘œ๋Š” ๊ฐ pre-training ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ๋‚˜ํƒ€๋‚ธ ํ‘œ ์ธ๋ฐ, Prefix language modeling์€ GPT์™€ ๊ฐ™์€ standard language modeling ๋ฐฉ์‹์„, BERT-style์€ BERT์—์„œ ์‚ฌ์šฉ๋˜๋Š” masked language modeling ๋ฐฉ์‹์„ ๋งํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ Deshuffling ๋ฐฉ์‹์€ sequence๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ˆœ์„œ๋ฅผ shufflingํ•œ ํ›„ ์›๋ž˜ sequence๋ฅผ target์œผ๋กœ ํ•ด์„œ ๋ณต๊ตฌํ•˜๋Š” ๋ฐฉ์‹์˜ denoising sequential autoencoder๋ฅผ ์ด์•ผ๊ธฐํ•œ๋‹ค.

 

Simplifying the BERT Objective

  • Bert-style : 15%์˜ token์„ mask token์œผ๋กœ ๋ณ€๊ฒฝ(์ž„์˜์˜ token์œผ๋กœ ๋ฐ”๊พธ์ง€ ์•Š์Œ)
  • MASS-style : Span์„ ๋จผ์ € ์ •ํ•˜๊ณ  span ๋‚ด token์„ maskingํ•˜๋Š” ๋ฐฉ์‹
  • Replace corrupted spans : Span์„ ์ •ํ•œ ํ›„ sentinel token์œผ๋กœ ๋ฐ”๊พธ๋Š” ๋ฐฉ์‹
  • Drop corrupted tokens : ๋ฌธ์žฅ์—์„œ token์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹

BERT-style์˜ pre-training objective๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•œ ํ›„ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ denoising objectives์— ๋Œ€ํ•ด์„œ ์„œ๋กœ ๋น„๊ต ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์œ„ ํ‘œ๊ฐ€ ์‹คํ—˜์˜ ๊ฒฐ๊ณผ์ธ๋ฐ, ์ฒซ๋ฒˆ์งธ๋กœ BERT-style์€ original BERT์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ 15%์˜ token์„ maskingํ•˜๋Š” ๋ฐฉ์‹์„ ๋งํ•œ๋‹ค. ๋‘๋ฒˆ์งธ๋กœ MASS-style์€ MASS ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋˜์—ˆ๋˜ span๋‚ด token๋“ค์„ maskingํ•˜๊ณ  ํ•ด๋‹น token๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์„ ๋งํ•œ๋‹ค. ๊ทธ ์•„๋ž˜์˜ Replace corrupted spans๋Š” T5์—์„œ ์‚ฌ์šฉํ•œ ์ผ์ • span์„ ํ•˜๋‚˜์˜ masking token์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹์„, Drop corrupted tokens๋Š” input sequence์˜ tokens๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋‹ค์‹œ ๋ณต์›ํ•˜๋Š” ๋ฐฉ์‹์„ ๋งํ•œ๋‹ค.

 

Varying the corruption rate
Corruption ๋น„์œจ์ด ๋†’์•„์ง€๋ฉด ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋Š” ํƒ€์ผ“์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์ด๋Š” ํ•™์Šต ์‹œ๊ฐ„์„ ๋” ์ฆ๊ฐ€์‹œํ‚ด → Bert์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ 15% corruption ๋น„์œจ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ๊ฒฐ์ •

 

Corrupting Spans
์ž…๋ ฅ ๋ฌธ์žฅ์„ span-corruption ํ•˜๋ฉด ์ž…๋ ฅ token์˜ ์ˆ˜๊ฐ€ ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์†๋„ ํ–ฅ์ƒ ํšจ๊ณผ๋„ ์žˆ์Œ
i.i.d ๋ฐฉ์‹์—์„œ๋Š” Span length๋Š” ์ž„์˜๋กœ ์ •ํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ •ํ•œ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์—†์Œ
์—ฌ๊ธฐ์„œ๋Š” ๊ฐ span ๊ธธ์ด์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ
ex) 500 tokens๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์—์„œ 75 tokens(15%)์„ span ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ

  1. 32.5๊ฐœ spans → Average span length = 2
  2. 25๊ฐœ spans → Average span length = 3

 

4. Pre-training Data set

Pre-train ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€ ๋น„๊ตํ•˜๋Š” ํŒŒํŠธ์ด๋‹ค.

Unlabeled datasets

  • C4 : Setup(2.2)์—์„œ ์„ค๋ช…
  • Unfiltered C4 : ์ „์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์€ C4 ๋ฐ์ดํ„ฐ์—์„œ ์˜์–ด๋งŒ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ
  • RealNews-like : C4 ๋ฐ์ดํ„ฐ์—์„œ RealNews(Zellers et al., 2019) ๋ฐฉ๋ฒ•์œผ๋กœ ํ•„ํ„ฐ๋งํ•œ ๋ฐ์ดํ„ฐ
  • WebText-like : 1๋…„ ๋™์•ˆ์˜ C4 ๋ฐ์ดํ„ฐ ์ค‘์—์„œ ๊ณ ํ’ˆ์งˆ(๋ ˆ๋”ง ์‚ฌ์šฉ์ž๋“ค์˜ ํ‰๊ฐ€)์˜ url์—์„œ ์œ ๋ž˜ํ•œ ๋ฐ์ดํ„ฐ
  • Wikipedia : ์˜์–ด ์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ
  • Wikipedia + Toronto Books Corpus : ์˜์–ด ์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ์— ๋” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•ด Toronto Books Corpus๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ

  • ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์€ C4 ๋ฐ์ดํ„ฐ๊ฐ€ ์šฐ์ˆ˜
  • ํ‘œ์—๋Š” ๋‚˜์˜ค์ง€ ์•Š์•˜์ง€๋งŒ(Table 16), SQuAD ํ˜น์€ SGLUE์˜ subtask์ธ MultiRC๋‚˜ ReCoRD ์ ์ˆ˜๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์˜ ์ ์ˆ˜๊ฐ€ ๋” ์šฐ์ˆ˜ → ๋ฐ์ดํ„ฐ์˜ ์ถœ์ฒ˜๊ฐ€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜, ํŠน์ • ๋„๋ฉ”์ธ์— ๊ด€ํ•œ ๋ฐ์ดํ„ฐ ๊ฒฝ์šฐ์— ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ณด์ž„(ex. Wikipedia + TBC๋‚˜ Wikipedia) → ํ•˜์ง€๋งŒ ํŠน์ • ๋„๋ฉ”์ธ์— ํ•œ์ •๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ 1. ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ถฉ๋ถ„ํžˆ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ต๊ณ , 2. ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์–ด๋ ค์›€

 

Pre-training data size

  • C4 dataset์˜ token ๊ฐœ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ C4 ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์กฐ์ ˆํ•ด ๋น„๊ตํ•จ (Full dataset # tokens = $2^{35}$)
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ์€ ๊ฒฝ์šฐ์—๋Š” ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ ์€ ํšŸ์ˆ˜๋งŒ ๋ฐ˜๋ณตํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ
  • ๋„ˆ๋ฌด ๋งŽ์€ ๋ฐ˜๋ณตํ•™์Šต์€ downstram์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜(๊ณผ์ ํ•ฉ)
  • ํฐ ๋ชจ๋ธ์— ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜๋ณตํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ์ €ํ•˜๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง(๊ณผ์ ํ•ฉ)

 

5. Training Strategy

Fine-tuning์— ๋Œ€ํ•œ ์„ค๋ช…

 

Fine-tuning methods

  • Fine-tune ๋‹จ๊ณ„์—์„œ ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ์ผ๋ถ€๋ถ„(์ผ๋ฐ˜์ ์œผ๋กœ ์ถœ๋ ฅ์ธต)๋งŒ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•๋งŒ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ(Peters et al., 2019)
  • ํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” Encoder-Decoder ๋ชจ๋ธ์€ ๋ฌธ์žฅ์„ ์ถœ๋ ฅ์œผ๋กœ ๋‚ด๋ฉด์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€ → ๋ชจ๋ธ ์ „์ฒด๋ฅผ fine-tuningํ•˜์ง€ ์•Š๋Š” ๋Œ€์•ˆ์œผ๋กœ Adapter layer ๋ฐฉ๋ฒ•๊ณผ Gradual unfreezing ๋ฐฉ๋ฒ•์ด ์žˆ์Œ
    • Gradual unfreezing:  2^{18}/12 step๋งˆ๋‹ค ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ํ•œ ๋ธ”๋ก์”ฉ unfreezeํ•˜๋ฉด์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•(๋งˆ์ง€๋ง‰ layer๋ถ€ํ„ฐ ๋จผ์ € ํ•™์Šต)
    • Adapter layer: Adpter layer๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•ด๋‹น layer๋งŒ fine-tuning

  • ๊ฒฐ๊ณผ
    • ํŒŒ๋ผ๋ฏธํ„ฐ ์ „์ฒด๋ฅผ fine-tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์Œ
    • ๋ฐ์ดํ„ฐ ์…‹์ด ์ž‘๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌ˜์‚ฌํ•˜๋Š” ๋ฒ”์œ„๊ฐ€ ์ ์€ ๊ฒฝ์šฐ(GLUE) ๋‚ฎ์€ ์ฐจ์›์˜ adapter layer๋„ ์ผ์ • ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ๋‹ค์–‘ํ•ด์ง€๋Š” ๊ฒฝ์šฐ(SGLUE)์—” adapter layer์˜ ์ฐจ์›๋„ ๋น„๋ก€ํ•ด์•ผ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ

 

Multi-task learning

Multi-task learning์ด๋ž€ ํ•˜๋‚˜์˜ unsupervised task์— ๋Œ€ํ•ด์„œ pre-training์„ ์ง„ํ–‰ํ•œ ํ›„ fine-tuningํ•˜๋Š” ๊ฒƒ ๋Œ€์‹ ์— ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ task์— ๋Œ€ํ•ด์„œ ํ•œ ๋ฒˆ์— training์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” ์ด multi-task learning ๋ฐฉ์‹๊ณผ pre-train + fine-tune ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•˜๊ณ ์ž ํ–ˆ๋‹ค. Multi-task learning์˜ ๊ฒฝ์šฐ์—๋Š” ๊ฐ task๋ณ„ data ์‚ฌ์šฉ ๋น„์œจ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๊ฒŒ ๋˜๋Š”๋ฐ, ์ž์นซ ๋„ˆ๋ฌด ๋งŽ์€ ์–‘์˜ data๋ฅผ training์—์„œ ํ™œ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด training dataset์„ memorizeํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋น„์œจ ์„ค์ • ๋ฐฉ์‹์— ๋Œ€ํ•ด์„œ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” pre-train/fine-tune ๋ฐฉ์‹์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ €์ž๋Š” pre-train/fine-tune ๋ฐฉ์‹์— multi-task learning์„ ์ ์šฉํ•˜๊ณ ์ž ํ•˜์˜€๊ณ , multi-task๋กœ pre-trainingํ•œ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•˜๋‚˜์˜ task์— ๋Œ€ํ•ด์„œ fine-tuningํ•˜๋Š” ๋ฐฉ์‹์„ ๊ณ ์•ˆํ•ด๋ƒˆ๋‹ค.

 

Combining multi-task learning with fine-tuning

๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ค ํ˜•์‹์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ์ข‹์€์ง€ ๋น„๊ตํ–ˆ๋‹ค.
๊ทธ ๊ฒฐ๊ณผ multi-task pre-training + fine-tuning์ด ๊ธฐ์กด์˜ unsupervised pre-training + fine-tuning ๋ฐฉ์‹๊ณผ ๊ฑฐ์˜ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜๊ณ  ์ตœ์ข… ๋ชจ๋ธ์˜ ํ•™์Šต์— ํ•ด๋‹น ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

6. Scaling

๋งŒ์•ฝ ์ปดํ“จํŒ… ์ž์›์ด 4๋ฐฐ ๋” ๋งŽ์•„์กŒ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์–ผ๋งˆ๋‚˜ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ์‹คํ—˜ํ–ˆ๋‹ค.
๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ(size), ํ•™์Šต ์‹œ๊ฐ„(training steps), ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ, 4๊ฐœ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ฒฝ์šฐ์˜ ์„ฑ๋Šฅ์„ baseline๊ณผ ๋น„๊ตํ–ˆ๋‹ค.
ํฐ ๋ชจ๋ธ์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๋Š” ๊ฒƒ์€ ์ด๋ฏธ ๊ณต๊ณต์—ฐํ•œ ์‚ฌ์‹ค์ด์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ปดํ“จํŒ… ์ž์›์„ 4๋ฐฐ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ํฌ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— scale up์€ ํ•˜์ง€ ์•Š์•˜๋‹ค.

 

์—ฌ๋Ÿฌ ์‹คํ—˜๋“ค์„ ๊ฑฐ์นœ ํ›„ T5 ๋ชจ๋ธ์ด ์„ ํƒํ•œ ๊ธฐ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Span-corruption objective
  • Longer pre-training (1 million steps on batch size )
  • Larger model (11B parameters)
  • Multi-task pre-training + fine-tuning
  • Beam Search

๊ทธ๋ฆฌ๊ณ  ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ์–ป๊ฒŒ๋œ ํ•ต์‹ฌ ์ •๋ณด๋“ค์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ–ˆ๋‹ค.

Text-to-Text Framework
Text-to-Text Framework๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ loss, decoding๋ฐฉ์‹ ๋“ฑ์˜ ๋ณ€๊ฒฝ์—†์ด generation, classification, regression task ๋ชจ๋‘์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

Original Transformer Architecture
BERT๋‚˜ GPT๊ฐ™์€ ๋ชจ๋ธ์ฒ˜๋Ÿผ Transformer์˜ encoder/decoder๋ฅผ ๋”ฐ๋กœ ๋–ผ์–ด๋‚ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ original Transformer ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด text-to-text์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

Denoising Span Objective
Pre-training์‹œ ์‚ฌ์šฉํ•˜๋Š” objective์ค‘์—์„œ๋Š” BERT ๊ณ„์—ด์—์„œ ์‚ฌ์šฉํ•˜๋Š” denoising objective, ๊ทธ ์ค‘์—์„œ๋„ SpanBERT์—์„œ ์‚ฌ์šฉํ•œ denoising span์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

C4 Dataset
์ผ๋ถ€ downstream task์—์„œ in-domain unlabeled dataset์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•˜์ง€๋งŒ, ๋„๋ฉ”์ธ์„ ์ œํ•œํ•˜๊ฒŒ ๋˜๋ฉด dataset์˜ ํฌ๊ธฐ๊ฐ€ ํ›จ์”ฌ ๋” ์ž‘์•„์ง€๊ฒŒ ๋œ๋‹ค. ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด์„œ ๋” ํฌ๊ณ  ๋” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์„ ์•„์šฐ๋ฅด๋Š” dataset์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 


๋ฐฐ์šด ์ 

  • ๋” ์ด์ƒ big model์ด ์•„๋‹Œ ์ž‘์€ ํฌ๊ธฐ๋กœ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์–ด์•ผ ํ•œ๋‹ค.
  • ๋” ํšจ์œจ์ ์ด๊ณ  ์ข‹์€ pre-training ๋ฐฉ์‹์ด ํ•„์š”ํ•˜๋‹ค. ํŠนํžˆ general knowledge์— ๋Œ€ํ•œ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์ด ํ•„์š”ํ•˜๋‹ค.
  • ๊ฐ task๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๊ณต์‹ํ™”ํ•ด์„œ ๋” ์ด์ƒ unsupervised pre-training์ด ์•„๋‹Œ supervised pre-training์œผ๋กœ ๋‚˜์•„๊ฐ€์•ผํ•œ๋‹ค.

์ฐธ๊ณ