허깅페이스 100프로 활용하기

AI TECH/TIL

허깅페이스 100프로 활용하기

prefer_all 2022. 12. 19. 13:56

멘토링 들을 때는 이해를 잘 못했는데 대회에 참가하면서 비로소 이해한 부분들을 정리하고자 한다.

<목차>
1. 학습 파이프라인 설계
2. 허깅 페이스 모델 뜯어보기
3. 허깅 페이스 모델 내 맘대로 바꿔보기
4. 거대 언어 모델

<요약> - tokenizer, config, model을 loading

model_name = 'klue/bert-base'

config = AutoConfig.from_pretrained(model_name) 🍑
tokenizer = AutoTokenizer.from_pretrained(model_name) ⛄
model = AutoModelForQuestionAnswering.from_pretrained( 🌽
    model_name,
    config=config,
)

허깅페이스 tokenizer

텍스트 데이터를 모델이 알아들을 수 있는 형태로 변환하기 위해서는 tokenization 과정을 거쳐 encoding을 진행해야합니다.

이 때 text data를 token화 하고 특정 숫자로 encoding 하는 과정을 모두 수행하는것이 transformers tokenizer 역할입니다.

Q: PLM이 없던 시기에는 tokenizer을 어떻게 사용했을까?
A : 내가 가지고 있는 데이터를 기반으로 parsing을 진행(성능 좋은 Parser를 사용했음)한 후 dictionary를 데이터별로 만들어 직접 숫자를 부여했습니다 참고

example = "첫눈 오는 이런 오후에"

model_name = 'klue/bert-base'
tokenizer = AutoTokenizer.from_pretrained(model_name) ⛄

# klue/bert-base는 한글에 대한 tokenizer이므로 한글을 인식할 수 있음
print('tokenization 결과 : ', tokenizer.tokenize(example))
print('tokenization + encoding 결과 : ', tokenizer.encode(example))

'''
tokenization 결과 :  ['첫눈', '오', '##는', '이런', '오후', '##에']
tokenization + encoding 결과 :  [2, 24122, 1443, 2259, 3667, 4082, 2170, 3]
'''

Subword 토크나이징
자주 쓰이는 글자 조합은 한 단위로 취급하고, 자주 쓰이지 않는 조합은 subword로 쪼갠다.
"##"는 디코딩을 할 때 해당 토큰을 앞 토큰에 띄어쓰기 없이 붙인다는 것을 뜻한다.

⛔️ tokenizer 사용시 주의사항

train data의 언어를 이해 할 수 있는 tokenizer인지 확인

from transformers import AutoTokenizer

example = "첫눈 오는 이런 오후에"

model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# bert-base-cased는 영어에 대한 tokenizer이므로 한글을 전혀 이해하지 못함
# 따라서 tokenizer output은 [unk]으로 나옴
print('tokenization 결과 : ', tokenizer.tokenize(example))
print('tokenization + encoding 결과 : ', tokenizer.encode(example))
'''
tokenization 결과 :  ['[UNK]', '[UNK]', '[UNK]', '[UNK]']
tokenization + encoding 결과 :  [101, 100, 100, 100, 100, 102]
'''

사용하고자 하는 pretrained model과 동일한 tokenizer인지 확인
적절한 tokenizer를 사용하지 않을 경우 vocab size mismatch에러가 발생하거나 special token이 [unk]으로 처리되는 🤦🏻‍♀️대참사🤦🏻‍♂️가 벌어질 수 있음
단어의 개수와 special token이 완전히 일치하는 모델은 (예를들어 klue의 roberta, bert) tokenizer를 cross로 사용 '할 수도' 있지만 옳은 방법은 아님
- 첨언하자면, 공개된 영어 bert와 roberta는 tokenizer가 호환되지 않습니다. (bert vocab 28996개, roberta vocab 50265개)
- klue bert는 동일한 기관에서 생성된 모델이므로 32000개로 총 vocab 사이즈가 동일하지만 이는 우연의 일치입니다.

허깅페이스 Config

사전 학습 모델을 사용하기 위해서는 사전학습 모델이 가진 setting을 그대로 가져와야합니다.

모델마다 vocab size, hidden dimension등 각각의 파라미터 세팅이 상이하므로

transformers는 이 정보를 Config로 쉽게 불러올 수 있는 기능을 제공합니다.

모델명+Config.from_pretrained가 가장 기본적인 형태였지만, 요즘은 Auto class로 더욱 편리하게 configuration을 가져올 수 있습니다.

from transformers import AutoConfig

model_name =  'klue/bert-base'

# pretrained 모델과 동일한 configuration을 가져옵니다.
model_config = AutoConfig.from_pretrained(model_name)

model_config

⛔️ config 사용시 주의사항

어떤 경우에는 config를 수정하여 사용하기도 하는데, 바꾸어도 되는 config와 바꾸지 말아야 하는 config가 정해져 있습니다.

바꾸면 안되는 config

Pretrained model 사용시 hidden dim등 이미 정해져 있는 모델의 아키텍쳐 세팅은 수정하면 안됩니다.
이를 수정해버릴 경우 에러가 발생하거나, 잘못된 방향으로 학습 될 수 있습니다.

바꾸어도 되는 config

vocab의 경우 special token을 추가한다면 config를 추가한 vocab의 개수만큼 추가하여 학습해야합니다.
downstream task를 위해 몇가지 config를 추가할 수도 있습니다. (아래에서 예시를 살펴봅시다)

sequence classification 모델을 config 세팅의 예시로 살펴봅시다.
transformers document를 보면 사용하는 모델별로 미리 정의되어야 하는 config들에 대해 알려주고 있습니다

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.bert = BertModel(config)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

num_labels라는 정보를 꼭 기입해주어야 하는 상황입니다.

'''
현재 상황은 아래와 같습니다.
1) 사용 모델 : Sequence classification 모델을
2) special token 2개 추가함
3) label은 총 10개

case 1. 원하는 config의 값을 수정하는 케이스🍑
case 2. downstream task를 위해 추가해야하는 config 케이스🥑
'''

model_name =  'klue/bert-base'

# case 1 🍑
# config를 추가하는 방법은 두가지로 가능함 

# [1] 호출 후 직접 config의 값을 수정하는 방법 🍑🍑
model_config = AutoConfig.from_pretrained(model_name)
model_config.vocab_size = model_config.vocab_size + 2
print('case 1 - [1] : ', model_config)

# [2] 호출과 동시에 수정하는 방법 🍑🤢🍑
# 하지만, vocab 수정은 해당 코드로 진행하는것을 권장하지 않음
# Advanced tutorial의 token 추가하기에서 자세히 다룸
model_config = AutoConfig.from_pretrained(model_name , vocab_size=32002)

print('case 1 - [2] : ', model_config)

# case 2 (sequence classification을 위해 num_labels를 설정하기) 🥑
# config를 추가하는 방법은 두가지로 가능함 

# [1] 호출 후 직접 config의 값을 수정하는 방법
model_config = AutoConfig.from_pretrained(model_name)
model_config.num_labels = 10

# [2] 호출과 동시에 수정하는 방법
model_config = AutoConfig.from_pretrained(model_name , num_labels=10)

📣 아무런 값이나 config에 추가할 수 있을까요 ?

임의로 만들어진 config key는 config에 추가 될수는 있지만 모델 학습에 사용되지 않습니다.

# 해당 방법으로 생성하면 config에 추가 됨
model_config.hey = 'Love~'

하지만 아래와 같이 model_config를 불러오면 config에 hey가 추가되어 있지 않다.

model_config = AutoConfig.from_pretrained(model_name , hey="hey~")

📣 원하는 custon_config로 config 전체를 업데이트 하는 방법은 다음과 같다.

custom_config = {
    "Hate" : "You~"
    }

model_config = AutoConfig.from_pretrained(model_name)
model_config.update(custom_config)
model_config

Pretrained model 불러오기

transformers의 가장 강력한 기능은 사전학습된 모델을 쉽게 불러오고, 사용할 수 있다는 것입니다.

단 세 줄로 원하는 pretrained model을 불러오고 사용할 수 있습니다. (물론 원하는 모델이 Huggingface transformers에 공개되어 있어야한다!)

우리는 해당모델을 그대로 사용할 수도 있고, 추가적으로 학습을 진행하여 내 데이터에 맞는 모델로 사용할수도 있습니다.

.from_config() 는 config 그대로 모델을 가져오는 method 입니다. 즉 사전학습된 weight을 가져오는게 아니니 주의해야합니다. 🌽
.from_pretrained() 는 model config에 해당하는 모델을 가져오고, 사전학습된 weight를 가져옵니다. 스스로 학습한 모델을 불러오려면 model_name 부분에 model이 저장된 directory를 입력하면 됩니다. 🌽🌽

from transformers import AutoConfig, AutoModelForQuestionAnswering

model_name =  'klue/bert-base'

# pretrained 모델과 동일한 configuration을 가져옵니다.
model_config = AutoConfig.from_pretrained(model_name)

# 모델을 정의합니다.
# option 1 : config에서 정의한 모델을 가져오기 (initial) 🌽
model = AutoModelForQuestionAnswering.from_config(model_config)

# option 2 : config에서 정의한 사전학습된 모델을 가져오기 (pretrained) 🌽🌽
model = AutoModelForQuestionAnswering.from_pretrained(model_name, config=model_config)

transformers는 두가지 타입의 모델을 제공하고 있습니다.

기본 모델 : hidden state가 출력되는 기본 모델
downstream task 모델: output은 task에 적합한 dimension으로 미리 정의되어있음.
일반적인 task를 쉽게 수행할 수 있도록 미리 기본 모델 + head가 설정된 모델

Huggingface Trainer

반복되는 Training loop를 효과적으로 모듈화 시켜놓은것이 transformers의 trainer입니다.

덕분에 우리는 매 모델을 학습하기 위해 training loop를 구현하는 과정을 단 몇줄만에 해결할 수 있습니다.

아래와 같은 structure로 trainer를 사용할 수 있습니다.

Training Arguments 설정
Trainer 호출
학습 / 추론 진행

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model_name = 'klue/bert-base'

model = AutoModelForQuestionAnswering.from_pretrained(model_name)

args = TrainingArguments(
    f"{model_name}-finetuned",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)


trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

📣 그렇다면 Trainer가 '항상' 좋을까요 ?

legacy가 존재할 수 밖에 없습니다. 따라서 버전이 바뀔 때 마다 변동되는 사항이 많아지고 코드를 지속적으로 수정해야하는 단점이 존재합니다.
pytorch lightning이 대표적으로 이러한 문제를 겪고 있으며, transformers도 예외는 아닙니다. 따라서 Trainer는 모든 상황에서 정답이 될 수는 없습니다
최대한 편리함을 이용하되, 동작 원리를 살펴보는 과정이 매우 중요합니다.
Trainer의 구조를 살펴보고, 내가 학습할 모델을 위한 Trainer를 만들어보는것도 좋은 방법입니다
- Trainer에 원하는 함수 오버라이딩 하여 수정하기 (general task에 적합)
- Custome Trainer 만들어보기 (general task가 아닌경우 유용함)

Special Token

간혹 모델의 성능을 높이기 위해 special token을 추가하거나, domain에 특화된 단어를 추가해주는 방법이 있습니다.

special token을 추가하는 경우 해당 token이 special token임을 tokenizer에게 알려주어야 합니다. 따라서 이 경우에는 add_special_tokens() 메서드를 사용해야합니다.
일반 token을 추가하는 경우엔 add_tokens() 메서드를 사용하여 vocab을 늘려줄 수 있습니다.

tokenizer에 vocab을 추가했다면 pretrained model의 token embedding 사이즈를 변경해주어야합니다.: tokenizer는 len() 사용하면 vocab의 총 개수가 나오므로 이를 이용하면 됩니다. 추가한 개수 만큼 vocab을 늘려주고, embedding 사이즈도 늘려주는 과정을 통해 직관적으로 vocab을 추가합니다. model.resize_token_embedding을 이용하면 됩니다.

model_name = 'klue/bert-base'

config = AutoConfig.from_pretrained(
    model_name,
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

# special token 추가하기 
special_tokens_dict = {'additional_special_tokens': ['[special1]','[special2]','[special3]','[special4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

# token 추가하기
new_tokens = ['COVID', 'hospitalization']
num_added_toks = tokenizer.add_tokens(new_tokens)

# 기존 config로 모델을 불러오기
# 모델을 불러오기전에 vocab을 수정하면 pretrained config와 충돌이 일어나 에러가 발생하니 주의
model = AutoModelForQuestionAnswering.from_pretrained(
    model_name,
    config=config,
)

# tokenizer config 수정해주기 (추후에 발생할 에러를 줄이기 위해)
config.vocab_size = len(tokenizer)

# model의 token embedding 사이즈 수정하기
model.resize_token_embeddings(len(tokenizer))

📣 special token을 추가할 때 항상 resize를 해주어야 하나요 ?

꼭 그렇지 않습니다. 잘 만들어진 모델은 resize를 하지않고도 모델에 새로운 vocab을 추가할 수 있도록 여분의 vocab 자리를 만들어 두었습니다. 여분의 vocab 개수는 모델에 따라 다르니 확인이 필요합니다.

model_name = 'klue/bert-base'

config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(tokenizer.vocab["[unused0]"]) #31500

# klue/bert-base는 500개(index:0~499) dummy vocab을 가지고 있음
print(tokenizer.vocab["[unused499]"]) #31999

tokenizer.vocab에 특정 단어가 포함되어있을지 확인해보려면 단어를 string 타입으로 넣어서 확인해보면 됩니다.

[unused0]이라는 인풋이 보시이나요 ? 사용하지 않는 dummy vocab을 추가했다는 의미입니다.

이러한 dummy vocab을 추가하는 이유는 무엇일까요 ?
사용자의 니즈에 따라서 단어를 추가할 수 있는 여유 공간을 제공한 것입니다.
즉, 유저가 pretrained model을 최대한 수정하지 않고도 다양한 vocab을 사용할 수 있도록 의도한 것입니다.
최근에 공개된 모델들은 대부분 'unused' vocab을 가지고 있습니다.
모든 모델이 dummy vocab을 고려하는것은 아닙니다.
- 예를들어, SKT의 KoBERT는 dummy vocab을 가지고 있지 않습니다.
- 따라서 추가 vocab을 넣을 경우에는 manual 하게 수정을 해줘야 하며, gluonnlp를 사용하는 부분을 수정해야합니다.
모델별로 dummy vocab을 위한 자리가 미리 마련된걸 알았으니, tokenizer loading시 cache가 저장되는 디렉토리로 이동해서 vocab.txt 파일을 manual하게 변경해주면 resize 없이 사용할 수 있습니다. (귀찮다면 add_token 후 resize를 합시다). vocab을 매우 많이 추가했다면 pretraining을 다시 수행하는것이 좋습니다 (TAPT: Task Adaptive PreTraining)

[CLS] output 추출하기

model에서 [CLS] 자리의 embedding만 가지고 오고 싶은 경우가 있습니다.

이때 전체 output representation에서 indexing으로 [CLS]embedding을 가지고 올 수 도 있지만

.pooler_output 을 이용하면 보다 쉽게 값을 가져올 수 있습니다

from transformers import AutoTokenizer, AutoModel
import torch

model_name = 'klue/bert-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

inputs = tokenizer("내가 먼저 엿보고 온 시간들", return_tensors="pt")
outputs = model(**inputs)

cls_output = outputs.pooler_output
cls_output

📣 [CLS] 토큰은 정말 문장을 대표할까?

BERT paper를 살펴보면 [CLS] 토큰은 문장을 대표하는 값으로 알려져 있습니다.

하지만 BERT의 저자 또한 [CLS]가 Sentence representation이란걸 보장할 순 없다고 밝혔습니다. https://github.com/google-research/bert/issues/164 https://arxiv.org/pdf/1908.10084.pdf

학습 파이프라인 설계

1. 데이터 불러오기

2. 데이터 전처리, 포맷팅하기

3. Dataset class 및 Data Loader에 데이터 담기 (템플릿화하기)

4. 토크나이저 불러오기 ex) 형태소 단위, wordpiece, 음절 단위 ...

모델 불러오기

최적화 전략 선택하기 ex) Adam, AdamW, AdamP, SGD ...

Loss 함수 선택하기 ex) MSE, MAE, Cross-entropy, Focal ...

5. Trainer 설정하기

6. Wandb sweep을 활용해 최적의 parameter 찾기

* 성능 개선을 위해 실험해볼 수 있는 부분들이다.

허깅 페이스 모델 뜯어보기

예제 1) BertForSequenceClassifcation에 Embedding 값을 추가하고 싶다

=> 아래 빨간 박스 부분을 수정해야 한다

허깅 페이스 github에서 코드를 불러온다

class CustomBertEmbeddings(nn.Module): # 원본 소스에서는 BertEmbeddings
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # 🍊 내가 추가한 Layer 
        self.entity_loc_embeddings = nn.Embedding(3, config.hidden_size, max_norm = True)
        self.entity_type_embeddings = nn.Embedding(7, config.hidden_size, max_norm = True)
        # 🍊 

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
        self.register_buffer(
            "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
        )
    
    ''' # 원본   
 	def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        past_key_values_length: int = 0,
    ) -> torch.Tensor:
    '''
    
    # 🍊 내가 수정한 forward 함수
    def forward(
    	self,
        input_ids = None,
        token_type_ids = None,
        position_ids = None,
        entity_loc_ids = None, # 🍊 새롭게 추가
        entity_type_ids = None # 🍊
        past_key_values_length = 0
    ) -> torch.Tensor:
    # 🍊

    if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
        # issue #5664
        if token_type_ids is None:
            if hasattr(self, "token_type_ids"):
                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        # 🍊
        # embeddings = inputs_embeds + token_type_embeddings 한 줄이었던 코드를 수정
        entity_loc_embeddings - self.entity_loc_embeddings(entity_loc_ids)
        if entity_type_ids != None:
            entity_type_embeddings = self.entity_type_embeddings(entity_type_ids)
            embeddings = inputs_embeds + token_type_embeddings + entity_loc_embeddings + entity_type_embeddings
        else:
            embeddings = inputs_embeds + token_type_embeddings + entity_loc_embeddings 
        # 🍊

        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

예제 2) BertForSequenceClassifcation에 Attention 계산 방식을 바꿔보고 싶다

=> 아래 빨간 박스 부분을 수정해야 한다

그리고 위 레이어에서 호출하는 것을 CustomBertSelfAttention으로 변경해야 하기 때문에 init에서 변경해줘야 한다.

아래의 표에 적힌 다양한 스코어 함수로 바꾸볼 수 있다.

예제 2) BertForSequenceClassifcation의 classifier을 바꿔보고 싶다

=> 아래 빨간 박스 부분을 수정해야 한다

허깅 페이스 모델 내 맘대로 바꿔보기

거대 언어 모델

'AI TECH > TIL' 카테고리의 다른 글

~week12 면접 준비 (0)	2022.12.14
week7,8 면접 준비 (0)	2022.12.14
[TIL] AI와 저작권법 (0)	2022.11.10
Contrastive Learning (0)	2022.11.03
STS 대회 에러 해결법 (0)	2022.11.02

현재글허깅페이스 100프로 활용하기

But my fav is coding