最近在看模型量化的课。
里面在量化下面这个模型的时候说建议不要量化最后的lm_head
。
CodeGenForCausalLM(
(transformer): CodeGenModel(
(wte): Embedding(51200, 1024)
(drop): Dropout(p=0.0, inplace=False)
(h): ModuleList(
(0-19): 20 x CodeGenBlock(
(ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): CodeGenAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(resid_dropout): Dropout(p=0.0, inplace=False)
(qkv_proj): W8A16LinearLayer()
(out_proj): W8A16LinearLayer()
)
(mlp): CodeGenMLP(
(fc_in): W8A16LinearLayer()
(fc_out): W8A16LinearLayer()
(act): NewGELUActivation()
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
(ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=1024, out_features=51200, bias=True)
)
他说的原文如下:
2:14 And as I said we're not going to quantize the language model head
2:18 because since the model is an autoregressive model, it uses
2:22 the output from the previous iteration to get the output of the next iteration.
2:27 If you quantize the language model head, a lot of errors might
2:31 might be accumulating over the generation steps.
2:34 And you will most likely end up, having some gibberish after some tokens.
没看懂他说的理由,为什么量化 lm_head 会积累错误?有大佬能简单易懂的解释一下吗?