muP (maximum update parametrization)#650
muP (maximum update parametrization)#650gordicaleksa wants to merge 37 commits intokarpathy:masterfrom
Conversation
ce71c19 to
9864277
Compare
|
Hello, I looked over your implementation of muP in both CUDA and Python, and found that in CUDA ( but I haven't seen the same scaling in your Pytorch version |
|
@alxndrTL it's mentioned in |
|
Ok I didn't realize that the layernorm code I showcased is only used pre-logits, as per 3.3 (I thought it was used for every layer norms). |
Hyperparam sweepsNote:
Conclusion:
Conclusion: Using
Conclusion: ~1/2^10 is a sweet spot for lr. The curves are stable as we increase the depth, i.e. the optimal
Conclusion: next steps:
cc: @karpathy |
|
@YuchenJin would be great to kick off a 7B mup run if you have some bandwidth! :) |
Hey @gordicaleksa, happy to! Do you want me to just run the two scripts ( |
|
Line 165 in b125cc6 AFAIK this line zero initializes modules with 'LLMC_SKIP_INIT' flag if mup is enabled. There is only one module with 'LLMC_SKIP_INIT' flag, it is lm_head. lm_head weight is tied to wte.weight. Since embedding layers are initialized later in the code, what is the purpose of the zero initialization referenced above? |
After reading mup.md, I can see now that MUP requires output layers to be initialized to zero. The code assumes that embeddings are initialized before linear layers, which is a correct but IMHO a weak assumption. Thanks for the great work. |













Main changes (see
mup.mdfile for more details):1/dinstead of1/sqrt(d), also add anattn_multtunable coefficient1/width_multbefore mapping into logitswhere:
width_multis the ratio of widths of the current model to the base modeldis the number of channels in a single attn headTest
To test muP vs SP (standard parametrization):
scripts/mup_coordinate_check.shscriptdev/mup_coordinate_check_visualize.pyscriptRun
use_mupto1mup_width_multto ratio of widths of your target model to your base modelmup_base_attn_multis a tunable param, 1 seems to be working nicely for our family of models.Ablations
The coord check results are highly dependent on the learning rate and max width used.
In my preliminary ablations (max width = 1024 & lr = 0.0006) I concluded that the only thing that would mess up the coordinate check was this line:
scale = (model->use_mup && i != 0 && i != 1) ? mup_scale_inv*scale : scale;In my subsequent ablations (max width = 1024 & lr = 0.006, i.e. lr almost the same as in the reference mup gpt-2 imp, they use 0.01) i concluded that the results are much more sensitive: Adam modifications also matter, 1/
width_multlogits scaling matters and whether we use1/d.See the next comment for more thorough ablation results.
References: