Skip to content

Re-update the LLM implementation#48

Merged
niansong1996 merged 8 commits into
mainfrom
yilunzhao/llm_implementation
Jul 10, 2023
Merged

Re-update the LLM implementation#48
niansong1996 merged 8 commits into
mainfrom
yilunzhao/llm_implementation

Conversation

@yilunzhao
Copy link
Copy Markdown
Member

@yilunzhao yilunzhao commented Apr 25, 2023

#46, re-update the implementation for llama, alpaca, santacoder

@niansong1996
Copy link
Copy Markdown
Contributor

Seems like there is an error from CI. I've seem this before, check here to see if it's useful.

@yilunzhao yilunzhao force-pushed the yilunzhao/llm_implementation branch from 3f2247e to 56ba84f Compare April 25, 2023 22:46
@yilunzhao yilunzhao force-pushed the yilunzhao/llm_implementation branch 2 times, most recently from d537aba to 9dbc8cd Compare April 26, 2023 01:49
@yilunzhao
Copy link
Copy Markdown
Member Author

Hi @niansong1996, sorry for the late reply. I have resolved the CI error. It seems that I have to change the transformers version in requirements.txt to avoid the error.

@niansong1996
Copy link
Copy Markdown
Contributor

That is okay, what we can do is to use this branch to evaluate the new models before we decide the upgrade the transformers version in the main branch.

@niansong1996
Copy link
Copy Markdown
Contributor

niansong1996 commented Apr 27, 2023

@yilunzhao I am getting the following error when testing LLaMA:
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:46, unhandled cuda error, NCCL version 2.10.3

The command I ran is:
python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/spider.yaml --model.beam_size 1 --data.val_max_instances 1 --data.val_batch_size 1 --model.print_generation_results true --model.print_eval_every_n_batches 1 --model.init_args.transformer_model_name decapoda-research/llama-7b-hf --data.init_args.transformer_model_name decapoda-research/llama-7b-hf --trainer.devices 2

Now if I use one GPU, I will get this error:
RuntimeError: CUDA error: no kernel image is available for execution on the device

Can you see if you can replicate those errors and figure out why they are happening?

…t llama-based model uses empty string as tokenizer_eos_token
@yilunzhao
Copy link
Copy Markdown
Member Author

Hi @niansong1996, I think this error raised because the installed torch is incompatible with CUDA in ziva. Could you please try to re-install the torch by pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116, and see if it can resolve the issue?

And this is my pip freeze:

absl-py==1.4.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.1.0
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
deepspeed==0.6.7
docker-pycreds==0.4.0
docopt==0.6.2
docstring-parser==0.15
filelock==3.12.0
frozenlist==1.3.3
fsspec==2023.4.0
func-timeout==4.3.5
gitdb==4.0.10
GitPython==3.1.31
google-auth==2.17.3
google-auth-oauthlib==1.0.0
grpcio==1.54.0
hjson==3.1.0
huggingface-hub==0.14.1
idna==3.4
importlib-metadata==6.6.0
joblib==1.2.0
jsonargparse==4.15.0
Markdown==3.4.3
MarkupSafe==2.1.2
multidict==6.0.4
ninja==1.11.1
nltk==3.8.1
numpy==1.24.3
oauthlib==3.2.2
openai==0.27.5
overrides==7.3.1
packaging==23.1
pandas==2.0.1
pathtools==0.1.2
Pillow==9.5.0
pipreqs==0.4.13
protobuf==4.22.3
psutil==5.9.5
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.7
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.7.4
pytz==2023.3
PyYAML==6.0
regex==2023.3.23
requests==2.29.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.10.1
sentencepiece==0.1.98
sentry-sdk==1.21.0
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sqlparse==0.4.4
tensorboard==2.12.2
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
tokenizers==0.13.3
torch==1.12.1+cu116
torchaudio==0.12.1+cu116
torchmetrics==0.9.3
torchvision==0.13.1+cu116
tqdm==4.65.0
transformers @ git+https://github.com/huggingface/transformers@11fd2c773b11c3fcfe0fa25aa4b92db03c83636c
tree-sitter==0.19.0
typing_extensions==4.5.0
tzdata==2023.3
urllib3==1.26.15
wandb==0.15.0
Werkzeug==2.3.1
yarg==0.1.9
yarl==1.9.2
zipp==3.15.0

@niansong1996 niansong1996 merged commit cd3a9fb into main Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants