result.txt 8.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
  1. (base) [root@localhost ~]# docker exec finetune-trainer tail -100 /tmp/train_297b8bc2-e382-4b53-853b-dbff4578601e.log
  2. [remote_train] task_type=dpo, template=auto
  3. [remote_train] Engine loaded: TextEngine
  4. [remote_train] Running preprocess_dataset...
  5. [remote_train] Preprocessing done, output: /root/Fine-tuning/backend/data/processed/297b8bc2-e382-4b53-853b-dbff4578601e_processed.jsonl
  6. [remote_train] Step 2: Loading model: Qwen/Qwen3.5-0.8B...
  7. Current Triton version 3.0.0 is below the recommended 3.2.0 version. Errors may occur and these issues will not be fixed. Please consider upgrading Triton.
  8. Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
  9. torch.compile is not available in Python 3.10, using identity decorator instead
  10. /opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  11. warnings.warn(_BETA_TRANSFORMS_WARNING)
  12. /opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  13. warnings.warn(_BETA_TRANSFORMS_WARNING)
  14. Loading weights: 100%|██████████| 320/320 [00:06<00:00, 47.18it/s]
  15. [remote_train] Model loaded successfully
  16. [remote_train] Step 3: Building PEFT config...
  17. [remote_train] Step 4: Starting training...
  18. [remote_train] NOTE: First step may take 2-5 minutes due to Triton kernel compilation (autotuning). This is normal.
  19. [remote_train] Total steps: 3 epochs, batch_size per GPU=16
  20. /opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:1348: UserWarning: Model has `tie_word_embeddings=True` and a tied layer is part of the adapter, but `ensure_weight_tying` is not set to True. This can lead to complications, for example when merging the adapter or converting your model to formats other than safetensors. Check the discussion here: https://github.com/huggingface/peft/issues/2777
  21. warnings.warn(msg)
  22. bitsandbytes library load error: Configured CUDA binary not found at /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
  23. Traceback (most recent call last):
  24. File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 320, in <module>
  25. lib = get_native_library()
  26. File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 288, in get_native_library
  27. raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
  28. RuntimeError: Configured CUDA binary not found at /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
  29. [transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
  30. [transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
  31. trainable params: 5,070,848 || all params: 757,463,872 || trainable%: 0.6695
  32. Map: 100%|██████████| 5/5 [00:00<00:00, 160.55 examples/s]
  33. 0%| | 0/1 [00:00<?, ?it/s]Training failed for job 297b8bc2-e382-4b53-853b-dbff4578601e: 'NoneType' object cannot be interpreted as an integer
  34. [remote_train] [rank 0] ERROR: 'NoneType' object cannot be interpreted as an integer
  35. [remote_train] Traceback (most recent call last):
  36. File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 236, in run_training
  37. adapter_path = await engine.train(
  38. File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 472, in train
  39. trainer.train()
  40. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1427, in train
  41. return inner_training_loop(
  42. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1509, in _inner_training_loop
  43. self._run_epoch(
  44. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _run_epoch
  45. batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
  46. File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 296, in _patched_gbs
  47. batch = next(epoch_iterator)
  48. File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 577, in __iter__
  49. current_batch = next(dataloader_iter)
  50. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
  51. data = self._next_data()
  52. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 790, in _next_data
  53. data = self._dataset_fetcher.fetch(index) # may raise StopIteration
  54. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
  55. return self.collate_fn(data)
  56. File "/opt/conda/lib/python3.10/site-packages/trl/trainer/utils.py", line 460, in __call__
  57. to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
  58. File "/opt/conda/lib/python3.10/site-packages/trl/trainer/utils.py", line 460, in <listcomp>
  59. to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
  60. TypeError: 'NoneType' object cannot be interpreted as an integer
  61. [remote_train] === Training job failed: 297b8bc2-e382-4b53-853b-dbff4578601e ===
  62. Traceback (most recent call last):
  63. File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
  64. return _run_code(code, main_globals, None,
  65. File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
  66. exec(code, run_globals)
  67. File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 466, in <module>
  68. main()
  69. File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 461, in main
  70. asyncio.run(run_training(job_id, model_id, model_type, dataset_id, config,
  71. File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
  72. return loop.run_until_complete(main)
  73. File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
  74. return future.result()
  75. File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 236, in run_training
  76. adapter_path = await engine.train(
  77. File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 472, in train
  78. trainer.train()
  79. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1427, in train
  80. return inner_training_loop(
  81. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1509, in _inner_training_loop
  82. self._run_epoch(
  83. File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _run_epoch
  84. batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches, self.args.device)
  85. File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 296, in _patched_gbs
  86. batch = next(epoch_iterator)
  87. File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 577, in __iter__
  88. current_batch = next(dataloader_iter)
  89. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 734, in __next__
  90. data = self._next_data()
  91. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 790, in _next_data
  92. data = self._dataset_fetcher.fetch(index) # may raise StopIteration
  93. File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
  94. return self.collate_fn(data)
  95. File "/opt/conda/lib/python3.10/site-packages/trl/trainer/utils.py", line 460, in __call__
  96. to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
  97. File "/opt/conda/lib/python3.10/site-packages/trl/trainer/utils.py", line 460, in <listcomp>
  98. to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
  99. TypeError: 'NoneType' object cannot be interpreted as an integer
  100. 0%| | 0/1 [00:12<?, ?it/s]