Maas2-group
/
Fine-tuning


			
				
					
						
						
							1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
							(base) [root@localhost ~]# docker exec finetune-trainer tail -200 /tmp/train_4e49dfbd-4a47-4c39-842e-462410e055a4.log
[remote_train] fla package found at: /opt/conda/lib/python3.10/site-packages/fla
[remote_train] fla shared memory patch v2 already applied, skipping
[remote_train] [rank 0] === Training job started: 4e49dfbd-4a47-4c39-842e-462410e055a4 ===
[remote_train] model_id=Qwen/Qwen3.5-0.8B, model_type=text
[remote_train] dataset_path=/root/Fine-tuning/backend/data/datasets/dpo_sample.jsonl
[remote_train] config={"model_id": "Qwen/Qwen3.5-0.8B", "model_type": "text", "dataset_id": "41e0a8e2-ddc7-464b-bc44-b13261bbc221", "peft_method": "lora", "epochs": 3, "batch_size": 16, "gradient_accumulation": 4, "learnin
[remote_train] Step 1: Preprocessing dataset...
[remote_train]   task_type=dpo, template=auto
[remote_train]   Engine loaded: TextEngine
[remote_train]   Running preprocess_dataset...
[remote_train]   Preprocessing done, output: /root/Fine-tuning/backend/data/processed/4e49dfbd-4a47-4c39-842e-462410e055a4_processed.jsonl
[remote_train] Step 2: Loading model: Qwen/Qwen3.5-0.8B...
Current Triton version 3.0.0 is below the recommended 3.2.0 version. Errors may occur and these issues will not be fixed. Please consider upgrading Triton.
Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
torch.compile is not available in Python 3.10, using identity decorator instead
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
Loading weights: 100%|██████████| 320/320 [00:06<00:00, 46.85it/s]
[remote_train]   Model loaded successfully
[remote_train] Step 3: Building PEFT config...
[remote_train] Step 4: Starting training...
[remote_train] NOTE: First step may take 2-5 minutes due to Triton kernel compilation (autotuning). This is normal.
[remote_train] Total steps: 3 epochs, batch_size per GPU=16
/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:1348: UserWarning: Model has `tie_word_embeddings=True` and a tied layer is part of the adapter, but `ensure_weight_tying` is not set to True. This can lead to complications, for example when merging the adapter or converting your model to formats other than safetensors. Check the discussion here: https://github.com/huggingface/peft/issues/2777
  warnings.warn(msg)
bitsandbytes library load error: Configured CUDA binary not found at /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 320, in <module>
    lib = get_native_library()
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 288, in get_native_library
    raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
RuntimeError: Configured CUDA binary not found at /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
[transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
[transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
trainable params: 5,070,848 || all params: 757,463,872 || trainable%: 0.6695
Map: 100%|██████████| 5/5 [00:00<00:00, 158.56 examples/s]
[remote_train] [rank 0] ERROR: 'DPOTrainer' object has no attribute '_data_collator'
[remote_train] Traceback (most recent call last):
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 236, in run_training
    adapter_path = await engine.train(
  File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 404, in train
    _orig_collator = trainer._data_collator
AttributeError: 'DPOTrainer' object has no attribute '_data_collator'. Did you mean: 'data_collator'?

[remote_train] === Training job failed: 4e49dfbd-4a47-4c39-842e-462410e055a4 ===
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 466, in <module>
    main()
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 461, in main
    asyncio.run(run_training(job_id, model_id, model_type, dataset_id, config,
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 236, in run_training
    adapter_path = await engine.train(
  File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 404, in train
    _orig_collator = trainer._data_collator
AttributeError: 'DPOTrainer' object has no attribute '_data_collator'. Did you mean: 'data_collator'?
(base) [root@localhost ~]# docker exec finetune-trainer bash -c '/opt/conda/bin/python -c "import trl; print(trl.__version__)"'
0.9.6