(base) [root@localhost ~]# docker exec finetune-trainer cat /tmp/train_11342ed4-38f4-4ed9-80eb-13c3e6cf27d3.log | tail -200
[remote_train] === Training job started: 11342ed4-38f4-4ed9-80eb-13c3e6cf27d3 ===
[remote_train] model_id=Qwen/Qwen1.5-0.5B, model_type=text
[remote_train] dataset_path=/root/Fine-tuning/backend/data/datasets/data.jsonl
[remote_train] config={"model_id": "Qwen/Qwen1.5-0.5B", "model_type": "text", "dataset_id": "74fefea9-ea87-49de-a76f-760832526987", "peft_method": "lora", "epochs": 3, "batch_size": 4, "gradient_accumulation": 4, "learning
[remote_train] Dataset file exists: /root/Fine-tuning/backend/data/datasets/data.jsonl
[remote_train] Step 1: Preprocessing dataset...
[remote_train]   task_type=sft, template=auto
[remote_train]   output_path=/root/Fine-tuning/backend/data/processed/11342ed4-38f4-4ed9-80eb-13c3e6cf27d3_processed.jsonl
[remote_train]   Selecting engine for model_type=text...
[remote_train]   Engine loaded: TextEngine
[remote_train]   PEFT method: lora
[remote_train]   Running preprocess_dataset...
[remote_train]   Preprocessing done, output: /root/Fine-tuning/backend/data/processed/11342ed4-38f4-4ed9-80eb-13c3e6cf27d3_processed.jsonl
[remote_train] Step 2: Loading model: Qwen/Qwen1.5-0.5B...
[remote_train]   Quantization: None
Loading weights: 100%|██████████| 291/291 [00:04<00:00, 59.76it/s] 
[remote_train]   Model loaded successfully
[remote_train] Step 3: Building PEFT config...
[remote_train]   PEFT config built
[remote_train] Step 4: Starting training...
Map: 100%|██████████| 274147/274147 [00:13<00:00, 19808.13 examples/s]
/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:1348: UserWarning: Model has `tie_word_embeddings=True` and a tied layer is part of the adapter, but `ensure_weight_tying` is not set to True. This can lead to complications, for example when merging the adapter or converting your model to formats other than safetensors. Check the discussion here: https://github.com/huggingface/peft/issues/2777
  warnings.warn(msg)
[transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
  warnings.warn(_BETA_TRANSFORMS_WARNING)
trainable params: 5,593,088 || all params: 469,580,800 || trainable%: 1.1911
  0%|          | 0/12852 [00:00<?, ?it/s][10:49:46.054][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:50:06.058][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:50:26.059][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:50:46.061][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:51:06.099][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:51:26.101][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:51:46.102][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:52:06.111][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:52:26.122][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:52:46.124][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:53:06.127][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:53:26.130][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:53:46.138][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:54:06.139][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:54:26.140][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:54:46.142][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:55:06.146][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:55:26.147][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:55:46.148][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
[10:56:06.149][MXKW][E]queues.c                :949 : [mxkwAllocRingBuff]ioctl alloc ringbuffer failed -1
/opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:108: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:183.)
  freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
  0%|          | 5/12852 [07:16<148:58:52, 41.75s/it]Training failed for job 11342ed4-38f4-4ed9-80eb-13c3e6cf27d3: CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[remote_train] ERROR: CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[remote_train] Traceback (most recent call last):
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 170, in run_training
    adapter_path = await engine.train(
  File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 280, in train
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1427, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1509, in _inner_training_loop
    self._run_epoch(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1737, in _run_epoch
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1909, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1981, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 195, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 218, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 134, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 126, in gather_map
    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)
  File "<string>", line 8, in __init__
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py", line 451, in __post_init__
    for idx, element in enumerate(iterator):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 126, in <genexpr>
    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 120, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 576, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 80, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 253, in gather
    return torch._C._gather(tensors, dim, destination)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[remote_train] === Training job failed: 11342ed4-38f4-4ed9-80eb-13c3e6cf27d3 ===
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 211, in <module>
    main()
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 207, in main
    asyncio.run(run_training(job_id, model_id, model_type, dataset_id, config))
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/root/Fine-tuning/backend/app/engines/remote_train.py", line 170, in run_training
    adapter_path = await engine.train(
  File "/root/Fine-tuning/backend/app/engines/text_engine.py", line 280, in train
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1427, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1509, in _inner_training_loop
    self._run_epoch(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1737, in _run_epoch
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1909, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1981, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 195, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 218, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 134, in gather
    res = gather_map(outputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 126, in gather_map
    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)
  File "<string>", line 8, in __init__
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py", line 451, in __post_init__
    for idx, element in enumerate(iterator):
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 126, in <genexpr>
    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py", line 120, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 576, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 80, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 253, in gather
    return torch._C._gather(tensors, dim, destination)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  0%|          | 5/12852 [07:20<314:27:29, 88.12s/it]
(base) [root@localhost ~]# cat /root/Fine-tuning/backend/data/logs/11342ed4-38f4-4ed9-80eb-13c3e6cf27d3.jsonl
{"ts": "2026-05-22T02:48:40.394919+00:00", "type": "start", "job_id": "11342ed4-38f4-4ed9-80eb-13c3e6cf27d3"}
{"ts": "2026-05-22T02:48:40.397444+00:00", "type": "status", "status": "preprocessing"}
{"ts": "2026-05-22T02:48:44.505034+00:00", "type": "status", "status": "loading_model"}
{"ts": "2026-05-22T02:49:02.526554+00:00", "type": "status", "status": "training"}
{"ts": "2026-05-22T02:49:25.920686+00:00", "type": "status", "status": "training"}
{"ts": "2026-05-22T02:49:25.922548+00:00", "type": "epoch_begin", "epoch": 0}
{"ts": "2026-05-22T02:56:46.009692+00:00", "type": "error", "message": "CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "traceback": "Traceback (most recent call last):\n  File \"/root/Fine-tuning/backend/app/engines/remote_train.py\", line 170, in run_training\n    adapter_path = await engine.train(\n  File \"/root/Fine-tuning/backend/app/engines/text_engine.py\", line 280, in train\n    trainer.train()\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/trainer.py\", line 1427, in train\n    return inner_training_loop(\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/trainer.py\", line 1509, in _inner_training_loop\n    self._run_epoch(\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/trainer.py\", line 1737, in _run_epoch\n    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/trainer.py\", line 1909, in training_step\n    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/trainer.py\", line 1981, in compute_loss\n    outputs = model(**inputs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1773, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1784, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py\", line 195, in forward\n    return self.gather(outputs, self.output_device)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py\", line 218, in gather\n    return gather(outputs, output_device, dim=self.dim)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py\", line 134, in gather\n    res = gather_map(outputs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py\", line 126, in gather_map\n    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)\n  File \"<string>\", line 8, in __init__\n  File \"/opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py\", line 451, in __post_init__\n    for idx, element in enumerate(iterator):\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py\", line 126, in <genexpr>\n    return type(out)((k, gather_map([d[k] for d in outputs])) for k in out)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/scatter_gather.py\", line 120, in gather_map\n    return Gather.apply(target_device, dim, *outputs)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py\", line 576, in apply\n    return super().apply(*args, **kwargs)  # type: ignore[misc]\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py\", line 80, in forward\n    return comm.gather(inputs, ctx.dim, ctx.target_device)\n  File \"/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py\", line 253, in gather\n    return torch._C._gather(tensors, dim, destination)\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 678.00 MiB. GPU 0 has a total capacity of 63.78 GiB of which 0 bytes is free. Of the allocated memory 1.63 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n"}
(base) [root@localhos