Procházet zdrojové kódy

禁用gradient_checkpointing

lxylxy123321 před 1 týdnem
rodič
revize
effc062485
2 změnil soubory, kde provedl 5 přidání a 17 odebrání
  1. 1 2
      backend/app/engines/text_engine.py
  2. 4 15
      result.txt

+ 1 - 2
backend/app/engines/text_engine.py

@@ -160,8 +160,7 @@ class TextEngine(BaseEngine):
             optim="adamw_torch",
             remove_unused_columns=False,
             report_to="none",
-            gradient_checkpointing=True,
-            gradient_checkpointing_kwargs={"use_reentrant": False},
+            gradient_checkpointing=False,
             dataloader_num_workers=1,
             dataloader_pin_memory=False,
             **({"deepspeed": deepspeed_config} if deepspeed_config else {}),

+ 4 - 15
result.txt

@@ -1,17 +1,6 @@
-2026-05-15 15:46:34 | WARNING  | fla.utils | Current Triton version 3.0.0 is below the recommended 3.2.0 version. Errors may occur and these issues will not be fixed. Please consider upgrading Triton.
-2026-05-15 15:46:34 | WARNING  | fla.utils | Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
-2026-05-15 15:46:40 | WARNING  | fla.ops.rwkv7.fused_addcmul | torch.compile is not available in Python 3.10, using identity decorator instead
-/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
-  warnings.warn(_BETA_TRANSFORMS_WARNING)
-/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
-  warnings.warn(_BETA_TRANSFORMS_WARNING)
-Loading weights: 100%|██████████| 320/320 [00:00<00:00, 362.08it/s]
-2026-05-15 15:46:41 | INFO     | peft-platform | Loaded model: Qwen/Qwen3.5-0.8B
-Map: 100%|██████████| 60/60 [00:00<00:00, 2263.26 examples/s]
-/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:1348: UserWarning: Model has `tie_word_embeddings=True` and a tied layer is part of the adapter, but `ensure_weight_tying` is not set to True. This can lead to complications, for example when merging the adapter or converting your model to formats other than safetensors. Check the discussion here: https://github.com/huggingface/peft/issues/2777
-  warnings.warn(msg)
 [transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
 trainable params: 5,070,848 || all params: 757,463,872 || trainable%: 0.6695
-  0%|          | 0/12 [00:00<?, ?it/s]2026-05-15 15:46:57 | ERROR    | peft-platform | Training failed for job 95169611-8cfc-445f-ab61-dee09ac711c6: '_ProgressCallback' object has no attribute 'on_step_begin'
-2026-05-15 15:46:57 | ERROR    | peft-platform | Job 95169611-8cfc-445f-ab61-dee09ac711c6 failed: '_ProgressCallback' object has no attribute 'on_step_begin'
-INFO:     127.0.0.1:49812 - "GET /health HTTP/1.1" 200 OK
+  0%|          | 0/12 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py:829: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:183.)
+  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
+2026-05-15 17:03:20 | ERROR    | peft-platform | Training failed for job bc4d7b3d-6f50-4877-aae7-1a1d0fc16da2: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
+2026-05-15 17:03:20 | ERROR    | peft-platform | Job bc4d7b3d-6f50-4877-aae7-1a1d0fc16da2 failed: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.