Ver código fonte

修改训练报错问题

lxylxy123321 1 semana atrás
pai
commit
b425507f91
6 arquivos alterados com 56 adições e 7 exclusões
  1. 4 0
      .env
  2. 16 0
      CLAUDE.md
  3. 1 1
      backend/app/config.py
  4. 3 2
      backend/app/engines/text_engine.py
  5. 1 0
      docker-compose.yml
  6. 31 4
      result.txt

+ 4 - 0
.env

@@ -12,3 +12,7 @@ MODELSCOPE_ENDPOINT=https://modelscope.cn
 CUDA_VISIBLE_DEVICES=0
 MAX_MEMORY_PER_GPU=0
 USE_UNSLOTH=false
+
+# 数据路径(与 backend/.env 保持一致)
+DATA_DIR=/root/Fine-tuning/backend/data
+DATABASE_URL=sqlite+aiosqlite:///root/Fine-tuning/backend/data/finetuning.db

+ 16 - 0
CLAUDE.md

@@ -118,3 +118,19 @@ pending → queued → preprocessing → training → completed
 - 项目路径:`/root/Fine-tuning`
 - 数据目录:`/root/Fine-tuning/backend/data`
 - 数据库:`/root/Fine-tuning/backend/data/finetuning.db`
+
+## 语言规范要求
+
+### 基础规则
+
+1. 所有日常对话、解释、分析、总结、步骤说明、文字回答**必须使用简体中文**。
+2. 代码、变量名、函数名、命令、配置关键字、专业技术术语、报错原文、JSON/配置内容**保留英文原样,不要翻译**。
+3. 注释可以用中文清晰说明逻辑。
+4. 不要主动输出英文闲聊、英文解释,全程自然中文沟通。
+5. 给出方案、排查原因、步骤讲解全部用中文,仅代码和技术固有文本用英文。
+
+### 输出格式
+
+- 说明文字:简体中文
+- 代码块、终端命令、JSON、YAML、报错日志:保持原生英文不变
+- 列表、步骤、结论一律中文表述

+ 1 - 1
backend/app/config.py

@@ -21,7 +21,7 @@ class EnvSettingsSourceWithCommaLists(EnvSettingsSource):
 
 class Settings(BaseSettings):
     model_config = SettingsConfigDict(
-        env_file=str(Path(__file__).resolve().parents[2] / ".env"),
+        env_file=str(Path(__file__).resolve().parent.parent / ".env"),
         env_file_encoding="utf-8",
         case_sensitive=False,
         extra="ignore",

+ 3 - 2
backend/app/engines/text_engine.py

@@ -1,6 +1,6 @@
 import os
 
-# 禁用 FlashAttention,解决沐曦显卡共享内存不足问题
+# 禁用 FlashAttention 和 FLA,解决沐曦显卡共享内存不足问题
 os.environ["PYTORCH_NO_FLASH"] = "1"
 os.environ["FLASH_ATTENTION_ENABLED"] = "0"
 os.environ["USE_FLASH_ATTENTION"] = "0"
@@ -68,6 +68,7 @@ class TextEngine(BaseEngine):
             "low_cpu_mem_usage": True,
             "use_safetensors": True,
             "max_memory": max_memory,
+            "attn_implementation": "sdpa",
         }
         if quantization == "4bit" or quantization == "qlora":
             load_kwargs["load_in_4bit"] = True
@@ -161,7 +162,7 @@ class TextEngine(BaseEngine):
             remove_unused_columns=False,
             report_to="none",
             gradient_checkpointing=False,
-            dataloader_num_workers=1,
+            dataloader_num_workers=0,
             dataloader_pin_memory=False,
             **({"deepspeed": deepspeed_config} if deepspeed_config else {}),
         )

+ 1 - 0
docker-compose.yml

@@ -24,6 +24,7 @@ services:
     devices:
       - /dev/mxcd:/dev/mxcd
     privileged: true
+    shm_size: "2gb"
     networks:
       - finetune-net
 

+ 31 - 4
result.txt

@@ -1,6 +1,33 @@
+INFO:     172.19.0.3:52548 - "POST /api/v1/datasets/download HTTP/1.0" 200 OK
+INFO:     127.0.0.1:46426 - "GET /health HTTP/1.1" 200 OK
+INFO:     172.19.0.3:48310 - "GET /api/v1/models/ HTTP/1.0" 200 OK
+INFO:     172.19.0.3:48320 - "GET /api/v1/datasets/ HTTP/1.0" 200 OK
+INFO:     172.19.0.3:48332 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-15 17:24:03 | INFO     | peft-platform | Job 5999c2df-0b6a-4ec2-a99a-9894ef923a85 enqueued
+2026-05-15 17:24:03 | INFO     | peft-platform | Training job created: 5999c2df-0b6a-4ec2-a99a-9894ef923a85
+INFO:     172.19.0.3:48340 - "POST /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-15 17:24:03 | INFO     | peft-platform | Preprocessed 60 samples for sft/alpaca
+INFO:     172.19.0.3:48356 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.19.0.3:48362 - "GET /api/v1/models/ HTTP/1.0" 200 OK
+INFO:     172.19.0.3:48360 - "GET /api/v1/datasets/ HTTP/1.0" 200 OK
+2026-05-15 17:24:13 | INFO     | peft-platform | CUDA available: True
+2026-05-15 17:24:13 | INFO     | peft-platform | CUDA device count: 1
+2026-05-15 17:24:13 | INFO     | peft-platform | GPU 0: MetaX N260
+2026-05-15 17:24:13 | INFO     | peft-platform | GPU 0 memory: 63.78 GB
+[transformers] `torch_dtype` is deprecated! Use `dtype` instead!
+2026-05-15 17:24:14 | WARNING  | fla.utils | Current Triton version 3.0.0 is below the recommended 3.2.0 version. Errors may occur and these issues will not be fixed. Please consider upgrading Triton.
+2026-05-15 17:24:14 | WARNING  | fla.utils | Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
+2026-05-15 17:24:20 | WARNING  | fla.ops.rwkv7.fused_addcmul | torch.compile is not available in Python 3.10, using identity decorator instead
+/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
+  warnings.warn(_BETA_TRANSFORMS_WARNING)
+/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
+  warnings.warn(_BETA_TRANSFORMS_WARNING)
+Loading weights: 100%|██████████| 320/320 [00:00<00:00, 382.46it/s]
+2026-05-15 17:24:21 | INFO     | peft-platform | Loaded model: Qwen/Qwen3.5-0.8B
+Map: 100%|██████████| 60/60 [00:00<00:00, 2212.59 examples/s]
+/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:1348: UserWarning: Model has `tie_word_embeddings=True` and a tied layer is part of the adapter, but `ensure_weight_tying` is not set to True. This can lead to complications, for example when merging the adapter or converting your model to formats other than safetensors. Check the discussion here: https://github.com/huggingface/peft/issues/2777
+  warnings.warn(msg)
 [transformers] warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
 trainable params: 5,070,848 || all params: 757,463,872 || trainable%: 0.6695
-  0%|          | 0/12 [00:00<?, ?it/s]/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py:829: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:183.)
-  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
-2026-05-15 17:03:20 | ERROR    | peft-platform | Training failed for job bc4d7b3d-6f50-4877-aae7-1a1d0fc16da2: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
-2026-05-15 17:03:20 | ERROR    | peft-platform | Job bc4d7b3d-6f50-4877-aae7-1a1d0fc16da2 failed: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
+  0%|          | 0/12 [00:00<?, ?it/s]2026-05-15 17:27:03 | ERROR    | peft-platform | Training failed for job 5999c2df-0b6a-4ec2-a99a-9894ef923a85: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
+2026-05-15 17:27:03 | ERROR    | peft-platform | Job 5999c2df-0b6a-4ec2-a99a-9894ef923a85 failed: out of resource: shared memory, Required: 106496, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.