浏览代码

修复ppo未解析到数据

lxylxy123321 22 小时之前
父节点
当前提交
3ed003eaed
共有 2 个文件被更改,包括 68 次插入7 次删除
  1. 2 0
      backend/app/preprocessors/__init__.py
  2. 66 7
      result.txt

+ 2 - 0
backend/app/preprocessors/__init__.py

@@ -134,6 +134,8 @@ TEMPLATE_MAP = {
     },
     "ppo": {
         "auto": apply_raw_template,
+        "alpaca": apply_alpaca_template,
+        "sharegpt": apply_sharegpt_template,
         "raw": apply_raw_template,
     },
 }

+ 66 - 7
result.txt

@@ -1,7 +1,66 @@
-(base) [root@localhost ~]# docker exec finetune-trainer /opt/conda/bin/python -c 'from trl.experimental.ppo import PPOTrainer; print([m for m in dir(PPOTrainer) if not m.startswith("_")])'<string>:1: TRLExperimentalWarning: You are importing from 'trl.experimental'. APIs here are unstable and may change or be removed without notice. Silence this warning by setting environment variable TRL_EXPERIMENTAL_SILENCE=1.
-/opt/conda/lib/python3.10/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
-  warnings.warn(_BETA_TRANSFORMS_WARNING)
-/opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
-  warnings.warn(_BETA_TRANSFORMS_WARNING)
-['add_callback', 'autocast_smart_context_manager', 'call_model_init', 'compute_loss', 'compute_loss_context_manager', 'create_accelerator_and_postprocess', 'create_model_card', 'create_optimizer', 'create_optimizer_and_scheduler', 'create_scheduler', 'evaluate', 'evaluation_loop', 'floating_point_ops', 'generate_completions', 'get_batch_samples', 'get_cp_size', 'get_decay_parameter_names', 'get_eval_dataloader', 'get_learning_rates', 'get_num_trainable_parameters', 'get_optimizer_cls_and_kwargs', 'get_optimizer_group', 'get_sp_size', 'get_test_dataloader', 'get_total_train_batch_size', 'get_tp_size', 'get_train_dataloader', 'hyperparameter_search', 'init_hf_repo', 'is_local_process_zero', 'is_world_process_zero', 'log', 'log_metrics', 'metrics_format', 'null_ref_context', 'num_examples', 'pop_callback', 'predict', 'prediction_step', 'push_to_hub', 'remove_callback', 'save_metrics', 'save_model', 'save_state', 'set_initial_training_values', 'store_flos', 'train', 'training_step']
-(base) [root@localhost ~]# 
+INFO:     172.20.0.4:39360 - "POST /api/oauth/exchange-code HTTP/1.0" 200 OK
+INFO:     172.20.0.4:39364 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:39378 - "GET /api/v1/datasets/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:39394 - "GET /api/v1/models/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:50946 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:50944 - "GET /api/v1/models/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:50952 - "GET /api/v1/datasets/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:50958 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-27 02:30:41 | INFO     | peft-platform | Training job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf: num_gpus=1, batch_size=16
+2026-05-27 02:30:41 | INFO     | peft-platform | Job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf enqueued
+2026-05-27 02:30:41 | INFO     | peft-platform | Training job created: 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf
+INFO:     172.20.0.4:50972 - "POST /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-27 02:30:41 | INFO     | app.engines.text_engine | Preprocessed 0 samples for ppo/alpaca
+INFO:     172.20.0.4:50998 - "GET /api/v1/models/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:50984 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:51000 - "GET /api/v1/datasets/ HTTP/1.0" 200 OK
+INFO:     172.20.0.4:51012 - "WebSocket /ws/training/4fd86f1d-3f2f-48ac-92a4-8e236159d1cf?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJhZjgyN2IxZC0wM2IxLTQwZGMtOTliMC1jOGRjYTEzNWEwNmUiLCJ1c2VybmFtZSI6InN1cGVyX2FkbWluIiwicm9sZXMiOlsic3VwZXJfYWRtaW4iXSwiZXhwIjoxNzc5ODUwMjMzLCJpYXQiOjE3Nzk4NDkwMzMsInR5cGUiOiJhY2Nlc3MifQ.WvY2rgy_lvYhdR4UGaXA6x1X5MiMFvWKwqk3JzQdpOY" [accepted]
+2026-05-27 02:30:41 | INFO     | peft-platform | 客户端已连接到训练 WebSocket (job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf)
+INFO:     connection open
+INFO:     172.20.0.4:35710 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:35720 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:43638 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     127.0.0.1:40052 - "GET /health HTTP/1.1" 200 OK
+INFO:     172.20.0.4:43646 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:59604 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-27 02:31:07 | INFO     | peft-platform | Remote cleanup result: true
+cleaned 147 processes
+2026-05-27 02:32:00 | INFO     | peft-platform | Created remote dataset directory: /root/Fine-tuning/backend/data/datasets
+2026-05-27 02:32:00 | INFO     | peft-platform | Uploading dataset file: /root/Fine-tuning/backend/data/uploads/ppo_sample.jsonl -> /root/Fine-tuning/backend/data/datasets/ppo_sample.jsonl
+2026-05-27 02:32:18 | INFO     | peft-platform | Dataset uploaded successfully: /root/Fine-tuning/backend/data/datasets/ppo_sample.jsonl
+2026-05-27 02:32:53 | INFO     | peft-platform | Remote training launched in container: job=4fd86f1d-3f2f-48ac-92a4-8e236159d1cf, container_pid=26886
+INFO:     127.0.0.1:57260 - "GET /health HTTP/1.1" 200 OK
+INFO:     127.0.0.1:59094 - "GET /health HTTP/1.1" 200 OK
+INFO:     127.0.0.1:55910 - "GET /health HTTP/1.1" 200 OK
+INFO:     172.20.0.4:37264 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:59616 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:37248 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42048 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:45268 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42050 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42172 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42170 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42188 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42186 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42194 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42198 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42202 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42218 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42234 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42252 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42262 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42246 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42270 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42272 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:42284 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     172.20.0.4:44220 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     127.0.0.1:52000 - "GET /health HTTP/1.1" 200 OK
+2026-05-27 02:33:46 | ERROR    | peft-platform | Remote job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf failed: num_samples should be a positive integer value, but got num_samples=0
+INFO:     172.20.0.4:51606 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+INFO:     127.0.0.1:54618 - "GET /health HTTP/1.1" 200 OK
+INFO:     172.20.0.4:47416 - "GET /api/v1/training/jobs HTTP/1.0" 200 OK
+2026-05-27 02:33:56 | ERROR    | peft-platform | SSH command timeout after 10s: docker exec finetune-trainer bash -c 'kill -9 26886 2>/dev/null; pkill -9 -P 26886 2>/dev/null'
+2026-05-27 02:33:56 | INFO     | peft-platform | Killed remote process 26886 via docker exec
+2026-05-27 02:33:56 | INFO     | peft-platform | Remote training launched for job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf
+2026-05-27 02:33:56 | INFO     | peft-platform | 客户端已从训练 WebSocket 断开 (job 4fd86f1d-3f2f-48ac-92a4-8e236159d1cf)
+INFO:     connection closed