|
@@ -0,0 +1,122 @@
|
|
|
|
|
+# AutoDL 服务器使用 MinerU 转换 PDF 操作指南
|
|
|
|
|
+
|
|
|
|
|
+## 1. 进入工作目录
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+cd autodl-tmp
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 2. 创建项目文件夹
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+mkdir mu
|
|
|
|
|
+cd mu
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+用于存放虚拟环境和 MinerU 相关文件。
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 3. 安装 `uv`(Python 环境管理工具)
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+pip install uv
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+`uv` 用于快速创建 Python 虚拟环境和管理依赖。
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 4. 创建虚拟环境
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+uv venv
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+创建 `.venv` 虚拟环境目录。
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 5. 激活虚拟环境
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+source .venv/bin/activate
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+激活后终端会显示类似:
|
|
|
|
|
+
|
|
|
|
|
+ (.venv) root@xxx:~/autodl-tmp/mu#
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 6. 安装 MinerU
|
|
|
|
|
+
|
|
|
|
|
+使用 **阿里云 PyPI 镜像**加速安装:
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+uv pip install "mineru[core,lmdeploy]==2.6.8" --system --index-url https://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+说明:
|
|
|
|
|
+
|
|
|
|
|
+- `mineru[core,lmdeploy]`:安装核心功能和 `lmdeploy` 推理支持\
|
|
|
|
|
+- `--index-url`:使用阿里云镜像源\
|
|
|
|
|
+- `--trusted-host`:允许信任镜像地址
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 7. 执行 PDF 转换
|
|
|
|
|
+
|
|
|
|
|
+``` bash
|
|
|
|
|
+mineru -p /root/autodl-fs/142 -o /root/autodl-fs/142output --source modelscope
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+参数说明:
|
|
|
|
|
+
|
|
|
|
|
+ 参数 说明
|
|
|
|
|
+ ----------------------- ------------------------
|
|
|
|
|
+ `-p` 输入 PDF 文件所在目录
|
|
|
|
|
+ `-o` 输出转换结果目录
|
|
|
|
|
+ `--source modelscope` 从 ModelScope 下载模型
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 8. 目录结构示例
|
|
|
|
|
+
|
|
|
|
|
+ autodl-tmp/
|
|
|
|
|
+ └── mu/
|
|
|
|
|
+ ├── .venv/
|
|
|
|
|
+ └── (mineru环境)
|
|
|
|
|
+
|
|
|
|
|
+ 输入PDF:
|
|
|
|
|
+ /root/autodl-fs/142
|
|
|
|
|
+
|
|
|
|
|
+ 输出结果:
|
|
|
|
|
+ /root/autodl-fs/142output
|
|
|
|
|
+
|
|
|
|
|
+------------------------------------------------------------------------
|
|
|
|
|
+
|
|
|
|
|
+## 9. 运行完成后的结果
|
|
|
|
|
+
|
|
|
|
|
+转换完成后:
|
|
|
|
|
+
|
|
|
|
|
+- PDF 会被解析为 **结构化文档**
|
|
|
|
|
+- 输出目录通常包含:
|
|
|
|
|
+
|
|
|
|
|
+```{=html}
|
|
|
|
|
+<!-- -->
|
|
|
|
|
+```
|
|
|
|
|
+ 142output/
|
|
|
|
|
+ ├── markdown/
|
|
|
|
|
+ ├── json/
|
|
|
|
|
+ ├── images/
|
|
|
|
|
+ └── logs/
|
|
|
|
|
+
|
|
|
|
|
+常见用途:
|
|
|
|
|
+
|
|
|
|
|
+- RAG知识库构建
|
|
|
|
|
+- 文档解析
|
|
|
|
|
+- 论文结构提取
|
|
|
|
|
+- Markdown知识笔记
|