全面架构重构：建立分层架构与高度可扩展的插件系统

后端重构： - 新增分层架构：API Routes -> Services -> Repositories -> Infrastructure - 彻底移除全局单例，全面采用 FastAPI 依赖注入 - 新增 api/ 目录拆分路由（proxies, plugins, scheduler, settings, stats） - 新增 services/ 业务逻辑层：ProxyService, PluginService, SchedulerService, ValidatorService, SettingsService - 新增 repositories/ 数据访问层：ProxyRepository, SettingsRepository, PluginSettingsRepository - 新增 models/ 层：Pydantic Schemas + Domain Models - 重写 core/config.py：采用 Pydantic Settings 管理配置 - 新增 core/db.py：基于 asynccontextmanager 的连接管理，支持数据库迁移 - 新增 core/exceptions.py：统一业务异常体系插件系统重构（核心）： - 新增 core/plugin_system/：BaseCrawlerPlugin + PluginRegistry - 采用显式注册模式（装饰器 + plugins/__init__.py），类型安全、测试友好 - 新增 plugins/base.py：BaseHTTPPlugin 通用 HTTP 爬虫基类 - 迁移全部 7 个插件到新架构（fate0, proxylist_download, ip3366, ip89, kuaidaili, speedx, yundaili） - 插件状态持久化到 plugin_settings 表任务调度重构： - 新增 core/tasks/queue.py：ValidationQueue + WorkerPool - 解耦爬取与验证：爬虫只负责爬取，代理提交队列后由 Worker 异步验证 - 调度器定时从数据库拉取存量代理并分批投入验证队列前端调整： - 新增 frontend/src/services/ 层拆分 API 调用逻辑 - 调整 stores/ 和 views/ 使用 Service 层 - 保持 API 兼容性，页面无需大幅修改其他： - 新增 main.py 作为新入口 - 新增 DESIGN.md 架构设计文档 - 更新 requirements.txt 增加 pydantic-settings
2026-04-02 11:55:05 +08:00
parent a79f78b338
commit 209a744d94
56 changed files with 2891 additions and 2095 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@@ -0,0 +1,470 @@
+# ProxyPool 架构重构设计文档
+
+> 目标：建立一个高度可扩展、分层清晰、易于维护的代理池系统。最关键的目标是**让添加新爬虫变得极其简单**。
+
+---
+
+## 1. 架构总览
+
+采用经典的分层架构：
+
+```
+┌─────────────────────────────────────────┐
+│  Frontend (Vue3 + Vite + Element Plus)  │
+└─────────────┬───────────────────────────┘
+              │ HTTP/REST
+┌─────────────▼───────────────────────────┐
+│  API Layer (FastAPI Routers)            │  ← 只负责：校验输入、调用 Service、格式化输出
+├─────────────────────────────────────────┤
+│  Service Layer                          │  ← 业务逻辑编排：爬取策略、验证调度、导出逻辑
+├─────────────────────────────────────────┤
+│  Plugin System (Crawlers)               │  ← 爬虫插件：实现统一接口，返回原始代理数据
+├─────────────────────────────────────────┤
+│  Task Queue & Workers                   │  ← 验证队列：背压控制、Worker 池、削峰填谷
+├─────────────────────────────────────────┤
+│  Repository Layer                       │  ← 数据访问：所有 SQL 收敛于此
+├─────────────────────────────────────────┤
+│  Infrastructure (DB / Config / Log)     │  ← 基础设施：连接池、配置、日志
+└─────────────────────────────────────────┘
+```
+
+---
+
+## 2. 后端核心设计原则
+
+### 2.1 消灭全局单例，全面使用依赖注入 (DI)
+当前 `scheduler = ValidationScheduler()` 是模块级全局变量，导致测试困难、隐式依赖。
+
+重构后：
+- 所有核心组件（DB、Scheduler、PluginManager）都通过 FastAPI `Depends` 注入
+- 使用 `contextlib.asynccontextmanager` 在 lifespan 中初始化并挂载到 `app.state`
+- 单元测试可以轻易 mock 任何一层
+
+### 2.2 Repository 模式收敛所有 SQL
+所有数据库操作从 `api_server.py`、`scheduler.py` 中彻底抽离到 `repositories/proxy_repo.py`。
+
+好处：
+- 换数据库时只改 Repository
+- 写单元测试直接 mock Repository
+- SQL 语句集中管理，防止散落在各处
+
+### 2.3 任务队列解耦爬取与验证
+当前插件爬取后直接 `asyncio.gather(*10000_tasks)` 验证，存在内存和并发风险。
+
+重构后引入轻量级内存队列：
+- `ValidationQueue`：基于 `asyncio.Queue`
+- `ValidationWorkerPool`：固定数量的 Worker 从队列消费
+- 爬取结果 `put` 进队列即返回，验证在后台进行
+- 天然支持背压（backpressure），防止内存爆炸
+
+---
+
+## 3. 插件系统设计（核心）
+
+### 3.1 设计目标
+**让添加一个新爬虫只需要做两件事：**
+1. 创建一个类，继承 `BaseCrawlerPlugin`
+2. 实现 `crawl()` 方法，返回 `list[ProxyRaw]`
+
+### 3.2 插件接口
+
+```python
+from dataclasses import dataclass
+from typing import List, AsyncIterator
+
+@dataclass
+class ProxyRaw:
+    ip: str
+    port: int
+    protocol: str  # http | https | socks4 | socks5
+
+class BaseCrawlerPlugin:
+    """所有爬虫插件必须继承的基类"""
+    
+    name: str = ""           # 插件唯一标识
+    display_name: str = ""   # 展示名称
+    description: str = ""    # 描述
+    enabled: bool = True     # 是否默认启用
+    
+    async def crawl(self) -> List[ProxyRaw]:
+        """
+        爬取代理的核心方法。
+        可以是纯同步逻辑，也可以包含异步 HTTP 请求。
+        返回原始代理列表，不要在这里做验证。
+        """
+        raise NotImplementedError
+    
+    async def health_check(self) -> bool:
+        """可选：检查当前插件是否可用（如目标网站是否可访问）"""
+        return True
+```
+
+### 3.3 插件注册机制
+采用**显式注册 + 装饰器**模式，抛弃运行时目录扫描。
+
+```python
+from core.plugin_system import registry
+
+@registry.register
+class MyNewPlugin(BaseCrawlerPlugin):
+    name = "my_new_plugin"
+    display_name = "我的新代理源"
+    
+    async def crawl(self):
+        return [ProxyRaw("1.2.3.4", 8080, "http")]
+```
+
+优点：
+- 类型安全：IDE 可以自动补全、静态检查
+- 可控：不会出现意外加载未预期模块的问题
+- 测试友好：测试时只注册 mock 插件
+
+同时提供一个兼容入口 `registry.auto_discover("plugins")`，用于兼容现有习惯。
+
+### 3.4 插件元数据持久化
+插件的 `enabled` 状态应该持久化到数据库（或 settings JSON），而不是仅存在于内存。
+
+新增 `plugin_settings` 表：
+```sql
+CREATE TABLE plugin_settings (
+    plugin_id TEXT PRIMARY KEY,
+    enabled INTEGER DEFAULT 1,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+```
+
+启动时：
+1. 加载所有已注册插件
+2. 从 `plugin_settings` 读取持久化状态
+3. 合并到插件实例中
+
+---
+
+## 4. 任务调度与验证队列
+
+### 4.1 验证队列设计
+
+```python
+class ValidationQueue:
+    def __init__(self, worker_count: int = 50):
+        self.queue: asyncio.Queue[ProxyRaw] = asyncio.Queue()
+        self.worker_count = worker_count
+        self.workers: list[asyncio.Task] = []
+        self._running = False
+    
+    async def start(self):
+        self._running = True
+        for _ in range(self.worker_count):
+            self.workers.append(asyncio.create_task(self._worker_loop()))
+    
+    async def stop(self):
+        self._running = False
+        for _ in self.workers:
+            self.queue.put_nowait(None)  # sentinel
+        await asyncio.gather(*self.workers, return_exceptions=True)
+    
+    async def submit(self, proxies: list[ProxyRaw]):
+        for p in proxies:
+            await self.queue.put(p)
+    
+    async def _worker_loop(self):
+        while True:
+            item = await self.queue.get()
+            if item is None:
+                break
+            await self._validate_and_save(item)
+            self.queue.task_done()
+```
+
+### 4.2 调度器设计
+`SchedulerService` 负责：
+- 启动/停止验证队列
+- 定时从数据库拉取存量代理，重新投入验证队列
+- 协调插件爬取后的验证流程
+
+```python
+class SchedulerService:
+    def __init__(self, queue: ValidationQueue, proxy_repo: ProxyRepository):
+        self.queue = queue
+        self.proxy_repo = proxy_repo
+        self.interval_minutes = 30
+        self._task: asyncio.Task | None = None
+```
+
+---
+
+## 5. 数据库设计
+
+保留 SQLite + aiosqlite，但优化连接管理。
+
+### 5.1 表结构
+
+```sql
+-- 代理表
+CREATE TABLE proxies (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    ip TEXT NOT NULL,
+    port INTEGER NOT NULL,
+    protocol TEXT DEFAULT 'http',
+    score INTEGER DEFAULT 10,
+    response_time_ms REAL,
+    last_check TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    UNIQUE(ip, port)
+);
+
+-- 插件设置表
+CREATE TABLE plugin_settings (
+    plugin_id TEXT PRIMARY KEY,
+    enabled INTEGER DEFAULT 1,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+-- 系统设置表（JSON 存储）
+CREATE TABLE settings (
+    key TEXT PRIMARY KEY,
+    value TEXT NOT NULL,
+    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+```
+
+### 5.2 连接管理
+- 使用 `asynccontextmanager` 管理连接生命周期
+- 每个 HTTP 请求独立获取连接，请求结束后关闭
+- 调度器/队列等长生命周期组件也定期重建连接（如每 1000 次操作）
+
+---
+
+## 6. API 设计调整
+
+保持现有 API 路径基本不变，但路由按资源拆分。
+
+### 6.1 路由拆分
+```
+apiv1/
+├── __init__.py
+├── proxies.py      # /api/proxies/*
+├── plugins.py      # /api/plugins/*
+├── scheduler.py    # /api/scheduler/*
+└── settings.py     # /api/settings
+```
+
+### 6.2 新增/调整的 API
+
+#### 插件相关
+- `GET /api/plugins` — 获取插件列表（含持久化状态）
+- `PUT /api/plugins/{plugin_id}/toggle` — 切换启用状态（持久化到 DB）
+- `POST /api/plugins/{plugin_id}/crawl` — 触发爬取（异步，返回任务 ID）
+- `POST /api/plugins/crawl-all` — 批量爬取
+
+**关键变更**：爬取接口改为**异步触发**而不是同步等待。因为新爬虫可能爬取数万个代理，同步 HTTP 请求会超时。
+
+返回示例：
+```json
+{
+  "code": 200,
+  "message": "爬取任务已启动",
+  "data": {
+    "task_id": "crawl-20250402-001",
+    "queued": 150
+  }
+}
+```
+
+为了简化前端，第一阶段可以保留同步 API，但内部通过 `asyncio.create_task` 包装，并设置合理的超时（30 秒）。在真正大规模使用时，再迁移到 WebSocket/SSE 推送进度。
+
+---
+
+## 7. 前端架构调整
+
+### 7.1 新增 Service 层
+从 Store 中剥离 API 调用逻辑：
+
+```
+frontend/src/
+├── services/
+│   ├── proxyService.js    # 代理相关 API 调用
+│   ├── pluginService.js   # 插件相关 API 调用
+│   ├── schedulerService.js
+│   └── settingService.js
+├── stores/
+│   ├── proxy.js           # 纯状态管理
+│   └── plugin.js
+```
+
+### 7.2 Store 职责收敛
+Store 只负责：
+- 持有状态（`ref/reactive`）
+- 提供计算属性
+- 调用 Service，然后更新状态
+
+### 7.3 API 适配
+由于后端 API 路径保持不变，前端改动主要是代码组织上的调整，URL 和返回结构尽量兼容。
+
+---
+
+## 8. 目录结构（重构后）
+
+```
+ProxyPool/
+├── api/                          # FastAPI 入口和路由
+│   ├── __init__.py
+│   ├── main.py                   # 应用工厂
+│   ├── lifespan.py               # 生命周期管理
+│   ├── deps.py                   # 依赖注入
+│   ├── errors.py                 # 统一异常
+│   └── routes/
+│       ├── __init__.py
+│       ├── proxies.py
+│       ├── plugins.py
+│       ├── scheduler.py
+│       └── settings.py
+│
+├── core/                         # 基础设施
+│   ├── __init__.py
+│   ├── config.py                 # Pydantic Settings
+│   ├── log.py                    # 日志
+│   ├── db.py                     # 数据库连接池/上下文
+│   └── exceptions.py             # 业务异常
+│
+├── models/                       # 数据模型
+│   ├── __init__.py
+│   ├── schemas.py                # Pydantic 模型
+│   └── domain.py                 # 领域模型（ProxyRaw, PluginInfo 等）
+│
+├── repositories/                 # 数据访问层
+│   ├── __init__.py
+│   └── proxy_repo.py             # ProxyRepository
+│
+├── services/                     # 业务逻辑层
+│   ├── __init__.py
+│   ├── proxy_service.py
+│   ├── plugin_service.py
+│   ├── scheduler_service.py
+│   └── validator_service.py
+│
+├── core/                         # 任务与插件系统
+│   ├── plugin_system/
+│   │   ├── __init__.py
+│   │   ├── base.py               # BaseCrawlerPlugin
+│   │   └── registry.py           # 插件注册中心
+│   └── tasks/
+│       ├── __init__.py
+│       ├── queue.py              # ValidationQueue
+│       └── workers.py            # Worker Pool
+│
+├── plugins/                      # 爬虫插件
+│   ├── __init__.py
+│   ├── base.py                   # 通用抓取基类（HTTP 请求封装）
+│   ├── fate0.py
+│   ├── proxylist_download.py
+│   └── ...
+│
+├── frontend/                     # Vue3 前端
+│   └── src/
+│       ├── services/             # 新增
+│       ├── stores/
+│       ├── api/
+│       └── ...
+│
+├── tests/                        # 测试目录
+│   ├── conftest.py
+│   ├── unit/
+│   └── integration/
+│
+├── script/
+├── data/
+├── db/
+├── logs/
+├── requirements.txt
+├── .env.example
+└── DESIGN.md                     # 本文档
+```
+
+---
+
+## 9. 迁移计划
+
+### Phase 1: 基础设施（今天完成）
+1. 重写 `core/config.py` → Pydantic Settings
+2. 重写 `core/db.py` → 带上下文管理的连接池
+3. 创建 `models/` 层
+
+### Phase 2: Repository + Service（今天完成）
+1. 创建 `repositories/proxy_repo.py`
+2. 创建 `services/` 下的业务类
+3. 迁移现有逻辑
+
+### Phase 3: 插件系统（今天完成，核心）
+1. 创建 `core/plugin_system/base.py` 和 `registry.py`
+2. 设计显式注册机制
+3. 将所有现有插件迁移到新基类
+
+### Phase 4: 任务队列（今天完成）
+1. 创建 `ValidationQueue` 和 `WorkerPool`
+2. 重写 `SchedulerService`
+
+### Phase 5: API 路由（今天完成）
+1. 拆分 `api_server.py` 到 `api/routes/`
+2. 组装新的 `api/main.py`
+
+### Phase 6: 前端调整（今天完成）
+1. 拆分 Service 层
+2. 适配 Store
+3. 保留现有页面，只改代码组织
+
+### Phase 7: 清理与验证
+1. 删除旧的 `api_server.py`, `core/scheduler.py`, `core/sqlite.py` 等
+2. 运行测试，确保所有功能正常
+3. 提交代码
+
+---
+
+## 10. 添加新爬虫的标准流程（目标体验）
+
+假设要添加一个名为 `mynewsource` 的爬虫：
+
+**Step 1**: 创建文件 `plugins/mynewsource.py`
+
+```python
+from core.plugin_system import BaseCrawlerPlugin, ProxyRaw
+from plugins.base import BaseHTTPPlugin  # 可选：如果基于 HTTP 爬取
+
+class MyNewSourcePlugin(BaseHTTPPlugin):
+    name = "mynewsource"
+    display_name = "我的新代理源"
+    description = "从 example.com 爬取免费代理"
+    
+    def __init__(self):
+        super().__init__()
+        self.urls = ["https://example.com/proxies"]
+    
+    async def crawl(self) -> list[ProxyRaw]:
+        results = []
+        for url in self.urls:
+            html = await self.fetch(url)
+            # ... 解析 html ...
+            results.append(ProxyRaw(ip="1.2.3.4", port=8080, protocol="http"))
+        return results
+```
+
+**Step 2**: 在 `plugins/__init__.py` 中注册
+
+```python
+from .mynewsource import MyNewSourcePlugin
+from core.plugin_system import registry
+
+registry.register(MyNewSourcePlugin)
+```
+
+**Step 3**: 重启后端服务，前端自动显示新插件。
+
+无需修改任何路由、服务、数据库表。
+
+---
+
+*文档版本: 1.0*
+*作者: Kimi Code*
+*日期: 2026-04-02*