ProxyPool/DESIGN.md

# ProxyPool 架构重构设计文档

> 目标：建立一个高度可扩展、分层清晰、易于维护的代理池系统。最关键的目标是**让添加新爬虫变得极其简单**。

---

## 1. 架构总览

采用经典的分层架构：

```
┌─────────────────────────────────────────┐
│  Frontend (Vue3 + Vite + Element Plus)  │
└─────────────┬───────────────────────────┘
              │ HTTP/REST
┌─────────────▼───────────────────────────┐
│  API Layer (FastAPI Routers)            │  ← 只负责：校验输入、调用 Service、格式化输出
├─────────────────────────────────────────┤
│  Service Layer                          │  ← 业务逻辑编排：爬取策略、验证调度、导出逻辑
├─────────────────────────────────────────┤
│  Plugin System (Crawlers)               │  ← 爬虫插件：实现统一接口，返回原始代理数据
├─────────────────────────────────────────┤
│  Task Queue & Workers                   │  ← 验证队列：背压控制、Worker 池、削峰填谷
├─────────────────────────────────────────┤
│  Repository Layer                       │  ← 数据访问：所有 SQL 收敛于此
├─────────────────────────────────────────┤
│  Infrastructure (DB / Config / Log)     │  ← 基础设施：连接池、配置、日志
└─────────────────────────────────────────┘
```

---

## 2. 后端核心设计原则

### 2.1 消灭全局单例，全面使用依赖注入 (DI)
当前 `scheduler = ValidationScheduler()` 是模块级全局变量，导致测试困难、隐式依赖。

重构后：
- 所有核心组件（DB、Scheduler、PluginManager）都通过 FastAPI `Depends` 注入
- 使用 `contextlib.asynccontextmanager` 在 lifespan 中初始化并挂载到 `app.state`
- 单元测试可以轻易 mock 任何一层

### 2.2 Repository 模式收敛所有 SQL
所有数据库操作从 `api_server.py`、`scheduler.py` 中彻底抽离到 `repositories/proxy_repo.py`。

好处：
- 换数据库时只改 Repository
- 写单元测试直接 mock Repository
- SQL 语句集中管理，防止散落在各处

### 2.3 任务队列解耦爬取与验证
当前插件爬取后直接 `asyncio.gather(*10000_tasks)` 验证，存在内存和并发风险。

重构后引入轻量级内存队列：
- `ValidationQueue`：基于 `asyncio.Queue`
- `ValidationWorkerPool`：固定数量的 Worker 从队列消费
- 爬取结果 `put` 进队列即返回，验证在后台进行
- 天然支持背压（backpressure），防止内存爆炸

---

## 3. 插件系统设计（核心）

### 3.1 设计目标
**让添加一个新爬虫只需要做两件事：**
1. 创建一个类，继承 `BaseCrawlerPlugin`
2. 实现 `crawl()` 方法，返回 `list[ProxyRaw]`

### 3.2 插件接口

```python
from dataclasses import dataclass
from typing import List, AsyncIterator

@dataclass
class ProxyRaw:
    ip: str
    port: int
    protocol: str  # http | https | socks4 | socks5

class BaseCrawlerPlugin:
    """所有爬虫插件必须继承的基类"""

    name: str = ""           # 插件唯一标识
    display_name: str = ""   # 展示名称
    description: str = ""    # 描述
    enabled: bool = True     # 是否默认启用

    async def crawl(self) -> List[ProxyRaw]:
        """
        爬取代理的核心方法。
        可以是纯同步逻辑，也可以包含异步 HTTP 请求。
        返回原始代理列表，不要在这里做验证。
        """
        raise NotImplementedError

    async def health_check(self) -> bool:
        """可选：检查当前插件是否可用（如目标网站是否可访问）"""
        return True
```

### 3.3 插件注册机制
采用**显式注册 + 装饰器**模式，抛弃运行时目录扫描。

```python
from core.plugin_system import registry

@registry.register
class MyNewPlugin(BaseCrawlerPlugin):
    name = "my_new_plugin"
    display_name = "我的新代理源"

    async def crawl(self):
        return [ProxyRaw("1.2.3.4", 8080, "http")]
```

优点：
- 类型安全：IDE 可以自动补全、静态检查
- 可控：不会出现意外加载未预期模块的问题
- 测试友好：测试时只注册 mock 插件

同时提供一个兼容入口 `registry.auto_discover("plugins")`，用于兼容现有习惯。

### 3.4 插件元数据持久化
插件的 `enabled` 状态应该持久化到数据库（或 settings JSON），而不是仅存在于内存。

新增 `plugin_settings` 表：
```sql
CREATE TABLE plugin_settings (
    plugin_id TEXT PRIMARY KEY,
    enabled INTEGER DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

启动时：
1. 加载所有已注册插件
2. 从 `plugin_settings` 读取持久化状态
3. 合并到插件实例中

---

## 4. 任务调度与验证队列

### 4.1 验证队列设计

```python
class ValidationQueue:
    def __init__(self, worker_count: int = 50):
        self.queue: asyncio.Queue[ProxyRaw] = asyncio.Queue()
        self.worker_count = worker_count
        self.workers: list[asyncio.Task] = []
        self._running = False

    async def start(self):
        self._running = True
        for _ in range(self.worker_count):
            self.workers.append(asyncio.create_task(self._worker_loop()))

    async def stop(self):
        self._running = False
        for _ in self.workers:
            self.queue.put_nowait(None)  # sentinel
        await asyncio.gather(*self.workers, return_exceptions=True)

    async def submit(self, proxies: list[ProxyRaw]):
        for p in proxies:
            await self.queue.put(p)

    async def _worker_loop(self):
        while True:
            item = await self.queue.get()
            if item is None:
                break
            await self._validate_and_save(item)
            self.queue.task_done()
```

### 4.2 调度器设计
`SchedulerService` 负责：
- 启动/停止验证队列
- 定时从数据库拉取存量代理，重新投入验证队列
- 协调插件爬取后的验证流程

```python
class SchedulerService:
    def __init__(self, queue: ValidationQueue, proxy_repo: ProxyRepository):
        self.queue = queue
        self.proxy_repo = proxy_repo
        self.interval_minutes = 30
        self._task: asyncio.Task | None = None
```

---

## 5. 数据库设计

保留 SQLite + aiosqlite，但优化连接管理。

### 5.1 表结构

```sql
-- 代理表
CREATE TABLE proxies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ip TEXT NOT NULL,
    port INTEGER NOT NULL,
    protocol TEXT DEFAULT 'http',
    score INTEGER DEFAULT 10,
    response_time_ms REAL,
    last_check TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ip, port)
);

-- 插件设置表
CREATE TABLE plugin_settings (
    plugin_id TEXT PRIMARY KEY,
    enabled INTEGER DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 系统设置表（JSON 存储）
CREATE TABLE settings (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

### 5.2 连接管理
- 使用 `asynccontextmanager` 管理连接生命周期
- 每个 HTTP 请求独立获取连接，请求结束后关闭
- 调度器/队列等长生命周期组件也定期重建连接（如每 1000 次操作）

---

## 6. API 设计调整

保持现有 API 路径基本不变，但路由按资源拆分。

### 6.1 路由拆分
```
apiv1/
├── __init__.py
├── proxies.py      # /api/proxies/*
├── plugins.py      # /api/plugins/*
├── scheduler.py    # /api/scheduler/*
└── settings.py     # /api/settings
```

### 6.2 新增/调整的 API

#### 插件相关
- `GET /api/plugins` — 获取插件列表（含持久化状态）
- `PUT /api/plugins/{plugin_id}/toggle` — 切换启用状态（持久化到 DB）
- `POST /api/plugins/{plugin_id}/crawl` — 触发爬取（异步，返回任务 ID）
- `POST /api/plugins/crawl-all` — 批量爬取

**关键变更**：爬取接口改为**异步触发**而不是同步等待。因为新爬虫可能爬取数万个代理，同步 HTTP 请求会超时。

返回示例：
```json
{
  "code": 200,
  "message": "爬取任务已启动",
  "data": {
    "task_id": "crawl-20250402-001",
    "queued": 150
  }
}
```

为了简化前端，第一阶段可以保留同步 API，但内部通过 `asyncio.create_task` 包装，并设置合理的超时（30 秒）。在真正大规模使用时，再迁移到 WebSocket/SSE 推送进度。

---

## 7. 前端架构调整

### 7.1 新增 Service 层
从 Store 中剥离 API 调用逻辑：

```
frontend/src/
├── services/
│   ├── proxyService.js    # 代理相关 API 调用
│   ├── pluginService.js   # 插件相关 API 调用
│   ├── schedulerService.js
│   └── settingService.js
├── stores/
│   ├── proxy.js           # 纯状态管理
│   └── plugin.js
```

### 7.2 Store 职责收敛
Store 只负责：
- 持有状态（`ref/reactive`）
- 提供计算属性
- 调用 Service，然后更新状态

### 7.3 API 适配
由于后端 API 路径保持不变，前端改动主要是代码组织上的调整，URL 和返回结构尽量兼容。

---

## 8. 目录结构（重构后）

```
ProxyPool/
├── main.py                       # 项目入口
├── requirements.txt              # Python 依赖
├── .env.example                  # 环境变量示例
│
├── app/                          # 后端代码
│   ├── api/                      # FastAPI 入口和路由
│   │   ├── __init__.py
│   │   ├── main.py               # 应用工厂
│   │   ├── lifespan.py           # 生命周期管理
│   │   ├── deps.py               # 依赖注入
│   │   ├── errors.py             # 统一异常
│   │   └── routes/
│   │       ├── __init__.py
│   │       ├── proxies.py
│   │       ├── plugins.py
│   │       ├── scheduler.py
│   │       └── settings.py
│   │
│   ├── core/                     # 基础设施
│   │   ├── __init__.py
│   │   ├── config.py             # Pydantic Settings
│   │   ├── log.py                # 日志
│   │   ├── db.py                 # 数据库连接池/上下文
│   │   ├── exceptions.py         # 业务异常
│   │   ├── plugin_system/        # 插件系统
│   │   │   ├── __init__.py
│   │   │   ├── base.py           # BaseCrawlerPlugin
│   │   │   └── registry.py       # 插件注册中心
│   │   └── tasks/                # 任务队列
│   │       ├── __init__.py
│   │       └── queue.py          # ValidationQueue
│   │
│   ├── models/                   # 数据模型
│   │   ├── __init__.py
│   │   ├── schemas.py            # Pydantic 模型
│   │   └── domain.py             # 领域模型
│   │
│   ├── repositories/             # 数据访问层
│   │   ├── __init__.py
│   │   ├── proxy_repo.py
│   │   ├── settings_repo.py
│   │   └── task_repo.py
│   │
│   ├── services/                 # 业务逻辑层
│   │   ├── __init__.py
│   │   ├── proxy_service.py
│   │   ├── plugin_service.py
│   │   ├── scheduler_service.py
│   │   └── validator_service.py
│   │
│   └── plugins/                  # 爬虫插件
│       ├── __init__.py
│       ├── base.py               # 通用抓取基类
│       ├── fate0.py
│       ├── kuaidaili.py
│       ├── ip3366.py
│       ├── ip89.py
│       ├── speedx.py
│       ├── yundaili.py
│       ├── proxylist_download.py
│       └── proxyscrape.py
│
├── WebUI/                        # Vue3 前端
│   ├── src/
│   │   ├── api/                  # API 封装
│   │   ├── stores/               # Pinia 状态管理
│   │   ├── views/                # 页面组件
│   │   ├── router/               # 路由配置
│   │   ├── components/           # 通用组件
│   │   └── style.css             # 全局样式
│   ├── index.html
│   └── package.json
│
├── tests/                        # 测试目录
│   ├── conftest.py
│   ├── unit/
│   └── integration/
│
├── script/                       # 启动脚本
├── db/                           # 数据存储
├── logs/                         # 日志文件
└── DESIGN.md                     # 本文档
```

---

## 9. 迁移计划

### Phase 1: 基础设施（今天完成）
1. 重写 `core/config.py` → Pydantic Settings
2. 重写 `core/db.py` → 带上下文管理的连接池
3. 创建 `models/` 层

### Phase 2: Repository + Service（今天完成）
1. 创建 `repositories/proxy_repo.py`
2. 创建 `services/` 下的业务类
3. 迁移现有逻辑

### Phase 3: 插件系统（今天完成，核心）
1. 创建 `core/plugin_system/base.py` 和 `registry.py`
2. 设计显式注册机制
3. 将所有现有插件迁移到新基类

### Phase 4: 任务队列（今天完成）
1. 创建 `ValidationQueue` 和 `WorkerPool`
2. 重写 `SchedulerService`

### Phase 5: API 路由（今天完成）
1. 拆分 `api_server.py` 到 `api/routes/`
2. 组装新的 `api/main.py`

### Phase 6: 前端调整（今天完成）
1. 拆分 Service 层
2. 适配 Store
3. 保留现有页面，只改代码组织

### Phase 7: 清理与验证
1. 删除旧的 `api_server.py`, `core/scheduler.py`, `core/sqlite.py` 等
2. 运行测试，确保所有功能正常
3. 提交代码

---

## 10. 添加新爬虫的标准流程（目标体验）

假设要添加一个名为 `mynewsource` 的爬虫：

**Step 1**: 创建文件 `app/plugins/mynewsource.py`

```python
from app.core.plugin_system import BaseCrawlerPlugin, ProxyRaw
from app.plugins.base import BaseHTTPPlugin  # 可选：如果基于 HTTP 爬取

class MyNewSourcePlugin(BaseHTTPPlugin):
    name = "mynewsource"
    display_name = "我的新代理源"
    description = "从 example.com 爬取免费代理"

    def __init__(self):
        super().__init__()
        self.urls = ["https://example.com/proxies"]

    async def crawl(self) -> list[ProxyRaw]:
        results = []
        for url in self.urls:
            html = await self.fetch(url)
            # ... 解析 html ...
            results.append(ProxyRaw(ip="1.2.3.4", port=8080, protocol="http"))
        return results
```

**Step 2**: 在 `app/plugins/__init__.py` 中注册

```python
from .mynewsource import MyNewSourcePlugin
from app.core.plugin_system import registry

registry.register(MyNewSourcePlugin)
```

**Step 3**: 重启后端服务，前端自动显示新插件。

无需修改任何路由、服务、数据库表。

---

*文档版本: 1.0*
*作者: Kimi Code*
*日期: 2026-04-02*