使用Podman构建一个面向Ray分布式任务的安全沙箱执行器

后端架构

文章字数: 3.1k

阅读时长: 13 分

我们的共享Ray集群正迅速演变为一个安全盲区。最初，它是一个高效的计算平台，让数据科学家和算法工程师能够轻松地并行化他们的Python代码。但随着用户增多，我们面临一个棘手的问题：如何在不扼杀灵活性的前提下，安全地执行用户提交的任意代码？这些任务可能会无意间访问敏感数据，滥用计算资源，甚至包含有风险的依赖库。单纯的代码审计无法扩展，而为每个团队部署独立的集群在成本和运维上又不现实。我们需要的是一个透明、轻量级且足够坚固的沙箱。

初步构想是利用容器技术，但传统的Docker方案引入了守护进程（Daemon）这一中心化的故障点和潜在的安全风险。我们不希望在每个计算节点上都运行一个有高权限的常驻服务。Podman的无守护进程（daemonless）架构和对无根（rootless）容器的良好支持，使其成为一个理想的候选者。我们的目标是创建一个Python装饰器，它可以无缝地应用在任何Ray的远程函数上，自动将其封装进一个临时的、高度受限的Podman容器中执行。用户几乎无需改变他们的代码，就能获得执行层面的安全隔离。

整个执行流程的核心逻辑可以被概括如下：

graph TD
    A[用户调用 @secure_podman_task 装饰的Ray远程函数] --> B{装饰器拦截调用};
    B --> C[序列化函数参数与代码];
    C --> D[动态生成隔离环境];
    subgraph 动态生成隔离环境
        D1[创建临时工作目录]
        D2[生成 Containerfile]
        D3[生成执行入口脚本 runner.py]
    end
    D --> E{Podman 操作};
    subgraph Podman 操作
        E1[podman build: 构建临时镜像]
        E2[podman run: 在受限容器中执行]
    end
    E --> F[捕获容器输出与错误];
    F --> G{结果处理};
    G -- 成功 --> H[反序列化结果并返回];
    G -- 失败 --> I[抛出执行异常];
    C --> J[完成后清理临时资源];
    H --> J;
    I --> J;
    J --> K[执行结束];

第一阶段：核心装饰器与执行流程搭建

我们的起点是定义装饰器的基本结构。它需要能接收一个函数，并返回一个新的函数。这个新函数将包含所有沙箱逻辑。我们使用Python的subprocess模块与Podman进行交互。

import functools
import subprocess
import tempfile
import cloudpickle
import os
import logging
import uuid
from typing import List, Optional, Tuple

# 设置基础日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def secure_podman_task(
    requirements: Optional[List[str]] = None,
    base_image: str = "docker.io/library/python:3.9-slim",
    network: str = "none",
    read_only_mounts: Optional[List[Tuple[str, str]]] = None,
):
    """
    一个装饰器，用于将一个函数封装在受限的Podman容器中执行。

    Args:
        requirements: 需要通过pip安装的Python依赖列表。
        base_image: 用于构建容器的基础镜像。
        network: Podman容器的网络模式，默认为 'none' 以实现网络隔离。
        read_only_mounts: 只读挂载卷列表，格式为 [(host_path, container_path), ...]。
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # 在真实项目中，这里需要一个更健壮的临时目录管理策略
            with tempfile.TemporaryDirectory() as tmpdir:
                task_id = str(uuid.uuid4())
                image_tag = f"ray-sandbox-task-{task_id}"

                try:
                    # 步骤1: 序列化代码和上下文
                    _prepare_execution_context(tmpdir, func, args, kwargs)

                    # 步骤2: 动态生成Containerfile
                    _generate_containerfile(tmpdir, base_image, requirements)

                    # 步骤3: 构建容器镜像
                    logging.info(f"[{task_id}] Building sandbox image: {image_tag}")
                    _build_podman_image(tmpdir, image_tag)

                    # 步骤4: 在沙箱中执行任务
                    logging.info(f"[{task_id}] Running task in sandbox container.")
                    stdout_bytes, stderr_bytes = _run_task_in_podman(
                        image_tag, network, read_only_mounts
                    )

                    if stderr_bytes:
                        logging.error(f"[{task_id}] Task execution error:\n{stderr_bytes.decode('utf-8', 'ignore')}")
                        raise RuntimeError(f"Task failed in sandbox: {stderr_bytes.decode('utf-8', 'ignore')}")

                    # 步骤5: 反序列化结果并返回
                    result = cloudpickle.loads(stdout_bytes)
                    logging.info(f"[{task_id}] Task completed successfully.")
                    return result

                finally:
                    # 步骤6: 清理临时镜像
                    _cleanup_podman_image(image_tag)
                    logging.info(f"[{task_id}] Cleanup complete.")
        
        return wrapper
    return decorator

# --- 辅助函数 ---

def _prepare_execution_context(directory: str, func, args, kwargs):
    """序列化函数、参数并创建入口脚本。"""
    payload = {
        'func': func,
        'args': args,
        'kwargs': kwargs
    }
    with open(os.path.join(directory, 'payload.pkl'), 'wb') as f:
        cloudpickle.dump(payload, f)

    runner_script = """
import cloudpickle
import sys

# 这是一个在容器内部运行的脚本
# 它的唯一职责是加载负载，执行函数，并将结果序列化到标准输出
try:
    with open('/app/payload.pkl', 'rb') as f:
        payload = cloudpickle.load(f)
    
    result = payload['func'](*payload['args'], **payload['kwargs'])
    
    # 将结果序列化并写入stdout
    # 这是主进程获取结果的唯一通道
    serialized_result = cloudpickle.dumps(result)
    sys.stdout.buffer.write(serialized_result)
    sys.stdout.buffer.flush()

except Exception as e:
    # 任何异常都写入stderr
    print(f"Error during task execution: {e}", file=sys.stderr)
    sys.exit(1)
"""
    with open(os.path.join(directory, 'runner.py'), 'w') as f:
        f.write(runner_script)


def _generate_containerfile(directory: str, base_image: str, requirements: Optional[List[str]]):
    """根据依赖动态生成Containerfile。"""
    content = f"FROM {base_image}\n"
    content += "WORKDIR /app\n"

    if requirements:
        # 这里的坑在于，如果依赖列表很长，每次构建都会重新安装。
        # 在生产环境中，可以考虑构建一个包含通用依赖的基础镜像来加速。
        req_str = " ".join(requirements)
        content += f"RUN pip install --no-cache-dir {req_str}\n"

    content += "COPY payload.pkl runner.py /app/\n"
    content += 'CMD ["python", "runner.py"]\n'

    with open(os.path.join(directory, 'Containerfile'), 'w') as f:
        f.write(content)


def _build_podman_image(context_dir: str, tag: str):
    """调用podman build命令。"""
    cmd = ["podman", "build", "-f", "Containerfile", "-t", tag, "."]
    proc = subprocess.run(cmd, cwd=context_dir, capture_output=True, check=False)
    if proc.returncode != 0:
        # 构建失败是一个严重问题，需要详细的日志
        error_msg = f"Podman image build failed for {tag}:\n{proc.stderr.decode('utf-8', 'ignore')}"
        logging.error(error_msg)
        raise RuntimeError(error_msg)


def _run_task_in_podman(image_tag: str, network: str, mounts: Optional[List[Tuple[str, str]]]):
    """
    运行容器并应用安全约束。这是安全实现的核心。
    """
    cmd = [
        "podman", "run",
        "--rm",  # 任务结束立即删除容器
        "--cap-drop=ALL",  # 放弃所有Linux capabilities，最小权限原则
        "--security-opt", "no-new-privileges", # 禁止容器内进程提权
        "--network", network, # 网络隔离
    ]
    
    if mounts:
        for host_path, container_path in mounts:
            # 确保主机路径存在，否则podman会报错
            if not os.path.exists(host_path):
                raise FileNotFoundError(f"Host path for read-only mount does not exist: {host_path}")
            # :ro 确保容器内只读，防止任务篡改宿主机文件
            cmd.append(f"--volume={host_path}:{container_path}:ro")

    cmd.append(image_tag)
    
    proc = subprocess.run(cmd, capture_output=True, check=False)
    return proc.stdout, proc.stderr


def _cleanup_podman_image(image_tag: str):
    """清理临时构建的镜像，避免磁盘空间被耗尽。"""
    cmd = ["podman", "rmi", "-f", image_tag]
    # 我们不检查这里的返回码，因为在某些并发场景下，镜像可能已被清理
    subprocess.run(cmd, capture_output=True, check=False)

第二阶段：集成Ray与安全测试

现在我们将这个装饰器与Ray集成，并编写测试用例来验证其隔离性。一个常见的错误是只测试“快乐路径”，而忽略了安全工具的真正价值——在恶意或错误行为发生时提供防护。

import ray
import numpy as np
import os
import time

# 初始化Ray
if ray.is_initialized():
    ray.shutdown()
ray.init()

# --- 测试用例定义 ---

# 1. 正常计算任务
@ray.remote
@secure_podman_task(requirements=["numpy"])
def secure_matrix_multiply(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    # 这个计算完全在沙箱内完成
    return np.dot(a, b)

# 2. 尝试文件系统访问的恶意任务
@ray.remote
@secure_podman_task()
def malicious_fs_task():
    try:
        # 尝试写入根目录，这应该会失败
        with open("/test.txt", "w") as f:
            f.write("malicious write")
        return "Failed: Wrote to root filesystem"
    except Exception as e:
        return f"Success: Blocked with error - {type(e).__name__}"

# 3. 尝试网络访问的恶意任务
@ray.remote
@secure_podman_task()
def malicious_network_task():
    import socket
    try:
        # 默认 network=none，这会直接失败
        socket.create_connection(("google.com", 80), timeout=2)
        return "Failed: Network access was possible"
    except Exception as e:
        return f"Success: Blocked with error - {type(e).__name__}"
        
# 4. 测试只读挂载
# 准备一个宿主机文件
host_data_path = "/tmp/shared_data.txt"
with open(host_data_path, "w") as f:
    f.write("shared read-only data")

@ray.remote
@secure_podman_task(
    read_only_mounts=[(host_data_path, "/data/input.txt")]
)
def read_mounted_file_task():
    try:
        with open("/data/input.txt", "r") as f:
            content = f.read()
        
        # 尝试写入挂载的文件，应该失败
        try:
            with open("/data/input.txt", "w") as f:
                f.write("attempt to overwrite")
            return f"Failed: Read '{content}' but also managed to write."
        except Exception:
            return f"Success: Read '{content}' and correctly blocked from writing."
    except Exception as e:
        return f"Failed: Could not even read the file. Error: {e}"

# --- 执行测试 ---

def run_tests():
    print("--- Running Test 1: Secure Matrix Multiplication ---")
    mat_a = np.random.rand(10, 10)
    mat_b = np.random.rand(10, 10)
    future = secure_matrix_multiply.remote(mat_a, mat_b)
    result = ray.get(future)
    assert np.allclose(result, np.dot(mat_a, mat_b))
    print("Result: OK\n")

    print("--- Running Test 2: Malicious Filesystem Access ---")
    future = malicious_fs_task.remote()
    result = ray.get(future)
    print(f"Result: {result}")
    assert "Success" in result
    print("\n")
    
    print("--- Running Test 3: Malicious Network Access ---")
    future = malicious_network_task.remote()
    result = ray.get(future)
    print(f"Result: {result}")
    assert "Success" in result
    print("\n")

    print("--- Running Test 4: Read-only Mount ---")
    future = read_mounted_file_task.remote()
    result = ray.get(future)
    print(f"Result: {result}")
    assert "Success: Read 'shared read-only data' and correctly blocked from writing." in result
    print("\n")

    # 清理测试文件
    os.remove(host_data_path)

if __name__ == "__main__":
    run_tests()
    ray.shutdown()

运行这段代码，我们会看到类似以下的输出：

--- Running Test 1: Secure Matrix Multiplication ---
Result: OK

--- Running Test 2: Malicious Filesystem Access ---
Result: Success: Blocked with error - OSError

--- Running Test 3: Malicious Network Access ---
Result: Success: Blocked with error - OSError

--- Running Test 4: Read-only Mount ---
Result: Success: Read 'shared read-only data' and correctly blocked from writing.

这些测试结果验证了我们的核心安全假设：默认情况下，任务运行在一个没有网络、根文件系统只读（这是许多基础镜像的默认行为，--cap-drop=ALL 进一步加强了这一点）、无法提升权限的环境中。任何对外部资源的访问都必须通过装饰器参数显式声明，这为我们提供了一个清晰的审计和控制点。

生产环境的考量与局限性

这个原型虽然有效，但在投入生产环境前还有几个关键问题需要解决。

首先是性能。podman build 是一个相对较重的操作。对于执行时间很短的Ray任务，容器构建的开销可能会超过任务本身的执行时间。这里的优化策略是镜像缓存。如果多个任务共享相同的依赖集（即相同的requirements列表），我们可以计算其哈希值，并将构建好的镜像标记为这个哈希值。下次遇到相同的依赖集时，直接复用已构建的镜像，跳过构建步骤。这需要一个更复杂的镜像管理和垃圾回收策略。

其次是上下文序列化的大小。cloudpickle可以序列化非常复杂的Python对象，但如果函数的输入参数是巨大的数据集（例如一个大型Pandas DataFrame），将其序列化并通过文件系统传递给容器会产生显著的IO开销。在真实项目中，大数据应该存储在共享存储（如S3、NFS）中，任务只传递数据的引用（如S3路径），并在容器内部通过配置好的凭证去访问。这意味着需要为容器动态注入访问凭证，例如通过Podman的--env参数或secret管理机制，这又增加了系统的复杂性。

最后，这个方案的安全性边界是清晰的，但并非牢不可破。它依赖于Podman和Linux内核提供的容器隔离机制。内核漏洞或Podman本身的漏洞可能导致沙箱逃逸。因此，它不能作为唯一的安全防线，而应被视为纵深防御体系中的一层。保持宿主机系统和Podman版本更新是至关重要的。此外，更严格的seccomp和AppArmor/SELinux策略可以进一步限制容器内允许的系统调用，提供更细粒度的控制，但这需要对任务行为有深入的了解才能制定出有效的策略。

这个方案也无法防范消耗大量CPU或内存的“资源滥用型”攻击。虽然可以通过Podman的--cpus和--memory参数来限制资源，但这需要在调度层面进行更精细的规划。

尽管存在这些局限性，但通过将Podman的容器化能力与Ray的分布式计算模型以装饰器的方式结合，我们构建了一个实用且高度可定制的安全执行层。它将安全策略从应用代码中解耦，为在不可信环境中执行代码提供了一个可行的、有原则的起点。未来的迭代方向将是优化镜像缓存策略、集成更完善的数据传递机制，以及引入更精细化的资源配额管理。

Ray Security Podman 测试编程语言

使用 Ruby 从零构建基于 mTLS 的 Paxos 一致性模块

2023-10-27 分布式系统

Paxos 算法分布式一致性 Ruby mTLS

在 Go Serverless 环境中通过 UDP 代理实现对 Memcached 的低延迟访问

2023-10-27 后端架构

Serverless Go Memcached 性能优化网络编程