Pod

Pod 是 Kubernetes 中最小的可部署计算单元，可以包含一个或多个容器。Pod 代表了集群中运行的一个进程组。

Pod 内部架构

flowchart TB
    subgraph P[Pod]
        N[Network Namespace]
        S[Storage Volumes]
        C1[Container 1]
        C2[Container 2]
    end
    N --> C1
    N --> C2
    S --> C1
    S --> C2

生命周期状态

stateDiagram-v2
    [*] --> Pending
    Pending --> Running
    Pending --> Failed
    Running --> Succeeded
    Running --> Failed
    Succeeded --> [*]
    Failed --> [*]

资源配置对比

配置项	requests	limits
作用	调度时最小资源保证	运行时最大资源限制
CPU	保证 CPU 时间	最大 CPU 使用量
内存	保证内存	最大内存限制

Pod 部署示例

# 基础 Pod 配置示例

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
  labels:
    app: nginx
spec:
  containers:
  - name: nginx-container
    image: nginx:1.21
    ports:
    - containerPort: 80
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Pod 部署策略

单容器 Pod

适用于简单应用部署
管理简单，资源开销小
适合无状态服务

多容器 Pod (边车模式)

主容器 + 辅助容器
共享网络和存储
适合日志收集、监控等场景

GPU Pod 部署

GPU Pod 是指在 Kubernetes 中部署并使用图形处理器（GPU）资源的 Pod。GPU 在机器学习、深度学习、科学计算、视频处理等需要大量并行计算的场景中发挥着重要作用。

graph TB
    API["API Server"] --> SCHED["Scheduler"]
    SCHED --> KUBELET["kubelet"]
    KUBELET --> APP["应用容器"]
    KUBELET --> DEVICE_PLUGIN["NVIDIA Device Plugin"]
    APP --> CUDA["CUDA Runtime"]
    CUDA --> NVIDIA_DRV["NVIDIA 驱动"]
    NVIDIA_DRV --> GPU["物理 GPU"]
    
    style API fill:#e1f5fe
    style SCHED fill:#e1f5fe
    style KUBELET fill:#f3e5f5
    style APP fill:#e8f5e8
    style CUDA fill:#e8f5e8
    style NVIDIA_DRV fill:#fff3e0
    style GPU fill:#ffebee
    style DEVICE_PLUGIN fill:#f1f8e9

GPU 资源对比

资源类型	CPU	GPU	适用场景
计算方式	串行计算	并行计算	不同计算模式
核心数量	几到几十核	数千个 CUDA 核心	并行处理能力
内存架构	统一内存	显存 + 系统内存	内存管理机制
资源调度	request/limit	nvidia.com/gpu	资源分配方式
编程模型	CPU 指令集	CUDA/OpenCL	开发模型

GPU Pod 部署示例

基础 GPU Pod 配置

# 基础 GPU 资源请求

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
  nodeSelector:
    kubernetes.io/hostname: gpu-node-1

多 GPU Pod 配置

# 多 GPU 资源请求

apiVersion: v1
kind: Pod
metadata:
  name: multi-gpu-pod
spec:
  containers:
  - name: multi-gpu-container
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 2
      requests:
        nvidia.com/gpu: 2
  nodeSelector:
    gpu-node: "true"

NVIDIA Device Plugin 配置

NVIDIA Device Plugin 是 Kubernetes 中用于管理 GPU 资源的关键组件，它使 Kubernetes 调度器能够识别和调度 GPU 资源。

# NVIDIA Device Plugin 部署配置

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: nvidia-device-plugin-ctr
        image: nvidia/k8s-device-plugin:1.0.0-beta4
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
          type: DirectoryOrCreate

GPU Pod 最佳实践

实践	说明	配置示例
资源请求与限制	明确指定 GPU 资源数量	resources.limits.nvidia.com/gpu: 1
节点亲和性	确保 Pod 调度到有 GPU 的节点	nodeSelector 或 nodeAffinity
容忍度配置	容忍 GPU 节点的特殊污点	tolerations 配置
存储优化	使用高速存储减少 I/O 瓶颈	SSD 或 NVMe 存储
网络优化	高带宽低延迟网络连接	InfiniBand 或高速以太网

常用命令

# 运行 Pod
kubectl run nginx --image=nginx

# 查看 Pod 详细信息
kubectl describe pod nginx

# 查看 Pod 日志
kubectl logs nginx

# 进入 Pod 容器
kubectl exec -it nginx -- /bin/bash

# 删除 Pod
kubectl delete pod nginx

# 检查节点 GPU 资源
kubectl describe nodes | grep -i nvidia

# 检查 GPU 使用情况
kubectl exec -it <gpu-pod> -- nvidia-smi