一、核心优化原理

显卡特性与参数关联

现代GPU的设计目标是高吞吐量并行计算，其核心硬件参数直接影响深度学习任务的性能表现。以下是三个关键特性：

参数类型	作用与影响
显存带宽	决定数据从显存到计算单元的传输速度，带宽越高，模型推理速度越快
流多处理器(SM)	每个SM包含多个CUDA核心，数量直接影响并行计算能力（如4096核心的RTX 4090）
计算能力(FLOPS)	单精度浮点性能决定矩阵运算速度，直接影响stable diffusion的迭代速率

显存管理机制是性能优化的核心战场。当处理1024x1024图像时，单个扩散迭代可能产生超过500MB的中间张量，低效的内存分配会导致显存碎片化，造成隐性性能损失。

PyTorch内存分配技术解析

PyTorch采用缓存式分配器（Caching Allocator）管理显存，其工作原理如下：

# 典型内存分配过程示意
if 空闲内存块存在:
    复用已有块
else:
    调用cudaMalloc申请新块
    将未使用部分加入缓存池

特性与局限：

优势：减少cudaMalloc调用次数，避免同步开销（单次cudaMalloc耗时约5-100μs）
缺陷：缓存策略导致"幽灵显存"现象，nvidia-smi显示占用高于实际使用量
调优参数：
- max_split_size_mb=256：限制内存块最大分割尺寸，减少碎片
- roundup_power2_divisions=4：将申请尺寸对齐到最近的2^n倍数，提升复用率

CUDA异步内存分配技术解析

CUDA 11.2引入的流序内存分配器（Stream-Ordered Allocator）带来革新：

cudaMallocAsync(&ptr, size, stream);  // 在特定流中异步分配
kernel<<<..., stream>>>(ptr);        // 保证内核执行前完成分配
cudaFreeAsync(ptr, stream);           // 流执行完毕自动释放

技术突破点：

消除全局同步：传统分配器需要全局锁，万级线程竞争时延迟激增
内存池化：将显存划分为固定尺寸块（如2MB），分配耗时从微秒级降至纳秒级
动态复用：通过 cudaMemPoolTrimTo设置保留阈值，平衡碎片与响应速度

实验数据显示，异步分配器在Stable Diffusion类任务中，分配速率提升16.5倍，显存碎片减少83%。

性能差异的本质原因

两种分配器在SD绘图中的表现差异源于内存管理粒度和执行模型的差异：

对比维度	PyTorch原生分配器	CUDA异步分配器
同步机制	全局锁导致多线程竞争	基于CUDA流的无锁设计
内存生命周期	由Python垃圾回收决定	与CUDA流执行周期严格绑定
碎片处理	被动拆分/合并，易产生"内存孤岛"	主动内存池化，支持块级复用
峰值显存占用	缓存策略导致约1.5倍实际需求	按需分配，波动幅度小于10%
超大规模批处理	易出现OOM（显存不足）	支持动态扩容，适合SDXL和高分辨率图像生成

实际测试表明，在SD-WebUI-Forge中使用异步分配器后：

1080x1080图像生成时间从14.2秒降至9.8秒（提升30%）
显存占用波动范围从±35%缩小到±8%
多模型切换时的卡顿率从22%降至3%以下

异步分配器通过硬件级流控机制，将显存管理从"被动响应"转变为"主动规划"，这正是其能突破传统分配器性能瓶颈的核心原理。

二、性能改进参数解析

在SD-WebUI-Forge中，开发者提供了三个与CUDA异步内存分配相关的关键参数（cuda-stream、pin-shared-memory、cuda-malloc），这些参数通过不同维度的显存调度策略，可显著影响图像生成速度和显存占用。

cuda-stream

工作原理

通过调用CUDA流（GPU内部的并行计算通道），实现模型加载与张量计算的同步流水线作业。传统模式下GPU需要等待模型完全加载后才能开始计算，而启用该参数后，模型加载过程可"见缝插针"地利用GPU空闲时段。

实测效果

在RTX 4050 6GB等30/40系小显存显卡上，SDXL生成速度提升15%-25%
基本消除模型加载耗时（传统模式约占用总时间的10%-15%）
但会带来15%概率的黑图异常（NaN输出），尤其在GTX 1060/2060等旧架构显卡上表现明显

适用场景

30/40系显卡用户（需显存≥6GB）
生成分辨率≤1024px的常规图像
对生成速度敏感且能接受偶发失败的情况

pin-shared-memory

工作原理

与 cuda-stream联用时，该参数将模型优先卸载至共享GPU内存而非系统内存。共享内存本质是显存的扩展缓冲区，其读写速度是DDR4内存的5-8倍，但缺乏有效回收机制。

实测数据

RTX 3060笔记本版生成SDXL图像速度提升≥20%
但存在不可逆崩溃风险：当共享内存耗尽时，程序将直接退出（传统模式可通过系统内存兜底）
在GTX 10系显卡上崩溃概率高达60%

硬件建议

必须与 cuda-stream配合使用
仅推荐30/40系显卡（需显存≥6GB）
显存带宽≥256bit的显卡效果更佳

cuda-malloc

技术本质

启用CUDA异步内存分配器（cudaMallocAsync），通过预分配内存池减少动态内存分配开销。该机制类似于"内存批处理"，可降低GPU指令周期中的空闲等待时间。

Figure illustrating reuse of memory within a stream using the new cudaMallocAsync and cudaFreeAsync API actions.

性能表现

在NVIDIA NSight等专业工具中可见毫秒级优化
实际生成单张图像耗时减少≤0.1秒
但存在约7%的程序崩溃概率（开发者测试统计）

特殊优势

对显存碎片化问题有改善作用
可提升多任务并行时的稳定性
适合需要连续生成多批次图像的场景

参数组合策略

组合方式	速度提升	稳定性	推荐指数
单独启用cuda-malloc	5%-8%	★★★★☆	⭐⭐
cuda-stream + pin-shared-memory	25%-35%	★★☆☆☆	⭐⭐⭐⭐
三者同时启用	40%+	★☆☆☆☆	⭐⭐

特别提示：任何参数组合都需要通过 --medvram或 --lowvram模式提供显存兜底保护。建议每次仅启用一个参数进行测试，生成关键作品前务必关闭实验性参数。

三、个人直观感受

当启用CUDA内置异步分配器，并在SD启动参数中添加--cuda-malloc --cuda-stream --pin-shared-memory 这三个启动参数后，绘图速度提升了0.5it/s左右，并且显存利用率的波动有所减小，速率也更加稳定了。虽然并没有明显地缩短生图时间，不过显存确实是被更高效地使用了，黑图崩溃之类的事件也没有频繁发生。如果你是40系显卡，我还是建议你添加这几个启动参数。

参考

[1] https://40jaiio.sadio.org.ar/sites/default/files/T2011/HPC/1074.pdf
[2] https://research.nvidia.com/sites/default/files/pubs/2019-02_Throughput-oriented-GPU-memory/paper.pdf
[3] https://www.techplayon.com/memory-management-in-gpu/
[4] https://pytorch.org/docs/stable/notes/cuda.html?highlight=max_split_size_mb
[5] https://www.restack.io/p/asynchronous-ai-programming-techniques-answer-cuda-asynchronous-memory-allocation-cat-ai
[6] https://www.reddit.com/r/StableDiffusion/comments/168sad2/pytorch_reserving_large_amounts_of_vram/
[7] https://github.com/NVIDIA/TensorRT-LLM/issues/1103
[8] https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/
[9] https://discuss.pytorch.org/t/how-do-i-rewrite-the-gpu-memory-allocation-algorithm-of-pytorch/179979
[10] https://datascience.stackexchange.com/questions/122109/seeking-guidance-on-understanding-graphics-card-parameters-for-deep-learning-tra
[11] https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html
[12] https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/vsphere-monitoring-and-performance-8-0/monitoring-inventory-objects/overview-performance-charts/hosts/gpu-memory-allocation-top-10-host.html
[13] https://resources.nvidia.com/gtcd-2020/gtc2020cwe21754
[14] https://www.savemyexams.com/a-level/computer-science/ocr/17/revision-notes/1-the-characteristics-of-contemporary-processors-input-output-and-storage-devices/1-2-types-of-processor/graphics-processing-unit-gpu/
[15] https://dl.acm.org/doi/10.1145/3293883.3295727
[16] https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf
[17] https://en.wikipedia.org/wiki/Graphics_processing_unit
[18] https://massedcompute.com/faq-answers/?question=What+is+the+optimal+GPU+memory+allocation+for+training+a+specific+deep+learning+model%3F
[19] https://aims.cuhk.edu.hk/converis/getfile?id=63684153&portal=true&v=1
[20] https://www.weka.io/learn/guide/gpu/what-is-a-gpu/
[21] https://kompute.cc/overview/memory-management.html
[22] https://ieeexplore.ieee.org/document/6835963/
[23] https://www.cherryservers.com/blog/everything-you-need-to-know-about-gpu-architecture
[24] https://www.liquidweb.com/gpu/memory/
[25] https://pytorch.org/blog/understanding-gpu-memory-1/
[26] https://github.com/Paperspace/PyTorch-101-Tutorial-Series/blob/master/PyTorch%20101%20Part%204%20-Memory%20management%20and%20Multi-GPU%20Usage%20in%20PyTorch.ipynb
[27] https://discuss.pytorch.org/t/best-practices-for-gpu-memory-management-explicit-outputs-for-example/16250
[28] https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html
[29] https://discuss.pytorch.org/t/how-to-optimize-memory-allocation-with-intermediate-tensor-operations/149053
[30] https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp
[31] https://pytorch.org/docs/2.1/torch_cuda_memory.html
[32] https://discuss.pytorch.org/t/im-trying-to-rewrite-the-cuda-cache-memory-allocator/178387
[33] https://stackoverflow.com/questions/76199688/how-to-allocate-more-memory-to-pytorch/76200688
[34] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html
[35] https://docs.nvidia.com/cuda/cuda-c-programming-guide/
[36] https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html
[37] https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/
[38] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html
[39] http://www.irisa.fr/alf/downloads/collange/cours/hpca2020_gpu_2.pdf
[40] https://stackoverflow.com/questions/8684770/how-is-cuda-memory-managed
[41] https://committhis.github.io/2020/10/06/cuda-abstractions.html
[42] https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-2/
[43] https://forums.developer.nvidia.com/t/cuda-virtual-memory-management/313090
[44] https://stackoverflow.com/questions/5007556/cuda-zero-copy-memory-considerations/5011564
[45] https://stackoverflow.com/questions/54748977/asynchronous-texture-object-allocation-in-multi-gpu-code
[46] https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch/62556666
[47] https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/
[48] https://stackoverflow.com/questions/19672111/global-memory-vs-dynamic-global-memory-allocation-in-cuda
[49] https://stackoverflow.com/questions/73030553/does-pytorch-allocate-gpu-memory-eagerly
[50] https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/
[51] https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/gpu-memory-system.html
[52] https://www.digitalocean.com/community/tutorials/the-hidden-bottleneck-how-gpu-memory-hierarchy-affects-your-computing-experience
[53] https://pytorch.org/docs/2.6/generated/torch.cuda.caching_allocator_alloc.html
[54] https://discuss.pytorch.org/t/how-do-i-rewrite-the-gpu-memory-allocation-algorithm-of-pytorch/179979
[55] https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html
[56] https://www.digitalocean.com/community/tutorials/pytorch-memory-multi-gpu-debugging
[57] https://kshitij12345.github.io/python,/pytorch/2023/02/26/External-CUDA-Allocator-With-PyTorch.html
[58] https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
[59] https://docs.cupy.dev/en/latest/reference/generated/cupy.cuda.MemoryAsyncPool.html
[60] https://www.techplayon.com/memory-management-in-gpu/
[61] https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/
[62] https://forums.developer.nvidia.com/t/cannot-use-stream-ordered-async-memory-allocator-with-cuda-mps/230229
[63] https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
[64] https://developer.ridgerun.com/wiki/index.php/NVIDIA_CUDA_Memory_Management
[65] https://discuss.pytorch.org/t/how-to-maximize-cpu-gpu-memory-transfer-speeds/173855
[66] https://github.com/pytorch/pytorch/blob/main/torch/cuda/memory.py
[67] https://en.wikipedia.org/wiki/CUDA