RTX 5070 Optimization
Generating high-resolution images (3440x1440) with Stable Diffusion XL is computationally expensive and memory-intensive. To achieve this with good performance, Medieval Deck includes a specialized optimization system tuned for the NVIDIA RTX 5070 GPU, implemented in gen_assets/rtx_optimizer.py
.
This system is for advanced users who want to understand the performance-tuning aspects of the project.
RTX5070Optimizer
Class
This class encapsulates a series of advanced optimizations that are applied to the SDXL pipeline at runtime.
Key Optimizations
-
Device Detection and Configuration
- Automatically detects if a CUDA-enabled GPU is available.
- Enables specialized settings if an RTX 5070 is identified.
-
Advanced Memory Management
- Memory-Efficient Attention: Enables
xformers
orflash_sdp
if available, which are highly optimized attention mechanisms that significantly reduce VRAM usage and increase speed. - Model Offloading: Uses
enable_model_cpu_offload()
to keep parts of the model on the CPU and only move them to VRAM when needed, reducing the standing memory footprint. - VRAM Fraction: Sets a VRAM usage limit (
torch.cuda.set_per_process_memory_fraction
) to prevent out-of-memory errors.
- Memory-Efficient Attention: Enables
-
Performance Acceleration
torch.compile
: The UNet, the largest part of the SDXL model, is compiled usingtorch.compile(mode="reduce-overhead")
. This feature of PyTorch 2.0+ converts the model's Python code into optimized low-level code, resulting in a significant speedup after an initial warm-up.- Half-Precision (FP16): The model is loaded in
float16
precision, which cuts VRAM usage nearly in half and is faster on modern GPUs. - TF32 Precision: Enables TensorFloat-32 for matrix multiplication (
torch.backends.cuda.matmul.allow_tf32
), providing a speed boost on Ampere and newer architectures without a significant loss in quality.
How It's Used
The AssetGenerator
class uses the RTX5070Optimizer
to configure the pipeline and the generation session.
# In AssetGenerator._initialize_pipeline()
# The optimizer applies its settings to the pipeline object
self.pipeline = self.optimizer.optimize_pipeline(self.pipeline)
# In AssetGenerator._generate_image()
# The generation call is wrapped in an optimization context
with self.optimizer.optimized_generation():
image = self.pipeline(...).images[0]
The optimized_generation
context manager handles pre-generation cleanup (like clearing the CUDA cache) and post-generation resource release, ensuring the system remains stable during multiple generation tasks.
System Benchmarking
The optimizer includes a benchmark_system
method to measure the performance of the local hardware, providing metrics on average generation time.