RTX 5070 Optimization
Generating high-resolution images (3440x1440) with Stable Diffusion XL is computationally expensive and memory-intensive. To achieve this with good performance, Medieval Deck includes a specialized optimization system tuned for the NVIDIA RTX 5070 GPU, implemented in gen_assets/rtx_optimizer.py.
This system is for advanced users who want to understand the performance-tuning aspects of the project.
RTX5070Optimizer Class
This class encapsulates a series of advanced optimizations that are applied to the SDXL pipeline at runtime.
Key Optimizations
-
Device Detection and Configuration
- Automatically detects if a CUDA-enabled GPU is available.
- Enables specialized settings if an RTX 5070 is identified.
-
Advanced Memory Management
- Memory-Efficient Attention: Enables
xformersorflash_sdpif available, which are highly optimized attention mechanisms that significantly reduce VRAM usage and increase speed. - Model Offloading: Uses
enable_model_cpu_offload()to keep parts of the model on the CPU and only move them to VRAM when needed, reducing the standing memory footprint. - VRAM Fraction: Sets a VRAM usage limit (
torch.cuda.set_per_process_memory_fraction) to prevent out-of-memory errors.
- Memory-Efficient Attention: Enables
-
Performance Acceleration
torch.compile: The UNet, the largest part of the SDXL model, is compiled usingtorch.compile(mode="reduce-overhead"). This feature of PyTorch 2.0+ converts the model's Python code into optimized low-level code, resulting in a significant speedup after an initial warm-up.- Half-Precision (FP16): The model is loaded in
float16precision, which cuts VRAM usage nearly in half and is faster on modern GPUs. - TF32 Precision: Enables TensorFloat-32 for matrix multiplication (
torch.backends.cuda.matmul.allow_tf32), providing a speed boost on Ampere and newer architectures without a significant loss in quality.
How It's Used
The AssetGenerator class uses the RTX5070Optimizer to configure the pipeline and the generation session.
# In AssetGenerator._initialize_pipeline()
# The optimizer applies its settings to the pipeline object
self.pipeline = self.optimizer.optimize_pipeline(self.pipeline)
# In AssetGenerator._generate_image()
# The generation call is wrapped in an optimization context
with self.optimizer.optimized_generation():
image = self.pipeline(...).images[0]
The optimized_generation context manager handles pre-generation cleanup (like clearing the CUDA cache) and post-generation resource release, ensuring the system remains stable during multiple generation tasks.
System Benchmarking
The optimizer includes a benchmark_system method to measure the performance of the local hardware, providing metrics on average generation time.