RTX 5070 Optimization

Generating high-resolution images (3440x1440) with Stable Diffusion XL is computationally expensive and memory-intensive. To achieve this with good performance, Medieval Deck includes a specialized optimization system tuned for the NVIDIA RTX 5070 GPU, implemented in gen_assets/rtx_optimizer.py.

This system is for advanced users who want to understand the performance-tuning aspects of the project.

RTX5070Optimizer Class

This class encapsulates a series of advanced optimizations that are applied to the SDXL pipeline at runtime.

Key Optimizations

  1. Device Detection and Configuration

    • Automatically detects if a CUDA-enabled GPU is available.
    • Enables specialized settings if an RTX 5070 is identified.
  2. Advanced Memory Management

    • Memory-Efficient Attention: Enables xformers or flash_sdp if available, which are highly optimized attention mechanisms that significantly reduce VRAM usage and increase speed.
    • Model Offloading: Uses enable_model_cpu_offload() to keep parts of the model on the CPU and only move them to VRAM when needed, reducing the standing memory footprint.
    • VRAM Fraction: Sets a VRAM usage limit (torch.cuda.set_per_process_memory_fraction) to prevent out-of-memory errors.
  3. Performance Acceleration

    • torch.compile: The UNet, the largest part of the SDXL model, is compiled using torch.compile(mode="reduce-overhead"). This feature of PyTorch 2.0+ converts the model's Python code into optimized low-level code, resulting in a significant speedup after an initial warm-up.
    • Half-Precision (FP16): The model is loaded in float16 precision, which cuts VRAM usage nearly in half and is faster on modern GPUs.
    • TF32 Precision: Enables TensorFloat-32 for matrix multiplication (torch.backends.cuda.matmul.allow_tf32), providing a speed boost on Ampere and newer architectures without a significant loss in quality.

How It's Used

The AssetGenerator class uses the RTX5070Optimizer to configure the pipeline and the generation session.

# In AssetGenerator._initialize_pipeline()
# The optimizer applies its settings to the pipeline object
self.pipeline = self.optimizer.optimize_pipeline(self.pipeline)

# In AssetGenerator._generate_image()
# The generation call is wrapped in an optimization context
with self.optimizer.optimized_generation():
    image = self.pipeline(...).images[0]

The optimized_generation context manager handles pre-generation cleanup (like clearing the CUDA cache) and post-generation resource release, ensuring the system remains stable during multiple generation tasks.

System Benchmarking

The optimizer includes a benchmark_system method to measure the performance of the local hardware, providing metrics on average generation time.