Data Layer Architecture

AlphaGen is designed for speed. Evaluating thousands of alphas during RL training requires a highly optimized data layer.

StockData Class

Located in alphagen_qlib/stock_data.py, this class is the bridge between Qlib's binary storage and PyTorch tensors.

Initialization

data = StockData(
    instrument="csi300",
    start_time="2010-01-01",
    end_time="2020-12-31",
    device=torch.device("cuda:0")
)

Internal Structure

Unlike standard Pandas DataFrames, StockData loads the entire dataset into a contiguous PyTorch Tensor on the GPU (if available).

  • Tensor Shape: (n_days, n_features, n_stocks).
    • n_days: Total trading days in the range.
    • n_features: 6 basic features (Open, Close, High, Low, Volume, VWAP).
    • n_stocks: Number of unique stocks in the instrument set.

Handling Missing Data

Stock data is jagged (stocks halt, delist, or IPO). Qlib handles alignment. StockData maintains an internal mapping of valid stocks per day to ensure calculations like Rank or Mean only consider active stocks.

Calculator

The QLibStockDataCalculator (in alphagen_qlib/calculator.py) performs the actual evaluation.

  • Batch Pearson Correlation: It implements a fast, tensor-based Pearson correlation to compute IC.
  • Caching: It manages the evaluation context, ensuring the target (future return) is pre-calculated and normalized.