Installation and Data Setup

AlphaGen requires a specific environment setup, particularly for the data layer, which relies on Qlib's binary format for efficiency.

1. Environment Setup

This project requires Python 3.8+ and PyTorch. We recommend creating a dedicated virtual environment.

# Clone the repository
git clone https://github.com/RL-MLDM/alphagen.git
cd alphagen

# Create a virtual environment (optional but recommended)
conda create -n alphagen python=3.8
conda activate alphagen

# Install dependencies
pip install -r requirements.txt

Key Dependencies

  • stable_baselines3 & sb3_contrib: Used for the PPO implementation.
  • qlib: Used for high-speed data retrieval and backtesting.
  • baostock: Used as the raw data source for Chinese A-shares.

2. Data Pipeline Configuration

AlphaGen does not work out-of-the-box without data. You must download raw stock data and convert it into the Qlib binary format.

Why Baostock and Qlib?

  • Baostock: A free, open-source data provider for Chinese stock data. We use it to fetch OHLCV (Open, High, Low, Close, Volume) data.
  • Qlib: A quantitative platform by Microsoft. It uses a binary file structure that is significantly faster than CSVs or SQL for the heavy tensor operations performed during alpha mining.

Running the Data Script

We provide a comprehensive script data_collection/fetch_baostock_data.py. This script handles downloading, cleaning, and converting the data.

# Run the data fetcher
python data_collection/fetch_baostock_data.py

What this script does:

  1. Login: Connects to the Baostock API.
  2. Fetch List: Gets the list of all A-shares.
  3. Download: Iterates through every stock to download daily K-line data and adjustment factors.
  4. Save CSV: Saves intermediate CSVs to ../data/export.
  5. Dump Binary: Invokes DumpDataAll (from qlib_dump_bin.py) to convert CSVs into Qlib binaries.

Data Location

By default, the script initializes Qlib with data at: ~/.qlib/qlib_data/cn_data_baostock_fwdadj (or similar, check the script output).

If you want to store data elsewhere, modify the DataManager instantiation in fetch_baostock_data.py:

# In data_collection/fetch_baostock_data.py
dm = DataManager(
    save_path="../data",
    qlib_export_path="~/.qlib/qlib_data/cn_data", # <--- Destination for Qlib Binaries
    qlib_base_data_path="~/.qlib/qlib_data/cn_data",
    adjust_date="2009-01-01"
)

3. Verifying Installation

To ensure Qlib is reading the data correctly, you can run a simple python check:

from alphagen_qlib.stock_data import StockData, initialize_qlib
import torch

# Point this to your generated data path
initialize_qlib("~/.qlib/qlib_data/cn_data")

try:
    # Attempt to load CSI300 data for a small range
    data = StockData(
        instrument="csi300",
        start_time="2020-01-01",
        end_time="2020-01-10",
        device=torch.device("cpu")
    )
    print(f"Success! Loaded data shape: {data.data.shape}")
except Exception as e:
    print(f"Data loading failed: {e}")