AlphaGen Documentation

Automatic Formulaic Alpha Generation with Reinforcement Learning

AlphaGen is a research framework designed to discover synergistic formulaic alphas (trading signals) using Reinforcement Learning (RL), Genetic Programming (GP), and Large Language Models (LLMs).

Based on the paper "Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning" (KDD 2023), this project addresses the limitations of traditional alpha mining by focusing on the combination of signals rather than just individual performance.

Core Philosophy

Quantitative trading relies on "alphas"—mathematical formulas transforming historical market data into predictive signals.

  1. Synergy over Individuality: Traditional methods (like Genetic Programming) often generate many alphas that are individually strong but highly correlated. AlphaGen optimizes for a pool of alphas that work well together in a linear model.
  2. Symbolic Representation: Alphas are represented as expression trees (e.g., Mean(Close, 10) / Open), making them interpretable and structurally valid.
  3. Source-Driven Evaluation: The framework integrates tightly with Qlib for high-performance, tensor-based backtesting on real-world stock data.

Supported Algorithms

  • AlphaGen (RL): Uses Proximal Policy Optimization (PPO) with invalid action masking to generate syntactically correct formulas that maximize the pool's Information Coefficient (IC).
  • AlphaGPT (LLM): An iterative human-in-the-loop framework where an LLM generates alphas, the system evaluates them, and feeds performance reports back to the LLM for refinement.
  • Baselines: Includes implementations of standard Genetic Programming (gplearn) and Deep Symbolic Optimization (dso) tailored for financial data.

Repository Structure

  • alphagen/: Core RL environment, expression tree definitions, and PPO policy structures.
  • alphagen_qlib/: Data loading and calculation logic wrapping Microsoft Qlib.
  • alphagen_llm/: Prompts and clients for OpenAI/LLaMA integration.
  • scripts/: Entry points for training RL agents, running LLM sessions, and testing validity.

Getting Started

If you are new to this project, we recommend the following path:

  1. Installation: Set up the Python environment and—most importantly—the Qlib data binary files.
  2. Quick Start: Run a minimal RL training session.
  3. Core Concepts: Understand how expressions are built and evaluated.