What is the benefit of the updated Gamma value?

Reducing Gamma from 0.995 to 0.991 shifts the model's horizon from roughly 6 days to 3 days, which is more appropriate for agents making decisions on an hourly timeframe.

Why does this skill replace tanh with linear clamping?

The tanh function saturates quickly, meaning movements beyond 0.1% provide almost no gradient signal. Linear clamping preserves a clear signal for moves up to 2%, allowing the model to learn from larger price actions.

What is reward hacking in trading models?

Reward hacking occurs when an RL agent finds a loophole to maximize rewards without achieving the actual goal, such as overtrading to collect a 'trading incentive' bonus while ignoring actual slippage and losses.

How does this skill improve agent exploration?

It increases the entropy coefficient (ent_coef) from 0.005 to 0.015, which prevents the model from converging too quickly on sub-optimal local strategies and encourages finding more robust trading patterns.

RL Trading Reward Optimizer v3.3.0

Name: RL Trading Reward Optimizer v3.3.0
Author: smith6jt-cop

bysmith6jt-cop

0•

数据科学与机器学习

Optimizes reinforcement learning reward functions for automated trading to eliminate reward hacking and improve P&L gradient signals.

This skill provides a specialized framework for tuning Reinforcement Learning (RL) agents used in financial trading environments, specifically targeting the Alpaca trading platform. It implements a risk-aware composite reward structure that addresses common pitfalls like HOLD bias and reward hacking through overtrading. By rebalancing weights toward P&L-driven objectives and implementing linear gradient clamping instead of saturating activation functions, this skill ensures more robust model convergence and realistic trading behavior for hourly market horizons.

主要功能

01Elimination of overtrading by removing artificial trading incentives

02Optimized discount factor (Gamma) for hourly trading timeframes

03Risk-aware composite reward rebalancing for trading agents

04Calibrated PPO hyperparameters with 3x entropy coefficient increase

050 GitHub stars

06Gradient preservation via linear P&L clamping up to ±2%

使用场景

01Improving RL training signals when gradients vanish due to tanh saturation

02Fixing 'reward hacking' where agents overtrade to collect artificial bonuses

03Adjusting agent time horizons to match hourly market data cycles

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add smith6jt-cop/skills_registry reward-function-v330

For use in Claude.ai and ChatGPT

Download Skill