Github: https://github.com/Kiiiiii123/TradeWithRL.

Introduction


As mentioned in previous blogs, stock exchange market is an excellent environment for reinforcement learning research. However, this tiny project is just for fun… 😂


OpenAI Gym Stock Trading Environment


We implement the env.StockTradingEnv class in CustomEnv.py, inheriting from gym.Env class.

  • Observation Space

Our trading strategy observes the parameters of every stock, such as opening price, closing price, volume, etc. Since some parameters have large values, to guarantee the convergence of network training, the observation values must be normalized into [-1, 1].

Parameters Description Details
date trading date Format: YYYY-MM-DD
code stock code Format: sh.600000; sh: Shanghai; sz: Shenzhen
open opening price Accuracy: 4 decimal places; Unit: yuan
high top price Accuracy: 4 decimal places; Unit: yuan
low minimum price Accuracy: 4 decimal places; Unit: yuan
close closing price Accuracy: 4 decimal places; Unit: yuan
preclose closing price yday Accuracy: 4 decimal places; Unit: yuan
volume volume of trade Unit: share
amount stock turnover Accuracy: 4 decimal places; Unit: yuan
adjustflag adjust status no adjust, forward adjust, backward adjust
turn turnover rate (%) Accuracy: 6 decimal places
tradestatus stock trading status 1: normal trading; 0: suspended
pctChg change (%) Accuracy: 6 decimal places
peTTM rolling price-earnings ratio Accuracy: 6 decimal places
psTTM rolling price-sales ratio Accuracy: 6 decimal places
pcfNcfTTM rolling price-cash ratio Accuracy: 6 decimal places
pbMRQ P/B ratio Accuracy: 6 decimal places
  • Action Space

We assume that there are only three operations to choose from: buy, sell and hold. We define the action as an array of length 2. The discrete value of action[0] represents different stock operations, the continuous value of action[1] represents a buy or sell percentage.

action[0] Description
1 buy: action[1]
2 sell: action[1]
3 hold

Note that when action[0] = 3, our trading strategy will not buy nor sell stock, meanwhile, the value of action[1] is meaningless. Our trading agent learns about this during the training process.

  • Reward Function

When trained in a stock trading environment, the strategy is most concerned with its current profitability, so we use the current profit as our reward function.

self.net_worth = self.balance + self.shares_held * current_price
# profit
reward = self.net_worth - INITIAL_ACCOUNT_BALANCE
reward = 1 if reward > 0 else reward = -100

In order to accelerate the training process of our strategy network and get a profitable strategy, when the profit is negative, a large penalty is given to the network (-100).

Trading Strategy


Since the action[1] value is continuous, PPO, the optimization algorithm based on policy-gradient is used. We just use the python implementation in stable-baselines.

Experiment


  • Installation

# create and active virtual environment
virtualenv -p python3.6 venv
source ./venv/bin/activate
# dependencies
pip install -r requirements.txt
  • Dataset

Our stock data comes from baostock, a free and open-source stock data platform. We can download stock data (pandas DataFrame) through its python API.

>> pip install baostock -i https://pypi.tuna.tsinghua.edu.cn/simple/ --trusted-host pypi.tuna.tsinghua.edu.cn
>> python get_stock_data.py

In get_stock_data.py, the stock data of the past 20 years were divided into training set and the data of the last month as the test set. The test set is necessary for verifying the effectiveness of our reinforcement learning strategy.

1990-01-01 ~ 2019-11-29 2019-12-01 ~ 2019-12-31
training set test set

Results


  • Single-Stock

    • Initial Balance: 10000

    • Stock Code: sz.002714 (Muyuan Food stuff Co,Ltd)

    • Training Set: stockdata/train/sz.002714.牧原股份.csv

    • Test Set: stockdata/test/sz.002714.牧原股份.csv

    • Simulate 20 days, the final profit of about 75


  • Multi-Stock

We choose 1002 stocks and train the trading strategies with the training data. Run visualize_bacth_testing.py, then we can get the backtesting result of our strategies as below: