0%

stable-baselines3的一些小技巧

Use gym wrapper

需要重写 reset() and step()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class TimeLimitWrapper(gym.Wrapper):
"""
:param env: (gym.Env) Gym environment that will be wrapped
:param max_steps: (int) Max number of steps per episode
"""

def __init__(self, env, max_steps=100):
# Call the parent constructor, so we can access self.env later
super(TimeLimitWrapper, self).__init__(env)
self.max_steps = max_steps
# Counter of steps per episode
self.current_step = 0

def reset(self, **kwargs):
"""
Reset the environment
"""
# Reset the counter
self.current_step = 0
return self.env.reset(**kwargs)

def step(self, action):
"""
:param action: ([float] or int) Action taken by the agent
:return: (np.ndarray, float, bool, bool, dict) observation, reward, is the episode over?, additional informations
"""
self.current_step += 1
obs, reward, terminated, truncated, info = self.env.step(action)
# Overwrite the truncation signal when when the number of steps reaches the maximum
if self.current_step >= self.max_steps:
truncated = True
return obs, reward, terminated, truncated, info

当然,如果自己写enviorment 就不需要 写wrapper了

logger 中的各种值

entropy_loss

entropy_loss: 负数值,越小代表预测的越不“自信”

if probs is [0.1,0.9]

entropy = -0.1ln(0.1) - 0.9ln(0.9) =~ 0.325

entropy_loss = -0.325

if probs is [0.5,0.5]

entropy = -0.5ln(0.5)-0.5ln(0.5) =~ 0.69

explained_variance

explained_variance, 用于计算how well value function (or critic in actor-critic methods) match the actual returns
~= 1: 完美
0: 和全预测一个值一样
<0: 比简单预测mean还差

RL Tricks I found in internet

http://joschu.net/docs/nuts-and-bolts.pdf

  • Premature drop in policy entropy ⇒ no learning

    • Alleviate by using entropy bonus or KL penalty (鼓励randomness)
  • I was confused as to what action I should take to improve my results after lots of experimentation whether feature engineering, reward shaping, more training steps, or algo hyper-parameter tuning. From lots of experiments, first and foremost look at your reward function and validate that the reward value for a given episode is representative for what you actually want to achieve - it took a lot of iterations to finally get this somewhat right. If you’ve checked and double checked your reward function, move to feature engineering. In my case, I was able to quickly test with feature answers (e.g. data that included information the policy was suppose to figure out) to realize that my reward function was not executing like it should. To that point, start small and simple and validate while making small changes. Don’t waste your time hyper-parameter tuning while you are still in development of your environment, observation space, action space, and reward function. While hyper-parameters make a huge difference, it won’t correct a bad reward function. In my experience, hyper-parameter tuning was able to identify the parameters to get to a higher reward quicker but that didn’t necessarily generalize to a better training experience. I used the hyper-parameter tuning as a starting point and then tweaked things manually from there.

  • Lastly, how much do you need to train - the million dollar question. This is going to significantly vary from problem to problem, I found success when the algo was able to process through any given episode 60+ times. This is the factor of exploration. Some problems/environments need less exploration and others need more. The larger the observation space and the larger the action space, the more steps that are needed. For myself, I came up with this function needed_steps = number_distinct_episodes * envs * episode_length mentioned in #2 based on how many times I wanted a given episode executed. Because my problem is data analytics focused, it was easy to determine how many distinct episodes I had, and then just needed to evaluate how many times I needed/wanted a given episode explored. In other problems, there is no clear amount of distinct episodes and the rule of thumb that I followed was run for 1M steps and see how it goes, and then if I’m sure of everything else run for 5M steps, and then for 10M steps - though there are constraints on time and compute resources. I would also work in parallel in which I would make some change run a training job and then in a different environment make a different change and run another training job - this allowed me to validate changes pretty quickly of which path I wanted to go down killing jobs that I decided against without having to wait for it to finish - tmux was helpful for this.