Training Neural Network Dynamics Model
- Collected a large dataset by executing random actions
- Trained a neural network dynamics model on this fixed dataset
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_32.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_16.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_2_size_16.yaml
Action Selection Using Learned Dynamics Model
Get predictions
inputs_normalized = (inputs - self.obs_acs_mean) / (self.obs_acs_std + 1e-8)
pred_obs_deltas_normalized = self.dynamics_models[i](inputs_normalized)
pred_obs_deltas = pred_obs_deltas_normalized * self.obs_delta_std + self.obs_delta_mean
pred_next_obs = obs + pred_obs_deltas
Action Selection
for t in range(self.mpc_horizon):
acs = action_sequences[:, t, :]
next_obs = np.stack([
self.get_dynamics_predictions(i, obs[i], acs)
for i in range(self.ensemble_size)
])
rewards = np.array([
self.env.get_reward(next_obs[i].reshape(-1, self.ob_dim), acs.reshape(-1, self.ac_dim))[0]
for i in range(self.ensemble_size)
])
sum_of_rewards += rewards
Implemented action selection using the learned dynamics model and a given reward function
python cs285/scripts/run_hw4.py -cfg experiments/mpc/obstacles_1_iter.yaml
Observing: Average eval return: -27.73971176147461
MBRL Algorithm with On-Policy Data Collection
In code
action_sequences = np.random.uniform(
self.env.action_space.low,
self.env.action_space.high,
size=(self.mpc_num_action_sequences, self.mpc_horizon, self.ac_dim),
)
if self.mpc_strategy == random:
rewards = self.evaluate_action_sequences(obs, action_sequences)
best_index = np.argmax(rewards)
return action_sequences[best_index][0]
python cs285/scripts/run_hw4.py -cfg experiments/mpc/obstacles_multi_iter.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_multi_iter.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_multi_iter.yaml
Hyperparameter Ablation
Hyperparameters Changed:
- Number of models in the ensemble
- Number of random action sequences considered during action selection
- MPC planning horizon
We then run
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_action_sequences_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_action_sequences_increased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_ensemble_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_ensemble_increased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_horizon_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_horizon_increased.yaml
CEM action
for i in range(self.cem_num_iters):
if i == 0:
elite_mean = np.mean(action_sequences, axis=0)
elite_std = np.std(action_sequences, axis=0)
else:
action_sequences = np.random.normal(
elite_mean, elite_std,
size=(self.mpc_num_action_sequences, self.mpc_horizon, self.ac_dim)
)
action_sequences = np.clip(
action_sequences,
self.env.action_space.low,
self.env.action_space.high
)
rewards = self.evaluate_action_sequences(obs, action_sequences)
elite_indices = np.argsort(rewards)[-self.cem_num_elites:]
elite_action_sequences = action_sequences[elite_indices]
elite_mean = np.mean(elite_action_sequences, axis=0)
elite_std = np.std(elite_action_sequences, axis=0)
elite_std = self.cem_alpha * elite_std + (1 - self.cem_alpha) * np.std(action_sequences, axis=0)
return elite_mean[0]
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_cem.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_cem_iters_2.yaml
MBPO Variant
In Code
for _ in range(rollout_len):
ac = sac_agent.get_action(ob)
next_ob = mb_agent.get_dynamics_predictions(
np.random.randint(mb_agent.ensemble_size), ob, ac
)
rew, done = env.get_reward(next_ob, ac)
Experiments:
- Model-free SAC baseline (rollout length 0)
- Dyna-like algorithm (rollout length 1)
- Full MBPO (rollout length 10)
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_dyna.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_mbpo.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_sac_baseline.yaml