Model-Based-RL

September 30, 2024 • 354 words

Model Based RL

Training Neural Network Dynamics Model

Collected a large dataset by executing random actions
Trained a neural network dynamics model on this fixed dataset

python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_32.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_16.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_2_size_16.yaml

Action Selection Using Learned Dynamics Model

Get predictions

inputs_normalized = (inputs - self.obs_acs_mean) / (self.obs_acs_std + 1e-8)

pred_obs_deltas_normalized = self.dynamics_models[i](inputs_normalized)
pred_obs_deltas = pred_obs_deltas_normalized * self.obs_delta_std + self.obs_delta_mean

pred_next_obs = obs + pred_obs_deltas

Action Selection

for t in range(self.mpc_horizon):
    acs = action_sequences[:, t, :]
    next_obs = np.stack([
        self.get_dynamics_predictions(i, obs[i], acs)
        for i in range(self.ensemble_size)
    ])

    rewards = np.array([
        self.env.get_reward(next_obs[i].reshape(-1, self.ob_dim), acs.reshape(-1, self.ac_dim))[0]
        for i in range(self.ensemble_size)
    ])

    sum_of_rewards += rewards

Implemented action selection using the learned dynamics model and a given reward function

python cs285/scripts/run_hw4.py -cfg experiments/mpc/obstacles_1_iter.yaml

Observing: Average eval return: -27.73971176147461

MBRL Algorithm with On-Policy Data Collection

In code

action_sequences = np.random.uniform(
    self.env.action_space.low,
    self.env.action_space.high,
    size=(self.mpc_num_action_sequences, self.mpc_horizon, self.ac_dim),
)

if self.mpc_strategy == random:
    rewards = self.evaluate_action_sequences(obs, action_sequences)
    best_index = np.argmax(rewards)
    return action_sequences[best_index][0]

python cs285/scripts/run_hw4.py -cfg experiments/mpc/obstacles_multi_iter.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_multi_iter.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_multi_iter.yaml

Hyperparameter Ablation

Hyperparameters Changed:

Number of models in the ensemble
Number of random action sequences considered during action selection
MPC planning horizon

We then run

python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_action_sequences_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_action_sequences_increased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_ensemble_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_ensemble_increased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_horizon_decreased.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/reacher_ablation_horizon_increased.yaml

CEM action

for i in range(self.cem_num_iters):
    if i == 0:
        elite_mean = np.mean(action_sequences, axis=0)
        elite_std = np.std(action_sequences, axis=0)
    else:
        action_sequences = np.random.normal(
            elite_mean, elite_std,
            size=(self.mpc_num_action_sequences, self.mpc_horizon, self.ac_dim)
        )
        action_sequences = np.clip(
            action_sequences,
            self.env.action_space.low,
            self.env.action_space.high
        )

    rewards = self.evaluate_action_sequences(obs, action_sequences)
    elite_indices = np.argsort(rewards)[-self.cem_num_elites:]
    elite_action_sequences = action_sequences[elite_indices]

    elite_mean = np.mean(elite_action_sequences, axis=0)
    elite_std = np.std(elite_action_sequences, axis=0)
    elite_std = self.cem_alpha * elite_std + (1 - self.cem_alpha) * np.std(action_sequences, axis=0)

return elite_mean[0]

python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_cem.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_cem_iters_2.yaml

MBPO Variant

In Code

for _ in range(rollout_len):
    ac = sac_agent.get_action(ob)
    next_ob = mb_agent.get_dynamics_predictions(
        np.random.randint(mb_agent.ensemble_size), ob, ac
    )
    rew, done = env.get_reward(next_ob, ac)

Experiments:

Model-free SAC baseline (rollout length 0)
Dyna-like algorithm (rollout length 1)
Full MBPO (rollout length 10)

python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_dyna.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_mbpo.yaml
python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_mbpo.yaml --sac_config_file experiments/sac/halfcheetah_clipq_sac_baseline.yaml