python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_32.yaml python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_1_size_16.yaml python cs285/scripts/run_hw4.py -cfg experiments/mpc/halfcheetah_0_iter_layer_2_size_16.yaml Get predictions pred_obs_deltas_normalized = self.dynamics_modelsi pred_obs_deltas = pred_obs_deltas_normalized * self.obs_delta_std + self.obs_delta_mean pred_next_obs = obs + pred_obs_deltas ``` Action Selection rewards = np.array([
Note: All Yaml files are in the git repo: https://github.com/jimchen2/cs285-reinforcement-learning python cs285/scripts/run_hw5_explore.py \ python cs285/scripts/run_hw5_explore.py \ python cs285/scripts/run_hw5_explore.py \ The Random Network Distillation algorithm encourages exploration by training another neural network to approximate the output
Compute Action and use epsilon greedy action = torch.tensor(random.randint(0, self.num_actions - 1)) action = self.critic(observation).argmax(dim=1) Step environment Add data to replay buffer replay_buffer.insert(...) Sample from replay buffer batch = replay_buffer.sample(config["batch_size"]) Train agent, we update the
There are 2 kinds of estimator for Policy Gradients, full trajectory and and "reward-to-go" We run the two configs on Cartpole with different parameters, specifically, rtg means reward to go, na means normalizing the advantages.
We run imitation learning and Dagger based on expert policies. In this experiment the expert policy is directly sampled out from a trained Neural Network, so Dagger differs from real world applications in that it