Imitation-Learning-with-DAgger

November 18, 2023 • 371 words

Imitation Learning with DAgger

We run imitation learning and Dagger based on expert policies. In this experiment the expert policy is directly sampled out from a trained Neural Network, so Dagger differs from real world applications in that it automatically corrects(compared to human experts manual correction).

Gym Environment

Step the Env

ac = ptu.to_numpy(policy(ptu.from_numpy(ob.reshape(1, -1))).sample())
ac = ac[0]  # Retrieve the action value
next_ob, rew, done, _ = env.step(ac)

Collecting Trajectories

paths, envsteps_this_batch = utils.sample_trajectories(env, actor, params['batch_size'], params['ep_len'])

Get Expert Policy

# Update actions in each path using expert policy
for path in paths:
    path[action] = expert_policy.get_action(path[observation])

Get Sampled Data

rand_indices = np.random.permutation(replay_buffer.obs.shape[0])[:params['batch_size']]
ob_batch, ac_batch = replay_buffer.obs[rand_indices], replay_buffer.acs[rand_indices]

Training the Actor
train_log = actor.update(ob_batch, ac_batch)

Behavioral Cloning

We run these scripts on the dataset for behavioral cloning.

Ant

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/Ant.pkl --env_name Ant-v4 --exp_name bc_ant --n_iter 1 --expert_data cs285/expert_data/expert_data_Ant-v4.pkl --video_log_freq -1

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/Ant.pkl --env_name Ant-v4 --exp_name bc_ant --n_iter 1 --expert_data cs285/expert_data/expert_data_Ant-v4.pkl --video_log_freq -1 --batch_size 1000 --eval_batch_size 30000  --ep_len 1000 --n_layers 2 --size 32 --learning_rate 3e-2

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/Ant.pkl --env_name Ant-v4 --exp_name bc_ant --n_iter 1 --expert_data cs285/expert_data/expert_data_Ant-v4.pkl --video_log_freq -1 --batch_size 100 --eval_batch_size 30000  --ep_len 1000 --n_layers 2 --size 16 --learning_rate 4e-2

HalfCheetah

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/HalfCheetah.pkl --env_name HalfCheetah-v4 --exp_name bc_halfcheetah --n_iter 1 --expert_data cs285/expert_data/expert_data_HalfCheetah-v4.pkl --video_log_freq -1

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/HalfCheetah.pkl --env_name HalfCheetah-v4 --exp_name bc_halfcheetah --n_iter 1 --expert_data cs285/expert_data/expert_data_HalfCheetah-v4.pkl --video_log_freq -1 --batch_size 100 --eval_batch_size 30000  --ep_len 1000 --n_layers 3 --size 64 --learning_rate 2e-2

Environment	Configuration	Mean	Standard Deviation
Ant	No Configuration	3,401.8508	0
Ant	`--size 32 --learning_rate 3e-2`	4,585.5176	108.1566
Ant	`--batch_size 100 --size 16 --learning_rate 4e-2`	4,170.0972	829.4213
HalfCheetah	No Configuration	3,264.8667	0
HalfCheetah	`--batch_size 100 --n_layers 3 --size 64 --learning_rate 2e-2`	3,765.103	106.6637

Dagger

Add--do_dagger to the end of each commands above and change iters

Ant

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/Ant.pkl --env_name Ant-v4 --exp_name dagger_ant --n_iter 10 --do_dagger --expert_data cs285/expert_data/expert_data_Ant-v4.pkl --video_log_freq -1

HalfCheetah

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/HalfCheetah.pkl --env_name HalfCheetah-v4 --exp_name bc_halfcheetah --n_iter 10 --expert_data cs285/expert_data/expert_data_HalfCheetah-v4.pkl --video_log_freq -1 --do_dagger

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/HalfCheetah.pkl --env_name HalfCheetah-v4 --exp_name bc_halfcheetah --n_iter 5 --expert_data cs285/expert_data/expert_data_HalfCheetah-v4.pkl --video_log_freq -1 --batch_size 100 --eval_batch_size 10000  --ep_len 1000 --n_layers 3 --size 64 --learning_rate 1e-2 --do_dagger

python cs285/scripts/run_hw1.py --expert_policy_file cs285/policies/experts/HalfCheetah.pkl --env_name HalfCheetah-v4 --exp_name bc_halfcheetah --n_iter 10 --expert_data cs285/expert_data/expert_data_HalfCheetah-v4.pkl --video_log_freq -1 --batch_size 10000 --eval_batch_size 10000  --ep_len 1000 --n_layers 2 --size 64 --learning_rate 4e-3 --do_dagger