# NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

## NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

1. Which of the following is true for an MDP?

• Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1)
• Pr(st+1,rt+1|st,at,st−1,at−1,st−2,at−2,…,s0,a0)=Pr(st+1,rt+1|st,at)
• Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1|s0,a0)
• Pr(st+1,rt+1|st,at)=Pr(st,rt|st−1,at−1)

2. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

• rn−1
• rn
• Action taken(an)
• None of the above

3. In many supervised machine learning algorithms, such as neural networks, we rely on the gradient descent technique. However, in the policy gradient approach to bandit problems, we made use of gradient ascent. This discrepancy can mainly be attributed to the differences in

• the objectives of the learning tasks
• the parameters of the functions whose gradient are being calculated
• the nature of the feedback received by the algorithms

4. In case of linear bandits, let’s consider we have 2 actions – a1 and a2. The policy π
to be followed when encountering a state s is given by

7. The actions in contextual bandits do not determine the next state, but typically do in full RL problems. True or false?

• True
• False