## NPTEL Reinforcement Learning Week 3 Assignment Answers 2024

1. Which of the following is true for an MDP?

- Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1)
- Pr(st+1,rt+1|st,at,st−1,at−1,st−2,at−2,…,s0,a0)=Pr(st+1,rt+1|st,at)
- Pr(st+1,rt+1|st,at)=Pr(st+1,rt+1|s0,a0)
- Pr(st+1,rt+1|st,at)=Pr(st,rt|st−1,at−1)

Answer :-For AnswerClick Here

2. The baseline in the REINFORCE update should not depend on which of the following (without voiding any of the steps in the proof of REINFORCE)?

- r
_{n−1} - r
_{n} - Action taken(a
_{n}) - None of the above

Answer :-For AnswerClick Here

3. In many supervised machine learning algorithms, such as neural networks, we rely on the gradient descent technique. However, in the policy gradient approach to bandit problems, we made use of gradient ascent. This discrepancy can mainly be attributed to the differences in

- the objectives of the learning tasks
- the parameters of the functions whose gradient are being calculated
- the nature of the feedback received by the algorithms

Answer :-

4. In case of linear bandits, let’s consider we have 2 actions – a_{1} and a_{2}. The policy π

to be followed when encountering a state s is given by

Answer :-

Answer :-

Answer :-For AnswerClick Here

7. The actions in contextual bandits do not determine the next state, but typically do in full RL problems. True or false?

- True
- False

Answer :-

8. In a continuous action space environment, we can employ any value function-based algorithm to discover an optimal policy.

- True
- False

Answer :-

Answer :-

10. In solving a multi-arm bandit problem using the policy gradient method, are we assured of converging to the optimal solution?

- No
- Yes

Answer :-For AnswerClick Here