# NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

## NPTEL Reinforcement Learning Week 6 Assignment Answers 2024

1. Assertion: In order to use importance sampling for off-policy Monte-Carlo policy evaluation, we require knowledge of the state transition probabilities of the MDP.
Reason: We require knowledge of the probability of each trajectory xi
according to the estimation policy and according to the behaviour policy. Both of these values are dependent on the state transition probabilities of the MDP.

• Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
• Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
• Assertion is true, Reason is false
• Both Assertion and Reason are false
`Answer :- For Answer Click Here`

2. Which of the following are true?

• Dynamic programming methods use full backups and bootstrapping.
• Temporal-Difference methods use sample backups and bootstrapping.
• Monte-Carlo methods use sample backups and bootstrapping.
• Monte-Carlo methods use full backups and no bootstrapping.
`Answer :- For Answer Click Here`

3. Which of the following statement(s) is/are true for the UC-Trees algorithm?

• We typically require a simulation model for the environment.
• It uses ϵ-greedy exploration for action selection.
• It computes an upper confidence bound on the value of (state, action) pairs.
• It is a variation of the Monte-Carlo tree search algorithm.
`Answer :- For Answer Click Here`

4. Consider the following statements:
(i) TD(0) methods uses unbiased sample of the return.
(ii) TD(0) methods uses a sample of the reward from the distribution of rewards.
(iii) TD(0) methods uses the current estimate of value function.
Which of the above statements is/are true?

• (i), (ii)
• (i),(iii)
• (ii), (iii)
• (i), (ii), (iii)
`Answer :- `

5. Assertion: Q-learning can use asynchronous samples from different policies to update Q values.
Reason: Q-learning is an off-policy learning algorithm.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion.
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
• Assertion is true but Reason is false.
• Assertion is false but Reason is true.
`Answer :- `

6. Suppose, for a 2 player game that we have modeled as an MDP, instead of learning a policy over the MDP directly, we separate the deterministic and stochastic result of playing an action to create ‘after-states’ (as discussed in the lectures). Consider the following statements:

(i) The set of states that make up ‘after-states’ may be different from the original set of
states for the MDP.
(ii) The set of ‘after-states’ could be smaller than the original set of states for the MDP.

Which of the above statements is/are True?

• Only (i)
• Only (ii)
• Both (i) and (ii)
• Neither (i) nor (ii).
`Answer :- For Answer Click Here`

7.

`Answer :- `

8. Assertion: Having a simulator/model is an advantage when using rollouts based methods.
Reason: Multiple trajectories can be sampled from the model from any given state.

• Assertion and Reason are both true and Reason is a correct explanation of Assertion.
• Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
• Assertion is true but Reason is false.
• Assertion and Reason are both false.
`Answer :- `

9. Consider an MDP with two states A and B. Given the single trajectory shown below (in the pattern of state, reward, next state…), use on-policy TD(0) updates to make estimates for the
values of the 2 states.
A, 3, B, 2, A, 5, B, 2, A, 4, END
Assume a discount factor γ=1, a learning rate α=1 and initial state-values of zero. What are the estimated values for the 2 states at the end of the sampled trajectory? (Note: You are not asked to compute the true values for the two states.)

• V(A)=2,V(B)=10
• V(A)=8,V(B)=7
• V(A)=4,V(B)=12
• V(A)=12,V(B)=7
`Answer :- `

10. Which one of the following statements are True for SARSA?

• It uses bootstrapping to approximate full return.
• It is an on-policy algorithm.
• It is a TD method.
• It always selects the greedy action choice.
`Answer :- For Answer Click Here`