Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQN-DQN.ipynb max over taus instead of max over actions? #9

Open
IgorAherne opened this issue Apr 2, 2023 · 2 comments
Open

IQN-DQN.ipynb max over taus instead of max over actions? #9

IgorAherne opened this issue Apr 2, 2023 · 2 comments

Comments

@IgorAherne
Copy link

IgorAherne commented Apr 2, 2023

Hello,

Given that forward() will return tuple:
return out.view(batch_size, num_tau, self.num_actions), taus

Should we use .max(1) instead of .max(2) ?
Currently it is:
Q_targets_next = Q_targets_next.detach().max(2)[0].unsqueeze(1) # (batch_size, 1, N)
Maybe should be:
Q_targets_next = Q_targets_next.detach().max(1)[0].unsqueeze(1) # (batch_size, 1, numActions)

In other words, to find the maximum in every tau group, rather than across every action?
Sorry if I misunderstood the process.

def learn(self, experiences):
        """Update value parameters using given batch of experience tuples.
        Params
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples 
            gamma (float): discount factor
        """
        self.optimizer.zero_grad()
        states, actions, rewards, next_states, dones = experiences
        # Get max predicted Q values (for next states) from target model
        Q_targets_next, _ = self.qnetwork_target(next_states)
        Q_targets_next = Q_targets_next.detach().max(2)[0].unsqueeze(1) # (batch_size, 1, N)    <-------------------------------------- HERE
        
        # Compute Q targets for current states 
        Q_targets = rewards.unsqueeze(-1) + (self.GAMMA**self.n_step * Q_targets_next * (1. - dones.unsqueeze(-1)))
        # Get expected Q values from local model
        Q_expected, taus = self.qnetwork_local(states)
        Q_expected = Q_expected.gather(2, actions.unsqueeze(-1).expand(self.BATCH_SIZE, 8, 1))

        # Quantile Huber loss
        td_error = Q_targets - Q_expected
        assert td_error.shape == (self.BATCH_SIZE, 8, 8), "wrong td error shape"
        huber_l = calculate_huber_loss(td_error, 1.0)
        quantil_l = abs(taus -(td_error.detach() < 0).float()) * huber_l / 1.0
        
        loss = quantil_l.sum(dim=1).mean(dim=1) # , keepdim=True if per weights get multipl
        loss = loss.mean()


        # Minimize the loss
        loss.backward()
        #clip_grad_norm_(self.qnetwork_local.parameters(),1)
        self.optimizer.step()

        # ------------------- update target network ------------------- #
        self.soft_update(self.qnetwork_local, self.qnetwork_target)
        return loss.detach().cpu().numpy()            
        ```
@IgorAherne IgorAherne changed the title IQN-DQN.ipynb max over actions instead of max over taus? IQN-DQN.ipynb max over taus instead of max over actions? Apr 2, 2023
@BY571
Copy link
Owner

BY571 commented Apr 5, 2023

Hey, I haven't had a look at the code for a while. Can you reference the part in the paper that made you think it's the max over the taus?

@IgorAherne
Copy link
Author

IgorAherne commented Apr 8, 2023

Hi Sebastian,

From page 5 of this paper https://arxiv.org/pdf/1806.06923.pdf
These equations are a bit tough for me, but looking at equation 2 and 3 from here:

image

@IgorAherne IgorAherne reopened this Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants