Investigate high loss of Mixtral #931

casper-hansen · 2023-12-10T10:51:43Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

The axolotl implementation is not aligned with the MegaBlocks implementation.

class LearnedRouter(torch.nn.Module):
    def forward(self, x):
        if self.training and self.args.moe_jitter_eps is not None:
            x = x * self.jitter(x)

        scores = self.layer(x.view(-1, x.shape[-1])).softmax(dim=-1)
        expert_weights, expert_indices = self._top_k(scores)

        expert_indices = (
            _uniform_expert_assignment(expert_indices, self.args.moe_num_experts)
            if self.args.uniform_expert_assignment else expert_indices
        )
        return scores, expert_weights, expert_indices

The current implementation:

https://github.com/OpenAccess-AI-Collective/axolotl/blob/68b227a7d8045d0f428d7ca3b9750f837d03611f/src/axolotl/models/mixtral/modeling_moe_mistral.py#L223-L232

Mistral also commented on this:

✔️ Solution

Investigate how we can adapt the most correct solution for the router. One way to test this is to measure the initial loss. For reference, back when I implemented sliding windows for Mistral, the initial loss dropped from 9.98 on main to 1.9 from the PR.

Measure loss on short and long context data, e.g. use casperhansen/longalpaca_1k_test with alpaca format.

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

The text was updated successfully, but these errors were encountered:

casper-hansen added the enhancement New feature or request label Dec 10, 2023

casper-hansen mentioned this issue Dec 10, 2023

Mixtral: More correct MoE, lower loss #932

Merged

winglian closed this as completed in #932 Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate high loss of Mixtral #931

Investigate high loss of Mixtral #931

casper-hansen commented Dec 10, 2023

Investigate high loss of Mixtral #931

Investigate high loss of Mixtral #931

Comments

casper-hansen commented Dec 10, 2023

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements