## Multi-Gate Mixture of Experts MMoE

Tags: #machine learning #multi task### Equation

$$g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))$$### Latex Code

g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))

### Have Fun

Let's Vote for the Most Difficult Equation!

### Introduction

#### Equation

#### Latex Code

g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))

#### Explanation

Multi-Gate Mixture of Experts (MMoE) model is firstly introduced in KDD2018 paper Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. The model introduce a MMoE layer to model the relationship of K multiple tasks using N experts. Let's assume input feature X has dimension D. There are K output tasks and N experts networks. The gating network is calculated as, g^{k}(x) is a N-dimensional vector indicating the softmax result of relative weights, W_{gk} is a trainable matrix with size R^{ND}. And f^{k}(x) is the weghted sum representation of output of N experts for task k. f_{i}(x) is the output of the i-th expert, and f^{k}(x) indicates the representation of k-th tasks as the summation of N experts.

#### Related Documents

- See paper Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts for details.

Reply