Cheatsheet of Latex Code for MultiTask Learning Equations
Navigation
In this blog, we will summarize the latex code of most fundamental equations of multitask learning(MTL) and transfer learning(TL). MultiTask Learning aims to optimize N related tasks simultaneously and achieve the overall tradeoff between multiple tasks. Typical network structure include sharedbottom models, CrossStitch Network, MultiGate Mixture of Experts (MMoE), Progressive Layered Extraction (PLE), Entire Space MultiTask Model (ESSM) models and etc. Different from multitask learning. In the following sections, we will dicuss more details of MTL equations, which is useful for your quick reference.
 1. MultiTask Learning(MTL)
 1.1 SharedBottom Model
 1.2 MultiGate Mixture of Experts (MMoE)
 1.3 Progressive Layered Extraction (PLE)
 1.4 Entire Space MultiTask Model (ESSM)
 1.5 CrossStitch Network
1. MultiTask Learning(MTL)

1.1 SharedBottom Model
Equation
Latex Code
y_{k}=h^{k}(f(x))
Explanation
SharedBottom mtl models have shared representation f(x) for K individual tasks. For each task k, there is task specific tower with parameters h^{k}(.) which produces individual output for each task.

1.2 MultiGate Mixture of Experts (MMoE)
Equation
Latex Code
g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))
Explanation
MultiGate Mixture of Experts (MMoE) model is firstly introduced in KDD2018 paper Modeling Task Relationships in Multitask Learning with Multigate MixtureofExperts. The model introduce a MMoE layer to model the relationship of K multiple tasks using N experts. Let's assume input feature X has dimension D. There are K output tasks and N experts networks. The gating network is calculated as, g^{k}(x) is a Ndimensional vector indicating the softmax result of relative weights, W_{gk} is a trainable matrix with size R^{ND}. And f^{k}(x) is the weghted sum representation of output of N experts for task k. f_{i}(x) is the output of the ith expert, and f^{k}(x) indicates the representation of kth tasks as the summation of N experts. See below link Modeling Task Relationships in Multitask Learning with Multigate MixtureofExperts for more details.

1.3 Progressive Layered Extraction (PLE)
Equation
Latex Code
g^{k}(x)=w^{k}(x)S^{k}(x) \\ w^{k}(x)=\text{softmax}(W^{k}_{g}x) \\ S^{k}(x)=\[E^{T}_{(k,1)},E^{T}_{(k,2)},...,E^{T}_{(k,m_{k})},E^{T}_{(s,1)},E^{T}_{(s,2)},...,E^{T}_{(s,m_{s})}\]^{T} \\ y^{k}(x)=t^{k}(g^{k}(x)) \\ g^{k,j}(x)=w^{k,j}(g^{k,j1}(x))S^{k,j}(x)
Explanation
Progressive Layered Extraction(PLE) model slightly modifies the original structure of MMoE models and explicitly separate the experts into shared experts and taskspecific experts. Let's assume there are m_{s} shared experts and m_{t} tasksspecific experts. S^{k}(x) is a selected matrix composed of (m_{s} + m_{t}) Ddimensional vectors, with dimension as (m_{s} + m_{t}) \times D. w^{k}(x) denotes the gating function with size (m_{s} + m_{t}) and W^{k}_{g} is a trainable parameters with dimension as (m_{s} + m_{t}) \times D. t^{k} denotes the taskspecific tower paratmeters. The progressive extraction layer means that the gating network g^{k,j}(x) of jth extraction layer takes the output of previous gating layers g^{k,j1}(x) as inputs. See below link of paper Progressive Layered Extraction (PLE): A Novel MultiTask Learning (MTL) Model for Personalized Recommendations for more details.

1.4 Entire Space MultiTask Model (ESSM)
Equation
Latex Code
L(\theta_{cvr},\theta_{ctr})=\sum^{N}_{i=1}l(y_{i},f(x_{i};\theta_{ctr}))+\sum^{N}_{i=1}l(y_{i}\&z_{i},f(x_{i};\theta_{ctr}) \times f(x_{i};\theta_{cvr}))
Explanation
ESSM model uses two separate towers to model pCTR prediction task and pCTCVR prediction task simultaneously. See below link of paper Entire Space MultiTask Model: An Effective Approach for Estimating PostClick Conversion Rate for more details.

1.5 CrossStitch Network
Equation
Latex Code
\begin{bmatrix} \tilde{x}^{ij}_{A}\\\tilde{x}^{ij}_{B}\end{bmatrix}=\begin{bmatrix} a_{AA} & a_{AB}\\ a_{BA} & a_{BB} \end{bmatrix}\begin{bmatrix} x^{ij}_{A}\\ x^{ij}_{B} \end{bmatrix}
Explanation
The crossstitch unit takes two activation maps xA and xB from previous layer and learns a linear combination of two inputs from previous tasks and combine them into two new representation. The linear combination is controlled by parameter \alpha. See below link of paper Crossstitch Networks for Multitask Learning for more details.