H Chen, B He, H Wang, Y Ren, SN Lim, A Shrivastava - proceedings.neurips.cc
We provide the architecture details in Table 1. On a 1920× 1080 video, given the timestamp
index t, we first apply a 2-layer MLP on the output of positional encoding layer, then we stack …