convolutional and (qkv) self-attention layers. The U-Net processes images while being
conditioned on the time embedding input for each sampling step and the class or caption
embedding input corresponding to the desired conditional generation. Such conditioning
involves scale-and-shift operations to the convolutional layers but does not directly affect the
attention layers. While these standard architectural choices are certainly effective, not …