output sequences, for applications in sequence recognition or production. Results are
presented showing that learning long-term dependencies in such recurrent networks using
gradient descent is a very difficult task. It is shown how this difficulty arises when robustly
latching bits of information with certain attractors. The derivatives of the output at time t with
respect to the unit activations at time zero tend rapidly to zero as t increases for most input …