Ambient seismic noise consists of emergent and impulsive signals generated by natural and anthropogenic sources. Developing techniques to identify specific cultural noise signals will benefit studies performing seismic imaging from continuous records. We examine spectrograms of urban cultural noise from a spatially dense seismic array located in Long Beach, California. The spectral features of the waveforms are used to develop a self‐supervised clustering model for differentiating cultural noise into separable types of signals. We use 161 hr of seismic data from 5200 geophones that contain impulsive signals originating from human activity. The model uses convolutional autoencoders, a self‐supervised machine‐learning technique, to learn latent features from spectrograms produced from the data. The latent features are evaluated using a deep clustering algorithm to separate the noise signals into different classes. We evaluate the separation of data and analyze the classes to identify the likely sources of the signals present in the data. To interpret the model performance, we examine the time–frequency domain features of the signals and the spatiotemporal evolution observed for each class. We demonstrate that clustering using deep autoencoders is a useful approach to characterizing seismic noise and identifying novel signals in the data.