Granular materials are considered as the most seen materials in the world, and discrete element method (DEM) has become one of the most accurate and effective methods to simulate them. However, to achieve the preciseness expected from DEM, there exist huge force computations. Researchers have to either focus on simulations with fewer particles or build large-scale computer clusters for the ones with more particles. Moreover, DEM exhibits rich data-parallel nature in simulations. Recently, graphics processing units (GPU) have become yet another powerful parallel computing platform for scientific applications. In this paper, we intend to implement DEM on GPUs to explore system resources thoroughly for performance gains. Experiment results have demonstrated that the proposed implementation can achieve 2x~15x speedup depending on the number of particles and generations of GPUs, when compared to LAMMPS/granular module on 4-core systems.