Text this: A temporal–spatial deep learning framework leveraging dynamic 3D attention maps for violence detection