Text this: Structure of multimodal feature representation module.