We investigate different RGB and depth fusion techniques for object detection with the aim to improve the detection accuracy compared to RGB-only systems. We consider recent proposal-free convolutional object detectors which we modify for RGB-D data. We introduce a third mixed branch in our network beside the RGB and depth branches and define a novel attention mechanism which extracts weighted features from the depth branch and applies them to the RGB feature map thus fusing the branches adaptively. Our method, which we call spatial Cross-Attention Fusion network or CAF-Net yields a state-of-theart mean average precision of 60.3% on the SUN RGB-D dataset outperforming all previous techniques by a significant margin.