Holography enables quantitative phase sensing through light wave interference, allowing interesting imaging modalities applied in computational microscopy where phase shift is used as a way to generate image contrast. Recovering both amplitude and phase distributions of a light field in in-line holography is an ill-posed problem that requires post-processing and further computations. This paper introduces a novel learning-based model dubbed Cross-Net, which can extract latent representations used for patch-based image reconstruction and regression tasks, thus enabling joint holographic image reconstruction and computational refocusing. Compared to the standard Vision Transformer (ViT), Cross-Net uses a dedicated convolutional block to substitute Multi-Head Self-Attention as an inter-spatial processing unit and adapts Patch Embedding to generate meaningful latent representations suitable for the tasks at hand. The proposed model is lighter and exhibits a significant performance gain in the joint tasks of computational refocusing and holographic image reconstruction compared to ViT and other state-of-the-art approaches. We further introduce the use of large scale synthetic microscopic data exploiting generative models to train our network.