Visual grouping is the ability to combine parts of objects to form whole objects, and it is crucial for general and robust vision. Despite this, a comprehensive model of how the process works in humans remains elusive. In my thesis, I investigate how well extant deep neural network models (DNNs) explain phenomena in visual grouping, and whether any failures can be remedied via targeted modeling. First, I overview studies of deep neural networks as models of the visual system generally. Second, I conduct a large-scale study of over 1,000 DNNs and 50 humans on a contour integration task, revealing it as a fundamental process underlying human-like vision. I examine where contour integration in models comes from and find that the scale of the pre-training dataset is a crucial characteristic, suggesting that the mechanism may be learned from data, rather than necessarily hardwired in architectural specifics. Third, I extend the study to cover more psychophysical benchmarks of visual grouping that consist of a large set of DNN models, as well as human data, and discover that the data scaling paradigm does not work well for more abstract and synthetic tasks. Neural evidence suggests that these benchmarks may measure entirely novel signals of visual grouping not captured by current neural benchmarks, directly suggesting that a more direct modeling approach may be needed. Finally, I propose a DNN model in a controlled setting that unifies perceptual grouping under a single principle of neural noise, and outperforms other extant models on a set of visual grouping benchmarks.