b'@online{Wijaya2411.03537,'b'\nTITLE = {Two-Stage Pretraining for Molecular Property Prediction in the Wild},\nAUTHOR = {Wijaya, Kevin Tirta and Guo, Minghao and Sun, Michael and Seidel, Hans-Peter and Matusik, Wojciech and Babaei, Vahid},\nLANGUAGE = {eng},\nURL = {https://arxiv.org/abs/2411.03537},\nEPRINT = {2411.03537},\nEPRINTTYPE = {arXiv},\nYEAR = {2024},\nMARGINALMARK = {$\\bullet$},\nABSTRACT = {Accurate property prediction is crucial for accelerating the discovery of new<br>molecules. Although deep learning models have achieved remarkable success,<br>their performance often relies on large amounts of labeled data that are<br>expensive and time-consuming to obtain. Thus, there is a growing need for<br>models that can perform well with limited experimentally-validated data. In<br>this work, we introduce MoleVers, a versatile pretrained model designed for<br>various types of molecular property prediction in the wild, i.e., where<br>experimentally-validated molecular property labels are scarce. MoleVers adopts<br>a two-stage pretraining strategy. In the first stage, the model learns<br>molecular representations from large unlabeled datasets via masked atom<br>prediction and dynamic denoising, a novel task enabled by a new branching<br>encoder architecture. In the second stage, MoleVers is further pretrained using<br>auxiliary labels obtained with inexpensive computational methods, enabling<br>supervised learning without the need for costly experimental data. This<br>two-stage framework allows MoleVers to learn representations that generalize<br>effectively across various downstream datasets. We evaluate MoleVers on a new<br>benchmark comprising 22 molecular datasets with diverse types of properties,<br>the majority of which contain 50 or fewer training labels reflecting real-world<br>conditions. MoleVers achieves state-of-the-art results on 20 out of the 22<br>datasets, and ranks second among the remaining two, highlighting its ability to<br>bridge the gap between data-hungry models and real-world conditions where<br>practically-useful labels are scarce.<br>},\n}\n'