Researchers from UC Berkeley and Meta Present AST-T5: A Novel Pretraining Paradigm that Harnesses the Power of Abstract Syntax Trees (ASTs) to Boost the Performance of Code-Centric Language Models

LLMs have had a significant impact in the fields of code generation and comprehension. These models, trained on extensive code datasets such as GitHub, excel in tasks like text-to-code conversion, code-to-code transpilation, and understanding code. However, many current models merely treat code as sequences of subword tokens, overlooking its structure. Research suggests that incorporating the Abstract Syntax Tree (AST) of code can notably improve performance in tasks related to code. Some studies use code obfuscation during pretraining to teach models about abstract code structures, but these methods often involve computationally expensive processes, restricting scalability and imposing stringent conditions. 

Researchers from UC Berkeley and  Meta AI have developed AST-T5, a pretraining approach that capitalizes on the AST to enhance code generation, transpilation, and comprehension. This method, employing dynamic programming, maintains code structure through AST-Aware Segmentation and equips the model with the ability to reconstruct diverse code structures via AST-Aware Span Corruption. Unlike other models, AST-T5 does not require intricate program analyses or architectural changes, ensuring seamless integration with any encoder-decoder Transformer. 


LMs have been extended from NLP to code understanding and generation tasks. Encoder-only models excel in code understanding when fine-tuned with classifiers, while decoder-only models are optimized for code generation through their autoregressive nature. Encoder-decoder models, such as PLBART and CodeT5, have been developed to perform well in diverse code-related tasks. Previous research has leveraged syntactic elements, such as ASTs, in neural network models for code understanding and generation. 

AST-T5 is a  pretraining framework that leverages ASTs for code-based language models. AST-T5 uses AST-Aware Segmentation, an algorithm designed to address Transformer token limits while retaining the semantic coherence of the code. AST-T5 also employs AST-Aware Span Corruption, a masking technique that pretrains the model to reconstruct code structures ranging from individual tokens to entire function bodies, enhancing its flexibility and structure-awareness. The efficacy of AST-T5’s proposed methods is evaluated through controlled experiments, comparing it against T5 baselines with identical Transformer architectures, pretraining data, and computational settings.


AST-T5 consistently outperforms similar-sized LMs across various code-related tasks, particularly in code-to-code tasks, surpassing CodeT5 by 2 points in the exact match score for the Bugs2Fix task and by 3 points in the precise match score for Java-C# Transpilation in CodeXGLUE. The contributions of each component within the AST-aware pretraining framework of AST-T5 are analyzed through controlled experiments, which show the effect of the proposed methods. AST-T5’s structure-awareness, achieved through leveraging the AST of code, enhances code generation, transpilation, and understanding. AST-T5 integrates seamlessly with any encoder-decoder transformer without requiring intricate program analyses or architectural changes. 

In conclusion, AST-T5 is a  pretraining paradigm that harnesses the power of ASTs to boost the performance of code-centric language models. AST-T5 consistently outperforms similar-sized language models across various code-related tasks, particularly in code-to-code tasks, surpassing CodeT5 in exact match scores for the Bugs2Fix task and Java-C# Transpilation in CodeXGLUE. The simplicity and adaptability of AST-T5 make it a potential drop-in replacement for any encoder-decoder language model, highlighting its potential for real-world deployments. AST-T5’s structure-awareness, achieved through leveraging the AST, enhances code generation, transpilation, and understanding. Future work may explore the scalability of AST-T5 by training larger models on more expansive datasets and evaluating the model on the entire sanitized subset without few-shot prompts.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *