Addressing the Obvious Challenges
JetBrains engineers faced several significant obstacles in their quest to develop AI-powered autocompletion:
Speed and Cost: Traditional chat models proved impractical due to their substantial computational requirements and slow response times. These models also lacked an understanding of code-specific techniques such as fill-in-the-middle (FIM) and token healing, which are essential for effective code completion. The overhead associated with running large language models for every autocompletion suggestion was simply too high, making them unsuitable for real-time use within an IDE.
Output Formatting: Leading chat models frequently produced data in inconsistent formats, making it challenging to parse the responses and seamlessly integrate them into the editor. The variability in output structure required significant post-processing to transform the generated text into usable code snippets. This additional processing step added latency and complexity to the overall autocompletion process. Furthermore, the generated code often contained extraneous text or formatting that needed to be removed before it could be inserted into the editor.
Data Provenance: Determining the origin of the training data and mitigating potential copyright infringement issues posed a significant challenge. The widespread use of copyrighted code in training datasets raised concerns about the legality and ethical implications of using these models for code generation. JetBrains needed to ensure that its autocompletion model was trained on data that was either publicly available or licensed for use in this context. Tracing the provenance of the training data was a complex and time-consuming task, but it was essential for avoiding legal and reputational risks.
Mellum: An Overview
The development team at JetBrains concluded that creating their own model was the most effective approach. Their objective was to design a model that struck a balance between quality, inference costs, and latency while being trained on data with clear provenance. Initial research indicated that a model with approximately 4 billion parameters could deliver competent autocompletion capabilities for a wide range of scenarios and users. Moreover, by exclusively training the model on code, they could establish a specialized vocabulary of tokens, free from irrelevant data. This focused approach would enable the model to learn the nuances of programming languages more effectively and generate more accurate and relevant code suggestions.
The training process for the Mellum model comprises three distinct stages, each contributing new knowledge and improving the quality of generated code. The initial phase involves basic pre-training on a large corpus of individual files. The second stage consists of refining the model with a smaller set of specialized examples. Reinforcement Learning with AI Feedback (RLAIF) is employed in the third stage to adapt the model to IDE-specific characteristics and eliminate unwanted output. This multi-stage training approach allows the model to gradually learn and improve its ability to generate high-quality code completions.
Pre-Training
To avoid potential issues related to ambiguous data origins, the model was trained from scratch, requiring a comprehensive introduction to numerous languages, programming syntax, patterns, and core concepts. This involved exposing the model to a vast amount of code from various sources, covering a wide range of programming languages and coding styles. The goal was to equip the model with a broad understanding of programming principles and best practices.
Dataset
The primary data source for pre-training was TheStack. This dataset ensures that the data is both legally sound and practically beneficial. TheStack is a large collection of open-source code repositories, making it a suitable resource for training a code autocompletion model. By using TheStack, JetBrains could be confident that the training data was legally obtained and that the resulting model would not infringe on any copyrights. Furthermore, TheStack provides a diverse range of code examples, allowing the model to learn from a variety of coding styles and patterns.
Pre-Training Process
During pre-training, the dataset was sampled multiple times to reach approximately 3 trillion tokens. A context window of 8192 tokens was used, with the dataset split into fragments of this size. The fill-in-the-middle (FIM) transformation was applied to half of the files in each fragment, encouraging the model to consider both preceding and subsequent code. This technique closely mimics real-world code generation scenarios. The large context window allowed the model to learn long-range dependencies in the code, while the FIM transformation encouraged the model to understand the context surrounding the code being generated.
The pre-training phase was conducted on a cluster of sixteen nodes, each equipped with eight H100 GPUs. This stage took approximately 15 days to complete, resulting in the 4-billion-parameter Mellum-base model. The use of high-performance GPUs and a distributed training setup was essential for completing the pre-training process in a reasonable timeframe. The resulting Mellum-base model served as the foundation for subsequent fine-tuning stages.
Pre-training creates a general-purpose code autocompletion model with extensive knowledge of many programming languages. However, at this stage, the model is only trained to predict the next token in a randomly selected file segment. Without additional context, the model lacks awareness of code structure and has no mechanism to determine when to stop generating code. The pre-training phase provides the model with a broad understanding of programming languages, but it does not equip the model with the ability to generate code that is contextually relevant and syntactically correct.
The fine-tuning stage is designed to address these limitations.
Context-Aware Fine-Tuning
Enhanced Fill-in-the-Middle
Unlike pre-training, where code fragments are selected randomly for prediction, fine-tuning concentrates on segmenting code in a more meaningful manner, teaching the model to extract code fragments that occur “in the wild”. This involved developing algorithms to identify relevant code segments within larger codebases, such as function definitions, loop bodies, and conditional statements. The model was then trained to predict the code within these segments, taking into account the surrounding context.
Specialized Examples
In practice, code autocompletion necessitates understanding surrounding files and broader contexts, possibly encompassing entire projects. To address this requirement, JetBrains developed a system for extracting and analyzing code from entire projects, allowing the model to learn from the relationships between different files and modules. This enabled the model to generate code completions that were more consistent with the overall project structure and coding style.
For data preprocessing, the company launched an internal project codenamed Code Engine: a cross-platform SDK and set of console utilities developed to build context directly from ordinaryfiles without requiring full project indexing. This SDK was deployed on an internal MapReduce cluster and used to process thousands of public repositories, generating many useful examples for training in a reasonable timeframe. The Code Engine project was instrumental in enabling the model to learn from large codebases without requiring the computationally expensive process of full project indexing.
Finding the correct algorithms required some trial and error. The team experimented with different approaches for extracting relevant code segments and analyzing project context before settling on a set of algorithms that produced satisfactory results. This iterative process involved careful evaluation of the model’s performance and continuous refinement of the data preprocessing techniques.
Tuning for Specific Languages
Small models can benefit greatly from specialization for specific languages. While the base model is trained on over 80 languages, most users typically work with only one or two. To address this, JetBrains created multiple specialized models:
mellum-all: Supports most languages and dialects available in JetBrains IDEs, but autocompletion quality is lower than specialized models. This model is designed for users who work with a wide variety of programming languages and need a general-purpose code completion tool.
mellum-python: Specializes in Python and Jupyter. This model is optimized for Python development and provides enhanced code completion capabilities for Python code and Jupyter notebooks.
mellum-kotlin: Specializes in Java and Kotlin. This model is designed for developers working with Java and Kotlin and offers improved code completion for these languages.
mellum-web: Specializes in web technologies. This model is tailored for web development and provides specialized code completion for HTML, CSS, JavaScript, and other web-related technologies.
The Final Step: RLAIF
Finally, cases where the learning goals do not match user expectations must be resolved. An additional training phase, RLAIF — Reinforcement Learning with AI Feedback, is used to solve such problems. This involved collecting feedback from users on the quality of the model’s code completions and using this feedback to improve the model’s performance. The RLAIF process allowed the model to learn from its mistakes and adapt to the specific needs and preferences of individual users.
The model learns from user interactions and understands how to better reflect user preferences. By analyzing user feedback, the model can identify patterns in user behavior and adjust its code completion suggestions accordingly. This ensures that the model’s suggestions are relevant, accurate, and aligned with user expectations.
This approach not only improves the overall quality score but also reduces the number of annoying generation artifacts. The RLAIF process helps to eliminate unwanted code fragments, incorrect syntax, and other imperfections in the model’s code completions. This results in a smoother and more efficient coding experience for users.
How Good Is Mellum?
The model performs exceptionally well for its size. Here’s how it was evaluated:
First, the model was evaluated on an internal benchmark codenamed “JetBrains BigCode”. This benchmark dataset was created by JetBrains to assess the model’s performance on a wide range of coding tasks.
Then it was tested on well-known public benchmarks such as SAFIM. These public benchmarks provide a standardized way to compare the model’s performance against other code completion tools.
Finally, usage statistics for features were collected, and user metrics were calculated. This involved tracking how frequently users accepted the model’s code completion suggestions and measuring the impact of the model on user productivity.
Offline Evaluation
Collecting data is a complex task, but creating a good metric that compares the original suggestion with the new one proposed by the neural network is even more challenging. We conducted a small study and ultimately settled on a combination of two primary metrics:
EM:
Exact Match is a very popular idea. This metric measures the percentage of code completions that exactly match the intended code.
A prediction is considered good if the first line of the completion matches the first line of the original, with minimal preprocessing. This focuses on the most critical part of the suggestion, ensuring the initial suggestion is correct and relevant.
KK:
The metric was named after its authors. This metric measures the similarity between the model’s code completion suggestions and the intended code.
The number of proposed lines from the original divided by the number of lines in the proposed completion. This helps understand the model’s accuracy in generating complete code blocks.
JetBrains BigCode
The model was evaluated against a benchmark dataset obtained using the internal JetBrains BigCode tool. This dataset allowed JetBrains to assess the model’s performance on a wide range of coding tasks specific to its IDEs.
By maintaining full control over our dataset rather than relying on public benchmarks, it becomes possible to reliably evaluate model quality for various coding styles and practices. This ensures that the model is optimized for the specific needs of JetBrains IDE users.
The results of our JetBrains BigCode evaluation show quality on par with popular models, but Mellum is smaller and more efficient. This demonstrates the effectiveness of JetBrains’ approach to training a code completion model that is both accurate and resource-efficient.
Quality of single-line suggestions (EM metric)
Public Benchmarks
The model was evaluated not only on the internal dataset but also on various public benchmarks, such as the multilingual benchmark SAFIM (syntax-aware fill in the middle). This ensured that the model’s performance was not biased towards the specific characteristics of the JetBrains BigCode dataset.
Online Evaluation
The main metric is called the ratio of completed code (RoCC). It is defined as the ratio of code characters written using code autocompletion to the total amount of code in the editor. This metric provides a direct measure of the impact of the model on user productivity.
Another important metric is the acceptance rate (AR), which is calculated as the number of accepted suggestions divided by the number of all suggestions shown. This metric reflects the quality and relevance of the model’s code completion suggestions.
This was a complex journey, but JetBrains specialists completed it with dignity. In the end, one general and several specialized models were obtained, which are available through the JetBrains AI platform. They are now successfully working in JetBrains AI Assistant. These models provide users with a powerful and efficient code completion tool that can significantly improve their coding productivity.
What’s next?
JetBrains engineers are currently working on a model for web development languages. It may become publicly available in the near future. This model will provide specialized code completion for HTML, CSS, JavaScript, and other web-related technologies.
There are plans to simultaneously increase both the number of parameters and the diversity of data. There are many different tasks in coding — Mellum will be able to perform them too. Service performance is still a key metric, so the expansion of the model will be within reasonable limits. This will enable Mellum to handle a wider range of coding tasks and provide even more accurate and relevant code completion suggestions, while maintaining its efficiency and responsiveness.