Expanding LLM Capabilities with Tool Use
LLMs have demonstrated remarkable proficiency in a wide range of natural language tasks. However, their true potential is unlocked through seamless integration with external tools such as APIs and computational frameworks. These tools empower LLMs with the ability to access real-time data, perform domain-specific calculations, and retrieve precise information, thereby enhancing their reliability and versatility. The ability to leverage external tools extends the capabilities of LLMs beyond their inherent knowledge base, allowing them to interact with the real world and solve complex problems more effectively. This integration is not merely about adding features; it’s about transforming LLMs from passive information providers into active problem solvers.
Consider the integration of a weather API, which enables LLMs to provide accurate and up-to-date weather forecasts. Instead of relying on static training data, the LLM can dynamically query the API to obtain the latest weather information for a specific location. Similarly, a Wikipedia API can equip LLMs with the ability to access a vast repository of information, allowing them to respond to complex queries with greater accuracy. In scientific contexts, tools such as calculators and symbolic engines can help LLMs overcome numerical inaccuracies, making them more reliable for complex calculations. The integration of these tools addresses a key limitation of LLMs: their inability to perform accurate mathematical operations or access real-time data.
By seamlessly integrating with these tools, LLMs evolve into robust, domain-aware systems capable of handling dynamic and specialized tasks with real-world utility. This transformation is driven by the ability to connect LLMs to external systems and data sources, enabling them to adapt to changing environments and solve a wider range of problems. The concept of ‘tool use’ is therefore central to the development of more powerful and versatile AI systems. Furthermore, the selection and usage of the right tools are crucial elements of building autonomous agents capable of complex reasoning and problem-solving. The accuracy with which these tools are utilized directly impacts the effectiveness and reliability of these agents.
Amazon Nova Models and Amazon Bedrock
Amazon Nova models, introduced at AWS re:Invent in December 2024, are designed to deliver exceptional price-performance value. These models offer state-of-the-art performance on key text-understanding benchmarks while maintaining cost-effectiveness. The focus on price-performance is a critical factor in the widespread adoption of LLMs, making them accessible to a broader range of users and applications. The design of Amazon Nova models reflects a commitment to delivering high-quality AI capabilities at a reasonable cost.
The series comprises three variants:
- Micro: A text-only model optimized for edge use, offering ultra-efficient performance. The Micro variant is specifically designed for resource-constrained environments where computational power is limited.
- Lite: A multimodal model that strikes a balance between versatility and performance. The Lite variant provides a good compromise between performance and computational cost, making it suitable for a wide range of applications.
- Pro: A high-performance multimodal model designed for tackling complex tasks. The Pro variant is the most powerful model in the series, capable of handling demanding tasks that require high accuracy and performance.
Amazon Nova models can be employed for a wide range of tasks, including generation and the development of agentic workflows. These models possess the capability to interact with external tools or services through a process known as tool calling. This functionality can be accessed through the Amazon Bedrock console and APIs such as Converse and Invoke. The availability of these APIs simplifies the integration of Amazon Nova models into existing applications and workflows.
In addition to utilizing the pre-trained models, developers have the option to fine-tune these models with multimodal data (Pro and Lite) or text data (Pro, Lite, and Micro). This flexibility enables developers to achieve the desired levels of accuracy, latency, and cost-effectiveness. Fine-tuning allows developers to tailor the model to specific tasks and datasets, resulting in improved performance and efficiency. Furthermore, developers can leverage the Amazon Bedrock console and APIs to perform self-service custom fine-tuning and distillation of larger models into smaller ones. The ability to distill larger models into smaller ones is particularly valuable for deploying LLMs in resource-constrained environments.
Solution Overview
The solution involves preparing a custom dataset specifically designed for tool usage. This dataset is then used to evaluate the performance of Amazon Nova models through Amazon Bedrock, utilizing the Converse and Invoke APIs. Subsequently, the AmazonNova Micro and Amazon Nova Lite models are fine-tuned using the prepared dataset via Amazon Bedrock. Upon completion of the fine-tuning process, these customized models are evaluated through provisioned throughput. The entire process is designed to be streamlined and efficient, allowing developers to quickly adapt Amazon Nova models to their specific needs.
The choice of Amazon Nova Micro and Amazon Nova Lite models for fine-tuning reflects a focus on optimizing performance and cost-effectiveness. These models offer a good balance between computational requirements and accuracy, making them suitable for a wide range of applications. The evaluation of the fine-tuned models through provisioned throughput ensures that the models meet the required performance criteria.
Tools
Tool usage in LLMs encompasses two essential operations: tool selection and argument extraction or generation. For example, consider a tool designed to retrieve weather information for a specific location. When presented with a query such as, ‘What’s the weather in London right now?’, the LLM assesses its available tools to determine if an appropriate tool exists. If a suitable tool is identified, the model selects it and extracts the necessary arguments – in this case, ‘London’ – to construct the tool call. This process of tool selection and argument extraction is fundamental to the functionality of tool-using LLMs.
Each tool is meticulously defined with a formal specification that outlines its intended functionality, mandatory and optional arguments, and associated data types. These precise definitions, referred to as tool config, ensure that tool calls are executed correctly and that argument parsing aligns with the tool’s operational requirements. The tool config acts as a contract between the LLM and the external tool, ensuring that the tool is used in the intended manner. Adhering to this requirement, the dataset used in this example defines eight distinct tools, each with its own arguments and configurations, all structured in a JSON format. The use of a standardized JSON format facilitates the integration of tools into the LLM ecosystem. The eight tools defined are as follows:
- weather_api_call: A custom tool designed for retrieving weather information. This tool allows the LLM to access real-time weather data for specific locations.
- stat_pull: A custom tool for identifying statistics. This tool enables the LLM to retrieve statistical information from various sources.
- text_to_sql: A custom tool for converting text to SQL queries. This tool allows the LLM to interact with databases using natural language.
- terminal: A tool for executing scripts within a terminal environment. This tool provides the LLM with the ability to execute system commands and interact with the operating system.
- wikipedia: A Wikipedia API tool for searching through Wikipedia pages. This tool allows the LLM to access a vast repository of information.
- duckduckgo_results_json: An internet search tool that utilizes DuckDuckGo to perform searches. This tool provides the LLM with access to a broader range of information beyond its inherent knowledge base.
- youtube_search: A YouTube API search tool for searching video listings. This tool allows the LLM to search for videos on YouTube.
- pubmed_search: A PubMed search tool for searching PubMed abstracts. This tool enables the LLM to access scientific and medical research information.
Dataset
The dataset used in this solution is a synthetic tool calling dataset, created with the assistance of a foundation model (FM) from Amazon Bedrock and subsequently validated and adjusted manually. This dataset was developed for the set of eight tools previously discussed, with the aim of generating a diverse collection of questions and tool invocations that enable another model to learn from these examples and generalize to unseen tool invocations. The use of a synthetic dataset allows for greater control over the content and distribution of the data, ensuring that the model is trained on a representative sample of tool invocations.
Each entry within the dataset is structured as a JSON object, containing key-value pairs that define the question (a natural language user query for the model), the ground truth tool required to answer the user query, its arguments (a dictionary containing the parameters required to execute the tool), and additional constraints such as order_matters: boolean
, which indicates whether the order of arguments is critical, and arg_pattern: optional
, a regular expression (regex) for argument validation or formatting. These ground truth labels are used to supervise the training of pre-trained Amazon Nova models, adapting them for tool use. This process, known as supervised fine-tuning, is further explored in the following sections. The inclusion of constraints such as order_matters
and arg_pattern
allows for more precise control over the model’s behavior during tool invocation.
The training set comprises 560 questions, while the test set contains 120 questions. The test set is structured to include 15 questions per tool category, totaling 120 questions. The balanced distribution of questions across tool categories ensures that the model is evaluated fairly and that its performance is not biased towards any particular tool.
Preparing the Dataset for Amazon Nova
To effectively utilize this dataset with Amazon Nova models, it is necessary to format the data according to a specific chat template. Native tool calling incorporates a translation layer that formats the inputs into the appropriate format before passing them to the model. In this solution, a DIY tool use approach is adopted, employing a custom prompt template. Specifically, the system prompt, the user message embedded with the tool config, and the ground truth labels must be added as the assistant message. The use of a custom prompt template allows for greater flexibility and control over the interaction between the LLM and the external tools.
The format of the chat template is crucial for ensuring that the Amazon Nova model understands the context of the conversation and can accurately select and invoke the appropriate tools. The system prompt provides the model with overall instructions and guidelines, while the user message contains the specific query or request. The tool config provides the model with information about the available tools and their arguments. The ground truth labels provide the model with the correct answer or tool invocation, which is used during the fine-tuning process.
Uploading the Dataset to Amazon S3
This step is essential for enabling Amazon Bedrock to access the training data during the fine-tuning process. The dataset can be uploaded either through the Amazon Simple Storage Service (Amazon S3) console or programmatically. Amazon S3 provides a scalable and reliable storage solution for large datasets, making it ideal for use with Amazon Bedrock.
The ability to upload the dataset programmatically allows for automation of the data preparation process, which can be particularly useful for large-scale deployments. The Amazon S3 console provides a user-friendly interface for managing and accessing the dataset.
Tool Calling with Base Models Through the Amazon Bedrock API
With the tool use dataset created and formatted as required, it can be used to test the Amazon Nova models. Both the Converse and Invoke APIs can be used for tool use in Amazon Bedrock. The Converse API enables dynamic, context-aware conversations, allowing models to engage in multi-turn dialogues, while the Invoke API allows users to call and interact with the underlying models within Amazon Bedrock. The choice between the Converse and Invoke APIs depends on the specific requirements of the application.
The Converse API is well-suited for applications that require interactive conversations with the LLM, while the Invoke API is more appropriate for applications that require direct access to the underlying model. Both APIs provide a flexible and powerful interface for interacting with Amazon Nova models.
To use the Converse API, the messages, system prompt (if any), and the tool config are sent directly to the API. This allows the LLM to process the input and generate a response based on the provided context and tool information.
To parse the tool and arguments from the LLM response, consider the question: ‘Hey, what’s the temperature in Paris right now?’. The output will be parsed to identify the tool and arguments needed to answer the question. This parsing process is crucial for extracting the relevant information from the LLM’s response and using it to invoke the appropriate tool.
Fine-tuning Amazon Nova Models for Enhanced Tool Usage
Fine-tuning is a crucial step in adapting pre-trained language models like Amazon Nova to specific tasks. By training the model on a dataset tailored to the desired application, the model can learn to perform the task with greater accuracy and efficiency. In the context of tool usage, fine-tuning can significantly improve the model’s ability to select the appropriate tool and extract the correct arguments. This targeted training allows the LLM to specialize in tool usage, optimizing its performance for this specific task.
The process of fine-tuning involves adjusting the model’s internal parameters to minimize the difference between its predictions and the ground truth labels in the training dataset. This is typically achieved through an iterative process, where the model is repeatedly exposed to the training data and its parameters are adjusted based on the observed errors. The goal of fine-tuning is to refine the model’s understanding of tool usage and improve its ability to generalize to new and unseen examples.
Preparing the Fine-tuning Dataset
The fine-tuning dataset should be carefully curated to reflect the types of questions and tool invocations that the model is expected to handle in real-world scenarios. The dataset should include a diverse range of examples, covering different tool categories and argument patterns. A well-curated dataset is essential for successful fine-tuning.
Each example in the dataset should consist of a question, the corresponding tool to be called, and the arguments required to execute the tool. The arguments should be formatted in a structuredmanner, typically as a JSON object. This structured format ensures that the model can easily parse and understand the arguments.
Fine-tuning Process
The fine-tuning process can be performed using the Amazon Bedrock console or APIs. The process involves specifying the model to be fine-tuned, the fine-tuning dataset, and the desired training parameters. Amazon Bedrock provides a convenient and efficient platform for fine-tuning LLMs.
The training parameters control various aspects of the fine-tuning process, such as the learning rate, batch size, and number of epochs. The learning rate determines the magnitude of the parameter adjustments made during each iteration. The batch size determines the number of examples processed in each iteration. The number of epochs determines the number of times the model is exposed to the entire training dataset. Careful selection of these parameters is crucial for achieving optimal performance.
Evaluating the Fine-tuned Model
After the fine-tuning process is complete, it is essential to evaluate the performance of the fine-tuned model. This can be done by testing the model on a separate test dataset that was not used during the fine-tuning process. This ensures that the evaluation is unbiased and provides an accurate assessment of the model’s generalization ability.
The test dataset should be representative of the types of questions and tool invocations that the model is expected to handle in real-world scenarios. The performance of the model can be evaluated by measuring metrics such as accuracy, precision, recall, and F1-score. These metrics provide a comprehensive assessment of the model’s performance on the tool usage task.
Benefits of Customizing Amazon Nova Models for Tool Usage
Customizing Amazon Nova models for tool usage offers several benefits:
- Improved Accuracy: Fine-tuning the model on a task-specific dataset can significantly improve the accuracy of tool selection and argument extraction. This is perhaps the most significant benefit of customization.
- Increased Efficiency: Fine-tuned models can often perform tool usage tasks more efficiently than pre-trained models. This can lead to reduced latency and improved throughput.
- Enhanced Adaptability: Fine-tuning allows the model to adapt to specific domains and use cases. This makes the model more versatile and applicable to a wider range of problems.
- Reduced Costs: In some cases, fine-tuning can reduce the computational resources required to perform tool usage tasks. This can lead to significant cost savings, especially for large-scale deployments.
Conclusion
Customizing Amazon Nova models for tool usage is a valuable technique for enhancing the performance and adaptability of LLMs. By fine-tuning the model on a task-specific dataset, developers can significantly improve the accuracy, efficiency, and adaptability of tool usage applications. As industries increasingly demand AI solutions capable of making informed decisions, the customization of LLMs for tool usage will become increasingly important. The ability to tailor LLMs to specific tasks and domains is a key driver of innovation in the field of artificial intelligence. Furthermore, the cost-effectiveness and ease of use of Amazon Bedrock make it an attractive platform for customizing Amazon Nova models for tool usage. The future of AI lies in the ability to create specialized and adaptable models that can solve real-world problems with greater accuracy and efficiency.