Thanks to the success in increasing the data, model size, and computational capacity for auto-regressive language modeling, conversational AI agents have witnessed a remarkable leap in capability in the last few years. Chatbots often use large language models (LLMs), known for their many useful skills, including natural language processing, reasoning, and tool proficiency.
These new applications need thorough testing and cautious rollouts to reduce potential dangers. Consequently, it is advised that products powered by Generative AI implement safeguards to prevent the generation of high-risk content that violates policies, as well as to prevent adversarial inputs and attempts to jailbreak the model. This can be seen in resources like the Llama 2 Responsible Use Guide.
The Perspective API1, OpenAI Content Moderation API2, and Azure Content Safety API3 are all good places to start when looking for tools to control online content. When used as input/output guardrails, however, these online moderation technologies fail for several reasons. The first issue is that there is currently no way to tell the difference between the user and the AI agent regarding the dangers they pose; after all, users ask for information and assistance, while AI agents are more likely to give it. Plus, users can’t change the tools to fit new policies because they all have set policies that they enforce. Third, fine-tuning them to specific use cases is impossible because each tool merely offers API access. Finally, all existing tools are based on modest, traditional transformer models. In comparison to the more powerful LLMs, this severely restricts their potential.
New Meta research brings to light a tool for input-output safeguarding that categorizes potential dangers in conversational AI agent prompts and responses. This fills a need in the field by using LLMs as a foundation for moderation.
Their taxonomy-based data is used to fine-tune Llama Guard, an input-output safeguard model based on logistic regression. Llama Guard takes the relevant taxonomy as input to classify Llamas and applies instruction duties. Users can personalize the model input with zero-shot or few-shot prompting to accommodate different use-case-appropriate taxonomies. At inference time, one can choose between several fine-tuned taxonomies and apply Llama Guard accordingly.
They propose distinct guidelines for labeling LLM output (responses from the AI model) and human requests (input to the LLM). Thus, the semantic difference between the user and agent responsibilities can be captured by Llama Guard. Using the ability of LLM models to obey commands, they can accomplish this with just one model.
They’ve also launched Purple Llama. In due course, it will be an umbrella project that will compile resources and assessments to assist the community in building ethically with open, generative AI models. Cybersecurity and input/output safeguard tools and evaluations will be part of the first release, with more tools on the way.
They present the first comprehensive set of cybersecurity safety assessments for LLMs in the industry. These guidelines were developed with their security specialists and are based on industry recommendations and standards (such as CWE and MITRE ATT&CK). In this first release, they hope to offer resources that can assist in mitigating some of the dangers mentioned in the White House’s pledges to create responsible AI, such as:
- Metrics for quantifying LLM cybersecurity threats.
- Tools to evaluate the prevalence of insecure code proposals.
- Instruments for assessing LLMs make it more difficult to write malicious code or assist in conducting cyberattacks.
They anticipate that these instruments will lessen the usefulness of LLMs to cyber attackers by decreasing the frequency with which they propose insecure AI-generated code. Their studies find that LLMs provide serious cybersecurity concerns when they suggest insecure code or cooperate with malicious requests.
All inputs and outputs to the LLM should be reviewed and filtered according to application-specific content restrictions, as specified in Llama 2’s Responsible Use Guide.
This model has been trained using a combination of publicly available datasets to detect common categories of potentially harmful or infringing information that could be pertinent to various developer use cases. By making their model weights publicly available, they remove the requirement for practitioners and researchers to rely on costly APIs with limited bandwidth. This opens the door for more experimentation and the ability to tailor Llama Guard to individual needs.
Check out the Paper and Meta Article. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.