Utilizing large language models (LLMs) has demonstrated an appealing problem space for the classification of unsafe ad content. LLMs have a relative advantage over traditional machine learning systems in the areas of deep contextual and cultural understanding because of the inherent complexity of the process of identifying content that violates policies. However, high-fidelity training data of the required quality and scale are difficult and costly to curate in order to fine-tune LLMs for such complex tasks. Standard data-intensive methods for training models are expensive, especially when you have to deal with concept drift as safety policies change or new types of harmful advertising material appear. In the worst-case scenario, the model must be retrained using brand-new data. Reducing the amount of training data needed is therefore paramount.
In light of this, we present a brand-new, scalable curation procedure for active learning that has the potential to significantly improve model alignment with human experts while simultaneously reducing the amount of training data required for fine-tuning LLMs. It is possible to use the method on datasets with hundreds of billions of examples to iteratively identify the examples for which annotation would be most useful and then use the expert labels that are generated for fine-tuning. We were able to reduce the amount of training data needed from 100,000 to less than 500 training examples in our experiments while simultaneously increasing model alignment with human experts by up to 65 percent. Data scale has been reduced even further in production systems with larger models, with up to four orders of magnitude less data used while maintaining or improving quality. Process of archiving Our method begins with a zero- or few-shot initial model (LLM-0), which we provide in conjunction with a prompt that describes the content of interest, such as defining clickbait and asking, “Is this ad clickbait?” The LLM-0 model then generates a large labeled data set, as shown in (1), by labeling advertisements as either benign (blue) or clickbait (orange) in the figure below. Due to the fact that only a small percentage of production traffic’s ads are actually clickbait, this initial data set is typically highly imbalanced. Due to its lack of refinement, the LLM’s true positive rate is also low. To find the most informative examples, we separately cluster examples labeled clickbait and examples labeled benign, which yields some overlapping clusters, thus indicating potential model confusion between clickbait and benign examples (2). We send pairs of examples that are closest to each other but have distinct labels for each of these overlapping cluster pairs (3) to human experts for their opinions. If needed to stay within our review budget, we prioritize pairs of examples that cover a larger area of our search space (4). The curated set that was produced is both informative and diverse due to the fact that it includes the examples that are the most confusable along the decision boundary. Two sets of these expert-provided labels are randomly divided. The first is used to evaluate models and is based on two important alignment metrics: the model-human alignment between the current model and human experts and the internal alignment, which measures how much experts agree. The subsequent iteration of the model is made by using the second to fine-tune the existing models. The procedure continues until the model-to-human alignment either matches the internal alignment or reaches a plateau where further improvement is impossible.
Contents
Metric
Our curation process does not assume the existence of ground truth. Even among policy experts, many classification problems in the ads safety space, like content moderation or fraud detection, are inherently ambiguous and necessitate interpretation and discussion. We therefore cannot rely on standard metrics like precision and recall, which require a ground truth label. Instead, we employ Cohen’s Kappa, which is a measure of the degree to which two independent annotation experts agree more than would be expected from chance. Cohen’s Kappa is used in our experiments as both a quality indicator for datasets (including model evaluation during the curation process, as previously mentioned) and a performance measure for models. Positive values indicate systematic disagreement, while values closer to 0 indicate no alignment above chance. Kappa values above.8 are generally regarded as exceptionally good, and values above.4 are generally regarded as acceptable, despite the fact that standards for interpreting these scores vary.
Experiments
We wanted to know which models and projects would benefit the most from our curation method. As baselines for our experiments, we fine-tuned two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two tasks of different complexity (lower and higher, based on expert alignment) using crowdsourced labels. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.
We compared each of these four baseline conditions to the corresponding curated condition, in which each model (Nano-1 and Nano-2) is fine-tuned through the aforementioned curation procedure in multiple rounds. As previously mentioned, we selected our carefully curated collection of examples at each iteration and utilized them for model evaluation and fine-tuning. All models plateaued before reaching parity with the experts’ internal alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 evaluation samples) for the lower complexity task and 5 iterations (~250 fine-tuning and ~150 evaluation samples) for the higher complexity task. (Keep in mind that the task of lower complexity had a wider range of examples, which may explain why it took longer to converge.) The final class balance for both data sets was less than 40% positive examples. The scale and quality of the data used in each condition are summarized in the table below. During the curation process, experts achieved an average pairwise Cohen’s Kappa of.81 for the lower complexity task and.78 for the higher complexity task. We consider these the ceiling for model performance. Based on our complete curated set, we calculated Kappa alignment between crowdsourced annotations and experts, which was.59 (lower complexity) and.41 (higher complexity) to determine the quality of our crowdsourced data.