解决生成数据集的多样性

在上一章《使用 LLM 生成合成数据集》中，我们讨论了使用大语言模型 (LLM) 生成合成数据集以进一步微调本地检索模型的潜力。这种方法是可能的，因为有大量的未标记文档可用。每个文档都用于生成一个或多个合成查询，形成查询 - 文档对。

但是，如果信息检索不是您的任务呢？假设您正在处理法律文档分类问题，但不允许将任何数据发送到外部 API。在这种情况下，您需要训练一个本地模型。然而，收集数据可能会成为一个障碍，导致产品开发中的延迟。

为了简单起见，假设任务是生成儿童故事。这个任务是研究 Eldan 等人 (2023) (opens in a new tab) 的起点。每个故事包含 2-3 段话，遵循简单的情节和主题，而整个数据集涵盖儿童的词汇和事实知识。

语言不仅是一套规则和符号；它传达和解释意思。主要挑战在于使用大语言模型生成培训数据，确保数据集多样性。即使在高 Temperature 下，模型仍然可能会产生重复的数据集，缺乏必要的多样性（即使是儿童语言中的）。连贯性和相关性也是自然语言生成的其他自然语言生成挑战。

为了应对多样性问题，作者准备了大约 1500 个基本单词的词汇，反映了典型儿童词汇，分为名词、动词和形容词。每次生成中，选择一个动词、一个名词和一个形容词。然后，模型生成一个故事，将这些随机单词集成到故事中。

这种技术显著增加了数据集的多样性，确保故事涵盖整个儿童词汇和多样的概念组合。此外，作者集成了潜在的故事特征（如对话、情节转折、坏结局或道德启示）。对于每个故事，从这些特征的随机子集中选择一部分，并提示模型包含它们。

Prompt:

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”{random.choice(verbs_list)}”, the noun ”{random.choice(nouns_list)}” and the adjective ”{random.choice(adjectives_list)}”. The story should have the following features: {random.choice(features_list)}, {random.choice(features_list)}. Remember to only use simple words!

让我们假设这些空白被填补如下：

Prompt:

Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”decorate”, the noun ”thunder” and the adjective ”ancient”. The story should have the following features: the story should contain at least one dialogue, the story has a bad ending. Remember to only use simple words!

Output:

Once upon a time, in an ancient house, there lived a girl named Lily. She loved to decorate her room with pretty things. One day, she found a big box in the attic. She opened it and saw many shiny decorations. Lily was very happy and decided to use them in her room. As Lily was decorating her room, the sky outside became dark. There was a loud thunder sound, and Lily got scared. She ran to her mom and said, ”Mommy, the thunder is so loud!” Her mom hugged her and said, ”Don’t worry, it will pass soon.” But the thunder did not stop. It got louder and louder, and the ancient house started to shake. Suddenly, the roof fell down on the pretty decorations. Lily was sad because her room was not pretty anymore. The end.

我们在这一章中利用了最新的文本生成模型（GPT-3.5 和 GPT-4），可以生成大量合成内容，按照指示进行。由于每次提示都是随机生成的，而模型严格遵循它，故故事大相径庭。即使只有一个或两个单词有所不同，叙述也会完全不同。实际上，我们在提示中应用随机性，以生成多样的数据集。

您可以这样理解：

确定不同样本之间可能会发生变化的参数 / 实体；
生成或手动编译这些实体，以填充空白；
通过随机选择实体生成数据集。最好将生成温度设置得比默认值高但比最大值低；
在生成结果上训练本地模型。

值得注意的是，可以使用实体之一来播种生成，例如在情感分类任务中，可以直接在提示中提到 “积极” 或 “消极”，以生成带有相应标签的文本。然后，使用此数据训练本地分类器。

迭代（分层）合成数据生成

格式可以更加复杂，通过使用更多实体，其中一些甚至是 LLM 本身生成的。例如，可以从生成故事摘要开始，然后写下一个句子，必须出现在故事中。之后，只需使用此生成的中间数据完成最终请求。

Prompt:

Summary: {a short summary generated by LLM, using the approach above}
Features: {copy the features from the initial prompt}
Sentence: {a sentence generated by LLM, which should be present in the story}
Words: {copy the words from the initial prompt}
Story:

填写 prompt 的示例：

Prompt:

Summary: Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind. They find beauty in the broken sandcastle and play happily with a butterfly.
Features: Dialogue, Foreshadowing, Twist
Sentence: One day, she went to the park and saw a beautiful butterfly.
Words: disagree, network, beautiful
Story:

Output:

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she went to the park and saw a beautiful butterfly. She tried to catch it, but it flew away. Lily’s friend, Timmy, came to play with her. They decided to build a sandcastle, but they disagreed on how to build it. Lily wanted to make it tall, while Timmy wanted to make it wide. They argued for a while, but finally agreed to make it both tall and wide. As they were building the sandcastle, they noticed a group of ants working together. They realized that the ants were building a network of tunnels. Suddenly, a gust of wind came and knocked over their sandcastle. They were sad, but then they saw the butterfly again. It landed on the remains of their sandcastle and they realized that it was still beautiful, even in its broken state. They smiled and played together happily.

因此，可以生成数十万个非常不同的示例来训练模型。假设您需要训练一个分类器来确定文本是否包含对话或情节转折。由于初始提示包含标签，因此知道需要为每个生成的样本预测哪个目标值。

关于合成数据集：提高语言模型在真实应用中的有效性

一个关键问题是，合成数据集是否能为在实际应用中训练网络带来真正的好处。作者通过实验证明，训练较小的语言模型使用来自先进语言模型的合成数据确实可以提高其效果。

在他们的研究中，Gunasekar等（2023） (opens in a new tab) 强調了高质量培训数据在他们的模型中的重要性。他们认为，语言模型在实际应用中的有效性将提高，如果它们是在类似“教材”（clear, detailed, informative和无偏见）的材料上培训的。

这些原则构成了创建半合成数据集以培训称为 Phi-1 的 LLM 的基础。主要评估任务是生成符合给定文本描述或文档字符串的 Python 函数。该模型的质量大小评估使用 HumanEval 基准（陈等，2021 (opens in a new tab)）。

作者在模型的有效性方面强调了多样性的重要性，原因有以下几点：

使模型接受各种编程表达式和解决问题的方法；
减少过度拟合或依赖特定模式的风险；
提高模型处理尚不熟悉或创新任务的能力。

为了应对代码写作的挑战，作者创建了类似教材的文档，重点介绍了推动锻炼推理和基本算法技能的主题。为实现多样性，他们在以下方面施加了限制：

主题
目标受众

不幸的是，作者未提供有关提示模板的具体信息，使用合成数据训练模型。但是，他们展示了生成的输出。作者选择使用 ChatGPT（GPT-3.5）而不是 GPT-4，并且结果也非常出色，因为它可以用合成数据进行微调。

图片来源：Gunasekar et al. (2023) (opens in a new tab)

重点是要注意目标受众的约束。通过指定目标受众来增强模拟数据的多样性。高中生、计算机专业本科生和博士研究生可能会以不同的方式解释和处理信息，使生成的数据更加多样。让我们尝试重新创建提示。

Prompt:

Write an extract from a Computer Science textbook for a 1st-year bachelor. The coding language is Python 3.6.

This is an extract from the middle of the following topic: Singular matrices.

The extract starts with a high-level overview of the topic. Then, it presents an example and describes the solution in natural language. After that, it provides 1-2 code snippets, following the example. Each snippet has no more than 10 rows. There should be no text after code snippets.

Keep in mind that the extract should be concise and explain only one concept at a time.  The code should represent a Python function & its call. All the common ML/DS libraries are available.

Don't be too verbose. The expected length of the extract is 2 paragraphs at most.

Output:

**Singular Matrices**

In linear algebra, a matrix is termed as "singular" if it doesn't have an inverse. Another equivalent definition is when the determinant of the matrix is zero. Singular matrices are crucial because they represent systems of linear equations that do not have a unique solution.

Consider a 2x2 matrix `[[a, b], [c, d]]`. If the product of `ad` and `bc` is the same (i.e., `ad = bc`), then this matrix is singular. Let's explore this with an example: given the matrix `[[2, 4], [1, 2]]`, you'll find that the product of the diagonals yields the same result (2 * 2 = 4 * 1), thus it's a singular matrix.

import numpy as np

def is_singular(matrix):
    determinant = np.linalg.det(matrix)
    return abs(determinant) < 1e-9

matrix_example = np.array([[2, 4], [1, 2]])
print(is_singular(matrix_example))  # returns True

相当接近了！总的来说，作者生成了 1B 个令牌来扩充模型的训练集，从而允许较小的模型（仅 1.5B 个参数）可以与十倍于其大小的模型相媲美（详细信息请参阅文章 [Gunasekar et al. (2023)]（ https://arxiv.org/abs/2306.11644））。 (opens in a new tab)

Phi-1 metrics, compared to bigger models.

图片来源：Gunasekar et al. (2023) (opens in a new tab)

对于一般的任务，可能不需要如此大量的合成数据（因为作者研究了预训练，这需要大量资源）。然而，即使作为估计，按照每 1000 个代币 “0.002 美元” 的价格（标准 ChatGPT 定价），生成的代币将花费 “2000 美元”，提示的费用也大致相同。请记住，随着领域变得更加利基，对合成数据的微调变得更有价值，特别是如果语言偏离英语（以及其他因素）。此外，该方法与思想链（CoT）配合良好，有助于局部模型提高推理能力。其他提示技巧也有效。并且不要忘记像 Alpaca (Taori et al., (2023) (opens in a new tab)) 和 Vicuna (Zheng et al., (2023) (opens in a new tab)) 通过对合成数据进行微调而表现出色。

数据生成代码生成