此页面由 Cloud Translation API 翻译。

使用向量嵌入搜索

本页介绍了如何使用 Firestore 执行 K 最近通道。使用以下方法进行相邻 (KNN) 向量搜索：

存储向量值
创建和管理 KNN 向量索引
使用其中一个受支持的向量进行 K 最近邻 (KNN) 查询距离衡量

存储向量嵌入

你可以根据模型创建矢量值，例如文本嵌入， Firestore 数据，并将其存储在 Firestore 文档中。

通过向量嵌入执行写入操作

以下示例展示了如何将向量嵌入存储在 Firestore 文档：

Python

from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector

firestore_client = firestore.Client()
collection = firestore_client.collection("coffee-beans")
doc = {
    "name": "Kahawa coffee beans",
    "description": "Information about the Kahawa coffee beans.",
    "embedding_field": Vector([1.0, 2.0, 3.0]),
}

collection.add(doc)vector_search.py

Node.js

import {
  Firestore,
  FieldValue,
} from "@google-cloud/firestore";

const db = new Firestore();
const coll = db.collection('coffee-beans');
await coll.add({
  name: "Kahawa coffee beans",
  description: "Information about the Kahawa coffee beans.",
  embedding_field: FieldValue.vector([1.0 , 2.0, 3.0])
});

使用 Cloud Functions 函数计算向量嵌入

为了在每次更新文档或文档时计算和存储向量嵌入，可以设置一个 Cloud Run 函数：

Python

@functions_framework.cloud_event
def store_embedding(cloud_event) -> None:
  """Triggers by a change to a Firestore document.
  """
  firestore_payload = firestore.DocumentEventData()
  payload = firestore_payload._pb.ParseFromString(cloud_event.data)

  collection_id, doc_id = from_payload(payload)
  # Call a function to calculate the embedding
  embedding = calculate_embedding(payload)
  # Update the document
  doc = firestore_client.collection(collection_id).document(doc_id)
  doc.set({"embedding_field": embedding}, merge=True)

Node.js

/**
 * A vector embedding will be computed from the
 * value of the `content` field. The vector value
 * will be stored in the `embedding` field. The
 * field names `content` and `embedding` are arbitrary
 * field names chosen for this example.
 */
async function storeEmbedding(event: FirestoreEvent<any>): Promise<void> {
  // Get the previous value of the document's `content` field.
  const previousDocumentSnapshot = event.data.before as QueryDocumentSnapshot;
  const previousContent = previousDocumentSnapshot.get("content");

  // Get the current value of the document's `content` field.
  const currentDocumentSnapshot = event.data.after as QueryDocumentSnapshot;
  const currentContent = currentDocumentSnapshot.get("content");

  // Don't update the embedding if the content field did not change
  if (previousContent === currentContent) {
    return;
  }

  // Call a function to calculate the embedding for the value
  // of the `content` field.
  const embeddingVector = calculateEmbedding(currentContent);

  // Update the `embedding` field on the document.
  await currentDocumentSnapshot.ref.update({
    embedding: embeddingVector,
  });
}

创建和管理向量索引

您必须先创建相应的索引，然后才能通过向量嵌入执行最近邻搜索。以下示例展示了如何创建和管理向量索引。

创建矢量索引

在创建矢量索引之前，请升级到最新版本的 Google Cloud CLI：

gcloud components update

如需创建向量索引，请使用 gcloud firestore indexes composite create：

gcloud

gcloud firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config field-path=vector-field,vector-config='vector-configuration' \
--database=database-id

其中：

collection-group 是集合组的 ID。
vector-field 是包含向量嵌入的字段的名称。
database-id 是相应数据库的 ID。
vector-configuration 包含向量 dimension 和索引类型。dimension 是一个不超过 2,048 的整数。索引类型必须为 flat。按如下方式设置索引配置的格式：{"dimension":"DIMENSION", "flat": "{}"}。

以下示例创建了一个复合索引，其中包含字段 vector-field 的向量索引以及字段 color 的升序索引。您可以在执行最近邻搜索之前使用此类索引预先过滤数据。

gcloud

gcloud firestore indexes composite create \
--collection-group=collection-group \
--query-scope=COLLECTION \
--field-config=order=ASCENDING,field-path="color" \
--field-config field-path=vector-field,vector-config='{"dimension":"1024", "flat": "{}"}' \
--database=database-id

列出所有向量索引

gcloud

gcloud firestore indexes composite list --database=database-id

将 database-id 替换为相应数据库的 ID。

删除矢量索引

gcloud

gcloud firestore indexes composite delete index-id --database=database-id

其中：

index-id 是要删除的索引的 ID。可使用 indexes composite list 检索索引 ID。
database-id 是相应数据库的 ID。

描述向量索引

gcloud

gcloud firestore indexes composite describe index-id --database=database-id

其中：

index-id 是要描述的索引的 ID。可使用 indexes composite list 检索索引 ID。
database-id 是相应数据库的 ID。

执行最近邻查询

您可以执行相似度搜索来查找向量嵌入的最近邻。相似度搜索需要使用向量索引。如果索引不存在，Firestore 会建议创建一个索引使用 gcloud CLI

以下示例查找查询向量的 10 个最近邻。

Python

from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

# Requires a single-field vector index
vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([3.0, 1.0, 2.0]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=5,
)vector_search.py

Node.js

import {
  Firestore,
  FieldValue,
  VectorQuery,
  VectorQuerySnapshot,
} from "@google-cloud/firestore";

// Requires a single-field vector index
const vectorQuery: VectorQuery = coll.findNearest({
  vectorField: 'embedding_field',
  queryVector: [3.0, 1.0, 2.0],
  limit: 10,
  distanceMeasure: 'EUCLIDEAN'
});

const vectorQuerySnapshot: VectorQuerySnapshot = await vectorQuery.get();

向量距离

最近邻查询支持下列向量距离选项：

EUCLIDEAN：测量向量之间的欧几里得距离。如需了解详情，请参阅欧几里得。
COSINE：基于向量之间的角度来比较向量，这样可以测量不依赖于向量大小的相似度。对于单位归一化向量，建议使用 DOT_PRODUCT，而不是余弦距离，虽然两者在数学上是等效的，但前者性能更好。如需了解详情，请参阅余弦相似度。
DOT_PRODUCT：与 COSINE 类似，但受向量大小影响。如需了解详情，请参阅点积。

选择距离度量

根据是否所有向量嵌入都已归一化，您可以确定用于查找距离的距离度量。标准化的向量嵌入的大小（长度）正好为 1.0。

此外，如果您知道训练模型时使用的距离测量值，使用该距离度量来计算矢量与嵌入。

标准化数据

如果您的数据集中的所有向量嵌入都已归一化，那么这三个距离测量可提供相同的语义搜索结果。从本质上讲，虽然每个距离测量返回不同的值，这些值的排序方式相同。时间嵌入已经过标准化处理，DOT_PRODUCT 通常是计算能力最强的但在大多数情况下，这种差异微乎其微。但是，如果您的应用对性能要求很高，DOT_PRODUCT或许可以帮助您性能调整

非标准化数据

如果您的数据集中的向量嵌入未归一化，那么将 DOT_PRODUCT 用作距离在数学上是不正确的因为点积不测量距离。取决于如何生成嵌入以及首选搜索类型， COSINE 或 EUCLIDEAN 距离测量结果为：搜索结果的主观上优于其他距离衡量结果。使用 COSINE 或 EUCLIDEAN 进行的实验以确定哪种方法最适合您的用例。

不确定数据是标准化数据还是非标准化数据

如果您不确定数据是否经过标准化，而您想要使用 DOT_PRODUCT，我们建议您改用 COSINE。 COSINE 类似于内置了标准化的 DOT_PRODUCT。使用 COSINE 测量的距离范围为 0 到 2。1 条结果接近 0 表示这些向量非常相似。

预先过滤文档

要在查找最近邻之前对文档进行预先过滤，您可以将与其他查询运算符搭配使用的相似度搜索。支持 and 和 or 复合过滤条件。如需详细了解支持的字段过滤条件，请参阅查询运算符。

Python

from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

# Similarity search with pre-filter
# Requires a composite vector index
vector_query = collection.where("color", "==", "red").find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([3.0, 1.0, 2.0]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=5,
)vector_search.py

Node.js

// Similarity search with pre-filter
// Requires composite vector index
const preFilteredVectorQuery: VectorQuery = coll
    .where("color", "==", "red")
    .findNearest({
      vectorField: "embedding_field",
      queryVector: [3.0, 1.0, 2.0],
      limit: 5,
      distanceMeasure: "EUCLIDEAN",
    });

const vectorQueryResults = await preFilteredVectorQuery.get();

检索计算出的矢量距离

您可以检索计算出的矢量距离，只需分配一个 FindNearest 查询中的 distance_result_field 输出属性名称，例如如以下示例中所示：

Python

from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([3.0, 1.0, 2.0]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_result_field="vector_distance",
)

docs = vector_query.stream()

for doc in docs:
    print(f"{doc.id}, Distance: {doc.get('vector_distance')}")vector_search.py

Node.js

const vectorQuery: VectorQuery = coll.findNearest(
    {
      vectorField: 'embedding_field',
      queryVector: [3.0, 1.0, 2.0],
      limit: 10,
      distanceMeasure: 'EUCLIDEAN',
      distanceResultField: 'vector_distance'
    });

const snapshot: VectorQuerySnapshot = await vectorQuery.get();

snapshot.forEach((doc) => {
  console.log(doc.id, ' Distance: ', doc.get('vector_distance'));
});

如果您想使用字段掩码返回文档字段的子集以及 distanceResultField，则还必须在字段掩码中添加 distanceResultField 的值，如以下示例所示：

Python

vector_query = collection.select(["color", "vector_distance"]).find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([3.0, 1.0, 2.0]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_result_field="vector_distance",
)vector_search.py

Node.js

const vectorQuery: VectorQuery = coll
    .select('color', 'vector_distance')
    .findNearest({
      vectorField: 'embedding_field',
      queryVector: [3.0, 1.0, 2.0],
      limit: 10,
      distanceMeasure: 'EUCLIDEAN',
      distanceResultField: 'vector_distance'
    });

指定距离阈值

您可以指定一个相似性阈值，它仅返回阈值。阈值字段的行为取决于距离度量您可以选择：

EUCLIDEAN 和 COSINE 距离将阈值限制在符合以下条件的文档：距离小于或等于指定阈值。这些距离测量结果会随着向量变得越来越相似而减少。
DOT_PRODUCT 距离将阈值限制为距离为大于或等于指定的阈值。点积距离会随着向量变得越来越相似而增加。

以下示例展示了如何指定距离阈值，以使用 EUCLIDEAN 距离指标返回最多 10 个距离不超过 4.5 个单位的最近文档：

Python

from google.cloud.firestore_v1.base_vector_query import DistanceMeasure
from google.cloud.firestore_v1.vector import Vector

collection = db.collection("coffee-beans")

vector_query = collection.find_nearest(
    vector_field="embedding_field",
    query_vector=Vector([3.0, 1.0, 2.0]),
    distance_measure=DistanceMeasure.EUCLIDEAN,
    limit=10,
    distance_threshold=4.5,
)

docs = vector_query.stream()

for doc in docs:
    print(f"{doc.id}")vector_search.py

Node.js

const vectorQuery: VectorQuery = coll.findNearest({
  vectorField: 'embedding_field',
  queryVector: [3.0, 1.0, 2.0],
  limit: 10,
  distanceMeasure: 'EUCLIDEAN',
  distanceThreshold: 4.5
});

const snapshot: VectorQuerySnapshot = await vectorQuery.get();

snapshot.forEach((doc) => {
  console.log(doc.id);
});

限制

请注意，在使用向量嵌入时，有以下限制：

支持的嵌入维度上限为 2,048。如要存储更大的索引，可使用降维。
通过最近邻查询返回的文档数量上限为 1,000。
向量搜索不支持实时快照监听器。
只有 Python 和 Node.js 客户端库支持向量搜索。

后续步骤

了解 Firestore 的最佳实践。
了解大规模读写。