Skip to content

搜索:更好、更快、更小

¥Search: better, faster, smaller

这是我们如何彻底重建客户端搜索的故事,在提供更好的用户体验的同时,使其变得更快、更小。

¥This is the story of how we managed to completely rebuild client-side search, delivering a significantly better user experience while making it faster and smaller at the same time.

Material for MkDocs 的搜索功能是迄今为止其最优秀、最受欢迎的资产之一:多语言离线功能,以及最重要的:全客户端。它提供了一种解决方案,使您的文档用户能够立即找到他们正在搜索的内容,而无需管理额外的服务器。然而,即使已经进行了多次迭代,仍有一些改进空间,因此我们从头开始重建了搜索插件和集成。本文将揭示新搜索的内部原理、它为何比以前的版本更强大,以及即将推出的功能。

¥The search of Material for MkDocs is by far one of its best and most-loved assets: multilingual, offline-capable, and most importantly: all client-side. It provides a solution to empower the users of your documentation to find what they're searching for instantly without the headache of managing additional servers. However, even though several iterations have been made, there's still some room for improvement, which is why we rebuilt the search plugin and integration from the ground up. This article shines some light on the internals of the new search, why it's much more powerful than the previous version, and what's about to come.

下一节将讨论当前搜索实现的架构和问题。如果您想立即了解最新内容,请直接跳至下一节。

¥The next section discusses the architecture and issues of the current search implementation. If you immediately want to learn what's new, skip to the section just after that.

建筑学

¥Architecture

Material for MkDocs 使用lunrlunr-languages来实现其客户端搜索功能。当文档页面加载完毕且 JavaScript 可用时,服务器将请求构建过程中内置搜索插件生成的搜索索引:

¥Material for MkDocs uses lunr together with lunr-languages to implement its client-side search capabilities. When a documentation page is loaded and JavaScript is available, the search index as generated by the built-in search plugin during the build process is requested from the server:

const index$ = document.forms.namedItem("search")
  ? __search?.index || requestJSON<SearchIndex>(
    new URL("search/search_index.json", config.base)
  )
  : NEVER

搜索索引

¥Search index

搜索索引包含所有页面的精简版本。让我们看一个例子,以便从原始 Markdown 文件中准确了解搜索索引包含的内容:

¥The search index includes a stripped-down version of all pages. Let's take a look at an example to understand precisely what the search index contains from the original Markdown file:

Expand to inspect example
# Example

## Text

It's very easy to make some words **bold** and other words *italic*
with Markdown. You can even add [links](#), or even `code`:

```
if (isAwesome) {
  return true
}
```

## Lists

Sometimes you want numbered lists:

1. One
2. Two
3. Three

Sometimes you want bullet points:

* Start a line with a star
* Profit!
{
  "config": {
    "indexing": "full",
    "lang": [
      "en"
    ],
    "min_search_length": 3,
    "prebuild_index": false,
    "separator": "[\\s\\-]+"
  },
  "docs": [
    {
      "location": "page/",
      "title": "Example",
      "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
    },
    {
      "location": "page/#example",
      "title": "Example",
      "text": ""
    },
    {
      "location": "page/#text",
      "title": "Text",
      "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
    },
    {
      "location": "page/#lists",
      "title": "Lists",
      "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
    }
  ]
}

如果我们检查搜索索引,我们会立即发现几个问题:

¥If we inspect the search index, we immediately see several problems:

  1. 所有内容都会被收录两次:搜索索引包含一个包含页面全部内容的条目,以及页面每个部分(即每个以标题或副标题开头的区块)的条目。这显著增加了搜索索引的大小。

  2. 所有结构都会丢失:构建搜索索引时,所有结构信息(例如 HTML 标签和属性)都会从内容中剥离。虽然这种方法对于段落和内联格式效果良好,但对于列表和代码块来说可能会有问题。摘录如下:

    … links , or even code : if (isAwesome) { … } Lists Sometimes you want …
    
    … links , or even code : if (isAwesome) { … } Lists Sometimes you want …
    

    … links , or even code : if (isAwesome) { … } Lists Sometimes you want … …链接,甚至代码:if (isAwesome) { … } 列表有时您想要……上下文:对于未经训练的眼睛,结果可能看起来像胡言乱语,因为无法立即看出哪些是文本,哪些是代码。此外,由于Lists与之前的代码块和之后的段落合并,因此不清楚它是否是标题。标点符号:紧跟标点符号的链接等内联元素由空格分隔(请参阅摘录中的,: 。这是因为在构建搜索索引期间,所有提取的文本都与空格字符连接。

不难看出,为主题作者实现良好的搜索体验是相当具有挑战性的,这就是为什么 Material for MkDocs(到目前为止)做了一些monkey patching以便能够呈现更有意义的搜索预览。

¥All content is included twice: the search index contains one entry with the entire contents of the page, and one entry for each section of the page, i.e., each block preceded by a headline or subheadline. This significantly contributes to the size of the search index.

搜索工作者

¥Search worker

实际的搜索功能是作为 Web Worker 1的一部分实现的,该 Web Worker 负责创建和管理lunr搜索索引。初始化搜索时,将执行以下步骤:

¥All structure is lost: when the search index is built, all structural information like HTML tags and attributes are stripped from the content. While this approach works well for paragraphs and inline formatting, it might be problematic for lists and code blocks. An excerpt:

  1. 将版块与页面链接:搜索索引会被解析,每个版块都会链接到其父页面。父页面本身不会被索引,因为这会导致重复结果,因此只保留版块。由于搜索结果是按页面分组的,因此必须进行链接。

  2. 分词:每个部分的titletext值使用mkdocs.yml中配置的分隔符拆分成分词。分词本身由lunr 的默认分词器执行,该分词器不支持前向查找或跨越多个字符的分隔符。为什么这很重要?我们稍后会看到,使用能够使用前向查找分隔字符串的分词器,我们可以实现更多功能。

  3. 索引:最后一步,每个部分都会被索引。查询索引时,如果搜索查询包含步骤 2 返回的某个标记,则该部分将被视为搜索结果的一部分,并传递给主线程。

现在,搜索工作者的基本运作方式就是这样的。当然,其中还涉及一些更神奇的东西,例如,为了弥补lunr的一些缺陷,搜索结果会被后处理重新评分,但总的来说,这就是数据进出索引的方式。

¥Context: for an untrained eye, the result can look like gibberish, as it's not immediately apparent what classifies as text and what as code. Furthermore, it's not clear that Lists is a headline as it's merged with the code block before and the paragraph after it.

搜索预览

¥Search previews

用户应该能够快速浏览并评估搜索结果在给定上下文中的相关性,这就是为什么简洁的摘要和突出显示的搜索词是良好搜索体验的重要组成部分。

¥Punctuation: inline elements like links that are immediately followed by punctuation are separated by whitespace (see , and : in the excerpt). This is because all extracted text is joined with a whitespace character during the construction of the search index.

这就是当前搜索预览生成功能的不足之处,因为某些搜索预览似乎不包含任何搜索词。这是因为搜索预览在最多 320 个字符后被截断,如下所示:

¥It's not difficult to see that it can be quite challenging to implement a good search experience for theme authors, which is why Material for MkDocs (up to now) did some monkey patching to be able to render slightly more meaningful search previews.

search preview

前两个结果看起来不相关,因为它们似乎不包含用户刚刚搜索的查询字符串。然而,它们确实相关。

¥Linking sections with pages: The search index is parsed, and each section is linked to its parent page. The parent page itself is not indexed, as it would lead to duplicate results, so only the sections remain. Linking is necessary, as search results are grouped by page.

解决这个问题的更好方案已经在路线图上存在很长时间了,但为了一劳永逸地解决这个问题,需要仔细考虑几个因素:

¥Tokenization: The title and text values of each section are split into tokens by using the separator as configured in mkdocs.yml. Tokenization itself is carried out by lunr's default tokenizer, which doesn't allow for lookahead or separators spanning multiple characters.

  1. 单词边界:一些静态网站生成器的主题2会通过在出现单词的旁边左右扩展文本来生成搜索预览,当搜索到足够多的单词时,会在空格处停止。预览可能如下所示:

    … channels, e.g., or which can be configured via mkdocs.yml …
    
    … channels, e.g., or which can be configured via mkdocs.yml …
    

    … channels, e.g., or which can be configured via mkdocs.yml … ... 渠道,例如,或可以通过 mkdocs.yml 配置...虽然这可能适用于使用空格作为单词分隔符的语言,但它不适用于日语或中文3等语言,因为它们具有非空格单词边界并使用专用分段器将字符串拆分为标记。

  2. 上下文感知:虽然空格并非适用于所有语言,但有人认为它可能是一个足够好的解决方案。不幸的是,这对于代码块来说并不一定适用,因为删除空格可能会改变某些语言的含义。

  3. 结构:保留结构信息并非必需,但显然有助于构建更有意义的搜索预览,从而快速评估相关性。如果某个单词出现在代码块中,则应将其渲染为代码块。

什么是新的?

¥What's new?

在我们对问题空间有了深入的了解之后,在我们深入研究新搜索实现的内部结构以了解它已经解决了哪些问题之前,我们先快速概述一下它带来的功能和改进:

¥Why is this important and a big deal? We will see later how much more we can achieve with a tokenizer that is capable of separating strings with lookahead.

丰富的搜索预览

¥Rich search previews

在重新构建搜索插件的过程中,我们重新设计了搜索索引的构造,以保留代码块、内联代码以及无序列表和有序列表的结构信息。以搜索索引部分中的示例为例,其结构如下:

¥Indexing: As a final step, each section is indexed. When querying the index, if a search query includes one of the tokens as returned by step 2., the section is considered to be part of the search result and passed to the main thread.

search preview now

search preview before

现在,代码块已成为搜索预览的“一等公民” ,甚至内联代码格式也得以保留。让我们看一下搜索索引的新结构,以了解其中的原因:

¥This is where the current search preview generation falls short, as some of the search previews appear not to include any occurrence of any of the search terms. This was due to the fact that search previews were truncated after a maximum of 320 characters, as can be seen here:

Expand to inspect search index
{
  ...
  "docs": [
    {
      "location": "page/",
      "title": "Example",
      "text": ""
    },
    {
      "location": "page/#text",
      "title": "Text",
      "text": "<p>It's very easy to make some words bold and other words italic with Markdown. You can even add links, or even <code>code</code>:</p> <pre><code>if (isAwesome){\n  return true\n}\n</code></pre>"
    },
    {
      "location": "page/#lists",
      "title": "Lists",
      "text": "<p>Sometimes you want numbered lists:</p> <ol> <li>One</li> <li>Two</li> <li>Three</li> </ol> <p>Sometimes you want bullet points:</p> <ul> <li>Start a line with a star</li> <li>Profit!</li> </ul>"
    }
  ]
}
{
  ...
  "docs": [
    {
      "location": "page/",
      "title": "Example",
      "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
    },
    {
      "location": "page/#example",
      "title": "Example",
      "text": ""
    },
    {
      "location": "page/#text",
      "title": "Text",
      "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
    },
    {
      "location": "page/#lists",
      "title": "Lists",
      "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
    }
  ]
}

如果我们再次检查搜索索引,我们可以看到情况如何改善:

  1. 内容仅包含一次:搜索索引不会重复包含页面内容,因为只有页面的某个部分才构成搜索索引的一部分。这将显著减少文件大小,减少传输字节数,并缩小搜索索引。

    ¥Better: support for

  2. 部分结构得以保留:搜索索引的每个部分都包含一小部分 HTML 子集,以提供必要的结构,从而实现更复杂的搜索预览。回顾之前的示例,我们来看一个摘录:NowBefore

    ¥Faster and

    … links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
    
    … links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
    

    … links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre> ... 链接,甚至 <code>code</code>:</p> <pre><code>if (isAwesome){ ... }\n</code></pre>

    … links , or even code : if (isAwesome) { … }
    
    … links , or even code : if (isAwesome) { … }
    

    … links , or even code : if (isAwesome) { … } ... 链接,甚至代码:if (isAwesome) { ... }标点符号问题消失了,因为没有插入额外的空格,并且保留的标记产生了额外的上下文,使扫描搜索结果更有效。

进入流程的下一步:标记化

¥The first two results look like they're not relevant, as they don't seem to include the query string the user just searched for. Yet, they are.

标记器向前看

¥Tokenizer lookahead

lunr默认标记器使用正则表达式来拆分给定字符串,方法是将每个字符与mkdocs.yml中定义的分隔符进行匹配。这不允许使用基于前瞻或多个字符的更复杂的分隔符。

¥A better solution to this problem has been on the roadmap for a very, very long time, but in order to solve this once and for all, several factors need to be carefully considered:

幸运的是,我们新的搜索实现提供了一个高级标记器,它没有这些缺点,并且支持更复杂的正则表达式。因此,Material for MkDocs 刚刚将其自己的分隔符配置更改为以下值:

¥Word boundaries: some themes2 for static site generators generate search previews by expanding the text left and right next to an occurrence, stopping at a whitespace character when enough words have been consumed. A preview might look like this:

[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;

虽然第一部分直到第一个|包含字符串应拆分的单个控制字符列表,但以下三个部分解释了正则表达式的其余部分。4

¥While this may work for languages that use whitespace as a separator between words, it breaks down for languages like Japanese or Chinese3, as they have non-whitespace word boundaries and use dedicated segmenters to split strings into tokens.

案例变更

¥Case changes

许多编程语言都使用PascalCasecamelCase命名约定。当用户搜索case一词时,很自然地会期望出现PascalCasecamelCase 。通过在分隔符中添加以下匹配组,现在可以轻松实现这一点:

¥Context-awareness: Although whitespace doesn't work for all languages, one could argue that it could be a good enough solution. Unfortunately, this is not necessarily true for code blocks, as the removal of whitespace might change meaning in some languages.

(?!\b)(?=[A-Z][a-z])

此正则表达式是负向前瞻( \b ,即不是单词边界)和正向前瞻( [A-Z][a-z] ,即大写字符后跟小写字符)的组合,并具有以下行为:

¥Structure: Preserving structural information is not a must, but apparently beneficial to build more meaningful search previews which allow for a quick evaluation of relevance. If a word occurrence is part of a code block, it should be rendered as a code block.

  • PascalCase Pascal · Case

  • camelCase camel , Case

  • UPPERCASE UPPERCASE

    ¥PascalCasePascal

现在搜索searchHighlight会出现讨论search.highlight功能标志的部分,这也表明它现在甚至可以正常用于搜索查询。5

¥After we built a solid understanding of the problem space and before we dive into the internals of our new search implementation to see which of the problems it already solves, a quick overview of what features and improvements it brings:

版本号

¥Version numbers

索引版本号是另一个可以通过小范围向前查找解决的问题。通常,应该将.视为分隔符,用于拆分诸如search.highlight之类的单词。但是,在.处拆分版本号会导致它们无法被发现。因此,以下表达式:

¥As we rebuilt the search plugin from scratch, we reworked the construction of the search index to preserve the structural information of code blocks, inline code, as well as unordered and ordered lists. Using the example from the search index section, here's how it looks:

\.(?!\d)

此正则表达式仅当.后面没有紧跟数字\d时才匹配,这样版本号才可以被检测到。搜索7.2.6会显示7.2.6 的发行说明。

HTML/XML 标签

¥HTML/XML tags

如果您的文档包含 HTML/XML 代码示例,您可能希望允许用户查找特定的标签名称。遗憾的是, <>控制字符在代码块中被编码为&lt;&gt; 。现在,在分隔符中添加以下表达式就可以实现这一点:

&[lg]t;

我们才刚刚开始探索 tokenizer lookahead 带来的新可能性。如果您发现了其他有用的表达式,欢迎在评论区分享。

¥Now, code blocks are first-class citizens of search previews, and even inline code formatting is preserved. Let's take a look at the new structure of the search index to understand why:

准确突出显示

¥Accurate highlighting

高亮是搜索过程的最后一步,它涉及高亮显示给定搜索结果中所有出现的搜索词。长期以来,高亮显示是通过动态生成的正则表达式来实现的。6

¥If we inspect the search index again, we can see how the situation improved:

这种方法对于日语或中文3等非空白语言存在一些问题,因为它仅在突出显示的术语位于单词边界时才有效。然而,亚洲语言使用专用的分词器进行标记,而这些分词器无法用正则表达式建模。

¥Content is included only once: the search index does not include the content of the page twice, as only the sections of a page are part of the search index. This leads to a significant reduction in size, fewer bytes to transfer, and a smaller search index.

现在,作为新标记化方法的直接结果,我们的新搜索实现使用标记位置进行突出显示,使其与标记化一样强大:

¥Some structure is preserved: each section of the search index includes a small subset of HTML to provide the necessary structure to allow for more sophisticated search previews. Revisiting our example from before, let's look at an excerpt:

  1. 单词边界:由于新的高亮器使用 token 位置,因此单词边界与 token 边界相同。这意味着更复杂的 token 化情况(例如,大小写更改版本号HTML/XML 标签)现在都能准确高亮显示。

    ¥camelCasecamel, Case

  2. 上下文感知:由于新的搜索索引保留了原始文档的部分结构信息,因此每个部分的内容现在被划分为单独的内容块——段落、代码块和列表。现在,只有实际包含某个搜索词的内容块才会被考虑纳入搜索预览。如果某个搜索词仅出现在代码块中,则渲染的是该代码块,例如,参见Twitter的结果。

    ¥UPPERCASE

基准测试

¥Benchmarks

我们进行了两项基准测试 - 一项是 MkDocs 本身的 Material 文档,另一项是包含超过 800,000 个单词的庞大 Markdown 文件库 - 大多数文档项目可能永远无法达到这个规模:

¥The punctuation issue is gone, as no additional whitespace is inserted, and the preserved markup yields additional context to make scanning search results more effective.

¥Before

现在

¥Now

相对的

¥Relative

MkDocs 的材料

¥Material for MkDocs

索引大小

¥Index size

573 千字节

¥573 kB

335 千字节

¥335 kB

–42%

¥–42%

索引大小( gzip

¥Index size (gzip)

105 千字节

¥105 kB

78 千字节

¥78 kB

–27%

¥–27%

索引时间7

¥Indexing time7

265毫秒

¥265 ms

177毫秒

¥177 ms

–34%

¥–34%

KJV Markdown8

¥KJV Markdown8

索引大小

¥Index size

8.2 MB

¥8.2 MB

4.4 MB

¥4.4 MB

–47%

¥–47%

索引大小( gzip

¥Index size (gzip)

2.3 MB

¥2.3 MB

1.2 MB

¥1.2 MB

–48%

¥–48%

索引时间

¥Indexing time

2,700 毫秒

¥2,700 ms

1,390 毫秒

¥1,390 ms

–48%

¥–48%

基准测试结果

¥On to the next step in the process: tokenization.

结果表明,索引时间(即页面加载时设置搜索所需的时间)已缩短高达 48%,这意味着新搜索速度提高了 95% 。这是一个显著的改进,对于大型文档项目尤其重要。

¥The default tokenizer of lunr uses a regular expression to split a given string by matching each character against the separator as defined in mkdocs.yml. This doesn't allow for more complex separators based on lookahead or multiple characters.

虽然 1.3 秒听起来似乎很长,但结合使用新的客户端搜索和即时加载功能,只需在页面首次加载时创建搜索索引即可。导航时,搜索索引会跨页面保留,因此只需支付一次费用。

¥Fortunately, our new search implementation provides an advanced tokenizer that doesn't have these shortcomings and supports more complex regular expressions. As a result, Material for MkDocs just changed its own separator configuration to the following value:

用户界面

¥User interface

此外,我们还进行了一些小改进,其中最突出的是“此页面上的更多结果”按钮,该按钮现在打开时会固定在搜索结果列表的顶部。这方便用户更快地跳出列表。

¥While the first part up to the first | contains a list of single control characters at which the string should be split, the following three sections explain the remainder of the regular expression.4

下一步是什么?

¥What's next?

我们全新的搜索功能是对 Material for MkDocs 的重大改进。它解决了一些多年来一直困扰我们的问题。然而,这仅仅是搜索体验的开始,未来我们将不断改进。接下来:

¥Many programming languages use PascalCase or camelCase naming conventions. When a user searches for the term case, it's quite natural to expect for PascalCase and camelCase to show up. By adding the following match group to the separator, this can now be achieved with ease:

  • 上下文感知搜索摘要:目前,前两个匹配的内容块会呈现为搜索预览。借助新的标记化技术,我们为更复杂的缩短和摘要方法奠定了基础,下一步我们将着手解决这些问题。

  • 用户界面改进:由于我们现在完全控制了搜索插件,因此可以添加有意义的元数据,以提供更多上下文信息和更佳的体验。我们将在未来探索其中的一些改进途径。

如果您已经读到这里,感谢您对 Material for MkDocs 的关注和投入!这是我在完成一项简短的 Twitter 调查后决定撰写的第一篇博客文章。欢迎您留下评论,分享您对新搜索功能的体验。

¥This regular expression is a combination of a negative lookahead (\b, i.e., not a word boundary) and a positive lookahead ([A-Z][a-z], i.e., an uppercase character followed by a lowercase character), and has the following behavior:


  1. 5.0.0之前的版本中,搜索操作是在主线程中进行的,这会导致浏览器锁定,从而无法使用。该问题最初在#904中报告,经过反复讨论后,最终在5.0.0中修复并发布。↩

  2. 在撰写本文时, Just the DocsDocusaurus使用此方法生成搜索预览。请注意,后者还与 Algolia 集成,后者是一个完全托管的基于服务器的解决方案。↩

  3. 中国和日本均位列 Material for MkDocs 用户前五名国家列。↩

  4. 有趣的是:搜索插件的分隔符默认值[\s\-]+总是让人有点恼火,因为它暗示多个字符可以被视为分隔符。然而, +完全无关紧要,因为lunr 的默认标记器从来都不支持涉及多个字符的正则表达式组。↩

  5. 之前,由于lunr处理通配符的方式,搜索查询无法正确分词,因为它会禁用包含通配符的搜索词的管道。为了提供良好的预输入体验,Material for MkDocs 会在每个未明确以+-开头的搜索词末尾添加通配符,从而有效地禁用分词。↩

  6. 使用mkdocs.yml中定义的分隔符,构建了一个试图模仿标记器(tokenizer)的正则表达式。例如,搜索查询search highlight被转换为相当繁琐的正则表达式(^|<separator>)(search|highlight) ,该正则表达式仅匹配单词边界。↩

  7. 十次不同运行中的最小值。↩

  8. 我们不可知地使用KJV Markdown作为测试工具,以了解 Material for MkDocs 在大型语料库中的表现,因为它是一组非常大的 Markdown 文件,包含超过 80 万个单词。↩