如何设计搜索引擎

一、问题描述

1.1 业务背景

搜索引擎是信息检索的核心技术，广泛应用于：

网站搜索：Google、百度、必应
站内搜索：电商商品搜索、文档搜索
企业搜索：知识库、日志搜索

核心价值：从海量数据中快速找到相关信息

1.2 核心功能

基础功能：

全文检索：根据关键词搜索文档
相关性排序：最相关的结果排在前面
分页展示：支持翻页查看更多结果

进阶功能：

搜索提示：输入时自动补全
拼写纠错：自动纠正拼写错误
高亮显示：高亮匹配关键词
相关搜索：推荐相关搜索词

1.3 技术挑战

数据规模：

Google索引：数千亿网页
淘宝商品：数十亿商品
如何快速检索？

响应速度：

搜索响应时间要求：<100ms
如何做到毫秒级返回？

相关性排序：

如何判断哪个结果更相关？
TF-IDF、BM25、PageRank

1.4 面试考察点

倒排索引：搜索引擎核心数据结构
分词算法：中文分词的挑战
相关性算法：TF-IDF、BM25
Elasticsearch：分布式搜索引擎
性能优化：缓存、索引分片

二、核心概念

2.1 倒排索引（Inverted Index）

正排索引（文档 → 词）：

Doc1: "搜索引擎原理"
Doc2: "搜索算法设计"
Doc3: "引擎优化技术"

倒排索引（词 → 文档）：

搜索 → [Doc1, Doc2]
引擎 → [Doc1, Doc3]
原理 → [Doc1]
算法 → [Doc2]
设计 → [Doc2]
优化 → [Doc3]
技术 → [Doc3]

倒排表结构：

词典（Dictionary）        倒排列表（Posting List）
-------------------------  ---------------------------------
搜索                    →  [Doc1:pos1, Doc2:pos1]
引擎                    →  [Doc1:pos2, Doc3:pos1]
...

2.2 TF-IDF算法

TF（Term Frequency）：词频

TF = 词在文档中出现次数 / 文档总词数

IDF（Inverse Document Frequency）：逆文档频率

IDF = log(文档总数 / 包含该词的文档数 + 1)

TF-IDF得分：

Score = TF × IDF

示例：

文档1："搜索引擎原理 搜索引擎设计"
文档2："数据库设计原理"

查询："搜索引擎"

词"搜索"：
- TF(Doc1) = 2/4 = 0.5
- IDF = log(2/1+1) = 0.3
- TF-IDF = 0.5 × 0.3 = 0.15

词"引擎"：
- TF(Doc1) = 2/4 = 0.5
- IDF = log(2/1+1) = 0.3
- TF-IDF = 0.5 × 0.3 = 0.15

Doc1总分 = 0.15 + 0.15 = 0.3
Doc2总分 = 0（不包含"搜索"和"引擎"）

结果：Doc1排名第一

三、系统设计

3.1 架构图

mermaid

graph TB
    subgraph 索引构建
        A[原始文档] --> B[分词器]
        B --> C[倒排索引构建器]
        C --> D[倒排索引存储]
    end
    
    subgraph 搜索流程
        E[用户查询] --> F[查询解析]
        F --> G[分词]
        G --> H[索引检索]
        H --> I[相关性排序]
        I --> J[返回结果]
    end
    
    D --> H

3.2 数据结构设计

词典表：

sql

CREATE TABLE dictionary (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    term VARCHAR(100) UNIQUE NOT NULL COMMENT '词项',
    doc_freq INT NOT NULL COMMENT '文档频率',
    posting_offset BIGINT COMMENT '倒排列表偏移量',
    KEY idx_term (term)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

倒排列表：

sql

CREATE TABLE posting_list (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    term_id BIGINT NOT NULL COMMENT '词项ID',
    doc_id BIGINT NOT NULL COMMENT '文档ID',
    term_freq INT NOT NULL COMMENT '词频',
    positions JSON COMMENT '位置列表',
    KEY idx_term_doc (term_id, doc_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

文档表：

sql

CREATE TABLE documents (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    url VARCHAR(1000),
    doc_length INT COMMENT '文档长度',
    create_time DATETIME DEFAULT CURRENT_TIMESTAMP,
    KEY idx_create_time (create_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

四、核心实现

4.1 Go实现

点击查看完整实现

package searchengine

import (
    "math"
    "sort"
    "strings"
)

// SearchEngine 搜索引擎
type SearchEngine struct {
    dictionary   map[string]*Term          // 词典
    postingLists map[string][]*Posting    // 倒排列表
    documents    map[int64]*Document       // 文档集合
    totalDocs    int                       // 文档总数
}

// Term 词项
type Term struct {
    Term    string
    DocFreq int // 包含该词的文档数
}

// Posting 倒排列表项
type Posting struct {
    DocID     int64
    TermFreq  int   // 词频
    Positions []int // 词在文档中的位置
}

// Document 文档
type Document struct {
    ID        int64
    Title     string
    Content   string
    DocLength int // 文档长度（词数）
}

// SearchResult 搜索结果
type SearchResult struct {
    DocID int64
    Title string
    Score float64
}

// NewSearchEngine 创建搜索引擎
func NewSearchEngine() *SearchEngine {
    return &SearchEngine{
        dictionary:   make(map[string]*Term),
        postingLists: make(map[string][]*Posting),
        documents:    make(map[int64]*Document),
    }
}

// BuildIndex 构建索引
func (se *SearchEngine) BuildIndex(docs []*Document) {
    for _, doc := range docs {
        se.addDocument(doc)
    }
}

// addDocument 添加文档到索引
func (se *SearchEngine) addDocument(doc *Document) {
    // 分词
    terms := tokenize(doc.Content)
    doc.DocLength = len(terms)
    
    // 存储文档
    se.documents[doc.ID] = doc
    se.totalDocs++
    
    // 统计词频和位置
    termFreqs := make(map[string]int)
    termPositions := make(map[string][]int)
    
    for pos, term := range terms {
        termFreqs[term]++
        termPositions[term] = append(termPositions[term], pos)
    }
    
    // 构建倒排索引
    for term, freq := range termFreqs {
        // 更新词典
        if _, exists := se.dictionary[term]; !exists {
            se.dictionary[term] = &Term{
                Term:    term,
                DocFreq: 0,
            }
        }
        se.dictionary[term].DocFreq++
        
        // 添加到倒排列表
        posting := &Posting{
            DocID:     doc.ID,
            TermFreq:  freq,
            Positions: termPositions[term],
        }
        se.postingLists[term] = append(se.postingLists[term], posting)
    }
}

// Search 搜索
func (se *SearchEngine) Search(query string, limit int) []*SearchResult {
    // 1. 查询分词
    queryTerms := tokenize(query)
    
    // 2. 获取候选文档（包含任一查询词的文档）
    candidateDocs := se.getCandidateDocs(queryTerms)
    
    // 3. 计算BM25得分
    results := make([]*SearchResult, 0)
    for docID := range candidateDocs {
        score := se.calculateBM25(queryTerms, docID)
        
        doc := se.documents[docID]
        results = append(results, &SearchResult{
            DocID: docID,
            Title: doc.Title,
            Score: score,
        })
    }
    
    // 4. 按得分排序
    sort.Slice(results, func(i, j int) bool {
        return results[i].Score > results[j].Score
    })
    
    // 5. 返回Top N
    if limit > len(results) {
        limit = len(results)
    }
    
    return results[:limit]
    }
    
// getCandidateDocs 获取候选文档
func (se *SearchEngine) getCandidateDocs(queryTerms []string) map[int64]bool {
    candidates := make(map[int64]bool)
    
    for _, term := range queryTerms {
        if postings, exists := se.postingLists[term]; exists {
            for _, posting := range postings {
                candidates[posting.DocID] = true
            }
        }
    }
    
    return candidates
}

// calculateBM25 计算BM25得分
func (se *SearchEngine) calculateBM25(queryTerms []string, docID int64) float64 {
    const (
        k1 = 1.5  // 词频饱和参数
        b  = 0.75 // 长度归一化参数
    )
    
    doc := se.documents[docID]
    avgDocLength := se.calculateAvgDocLength()
    score := 0.0
    
    for _, term := range queryTerms {
        // 计算IDF
        idf := se.calculateIDF(term)
        
        // 获取词频
        tf := se.getTermFreq(term, docID)
        
        // BM25公式
        numerator := tf * (k1 + 1)
        denominator := tf + k1*(1-b+b*float64(doc.DocLength)/avgDocLength)
        
        score += idf * (numerator / denominator)
    }
    
    return score
        }
        
// calculateIDF 计算IDF
func (se *SearchEngine) calculateIDF(term string) float64 {
    termInfo, exists := se.dictionary[term]
    if !exists {
        return 0.0
    }
    
    // IDF = log((N - df + 0.5) / (df + 0.5) + 1)
    n := float64(se.totalDocs)
    df := float64(termInfo.DocFreq)
    
    return math.Log((n-df+0.5)/(df+0.5) + 1)
}

// getTermFreq 获取词频
func (se *SearchEngine) getTermFreq(term string, docID int64) float64 {
    postings, exists := se.postingLists[term]
    if !exists {
        return 0.0
    }
    
    for _, posting := range postings {
        if posting.DocID == docID {
            return float64(posting.TermFreq)
        }
    }
    
    return 0.0
}

// calculateAvgDocLength 计算平均文档长度
func (se *SearchEngine) calculateAvgDocLength() float64 {
    if se.totalDocs == 0 {
        return 0
    }
    
    totalLength := 0
    for _, doc := range se.documents {
        totalLength += doc.DocLength
    }
    
    return float64(totalLength) / float64(se.totalDocs)
}

// tokenize 分词（简化版，实际应使用jieba等）
func tokenize(text string) []string {
    // 简单按空格分词
    text = strings.ToLower(text)
    return strings.Fields(text)
        }
        
// Trie树实现（用于搜索提示）
type Trie struct {
    root *TrieNode
}

type TrieNode struct {
    children map[rune]*TrieNode
    isEnd    bool
    word     string
    freq     int // 搜索频率
}

func NewTrie() *Trie {
    return &Trie{
        root: &TrieNode{
            children: make(map[rune]*TrieNode),
        },
    }
}

// Insert 插入词
func (t *Trie) Insert(word string, freq int) {
    node := t.root
    
    for _, char := range word {
        if _, exists := node.children[char]; !exists {
            node.children[char] = &TrieNode{
                children: make(map[rune]*TrieNode),
            }
        }
        node = node.children[char]
    }
    
    node.isEnd = true
    node.word = word
    node.freq = freq
}

// SearchPrefix 搜索前缀
func (t *Trie) SearchPrefix(prefix string) []string {
    node := t.root
    
    // 找到前缀节点
    for _, char := range prefix {
        if _, exists := node.children[char]; !exists {
            return nil
        }
        node = node.children[char]
        }
        
    // DFS收集所有词
    results := make([]*TrieNode, 0)
    t.dfs(node, &results)
    
    // 按频率排序
    sort.Slice(results, func(i, j int) bool {
        return results[i].freq > results[j].freq
    })
    
    // 返回Top 10
    words := make([]string, 0)
    for i := 0; i < len(results) && i < 10; i++ {
        words = append(words, results[i].word)
    }
    
    return words
    }
    
// dfs 深度优先搜索
func (t *Trie) dfs(node *TrieNode, results *[]*TrieNode) {
    if node.isEnd {
        *results = append(*results, node)
    }
    
    for _, child := range node.children {
        t.dfs(child, results)
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306

4.2 Java实现

SearchEngine.java

java

public class SearchEngine {
    
    private Map<String, Term> dictionary;
    private Map<String, List<Posting>> postingLists;
    private Map<Long, Document> documents;
    private int totalDocs;
    
    public SearchEngine() {
        this.dictionary = new HashMap<>();
        this.postingLists = new HashMap<>();
        this.documents = new HashMap<>();
        this.totalDocs = 0;
    }
    
    /**
     * 构建索引
     */
    public void buildIndex(List<Document> docs) {
        for (Document doc : docs) {
            addDocument(doc);
            }
        }
        
    /**
     * 添加文档
     */
    private void addDocument(Document doc) {
        // 分词
        List<String> terms = tokenize(doc.getContent());
        doc.setDocLength(terms.size());
        
        // 存储文档
        documents.put(doc.getId(), doc);
        totalDocs++;
        
        // 统计词频和位置
        Map<String, Integer> termFreqs = new HashMap<>();
        Map<String, List<Integer>> termPositions = new HashMap<>();
        
        for (int i = 0; i < terms.size(); i++) {
            String term = terms.get(i);
            termFreqs.put(term, termFreqs.getOrDefault(term, 0) + 1);
            termPositions.computeIfAbsent(term, k -> new ArrayList<>()).add(i);
        }
        
        // 构建倒排索引
        for (Map.Entry<String, Integer> entry : termFreqs.entrySet()) {
            String term = entry.getKey();
            int freq = entry.getValue();
            
            // 更新词典
            dictionary.computeIfAbsent(term, k -> new Term(term, 0));
            Term termInfo = dictionary.get(term);
            termInfo.setDocFreq(termInfo.getDocFreq() + 1);
            
            // 添加到倒排列表
            Posting posting = new Posting();
            posting.setDocId(doc.getId());
            posting.setTermFreq(freq);
            posting.setPositions(termPositions.get(term));
            
            postingLists.computeIfAbsent(term, k -> new ArrayList<>()).add(posting);
        }
    }
    
    /**
     * 搜索
     */
    public List<SearchResult> search(String query, int limit) {
        // 1. 查询分词
        List<String> queryTerms = tokenize(query);
        
        // 2. 获取候选文档
        Set<Long> candidateDocs = getCandidateDocs(queryTerms);
        
        // 3. 计算BM25得分
        List<SearchResult> results = new ArrayList<>();
        for (Long docId : candidateDocs) {
            double score = calculateBM25(queryTerms, docId);
            
            Document doc = documents.get(docId);
            SearchResult result = new SearchResult();
            result.setDocId(docId);
            result.setTitle(doc.getTitle());
            result.setScore(score);
            
            results.add(result);
}

        // 4. 排序
        results.sort((a, b) -> Double.compare(b.getScore(), a.getScore()));
        
        // 5. 返回Top N
        return results.stream().limit(limit).collect(Collectors.toList());
    }
    
    /**
     * 获取候选文档
     */
    private Set<Long> getCandidateDocs(List<String> queryTerms) {
        Set<Long> candidates = new HashSet<>();
        
        for (String term : queryTerms) {
            List<Posting> postings = postingLists.get(term);
            if (postings != null) {
                for (Posting posting : postings) {
                    candidates.add(posting.getDocId());
                }
            }
        }
        
        return candidates;
    }
    
    /**
     * 计算BM25得分
     */
    private double calculateBM25(List<String> queryTerms, Long docId) {
        final double k1 = 1.5;
        final double b = 0.75;
        
        Document doc = documents.get(docId);
        double avgDocLength = calculateAvgDocLength();
        double score = 0.0;
        
        for (String term : queryTerms) {
            double idf = calculateIDF(term);
            double tf = getTermFreq(term, docId);
            
            double numerator = tf * (k1 + 1);
            double denominator = tf + k1 * (1 - b + b * doc.getDocLength() / avgDocLength);
            
            score += idf * (numerator / denominator);
    }
    
        return score;
    }
    
    /**
     * 计算IDF
     */
    private double calculateIDF(String term) {
        Term termInfo = dictionary.get(term);
        if (termInfo == null) {
            return 0.0;
        }
        
        double n = totalDocs;
        double df = termInfo.getDocFreq();
        
        return Math.log((n - df + 0.5) / (df + 0.5) + 1);
}

    /**
     * 分词
     */
    private List<String> tokenize(String text) {
        return Arrays.asList(text.toLowerCase().split("\\s+"));
    }
}

五、Elasticsearch应用

5.1 Elasticsearch简介

核心特点：

分布式搜索引擎
基于Lucene
RESTful API
实时搜索

核心概念：

Index（索引）：类似数据库
Document（文档）：类似行记录
Field（字段）：类似列
Shard（分片）：索引分片存储
Replica（副本）：分片副本

5.2 使用示例

创建索引：

json

PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "price": {
        "type": "double"
      },
      "category": {
        "type": "keyword"
      }
    }
  }
}

插入文档：

json

POST /products/_doc/1
{
  "title": "iPhone 15 Pro",
  "price": 7999,
  "category": "手机"
}

搜索：

json

GET /products/_search
{
  "query": {
    "match": {
      "title": "iPhone"
    }
  },
  "sort": [
    { "_score": "desc" },
    { "price": "asc" }
  ],
  "from": 0,
  "size": 10
}

六、性能优化

6.1 索引优化

分片策略：

索引大小 < 50GB：1个分片
索引大小 100GB：2-5个分片
索引大小 > 500GB：10+个分片

索引压缩：

词典压缩：前缀树
倒排列表压缩：差分编码

6.2 查询优化

缓存：

Query Cache：查询结果缓存
Filter Cache：过滤条件缓存
Field Data Cache：字段数据缓存

性能数据：

指标	优化前	优化后	提升
索引构建速度	1000doc/s	10000doc/s	10x
搜索响应时间	500ms	50ms	10x
索引大小	10GB	5GB	50%

七、面试要点

7.1 常见追问

Q1: 倒排索引和正排索引的区别？

维度	正排索引	倒排索引
结构	文档→词	词→文档
用途	展示文档	搜索文档
查询	慢（全表扫描）	快（直接定位）

Q2: TF-IDF和BM25的区别？

A: BM25是TF-IDF的改进版，解决了词频饱和问题和文档长度归一化。

Q3: 如何实现中文分词？

A: 常用算法：

基于词典：最大匹配法
基于统计：HMM、CRF
开源工具：jieba、IK Analyzer

7.2 扩展知识点

相关场景题：

推荐系统

相关技术文档：

数据结构

八、总结

搜索引擎设计要点：

核心数据结构：倒排索引
相关性算法：TF-IDF、BM25
分词技术：中文分词
Elasticsearch：生产级搜索引擎
性能优化：索引分片、缓存策略

面试重点：

倒排索引原理和实现
BM25算法公式
Elasticsearch架构和使用

参考资料：

《搜索引擎原理》
Elasticsearch官方文档
Lucene核心技术

如何设计搜索引擎 ​

一、问题描述 ​

1.1 业务背景 ​

1.2 核心功能 ​

1.3 技术挑战 ​

1.4 面试考察点 ​

二、核心概念 ​

2.1 倒排索引（Inverted Index） ​

2.2 TF-IDF算法 ​

三、系统设计 ​

3.1 架构图 ​

3.2 数据结构设计 ​

四、核心实现 ​

4.1 Go实现 ​

4.2 Java实现 ​

五、Elasticsearch应用 ​

5.1 Elasticsearch简介 ​

5.2 使用示例 ​

六、性能优化 ​

6.1 索引优化 ​

6.2 查询优化 ​

七、面试要点 ​

7.1 常见追问 ​

7.2 扩展知识点 ​

八、总结 ​

如何设计搜索引擎

一、问题描述

1.1 业务背景

1.2 核心功能

1.3 技术挑战

1.4 面试考察点

二、核心概念

2.1 倒排索引（Inverted Index）

2.2 TF-IDF算法

三、系统设计

3.1 架构图

3.2 数据结构设计

四、核心实现

4.1 Go实现

4.2 Java实现

五、Elasticsearch应用

5.1 Elasticsearch简介

5.2 使用示例

六、性能优化

6.1 索引优化

6.2 查询优化

七、面试要点

7.1 常见追问

7.2 扩展知识点

八、总结