基于Redis的自动补全算法

使用Redis实现搜索引擎的自动补全功能

搜索引擎通常有一个自动不全功能，即用户输入了搜索对象的前缀后，搜索引擎会自动匹配可能的搜索对象的全称。编程时使用的IDE或文本编辑器也有类似自动匹配的提示。常规的思路是使用Trie树或其优化后的Patricia树。具体的实现上，要权衡内存消耗和查询效率，使用Double Array Trie是一种不错的思路。今天，提供一种不一样的实现思路，它不需要我们实现自己的Trie树，而是借用Redis内置的数据结构。这样做的好处是，多个WEB应用（进程）可用通过Redis共享存储的数据，而不用在进程中维护前缀匹配的数据（Trie树这种方案要维护前缀数据需要额外的空间）。另外，Redis内置的数据结构使用C语言实现，经得起性能考验。

本文先提供简单的思路，然后在这个思路上优化。

朴素的实现

我们先假定一个场景以方便说明原理。假设现在数据是以键值的形式保存，把键当做是区分不同场景下值的名字空间，值是一个列表，保存该名字空间下的搜索对象。我们要做的是，给定名字空间和搜索前缀的情况下，找到该名字空间下的所有匹配前缀的搜索对象。例如：

1	key-->[abc, abcd, wind]

key是名字空间，现在我们输入搜索前缀prefix，程序需要好到所有匹配prefix的对象。如果prefix=abc，那么输入的结果为abc、abcd。

一个明显的思路是遍历，对，这就是我们的经典实现。每次进行前缀查找时，把key对应的值（列表）返回给应用进程，程序负责遍历是否匹配前缀。具体的代码如下：

import redis

class AutoComplete:

    def __init__(self, store_size=100):
        self._conn = redis.Redis()
        self._store_size = store_size

    def add(self, key, value):
        ac_list = 'recent:' + key
        pipeline = self._conn.pipeline(True)
        pipeline.lrem(ac_list, value)
        pipeline.lpush(ac_list, value)
        # pipeline.ltrim(ac_list, 0, self._store_size-1) 这行代码用于维护有限的内存空间，一般不用
        pipeline.execute()

    def remove(self, key, value):
        self._conn.lrem('recent:'+key, value)

    def find_prefix(self, key, prefix):
        candidates = self._conn.lrange('recent:'+key, 0, -1)
        matches = []
        for candidate in candidates:
            if candidate.lower().decode('utf-8').startswith(prefix):
                matches.append(candidate)
        return matches

pipeline.lrem(ac_list, value)和pipeline.lpush(ac_list, value)用于把先添加的元素放到列表前面。前缀匹配的核心代码：

def find_prefix(self, key, prefix):
    candidates = self._conn.lrange('recent:'+key, 0, -1)
    matches = []
    for candidate in candidates:
        if candidate.lower().decode('utf-8').startswith(prefix):
            matches.append(candidate)
    return matches

每一次调用find_prefix时都需要遍历列表，时间复杂度为O(N)，如果列表很大，会影响性能。可以通过在add函数中添加pipeline.ltrim(ac_list, 0, self._store_size-1)来确保遍历的链表在一定范围内。

简单的测试样例：

def test_AutoComplete():
    ac = AutoComplete()
    ac.add('allen', 'wind')
    ac.add('allen', 'windy')
    ac.add('allen', 'winding')
    print(ac.find_prefix('allen', 'wind'))
    ac.remove('allen', 'winding')
    print(ac.find_prefix('allen', 'wind'))

优化的实现

接着上面的思路，如果列表已经排序好了，那么那么通过二分查找可以找到前缀所在的区间范围，直接查找该区间即可。但Redis的列表数据结构并不是自动排序的。注意到，Redis的ZSet数据结构的值（集合成员）是可以有序的，为了达到这个结果，需要特殊的处理：把每个成员member的权值设置为零，有序集的元素的排序就是members的排序。通过操作redis体验下。

127.0.0.1:6379> zadd key 0 abc 0 abg 0 abd 0 abe 0 abf
(integer) 5
127.0.0.1:6379> zrange key 0 -1
1) "abc"
2) "abd"
3) "abe"
4) "abf"
5) "abg"

可以看到值（集合成员）是按字符串排序好的，因此，通过二分搜索找到prefix的范围即可。但还需要一点小技巧，通过Python代码来理解这个算法过程。

def find_prefix_range(self, prefix):
    local = bisect.bisect_left(self.characters, prefix[-1:])
    suffix = self.characters[(local or 1) - 1]
    _id = str(uuid.uuid4()) # 为了防止并发的情况，覆盖了别的task的成员标记。
    # _id = '' # 如果没有并发的情况，_id可以设置为空字符
    return prefix[:-1] + suffix + '{' + _id, prefix + '{' + _id

def find_prefix(self, key, prefix):
    start, end = self.find_prefix_range(prefix)
    key = 'members:' + key
    self._conn.zadd(key, start, 0, end, 0)
    pipeline = self._conn.pipeline(True)
    while True:
        try:
            pipeline.watch(True)
            start_index = pipeline.zrank(key, start)
            end_index = pipeline.zrank(key, end)
            size = min(start_index+self._size-1, end_index-2)
            pipeline.multi()
            pipeline.zrem(key, start, end)
            pipeline.zrange(key, start_index, end_index)
            items = pipeline.execute()[-1]
            break
        except redis.exceptions.WatchError:
            continue
    return [item for item in items if b'{' not in item]

find_prefix_range函数生成prefix在排序时位于其左边的元素和右边元素。例如：

1) "aac"
2) "abd"
3) "abe"
4) "abf"
5) "acg"

是排序好的值，为了找到前缀为ab的所有元素，我们先生成排在ab左边和ab后边的元素。它们分别是('aa{', 'ab{')。aa{刚刚好排在所有以ab开头的字符串前面；ab{也刚刚好排在所有以ab开头的字符串后面。

随后，我们把aa{和ab{，加入到zset中，结果如下：

1) "aac"
2) "aa{"
3) "abd"
4) "abe"
5) "abf"
6) "ab{"
7) "acg"

因此，只要找到aa{和ab{的位置索引，那么这两个位置索引的区间全都为ab开头的字符串。可能这个区间也很大，我们可以通过size = min(start_index+self._size-1, end_index-2)这行代码限制区间大小。一旦区间确定了，可以通过zrange(key, start_index, end_index)返回区间指定的元素。

具体的实现见代码。这里指出几个实现上的细节。find_prefix_range函数中给返回值添加uuid字符串是考虑并发情况下，其他进程或线程也进行前缀查找而覆盖了它们的标记元素（即上面提到的aa{和ab{），如果使用Lua脚本实现上述的算法过程就可以免去UUID了，因为redis执行是单线程且lua脚本是原子执行的。使用prefix[-1:]而不是prefix[-1]是因为前者如果传入空字符不会出错。另外，使用完标记元素后调用pipeline.zrem(key, start, end)会移除标记元素（即上面提到的aa{和ab{）。

上述思路的详细过程如下：

class ZAutoComplete:

    characters = '`abcdefghijklmnopqrstuvwxyz{'

    def __init__(self, extract_size=10):
        self._conn = redis.Redis()
        self._size = extract_size

    def add(self, key, value):
        self._conn.zadd('members:'+key, value, 0)

    def remove(self, key, value):
        self._conn.zrem('members:'+key, value)

    def find_prefix_range(self, prefix):
        local = bisect.bisect_left(self.characters, prefix[-1:])
        suffix = self.characters[(local or 1) - 1]
        _id = str(uuid.uuid4())
        # _id = ''
        return prefix[:-1] + suffix + '{' + _id, prefix + '{' + _id

    def find_prefix(self, key, prefix):
        start, end = self.find_prefix_range(prefix)
        key = 'members:' + key
        self._conn.zadd(key, start, 0, end, 0)
        pipeline = self._conn.pipeline(True)
        while True:
            try:
                pipeline.watch(True)
                start_index = pipeline.zrank(key, start)
                end_index = pipeline.zrank(key, end)
                size = min(start_index+self._size-1, end_index-2)
                pipeline.multi()
                pipeline.zrem(key, start, end)
                pipeline.zrange(key, start_index, end_index)
                items = pipeline.execute()[-1]
                break
            except redis.exceptions.WatchError:
                continue
        return [item for item in items if b'{' not in item]

编写简单的测试：

def test_ZAutoComplete():
    zac = ZAutoComplete()
    zac.add('allen', 'wind')
    zac.add('allen', 'windy')
    zac.add('allen', 'winding')
    print(zac.find_prefix('allen', 'wind'))
    zac.remove('allen', 'winding')
    print(zac.find_prefix('allen', 'wind'))

算法源码在redis-autocomplete-algorithm。

结语

通过Redis的源码发现Redis的设计很简洁，但并不代表功能就简单，Redis有很多有趣的运用有待挖掘。当然，召回率更高的方法当然是借助NLP技术，后期有需要在深入。

转载请包括本文地址：https://allenwind.github.io/blog/4560
更多文章请参考：https://allenwind.github.io/blog/archives/