Elasticsearch 深入搜索-近似匹配

作者: 觉释 | 来源:发表于2020-09-03 08:29 被阅读0次

Elasticsearch 深入搜索-近似匹配
Elasticsearch 深入搜索
Elasticsearch 深入搜索-全文搜索
25、ElasticSearch 7.x 近似匹配、混合使用ma
Elasticsearch 深入搜索-多字段搜索
后端存储6（ES）
SpringBoot1.5.x集成Elasticsearch
十九、Elasticsearch基于slop参数实现近似匹配
elasticsearch 深入搜索-结构化搜索
近似匹配算法

短语搜索

就像 match 查询对于标准全文检索是一种最常用的查询一样，当你想找到彼此邻近搜索词的查询方法时，就会想到 match_phrase 查询。

GET /my_index/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

match_phrase 查询同样可写成一种类型为 phrase 的 match 查询:

"match": {
    "title": {
        "query": "quick brown fox",
        "type":  "phrase"
    }
}

词项的位置
当一个字符串被分词后，这个分析器不但会返回一个词项列表，而且还会返回各词项在原始字符串中的位置或者顺序关系：

GET /_analyze?analyzer=standard
Quick brown fox

返回信息如下：

{
   "tokens": [
      {
         "token": "quick",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1 
      },
      {
         "token": "brown",
         "start_offset": 6,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2 
      },
      {
         "token": "fox",
         "start_offset": 12,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 3 
      }
   ]
}

position 代表各词项在原始字符串中的位置。

什么是短语

一个被认定为和短语 quick brown fox 匹配的文档，必须满足以下这些要求：
• quick 、 brown 和 fox 需要全部出现在域中。
• brown 的位置应该比 quick 的位置大 1 。
• fox 的位置应该比 quick 的位置大 2 。

混合起来

精确短语匹配或许是过于严格了。也许我们想要包含 “quick brown fox” 的文档也能够匹配 “quick fox,” ，尽管情形不完全相同。

我们能够通过使用 slop 参数将灵活度引入短语匹配中：

GET /my_index/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "quick fox",
                "slop":  1
            }
        }
    }
}

slop 参数告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配。相隔多远的意思是为了让查询和文档匹配你需要移动词条多少次？

多值字段

对多值字段使用短语匹配时会发生奇怪的事。想象一下你索引这个文档:

PUT /my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

然后运行一个对 Abraham Lincoln 的短语查询:

GET /my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}

令人惊讶的是，即使 Abraham 和 Lincoln 在 names 数组里属于两个不同的人名，我们的文档也匹配了查询。这一切的原因在Elasticsearch数组的索引方式。
换句话说， Elasticsearch对以上数组分析生成了与分析单个字符串 John Abraham Lincoln Smith 一样几乎完全相同的语汇单元。我们的查询示例寻找相邻的 lincoln 和 abraham ，而且这两个词条确实存在，并且它们俩正好相邻，所以这个查询匹配了。

幸运的是，在这样的情况下有一种叫做 position_increment_gap 的简单的解决方案，它在字段映射中配置。

DELETE /my_index/groups/ 

PUT /my_index/_mapping/groups 
{
    "properties": {
        "names": {
            "type":                "string",
            "position_increment_gap": 100
        }
    }
}

首先删除映射 groups 以及这个类型内的所有文档。

然后创建一个有正确值的新的映射 groups 。

越近越好

鉴于一个短语查询仅仅排除了不包含确切查询短语的文档，而邻近查询 — 一个 slop 大于 0— 的短语查询将查询词条的邻近度考虑到最终相关度 _score 中。通过设置一个像 50 或者 100 这样的高 slop 值, 你能够排除单词距离太远的文档，但是也给予了那些单词临近的的文档更高的分数。

下列对 quick dog 的邻近查询匹配了同时包含 quick 和 dog 的文档，但是也给了与 quick 和 dog 更加临近的文档更高的分数：

POST /my_index/_search
{
   "query": {
      "match_phrase": {
         "title": {
            "query": "quick dog",
            "slop":  50 
         }
      }
   }
}

注意高 slop 值。

分数较高因为 quick 和 dog 很接近
分数较低因为 quick 和 dog 分开较远

使用邻近度提高相关度

GET /my_index/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

must 子句从结果集中包含或者排除文档。

should 子句增加了匹配到文档的相关度评分。

性能优化

短语查询和邻近查询都比简单的 query 查询代价更高。一个 match 查询仅仅是看词条是否存在于倒排索引中，而一个 match_phrase 查询是必须计算并比较多个可能重复词项的位置。

GET /my_index/_search
{
    "query": {
        "match": {  
            "title": {
                "query":                "quick brown fox",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
        "window_size": 50, 
        "query": {         
            "rescore_query": {
                "match_phrase": {
                    "title": {
                        "query": "quick brown fox",
                        "slop":  50
                    }
                }
            }
        }
    }
}

match 查询决定哪些文档将包含在最终结果集中，并通过 TF/IDF 排序。

window_size 是每一分片进行重新评分的顶部文档数量。

目前唯一支持的重新打分算法就是另一个查询，但是以后会有计划增加更多的算法。

寻找相关词

短语查询和邻近查询都很好用，但仍有一个缺点。它们过于严格了：为了匹配短语查询，所有词项都必须存在，即使使用了 slop 。

生成Shingles

Shingles 需要在索引时作为分析过程的一部分被创建。我们可以将 unigrams 和 bigrams 都索引到单个字段中，但将它们分开保存在能被独立查询的字段会更清晰。unigrams 字段将构成我们搜索的基础部分，而 bigrams 字段用来提高相关度。

首先，我们需要在创建分析器时使用 shingle 语汇单元过滤器：

DELETE /my_index

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}

默认最小/最大的 shingle 大小是 2 ，所以实际上不需要设置。

shingle 语汇单元过滤器默认输出 unigrams ，但是我们想让 unigrams 和 bigrams 分开。

my_shingle_analyzer 使用我们常规的 my_shingles_filter 语汇单元过滤器。

多字段

我们曾谈到将 unigrams 和 bigrams 分开索引更清晰，所以 title 字段将创建成一个多字段（参考字符串排序与多字段）：

PUT /my_index/_mapping/
{
        "properties": {
            "title": {
                "type": "string",
                "fields": {
                    "shingles": {
                        "type":     "string",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
        }
    }
}

通过这个映射， JSON 文档中的 title 字段将会被以 unigrams (title)和 bigrams (title.shingles)被索引，这意味着可以独立地查询这些字段。

最后，我们可以索引以下示例文档:

POST /my_index/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }

搜索 Shingles
为了理解添加 shingles 字段的好处，让我们首先来看 The hungry alligator ate Sue 进行简单 match 查询的结果：

GET /my_index/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

这个查询返回了所有的三个文档，但是注意文档 1 和 2 有相同的相关度评分因为他们包含了相同的单词：

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.44273707, 
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "2",
        "_score": 0.44273707, 
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "3", 
        "_score": 0.046571054,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

两个文档都包含 the 、 alligator 和 ate ，所以获得相同的评分。

我们可以通过设置 minimum_should_match 参数排除文档 3 ，参考控制精度。

现在在查询里添加 shingles 字段。不要忘了在 shingles 字段上的匹配是充当一种信号—为了提高相关度评分—所以我们仍然需要将基本 title 字段包含到查询中：

GET /my_index/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

仍然匹配到了所有的 3 个文档，但是文档 2 现在排到了第一名因为它匹配了 shingled 词项 ate sue.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.4883322,
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "1",
        "_score": 0.13422975,
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "3",
        "_score": 0.014119488,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

即使查询包含的单词 hungry 没有在任何文档中出现，我们仍然使用单词邻近度返回了最相关的文档。

Performance性能

shingles 不仅比短语查询更灵活，而且性能也更好。 shingles 查询跟一个简单的 match 查询一样高效，而不用每次搜索花费短语查询的代价。只是在索引期间因为更多词项需要被索引会付出一些小的代价，这也意味着有 shingles 的字段会占用更多的磁盘空间。然而，大多数应用写入一次而读取多次，所以在索引期间优化我们的查询速度是有意义的。

这是一个在 Elasticsearch 里会经常碰到的话题：不需要任何前期进行过多的设置，就能够在搜索的时候有很好的效果。一旦更清晰的理解了自己的需求，就能在索引时通过正确的为你的数据建模获得更好结果和性能。

Elasticsearch 深入搜索-近似匹配
短语搜索就像 match 查询对于标准全文检索是一种最常用的查询一样，当你想找到彼此邻近搜索词的查询方法时，就会...
Elasticsearch 深入搜索
前言 Elasticsearch 是一个开源的搜索引擎，建立在一个全文搜索引擎库 Apache Lucene™ 基...
Elasticsearch 深入搜索-全文搜索
基于词项和基于全文匹配查询匹配查询 match 是个核心查询。无论需要查询什么字段， match 查询都应...
25、ElasticSearch 7.x 近似匹配、混合使用ma
主要内容：近似匹配、混合使用match和近似匹配、rescoring机制 1、近似匹配 1.1、match_phr...
Elasticsearch 深入搜索-多字段搜索
多字符串查询为什么将译者条件语句放入另一个独立的 bool 查询中呢？所有的四个 match 查询都是 shou...
后端存储6（ES）
如何用 ES(Elasticsearch) 构建一个搜索系统首先数据库不适合做搜索。搜索的核心需求是全文匹配，对...
SpringBoot1.5.x集成Elasticsearch
分词器搜索全匹配查询低版本springboot集成es问题 logstash同步问题 Elasticsearch...
十九、Elasticsearch基于slop参数实现近似匹配
1、基本语法 2、slop的含义 query string，搜索文本中的几个term，要经过几次移动才能与一个do...
elasticsearch 深入搜索-结构化搜索
精确值查找 term 查询数字我们首先来看最为常用的 term 查询，可以用它处理数字（numbers）、布尔...
近似匹配算法
Hamming distanceimage.png Edit distanceimage.png 把索引做成哈希表...

Elasticsearch 深入搜索-近似匹配

短语搜索

什么是短语

混合起来

多值字段

越近越好

使用邻近度提高相关度

性能优化

寻找相关词

生成Shingles

多字段

Performance性能

相关文章

Elasticsearch 深入搜索-近似匹配

Elasticsearch 深入搜索

Elasticsearch 深入搜索-全文搜索

25、ElasticSearch 7.x 近似匹配、混合使用ma

Elasticsearch 深入搜索-多字段搜索

后端存储6（ES）

SpringBoot1.5.x集成Elasticsearch

十九、Elasticsearch基于slop参数实现近似匹配

elasticsearch 深入搜索-结构化搜索

近似匹配算法

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读