1. 倒排索引原理初探
有以下两个文档:
# doc1
I really liked my small dogs, and I think my mom also liked them.
# doc2
He never liked any dogs, so I hope that my mom will not expect me to liked him
对文档进行分词,统计其在这两个doc中是否存在,*代表存在
word | doc1 | doc2 |
---|---|---|
I | * | * |
really | * | |
liked | * | * |
my | * | * |
small | * | |
dogs | * | * |
and | * | |
think | * | |
mom | * | * |
also | * | |
them | * | |
He | * | |
never | * | |
any | * | |
so | * | |
hope | * | |
that | * | |
will | * | |
not | * | |
expect | * | |
me | * | |
to | * | |
him | * |
对单词进行常规化处理:
# 同义词转换
mom -> mother
small -> little
# 时态转换
liked -> like
# 类似:likes -> like
# 单复数转换
dogs -> dog
# 大小写转换
# 举例:Tom -> tom
重新建立倒排索引,加入常规化后的单词:
word | 常规化单词 | document id |
---|---|---|
I | [1、2] | |
really | [1] | |
liked | like | [1、2] |
my | [1、2] | |
small | little | [1] |
dogs | dog | [1、2] |
and | [1] | |
think | [1] | |
mom | mother | [1、2] |
also | [1] | |
them | [1] | |
He | [2] | |
never | [2] | |
any | [2] | |
so | [2] | |
hope | [2] | |
that | [2] | |
will | [2] | |
not | [2] | |
expect | [2] | |
me | [2] | |
to | [2] | |
him | [2] |
搜索"mother like little dog",首先分词,然后查看这些单词出现过的id,就返回了id为1和2的这两条文档
2. TF&IDF算法
ES使用TF&IDF算法(Term Frequency/Inverse Document Frequency)算法来计算document的评分,及其与搜索条件的匹配程度,此算法包括以下几个方面的影响因素:
- Term Frequency:搜索词出现次数越多的文档越相关
doc1:hello you, and world is very good
doc2:hello, how are you
搜索请求:hello world
hello在doc1中出现1次,在doc2中出现1次
world在doc1中出现1次,在doc2中出现0次
所以doc1更相关
- Inverse Document Frequency:搜索词在所有文档中出现的次数越多,则包含这个词的文档就越不相关
doc1:hello, today is very good
doc2:hi world, how are you
搜索请求:hello world
假设hello这个单词在全部文档中出现了10000次,world这个单词在全部文档中出现了100次
此时,doc1和doc2都是只能匹配一个单词,那么doc2的评分就比doc1靠前
- Field-Length Norm:field的长度越长,相关度越弱
doc1:{ "title": "hello article", "content": "yes"(一万个yes)}
doc2:{ "title": "my article", "content": "world yes"(一万个yes)}
搜索请求:hello world
doc1和doc2同样都是匹配到一个单词,且都出现了1次,但是doc1匹配到单词的field(title)比doc2匹配到单词的field(content)长度短
所有doc1评分更高
- 使用执行计划查看API可以获取score的具体计算公式
GET /test_index/test_type/_search?explain
{
"query": {
"match": {
"test_field": "test hello"
}
}
}
3. 正排索引
搜索的时候,要依靠倒排索引,排序的时候,需要依靠正排索引,将每个document的每个field,然后进行排序,就是所谓的正排索引,在建立索引的时候,一方面会建立倒排索引,以供搜索用,一方面会建立正排索引,供排序,聚合,过滤等操作使用
doc1: {"name": "jack", "age": 27}
doc2: {"name": "tom", "age": 30}
以上两条doc的正排索引如下:
document id | name | age |
---|---|---|
1 | jack | 27 |
2 | tom | 30 |
在内存资源充足的时候,正排索引会缓存在内存中,否则就会写入磁盘
4. 倒排索引的结构
一条倒排索引的信息中包含以下内容:
- 内容包含这个关键词的document list
- 内容包含这个关键词的所有document的数量:IDF
- 这个关键词在每个document中出现的次数:TF
- 这个关键词在这个document中的位置
- 内容包含这个关键词的每个document的长度:Length Norm
- 内容包含这个关键词的所有document的平均长度
5. 字符串排序问题
如果对一个字符串进行排序,结果往往不准确,因为分词后是多个单词,再排序就不是我们想要的结果了,通常解决方案是,将一个字符串建立两次索引,一个分词,用来进行搜索,一个不分词,用来进行排序:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "text",
"fielddata": true,
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
PUT /my_index/my_type/1
{
"title": "first article"
}
PUT /my_index/my_type/2
{
"title": "second article"
}
PUT /my_index/my_type/3
{
"title": "third article"
}
# 使用分词的title进行排序
GET /my_index/my_type/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"title": {
"order": "desc"
}
}
]
}
{
"took" : 25,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "3",
"_score" : null,
"_source" : {
"title" : "third article"
},
"sort" : [
"third"
]
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "2",
"_score" : null,
"_source" : {
"title" : "second article"
},
"sort" : [
"second"
]
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_score" : null,
"_source" : {
"title" : "first article"
},
"sort" : [
"first"
]
}
]
}
}
# 使用不分词的title进行排序
GET /my_index/my_type/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"title.raw": {
"order": "desc"
}
}
]
}
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "3",
"_score" : null,
"_source" : {
"title" : "third article"
},
"sort" : [
"third article"
]
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "2",
"_score" : null,
"_source" : {
"title" : "second article"
},
"sort" : [
"second article"
]
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_score" : null,
"_source" : {
"title" : "first article"
},
"sort" : [
"first article"
]
}
]
}
}
网友评论