ES
作为一个搜索引擎,拥有高效且功能齐全搜索算法,这一期我们来了解一下其细节。
IK 中文分词器
ES
并不支持中文词语的切割,当使用中文时,我们输入的词汇会被切割成一个个单子,而不能组成我们想要的词语。
幸运的是,ES 人性化的支持各种插件的安装,通过安装 IK
分词器,我们就可以解决这个问题。
在 Docker 中,我们需要先进入容器,不用 Docker 可以跳过这一步:
1
| docker exec -it elasticsearch bash
|
找到 bin 目录,里面有用于插件安装的程序:
接着下载安装插件,注意版本号一致:
1
| elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/7.8.1/elasticsearch-analysis-ik-7.8.1.zip
|
重启容器,安装成功:
1 2
| exit # 退出容器 docker restart elasticsearch
|
测试
我们依旧使用 Kibana 进行测试。
不使用分词器:
1 2 3 4
| GET _analyze { "text": "学习?学个屁" }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| { "tokens" : [ { "token" : "学", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "习", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "学", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "个", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "屁", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 4 } ] }
|
可以看到字被一个个分开了。
使用分词器:
1 2 3 4 5
| GET _analyze { "analyzer": "ik_smart", "text": "学习?学个屁" }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| { "tokens" : [ { "token" : "学习", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "学", "start_offset" : 3, "end_offset" : 4, "type" : "CN_CHAR", "position" : 1 }, { "token" : "个", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 2 }, { "token" : "屁", "start_offset" : 5, "end_offset" : 6, "type" : "CN_CHAR", "position" : 3 } ] }
|
学习被当成一个词汇,而不是单字。
自定义词典
学习被当成一个词汇,然而学个屁仍然是单字。如果我们要把学个屁加入词汇怎么办呢?这时候我们就需要自定义词典。
进入容器,如果你已经把 config
目录挂载到卷外就不需要进入了(我忘了):
1
| docker exec -it elasticsearch bash
|
ES 的 docker 容器基于 centos 系统,我们先安装 vim:
进入词典目录:
1
| cd /usr/share/elasticsearch/config/analysis-ik
|
用 vim 生成一个文件
my_word.dic
,加入我们需要的词汇。
看一下文本内容:
1 2
| [root@3fb842497984 analysis-ik]# cat my_word.dic 学个屁
|
加入配置,打开 IKAnalyzer.cfg.xml,将自定义词典加到 ext_dict 中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <entry key="ext_dict">my_word.dic</entry> <entry key="ext_stopwords"></entry> </properties>
|
退出,重启容器:
1 2
| exit docker restart elasticsearch
|
再来试一下效果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| { "tokens" : [ { "token" : "学习", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "学个屁", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 1 } ] }
|
测试成功。
ES 搜索语法
keyword 和 text 的区别
keyword:关键词,存入数据时不会被分词
text:文本,存入数据时会被分词
再测试之前,还要再介绍一下 ES 的两种查询:
match: 模糊查询,会对搜索关键词分词
term: 精确查询,不会对搜索关键词分词
测试
创建索引。
name 为 keyword,desc 为 text:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| PUT user { "mappings": { "properties": { "name": { "type": "keyword" }, "age": { "type": "integer" }, "desc": { "type": "text" } } } }
|
插入三个字段:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| PUT user/_doc/1 { "name": "user koorye1", "age": 19, "desc": "i love python" }
PUT user/_doc/2 { "name": "user koorye2", "age": 20, "desc": "i love java" }
PUT user/_doc/3 { "name": "user koorye3", "age": 21, "desc": "i love c" }
|
接下来我们用两种匹配测试一下:
- 对 keyword 使用模糊查询
1 2 3 4 5 6 7 8
| GET user/_search { "query": { "match": { "name": "user" } } }
|
返回结果为空。为什么?因为 name 是 keyword
类型,被看作一个整体,我们不能使用 keyword 的部分内容进行匹配。
- 对 keyword 使用精确查询
1 2 3 4 5 6 7 8
| GET user/_search { "query": { "term": { "name": "user koorye1" } } }
|
查询关键词是精确的,name
也是关键词类型,都不会分词,故只有完全匹配才能测试成功。这次我们查到一条结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.9808291, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 0.9808291, "_source" : { "name" : "user koorye1", "age" : 19, "desc" : "i love python" } } ] } }
|
- 对 text 进行模糊匹配
1 2 3 4 5 6 7 8
| GET user/_search { "query": { "match": { "desc": "i love" } } }
|
由于存储的是 text 类型,故已经被分词,相当于库中存储了 i / love /
python / java / c 几个单词。我们使用模糊查询,故查询关键词被分词,是 i /
love。模糊查询只需有一个关键词对应即可,所以可以成功匹配,返回 3
条结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
| { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.26706278, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 0.26706278, "_source" : { "name" : "user koorye1", "age" : 19, "desc" : "i love python" } }, { "_index" : "user", "_type" : "_doc", "_id" : "2", "_score" : 0.26706278, "_source" : { "name" : "user koorye2", "age" : 20, "desc" : "i love java" } }, { "_index" : "user", "_type" : "_doc", "_id" : "3", "_score" : 0.26706278, "_source" : { "name" : "user koorye3", "age" : 21, "desc" : "i love c" } } ] } }
|
- 对 text 进行精确匹配
1 2 3 4 5 6 7 8
| GET user/_search { "query": { "term": { "desc": "i love python" } } }
|
字段完全相同,返回结果却为空?也许很出乎意料。原因是,由于 desc 属于
text 类型,库的字段已经被一个个拆开了,库中只有单独的 i / love / python
几个单词,却没有完整的 i love python
这个句子。因此,即使完全相同,我们这次也查不到结果。
相反,单个字段匹配反而可以用 term 得到结果:
1 2 3 4 5 6 7 8
| GET user/_search { "query": { "term": { "desc": "python" } } }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.9808291, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 0.9808291, "_source" : { "name" : "user koorye1", "age" : 19, "desc" : "i love python" } } ] } }
|
过滤搜索结果
使用 _source 即可:
1 2 3 4 5 6 7 8 9
| GET user/_search { "query": { "match": { "desc": "i love" } }, "_source": ["name", "age"] }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 1.497693, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 1.497693, "_source" : { "name" : "user koorye1", "age" : 19 } }, { "_index" : "user", "_type" : "_doc", "_id" : "2", "_score" : 1.497693, "_source" : { "name" : "user koorye2", "age" : 20 } }, { "_index" : "user", "_type" : "_doc", "_id" : "3", "_score" : 1.497693, "_source" : { "name" : "user koorye3", "age" : 21 } } ] } }
|
返回体中没有 desc.
结果分页与排序
通过 sort 指定排序规则,from 指定开始页,size 指定每页的数据量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| GET user/_search { "query": { "match": { "desc": "i love" } }, "sort": [ { "age": { "order": "desc" } } ], "from": 0, "size": 2 }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| { "took" : 22, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "name" : "user koorye3", "age" : 21, "desc" : "i love c" }, "sort" : [ 21 ] }, { "_index" : "user", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "name" : "user koorye2", "age" : 20, "desc" : "i love java" }, "sort" : [ 20 ] } ] } }
|
逻辑查询
ES 提供了几种逻辑类型来进行与或非判断:
- must 与,即都要匹配
- should 或,即有一种匹配
- must_not 非,即相反才匹配
与格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| GET user/_search { "query": { "bool": { "must": [{ "match": { "desc": "love" } },{ "match": { "age": "19" } }] } } }
|
或格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| GET user/_search { "query": { "bool": { "should": [{ "match": { "desc": "love" } },{ "match": { "age": "19" } }] } } }
|
非格式:
1 2 3 4 5 6 7 8 9 10 11 12
| GET user/_search { "query": { "bool": { "must_not": [{ "match": { "age": "19" } }] } } }
|
区间查询
ES 提供了区间查询:
- gt 大于
- lt 小于
- gte 大于等于
- lte 小于等于
例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| GET user/_search { "query": { "bool": { "filter": [{ "range": { "age": { "gte": 20, "lte": 30 } } }] } } }
|
结果高亮
通过 highlight 指定需要高亮的字段:
1 2 3 4 5 6 7 8 9 10 11 12 13
| GET user/_search { "query": { "match": { "desc": "love" } }, "highlight": { "fields": { "desc": {} } } }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
| { "took" : 115, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.13353139, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 0.13353139, "_source" : { "name" : "user koorye1", "age" : 19, "desc" : "i love python" }, "highlight" : { "desc" : [ "i <em>love</em> python" ] } }, { "_index" : "user", "_type" : "_doc", "_id" : "2", "_score" : 0.13353139, "_source" : { "name" : "user koorye2", "age" : 20, "desc" : "i love java" }, "highlight" : { "desc" : [ "i <em>love</em> java" ] } }, { "_index" : "user", "_type" : "_doc", "_id" : "3", "_score" : 0.13353139, "_source" : { "name" : "user koorye3", "age" : 21, "desc" : "i love c" }, "highlight" : { "desc" : [ "i <em>love</em> c" ] } } ] } }
|
我们还可以自定义高亮的前后缀:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| GET user/_search { "query": { "match": { "desc": "love" } }, "highlight": { "fields": { "desc": {} }, "pre_tags": "<span class='high'>", "post_tags": "</span>" } }
|
返回体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
| { "took" : 6, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.13353139, "hits" : [ { "_index" : "user", "_type" : "_doc", "_id" : "1", "_score" : 0.13353139, "_source" : { "name" : "user koorye1", "age" : 19, "desc" : "i love python" }, "highlight" : { "desc" : [ "i <span class='high'>love</span> python" ] } }, { "_index" : "user", "_type" : "_doc", "_id" : "2", "_score" : 0.13353139, "_source" : { "name" : "user koorye2", "age" : 20, "desc" : "i love java" }, "highlight" : { "desc" : [ "i <span class='high'>love</span> java" ] } }, { "_index" : "user", "_type" : "_doc", "_id" : "3", "_score" : 0.13353139, "_source" : { "name" : "user koorye3", "age" : 21, "desc" : "i love c" }, "highlight" : { "desc" : [ "i <span class='high'>love</span> c" ] } } ] } }
|