ES的常用查询以及其他操作

1 数据准备

1.1 创建索引和新增数据

先新增一条数据，新增数据时会自动创建索引 test_standard_analyzer。

PUT /test_standard_analyzer/_doc/1
{
  "remark": "This is a test doc"
}

PUT /test_standard_analyzer/_doc/2
{
  "remark": "This is an apple"
}

然后查询一下。

GET test_standard_analyzer/_search
{
  "query": {
    "match_all": {}
  }
}

查询结果如下所示。

    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "remark" : "This is a test doc"
        }
      },
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "remark" : "This is an apple"
        }
      }
    ]

1.2 测试分词

没指定es分词器时，es会使用默认分词器-standard。测试下分词效果。

 POST test_standard_analyzer/_analyze
{
  "field": "remark",
  "text": "This is a test doc"
}

# 或

 POST test_standard_analyzer/_analyze
{
  "analyzer": "standard",
  "text": "This is a test doc"
}

分词结果如下所示。

{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "doc",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

2 ES查询

2.1 match

match查询会将查询条件进行分词。命中数据的条件：匹配到查询条件的其中一个分词即可。

以下查询将命中数据，查询条件被分词为“b”和“doc”。

GET test_standard_analyzer/_search
{
  "query": {
    "match": {
      "remark":"b doc"
    }
  }
}

2.2 match_phrase

match_phrase查询会将查询条件进行分词。

命中数据的条件：（1）查询条件的所有分词都需要匹配到，（2）相对顺序还要一致，（3）默认（slop=0或者未设置该值）查询条件的分词在es数据中是连续的。

2.2.1 查询条件的分词在es数据中需要是连续的

（1）命中数据

GET test_standard_analyzer/_search
{
  "query": {
    "match_phrase": {
      "remark":"a test doc"
    }
  }
}

（2）未命中数据

以下查询未命中数据，因为查询条件的分词在es数据中不连续，中间还间隔一个“test”。

GET test_standard_analyzer/_search
{
  "query": {
    "match_phrase": {
      "remark":"a doc"
    }
  }
}

2.2.2 查询条件的分词在es数据中可以不连续

slop 参数用于指定中间可省略几个词语。slop > 0时，查询条件的分词在es数据中可以不连续。

因此以下查询将命中数据。

GET test_standard_analyzer/_search
{
  "query": {
    "match_phrase": {
      "remark":{
        "query": "a doc",
        "slop": 1
      }
    }
  }
}

2.3 multi_match

multi_match查询会将查询条件进行分词。它会从多个字段中去寻找我们要查找的条件。

案例如下所示。先新增一条数据。

PUT /test_standard_analyzer/_doc/3
{
  "remark": "This is an apple",
  "content":"this is a green apple"
}

然后通过下述语句进行查询，此时能查到数据，是因为它会从字段 remark 和 content 中寻找我们要查找的条件，并且在 content 字段的倒排索引中匹配到了分词green。

GET test_standard_analyzer/_search
{
  "query": {
    "multi_match": {
        "query" : "green banana",
        "fields" : ["remark", "content"]
    }
  }
}

2.4 term

term查询不会对查询条件进行分词，而是直接拿查询条件作为一个词去和倒排索引进行匹配。匹配到了，则命中了es的数据，否则未命中es数据。下面是具体的查询案例。

2.4.1 未命中数据

因为倒排索引中没有 “This is a test doc” 这个词。

# 指定最多返回50条数据，默认返回10条数据
GET test_standard_analyzer/_search
{
  "query": {
    "term": {
      "remark":"This is a test doc"
    }
  },
  "size":50
}

2.4.2 命中了数据

因为此查询匹配到了倒排索引中的词-“doc”。

GET test_standard_analyzer/_search
{
  "query": {
    "term": {
      "remark":"doc"
    }
  }
}

2.5 terms

terms查询不会对查询条件进行分词，而是直接拿查询条件作为词去和倒排索引进行匹配。查询条件中至少一个词匹配到了，则命中了es的数据。terms查询相当于对多个term查询结果取并集，即取所有词的匹配结果的并集。下面是具体的查询案例。

2.5.1 命中多条数据

查询语句

GET test_standard_analyzer/_search
{
  "query": {
    "terms": {
      "remark": ["doc", "apple"]
    }
  }
}

查询结果

    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "remark" : "This is a test doc"
        }
      },
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "remark" : "This is an apple"
        }
      }
    ]

2.5.2 命中一条数据

查询语句

GET test_standard_analyzer/_search
{
  "query": {
    "terms": {
      "remark": ["haha", "apple"]
    }
  }
}

查询结果

    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "remark" : "This is an apple"
        }
      }
    ]

2.5.3 未命中数据

GET test_standard_analyzer/_search
{
  "query": {
    "terms": {
      "remark": ["tom", "jack"]
    }
  }
}

2.6 fuzzy

fuzzy查询是一种模糊查询。它不会对查询条件进行分词。

2.6.1 参数介绍

GET test_standard_analyzer/_search
{
  "query": {
    "fuzzy": {
      "fruit":{
        "value":"app",
        "fuzziness":"1",
        "max_expansions": 50,
        "prefix_length": 1,
        "transpositions": true,
        "rewrite": "constant_score"
      }
    }
  }
}

主要参数如下：

value —— 查询条件；
fuzziness —— 莱文斯坦编辑距离；
max_expansions —— 衍生词库最大大小，默认为50；
prefix_length —— 根据查询条件构建衍生词库时，固定查询条件左侧几个字符不变，默认为0；
transpositions —— 构建衍生词库时，是否允许字符交换位置（如 ab → ba），默认为true；
rewrite —— 是否进行es评分，默认进行评分。设为 constant_score 时不进行评分。

2.6.2 查询原理

fuzzy查询过程如下所述：

首先根据设定的莱文斯坦编辑距离值（ Levenshtein edit distance）构造一个查询条件对应的衍生词库（a set of all possible variations, or expansions）；
然后拿着词库中的每个衍生词去和倒排索引进行匹配；
最后返回匹配结果的并集。

由上可知，构造的衍生词库越大，从es中召回的数据量可能就越多。

2.6.3 查询举例

首先插入3条数据，然后查询。

PUT /test_standard_analyzer/_doc/7
{
  "fruit":"apple"
}


PUT /test_standard_analyzer/_doc/8
{
  "fruit":"appl"
}

PUT /test_standard_analyzer/_doc/9
{
  "fruit":"appla"
}

2.6.3.1 查询条件为“app”，莱文斯坦编辑距离为2，衍生词库大小为3

此时召回了“appl”、“appla”和“apple”，说明该fuzzy查询的衍生词库的元素为“appl”、“appla”和“apple”。

GET test_standard_analyzer/_search
{
  "query": {
    "fuzzy": {
      "fruit":{
        "value":"app",
        "fuzziness":"2",
        "max_expansions": 3
      }
    }
  }
}


# 查询结果
    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 0.58364576,
        "_source" : {
          "fruit" : "appl"
        }
      },
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 0.29182288,
        "_source" : {
          "fruit" : "apple"
        }
      },
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 0.29182288,
        "_source" : {
          "fruit" : "appla"
        }
      }
    ]

2.6.3.2 查询条件为“app”，莱文斯坦编辑距离为2，衍生词库大小为2

此时召回了“appl”、“appla”，说明该fuzzy查询的衍生词库的元素为“appl”和“appla”。此时衍生词库中没有“apple”，因此没有召回“apple”。

GET test_standard_analyzer/_search
{
  "query": {
    "fuzzy": {
      "fruit":{
        "value":"app",
        "fuzziness":"2",
        "max_expansions": 2
      }
    }
  }
}

# 查询结果

    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 0.65388614,
        "_source" : {
          "fruit" : "appl"
        }
      },
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 0.32694307,
        "_source" : {
          "fruit" : "appla"
        }
      }
    ]

2.6.3.3 查询条件为“ap”，莱文斯坦编辑距离为3，衍生词库大小为50。

此时只召回了“appl”，没有召回“appla”和“apple”，说明该fuzzy查询的衍生词库的元素中有“appl”，但是没有“appla”和“apple”(虽然 “ap”与“appla”或“apple”的编辑距离为3)。

GET test_standard_analyzer/_search
{
  "query": {
    "fuzzy": {
      "fruit":{
        "value":"ap",
        "fuzziness":"3",
        "max_expansions": 50
      }
    }
  }
}


# 查询结果
    "hits" : [
      {
        "_index" : "test_standard_analyzer",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 0.0,
        "_source" : {
          "fruit" : "appl"
        }
      }
    ]

2.7 range

range查询用于数值字段的范围查询，主要的查询条件如下：

gte：大于等于
gt：大于
lt：小于
lte：小于等于

查询举例：

GET /test_standard_analyzer/_search
{ 
  "query": {
    "range": { 
      "nums": { 
          "gte":3, 
          "lt":8
      } 
    }
  } 
}

2.8 bool

bool查询的条件有以下几种：

must：代表且的关系，指必须要满足该条件
should：代表或的关系，指符合该条件的就可以被查出来
must_not：代表非的关系，指不符合该条件的数据才能被查出来

GET /test_standard_analyzer/_search
{
    "query":{
        "bool":{
             "must":[
               {"term":{"remark":"apple" }},
               { "term":{"adderss":"shanghai"}}
            ],
            "should":{
                "term":{"content":"green"}
            },
             "must_not":{
                "term":{"content":"red"}
            }
        }
    }
}

2.9 排序

es中可以根据字段进行排序。举例如下：

GET test/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "id": {
        "order": "asc"
      }
    }
  ]
}

查询结果如下所示：

    "hits" : [
      {
        "_index" : "test_1216",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "address" : "Jiangsu",
          "name" : "tom",
          "id" : 4
        },
        "sort" : [
          4
        ]
      },
      {
        "_index" : "test_1216",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : null,
        "_source" : {
          "address" : "Jiangsu",
          "name" : "lucy",
          "id" : 5
        },
        "sort" : [
          5
        ]
      },
      {
        "_index" : "test_1216",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : null,
        "_source" : {
          "address" : "Jiangsu",
          "name" : "jack2",
          "id" : 6
        },
        "sort" : [
          6
        ]
      }
    ]

3 ES的其他操作

3.1 删除某条数据

DELETE /{index}/_doc/{id}

4 参考文献

（1）ES权威指南：Search API | Elasticsearch Guide [7.10] | Elastic

（2）【ES知识】ES基础查询语法一览