Jusene's Blog

Elasticsearch 大数据分析

字数统计: 7.2k阅读时长: 37 min
2017/07/27 Share

Elasticsearch

Elasticsearch是一个基于Apache Lucene的开源搜索引擎,无论是开源还是专有领域,Lucene可以被认为是迄今为止最先进、性能最好的、功能最全的搜索引擎库,Lucene非常复杂,而Elasticsearch通过RESTful API隐藏了Lucene的复杂性,让搜索变得更简单,不过Elasticsearch不仅仅是一个搜索,它更是文档NoSQL体系的一种,我们可以这样描述它:

  • 分布式的实时文档存储,每个字段都可以被索引并可被搜索
  • 分布式的实时分析搜索引擎
  • 可以扩展到上百台大集群,处理PB级结构化或非结构化数据
    而且这些功能被集成到一个服务中,通过RESTful API调用,满足各种编程语言的需求。

    基本组件

  • 索引(index):文档容器,换句话说,索引时具有属性的文档集合,类似于表,索引名必须使用小写,每个索引的默认分片为5个,每个分片至少有一个副本
  • 类型(type):类型时索引的逻辑分区,其意义完全取决于用户需求,一个索引内部可定义一个或多个类型,一般来说,类型就是拥有相同的域的文档的预定义
  • 文档(documentt):文档是Lucene索引和搜索的原子单位,它包含了一个或多个域,是域的容器:基于json格式表示
  • 映射(mapping):原始内容存储为文档之前需要实现分析,例如切词、过滤掉某些词等;映射用于定义分析机制该如何实现;除此之外,ES还为映射提供了诸如将域中的内容排序等功能

ES集群组件

  • Cluster:ES的集群标识为集群名称;默认为‘elasticsearch’,节点就是依靠是名字来决定加入哪个集群,一个节点只能属于一个集群。
  • Node:运行单个ES实例的主机即为节点,用于存储数据,参与集群索引及搜索操作,节点的标识靠节点名。
  • Shard:将索引切割成为物理存储组件;但每一个shard都是一个独立且完整的索引;创建索引时,ES默认将其分割为5个shards,用户也可以按需定义,创建完成之后不可修改;shard有两种类型:primary shard和replia,replia用于数据冗余及查询时的负载均衡,每个主shard的副本数量可自定义,且可动态修改

ES Cluster启动时默认以多播或者单播的形式在9300/tcp查询同一集群中的其他节点,并与之通信。集群中所有节点会选举出一个主节点负责管理整个集群状态,以及在集群中决定shards的分布方式,站在用户角度而言,每个均接受并响应用户的各类请求。

ES Cluster的状态:

  • green:所有主要分片和副本都可用
  • yellow:所有主要分片可用,但不是所有复制分片都可用
  • red:不是所有主要分片都可用

倒排索引

倒排索引是Lucene中的重要概念,也是ES能够快速检索出内容的重要原因,倒排索引源于实际应用中需要根据属性的值来查找记录,这种索引表中的每一项都包括了一个属性值和具备这种的记录的值,由于不是通过记录来确定属性值,而是由属性来确定记录的位置,所以被称为倒排索引。

在搜索过程中,一段数据需要存储,Lucene首先要进行切词操作,而每个切成的可是表示为这段数据的属性,而通过保存文档于属性对的方式存储下这段数据,而后在检索的过程中检索这种属性,通过属性就可以找到相对应的文档,当然还是有匹配的权重,匹配度越高被搜索到的越前面,很像我们使用的搜索引擎吧,这就是基本的倒排索引概念。

Elasticsearch安装

Elasticseach由java开发,所以我们需要安装java运行环境JDK,OpenJDK或者OracleJDK,最新的Elasticsearch必须在JDK 1.8的情况下运行。

1
~]# yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
2
~]# wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.0.rpm
3
~]# rpm -ivh elasticsearch-5.5.0.rpm
4
5
elasticsearch对系统资源比较耗费,所以一些默认的系统系统参数需要修改下:
6
问题一:
7
java.lang.UnsupportedOperationException: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed
8
        at org.elasticsearch.bootstrap.SystemCallFilter.linuxImpl(SystemCallFilter.java:363) ~[elasticsearch-5.5.0.jar:5.5.0]
9
        at org.elasticsearch.bootstrap.SystemCallFilter.init(SystemCallFilter.java:638) ~[elasticsearch-5.5.0.jar:5.5.0]
10
        at org.elasticsearch.bootstrap.JNANatives.tryInstallSystemCallFilter(JNANatives.java:215) [elasticsearch-5.5.0.jar:5.5.0]
11
        at org.elasticsearch.bootstrap.Natives.tryInstallSystemCallFilter(Natives.java:99) [elasticsearch-5.5.0.jar:5.5.0]
12
        at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:111) [elasticsearch-5.5.0.jar:5.5.0]
13
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194) [elasticsearch-5.5.0.jar:5.5.0]
14
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:351) [elasticsearch-5.5.0.jar:5.5.0]
15
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.5.0.jar:5.5.0]
16
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.5.0.jar:5.5.0]
17
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.5.0.jar:5.5.0]
18
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.5.0.jar:5.5.0]
19
        at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.5.0.jar:5.5.0]
20
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.5.0.jar:5.5.0]
21
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.5.0.jar:5.5.0]
22
23
这是一个警告,采用最新的内核就可以解决,不影响使用。
24
25
问题二:
26
max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
27
~]# vim /etc/security/limits.conf
28
* soft nofile 65536
29
* hard nofile 65536
30
31
问题三:
32
max number of threads [1024] for user [elasticsearch] is too low, increase to at least [2048]
33
~]# vim /etc/security/limits.d/90-nproc.conf 
34
*          soft    nproc     2048
35
*          hard    nproc     2048
36
37
问题四:
38
max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
39
~]# vim /etc/sysctl.conf
40
vm.max_map_count=655360
41
~]# sysctl -p
42
43
问题五:
44
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
45
~]# vim /etc/elasticsearch/elasticsearch.yml
46
bootstrap.system_call_filter: false

elasticsearch.yml配置:

1
~]# cat /etc/elasticsearch/elasticsearch.yml
2
cluster.name: MyES    集群名称,相同的集群使用同一集群名称来辨别
3
node.name: node1      节点名称
4
#node.attr.rack: r1   集群附加属性
5
path.data: /data/elastic  数据存储文档目录
6
path.logs: /data/elastic/log  日志目录
7
network.host: 0.0.0.0    绑定的ip
8
http.port: 9200       restful api的接口
9
transport.tcp.port: 9300  参与集群事务通信的端口
10
discovery.zen.ping.unicast.hosts: ["10.211.55.48", "10.211.55.49"]  集群单播检查存活
11
discovery.zen.minimum_master_nodes: 2    当集群分区选举新的主节点时,选举要求总节点/2+1,所以这里最小的节点数应该为奇数,这里只是为了试验
12
gateway.recover_after_nodes: 2    当一个集群恢复或者重新启动的时候,最少需要几个节点启动,集群才会启动
13
action.destructive_requires_name: true  当删除索引的时候需要精确名称
14
~]# service elasticsearch start
15
~]# tail -f /data/elastic/log/MyES.log
16
[2017-05-16T12:27:12,599][WARN ][o.e.d.z.ZenDiscovery     ] [node2] not enough master nodes discovered during pinging (found [[Candidate{node={node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
17
[2017-05-16T12:27:15,600][WARN ][o.e.d.z.ZenDiscovery     ] [node2] not enough master nodes discovered during pinging (found [[Candidate{node={node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
18
[2017-05-16T12:27:18,441][WARN ][o.e.n.Node               ] [node2] timed out while waiting for initial discovery state - timeout: 30s
19
[2017-05-16T12:27:18,460][INFO ][o.e.h.n.Netty4HttpServerTransport] [node2] publish_address {10.211.55.49:9200}, bound_addresses {[::]:9200}
20
[2017-05-16T12:27:18,460][INFO ][o.e.n.Node               ] [node2] started
21
[2017-05-16T12:27:18,602][WARN ][o.e.d.z.ZenDiscovery     ] [node2] not enough master nodes discovered during pinging (found [[Candidate{node={node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
22
[2017-05-16T12:27:21,604][WARN ][o.e.d.z.ZenDiscovery     ] [node2] not enough master nodes discovered during pinging (found [[Candidate{node={node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
23
[2017-05-16T12:27:24,606][WARN ][o.e.d.z.ZenDiscovery     ] [node2] not enough master nodes discovered during pinging (found [[Candidate{node={node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
24
[2017-05-16T12:27:35,256][INFO ][o.e.c.s.ClusterService   ] [node2] new_master {node2}{HYFYyQ31QmatkqJXoKsNCw}{ikvdDqarTHm-aWCX_89CcQ}{10.211.55.49}{10.211.55.49:9300}, added {{node1}{HrlO474CRxK0XJv_0w4cvg}{PlIc8KhdRKKTmWTK5bRfhQ}{10.211.55.48}{10.211.55.48:9300},}, reason: zen-disco-elected-as-master ([1] nodes joined)[{node1}{HrlO474CRxK0XJv_0w4cvg}{PlIc8KhdRKKTmWTK5bRfhQ}{10.211.55.48}{10.211.55.48:9300}]
25
[2017-05-16T12:27:35,373][INFO ][o.e.g.GatewayService     ] [node2] recovered [0] indices into cluster_state
26
~]# netstat -ntlp
27
Active Internet connections (only servers)
28
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
29
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      2504/sshd           
30
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      2654/master         
31
tcp        0      0 :::9200                     :::*                        LISTEN      14495/java          
32
tcp        0      0 :::9300                     :::*                        LISTEN      14495/java          
33
tcp        0      0 :::22                       :::*                        LISTEN      2504/sshd           
34
tcp        0      0 ::1:25                      :::*                        LISTEN      2654/master

Restful API

四类API:

  • (1)检查集群、节点、索引健康与否,及获取相应状态
  • (2)管理集群、节点、索引及元数据
  • (3)执行CRUD操作
  • (4)执行高级操作,例如paging,fitering等

ES访问接口: TCP/9200

1
curl -X<VERB> '<PROTOCOL>://HOST:PORT/<PATH>?<QUERY_STRING>' -d '<BODY>'

首先我们先检查下集群和节点的状态:

1
~]# curl '10.211.55.49:9200/'
2
{
3
  "name" : "node2",
4
  "cluster_name" : "MyES",
5
  "cluster_uuid" : "VvFCdamHRJWoX8NJIK76Qw",
6
  "version" : {
7
    "number" : "5.5.0",
8
    "build_hash" : "260387d",
9
    "build_date" : "2017-06-30T23:16:05.735Z",
10
    "build_snapshot" : false,
11
    "lucene_version" : "6.6.0"
12
  },
13
  "tagline" : "You Know, for Search"
14
}
15
~]# curl '10.211.55.48:9200/'
16
{
17
  "name" : "node1",
18
  "cluster_name" : "MyES",
19
  "cluster_uuid" : "VvFCdamHRJWoX8NJIK76Qw",
20
  "version" : {
21
    "number" : "5.5.0",
22
    "build_hash" : "260387d",
23
    "build_date" : "2017-06-30T23:16:05.735Z",
24
    "build_snapshot" : false,
25
    "lucene_version" : "6.6.0"
26
  },
27
  "tagline" : "You Know, for Search"
28
}
29
30
我们可以看到这连个节点都是属于MyES集群,就像ES的集群中tagline一样,“You Know, for Search”,这就是为了大数据搜索而准备的集群。
31
32
~]# curl -XGET "http://10.211.55.48:9200/_cluster/health?pretty"
33
{
34
  "cluster_name" : "MyES",
35
  "status" : "green",
36
  "timed_out" : false,
37
  "number_of_nodes" : 2,
38
  "number_of_data_nodes" : 2,
39
  "active_primary_shards" : 0,
40
  "active_shards" : 0,
41
  "relocating_shards" : 0,
42
  "initializing_shards" : 0,
43
  "unassigned_shards" : 0,
44
  "delayed_unassigned_shards" : 0,
45
  "number_of_pending_tasks" : 0,
46
  "number_of_in_flight_fetch" : 0,
47
  "task_max_waiting_in_queue_millis" : 0,
48
  "active_shards_percent_as_number" : 100.0
49
}
50
51
我们的集群处于green状态,说名所以分片和副本都是可用正常的。
52
53
~]# curl -XGET "http://10.211.55.48:9200/_cluster/state?pretty"
54
{
55
  "cluster_name" : "MyES",
56
  "version" : 2,
57
  "state_uuid" : "Twy26y7dTtqilrjUEmDalQ",
58
  "master_node" : "HYFYyQ31QmatkqJXoKsNCw",
59
  "blocks" : { },
60
  "nodes" : {
61
    "HrlO474CRxK0XJv_0w4cvg" : {
62
      "name" : "node1",
63
      "ephemeral_id" : "PlIc8KhdRKKTmWTK5bRfhQ",
64
      "transport_address" : "10.211.55.48:9300",
65
      "attributes" : { }
66
    },
67
    "HYFYyQ31QmatkqJXoKsNCw" : {
68
      "name" : "node2",
69
      "ephemeral_id" : "ikvdDqarTHm-aWCX_89CcQ",
70
      "transport_address" : "10.211.55.49:9300",
71
      "attributes" : { }
72
    }
73
  },
74
  "metadata" : {
75
    "cluster_uuid" : "VvFCdamHRJWoX8NJIK76Qw",
76
    "templates" : { },
77
    "indices" : { },
78
    "index-graveyard" : {
79
      "tombstones" : [ ]
80
    }
81
  },
82
  "routing_table" : {
83
    "indices" : { }
84
  },
85
  "routing_nodes" : {
86
    "unassigned" : [ ],
87
    "nodes" : {
88
      "HrlO474CRxK0XJv_0w4cvg" : [ ],
89
      "HYFYyQ31QmatkqJXoKsNCw" : [ ]
90
    }
91
  }
92
}
93
这是查看集群状态的信息。
94
95
~]# curl "10.211.55.49:9200/_nodes/node1/state?pretty"
96
{
97
  "_nodes" : {
98
    "total" : 1,
99
    "successful" : 1,
100
    "failed" : 0
101
  },
102
  "cluster_name" : "MyES",
103
  "nodes" : {
104
    "HrlO474CRxK0XJv_0w4cvg" : {
105
      "name" : "node1",
106
      "transport_address" : "10.211.55.48:9300",
107
      "host" : "10.211.55.48",
108
      "ip" : "10.211.55.48",
109
      "version" : "5.5.0",
110
      "build_hash" : "260387d",
111
      "roles" : [
112
        "master",
113
        "data",
114
        "ingest"
115
      ]
116
    }
117
  }
118
}
119
120
看不惯json接口的数据,ES集群也为我们提供一个_cat接口:
121
~]# curl -XGET "http://10.211.55.48:9200/_cat"
122
=^.^=
123
/_cat/allocation
124
/_cat/shards
125
/_cat/shards/{index}
126
/_cat/master
127
/_cat/nodes
128
/_cat/tasks
129
/_cat/indices
130
/_cat/indices/{index}
131
/_cat/segments
132
/_cat/segments/{index}
133
/_cat/count
134
/_cat/count/{index}
135
/_cat/recovery
136
/_cat/recovery/{index}
137
/_cat/health
138
/_cat/pending_tasks
139
/_cat/aliases
140
/_cat/aliases/{alias}
141
/_cat/thread_pool
142
/_cat/thread_pool/{thread_pools}
143
/_cat/plugins
144
/_cat/fielddata
145
/_cat/fielddata/{fields}
146
/_cat/nodeattrs
147
/_cat/repositories
148
/_cat/snapshots/{repository}
149
/_cat/templates
150
151
_cat api接口为我们提供了一个功能选择。
152
153
~]# curl -XGET "http://10.211.55.48:9200/_cat/nodes?v"
154
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
155
10.211.55.48            5          94   0    0.35    0.31     0.34 mdi       -      node1
156
10.211.55.49            7          94   0    0.08    0.30     0.34 mdi       *      node2
157
我们可以在状态中看见集群主节点时node2

Plugin

ES集群很多功能都需要扩展来完成,而有些Plugin是必须安装的,常使用的Plugin有:

  • marvel
  • bigdesk
  • head
  • kopf

这些都是站点插件,可以在网页直接管理es集群。

那么如何安装Plugin呢?

  • 直接将插件放置plugin目录中即可:/usr/share/elasticsearch/plugins
  • 使用elasticsearch-plugin来安转:/usr/share/elasticsearch/bin/elasticsearch-plugin

CRUD

  • 创建

    1
    ~]# curl -XPUT "10.211.55.48:9200/students/class1/1?pretty" -d '
    2
    {
    3
    	"name": "jusene",
    4
    	"age": 25,
    5
    	"class": "English"
    6
    }'
    7
    {
    8
      "_index" : "students",
    9
      "_type" : "class1",
    10
      "_id" : "1",
    11
      "_version" : 1,
    12
      "result" : "created",
    13
      "_shards" : {
    14
        "total" : 2,
    15
        "successful" : 2,
    16
        "failed" : 0
    17
      },
    18
      "created" : true
    19
    }
    20
    ~]# curl -XPUT "10.211.55.48:9200/students/class2/1?pretty" -d '
    21
    {
    22
    	"name": "jack",
    23
    	"age": 24,
    24
    	"class": "Math"
    25
    }'
    26
    {
    27
      "_index" : "students",
    28
      "_type" : "class2",
    29
      "_id" : "1",
    30
      "_version" : 1,
    31
      "result" : "created",
    32
      "_shards" : {
    33
        "total" : 2,
    34
        "successful" : 2,
    35
        "failed" : 0
    36
      },
    37
      "created" : true
    38
    }
  • 查看

    1
    ~]# curl -XGET "10.211.55.48:9200/students/class1/1?pretty"
    2
    {
    3
      "_index" : "students",
    4
      "_type" : "class1",
    5
      "_id" : "1",
    6
      "_version" : 1,
    7
      "found" : true,
    8
      "_source" : {
    9
        "name" : "jusene",
    10
        "age" : 25,
    11
        "class" : "English"
    12
      }
    13
    }
  • 修改

    1
    ~]# curl -XPOST "10.211.55.48:9200/students/class1/1/_update?pretty" -d '{"doc": {"age": 26}}'
    2
    {
    3
      "_index" : "students",
    4
      "_type" : "class1",
    5
      "_id" : "1",
    6
      "_version" : 2,
    7
      "result" : "updated",
    8
      "_shards" : {
    9
        "total" : 2,
    10
        "successful" : 2,
    11
        "failed" : 0
    12
      }
    13
    }
    14
    ~]# curl -XGET "10.211.55.48:9200/students/class1/1?pretty"
    15
    {
    16
      "_index" : "students",
    17
      "_type" : "class1",
    18
      "_id" : "1",
    19
      "_version" : 2,
    20
      "found" : true,
    21
      "_source" : {
    22
        "name" : "jusene",
    23
        "age" : 26,
    24
        "class" : "English"
    25
      }
    26
    }
    27
    28
    注意:如果用put是覆盖这个文档
  • 删除

    1
    ~]# curl -XDELETE '10.211.55.48:9200/students/class1/1?pretty'
    2
    {
    3
      "found" : true,
    4
      "_index" : "students",
    5
      "_type" : "class1",
    6
      "_id" : "1",
    7
      "_version" : 3,
    8
      "result" : "deleted",
    9
      "_shards" : {
    10
        "total" : 2,
    11
        "successful" : 2,
    12
        "failed" : 0
    13
      }
    14
    }
    15
    ~]# curl -XGET "10.211.55.48:9200/students/class1/1?pretty"   
    16
    {
    17
      "_index" : "students",
    18
      "_type" : "class1",
    19
      "_id" : "1",
    20
      "found" : false
    21
    }
    22
    23
    同理删除类 或者 索引
    24
    25
    ~]# curl -XDELETE '10.211.55.48:9200/students/class1?pretty
    26
    ~]# curl -XDELETE '10.211.55.48:9200/students?pretty

查询数据

Query API:

  • Query DSL:JSON based language for building complex queries
    用于实现诸多类型的查询操作,比如,simple term query,phrase,range,boolean,fuzzy等
  • 多索引、多类型查询

多索引、多类型查询

1
/_search:所以索引
2
/INDEX_NAME/_search:单索引
3
/INDEX1,INDEX2/_search:多索引
4
/s*,t*/_search:
5
/students/class1/_search:单类型搜索
6
/students/class1,class2/_search:多类型搜索

ES:对每一个文档。会取得其所以域的所以值,生成一个名为_all的域:执行查询时,如果在query_string未指定查询的域,则在_all域上执行查询操作。

如:

1
- GET /_search?q='zgx'
2
- GET /_search?q='zhang%20guoxing'
3
- GET /_search?q=name:'zgx'
4
- GET /_search?q=name:'zhang%20guoxing'
1
~]# curl "10.211.55.48:9200/_search?q='zgx'&pretty"
2
{
3
  "took" : 74,
4
  "timed_out" : false,
5
  "_shards" : {
6
    "total" : 5,
7
    "successful" : 5,
8
    "failed" : 0
9
  },
10
  "hits" : {
11
    "total" : 2,
12
    "max_score" : 0.17225473,
13
    "hits" : [
14
      {
15
        "_index" : "students",
16
        "_type" : "class1",
17
        "_id" : "4",
18
        "_score" : 0.17225473,
19
        "_source" : {
20
          "name" : "zgx",
21
          "age" : 25,
22
          "class" : "English"
23
        }
24
      },
25
      {
26
        "_index" : "students",
27
        "_type" : "class1",
28
        "_id" : "6",
29
        "_score" : 0.17225473,
30
        "_source" : {
31
          "name" : "zhang guoxing",
32
          "age" : 25,
33
          "desc" : "zgx"
34
        }
35
      }
36
    ]
37
  }
38
}
39
40
我们还可看见对这个搜索我们还有score分数的评判
1
~]# curl "10.211.55.48:9200/_search?q='zhang%20guoxing'&pretty"
2
{
3
  "took" : 12,
4
  "timed_out" : false,
5
  "_shards" : {
6
    "total" : 5,
7
    "successful" : 5,
8
    "failed" : 0
9
  },
10
  "hits" : {
11
    "total" : 2,
12
    "max_score" : 1.3097504,
13
    "hits" : [
14
      {
15
        "_index" : "students",
16
        "_type" : "class1",
17
        "_id" : "6",
18
        "_score" : 1.3097504,
19
        "_source" : {
20
          "name" : "zhang guoxing",
21
          "age" : 25,
22
          "desc" : "zgx"
23
        }
24
      },
25
      {
26
        "_index" : "students",
27
        "_type" : "class1",
28
        "_id" : "5",
29
        "_score" : 0.5753642,
30
        "_source" : {
31
          "name" : "zhang guoxing",
32
          "age" : 25,
33
          "class" : "English"
34
        }
35
      }
36
    ]
37
  }
38
}
1
~]# curl "10.211.55.48:9200/_search?q=name:'zhang%20guoxing'&pretty"
2
{
3
  "took" : 5,
4
  "timed_out" : false,
5
  "_shards" : {
6
    "total" : 5,
7
    "successful" : 5,
8
    "failed" : 0
9
  },
10
  "hits" : {
11
    "total" : 2,
12
    "max_score" : 0.6548752,
13
    "hits" : [
14
      {
15
        "_index" : "students",
16
        "_type" : "class1",
17
        "_id" : "6",
18
        "_score" : 0.6548752,
19
        "_source" : {
20
          "name" : "zhang guoxing",
21
          "age" : 25,
22
          "desc" : "zgx"
23
        }
24
      },
25
      {
26
        "_index" : "students",
27
        "_type" : "class1",
28
        "_id" : "5",
29
        "_score" : 0.2876821,
30
        "_source" : {
31
          "name" : "zhang guoxing",
32
          "age" : 25,
33
          "class" : "English"
34
        }
35
      }
36
    ]
37
  }
38
}
1
~]# curl "10.211.55.48:9200/_search?q=name:'zgx'&pretty"
2
{
3
  "took" : 27,
4
  "timed_out" : false,
5
  "_shards" : {
6
    "total" : 5,
7
    "successful" : 5,
8
    "failed" : 0
9
  },
10
  "hits" : {
11
    "total" : 1,
12
    "max_score" : 0.80259144,
13
    "hits" : [
14
      {
15
        "_index" : "students",
16
        "_type" : "class1",
17
        "_id" : "4",
18
        "_score" : 0.80259144,
19
        "_source" : {
20
          "name" : "zgx",
21
          "age" : 25,
22
          "class" : "English"
23
        }
24
      }
25
    ]
26
  }
27
}

前两个:表示在_all域搜索
后两个: 表示在特定的类型上搜索

数据类型:string,number,boolean,dates

查看执行上mapping类型:

1
~]#  curl "10.211.55.48:9200/students/_mapping/class1?pretty"  
2
{
3
  "students" : {
4
    "mappings" : {
5
      "class1" : {
6
        "properties" : {
7
          "age" : {
8
            "type" : "long"   
9
          },
10
          "class" : {
11
            "type" : "text",
12
            "fields" : {
13
              "keyword" : {
14
                "type" : "keyword",
15
                "ignore_above" : 256
16
              }
17
            }
18
          },
19
          "desc" : {
20
            "type" : "text",
21
            "fields" : {
22
              "keyword" : {
23
                "type" : "keyword",
24
                "ignore_above" : 256
25
              }
26
            }
27
          },
28
          "name" : {
29
            "type" : "text",
30
            "fields" : {
31
              "keyword" : {
32
                "type" : "keyword",
33
                "ignore_above" : 256
34
              }
35
            }
36
          }
37
        }
38
      }
39
    }
40
  }
41
}
42
43
我们可以看见在这个类中的字端的映射关系

ES中的搜索的数据广义上可被理解两类:
types:exact 精确搜索:指未经加工的原始值:在搜索时进行精确匹配,类似于sql语句
full-text 全文搜索:用于引用文本中的数据:判断文档在多大程度上匹配查询请求:即评估文档与用户请求查询的相关度,这个才是ES最强大的地方

为了完成full-text搜索,ES必须首先分许文本,并创建出倒排索引,倒排索引中的数据还需正规化标准化处理,如全部小写等,当采用不同的分析器处理文本搜索的时候,因为不同的分析器采用的标准不同,所以搜索结果还是有出入的。

上述过程我们也可以同称为分析,分析按照Lucene来说可以是分词和正规化构建倒排索引的过程,分析由分析器组成,分析器由三个组件组成:字符过滤器,分词器,分词过滤器。ES内置的分析器:

  • Standard analyzer
  • Simple analyzer
  • Whitespace analyzer
  • Language analyzer
    分析器不仅在创建索引时用到:在构建查询时也会用到,索引在创建和查询的时候分析器使用不一致,查询结果都是不尽相同的。

Query DSL

Query DSL通过request body来完成:
分成两类:

  • query dsl:执行full-text查询时,基于相关度来评判其匹配结
    查询执行过程复制,且不会被缓存
  • filter dsl:执行exact查询,基于其结果为yes或者no进行评判
    速度快,且结果缓存

Filter DSL

  • term filter:精准匹配包含指定term的文档

    1
    ~]# curl "10.211.55.24:9200/students/_search?pretty" -d {
    2
    	"query":{
    3
    		"term":{
    4
    			"name": "jusene"
    5
    		}
    6
    	}
    7
    }
    8
    {
    9
      "took" : 4,
    10
      "timed_out" : false,
    11
      "_shards" : {
    12
        "total" : 5,
    13
        "successful" : 5,
    14
        "failed" : 0
    15
      },
    16
      "hits" : {
    17
        "total" : 2,
    18
        "max_score" : 0.6931472,
    19
        "hits" : [
    20
          {
    21
            "_index" : "students",
    22
            "_type" : "class1",
    23
            "_id" : "1",
    24
            "_score" : 0.6931472,
    25
            "_source" : {
    26
              "name" : "jusene",
    27
              "age" : 25,
    28
              "class" : "English"
    29
            }
    30
          },
    31
          {
    32
            "_index" : "students",
    33
            "_type" : "class1",
    34
            "_id" : "3",
    35
            "_score" : 0.2876821,
    36
            "_source" : {
    37
              "name" : "jusene",
    38
              "age" : 25,
    39
              "class" : "English"
    40
            }
    41
          }
    42
        ]
    43
      }
    44
    }
  • terms filter:精准匹配多个精致值

    1
    ~]# curl "10.211.55.48:9200/students/_search?pretty" -d {
    2
    	"query":{
    3
    		"terms":{
    4
    			"name":["jusene","zgx"]
    5
    		}
    6
    	}
    7
    }
  • range filter:用于指定范围内查找数值和时间

    1
    ~]# curl "10.211.55.48:9200/students/_search?pretty" -d '{
    2
    	"query":{
    3
    		"range":{
    4
    			"age":{
    5
    				"lt":25
    6
    			}
    7
    		}
    8
    	}
    9
    }'
  • exists filter

    1
    ~]# curl "10.211.55.48:9200/students/_search?pretty" -d '{
    2
    	"query":{
    3
    		"exists":{
    4
    			"field": "age"
    5
    		}
    6
    	}
    7
    }'
  • boolean filter
    基于boolean的逻辑来合并多个filter子句

must:其内部所以的子句条件必须同时匹配,即and
must_not: 其所有子句必须不匹配,即not
should: 至少有一个子句匹配,即or

1
~]# curl "10.211.55.48:9200/students/_search?pretty" -d '{
2
	"query":{
3
		"bool":{
4
			"must":{
5
				"term":{"age": 24}
6
			},
7
			"must_not":{
8
				"term":{"name":"zgx"}
9
			},
10
			"should":[
11
				{"term":{"class":"English"}},
12
				{"term":{"class":"Math"}}
13
				]
14
		}
15
	}
16
}'

Query DSL

  • match_all:用于匹配所以文档,没有指定query,默认即为match_all query

    1
    ~]# curl '10.211.55.48:9200/_search?pretty' -d '
    2
    {
    3
    "query": {"match_all": {}}
    4
    }'
  • match:在几乎任何域上执行full_text和exact-value查询

    1
    执行full-text查询,首先对查询时的语句进行分析
    2
    ~]# curl "10.211.55.48:9200/_search?pretty" -d '{
    3
    "query":{
    4
    		"match":{"name":"zgx"}
    5
    	}	
    6
    }
    7
    '
    8
    9
    如果执行exact-value查询:搜索精确值,此时,建议使用过滤,而非查询
    10
    ~]# curl "10.211.55.48:9200/students/_search?pretty" -d '{
    11
    "query":{
    12
    	"match":{"name":"zgx"}
    13
    }
    14
    }'
  • multi_match:用于多个域上执行相同的查询

    1
    ~]# curl "10.211.55.48:9200/_search?pretty" -d '{
    2
    	"query":{
    3
    		“multi_match”:{
    4
    			"query":"zgx",
    5
    			"fields":["name","desc"]
    6
    		}
    7
    8
    	}
    9
    }'
  • bool query:基于boolean逻辑合并多个查询语句,与bool filter不同的是,查询子句不是返回yes或no,而是其计算出的匹配度分值,因此,boolean Query会为各子句合并其score

    1
    ~]# curl "10.211.55.48:9200/students/_search?pretty" -d '{
    2
    	"query":{
    3
    		"bool":{
    4
    			"must":{
    5
    				"range":{"gte": 24}
    6
    			},
    7
    			"must_not":{
    8
    				"match":{"name":"zgx"}
    9
    			},
    10
    			"should":[
    11
    				{"match":{"class":"English"}},
    12
    				{"match":{"class":"Math"}}
    13
    				]
    14
    		}
    15
    	}
    16
    }'
  • wildcards query:shell统配符查询

    1
    ~]# curl "10.211.55.48:9200/students/class1/_search?pretty" -d '{
    2
    	"query":{
    3
    		"wildcards":{
    4
    			"name":"z*x"
    5
    		}
    6
    	}
    7
    	}'
  • regexp query:正则查询

    1
    ~]# curl "10.211.55.48:9200/_search?pretty" -d '{
    2
    	"query":{
    3
    		"regexp":{
    4
    			"age":"[0-9]+"
    5
    		}
    6
    	}
    7
    	}'
  • prefix query:前缀查询

    1
    ~]# curl "10.211.55.48:9200/_search?pretty" -d '{
    2
    "query":{
    3
    	"prefix":{
    4
    		"class":"M"
    5
    	}
    6
    }
    7
    }'
  • phrase match:短语匹配

    1
    ~]# curl "10.211.55.48:9200/_search?pretty" -d '{
    2
    	"query":{
    3
    		"match_phrase":{
    4
    			"name": "zhang guoxing"
    5
    		}
    6
    	}
    7
    }'

复合查询

即使用filter dsl和query dsl

1
~]# curl "10.211.55.48:9200/_search?pretty" -d '{
2
	"query":{
3
		"filtered":{
4
			"filter":{
5
				"range":{
6
					"age":{"gt":24}
7
				}
8
			},
9
			"query":{
10
				"match":{
11
					"name":"jusene"
12
				}
13
			}
14
		}
15
	}
16
}'

高亮搜索

1
~]# curl "10.211.55.48:9200/_search?pretty" -d '{
2
	"query":{
3
		"match":{
4
			"name":"jusene"
5
		}
6
	},
7
	"highlight":{
8
		"fields":{
9
			"name":{}
10
		}
11
	}

这里包含了来自name字段中的文本,并且用来标识匹配到的单词。

检查DSL语法

1
~]# curl "10.211.55.48:9200/students/_validate?pretty" -d "body"

查考资料:
https://es.xiaoleilu.com/index.html
http://www.cnblogs.com/ghj1976/p/5293250.html

CATALOG
  1. 1. Elasticsearch
    1. 1.1. 基本组件
    2. 1.2. ES集群组件
    3. 1.3. 倒排索引
  2. 2. Elasticsearch安装
  3. 3. Restful API
    1. 3.1. Plugin
    2. 3.2. CRUD
  4. 4. 查询数据
    1. 4.1. 多索引、多类型查询
    2. 4.2. Query DSL
      1. 4.2.1. Filter DSL
      2. 4.2.2. Query DSL
      3. 4.2.3. 复合查询