Python Elasticsearch Client¶
Official low-level client for Elasticsearch. Its goal is to provide common ground for all Elasticsearch-related code in Python; because of this it tries to be opinion-free and very extendable.
For a more high level client library with more limited scope, have a look at
elasticsearch-dsl - it is a more pythonic library sitting on top of
elasticsearch-py
.
Compatibility¶
The library is compatible with all Elasticsearch versions since 0.90.x
but you
have to use a matching major version:
For Elasticsearch 2.0 and later, use the major version 2 (2.x.y
) of the
library.
For Elasticsearch 1.0 and later, use the major version 1 (1.x.y
) of the
library.
For Elasticsearch 0.90.x, use a version from 0.4.x
releases of the
library.
The recommended way to set your requirements in your setup.py or requirements.txt is:
# Elasticsearch 2.x
elasticsearch>=2.0.0,<3.0.0
# Elasticsearch 1.x
elasticsearch>=1.0.0,<2.0.0
# Elasticsearch 0.90.x
elasticsearch<1.0.0
The development is happening on master
and 1.x
branches, respectively.
Example Usage¶
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()
doc = {
'author': 'kimchy',
'text': 'Elasticsearch: cool. bonsai cool.',
'timestamp': datetime.now(),
}
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc)
print(res['created'])
res = es.get(index="test-index", doc_type='tweet', id=1)
print(res['_source'])
es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])
Features¶
This client was designed as very thin wrapper around Elasticseach’s REST API to allow for maximum flexibility. This means that there are no opinions in this client; it also means that some of the APIs are a little cumbersome to use from Python. We have created some Helpers to help with this issue as well as a more high level library (elasticsearch-dsl) on top of this one to provide a more convenient way of working with Elasticsearch.
Persistent Connections¶
elasticsearch-py
uses persistent connections inside of individual connection
pools (one per each configured or sniffed node). Out of the box you can choose
to use http
, thrift
or an experimental memcached
protocol to
communicate with the elasticsearch nodes. See Transport classes for more
information.
The transport layer will create an instance of the selected connection class
per node and keep track of the health of individual nodes - if a node becomes
unresponsive (throwing exceptions while connecting to it) it’s put on a timeout
by the ConnectionPool
class and only returned to the
circulation after the timeout is over (or when no live nodes are left). By
default nodes are randomized before being passed into the pool and round-robin
strategy is used for load balancing.
You can customize this behavior by passing parameters to the
Connection Layer API (all keyword arguments to the
Elasticsearch
class will be passed through). If what
you want to accomplish is not supported you should be able to create a subclass
of the relevant component and pass it in as a parameter to be used instead of
the default implementation.
Note
Since we use persistent connections throughout the client it means that the
client doesn’t tolerate fork
very well. If your application calls for
multiple processes make sure you create a fresh client after call to
fork
.
Automatic Retries¶
If a connection to a node fails due to connection issues (raises
ConnectionError
) it is considered in faulty state. It
will be placed on hold for dead_timeout
seconds and the request will be
retried on another node. If a connection fails multiple times in a row the
timeout will get progressively larger to avoid hitting a node that’s, by all
indication, down. If no live connection is available, the connection that has
the smallest timeout will be used.
By default retries are not triggered by a timeout
(ConnectionTimeout
), set retry_on_timeout
to
True
to also retry on timeouts.
Sniffing¶
The client can be configured to inspect the cluster state to get a list of
nodes upon startup, periodically and/or on failure. See
Transport
parameters for details.
Some example configurations:
from elasticsearch import Elasticsearch
# by default we don't sniff, ever
es = Elasticsearch()
# you can specify to sniff on startup to inspect the cluster and load
# balance across all nodes
es = Elasticsearch(["seed1", "seed2"], sniff_on_start=True)
# you can also sniff periodically and/or after failure:
es = Elasticsearch(["seed1", "seed2"], sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)
SSL and Authentication¶
You can configure the client to use SSL
for connecting to your
elasticsearch cluster, including certificate verification and http auth:
from elasticsearch import Elasticsearch
# you can use RFC-1738 to specify the url
es = Elasticsearch(['https://user:secret@localhost:443'])
# ... or specify common parameters as kwargs
# use certifi for CA certificates
import certifi
es = Elasticsearch(
['localhost', 'otherhost'],
http_auth=('user', 'secret'),
port=443,
use_ssl=True,
verify_certs=True,
ca_certs=certifi.where(),
)
Warning
By default SSL certificates won’t be verified, pass in
verify_certs=True
to make sure your certificates will get verified. The
client doesn’t ship with any CA certificates; easiest way to obtain the
common set is by using the certifi package (as shown above).
See class Urllib3HttpConnection
for detailed
description of the options.
Logging¶
elasticsearch-py
uses the standard logging library from python to define
two loggers: elasticsearch
and elasticsearch.trace
. elasticsearch
is used by the client to log standard activity, depending on the log level.
elasticsearch.trace
can be used to log requests to the server in the form
of curl
commands using pretty-printed json that can then be executed from
command line. If the trace logger has not been configured already it is set to
propagate=False so it needs to be activated separately.
Environment considerations¶
When using the client there are several limitations of your environment that could come into play.
When using an http load balancer you cannot use the Sniffing functionality - the cluster would supply the client with IP addresses to directly connect to the cluster, circumventing the load balancer. Depending on your configuration this might be something you don’t want or break completely.
In some environments (notably on Google App Engine) your http requests might be
restricted so that GET
requests won’t accept body. In that case use the
send_get_body_as
parameter of Transport
to send all
bodies via post:
from elasticsearch import Elasticsearch
es = Elasticsearch(send_get_body_as='POST')
Running with AWS Elasticsearch service¶
If you want to use this client with IAM based authentication on AWS you can use the requests-aws4auth package:
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
host = 'YOURHOST.us-east-1.es.amazonaws.com'
awsauth = AWS4Auth(YOUR_ACCESS_KEY, YOUR_SECRET_KEY, REGION, 'es')
es = Elasticsearch(
hosts=[{'host': host, 'port': 443}],
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection
)
print(es.info())
Contents¶
API Documentation¶
All the API calls map the raw REST api as closely as possible, including the distinction between required and optional arguments to the calls. This means that the code makes distinction between positional and keyword arguments; we, however, recommend that people use keyword arguments for all calls for consistency and safety.
Note
for compatibility with the Python ecosystem we use from_
instead of
from
and doc_type
instead of type
as parameter names.
Global options¶
Some parameters are added by the client itself and can be used in all API calls.
Ignore¶
An API call is considered successful (and will return a response) if
elasticsearch returns a 2XX response. Otherwise an instance of
TransportError
(or a more specific subclass) will be
raised. You can see other exception and error states in Exceptions. If
you do not wish an exception to be raised you can always pass in an ignore
parameter with either a single status code that should be ignored or a list of
them:
from elasticsearch import Elasticsearch
es = Elasticsearch()
# ignore 400 cause by IndexAlreadyExistsException when creating an index
es.indices.create(index='test-index', ignore=400)
# ignore 404 and 400
es.indices.delete(index='test-index', ignore=[400, 404])
Timeout¶
Global timeout can be set when constructing the client (see
Connection
’s timeout
parameter) or on a per-request
basis using request_timeout
(float value in seconds) as part of any API
call, this value will get passed to the perform_request
method of the
connection class:
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
Note
Some API calls also accept a timeout
parameter that is passed to
Elasticsearch server. This timeout is internal and doesn’t guarantee that the
request will end in the specified time.
Elasticsearch¶
Indices¶
Cluster¶
Nodes¶
Cat¶
Snapshot —
Exceptions¶
Connection Layer API¶
All of the classes responsible for handling the connection to the Elasticsearch
cluster. The default subclasses used can be overriden by passing parameters to the
Elasticsearch
class. All of the arguments to the client
will be passed on to Transport
,
ConnectionPool
and Connection
.
For example if you wanted to use your own implementation of the
ConnectionSelector
class you can just pass in the
selector_class
parameter.
Note
ConnectionPool
and related options (like
selector_class
) will only be used if more than one connection is defined.
Either directly or via the Sniffing mechanism.
Transport¶
Connection Pool¶
Connection Selector¶
Urllib3HttpConnection (default connection_class)¶
Transport classes¶
List of transport classes that can be used, simply import your choice and pass
it to the constructor of Elasticsearch
as
connection_class. Note that the
RequestsHttpConnection
requires requests
to be installed.
For example to use the requests
-based connection just import it and use it:
from elasticsearch import Elasticsearch, RequestsHttpConnection
es = Elasticsearch(connection_class=RequestsHttpConnection)
Connection¶
Urllib3HttpConnection¶
RequestsHttpConnection¶
Helpers¶
Collection of simple helper functions that abstract some specifics or the raw API.
Bulk helpers¶
There are several helpers for the bulk
API since it’s requirement for
specific formatting and other considerations can make it cumbersome if used directly.
All bulk helpers accept an instance of Elasticsearch
class and an iterable
actions
(any iterable, can also be a generator, which is ideal in most
cases since it will allow you to index large datasets without the need of
loading them into memory).
The items in the action
iterable should be the documents we wish to index
in several formats. The most common one is the same as returned by
search()
, for example:
{
'_index': 'index-name',
'_type': 'document',
'_id': 42,
'_parent': 5,
'_ttl': '1d',
'_source': {
"title": "Hello World!",
"body": "..."
}
}
Alternatively, if _source is not present, it will pop all metadata fields from the doc and use the rest as the document data:
{
"_id": 42,
"_parent": 5,
"title": "Hello World!",
"body": "..."
}
The bulk()
api accepts index
, create
,
delete
, and update
actions. Use the _op_type
field to specify an
action (_op_type
defaults to index
):
{
'_op_type': 'delete',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
}
{
'_op_type': 'update',
'_index': 'index-name',
'_type': 'document',
'_id': 42,
'doc': {'question': 'The life, universe and everything.'}
}
Note
When reading raw json strings from a file, you can also pass them in directly (without decoding to dicts first). In that case, however, you lose the ability to specify anything (index, type, even id) on a per-record basis, all documents will just be sent to elasticsearch to be indexed as-is.
Scan¶
Reindex¶
Changelog¶
2.2.0 (2016-01-05)¶
- adding additional options for ssh -
ssl_assert_hostname
andssl_assert_fingerprint
to the default connection class- fix sniffing
2.1.0 (2015-10-19)¶
- move multiprocessing import inside parallel bulk for Google App Engine
2.0.0 (2015-10-14)¶
- Elasticsearch 2.0 compatibility release
1.8.0 (2015-10-14)¶
- removed thrift and memcached connections, if you wish to continue using those, extract the classes and use them separately.
- added a new, parallel version of the bulk helper using thread pools
- In helpers, removed
bulk_index
as an alias forbulk
. Usebulk
instead.
1.7.0 (2015-09-21)¶
- elasticsearch 2.0 compatibility
- thrift now deprecated, to be removed in future version
- make sure urllib3 always uses keep-alive
1.6.0 (2015-06-10)¶
- Add
indices.flush_synced
APIhelpers.reindex
now supports reindexing parent/child documents
1.5.0 (2015-05-18)¶
- Add support for
query_cache
parameter when searching- helpers have been made more secure by changing defaults to raise an exception on errors
- removed deprecated options
replication
and the deprecated benchmark api.- Added
AddonClient
class to allow for extending the client from outside
1.4.0 (2015-02-11)¶
- Using insecure SSL configuration (
verify_cert=False
) raises a warningreindex
accepts aquery
parameter- enable
reindex
helper to accept any kwargs for underlyingbulk
andscan
calls- when doing an initial sniff (via
sniff_on_start
) ignore special sniff timeout- option to treat
TransportError
as normal failure inbulk
helpers- fixed an issue with sniffing when only a single host was passed in
1.3.0 (2014-12-31)¶
- Timeout now doesn’t trigger a retry by default (can be overriden by setting
retry_on_timeout=True
)- Introduced new parameter
retry_on_status
(defaulting to(503, 504, )
) controls which http status code should lead to a retry.- Implemented url parsing according to RFC-1738
- Added support for proper SSL certificate handling
- Required parameters are now checked for non-empty values
- ConnectionPool now checks if any connections were defined
- DummyConnectionPool introduced when no load balancing is needed (only one connection defined)
- Fixed a race condition in ConnectionPool
1.2.0 (2014-08-03)¶
Compatibility with newest (1.3) Elasticsearch APIs.
- Filter out master-only nodes when sniffing
- Improved docs and error messages
1.1.1 (2014-07-04)¶
Bugfix release fixing escaping issues with request_timeout
.
1.1.0 (2014-07-02)¶
Compatibility with newest Elasticsearch APIs.
- Test helpers -
ElasticsearchTestCase
andget_test_client
for use in your tests- Python 3.2 compatibility
- Use
simplejson
if installed instead of stdlib json library- Introducing a global
request_timeout
parameter for per-call timeout- Bug fixes
1.0.0 (2014-02-11)¶
Elasticsearch 1.0 compatibility. See 0.4.X releases (and 0.4 branch) for code compatible with 0.90 elasticsearch.
- major breaking change - compatible with 1.0 elasticsearch releases only!
- Add an option to change the timeout used for sniff requests (
sniff_timeout
).- empty responses from the server are now returned as empty strings instead of None
get_alias
now hasname
as another optional parameter due to issue #4539 in es repo. Note that the order of params have changed so if you are not using keyword arguments this is a breaking change.
0.4.4 (2013-12-23)¶
helpers.bulk_index
renamed tohelpers.bulk
(alias put in place for backwards compatibility, to be removed in future versions)- Added
helpers.streaming_bulk
to consume an iterator and yield results per operationhelpers.bulk
andhelpers.streaming_bulk
are no longer limitted to just index operations.- unicode body (for
incices.analyze
for example) is now handled correctly- changed
perform_request
onConnection
classes to return headers as well. This is a backwards incompatible change for people who have developed their own connection class.- changed deserialization mechanics. Users who provided their own serializer that didn’t extend
JSONSerializer
need to specify amimetype
class attribute.- minor bug fixes
0.4.3 (2013-10-22)¶
- Fixes to
helpers.bulk_index
, better error handling- More benevolent
hosts
argument parsing forElasticsearch
requests
no longer required (nor recommended) for install
0.4.2 (2013-10-08)¶
ignore
param acceted by all APIs- Fixes to
helpers.bulk_index
0.4.1 (2013-09-24)¶
Initial release.
License¶
Copyright 2013 Elasticsearch
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.