Elasticsearch is a search engine that’s designed to deal with natural language and calculating relevancy of matched results to provide more then just boolean yes or no matching. It is particularly well suited to parsing large amounts of text and finding relevant documents for given queries.
Parts of WordPress.com uses Elasticsearch to augment and speed up searches under the hood. In addition some JSON API endpoints allow limited querying of our Elasticsearch index of posts. This doc aims to detail how the internal Elasticsearch index is setup and how to query for results through the WordPress.com API.
Posts and pages are indexed into Elasticsearch as a document of the “
post” type. Individual fields within an Elasticsearch document can be referenced using just the field name (e.g. “
tag.name“) or with the document type prepended (e.g. “
Elasticsearch only uses UTF-8 character encoding, so all documents are converted to UTF-8 before being indexed. In addition all URLs fields have their protocol part (e.g. http/https) excluded to aid in prefix matching.
Please see the Elasticsearch Core Data Type documentation for details on each native data type. For textual data we treat them using 3 different ways depending on the field.
- analyzed: the text has been broken up into individual terms based on the language analyzer being applied to this document (generally based on whitespace, but Chinese and Japanese for example are broken up into words using word segmentation algorithms)
- not analyzed: the entire string is treated as a single term
- lowercased: entire string is treated as a single term, but it is lowercased, and character folding/normalization is performed so “My Resumé” will be one term “my resume”
Allowed Queries & Filters
Please see the Elasticsearch Query DSL Guide for how to build queries but note, some types of Elasticsearch queries are too resource intensive for us to run on the WordPress.com infrastructure. The following is a list of queries and filters that are allowed:
- and filter
- exists filter
- geo bounding box filter
- geo distance filter
- geo distance range filter
- geohash cell filter
- limit filter
- missing filter
- not filter
- or filter
- type filter
- filtered query
- bool query
- boosting query
- dis max query
- function score query
- more like this / mlt query
- more like this field / mlt field query
Running faceted queries requires a custom index and so is currently only available for WordPress.com VIP clients with the VIP Search Add On. The custom indices for VIP clients also contain some additional fields in the post documents containing the post meta.