General Information
Abstract
The search service is responsible for metadata and content extraction, the retrieved data is indexed and made searchable.
The search service runs out of the box with the shipped default basic configuration.
No further configuration is needed.
Note that as of now, the search service cannot be scaled. Consider using dedicated hardware for this service in case more resources are needed.
Table of Contents
- Search backends
- Query language
- Content analysis / Extraction
- Manually Trigger Re-Indexing a Space
- Metrics
Search backends
To store and query the indexed data, a search backend is needed.
As of now, the search service supports the following backends:
- bleve (default)
- opensearch
Bleve
Bleve is a lightweight, embedded full-text search engine written in Go and is the default search backend. It is straightforward to set up and requires no additional services to run.
The following optional settings can be set:
SEARCH_ENGINE_BLEVE_DATA_PATH=/path/to/bleve/index(default:$OC_BASE_DATA_PATH/search): Path to store the bleve index.
OpenSearch
OpenSearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. Additionally, it provides advanced features like clustering, replication, and sharding.
To enable OpenSearch as a backend, the following settings must be set:
SEARCH_ENGINE_TYPE=open-searchSEARCH_ENGINE_OPEN_SEARCH_CLIENT_ADDRESSES=http://YOUR-OPENSEARCH.URL:9200(comma-separated list of OpenSearch addresses)
Additionally, the following optional settings can be set:
SEARCH_ENGINE_OPEN_SEARCH_RESOURCE_INDEX_NAME=val(default:opencloud-resource): Name of the OpenSearch indexSEARCH_ENGINE_OPEN_SEARCH_CLIENT_USERNAME=val: Username for HTTP Basic Authentication.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_PASSWORD=val: Password for HTTP Basic Authentication.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_HEADER=val: HTTP headers to include in requests.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_CA_CERT=valCA certificate for TLS connections.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_RETRY_ON_STATUS=valHTTP status codes that trigger a retry.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISABLE_RETRY=valDisable retries on errors.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_RETRY_ON_TIMEOUT=val: Enable retries on timeout.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_MAX_RETRIES=val: Maximum number of retries for requests.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_COMPRESS_REQUEST_BODY=val: Compress request bodies.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISCOVER_NODES_ON_START=val: Discover nodes on service start.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_DISCOVER_NODES_INTERVAL=val: Interval for discovering nodes.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_METRICS=val: Enable metrics collection.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_ENABLE_DEBUG_LOGGER=val: Enable debug logging.SEARCH_ENGINE_OPEN_SEARCH_CLIENT_INSECURE=val: Skip TLS certificate verification.
Query language
By default, KQL is used as the query language. For an overview of how to write kql queries, please read the microsoft documentation.
Not all parts are supported, the following list gives an overview of parts that are not implemented yet:
- Synonym operators
- Inclusion and exclusion operators
- Dynamic ranking operator
- ONEAR operator
- NEAR operator
- Date intervals
In this ADR you can read why KQL was chosen.
Content analysis / Extraction
The search service supports the following content extraction methods:
Basic: enabled by default, only provides metadata extraction.Tika: needs to be installed and configured separately, provides content extraction for many file types.
Note that the file content has to be transferred to the search service internally for content extraction, which is resource-intensive and can lead to delays with larger documents.
Basic
This extractor is the simplest one and just uses the resource information provided by OpenCloud. It does not do any further content analysis.
Tika
The main difference is that this extractor is able to analyze and extract data from more advanced file types like PDF, DOCX, PPTX, etc. However, Apache Tika is required for this task. Read the Getting Started with Apache Tika guide on how to install and run Tika or use a ready to run Tika container. See the Tika container usage document for a quickstart.
As soon as Tika is installed and configured, the search service needs to be told to use it.
The following settings must be set:
SEARCH_EXTRACTOR_TYPE=tikaSEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL
Additionally, the following optional settings can be set:
SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS=true(default:true): ignore stop words likeI,you,theduring content extraction.
Manually Trigger Re-Indexing a Space
The service includes a command-line interface to trigger re-indexing a space:
opencloud search index --space $SPACE_ID
It can also be used to re-index all spaces:
opencloud search index --all-spaces
Metrics
The search service exposes the following prometheus metrics at <debug_endpoint>/metrics (as configured using the SEARCH_DEBUG_ADDR env var):
| Metric Name | Type | Description | Labels |
|---|---|---|---|
opencloud_search_build_info | Gauge | Build information | version |
opencloud_search_events_outstanding_acks | Gauge | Number of outstanding acks for events | |
opencloud_search_events_unprocessed | Gauge | Number of unprocessed events | |
opencloud_search_events_redelivered | Gauge | Number of redelivered events | |
opencloud_search_search_duration_seconds | Histogram | Duration of search operations in seconds | status |
opencloud_search_index_duration_seconds | Histogram | Duration of indexing operations in seconds | status |