About Apache Solr
Apache Solr is an open-source search server built on top of Lucene that provides all of Lucene’s search capabilities through HTTP requests. It has been around for almost a decade and a half, making it a mature product with a broad user community.
Solr offers powerful features such as distributed full-text search, faceting, near real-time indexing, high availability, NoSQL features, integrations with big data tools such as Hadoop, and the ability to handle rich-text documents such as Word and PDF.
About Elasticsearch
Elasticsearch is also an open-source search engine built on top of Apache Lucene, as the rest of the ELK Stack, including Logstash and Kibana. It extends Lucene’s powerful indexing and search functionalities using RESTful APIs, and it archives the distribution of data on multiple servers using the index and shards concept. Elasticsearch is completely based on JSON and is suitable for time series and NoSQL data.
This tool is much younger than Solr, but it has gained a lot of popularity because of its feature-rich use cases. Some of its primary features include distributed full-text distributed search, high availability, powerful query DSL, multitenancy, Geo Search, and horizontal scaling.
Relative Popularity in Apache Solr and Apache ElasticSearch
According to DB-Engines, which ranks database management systems and search engines according to their popularity, Elasticsearch is ranked number one, and Solr is ranked number three.
Solr had gained popularity in the first ten years of its existence, but Elasticsearch has been the most popular search engine since 2016.
Figure 1: DB-Engines Ranking—Elasticsearch vs. Solr Popularity (Source: DB-Engines)
Installation and Configuration of Apache Solr and Apache ElasticSearch
Java is the primary prerequisite for installing both of these engines, but the default Elasticsearch configuration requires 1GB of HEAP memory. This can be changed in the jvm.options file inside the config directory.
By default, Solr needs at least 512MB of HEAP memory to allocate to instances. This setting can be changed in either the solr script file or the solr.in.cmd file. Both files are located inside the bin directory of the Solr installation.
Elasticsearch is easy to install and configure, but it’s quite a bit heavier than Solr. The latest version of Elasticsearch (version 7.7.1, released in June 2020) has a compressed size of 314.5MB, whereas Solr (version 8.5.2, released in May 2020) ships at 191.7MB.
Configuration files in Elasticsearch are written in YML format. Solr supports XML-based configuration files.
Indexing and Searching in Apache Solr and Apache ElasticSearch
Both Solr and Elasticsearch write indexes in Lucene. But, since differences exist in sharding and replication (among other features), there are also differences in their files and architectures. Additionally, Elasticsearch has native DSL support while Solr has a robust Standard Query Parser that aligns to Lucene syntax.
Data Sources
Both tools support a wide range of data sources.
Solr uses request handlers to ingest data from XML files, CSV files, databases, Microsoft Word documents, and PDFs. With native support for the Apache Tika library, it supports extraction and indexing from over one thousand file types. Solr ships with a simple command line post. To ingest CSV-based data in a collection named testcollection, for example, you just need to use the following command:
bin/post -c testcollection *.csv
Elasticsearch, on the other hand, is completely JSON-based. It supports data ingestion from multiple sources using the Beats family (lightweight data shippers available in the ELK Stack) and Logstash.
Use Cases in Apache Solr and Apache ElasticSearch
While both products are document-oriented search engines, Solr has always been more focused on enterprise-directed text searches with advanced information retrieval (IR). Consequently, it’s more suited for search applications that use massive amounts of static data. Solr fits better into enterprise applications that already implement big data ecosystem tools, such as Hadoop and Spark. Additionally, Solr stands out in handling Rich Text Format (RTF) documents. To compete with Elasticsearch, recent Solr releases have offered new features such as Parallel SQL Interface and streaming expressions.
Elasticsearch is focused more on scaling, data analytics, and processing time series data to obtain meaningful insights and patterns. Its large-scale log analytics performance makes it quite popular. Elasticsearch is more suited to modern web applications where data is carried in and out in JSON format. Elasticsearch has also put a lot of development effort into making its tool more resilient. This turns it into a primary data store.
Searching in Apache Solr and Apache ElasticSearch
Both Solr and Elasticsearch support NRT (near real-time) searches and take advantage of all of Lucene’s search capabilities. They both have additional search-related feature sets, described below, since they both support JSON-based Query DSL.
Earlier Solr versions had to rely on its Standard Query Parser, but Solr now also supports JSON-based Query DSL. While Solr’s Standard Query Parser allows users to create a variety of structured queries, the chances of making syntax errors while writing these queries is much higher. Nevertheless, you can write very complex search queries in Solr that are unavailable in Elasticsearch. Solr includes a sample search UI, called Velocity Search, that offers powerful features such as searching, faceting, highlighting, autocomplete, and Geo Search.
Elasticsearch’s DSL is native. The aggregation framework in Elasticsearch is powerful with aggregation queries in the APIs with better caching. The more recent releases of the tool offer better management of memory footprints.
Indexing in Apache Solr and Apache ElasticSearch
Because Elasticsearch is schemaless, it is easy to index unstructured data and dynamic fields without defining the schema of the index in advance. Earlier Solr versions required a defined schema before indexing data. However, Solr now supports a schemaless mode.
Both search engines support custom analyzers, synonym-based indexing, stemming, and various tokenization options.
Scalability and Distribution in Apache Solr and Apache ElasticSearch
Search engines have to quickly process large amounts of data and complex queries on sets of hundreds of millions of records. Sometimes these queries can be so resource-intensive that they can take the whole system down—especially if you haven’t planned for the load in advance and can’t scale quickly. For this reason, a search engine must be scalable and fault-tolerant in nature.
Clusters, Sharding, and Rebalancing in Apache Solr and Apache ElasticSearch
Both Elasticsearch and SolrCloud provide support for sharding. But, since Elasticsearch’s design has horizontal scaling in mind, it has better support for scaling and cluster management. Its disadvantage is that the shards cannot increase once they’ve been created, although you can use a shrink API to reduce the shards of an index. SolrCloud supports further splitting of an existing shard but not the shrinking of shards.
Elasticsearch’s built-in zen discovery module handles cluster coordination. SolrCloud requires Apache Zookeeper, an additional service.
In case of a shard or node failure, Elasticsearch does cluster rebalancing itself and rarely requires a manual intervention. In SolrCloud, rebalancing is complex and hard to manage.