At PipeCandy, apart from a lot of ML and automation, we have a custom reporting and analysis team as well. This team, not being tech savvy, needs a way to query data from records of ~150 million people and ~10 million companies. The requirement was to perform partial search, exact search, fuzzy match, etc. on text data.
We opted for Elasticsearch to do the job. We are using Elasticsearch cluster managed by AWS. After getting familiar with APIs and indexing, the question in front of the team was: “How big of a cluster do we need?”. I’ll walk you through some basic math that we did while selecting infra for Elasticsearch.
One needs to account for:
- Data storage
- Distribution of data
Every instance in Elasticsearch will have an upper cap on the disk that you can use with Elasticsearch. So this calculation matters
- AWS allocates 5% of total available volume to the OS for internal file system
- 20% of total space is reserved for background Elasticsearch operations
- One should consider that while heavy write operations, the disk usage will shoot up till the Lucene indexes are merged. In my case for 12GB of data, the disk required spiked to a max value of 20GB. As per my discussion with AWS support, one should get worried when there is only 25% of total space is left on the slave with least free space.
So if you opt for a cluster with 100GB of space, you’ll be having no issues with reading and writing operations till you require 50GB (100 – 25 – 25) or less of space for your data. Index a %age of your total data and use the above math to get the final space that you’ll be consuming on the cluster. Keep in mind that I haven’t taken into account the replication factor.
The cost of computation will determine CPU resource requirement. It will be determined by following points
- Analyzer: During both query and indexing operations, the analyzer will play a crucial role. From Lucene documentation, “An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.” Elasticsearch offers 8 analyzers by default and you can define your custom analyzer as well. Each of the analyzers will do different kinds of text extraction and hence different computation costs.
- Reindexing: The frequency of reindexing the data will also matter. The more complex your indexing is going to be, the more your infra should shift towards compute intensive resource.
- Data Volume: More the data, more the processing, higher the cost.
The network requirement will depend on the size of result documents that you are getting from the query. Generally, high compute machines will have better network bandwidth as well. One way to lower your network requirement will be to ask for only those fields in the result that you actually want to use. This can be done via source filtering.
The size of the cluster can be scaled vertically or horizontally. Assuming cost is same for the two scaling options, if possible, try for a horizontally scaled cluster (more machines over higher configuration machines) to ensure your data is replicated in multiple physical locations. This will provide better fault tolerance and gives better search throughput since your queries can be executed on all replicas in parallel. One should keep in mind that OS, JVM, and indexing will also have their own overheads. So, the machine configuration should be big enough to take care of these.
I’ve discussed four points – storage, computation, network and data distribution. If you think I’ve omitted some point which should be considered for resource calculation, please let us know in the comments. Feedback is welcome.
This post was written by Ashutosh. He’s a part of our kickass tech team.