At PipeCandy, apart from a lot of ML and automation, we have a custom reporting and analysis team as well. This team, not being tech savvy, needs a way to query data from records of ~150 million people and ~10 million companies. The requirement was to perform partial search, exact search, fuzzy match, etc. on text data. We opted for Elasticsearch to do the job. We are using Elasticsearch cluster managed by AWS. After getting familiar with APIs and indexing, the question in front of the team was: "How big of a cluster do we need?". I'll walk you through some basic math that we did while selecting infra for Elasticsearch. One needs to account for:
Every instance in Elasticsearch will have an upper cap on the disk that you can use with Elasticsearch. So this calculation matters
So if you opt for a cluster with 100GB of space, you'll be having no issues with reading and writing operations till you require 50GB (100 - 25 - 25) or less of space for your data. Index a %age of your total data and use the above math to get the final space that you'll be consuming on the cluster. Keep in mind that I haven't taken into account the replication factor.
The cost of computation will determine CPU resource requirement. It will be determined by following points
The network requirement will depend on the size of result documents that you are getting from the query. Generally, high compute machines will have better network bandwidth as well. One way to lower your network requirement will be to ask for only those fields in the result that you actually want to use. This can be done via source filtering.
The size of the cluster can be scaled vertically or horizontally. Assuming cost is same for the two scaling options, if possible, try for a horizontally scaled cluster (more machines over higher configuration machines) to ensure your data is replicated in multiple physical locations. This will provide better fault tolerance and gives better search throughput since your queries can be executed on all replicas in parallel. One should keep in mind that OS, JVM, and indexing will also have their own overheads. So, the machine configuration should be big enough to take care of these. I've discussed four points - storage, computation, network and data distribution. If you think I've omitted some point which should be considered for resource calculation, please let us know in the comments. Feedback is welcome. This post was written by Ashutosh. He's a part of our kickass tech team.
Every week we send out one deeply researched essay that captures the eCommerce industry and its evolution, right to your inbox. 50%+ open rates for a year now.
Trusted by 22,804 DTC & eCommerce industry insiders.