As analytics becomes an increasing focus across all industries, many organizations are turning to a group of open-source products from Elastic to search, analyze, and visualize their data in real-time. The Elastic Stack refers to Elasticsearch, Logstash, Kibana, and Beats, with each product providing unique capabilities as described below:
Elasticsearch takes care of storage and provides RESTful search and analytics capabilities.
Logstash handles the server-side data processing pipeline used to ingest data, perform data transformations, and index data (usually from logs).
Kibana allows users to visualize Elasticsearch data.
Beats sends data from edge machines to Elasticsearch and Logstash.
The Elastic Stack’s distributed architecture allows the platform to scale linearly, so as your data volume and velocity increase, the Elastic Stack provides a path for scaling your system. Different use cases require different node structures, and many Elastic clusters will transition from needing a smaller to a larger number of nodes as data volume increases. The ability to confidently scale cloud systems is part of what makes them so appealing, so it is important that best practices are used. This blog post will focus on key tips for effectively scaling in the Elastic Stack by imagining cluster sizes of three nodes, 10 nodes, and hundreds of nodes
Tips to Scale Smoothly
When setting up a smaller environment (approximately three nodes), it is important to consider the following:
Use the default installation:
There is the open-source installation and default installation. Use the default installation
for all the core segments of the stack: Elasticsearch, Kibana, Logstash, and Beats. They are all still free since open source is important to the people at Elastic, and they can be found in one binary. Using the default installation will save you substantial effort.
Remember free useful features: You get several useful features with the free version of Elastic, such as Kibana Spaces, which allows you to organize your dashboards and other saved objects. None of these extra pieces are in the open-source option.
Use base template structure and explicit mappings: This is important because even though Elasticsearch can do a good job of guessing how your data looks, you will want to specifically define mappings to avoid mistakes.
Add the dynamic setting in your development environment: This will add newly detected fields to the mapping. You do this by adding in the Elasticsearch template configuration. You should do this in dev, but do not do this in production. By having the dynamic setting set to true, when Elasticsearch receives a new field, this field is automatically added to the mapping. This is good in development, so the developer updating the Python code does not have to worry about that wider part of the process. For production, you want “dynamic”: strict, so that unexpected new fields throw an exception on top of when data types are wrong.
When setting up a slightly larger environment (approximately 10 nodes), it is important to utilize the paid version to take advantage of the different node types available. Dedicated master nodes should be considered somewhere between five and 10 nodes. If you only need three nodes, having three master nodes doubles your cost and is most likely financially unreasonable. But if you need to have about 10 nodes and to break responsibilities out, then few new things should be considered:
You should move to have dedicated master nodes now.
There are several different types of nodes in Elastic: master, ingest, coordinator, machine learning, and data (hot/warm/cold).
Master: These nodes are the brain of the operation and know where and how everything is connected. You will typically have about three of these.
Data: These are where data shards live and the queries are executed.
Coordinator: The easiest way to think of these are as data nodes that hold no data. This means it gives you extra resources such as added CPU or RAM or network. You can set it up so that people connect to coordinator modes instead of directly to data or master nodes.
Ingest: A relatively new option that can do a lot of what Logstash does, but they live inside the cluster. They are not a replacement for Logstash, but you can use them to do transformations on data on the way in. An example would be if you are using Beats and sending syslog in. You could see that the syslog can be broken into multiple fields if you are using Ingest nodes.
Machine Learning: This is for run jobs and can handle ML API requests. Runs in cluster on C instead of Java because Elastic ML was written in C and not Java.
15 to 50 gigabytes per shard is the rule of thumb. This comes from a lot of empirical evidence and feedback from Elastic users over time.
Shards are how you increase parallelism in your cluster, but too many shards can slow down system and be reflected by poor master state. Don’t have 10 shards for 80GB to end up with about 8GB per shard. Have two shards in that instance so you have about 40GB shards which falls in the mentioned range.
Typically, the more data you have, the larger shards you have, and this is on a use case basis.
If using hundreds of nodes you must remember that everyone’s hardware and network are a bit different so something more bespoke may be required:
Specific cluster sizing:
How to find shard size?
Elastic folks call it “finding the unit of work” and it takes advantage of how linearly Elastic scales. If I know I can do 100GB on one node, then I can do 200GB on two nodes.
Be experimental by making an index with one shard and then two shards and then three shards, etc. Keep checking the throughput and you’ll see when it levels off or you get diminishing returns. Set the shard number at one before you hit diminishing returns.
Where To Turn If You Need More Help
When addressing small issues or one-off questions, Elastic’s thorough documentation library or training courses can be helpful. For more dedicated help, please feel free to reach out to me at email@example.com or contact Analytics and Insights at Credera to scale to the next level.