I'm searching for information on how ElasticSearch would scale with the amount of data in its indexes and am surprised how little I can find on that topic. Maybe some experience from the crowd here can help me.
We are currently using CloudSearch to index ≈ 7 million documents; in CloudSearch this results in 2 instances of type m2.xlarge. We are considering switching to ElasticSearch instead to reduce the cost. But all I find on the scaling of ElasticSearch is that it does scale well, can be distributed over several instances etc.
But what kind of machine (memory, disc) would I need for this kind of data?
How would that change if I increased the amount of data by the factor of 12 (≈ 80 million documents)?
As Javanna said, it depends. Mostly on: (1) rate of indexing; (2) size of documents; (3) rate and latency requirements for searches; and (4) type of searches.
Considering this, the best we can help is giving examples.
On our site (news monitoring) we:
Index more than 100 docs per minute. We have, currently, near 50 million documents. I've also heard of ES indexes with hundreds of millions of documents.
Documents are news articles with some metadata, not short but not that large.
Our search latency varies between ~50ms (for normal and rare terms) up to 800ms for common terms (stopwords, we index them). This variation is largely due to our custom scoring (thanks to Lucene/ES support for customizing it) and to the fact the dataset (inverted lists) do not fit entirely in memory (OS cache). So when it hits a cached inverted list, it's faster.
We do OR queries with a lot of terms which are one of the hardest. Also we do faceting on two single-valued fields. And have some experiments with date facet (to show rate of publication through time).
We do all this with 4 EC2's m1.large instances. And now we're planning moving to ES, just released, 0.9 version to get all the goodies and performance improvements of Lucene 4.0.
Now leaving examples aside. ElasticSearch is pretty scalable. It is very simple to create an index with N shards and M replicas, and then create X machines with ES. It will distribute all shards and replicas accordingly. You can change the number of replicas anytime you want (for each index).
One downside is that you can't change the number of shards after the index creation. But you can still "overshard" it beforehand to leave room for scaling when needed. Or create a new index with the right number of shards and reindex everything (we do this).
Finally, ElasticSearch (and also Solr) uses, under the hood, the Lucene Search library, which is very mature and well known library.
I've actually recently switched from using CloudSearch to a hosted ElasticSearch service at the company I work for. Our specific application has a little over 100 million documents and is growing daily. So far, our experience with ElasticSearch has been absolutely wonderful. Search performance averages at ~250ms, even with all the sorting, filtering, and faceting. Indexing documents is also relatively fast, despite the several MB load we pass through HTTP with the bulk API every couple of hours. Refresh rates seem to be near instant, as well.
For our ~100M doc / 12GB index, we used 4 shards / 2 replicas (will bump to 3 replicas if performance degrades) spread across 4 nodes. Prior to setting up the index, our team spent a couple of days researching ElasticSearch cluster deployment/maintenance, and opted to use http://qbox.io to save money and time. We were paralyzingly afraid of performance and scale issues choosing to host our index on a dedicated cluster like Qbox, but so far the experience has been seriously fantastic.
Since our index lives on a dedicated cluster, we don't have access to nuts-and-bolts node-level configuration settings, so my technical expertise with ES deployment is still pretty limited. That being said, I can't be sure of exactly what performance tweeks are needed for the performance we've experienced on our index. However, I do know Qbox's cluster uses SSD... so that could definitely have a significant impact.
Point in case, ElasticSearch has scaled seamlessly. I highly, highly recommend the switch (even if it's just to save $$, CloudSearch is crazy expensive). Hope this information helps!
CloudSearch recently dropped prices and may now be a cheaper alternative than maintaining your own Search infrastrcuture on EC2 - http://aws.amazon.com/blogs/aws/cloudsearch-price-reduction-plus-features/
Related
I am a junior dev ops engineer and have this very basic question.
My team is currently working on providing an AWS OpenSearch cluster. Due to the type of our problem, we require the storage-optimized instances. From the amazon documentation I found that they recommend a minimum number of 3 nodes. The required storage size is known to me, in the OpenSearch Service pricing calculator I found that I can either choose 10 i3.large instances or 5 i3.xlarge ones. I checked the prices, they are the same.
So my question is, when I am faced with such a problem, do I choose the lesser bigger instances or the bigger number of smaller instances? I am particularly interested in the reason.
Thank you!
Each VM has some overhead for the OS so 10 smaller instances would have less compute and RAM available for ES in total than 5 larger instances. Also, if you just leave the default index settings (5 primary shards, 1 replica) and actively write to only 1 index at a time, you'll effectively have only 5 nodes indexing data for you (and these nodes will have less bandwidth because they are smaller).
So, I would usually recommend running a few larger instances instead of many smaller ones. There are some special cases where it won't be true (like a concurrent-search-heavy cluster) but for those, I'd recommend going with even larger instances in the first place.
I'm building a GUI tool for querying logs and looking for a cheaper option. DDB will fetch logs from an S3 bucket using lambda whereas ES will get the same logs streamed from CloudWatch. The thing is my queries are gonna be simple, not complex ones so I'm inclining towards DDB. Any inputs will be appreciated.
If you have fixed access patterns that can be queried using the partition key and sort key, staying within the limits of querying on a sort key, then DynamoDB is certainly a very good option. There are other factors, like size of the data, and number of records in a partition.
If you can do most of the filtering with the above, but need to further reduce the data based on values outside the key you still can use DynamoDB, but your milage may vary on how good it is. It becomes very dependent on data size and filtering complexity.
There is certainly a point where the complexity of the queries goes beyond what DynamoDB is designed for. At that point ES is often a good answer. Keep in mind that ES isn't a fully managed service, and it's a paid for the time it's running, regardless of use. I tend to try to avoid these types of services when I can, but if cost is not a significant factor for you, and you feel comfortable managing the ES cluster, then ES is a great option for advanced querying.
Lately, I have been struggling to understand what is my network speed (downlink) between nodes on AWS (in a multi-homed cluster, computers in different regions).
I have a lot of fluctuations when I measure it with a script which I have written (based on this link and SCP) or with Iperf.
I believe it is based on network use which changes rapidly (mostly between regions), but I still don't understand AWS documentation about what is the performance I am paying for, a minimum and a maximum downlink rate for example (aws instances).
At first, I have tried the T2 type, and as I saw it had burst CPU performance, I thought that maybe the NIC performance is also bursty so I have moved to M4 type, but I have got the same problems with M4.
Is there any way to know my NIC downlink rate based on the type and flavor?
*I have asked a similar question on the AWS forum, but I haven't got a response (https://forums.aws.amazon.com/thread.jspa?threadID=296389).
There is no way to get a better indication that your measuring. AWS does not publish anything indicating this performance, and unless we are talking the larger instance where network performance is actually specifically given. I.e. m5.12xlarge having 10 gbps. Most likely network performance does have a burst component for smaller instance types.
There are pages with other peoples benchmarks, but you won't find any official answer for any of this.
We are looking into getting an ELK stack setup on Amazon but we don't really know what we need of machines to handle it smoothly.
Now I know that it will become obvious if it doesn't run smooth but still we hoped to get an idea on what we would need for our situation.
So we 4 servers that generate log files in a custom format. About ~45 million lines of logs each day, generating about 4 files of 600mb (gzipped) so around ~24GB of logs each day.
Now we are looking into the ELK stack and would like the dashboards of Kibana display realtime data, so I was thinking of logging using syslog to logstash.
4 Servers -> Rsyslog (on those 4 servers) -> Logstash (AWS) -> ElasticSearch (AWS) -> Kibana (AWS)
So now we need to figure out what kind of hardware we would need in AWS to handle this.
I read somewhere 3 masters for ElasticSearch and 2 datanodes at minimum.
So that would total 5 servers + 1 server for Kibana and 1 for Logstash?
So I would need a total of 7 servers to get started, but that kinda seems overkill?
I would like to keep my data for 1 month, so 31 days at most, so I would have around ~1.4TB of raw logdata in Elastic Search (~45GB x 31)
But since I don't really have a clue on what the best setup would be, any hints/tips/info would be welcome.
Also a system or tool that would handle this for me (node failure, etc) could be useful.
Thanks in advance,
darkownage
Here's how I've architected my cloud clusters:
3 Master nodes - these nodes coordinate the cluster and keeping three of them helps tolerate failure. Ideally these will spread across availability zones. These can be fairly small and ideally do not receive any requests - their only job is to maintain the cluster. In this case set discovery.zen.minimum_master_nodes = 2 to maintain quorum. These IPs and these IPs only are what you should provide to all cluster nodes in discovery.zen.ping.unicast.hosts
Indexes: you should probably take advantage of daily indexes - see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html This will make more sense below but will also be beneficial if you begin to scale up - you can increase shard count over time without re-indexing.
Data Nodes: Depending on your scale or performance requirements there are a few options - i2.xlarge or d2.xlarge will work well but r3.2xlarge are also a good option. Make sure to keep the JVM heap <30GB. Keep the data paths on ephemeral drives local to the instances - EBS is not really so ideal for this use case but depending on your requirements might be sufficient. Be sure you have multiple data nodes so the replica shards can split across availability zones. As your data requirements increase, just scale these up.
Hot/Warm: Depending on the use case - it sometimes is beneficial to split your data nodes into Hot/Warm (Fast SSD/Slow HDD). This is mainly due to the fact that all writes are in realtime, and the majority of reads are on the past few hours. If you can move yesterday's data onto cheaper, slower drives, it helps out quite a bit. This is a little more involved but you can read more at https://www.elastic.co/blog/hot-warm-architecture. This requires adding some tags and using curator on a nightly basis but is generally worth it due to the cost savings of moving largely unsearched data off of more expensive SSD.
In production, I run ~20 r3.2xlarge for the hot tier and 4-5 d2.xlarge for the warm tier with a replication factor of 2 - this allows ~TB per day of ingest and a decent amount of retention. We scale Hot for volume and Warm for retention.
Overall - good luck! It's a fun stack to build and operate once everything is running smoothly.
PS - Depending on the time/resources you have available, you can run the managed elasticsearch service on AWS, but the last time i looked its ~60% more expensive than running it on your own instances, and YMMV.
Seems like you need something to start with ELK Stack on AWS
Did u tried this couple of CloudFormation scripts, It would ease your installation process and will help you setup your environment in one go.
ELK-Cookbook - CloudFormation Script
ELK-Stack with Google OAuth in Private VPC
Comment below if this doesn't solves your problem.
Having a very specific access pattern for my data, I wonder about the expected mapreduce performance of Cassandra. These are my requirements:
There will be 10 Million Documents (e.g. JSON, a couple of KB each)
in my database There will be occasional updates of the documents
Users want to create results from the whole dataset that require
processing of each document
Users will want to do this in a
semi-interactive fashion, trying out effects of changes they make to
the processing of each document. Waiting for the result a couple of
minutes is ok.
Users would like to be able to spend money (scaling up
or out) to increase interactive speed if there is a desire to
increase processing speed.
There will not be large user numbers,
processing needs to be done a couple of times per hour, maybe.
Durability is not a primary concern, as the data is replicated from a
source system anyway.
This sounds like a good Job for Cassandra and MapReduce but given that MapReduce is not intended to be used semi-interactively but rather as a background job, I wonder what performance possibilities I can expect using Cassandra.
My other options are plain MySQL with documents stored as CLOBS or partitioned Redis.
Can anyone provide clues on how to estimate the speed possibilities?