Good setup on AWS for ELK - amazon-web-services

We are looking into getting an ELK stack setup on Amazon but we don't really know what we need of machines to handle it smoothly.
Now I know that it will become obvious if it doesn't run smooth but still we hoped to get an idea on what we would need for our situation.
So we 4 servers that generate log files in a custom format. About ~45 million lines of logs each day, generating about 4 files of 600mb (gzipped) so around ~24GB of logs each day.
Now we are looking into the ELK stack and would like the dashboards of Kibana display realtime data, so I was thinking of logging using syslog to logstash.
4 Servers -> Rsyslog (on those 4 servers) -> Logstash (AWS) -> ElasticSearch (AWS) -> Kibana (AWS)
So now we need to figure out what kind of hardware we would need in AWS to handle this.
I read somewhere 3 masters for ElasticSearch and 2 datanodes at minimum.
So that would total 5 servers + 1 server for Kibana and 1 for Logstash?
So I would need a total of 7 servers to get started, but that kinda seems overkill?
I would like to keep my data for 1 month, so 31 days at most, so I would have around ~1.4TB of raw logdata in Elastic Search (~45GB x 31)
But since I don't really have a clue on what the best setup would be, any hints/tips/info would be welcome.
Also a system or tool that would handle this for me (node failure, etc) could be useful.
Thanks in advance,
darkownage

Here's how I've architected my cloud clusters:
3 Master nodes - these nodes coordinate the cluster and keeping three of them helps tolerate failure. Ideally these will spread across availability zones. These can be fairly small and ideally do not receive any requests - their only job is to maintain the cluster. In this case set discovery.zen.minimum_master_nodes = 2 to maintain quorum. These IPs and these IPs only are what you should provide to all cluster nodes in discovery.zen.ping.unicast.hosts
Indexes: you should probably take advantage of daily indexes - see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html This will make more sense below but will also be beneficial if you begin to scale up - you can increase shard count over time without re-indexing.
Data Nodes: Depending on your scale or performance requirements there are a few options - i2.xlarge or d2.xlarge will work well but r3.2xlarge are also a good option. Make sure to keep the JVM heap <30GB. Keep the data paths on ephemeral drives local to the instances - EBS is not really so ideal for this use case but depending on your requirements might be sufficient. Be sure you have multiple data nodes so the replica shards can split across availability zones. As your data requirements increase, just scale these up.
Hot/Warm: Depending on the use case - it sometimes is beneficial to split your data nodes into Hot/Warm (Fast SSD/Slow HDD). This is mainly due to the fact that all writes are in realtime, and the majority of reads are on the past few hours. If you can move yesterday's data onto cheaper, slower drives, it helps out quite a bit. This is a little more involved but you can read more at https://www.elastic.co/blog/hot-warm-architecture. This requires adding some tags and using curator on a nightly basis but is generally worth it due to the cost savings of moving largely unsearched data off of more expensive SSD.
In production, I run ~20 r3.2xlarge for the hot tier and 4-5 d2.xlarge for the warm tier with a replication factor of 2 - this allows ~TB per day of ingest and a decent amount of retention. We scale Hot for volume and Warm for retention.
Overall - good luck! It's a fun stack to build and operate once everything is running smoothly.
PS - Depending on the time/resources you have available, you can run the managed elasticsearch service on AWS, but the last time i looked its ~60% more expensive than running it on your own instances, and YMMV.

Seems like you need something to start with ELK Stack on AWS
Did u tried this couple of CloudFormation scripts, It would ease your installation process and will help you setup your environment in one go.
ELK-Cookbook - CloudFormation Script
ELK-Stack with Google OAuth in Private VPC
Comment below if this doesn't solves your problem.

Related

Changing DynamoDB tables from Provisioned to On-Demand throughput at scale

I'm planning to convert 22 DynamoDB tables created through Terraform from Provisioned to On-Demand throughput.
Initial tests for changing this through Terraform showed that it takes about 30 minutes per table, which is too slow and makes using our current deployment pipeline in production a no-go.
I'm trying to speed things up and these are the options I thought of:
Create a job that will do a single table, and spawn 22 of those to run concurrently (Jenkins setup might let 10 run at a time - no control over that). I see this as low risk, but possibly lengthy to run in production.
Scripting the whole thing to: convert using the cli, cleaning up resources not needed post-conversion, deleting existing Terraform state files, then doing a Terraform import of the new DynamoDB resources. Seems much riskier than 1).
Something else I haven't thought of...
Outstanding question: Is the provisioned -> per-request conversion sensitive to amount of stored data?
Looking for opinions, and/or information from folks who might have gone through a similar exercise.

Redis elasticache in aws - how to get persistence and keep good latency

I am currently using redis cluster with 2 node groups and a replica per node.
I chose to use redis because of the high performance. I have a new requirement to have persistent storage of the data in redis. I want to keep the good latency redis gives me and still build some procedure to save the data in the background. Backup built in snapshots is not good enough anymore since there is a maximum of 20 backups per 24 hours. I need data to be synced aprox. every minute
The data needs to be stored in a way that restart of the system will not make the data to be lost and that it can be restored back at all times.
So if I summarize the requirements:
Keep working with redis elasticache
Keep highest performance and latency
Be able to have the data persistent (including when the system is down or restarted)
The data sync to happen in intervals of a minute.
Be able to restore data back to redis when it lost the data.
I was looking when googling at manually running BGSAVE from a side docker in EC2 or to have a slave running in another EC2 machine. And then a lambda may take the rdb dile/data and save in s3.
Will this fit my needs?
What do the experts suggest? What are your ideas?
You can get close to your requirements by enabling AOF persistence.
This is done in the cluster's parameter group:
appendonly yes
appendfsync always|everysec
You will have to restart as well.
As you can see, redis only has two options for file system sync-for every value and every second.
Every value will be quite slow, so go with everysec if you want to keep good performance.

Is there a better way for me to architect this batch processing pipeline?

So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.

Cloud computing service to run thousands of containers in parallel

Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.

Replicated caching solutions compatible with AWS

My use case is as follow:
We have about 500 servers running in an autoscaling EC2 cluster that need to access the same configuration data (layed out in a key/value fashion) several million times per second.
The configuration data isn't very large (1 or 2 GBs) and doesn't change much (a few dozen updates/deletes/inserts per minute during peak time).
Latency is critical for us, so the data needs to be replicated and kept in memory on every single instance running our application.
Eventual consistency is fine. However we need to make sure that every update will be propagated at some point. (knowing that the servers can be shutdown at any time)
The update propagation across the servers should be reliable and easy to setup (we can't have static IPs for our servers, or we don't wanna go the route of "faking" multicast on AWS etc...)
Here are the solutions we've explored in the past:
Using regular java maps and use our custom built system to propagate updates across the cluster. (obviously, it doesn't scale that well)
Using EhCache and its replication feature. But setting it up on EC2 is very painful and somehow unreliable.
Here are the solutions we're thinking of trying out:
Apache Ignite (https://ignite.apache.org/) with a REPLICATED strategy.
Hazelcast's Replicated Map feature. (http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#replicated-map)
Apache Geode on every application node. (http://geode.apache.org/)
I would like to know if each of those solutions would work for our use case. And eventually, what issues I'm likely to face with each of them.
Here is what I found so far:
Hazelcast's Replicated Map is somehow recent and still a bit unreliable (async updates can be lost in case of scaling down)
It seems like Geode became "stable" fairly recently (even though it's supposedly in development since the early 2000s)
Ignite looks like it could be a good fit, but I'm not too sure how their S3 discovery based system will work out if we keep adding / removing node regularly.
Thanks!
Geode should work for your use case. You should be able to use a Geode Replicated region on each node. You can choose to do synchronous OR asynchronous replication. In case of failures, the replicated region gets an initial copy of the data from an existing member in the system, while making sure that no in-flight operations are lost.
In terms of configuration, you will have to start a couple/few member discovery processes (Geode locators) and point each member to these locators. (We recommend that you start one locator/AZ and use 3 AZs to protect against network partitioning).
Geode/GemFire has been stable for a while; powering low latency high scalability requirements for reservation systems at Indian and Chinese railways among other users for a very long time.
Disclosure: I am a committer on Geode.
Ignite provides native AWS integration for discovery over S3 storage: https://apacheignite-mix.readme.io/docs/amazon-aws. It solves main issue - you don't need to change configuration when instances are restarted. In a nutshell, any nodes that successfully joins topology writes its coordinates to a bucket (and removes them when fails or leaves). When you start a new node, it reads this bucket and connects to one the listed addresses.
Hazelcast's Replicated Map will not work for your use-case. Note that it is a map that is replicated across all it's nodes not on the client nodes/servers. Also, as you said, it is not fully reliable yet.
Here is the Hazelcast solution:
Create a Hazelcast cluster with a set of nodes depending upon the size of data.
Create a Distributed map(IMap) and tweak the count & eviction configurations based on size/number of key/value pairs. The data gets partitioned across all the nodes.
Setup Backup count based on how critical the data is and how much time it takes to pull the data from the actual source(DB/Files). Distributed maps have 1 backup by default.
In the client side, setup a NearCache and attach it to the Distributed map. This NearCache will hold the Key/Value pair in the local/client side itself. So the get operations would end up in milliseconds.
Things to consider with NearCache solution:
The first get operation would be slower as it has to go through network to get the data from cluster.
Cache invalidation is not fully reliable as there will be a delay in synchronization with the cluster and may end reading stale data. Again, this is same case across all the cache solutions.
It is client's responsibility to setup timeout and invalidation of Nearcache entries. So that the future pulls would get fresh data from cluster. This depends on how often the data gets refreshed or value is replaced for a key.