Parsing logfiles with logstash taking much time.How can i increase the logstash performance.Hereare my scenario for 25mb it takes 30 mins exactly.Could anyone suggest me, how to increase the performance & also is this correct way to calculate the performance.My environment details AWS large instance with 15gb RAM.Thanks in advance for your responses
Related
i am a beginner in aws opensearch.I have one important question regarding it that is
how much data(in MB or GB) i can insert in bulk at a single time in aws opensearch.
i tried to find solution of my question on aws website but couldn't get the answer please let me know if you can help
The amount of data you can insert using the bulk operations will depend on the cluster, configuration and data (among other factors) but I've found this to be a good recommendation:
"Start with the bulk request size of 5 MiB to 15 MiB. Then, slowly increase the request size until the indexing performance stops improving."
So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.
Description:
I have a project running on an AWS server which is taking too long to complete.
The project consists of dynamic form fields which are saved with Solr search Indexer. My simple Solr Queries are taking 2 minutes of time to successfully execute the query. I have kept Solr in a EFS mapped folder on AWS instance.
I don't want to move Solr on EBS.
Solutions I have tried:
EFS Performance Mode : Both General purpose and Max I/O still taking around 2 Minutes.
EFS Throughput Mode : Tried Burst mode taking roughly around same time.
I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."
There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.
If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month
We are looking into getting an ELK stack setup on Amazon but we don't really know what we need of machines to handle it smoothly.
Now I know that it will become obvious if it doesn't run smooth but still we hoped to get an idea on what we would need for our situation.
So we 4 servers that generate log files in a custom format. About ~45 million lines of logs each day, generating about 4 files of 600mb (gzipped) so around ~24GB of logs each day.
Now we are looking into the ELK stack and would like the dashboards of Kibana display realtime data, so I was thinking of logging using syslog to logstash.
4 Servers -> Rsyslog (on those 4 servers) -> Logstash (AWS) -> ElasticSearch (AWS) -> Kibana (AWS)
So now we need to figure out what kind of hardware we would need in AWS to handle this.
I read somewhere 3 masters for ElasticSearch and 2 datanodes at minimum.
So that would total 5 servers + 1 server for Kibana and 1 for Logstash?
So I would need a total of 7 servers to get started, but that kinda seems overkill?
I would like to keep my data for 1 month, so 31 days at most, so I would have around ~1.4TB of raw logdata in Elastic Search (~45GB x 31)
But since I don't really have a clue on what the best setup would be, any hints/tips/info would be welcome.
Also a system or tool that would handle this for me (node failure, etc) could be useful.
Thanks in advance,
darkownage
Here's how I've architected my cloud clusters:
3 Master nodes - these nodes coordinate the cluster and keeping three of them helps tolerate failure. Ideally these will spread across availability zones. These can be fairly small and ideally do not receive any requests - their only job is to maintain the cluster. In this case set discovery.zen.minimum_master_nodes = 2 to maintain quorum. These IPs and these IPs only are what you should provide to all cluster nodes in discovery.zen.ping.unicast.hosts
Indexes: you should probably take advantage of daily indexes - see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html This will make more sense below but will also be beneficial if you begin to scale up - you can increase shard count over time without re-indexing.
Data Nodes: Depending on your scale or performance requirements there are a few options - i2.xlarge or d2.xlarge will work well but r3.2xlarge are also a good option. Make sure to keep the JVM heap <30GB. Keep the data paths on ephemeral drives local to the instances - EBS is not really so ideal for this use case but depending on your requirements might be sufficient. Be sure you have multiple data nodes so the replica shards can split across availability zones. As your data requirements increase, just scale these up.
Hot/Warm: Depending on the use case - it sometimes is beneficial to split your data nodes into Hot/Warm (Fast SSD/Slow HDD). This is mainly due to the fact that all writes are in realtime, and the majority of reads are on the past few hours. If you can move yesterday's data onto cheaper, slower drives, it helps out quite a bit. This is a little more involved but you can read more at https://www.elastic.co/blog/hot-warm-architecture. This requires adding some tags and using curator on a nightly basis but is generally worth it due to the cost savings of moving largely unsearched data off of more expensive SSD.
In production, I run ~20 r3.2xlarge for the hot tier and 4-5 d2.xlarge for the warm tier with a replication factor of 2 - this allows ~TB per day of ingest and a decent amount of retention. We scale Hot for volume and Warm for retention.
Overall - good luck! It's a fun stack to build and operate once everything is running smoothly.
PS - Depending on the time/resources you have available, you can run the managed elasticsearch service on AWS, but the last time i looked its ~60% more expensive than running it on your own instances, and YMMV.
Seems like you need something to start with ELK Stack on AWS
Did u tried this couple of CloudFormation scripts, It would ease your installation process and will help you setup your environment in one go.
ELK-Cookbook - CloudFormation Script
ELK-Stack with Google OAuth in Private VPC
Comment below if this doesn't solves your problem.