deleting old indexes in amazon elasticsearch

deleting old indexes in amazon elasticsearch - amazon-web-services

We are using AWS Elasticsearch for logs. The logs are streamed via Logstash continuously. What is the best way to periodically remove the old indexes?
I have searched and various approaches recommended are:
Use lambda to delete old indexes - https://medium.com/#egonbraun/periodically-cleaning-elasticsearch-indexes-using-aws-lambda-f8df0ebf4d9f
Use scheduled docker containers - http://www.tothenew.com/blog/running-curator-in-docker-container-to-remove-old-elasticsearch-indexes/
These approaches seem like an overkill for such a basic requirement as "delete indexes older than 15 days"
What is the best way to achieve that? Does AWS provide any setting that I can tweak?

Elasticsearch 6.6 brings a new technology called Index Lifecycle Manager See here. Each index is assigned a lifecycle policy, which governs how the index transitions through specific stages until they are deleted.
For example, if you are indexing metrics data from a fleet of ATMs into Elasticsearch, you might define a policy that says:
When the index reaches 50GB, roll over to a new index.
Move the old index into the warm stage, mark it read only, and shrink it down to a single shard.
After 7 days, move the index into the cold stage and move it to less expensive hardware.
Delete the index once the required 30 day retention period is reached.
The technology is in beta stage yet, however is probably the way to go from now on.

Running curator is pretty light and easy.
Here you can find a Dockerfile, config and action-file.
https://github.com/zakkg3/curator
Also, Curator can help you if you need to (among others):
Add or remove indices (or both!) from an alias
Change shard routing allocation
Delete snapshots
Open closed indices
forceMerge indices
reindex indices, including from remote clusters
Change the number of replicas per shard for indices
rollover indices
Take a snapshot (backup) of indices
Restore snapshots
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
Here is a typical action file for delete indices older than 15 days:
actions:
1:
action: delete_indices
description: >-
Delete indices older than 15 days (based on index name), for logstash-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: logstash-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 15

I followed the elasticsearch-curator documentation to install the package:
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/pip.html
Then I used the AWS base example of how to automate the indexes cleanup using the signed based authentication provided by requests_aws4auth package:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/curator.html
It worked like a charm.
You can decide to run this inside a lambda, docker or include it in your own DevOps cli.

Related

AWS ASG replacement strategy

I have successfully created an ASG with rolling update which seem to work. I have however, a rather unique use case. I would like to have a update strategy where I run both in parallell (EC2_old and EC2_new). Meaning, I want to make sure the new one is up and running during a test session of 15-30 min. During these 15-30 min I also want the deployment process to continue and not get stuck in a waiting mode for this transition to become complete. In a way I'm looking for a blue/green deployment strategy and I don't know if it is even possible.
I did some reading and came across WillReplace update policy. This could do the trick but the cfn options seem rather limited. Has anyone implemented an update strategy of this complexity?
Current policy looks like this:
updatePolicy = {
autoScalingRollingUpdate: {
maxBatchSize: 1,
minInstancesInService: 1,
pauseTime: "PT1H",
waitOnResourceSignals: true,
suspendProcesses: [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"ScheduledActions",
"AlarmNotification"
]
}
};

willReplace won't be a blue/green strategy. It does create a new ASG, but it will swap the target group to the new ASG instances as soon as they are all healthy. If you google aws blue green deployment you should find a quick start that goes over how to set up what you are looking for.

Best way to retire an index

I am retiring an old elastic search index in AWS that has not received a new document since 2016. However, something is still trying to search it.
I still want deprecate this index in a manner manner where I can get back to the original state quickly. I have created a manual snapshot of the index and it is sitting in S3. I was planning on deleting the domain, but, from what I understand, that deletes everything billable under AWS including the end point. As I mentioned above, I want to be able to get back to the original state of the index. So this domain contains a series of indexes. The largest index is 20.5 Gb. I was going to delete the large index and resize the cluster to a smaller instance size and footprint. Will this work or will it be unsearchable?

I've no experience using Elasticsearch on AWS, but I have an idea about your index.
You say the index has received no new documents for a long time. If this also means no deletions and no updates, you could theoretically just take this index to a new cluster, using either snapshot + restore, or a cross-cluster reindex. Continue operating your old cluster until you're sure the new one is working well.
Again - not familiar with AWS terminology, but it sounds like this approach translates to using separate "domains". First you fully ensure the new "domain" is working with the right hardware spec and data, and then delete the old "domain".

TL;DR -> yes!
The backup to S3 will work, but the documents will be unsearchable because in order to downsize the storage you have to delete the index.
But if someday you want to restore the data from S3 back to the index, you can.
You can resize instances and storage sizes with no downtime, however, that takes a long time and you pay extra for the machines while they are resizing.
Example:
you change your storage size from 100gb to 99gb
elasticsearch service will spin up another instance, copy all your data from the old instance to the new one and then delete the old one.
same for instance sizes.
machine up, cluster sync, machine down.
while they are syncing, you pay for them.
your plan will work, es is very flexible.
if you really don't trust aws, just make a json export from the index and keep it on s3 too, just in case things go south.

EMR dyanmodb export failed because of table capacity set to on-demand

After we changed the dynamodb table capacity to on-demand, the data pipeline job to export dynamodb table failed with this error.
Exception in thread "main" java.lang.RuntimeException: Read throughput should not be less than 1. Read throughput percent: 0.0
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBInputFormat.getSplits(AbstractDynamoDBInputFormat.java:51)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
Any workaround to this issue?
Thanks
--gsu

I'd contact AWS support to confirm, but I was told the EMR DynamoDB connector does not formally support tables using on-demand provisioning yet. So, more than likely you need to switch the table back to provisioned capacity as a workaround.
Edit: As of 23 January 2019, the EMR connector for DynamoDB supports tables set to on-demand billing.

If the issue is not resolved, then you might need to change at 3 places:
You need to use emr release emr-5.26.0 to emr-5.30.0.
Replace org.apache.hadoop.dynamodb.tools.DynamoDbExport with org.apache.hadoop.dynamodb.tools.DynamoDBExport. Notice the change in casing. Similar for DynamoDBImport.
If you are using emr-dynamodb-connector, then you need to clone its latest version, generate the emr-ddb-tools jar by mvn clean install and then use the generated jar for emr-dynamodb-tools which would be version 4 or higher currently and replace this in your arguments. These should resolve the issue.
Also, there is an issue currently with emr releases 31 or higher if you are using emr-dynamodb-tools which shows you error related to some joda-time framework. I would restrict to using any releases between emr-5.26.0 to emr-5.30.0.

Difference in default partitioning by instance type

My understanding was that spark will choose the 'default' number of partitions, solely based on the size of the file or if its a union of many parquet files, the number of parts.
However, in reading in a set of large parquet files, I see the that default # of partitions for an EMR cluster with a single d2.2xlarge is ~1200. However, in a cluster of 2 r3.8xlarge I'm getting default partitions of ~4700.
What metrics does Spark use to determine the default partitions?
EMR 5.5.0

spark.default.parallelism - Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2X number of CPU cores available to YARN containers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
Looks like it matches non EMR/AWS Spark as well

I think there was some transient issue because I restarted that EMR cluster with d2.2xlarge and it gave me the number of partitions I expected, which matched the r3.8xlarge, which was the number of files on s3.
If anyone knows why this kind of things happens though, I'll gladly mark yours as the answer.

AWS elasticsearch log rotation

I want to use AWS elasticsearch to store the log of my application. Since there a huge amount of data to input to AWS elasticsearch ( ~30GB daily), so i would only keep 3 days of data. Are there any way to schedule data removal from AWS elasticsearch or do a log rotation? What happen if the AWS elasticsearch storage is full?
Thanks for the help

A possible way is to specify the index parameter in elasticsearchoutput to something like logstash-%{appname}-%{date_format}". Hence you can then use curator plugin in order to delete the old indices by number of days or so.
This SO pretty much explains the same. Hope it helps!

I assume you are using the AWS Amazon Elasticsearch Service?
The storage type is an EBS volume with a fixed size of disk space. If you want to keep only the last three days, I assume you have 3 indices then, like that
my-index-2017.01.30
my-index-2017.01.31
my-index-2017.02.01
Basically you can write some simple script which deletes indices older than 3 days. With the REST API it just is in Sense DELETE my-index-2017.01.30.
I recommend to use Elasticsearch Curator for the job. See https://www.elastic.co/guide/en/elasticsearch/client/curator/current/delete_indices.html
I'm not sure if the Service interface itself has an option for that. But Elasticsearch Curator should do the job for you.

Update for 2020:
AWS ES has now support for Index state management which lets you define custom management policies to automate routine tasks and apply them to indices and index patterns. You no longer need to set up and manage external processes to run your index operations.
For example, you can define a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.
Index State Management - https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

deleting old indexes in amazon elasticsearch - amazon-web-services

Related

AWS ASG replacement strategy

Best way to retire an index

EMR dyanmodb export failed because of table capacity set to on-demand

Difference in default partitioning by instance type

AWS elasticsearch log rotation

Categories

Resources