AWS Kinesis Firehose - using Index Rotation (Elasticsearch) - amazon-web-services

I have set up a new AWS Kinesis Firehose stream and I'd like to create a new index on a weekly basis.
For that, I should use Index Rotation setting when configuring the stream.
But, do I have to the new index every weekend for the next upcoming week?
If not (hopefully not), how does Firehose knows what mapping to use? does it use the mapping defined in the index that I specified in the Index setting?
More over, lets say I have old data, can I make Firehose to create an index with the relevant timestamp according to the dates in specified in my old data?
Thank you !

Have you tried to consider creating index template in elastic search. Thay way new indexes will pick the mapping defined in index template.
Refer to following link for details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html

Well, apparently the answer is yes, but in a bad way.
If Firehose pushes data to a new index, that is not pre-defined with a mapping, that data is ingested to Elasticsearch and a mapping is created automatically for you.
This is really bad.
You should auto create an index 1-2 hours prior the event.
I'll post a Lambda function and its configuration for doing that automatically.

Related

Best strategy to archive specific records from RDS to a cheaper storage in AWS

I have the following requirements:
For every deleted record in RDS we need to archive it into somewhere cheaper on AWS.
Reduce storage cost
Not using Glacier
Context oriented (e.g. a file per table)
re-import is not a requirement
I'm not an experienced user with AWS, so I'm still a bit lost among the amount of options it has to offer and I'd like to know if you have more ideas to help me clear it out.
Initial thoughts:
The microservice that deletes the record, might send it to a broker (RabbitMQ for e.g.) and another microservice (let's call it archiver) will listen to it, write into a file, zip and send to S3. This approach has some technical challenges though: in order to make sense create big files, I need to wait the queue to growth a bit, wrap it into a stream and zip inside S3. The transaction control is very weak as well, since file writing and ack on messages are signal based i.e. I'll remove the messages from the broker just after the file is created.
Add a new column to the "archiveble" tables as "deleted (bool)" and run a separate job fetching only those records and saving them into S3. Discarded they don't want the new microservice with access to other's databases.
Following the same approach as in the first item, but instead of save into S3, save into a cheaper database. SimpleDB?
option 1, but instead of rabbitmq, write it to a kinesis firehose and direct that to an s3 location - it doesn't get much cheaper or easier than that.

How to read the oldest unprocessed record in Kinesis Data Stream

I'm new to AWS and would like some guidance.
I want to process the oldest unprocessed record but I cannot seem to get the params right.
Current Architecture
For the shard iterator:
I've tried TRIM_HORIZON which gave me all the records since the
beginning.
I've also tried LATEST which only gave me the one latest record.
Not sure if these additional details will help but...
I'm putting my own records in through Lambda on the AWS console
I'm debugging this by looking at the log files in CloudWatch
I'm getting records through the shard iterator (TRIM_HORIZON and LATEST)
My getRecords limit is set at 100
Thanks in advance!
There is no "oldest unprocessed record", as Kinesis doesn't know what you've processed (for example, you may have fetched the records but not done anything with them).
If you're using Kinesis, I strongly recommend using Kinesis Client Library, which has the concept of checkpoints - these are essentially a nice wrapper on top of ShardIterator AFTER_SEQUENCE_NUMBER, which translates to "oldest uncheckpointed record" - or as close as you'll get to "oldest unprocessed record".
(You could always implement this logic yourself, but why not reuse work that Amazon has already done for you?)

deleting old indexes in amazon elasticsearch

We are using AWS Elasticsearch for logs. The logs are streamed via Logstash continuously. What is the best way to periodically remove the old indexes?
I have searched and various approaches recommended are:
Use lambda to delete old indexes - https://medium.com/#egonbraun/periodically-cleaning-elasticsearch-indexes-using-aws-lambda-f8df0ebf4d9f
Use scheduled docker containers - http://www.tothenew.com/blog/running-curator-in-docker-container-to-remove-old-elasticsearch-indexes/
These approaches seem like an overkill for such a basic requirement as "delete indexes older than 15 days"
What is the best way to achieve that? Does AWS provide any setting that I can tweak?
Elasticsearch 6.6 brings a new technology called Index Lifecycle Manager See here. Each index is assigned a lifecycle policy, which governs how the index transitions through specific stages until they are deleted.
For example, if you are indexing metrics data from a fleet of ATMs into Elasticsearch, you might define a policy that says:
When the index reaches 50GB, roll over to a new index.
Move the old index into the warm stage, mark it read only, and shrink it down to a single shard.
After 7 days, move the index into the cold stage and move it to less expensive hardware.
Delete the index once the required 30 day retention period is reached.
The technology is in beta stage yet, however is probably the way to go from now on.
Running curator is pretty light and easy.
Here you can find a Dockerfile, config and action-file.
https://github.com/zakkg3/curator
Also, Curator can help you if you need to (among others):
Add or remove indices (or both!) from an alias
Change shard routing allocation
Delete snapshots
Open closed indices
forceMerge indices
reindex indices, including from remote clusters
Change the number of replicas per shard for indices
rollover indices
Take a snapshot (backup) of indices
Restore snapshots
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
Here is a typical action file for delete indices older than 15 days:
actions:
1:
action: delete_indices
description: >-
Delete indices older than 15 days (based on index name), for logstash-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: logstash-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 15
I followed the elasticsearch-curator documentation to install the package:
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/pip.html
Then I used the AWS base example of how to automate the indexes cleanup using the signed based authentication provided by requests_aws4auth package:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/curator.html
It worked like a charm.
You can decide to run this inside a lambda, docker or include it in your own DevOps cli.

AWS elasticsearch log rotation

I want to use AWS elasticsearch to store the log of my application. Since there a huge amount of data to input to AWS elasticsearch ( ~30GB daily), so i would only keep 3 days of data. Are there any way to schedule data removal from AWS elasticsearch or do a log rotation? What happen if the AWS elasticsearch storage is full?
Thanks for the help
A possible way is to specify the index parameter in elasticsearchoutput to something like logstash-%{appname}-%{date_format}". Hence you can then use curator plugin in order to delete the old indices by number of days or so.
This SO pretty much explains the same. Hope it helps!
I assume you are using the AWS Amazon Elasticsearch Service?
The storage type is an EBS volume with a fixed size of disk space. If you want to keep only the last three days, I assume you have 3 indices then, like that
my-index-2017.01.30
my-index-2017.01.31
my-index-2017.02.01
Basically you can write some simple script which deletes indices older than 3 days. With the REST API it just is in Sense DELETE my-index-2017.01.30.
I recommend to use Elasticsearch Curator for the job. See https://www.elastic.co/guide/en/elasticsearch/client/curator/current/delete_indices.html
I'm not sure if the Service interface itself has an option for that. But Elasticsearch Curator should do the job for you.
Update for 2020:
AWS ES has now support for Index state management which lets you define custom management policies to automate routine tasks and apply them to indices and index patterns. You no longer need to set up and manage external processes to run your index operations.
For example, you can define a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.
Index State Management - https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html

How can we efficiently push data from csv file to dynamodb without using aws pipeline?

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.