Move data from AWS Elasticsearch to S3 - amazon-web-services

I have an application pumping logs to an AWS OpenSearch (earlier Elasticsearch) cluster. I want to move old logs to S3 to save cost and still be able to read the logs (occasionally).
One approach I can think of is writing a cron job that reads the old indexes, writes them (in text format) to the s3 and deletes the indexes. This also requires keeping day wise indexes. Is there a more efficient/better way?

You can use the manual snapshots approach to backup your indexes to s3: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-snapshots.html
Another option as suggested toward the end of the first link is to use a tool named Curator within lambda that will handle the index rotation:
https://docs.aws.amazon.com/opensearch-service/latest/developerguide/curator.html
Depending on your use case UltraWarm could be the best approach, in case you want those logs to be searchable later on without the need of manual restores, that will be required in case you go with the first two options I have listed:
https://aws.amazon.com/blogs/aws/general-availability-of-ultrawarm-for-amazon-elasticsearch-service/

There is one tool elasticdump
# Export ES data to S3 (using s3urls)
elasticdump \
--s3AccessKeyId "${access_key_id}" \
--s3SecretAccessKey "${access_key_secret}" \
--input=http://production.es.com:9200/my_index \
--output "s3://${bucket_name}/${file_name}.json"

Related

Export DynamoDB table to S3 automatically

The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.

AWS Quicksight cant see Athena DB in another region

My Athena DB is in ap-south-1 region and AWS QuickSight doesn't exist in that region.
How can I connect QuickSight with Athena in that case?
All you need to do is to copy table definitions from one region to another. There are several ways to do that
With AWS Console
This approach is the most simple one and doesn't require additional setup as everything is based on Athena DDL statements.
Get table definition with
SHOW CREATE TABLE `database`.`table`;
This should output something like:
CREATE EXTERNAL TABLE `database`.`table`(
`col_1` string,
`col_2` bigint,
...
`col_n` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://some/location/on/s3'
TBLPROPERTIES (
'classification'='parquet',
...
'compressionType'='gzip')
Change to a desired region
Create database where you want to store table definitions, or use default one.
Execute statement produced by SHOW CREATE TABLE. Note, you might need to change name of database with respect to previous step
If you table is partitioned then you would need to load all partitions.
If data on S3 adheres HIVE partitioning style, i.e.
s3://some/location/on/s3
|
├── day=01
| ├── hour=00
| └── hour=01
...
then you can use
MSCK REPAIR TABLE `database`.`table`
Alternatively, you can load partitions one by one
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='00')
LOCATION 's3://some/location/on/s3/01/00';
ALTER TABLE `database`.`table`
ADD PARTITION (day='01', hour='01')
LOCATION 's3://some/location/on/s3/01/01';
...
With AWS API
You can use AWS SDK, e.g. boto3 for python, which provide an easy to use, object-oriented API. Here you have two options:
Use Athena client. Like in a previous approach, you would need to get table definition statement from AWS Console. But all other steps, can be done in scripted manner with the use of start_query_execution method of Athena Client. There are plenty resources online, e.g. this one
Use AWS Glue client. This method is solely based on operation within AWS Glue Data Catalog, which is used by Athena during query execution. Main idea is to create two glue clients, one for source and one for destination catalog. For example
import boto3
KEY_ID = "__KEY_ID__"
SECRET = "__SECRET__"
glue_source = boto3.client(
'glue',
region_name="ap-south-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
glue_destination = boto3.client(
'glue',
region_name="us-east-1",
aws_access_key_id=KEY_ID,
aws_secret_access_key=SECRET
)
# Or you can do it with creating sessions
glue_source = boto3.session.Session(profile_name="profile_for_ap_south_1").client("glue")
glue_destination = boto3.session.Session(profile_name="profile_for_us_east_1").client("glue")
Then you would need to use get and create type methods. This would also require parsing responses that would get from glue clients.
With AWS Glue crawlers
Although, you can use AWS Glue crawlers to "rediscover" data on S3, I wouldn't recommend this approach since you already know structure of you data.
The answer of #Ilya Kisil is correct but I would like to bring some more details and alternative solutions.
There are two different approaches you can take.
As suggested by Ilya, copy the table definitions from one region (source region) to another (destination region). The idea is to reference the data of the other region.
I found the Glue Crawlers much easier and faster. You need to create a Glue Crawler in the source region and specify the S3 bucket of the destination region where the metadata is located. Once you do it, you will see in the Athena source region all the tables of the destination region! Behind the scenes what Glue Crawler does is what Ilya explained in the "With AWS Console" section. So, instead of creating the table one by one and loading the partitions (if exist), you can just create one Glue Crawler.
Note, that it holds a reference to your destination region tables. So that it doesn't copy the data. At first glance, it seems to be great! Why should we copy the data if we could reference it? But when you take a deeper look, you can find that you are probably going to pay more money $$$. When you reference data, you will pay for the data each query returns and if you consume the data a lot, and you have TB/PB of data, it might be too expensive, and if cost is a consideration for you, I would recommend you consider the second solution.
Also note, that although the data is not being copied to the source region and just referenced, behind the scenes, when you execute a query, AWS saves the data temporarily in the source region. So, if you need to be GDPR compliant you might need to be aware of that.
Copy the data from the destination region to the source region and have a process that keeps synchronizing it. Then you will not pay for the Athena queries, but rather pay for the storage that is usually cheaper. If possible, you can also copy just what you need or aggregate the data, so you have less copied storage => and less cost.
A convenient way to do it is by creating a Glue Job that will be responsible for copying the data from the destination region S3 bucket to the source region S3 bucket. And then you can add it to a Glue Workflow that will run this job once a day or whatever is proper for you.
To Summarize:
There are lots of things to consider and I mentioned some of them. In each use case, you have advantages and disadvantages and you can find what is the right one for you.
(Solution 1) Advantages:
Easy. Just some clicks.
Fast.
Referencing the data and no need to have duplicated data.
(Solution 1) Disadvantages:
Might be way more expensive (depends on the data usage).
(Solution 2) Advantages:
Might be much cheaper
(Solution 2) Disadvantages:
Slow/Longer solution
Need to copy existing data and then have a process to copy new data

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

Pointing multiple projects' log sinks to one bucket

I have a few GCP projects with log sinks to different storage buckets. I'd like to combine them into a single bucket. But the stackdriver export doesn't add any distinguishing information to the object names it creates; they all look like cloudaudit.googleapis.com/activity/2017/11/14/00:00:00_00:59:59_S0.json
What will happen if I start pushing them all to a single bucket? Will the different project sinks overwrite each other's objects? Is there any way to distinguish which project created the logs just from the object?
If not, I guess I should switch to pubsub sinks, and then write some code that produces objects with more desirable names. Are there any established patterns or examples for doing this?
Update: I filed https://issuetracker.google.com/issues/69371200 for this issue.
To enable this, just select custom destination on the sink and point to the bucket with this format: storage.googleapis.com/[BUCKET_ID].
I've just enabled this in a couple of my projects, as I'm curious to see the results when exporting to a bucket. However, I have been using a single BQ sink for all my projects, and the tables created have all the logs mixed, so no logs lost when using a single BQ sink.
I'm assuming for a GCS sink will work in the same way, but I'll tell you in a couple of days.
If a single bucket sink does not work, you can always use a single BQ sink (that will help in analyzing the logs), and when you no longer want to have them in BQ, export them and store the files wherever you want.
Also, since you'll be writing to your sink constantly, you can't use nearline or coldline, so the storage pricing is better in BQ than a regional bucket (0.02 USD/GB in BQ vs somewhere between 0.02 and 0.35 USD/GB for regional storage, depending on the region; BQ has 10GB free monthly, GCS 5GB).
I would generally recommend using a BQ sink, but I'll tell you what happens with my bucket logs.
Update:
A few hours later, and I've verified that shared bucket sinks work pretty much as you would expect. It concatenates logs chronologically regardless of the project origin, and only creates a single file for each time window. Hope this helps! (I still prefer BQ as a log sink...)
Update 2:
For the behavior you seek in the feature request, I would use BQ, but you could just as easily grep the project ID and separate the logs:
grep '"logName":"projects/<your-project-id>/' mixed-log.json > single-project-log.json
Or just get a cloud function triggered by bucket updates (so, every time you receive a log file in the sink) to run this for you.
Or namespace you buckets and have a cloud function moving them to wherever you need as soon as they are written.
The possibilities are endless!
If you have an organization or folder which includes all the projects that you want to collect logs from, then you can create a sink that collects from all projects in that org/folder.
Unfortunatlely, you cannot do this from the Cloud Console. Instead you must use gcloud with the --organization or --folder option or the API.

AWS elasticsearch log rotation

I want to use AWS elasticsearch to store the log of my application. Since there a huge amount of data to input to AWS elasticsearch ( ~30GB daily), so i would only keep 3 days of data. Are there any way to schedule data removal from AWS elasticsearch or do a log rotation? What happen if the AWS elasticsearch storage is full?
Thanks for the help
A possible way is to specify the index parameter in elasticsearchoutput to something like logstash-%{appname}-%{date_format}". Hence you can then use curator plugin in order to delete the old indices by number of days or so.
This SO pretty much explains the same. Hope it helps!
I assume you are using the AWS Amazon Elasticsearch Service?
The storage type is an EBS volume with a fixed size of disk space. If you want to keep only the last three days, I assume you have 3 indices then, like that
my-index-2017.01.30
my-index-2017.01.31
my-index-2017.02.01
Basically you can write some simple script which deletes indices older than 3 days. With the REST API it just is in Sense DELETE my-index-2017.01.30.
I recommend to use Elasticsearch Curator for the job. See https://www.elastic.co/guide/en/elasticsearch/client/curator/current/delete_indices.html
I'm not sure if the Service interface itself has an option for that. But Elasticsearch Curator should do the job for you.
Update for 2020:
AWS ES has now support for Index state management which lets you define custom management policies to automate routine tasks and apply them to indices and index patterns. You no longer need to set up and manage external processes to run your index operations.
For example, you can define a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.
Index State Management - https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html