AWS Data Lake Dynamo vs ElasticSearch - amazon-web-services

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and any additional metadata / attributes you would like to search by in ES. If that is correct, how would you use the two together to support that. I tried to find more detailed information about how to properly pair the two together, but have been unsuccessful. Any information / documentation that others have would be great. Good chance I am overlooking some obvious examples / documentation.
What I am imagining is something like the following:
User could search for metadata / attributes in ES that would point to the high-level S3 buckets / partitions that match.
The search in DynamoDB would be against the part of the key (Partition / bucket) from the ES result
The search would most likely result in many individual objects / keys that could then be processed, extracted, etc.

I spoke to one of our AWS reps, who referred me to this article. It was a great starting point. AWS Data Lake. This seemed to answer some of my questions about the user of components and approach, that was previously unclear to me.
Highlights:
Blueprint for implementing a data lake. Combining S3 / DynamoDB / ES is common.
There are many variations to the implementation. Substituting an RDS for ES / DynamoDB, using just ES, etc.
We will most likely start with an RDS to workout the process, then move to DyanmoDB / ES.

Related

Is there a way to deal with changes in log schema?

I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.

what does it mean "partitioned data" - S3

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).
In the README there are 2 options:
S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution.
S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution.
I tried to understand the differences but unsuccessfully. Can someone explain what is "partitioned data" in s3?
according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive’s partitioning strategy: different levels of the tree represent different columns."
search in the hadoop docs for the full details.
be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs. but since their work is a reimplementation of the netflix work, they should do the same thing here
I'm not familiar with outputCommitter, by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.
For example:
/data/month=1/
/data/month=2/
/data/month=3/
...
If a Hive-type query is run against the data with a clause like WHERE month=1, then it would only need to look in the month=1/ subdirectory, thereby saving 2/3rds of disk access.

CloudFront Top Referrers Report - ALL referrer URLs

In AWS I can find under:
Cloudfront >> Reports & Analytics >> Top Referrers (CloudFront Top Referrers Report)
There I get the top25 items. How can I get ALL of them?
I have turned on logging in my bucket, but it seems that the referrer is not part of the log-file. Any idea how amazon collects its top25 and how I can according to that get the whole list?
Thanks for your help, in advance.
Amazon's built in analytics are, as you've noticed, rather basic. The data you're looking for all lives in the logfiles that you can set cloudfront up to export (in the cs(Referer) field). If you know what you're looking for, you can set up a little pipeline to download logs, pull out the numbers you care about and generate reports.
Amazon also makes it easy[1] to set up Athena or Redshift to look directly at Cloudfront or S3 logfiles in their target bucket. After a one-time setup, you could query them directly for the numbers you need.
There are also paid services built to fill in the holes in Amazon's default reports. S3stat (https://www.s3stat.com/), for example, will give you a Top 200 Referrer list in its reports, with the ability to export complete lists.
[1] "easy", using Amazon's definition of the word, meaning really really hard.

Is there a way to query S3 object key names for the latest per prefix?

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

AWS elasticsearch log rotation

I want to use AWS elasticsearch to store the log of my application. Since there a huge amount of data to input to AWS elasticsearch ( ~30GB daily), so i would only keep 3 days of data. Are there any way to schedule data removal from AWS elasticsearch or do a log rotation? What happen if the AWS elasticsearch storage is full?
Thanks for the help
A possible way is to specify the index parameter in elasticsearchoutput to something like logstash-%{appname}-%{date_format}". Hence you can then use curator plugin in order to delete the old indices by number of days or so.
This SO pretty much explains the same. Hope it helps!
I assume you are using the AWS Amazon Elasticsearch Service?
The storage type is an EBS volume with a fixed size of disk space. If you want to keep only the last three days, I assume you have 3 indices then, like that
my-index-2017.01.30
my-index-2017.01.31
my-index-2017.02.01
Basically you can write some simple script which deletes indices older than 3 days. With the REST API it just is in Sense DELETE my-index-2017.01.30.
I recommend to use Elasticsearch Curator for the job. See https://www.elastic.co/guide/en/elasticsearch/client/curator/current/delete_indices.html
I'm not sure if the Service interface itself has an option for that. But Elasticsearch Curator should do the job for you.
Update for 2020:
AWS ES has now support for Index state management which lets you define custom management policies to automate routine tasks and apply them to indices and index patterns. You no longer need to set up and manage external processes to run your index operations.
For example, you can define a policy that moves your index into a read_only state after 30 days and then ultimately deletes it after 90 days.
Index State Management - https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/ism.html