Using Amazon S3 as a limited-database - amazon-web-services

I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?

The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!

I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.

Related

AWS ETL Design Help: How Can I Maintain a new S3 Bucket with just the most recent data updates?

I am fairly new to AWS and am a bit overwhelmed with the options for a task I have.
Have: I have a dimensional model in an S3 bucket (read only access), that has a folder structure and contains partitioned parquet files. This bucket will be updated daily (+40GB a day), with changes to both dim and fact tables. I need to get this data out of S3, but it's extremely inefficient to set up a boto3 connection and repeatedly pull the entire raw data and continuously check if the data has even been updated.
What I was thinking for a solution: To maintain updated tables in another S3 bucket that I create (likely Athena query outputs), where I can just pull in the updated changes, so that boto can just check if there is data in the new bucket and pull, reducing load.
Considerations:
I need some kind of event notification that triggers the Athena query. I was looking into Lambda or Cloudwatch, but unsure which is better or restraints.
For the fact tables, I need an Athena query that gets the most recent "Last Updated" timestamp from the updated data. And then updates the updated bucket tables to include all the raw data that is greater than the found timestamp.
FYI: I am working with partitioned data, and I am not sure if I can just work with the tables as partitions (part-0000dim-table-3.parquet) or if additional steps are required to work with partitions.
For the dim tables, I need to somehow scan the entire table for changes (dim tables are a combination of SCD 0,1,2)... unsure how best to do this. In the worst case, I could just point the boto3 connection to the raw dim tables whenever the fact tables update.
What AWS APIs, workflows, should I think about using?
I am unclear on the constraints that I could run into with either Lambda, Cloudwatch, Athena step functions, etc. and trying to learn as I go. I am also struggling on how to compare Athena query results across the two buckets.
Thank you very much & if there is any more information that would help, just let me know!!

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

Strategy for Updating Schema/Data of Data Stored in AWS S3

At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose,
How should i collect the data ?
Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ?
What i mean is:
- If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3...
- If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .
What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?
On any case what are the best practices to avoid data duplication in RedShift ?
Appreciate the help!
It is preferred to aggregate event logs before ingesting them into Amazon Redshift.
The benefits are:
You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.
You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.
You can accumulate your events in several places before aggregating and loading them into Redshift:
Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.
DynamoDB - as #Swami described, DynamoDB is a very good way to accumulate the events.
Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.
Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).
See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490
Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.
Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.
However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.
Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.
One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.
For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
and this blog:
http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html
For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html
Hope this helps.
Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...
This is probably the easiest and most reliable way to do so.
You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.
In my CSV_Loader_For_Redshift I do just that:
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.
They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.
You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.
This solves the purpose of aggregation of logs. This setup will handle all of that for you.
You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.