I am new to snowflake and tried to check snowflake on AWS. I understood that AWS is using S3 as storage layer. But if S3 is used to store the data, how come snowflake is allowing updates on the data?
the way Snowflake stores and manages data is very specific to Snowflake and is key to a lot of its unique functionality. While it supports the standard SQL commands, what it is actually doing in the background is not what you might think: it does not update data - instead it will insert new data and mark the existing data as "old". In the same way, it does not delete data when a user issues a "delete" command; instead it will flag the data as deleted and at some point in the future (depending on the type of account you have with Snowflake and how you've configured the account) it will physically delete your "deleted" and "old" data.
It is this way of working that enables you to undrop tables and do "time travel" on your data e.g. query as it was in a specific point in the past
Related
In an application that collects data from IoT devices from multiple customer in a single AWS Timestream table, what if a customer leaves and requests all of its data to be deleted, according to GDPR's right to erasure?
Timestream doesn't support deleting records, only updating them.
Possible solutions we came up so far:
update all records of the customer with zero values
use table per customer and then delete the whole table (problematic with large number of customers)
don't actually delete the data, but make it not relatable to the customer anymore (probably not GDPR compliant anyway)
Are there any other options here? Or is there no solution and we have to dump Timestream altogether and use a different product that supports deletes?
One other alternative is to use data retention to delete the data in Timestream after a period of time you define.
GDPR mentions that you need to delete the data after 1 month (it seems you can have 2 more month but then you need to inform the user that data deletion would take longer).
Keeping data in Timestream can become expensive. I usually store the data in Timestream (short retention) and S3. I use Timestream for fast analytics or near real time dashboarding. I use S3 for adhoc query, ML or historical dashboarding.
My Requirments:
I want to store real-time events data coming from e-commerce websites into a database
In parallel to storing the data, i want to access the events data from a database
I want to perform some sort of ad-hoc analysis(SQL)
Using some sort of built-in methods(either from Boto3 or JAVA SDK), I want to access the events data
I want to create some sort of Custom-API's to access events data stored in database
I recently came across with Amazon Aurora(mysql) database.
I thought Aurora is one of the good example for my requirements. But when I dig into this Amazon Aurora(mysql), I noticed that we can create a database using AWS-CDK
BUT
1. No equivalent methods to create tables using AWS-CDK/BOTO3
2. No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Can anyone tell me how i can store realtime data into AURORA?
Can anyone tell me how i can access realtime data stored in AURORA?
No equivalent methods to create tables using AWS-CDK/BOTO3
This is because only Aurora Serveless can be accessed using Data API, not regular database.
You have to use regular mysql tools (e.g., mysql cli, phpmyadmin, mysql workbench etc) to create tables and populate them.
No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Same reason and solution as for point 1.
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Terraform has mysql, but its not for tables, but users and databases.
Can anyone tell me how i can store realtime data into AURORA?
There is no out-of-the box solution for that, so you need custom solution for that. Maybe stream data to Kinesis Streams or Firehose, then to lambda and lambda will populate your DB? Seems easiest to implement.
Can anyone tell me how i can access realtime data stored in AURORA?
If you stream data to Kinesis Stream first, you can use Kinesis Analytics to analyze it in real time.
Since many of the above requires custom solutions, other architectures are possible.
Create connectoin manager as
DriverManager.getConnection(
"jdbc:mysql://localhost:3306/$dbName", //replace here with you endpoints & database name
"root",
"admin123"
) then
val stmt: Statement = con.createStatement()
stmt.executeQuery("use productcatalogueinfo;")
Whenever your lambda is triggering then it performs this connection and DDL operations too.
At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)
I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.
The document just says that it is a query service but not explicitly states that it can or cannot perform data update.
If Athena cannot do insert or update, is there any other aws service which can do like a normal DB?
Amazon Athena is, indeed, a query service -- it only allows data to be read from Amazon S3.
One exception, however, is that the results of the query are automatically written to S3. You could, therefore, use a query to generate results that could be used by something else. It's not quite updating data but it is generating data.
My previous attempts to use Athena output in another Athena query didn't work due to problems with the automatically-generated header, but there might be some workarounds available.
If you are seeking a service that can update information in S3, you could use Amazon EMR, which is basically a managed Hadoop cluster. Very powerful and capable, and can most certainly update information in S3, but it is rather complex to learn.
Amazon Athena adds support for inserting data into a table using the results of a SELECT query or using a provided set of values
Amazon Athena now supports inserting new data to an existing table using the INSERT INTO statement.
https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/
https://docs.aws.amazon.com/athena/latest/ug/insert-into.html
Bucketed tables not supported
INSERT INTO is not supported on bucketed tables. For more information, see Bucketing vs Partitioning.
AWS S3 is a object storage. Both Athena and S3 Select is for queries. The only way to modify a object(file) in S3 is to retrieve from S3, modify and upload back to S3.
As of September 20, 2019 Athena also supports INSERT INTO: https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/
Finally there is a solution from AWS. Now you can perform CRUD (create, read, update and delete) operations on AWS Athena. Athena Iceberg integration is generally available now. Create the table with:
TBLPROPERTIES ( 'table_type' ='ICEBERG' [, property_name=property_value])
then you can use it's amazing feature.
For a quick introduction, you can watch this video. (Or search Insert / Update / Delete on S3 With Amazon Athena and Apache Iceberg | Amazon Web Services on Youtube)
Read Considerations and Limitations
Athena supports CTAS (create table as) statements as of October 2018. You can specify output location and file format among other options.
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
To INSERT into tables you can write additional files in the same format to the S3 path for a given table (this is somewhat of a hack), or preferably add partitions for the new data.
Like many big data systems, Athena is not capable of handling UPDATE statements.
We could use something known as Apache Iceberg in collaboration with Athena to perform CRUD operations on S3 data inside AWS itself.
The only caveat being that at the time of table creation we need to use extra parameter as table_type = 'ICEBERG'.
Eg:
create table demo
(
id string,
attr1 string
)
location 's3://path'
TBLPROPERTIES (
'table_type' = 'ICEBERG'
)
For more details : https://www.youtube.com/watch?v=u1v666EXCJw