dropping table using presto doesn't drop tables from warehouse - hdfs

I am trying to drop a table using presto, actually metadata gets dropped but table data in /user/hive/warehouse still exists.
Upon further digging I found that this behavior was possibly due to the sticky bit which was getting applied to the database and all the tables inside the database.
drwxrwxrwt - hadoop hadoop 0 2017-11-29 13:40 /user/hive/warehouse/testmanishdb.db
now if I try to remove the sticky bit applied on the database using the below command
hadoop fs -chmod -t /user/hive/warehouse/testmanishdb.db/
sticky bit gets removed as can be seen below
drwxrwxrwx - hadoop hadoop 0 2017-11-29 13:44 /user/hive/warehouse/testmanishdb.db
Now going forward if I create a table in presto and drop the table its metadata and corresponding /user/hive/warehouse/ data on the HDFS also gets deleted.
Though this behavior is only observed in presto, even without removing sticky bit if I drop table using hive/spark both metadata and data gets deleted.
Thanks
Manish

Related

Getting incremental data from Amazon Aurora to Redshift via DMS using CDC

my company wants to build a Data warehouse in Redshift. We have an OLTP database running in Amazon Aurora and we are thinking of using the DMS (data migration service). I am trying to get my head around the capabilities of CDC (change data capture). The thing is that CDC (over DMS) replicates and stores changes (in our case in Redshift) and I was wondering if it is possible to select specific columns which I want to store (this should be possible to do with table mapping - include) and based on which I want to store? As far as I understand it, if any columns of a row are updated, then the replication is triggered, which could mean a replication that is useless (e.g. if somebody updates a column that I do not want to follow)
E.g. I have a table with leads, which has some 30 columns. Now I am interested in the DW purposes only in 5 columns and I want to get a new line to the redshift table only if any of those 5 columns changes (is updated)... like if the stage of lead is changed, I will get a new line. On the other hand, I am not interested in the column 'Salesmans_comment' so if the salesman updates a comment, I do not want to have a new line, because I am not interested in it...Cheers!
I have run through most of the available yt tutorials and read through the documentation, but I haven't found a clear answer...
Thanks

Retention and archival policy on Hive data

We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.
After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.
Is there any hive built-in feature that does the above?
Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)
I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.
If you are using partitions you may wish to do some additional cleanup.
MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.
It's better to drop partitions:
ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;
In Hive you can implement partitions retention (since Hive 3.1.0)
For example to drop partitions and their data after 7 days:
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
There is not a hive internal tool that removes 'databases' according to a "retention period" in hive.
You have been doing this for a while so you are likely well aware of the risks of deleting metadata older than a year.
There are several ways to define retention on data, but none that I'm aware to remove metadata.
Things you could look at:
You could add a trigger to Aurora to delete tables directly from the hive metadata. (Hive tables have values for create time and they're last access time) you could create some logic to work at that level.

Is it possible to change a database (schema) name in AWS Athena?

I created a database and some tables with Data on AWS Athena and would like to rename the database without deleting and re-creating the tables and database. Is there a way to do this? I tried the standard SQL alter database but it doesn't seem to work.
thanks!
I'm afraid there is no way to do this according to this official forum thread. You would need to remove the database and re-create it. However, since Athena does not store any data by itself, deleting a table or a database won't impact your data stored on S3. Therefore, if you kept all the scripts that create external tables, re-creating a database should be fairly quick thing to do.
Athena doesn't support renaming database. You need to recreate database with a new name.
You can use Presto which is an open source version of Athena and Presto supports more DDL queries.

Extracting multiple RDS MySQL tables to S3

Rather new to AWS Data Pipeline so any help will be appreciated. I have used the pipeline template RDStoS3CopyActivity to extract all contents of a table in RDS MySQL. Seems to be working good. But there are 90 other tables to be extracted and dumped to S3. I cannot imagine craeting 90 pipelines or one for each table.
What is the best approach to resolving this task? How could pipeline be instructed to iterate though a list of the table names?
I am not sure if this will ever get responded. However, in this early stage of exploration, I have developed a pipeline that seems to fit a preliminary purpose -- extracting from 10 RDS MySQL tables and copying each to their respective sub-bucket on S3.
The logic is rather simple.
Configure connection for the RDS MySQL.
Extract data by specifying in "Select Query" field for each table.
Drop a Copy Activity and link up for each table above. It runs on a specified EC2 instance. If you're running expensive query, make sure you choose the appropriate EC2 instance with enough CPU and memory. This step copies the extracted dump, which lives temporarily in ec2 tmp filesystem, to a designated S3 bucket you will set up next.
Finally, the designated / target destination.
By default, data extracted and loaded to S3 bucket will be comma separated. If you need it to be tab delimited, then in the last target S3 destination:
- Add an optional field.. > select Data Format.
- Create a new Tab Separated. This will appear under the category of 'Others'.
- Give it a name. I call it Tab Separated.
- Type: TSV. Hover mouse over 'Type' to learn more of other data formats.
- Column separator: \t (i could leave this blank as type was already specified as tsv)
Screenshot
-
If the tables are all in the same RDS Why not use a SQLActivity pipeline with a SQL statement containing multiple unload commands to S3?
You can just write one query and use one pipeline.

Backup DynamoDB Table with dynamic columns to S3

I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though?
That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns:
-- Map DynamoDB Table
CREATE EXTERNAL TABLE dynamodb_table (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "MyTable");
(As an aside, I have also tried using the Datapipe system but found it rather frustrating as I could not figure out from the documentation how to perform simple tasks like run a shell script without everything failing.)
It turns out that the Hive script that I posted in the original question works just fine but only if you are using the correct version of Hive. It seems that even with the install-hive command set to install the latest version, the version used is actually dependent on the AMI Version.
After doing a fair bit of searching I managed to find the following in Amazon's docs (emphasis mine):
Create a Hive table that references data stored in Amazon DynamoDB. This is similar to
the preceding example, except that you are not specifying a column mapping. The table
must have exactly one column of type map. If you then create an EXTERNAL
table in Amazon S3 you can call the INSERT OVERWRITE command to write the data from
Amazon DynamoDB to Amazon S3. You can use this to create an archive of your Amazon
DynamoDB data in Amazon S3. Because there is no column mapping, you cannot query tables
that are exported this way. Exporting data without specifying a column mapping is
available in Hive 0.8.1.5 or later, which is supported on Amazon EMR AMI 2.2.3 and later.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html