Remember last filename created by Hive on S3 - amazon-web-services

Hi I would like to know if there is a way that I can get the name of last parquet file that Hive created on S3 after I insert new data into table?

Please look at the hive warehouse directory for changes after you write the data to hive.

Related

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

Spark SQL query to get the last updated timestamp of a Athena table stored as CSV in AWS S3

Is it possible to get the last updated timestamp of a Athena Table stored as a CSV file format in S3 location using Spark SQL query.
If yes, can someone please provide more information on it.
There are multiple ways to do this.
Use the athena jdbc driver and do a spark read where the format is jdbc. In this read you will provide your "select max(timestamp) from table" query. Then as the next step just save to s3 fcrom the spark dataframe
You can skip the jdbc read altogther and just use boto3 to run the above query. It would be a combination of start_query_execution and get_query_results. You can then save this to s3 as well.

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3

Skip columns while copying data into redshift from S3 using copy command

I have a CSV table in S3 with 100's of attributes/features, I don't want to create table in RedShift with all these attributes before importing data. Is there anyway to select only the columns I need while copying data from S3 into Redshift?
You cannot achieve the above using just a copy command it is doable using a python script. Please go through this
Read specific columns from a csv file with csv module?
There are couple of options listed in aws forum for this problem, take a look at https://forums.aws.amazon.com/message.jspa?messageID=432590 if they may work for you.

Getting s3 key name within EMR

I'm running a hvie script on EMR that's pulling data out of s3 keys. I can get all the data and put it in a table just fine. The problem is, some of the data I need is in the key name. How do I get the key name from within hive and put that into the hive table?
I faced similar problem recently. From what I researched, it depends. You can get the data out of the "directory" part but not the "filename" part of s3 keys.
You can use partition if s3 keys are formatted properly. partition can be queried the same way as columns. here is a link with some examples: Loading data with Hive, S3, EMR, and Recover Partitions
You can also specify the partitions yourself if s3 files are already grouped properly. For example I needed the date information so my script looked like this:
create external table Example(Id string, PostalCode string, State string)
partitioned by (year int, month int, day int)
row format delimited fields terminated by ','
tblproperties ("skip.header.line.count"="1");
alter table Example add partition(year=2014,month=8,day=1) location 's3n://{BuckeyName}/myExampledata/2014/08/01/';
alter table Example add partition(year=2014,month=8,day=2) location 's3n://{BuckeyName}/myExampledata/2014/08/02/';
...keep going
The partition data must be part of the "directory name" and not the "filename" because Hive loads data from a directory.
If you need to read some text out of the file name, I think you have to create custom program to rename the objects to so that the text you need is in the "directory name".
Good luck!