Possible to copy data from one s3 bucket to another via Hive? - amazon-web-services

How can I query one bucket via hive and copy the results to another bucket in s3?
I have a DDL setup to run avro queries but wanting to transfer the subset of results from my filter to a new bucket/location in s3.

You can just use a CREATE TABLE AS SELECT statement from one catalog in Presto to another.

Related

Query to multiple csv fles at S3 through Athena

I exported my SQL DB into S3 in csv format. Each table is exported into separate csv files and saved in Amazon S3. Now, can I send any query to that S3 bucket which can join multiple tables (multiple csv files in S3) and get a result-set? How can I do that and save in a separate csv file?
The steps are:
Put all files related to one table into a separate folder (directory path) in the S3 bucket. Do not mix files from multiple tables in the same folder because Amazon Athena will assume they all belong to one table.
Use the CREATE TABLE to define a new table in Amazon Athena, and specify where the files are kept via the LOCATION 's3://bucket_name/[folder]/' parameter. This tells Athena which folder to use when reading the data.
Or, instead of using CREATE TABLE, an easier way is:
Go to the AWS Glue management console
Select Create crawler
Select Add a data source provide the location in S3 where the data is stored
Provide other information as prompted (you'll figure it out)
Then, run the crawler and AWS Glue will look at the data files in the specified folder and will automatically create a table for that data. The table will appear in the Amazon Athena console.
Once you have created the tables, you can use normal SQL to query and join the tables.

How to directly copy Amazon Athena tables into Amazon Redshift?

I have some JSON files in S3 and I was able to create databases and tables in Amazon Athena from those data files. It's done, my next target is to copy those created tables into Amazon Redshift. There are other tables in the Amazon Athena which I created base on those data files. I mean I created three tables using those data files which is in the S3, latter I created new tables using those those 3 tables. So at the moment I have 5 different tables which want to create in the Amazon Redshift with data or without data.
I checked the COPY command in Amazon Redshift, but there is no COPY command for Amazon Athena. Here are the available list.
COPY from Amazon S3
COPY from Amazon EMR
COPY from Remote Host (SSH)
COPY from Amazon DynamoDB
If there is no any other solutions, I planned to create new JSON files based on newly created tables in the Amazon Athena into S3 buckets. Then we can easily copy those from S3 into the Redshift, isn't it? Are there any other good solutions for this?
If your s3 files are in an OK format you can use Redshift Spectrum.
1) Set up a hive metadata catalog of your s3 files, using aws glue if you wish.
2) Set up Redshift Spectrum to see that data inside redshift (https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html)
3) Use CTAS to create a copy inside redshift
create table redshift_table as select * from redshift_spectrum_schema.redshift_spectrum_table;

AWS Athena - duplicate columns due to partitionning

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly.
The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.
What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:
HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns
Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ?
That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..
Or any other suggestions ?
Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.
Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.
The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.
You can fix this by updating the schema of the glue table and rename the duplicate column:
Open the AWS Glue console.
Choose the table name from the list, and then choose Edit schema.
Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.
Reference:
Resolve HIVE_INVALID_METADATA error

What does an AWS Glue Crawler do

I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?
When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?
The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.
This way you can see the information that s3 has as a database composed of several tables.
For example if you want to create a crawler you must specify the following fields:
Database --> Name of database
Service role service-role/AWSGlueServiceRole
Selected classifiers --> Specify Classifier
Include path --> S3 location
Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, its schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.
I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.

Creating Table As substitution

I am currently working with AWS-Athena and it does not support CREATE TABLE AS which is fine so I thought I would approach it by doing INSERT OVERWRITE DIRECTORY S3://PATH and then loading from S3 but apparently that doesn't seem to work either. How would I create a table from a query if both of these options are out the window?
Amazon Athena is read-only. It cannot be used to create tables in Amazon S3.
However, the output of an Amazon Athena query is stored in Amazon S3 and could be used as input for another query. However, you'd have to know the path of the output.
Amazon Athena is ideal for individual queries against data stored in Amazon S3, but is not the best tool for ETL actions, which typically involve transforming data, storing it and then sequentially processing it again.
You don't have to use INSERT, just create an external table over the location of the previous query results
https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-results/