How to create external table using s3 location from Glue catalog? - amazon-athena

I have a Glue job that writes csv data to an s3 bucket. I would like to create an external table or view based on this data. But I do not want to hardcode the s3 location in the table definition, I want to use the s3 location stored in the Glue catalog. Something like this:
CREATE VIEW t1 as (
SELECT * FROM some_glue_catalog_table
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
X-Y-explanation:
What I actually need is to split my csv data into columns. My csv data is properly catalogued, but when I use a select statement on it with Athena, all my data is shoved into the first column, the rest of the columns are empty. I've looked for an option that allows me to specify things like the delimiter and quotation character, but those options only seem available for CREATE EXTERNAL TABLE statements. Which is why I'm now asking the question above.
I do not wish to use the Athena API/AWS CLI for this.

Related

How to skip files with specific extension on Redshift external tables?

I have a partitioned location on S3 with data I want to read via Redshift External Table, which I create with the SQL statement CREATE EXTERNAL TABLE....
Only thing is that I have some metadata files within these partitions with, for example, extension .txt while the data I'm reading is .json.
Is it possible to inform Redshift to skip those files, in a similar manner to Glue Crawler exclude patterns?
e.g. Glue crawler exclude patterns
Can you try using the pseudocolumns in the SQL and excluding based on the path name?
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE_usage.html
create external table as ....
select .....
where "$path" like '%.json'

Duplicate Table in AWS Glue using AWS Athena

I have a table in AWS Glue which uses an S3 bucket for it's data location. I want to execute an Athena query on that existing table and use the query results to create a new Glue table.
I have tried creating a new Glue table, pointing it to a new location in S3, and piping the Athena query results to that S3 location. This almost accomplishes what I want, but
a .csv.metadata file is put in this location along with the actual .csv output (which is read by the Glue table as it reads all files in the specified s3 location).
The csv file places double quotes around each field, which ruins any fieldSchema defined in the Glue Table that uses numbers
These services are all designed to work together, so there must be a proper way to accomplish this. Any advice would be much appreciated :)
The way to do that is by using CTAS query statements.
A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.
For example:
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/'
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
There are some limitations though. However, for your case the most important are:
The destination location for storing CTAS query results in Amazon S3 must be empty.
The same applies to the name of new table, i.e. it shouldn't exist in AWS Glue Data Catalog.
In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system.
However, can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/',
bucketed_by=ARRAY['some_column_from_select'],
bucket_count=1
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
Apart from creating a new files and defining a table associated with you can also convert your data to a different file formats, e.g. Parquet, JSON etc.
I guess you have to change ur ser-de. If you are querying csv data either opencsvserde or lazysimple serde should work for you.

Problems while uploading quoted data to Redshift from S3 using AWS GLUE. How do I insert the data?

I am trying to insert a data set in Redshift with values as :
"2015-04-12T00:00:00.000+05:30"
"2015-04-18T00:00:00.000+05:30"
"2015-05-09T00:00:00.000+05:30"
"2015-05-24T00:00:00.000+05:30"
"2015-07-19T00:00:00.000+05:30"
"2015-08-02T00:00:00.000+05:30"
"2015-09-05T00:00:00.000+05:30"
The crawler which I ran over S3 data is unable to identify the columns or datatype of the values. I have been tweaking the table settings to get the job to push the data into Redshift but no avail. Here is what I have tried so far :
Manually added the column in the table definition in Glue Catalog. There is only 1 column which is mentioned above.
Changed the Serde serialization lib from LazySimpleSerde to org.apache.hadoop.hive.serde2.lazy.OpenCSVSerDe
Added the following Serde parameters - quoteChar ", line.delim \n, field.delim \n
I have already tried different combinations of line.delim and field.delim properties. Including one, omitting another and taking both at the same time as well.
Changed the classification from UNKONWN to text in table properties.
Changed the recordCount property to 469 to match the raw data row counts.
The job runs are always successful. After the job runs, when I go to select * from table_name, I always get correct count of rows in the redshift table as per the raw data but all the rows are NULL. How do I populate the rows in Redshift ?
The table properties have been uploaded in image album here : Imgur Album
I was unable to push the data into Redshift using Glue. So I turned to COPY command of Redshift. Here is the command that I executed in case anyone else needs it or faces the same situation :
copy schema_Name.Table_Name
from 's3://Path/To/S3/Data'
iam_role 'arn:aws:iam::Redshift_Role'
FIXEDWIDTH 'Column_Name:31'
region 'us-east-1';

Converting data in AWS S3 to another schema structure (also in S3)

quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.

can athena table be created for s3 bucket sub-directories?

Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables