How to use multiple file format in Athena - amazon-web-services

I have multiple file with different formats (csv, json and parquet) in s3 bucket directory (All files are in same directory). All files have same structure. How can I use these files to create Athena table?
Do we have provision to provide different Serde while creating table?
Edit: Table gets created but there is no data when I preview table.

There are a few options, but in my opinion it is best to create the separate paths (folders) for each type of files and run Glue Crawler on each of them. You will have multiple tables, but you can consolidate them by using Athena views or you can convert these files to one format by using Glue (for instance).
If you want to have the files in one folder you can use include and exclude patterns in Glue Crawler. Also in this case you will have to create seperate table for each type of file.
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Related

Is it required to have 1 table schema in 1 s3 folder , so that crawler can pick the data in AWS Glue?

When I try to have multiple files in an s3 folder(with different tables schemas) and use the location to create multiple tables using crawler and AWS glue , the athena doesnt detect any data and it gives blank data . However if we have files with only single table schema (tables with same column structure ) then , it detects the data well . So the question is , Is there a way athena can create multiple tables with different structures from the same s3 folder ?
I have tried creating different folders for different files and crawler picks up the table schema well and it gives us the exact result , However it is not feasible as creating different folders for 100's of files is not a Solution . Hence searching for another way.
When defining a table in Amazon Athena (and AWS Glue), the location parameter should point to a folder path in an Amazon S3 bucket.
When running a query, Athena will look in every file in that folder, including sub-folders.
Therefore, you should only keep files of the same format (and schema) in that directory and all of its subdirectories. All of these files will populate the one table.
Do not put multiple files in the same directory if they are meant to populate different tables or have different schemas.

Glue crawler is not combining data - also no visible data in tables

I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.

AWS Glue crawler - Order of columns in input files

I have created two partitions in a s3 bucket and loading a csv file in each of the folder. Accordingly running the Glue crawler on top of these files, which are registered as a table in Glue catalog,which Im able to query via Athena.
Partition-1: Loading csv file in s3, csv file has 5 columns
Partition-2: Loading csv file in s3, csv file has same 5 columns as above, but in different order compared to (1)
When I run the crawler first time on (1), it creates the Glue table/schema. Later when I upload the same data in different order to a different partition as (2) and run the crawler,it just tries to map the second file to the schema already created as part of (1), which results in data issues.
Does order of columns in Glue important? Does the crawler not automatically identify the columns based on the name, instead of the expecting in the same order (2) as of (1).
Order is important in csv files. Any change makes it think that the schema is different. However if u use parquet files, then order can be played around with

Export athena table to S3 as one readable file

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.
Update 2019-08-13
Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.
Update 2019-08-14
You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Athena stores query results in Amazon S3.
A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call.
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

how to combine multiple s3 files into one using Glue

I need some help in combining multiple files in different company partition in S3 into one file with company name in the file as one of the column.
I am new and I am not able to find any information also I did spoke to support and they say it is not supported. But in DataStage it is a basic function to combin multiple files into one.
Please throw some light
Regards,
Prakash
If the Column names are same in the file and number of columns are also same, Glue will automatically combine them.
Make sure the files you want to combine are in same folder on s3 and your glue crawler is pointing to the folder.
Review the AWS Glue examples, particularly the Join and Rationalize Data in S3 example. It shows you how to use a Python script to do joins and filters with transforms.