How to CTAS with external location as csv.gz - amazon-web-services

I have close to 90 GB of data that needs to be uploaded to an S3 bucket with a specific naming convention.
If I use CTAS query with external_location it does not give me the option to give the file a specific name. Additionally with format csv is not an option.
CREATE TABLE ctas_csv_partitioned
WITH (
format = 'TEXTFILE',
external_location = 's3://my_athena_results/ctas_csv_partitioned/',
partitioned_by = ARRAY['key1']
)
AS SELECT name1, address1, comment1, key1
FROM tables1
I want to upload the output file so it look as sample_file.csv.gz
What is the easiest way to go about this?

Unfortunately, there is no way to specify neither file name nor the extension for it with Athena alone. Moreover, files created with CTAS query won't have any file extension at all. However, you can rename files directly with CLI for S3.
aws s3 ls s3://path/to/external/location/ --recursive \
| awk '{cmd="aws s3 mv s3://path/to/external/location/"$4 " s3://path/to/external/location/"$4".csv.gz"; system(cmd)}'
Just have tried this snippet and everything worked fine. However, sometimes an empty file s3://path/to/external/location/.csv.gz would also got created. Note I didn't include --recursive option for aws s3 mv since it would also produce weird results.
As far as format field is concerned, then you simply need to add field_delimiter=',' into WITH clause.
CREATE TABLE ctas_csv_partitioned
WITH (
format = 'TEXTFILE',
field_delimiter=','
external_location = 's3://my_athena_results/ctas_csv_partitioned/',
partitioned_by = ARRAY['key1']
)
AS SELECT name1, address1, comment1, key1
FROM tables1

Related

Convert result of Athena query to S3 multipart files

I am running Athena query and trying to get a difference between two tables as below:
CREATE TABLE TableName
WITH (
format = 'TEXTFILE',
write.compression = 'NONE',
external_location = s3Location)
AS
SELECT * FROM currentTable
EXCEPT
SELECT * FROM previousTable;
I would like to store the result of query into specified s3Location as S3 multipart file (file with extension: .0000_part_00) and as TSV file format. How can I achieve it?
I am trying to do it programmatically in Java.

Athena removing delimiter in output CSV files

I'm using Athena to write some gzip files to S3.
Query
CREATE TABLE NODES_GZIPPED_NODESTEST5
WITH (
external_location = 'my-bucket',
format = 'TEXTFILE',
)
AS SELECT col1, col2
FROM ExistingTableIHave
LIMIT 10;
The table is just 2 columns, but when I create this table and check the external_location, the files are missing the comma delimiter between the data. How can I ensure the CSVs it writes to S3 keep the commas?
You can add a field_delimiter to the WITH expression.
From the AWS docs:
Optional and specific to text-based data storage formats. The single-character field delimiter for files in CSV, TSV, and text files. For example, WITH (field_delimiter = ','). Currently, multicharacter field delimiters are not supported for CTAS queries. If you don't specify a field delimiter, \001 is used by default.

AWS Athena create external table succeeds even if AWS s3 doesn't have file in it?

create external table reason ( reason_id int,
retailer_id int,
reason_code string,
reason_text string,
ordering int,
creation_date date,
is_active tinyint,
last_updated_by int,
update_date date
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
location 's3://bucket_name/athena-workspace/athena-input/'
TBLPROPERTIES ("skip.header.line.count"="1");
Query above successfuly executes, however, there is no files in the provided location!!!
Upon successful execution table is created and is empty. How is this possible?
Even if I upload file to the provided location, created table is still empty!!
Athena is not a data store, it is simply a serverless tool to read data in S3 using SQL like expressions.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
This query is creating the metadata of the table, it doesn't write to that location it reads from it.
If you put a CSV into the location and performed select * from reason it would attempt to map any CSV in the prefix of athena-workspace/athena-input/ within bucket bucket_name to your data format using the ROW FORMAT and SERDEPROPERTIES to parse the files. It would also skip the first line assuming its a header.

s3 - how to get fast line count of file? wc -l is too slow

Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well.
Note: solution must run non-interactively, ie in an overnight batch.
Right no i am doing this, it works but takes around 10 minutes for a 20GB file:
aws cp s3://foo/bar - | wc -l
Here's two methods that might work for you...
Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.
You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.
Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.
Yes, Amazon S3 is having the SELECT feature, also keep an eye on the cost while executing any query from SELECT tab..
For example, here is the price #Jun2018 (This may varies)
S3 Select pricing is based on the size of the input, the output, and the data transferred.
Each query will cost 0.002 USD per GB scanned, plus 0.0007 USD per GB returned.
You can do it using python/boto3.
Define bucket_name and prefix:
colsep = ','
s3 = boto3.client('s3')
bucket_name = 'my-data-test'
s3_key = 'in/file.parquet'
Note that S3 SELECT can access only one file at a time.
Now you can open S3 SELECT cursor:
sql_stmt = """SELECT count(*) FROM s3object S"""
req_fact =s3.select_object_content(
Bucket = bucket_name,
Key = s3_key,
ExpressionType = 'SQL',
Expression = sql_stmt,
InputSerialization={'Parquet': {}},
OutputSerialization = {'CSV': {
'RecordDelimiter': os.linesep,
'FieldDelimiter': colsep}},
)
Now iterate thourgh returned records:
for event in req_fact['Payload']:
if 'Records' in event:
rr=event['Records']['Payload'].decode('utf-8')
for i, rec in enumerate(rr.split(linesep)):
if rec:
row=rec.split(colsep)
if row:
print('File line count:', row[0])
If you want to count records in all parquet files in a given S3 directory, check out this python/boto3 script: S3-parquet-files-row-counter

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.