Access S3 CSV file in Amazon Athena - amazon-web-services

I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?

You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')

The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.

Related

How to export hive table data into csv.gz format stored in s3

So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```

Amazon Athena Error Creating Partitioned Tables

I am new to Athena, and would request for some help.
I have multiple csv files in the following format. Pls note all fields are in double quotes. And total file size is about 5GB. If possible, I would rather do this without the use of Glue. Unless there is a reason to spend $ on running the crawlers.
"emailusername.string()","emaildomain.string()","name.string()","details.string()"
"myname1","website1.com","fullname1","address1 n details"
"myname2","website2.com","fullname2","address2 n details"
The following code on Athena works perfectly:
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`emailusername` string,
`emaildomain` string,
`name` string,
`details` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
However I am neither able to cluster, nor use partitioning. The following code runs successfully. Post that I am also able to Load Partitions successfully. But no data is returned!
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tablea (
`name` string,
`details` string
)
PARTITIONED BY (emaildomain string, emailusername string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://projectzzzz2/0001_aaaa_delme/'
TBLPROPERTIES ('has_encrypted_data'='false');
MSCK REPAIR TABLE tablea;
SELECT * FROM "db1"."tablea";
Result: Zero records returned
If your intention is to create partitions on emaildomain, emailusername
You don’t need to have fields called emaildomain, emailusername in the table. However, you need to have 2 directories as domain1/user1 under your s3 location.
e.g. s3://projectzzzz2/0001_aaaa_delme/domain1/user1
make sure
copy your file to s3://projectzzzz2/0001_aaaa_delme ( not to the location s3://projectzzzz2/0001_aaaa_delme/domain1/user1)
then you can issue
ALTER TABLE tablea ADD PARTITION (emaildomain ='domain1', emailusername= 'user1') location ‘s3://projectzzzz2/0001_aaaa_delme/domain1/user1' ;
If you query the table tablea you will see new fields called emaildomain and emailusername been added automatically
As of my knowledge, whenever you add a new user or new email domain then you need to copy your file into the new folder and need to issue the ‘Alter table’ statement accordingly.

Why is this SQL query on Amazon Athena not pulling the required table from s3 to my database?

What is the problem in my syntax that the query is not running?
(error and error code mentioned below)
All the names have been fixed.
"foldername3" has only one file and its name is pinmap.csv.
There are only 9 columns in the csv file.
CREATE EXTERNAL TABLE IF NOT EXISTS default.`pinmap`(
'circle' string,
'region' string,
'division' string,
'office' string,
'pin' int,
'office_type' string,
'delivery' string,
'district' string,
'state' string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Error code:
line 1:8: no viable alternative at input 'create external' (service:
amazonathena; status code: 400; error code: invalidrequestexception;
Ideally the query should import the csv file from s3 to amazon athena as a table named "pinmap" in the database named "default".
Try to use backticks instead of apostrophes
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`pinmap`(
`circle` string,
`region` string,
`division` string,
`office` string,
`pin` int,
`office_type` string,
`delivery` string,
`district` string,
`state` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Result:
Query successful.
Also note, that this query only defines meta information about your data in S3, e.g. table schema, database etc, which is then stored in AWS Glue datacatalog. So there is no actual import of csv file, they still remain in S3.

AWS Glue/Data catalog showing quotes around data

When I query my files from Data Catalog using Athena, all the data appears wrapped with quotes. Isit possible to remove those quotes?
I tried adding quoteChar option in the table settings, but it didnt help
UPDATE
As requested, the DDL:
CREATE EXTERNAL TABLE `holidays`(
`id` bigint,
`start` string,
`end` string,
`createdat` string,
`updatedat` string,
`deletedat` string,
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pinfare-glue/holidays/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-holidays',
'averageRecordSize'='84',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='29',
'sizeKey'='2494',
'skip.header.line.count'='1',
'typeOfData'='file')
I know its late but I think the issue is with the "Serde serialization lib"
In
AWS GLUE --> Click on the table --> Edit Table --> check "Serde serialization lib"
it's value should be "org.apache.hadoop.hive.serde2.OpenCSVSerde"
Than Click Apply
This should solve your issue. Below is a sample image for your reference.

read csv file where value contain comma in AWS athena

Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it
As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'