When I query my files from Data Catalog using Athena, all the data appears wrapped with quotes. Isit possible to remove those quotes?
I tried adding quoteChar option in the table settings, but it didnt help
UPDATE
As requested, the DDL:
CREATE EXTERNAL TABLE `holidays`(
`id` bigint,
`start` string,
`end` string,
`createdat` string,
`updatedat` string,
`deletedat` string,
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pinfare-glue/holidays/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-holidays',
'averageRecordSize'='84',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='29',
'sizeKey'='2494',
'skip.header.line.count'='1',
'typeOfData'='file')
I know its late but I think the issue is with the "Serde serialization lib"
In
AWS GLUE --> Click on the table --> Edit Table --> check "Serde serialization lib"
it's value should be "org.apache.hadoop.hive.serde2.OpenCSVSerde"
Than Click Apply
This should solve your issue. Below is a sample image for your reference.
Related
I'm creating a table in Athena based on a list of CSV files in an S3 bucket. The files in the buckets are placed in folders like this:
$ aws s3 ls s3://bucket-name/ --recursive
2023-01-23 16:05:01 25601 logs2023/01/23/23/analytics_Log-1-2023-01-23-23-59-59-6dc5bd4c-f00f-4f34-9292-7bfa9ec33c55
2023-01-23 16:10:03 18182 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-05-01-aa2cb565-05c8-43e2-a203-96324f66a5a7
2023-01-23 16:15:05 20350 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-10-03-87b03989-c059-4fca-8e8b-909e787db889
2023-01-23 16:20:09 25187 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-15-06-6d9b39fb-c05f-4416-9b17-415f48e63591
2023-01-23 16:25:18 20590 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-20-16-3939a0fe-8cfb-4168-bc8e-e71d2122add5
This is the format for the folder structure:
logs{year}/{month}/{day}/{hour}/<filename>
I would like to use Athena's partition projection and this is how I'm creating my table:
CREATE EXTERNAL TABLE analytics.logs (
id string,
...
type tinyint)
PARTITIONED BY (
year bigint COMMENT '',
month string COMMENT '',
day string COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/'
TBLPROPERTIES (
'classification'='csv',
'partition.day.values'='01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31',
'partition.day.type'='enum',
'partition.enable'='true',
'partition.month.values'='01,02,03,04,05,06,07,08,09,10,11,12',
'partition.month.type'='enum',
'partition.year.range'='2022,2100',
'partition.year.type'='integer',
'storage.location.template'='s3://bucket-name/logs${year}/${month}/${day}/')
As you can see, I'm trying to partition the data using year, month, and day. Even though there's also an hour folder, I'm not interested in that. This command executes just fine and it creates the table too. But when I query the table:
SELECT * FROM analytics.logs LIMIT 10;
It returns empty. But if I create the same table without the PARTITIONED part, I can see the records. Can someone please help me understand what I'm doing wrong?
[UPDATE]
I simplified the folder structure to see if it works. It does not.
$ aws s3 ls s3://bucket-name/test --recursive
2023-01-24 07:03:30 0 test/
2023-01-24 07:03:59 0 test/2022/
2023-01-24 07:11:06 13889 test/2022/Log-1-2022-12-01-00-00-11-255f8d74-5417-42a0-8c09-97282a626903
2023-01-24 07:11:05 8208 test/2022/Log-1-2022-12-01-00-05-15-c34eda24-36d8-484c-b7b6-4861c297d857
CREATE EXTERNAL TABLE `log_2`(
`id` string,
...
`type` tinyint)
PARTITIONED BY (
`year` bigint COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/test'
TBLPROPERTIES (
'classification'='csv',
'partition.enable'='true',
'partition.year.range'='2021,2023',
'partition.year.type'='integer',
'storage.location.template'='s3://bucket-name/test/${year}/')
And still the following query returns nothing:
SELECT * FROM "analytics"."log_2" where year = 2022 limit 10;
You have a mismatch in data types. Partition by year is bigint and partition projection is integer. Make both integers.
"projection.enabled" = "true",
"projection.datehour.type" = "date",
"projection.datehour.format" = "yyyy/MM/dd/HH",
"projection.datehour.range" = "2021/01/01/00,NOW",
"projection.datehour.interval" = "1",
"projection.datehour.interval.unit" = "HOURS",
Change the partition word to projection.
For anyone else who might make my mistake, the problem is that I was (incorrectly) using partition in the TBLPROPERTIES section. While it should have been projection.
To provide you with a working example:
CREATE EXTERNAL TABLE `log_2`(
id string,
...
type tinyint)
PARTITIONED BY (
`year` bigint COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/test'
TBLPROPERTIES (
'classification'='csv',
'projection.enable'='true',
'projection.year.range'='2021,2023',
'projection.year.type'='integer',
'storage.location.template'='s3://bucket-name/test/${year}/')
I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.
What is the problem in my syntax that the query is not running?
(error and error code mentioned below)
All the names have been fixed.
"foldername3" has only one file and its name is pinmap.csv.
There are only 9 columns in the csv file.
CREATE EXTERNAL TABLE IF NOT EXISTS default.`pinmap`(
'circle' string,
'region' string,
'division' string,
'office' string,
'pin' int,
'office_type' string,
'delivery' string,
'district' string,
'state' string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Error code:
line 1:8: no viable alternative at input 'create external' (service:
amazonathena; status code: 400; error code: invalidrequestexception;
Ideally the query should import the csv file from s3 to amazon athena as a table named "pinmap" in the database named "default".
Try to use backticks instead of apostrophes
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`pinmap`(
`circle` string,
`region` string,
`division` string,
`office` string,
`pin` int,
`office_type` string,
`delivery` string,
`district` string,
`state` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/foldername3/'
TBLPROPERTIES (
'skip.header.line.count'='1');
Result:
Query successful.
Also note, that this query only defines meta information about your data in S3, e.g. table schema, database etc, which is then stored in AWS Glue datacatalog. So there is no actual import of csv file, they still remain in S3.
when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by
TBLPROPERTIES (
"skip.header.line.count"="1")
But still no use.
Any advice please?
CREATE EXTERNAL TABLE skipheader(
permalink string,
company string,
numemps bigint,
category string,
city string,
state string,
fundeddate string,
raisedamt bigint,
raisedcurrency string,
round string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/filename/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Looking at the release notes for when the feature was released it says
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.
My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED, and that you have to use the OpenCSV serde:
CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")
The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.
If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like
SELECT * FROM skipheader WHERE permalink != 'permalink'. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.
Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it
As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'