i'm not able to query S3 files with Aws Athena, the content of the files are regular json arrays like this:
[
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "f257c9c0-b7e1-4663-8d6d-97e652b27c10",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200001697080089188056D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
},
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "e531272e-465c-4294-950d-95a683ff8e3b",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200341201321E0000A946D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
}
]
a simple query select * from mytable returns empty rows if the table has been generated in this way
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
or it gives me an error HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1] if the table has been generated with:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable(
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
if i modify the content of the file in this way (an json object each rows without trailing commas, the query gives me results)
{ "DataInvio": "2020-02-06T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
{ "DataInvio": "2020-02-07T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
How to query json array structures directly?
Athena Best Practices recommends to have one json per row:
Make sure that each JSON-encoded record is represented on a separate line.
This has been asked a few times and I don't think someone made it work with a array of json:
aws athena - Create table by an array of json object
AWS Glue Custom Classifiers Json Path
This is related to the formatting of the JSON objects. The resolution of these issues is also described here: https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
Apart from this, if you are using AWS Glue to crawl these files, make sure the Classification of database table of Data Catalog is not "UNKNOWN".
Related
I have json files in S3 bucket generated by AWS Textract service and I'm using Athena to query data from those files. Every file has the same structure and I created a table in Athena where I have column "blocks" that is array of struct:
"blocks": [{
"BlockType": "LINE",
"Id": "12345",
"Text": "Text from document",
"Confidence": 98.7022933959961,
"Page": "1",
"SourceLanguage": "de",
"TargetLanguage": "en",
},
...100+ blocks]
How can I query just for the "Text" property from every block that has one?
Thanks in advance!
I have defined a table with exact schema of yours using sample JSON provided.
_col0
#
array(row(blocktype varchar, id varchar, text varchar, confidence double, page varchar, sourcelanguage varchar, targetlanguage varchar))
I have used unnest operator to flatten the array of blocks and fetch the Text column from it using below query:
select block.text from <table-name> CROSS JOIN UNNEST(blocks) as t(block)
It looks like column stores array of rows, so you can process it as one (array functions):
select transform(
filter(block_column, t -> t.text is not null),
r => cast(row(r.text) as row(text varchar))) texts
from table
I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!
Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'
im trying to create a AWS Athena table using RegexSerDe.. due to some export issues i cannot use JsonSerDe.
2019-04-11T09:05:16.775Z {"timestamp":"data0","level":"data1","thread":data2","logger":"data3","message":"data4","context":"data5"}
I was trying to obtain json values with a regex, but without any luck.
CREATE EXTERNAL TABLE IF NOT EXISTS dsfsdfs.mecs3(
`timestamp` string,
`level` string,
`thread` string,
`logger` string,
`message` string,
`context` string
)
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "[ :]+(\\"[^\"]*\\")"
)LOCATION 's3://thisisates/'
Error: HIVE_CURSOR_ERROR: Number of matching groups doesn't match the
number of columns
Would be great some help as i'm not an expert in regex.
Thanks and BR.
Getting this working will probably be very hard - even if you can write a regex that will capture the columns out of the JSON structure, can you guarantee that all JSON documents will be rendered with the properties in the same order? JSON itself considers {"a": 1, "b": 2} and {"b": 2, "a": 1} to be equivalent, so many JSON libraries don't guarantee, or even care about ordering.
Another approach to this is to create a table with two columns: timestamp and data, as a regex table with a regex with two capture groups, the timestamp and the rest of the line – or possibly as a CSV table if the character after the timestamp is a tab (if it's a space it won't work since the JSON will contain spaces):
CREATE EXTERNAL TABLE IF NOT EXISTS mecs3_raw (
`timestamp` string,
`data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\S+) (.+)$"
)
LOCATION 's3://thisisates/'
(the regex assumes that there is a space between the timestamp and the JSON structure, change it as needed).
That table will not be very usable by itself, but what you can do next is to create a view that extracts the properties from the JSON structure:
CREATE VIEW mecs3 AS
SELECT
"timestamp",
JSON_EXTRACT_SCALAR("data", '$.level') AS level,
JSON_EXTRACT_SCALAR("data", '$.thread') AS thread,
JSON_EXTRACT_SCALAR("data", '$.logger') AS logger,
JSON_EXTRACT_SCALAR("data", '$.message') AS message,
JSON_EXTRACT_SCALAR("data", '$.context') AS context
FROM mecs3_raw
(mecs3_raw is the table with timestamp and data columns)
This will give you what you want and will be much less error prone.
Try Regex: (?<=")[^\"]*(?=\" *(?:,|}))
Demo
Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it
As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'