I have a file that looks like this:
33.49.147.163 20140416123526 https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
I want to load it into a hive table. I do it this way:
create external table Logs (
ip string,
ts timestamp,
request string,
page_size smallint,
status_code smallint,
info string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"timestamp.formats" = "yyyyMMddHHmmss",
"input.regex" = '^(\\S*)\\t{3}(\\d{14})\\t(\\S*)\\t(\\S*)\\t(\\S*)\\t(\\S*).*$'
)
stored as textfile
location '/data/user_logs/user_logs_M';
And
select * from Logs limit 10;
results in
33.49.147.16 NULL https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
How to parse timestamps correctly, to avoid this NULLs?
"timestamp.formats" SerDe property works only with LazySimpleSerDe (STORED AS TEXTFILE), it does not work with RegexSerDe. If you are using RegexSerDe, then parse timestamp in a query.
Define ts column as STRING data type in CREATE TABLE and in the query transform it like this:
select timestamp(regexp_replace(ts,'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})','$1-$2-$3 $4:$5:$6.0')) as ts
Of course, you can extract each part of the timestamp using SerDe as separate columns and properly concatenate them with delimiters in the query to get correct timestamp format, but it will not give you any improvement because anyway you will need additional transformation in the query.
Related
I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!
Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'
So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR
One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'
Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.
I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/
I have created a table in AWS Athena like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'
In the bucket I put a simple CSV file with the following context:
rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"
When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response:
How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.
The problem here is that the OpenCSV Serializer-Deserializer
Does not support embedded line breaks in CSV files.
See this documentation from AWS.
However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.
Edit: Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';