Regex for Parsing vertical CSV in Athena - regex

So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR

One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'

Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.

Related

Athena SQL create table with text data

Below is how the data looks
Flight Number: SSSVAD123X Date: 2/8/2020 1:04:40 PM Page[s] Printed: 1 Document Name: DownloadAttachment Print Driver: printermodel (printer driver)
I need help creating an Athena SQL create table with in below format
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
this is new to me, any direction towards solution will work
You may be able to use a regex serde to parse your files. It depends on the shape of your data. You only provide a single line so this assumes that every line in your data files look the same.
Here's the Athena documentation for the feature: https://docs.aws.amazon.com/athena/latest/ug/apache.html
You should be able to do something like the following:
CREATE EXTERNAL TABLE flights (
flight_number STRING,
`date` STRING,
pages_printed INT,
document_name STRING,
print_driver STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^Flight Number:\\s+(\\S+)\\s+Date:\\s+(\\S+)\\s+Page\\[s\\] Printed:\\s+(\\S+)\\s+Document Name:\\s+(\\S+)\\s+Print Driver:\\s+(\\S+)\\s+\\(printer driver\\)$"
) LOCATION 's3://example-bucket/some/prefix/'
Each capture group in the regex will map to a column, in order.
Since I don't have access to your data I can't test the regex, unfortunately, so there may be errors in it. Hopefully this example is enough to get you started.
First, make sure your data format uses tab spacing between columns because your sample doesn't seem to have a consistent separator.
Flight Number Date Pages Printed Document Name Print Driver
SSSVAD123X 2/8/2020 1:04:40 PM 1 DownloadAttachment printermodel
As per AWS documentation, use the LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files if your data does not include values enclosed in quotes. You don't need to make it complicated using Regex.
Reference: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html
As LazySimpleSerDe is the default used by AWS Athena, you don't even need to declare it, see the create table statement for your data sample:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydb`.`mytable` (
`Flight Number` STRING,
`Date` STRING,
`Pages Printed` INT,
`Document Name` STRING,
`Print Driver` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
You can use an online generator to help you in the future: https://www.hivetablegenerator.com/
From the generator page: "Easily convert any JSON (even complex Nested ones), CSV, TSV, or Log sample file to an Apache HiveQL DDL create table statement."

Apache Hive regEx serde: proper regex for a mixed format (json)

im trying to create a AWS Athena table using RegexSerDe.. due to some export issues i cannot use JsonSerDe.
2019-04-11T09:05:16.775Z {"timestamp":"data0","level":"data1","thread":data2","logger":"data3","message":"data4","context":"data5"}
I was trying to obtain json values with a regex, but without any luck.
CREATE EXTERNAL TABLE IF NOT EXISTS dsfsdfs.mecs3(
`timestamp` string,
`level` string,
`thread` string,
`logger` string,
`message` string,
`context` string
)
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "[ :]+(\\"[^\"]*\\")"
)LOCATION 's3://thisisates/'
Error: HIVE_CURSOR_ERROR: Number of matching groups doesn't match the
number of columns
Would be great some help as i'm not an expert in regex.
Thanks and BR.
Getting this working will probably be very hard - even if you can write a regex that will capture the columns out of the JSON structure, can you guarantee that all JSON documents will be rendered with the properties in the same order? JSON itself considers {"a": 1, "b": 2} and {"b": 2, "a": 1} to be equivalent, so many JSON libraries don't guarantee, or even care about ordering.
Another approach to this is to create a table with two columns: timestamp and data, as a regex table with a regex with two capture groups, the timestamp and the rest of the line – or possibly as a CSV table if the character after the timestamp is a tab (if it's a space it won't work since the JSON will contain spaces):
CREATE EXTERNAL TABLE IF NOT EXISTS mecs3_raw (
`timestamp` string,
`data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\S+) (.+)$"
)
LOCATION 's3://thisisates/'
(the regex assumes that there is a space between the timestamp and the JSON structure, change it as needed).
That table will not be very usable by itself, but what you can do next is to create a view that extracts the properties from the JSON structure:
CREATE VIEW mecs3 AS
SELECT
"timestamp",
JSON_EXTRACT_SCALAR("data", '$.level') AS level,
JSON_EXTRACT_SCALAR("data", '$.thread') AS thread,
JSON_EXTRACT_SCALAR("data", '$.logger') AS logger,
JSON_EXTRACT_SCALAR("data", '$.message') AS message,
JSON_EXTRACT_SCALAR("data", '$.context') AS context
FROM mecs3_raw
(mecs3_raw is the table with timestamp and data columns)
This will give you what you want and will be much less error prone.
Try Regex: (?<=")[^\"]*(?=\" *(?:,|}))
Demo

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

Column names containing dots in Spectrum

I created a customers table with columns has account_id.cust_id, account_id.ord_id and so on.
My create external table query was as follows:
CREATE EXTERNAL TABLE spectrum.customers
(
"account_id.cust_id" numeric,
"account_id.ord_id" numeric
)
row format delimited
fields terminated by '^'
stored as textfile
location 's3://awsbucketname/test/';
SELECT "account_id.cust_id" FROM spectrum.customers limit 100
and I get an error as :
Invalid Operation: column account_id.cust_id does not exists in
customers.
Is there any way or syntax to write column names like account_id.cust_id (text.text) while creating the table or while writing the select query?
Please help.
PS: Single quotes, back ticks don't work either.

Getting Null Values in Hive Create & Load Query with REGEX

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;
Please correct me here.
Sample Log:
[INFO_|01/31 07:19:29] IMEI 356307043180842
[INFO_|01/31 07:19:33] PacketLength = 372
[INFO_|01/31 07:19:33] Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873
Thanks.
With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.
Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.
One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.
The way to do this in Hive is to use a different file_format:
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.
Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.
You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:
SET textinputformat.record.delimiter=EOR;
CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = ...
"output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;
In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.
If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.