Here are two lines from my log files that I'm trying to match. I'm trying to separate each line into four columns (date, hostname, command, status).
The line is tab deliminated between date, hostname, command, and status in the line. The status column may contain spaces.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
In Rubular (http://rubular.com/) my regex expression matches exactly as I want it; however after I query my hive table for the date column, I get the entire line which leads me to believe that the regex statement doesn't match what HIVE is looking for.
([^ ])\s([^ ])\s([^ ])\s(.*)
And this is my create table statement with results from select query:
CREATE EXTERNAL TABLE IF NOT EXISTS sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\\s*(.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION '/user/sys_log_output/sys-results/';
select date from sys_results;
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
I figured it out. hive regex recognizes tabs using '\t' I changed my input.regex expression to this.
"input.regex" = "([^ ])\t([^ ])\t([^ ])\t([^ ].)"
Related
This athena table correctly reads the first line of the file.
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
This table is not imported correctly due to html code found in 5th column. Is there any other way?
It appears that your file contains a lot of multi-line text in the textbody field. This does not the CSV standard (or at least, it cannot be understood by the OpenCSVSerde).
As a test, I made a simple file:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
Row 1 is the header
Row 2 is normal
Row 3 has a field with \" escaped quotes
Row 4 has escaped newlines
I then ran the command from your question and pointed it to this data file.
Result:
Rows 1-3 (including the header row) were returned
Row 4 only worked until the \ -- data after that was lost
Bottom line: Your file format is not compatible with CSV format.
You might be able to find some Serde that can handle it, but OpenCSVSerde doesn't seem to understand it because rows are normally split by newlines.
I am trying to parse a string which is :
"297","298","Y","","299"
using Regexp serder but i am unable to do so.
The Table definition i have created is :
create external table test.test1
(a string,
b string,
c string,
d string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "\"\"|\"([^\"]+)\"")
the regex used in the serde properties looks promising in the regexp test websites but i am getting exception while trying to read the table kindly help me out in this.
I know that this can be easily done using csv serde but i am trying to figure out a bigger part of the problem for which i have to use the regexp serde
Thanks
In the regex it should be capturing group per column.
Your data contains 5 columns and table 4, you want to skip one column, right?
For example this regex will work: with serdeproperties ('input.regex' = '^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$')
You can easily check without creating table, like this:
select regexp_replace('"297","298","Y","","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
select regexp_replace('"297","298","Y","this column is skipped","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server Buffer pool extension is already disabled. No action is necessary.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled.
So I created a table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
timestamp string,
date string,
time string,
user string,
message stringg
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
) LOCATION 's3://log/sql_error_log_stream/';
But it didn't give any results. Can someone help me to figure it out?
Few observations:
Timestamp '2019-09-21T12:19:32.107Z' is not in hive TIMESTAMP format, define it as STRING in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344
message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try (.*?)$ instead of (\\w+) for message field.
Try this regexp:
(\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$
Use (\\S+) - this means everything except spaces.
(\\w+) does not work for the first group because \\w matches any alphanumerical character and the underscore only, and first group (timestamp) contains - and : characters also.
Also hyphen - if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344
I want to create table in Hive
CREATE TABLE table (
a string
,b string
)
PARTITIONED BY ( pr_filename string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='reg_exp') ;
but source data have multiline header starting with "#"
# <some comment>
#
# <some other comments>
# <some other comments>
# <some other comments>
#
a,b
1,2
8,2
8,9
Is it possible to write reg_exp to filter out all rows starting with chosen character or do I have to use temporary table to deal with this header?
If you try to filter like this:
'input.regex'='^([^#]+),([a-zA-Z])' --first group is everything except #
The row will be returned anyway with NULLs, you can filter such records.
RegexSerDe JavaDocs says:
In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored
The solution is to use intermediate table + filter rows when selecting from it.
I am using Serde2 (an Apache Hive module) which can use regex to split data.
I am try to write a regex to split the following data:
123~|`sample~|`text
12~|`ss|gs~|`max`s
The delimiter or field separator is ~|`.
So far I have come up with this:
[^(?!^\~\|`$)]*\~\|`[^(?!\~\|`)]**\~\|`[^(?!\~\|`)]*
but this is not working. The error message is:
java.io.IOException:
org.apache.hadoop.hive.serde2.SerDeException:
Number of matching groups doesn't match the number of columns
How can I fix my Regex?
I think this is the regex you are looking for:
(.*?)~\\|`(.*?)~\\|`(.*)
In case you are worried about screening out lines in your data which might have a number of fields other than 3, you can add ^ and $ to the beginning and end of the regular expression respectively. That shouldn't be needed if you are pretty confident about your data however.
Note that the escaping backslashes themselves have to be escaped since this is a Java string. So, testing with your data in a local file:
# cat test.data
123~|`sample~|`text
12~|`ss|gs~|`max`s
And this is how your data gets de-serialized/serialized:
hive> CREATE TABLE table_name (
> first STRING,
> second STRING,
> third STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
> "input.regex" = "(.*?)~\\|`(.*?)~\\|`(.*)",
> "output.format.string" = "%1$s %2$s %3$s"
> );
OK
Time taken: 0.4 seconds
hive> LOAD DATA LOCAL INPATH 'test.data' INTO TABLE table_name;
Copying data from file:test.data
Copying file: file:test.data
Loading data to table default.table_name
Table default.table_name stats: [numFiles=1, numRows=0, totalSize=39, rawDataSize=0]
OK
Time taken: 0.601 seconds
hive> SELECT * FROM table_name;
OK
123 sample text
12 ss|gs max`s
Time taken: 0.382 seconds, Fetched: 2 row(s)
I hope this helps.