Hive - Regex for the SYSLOG/ERRORLOG - regex

I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server Buffer pool extension is already disabled. No action is necessary.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled.
So I created a table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
timestamp string,
date string,
time string,
user string,
message stringg
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
) LOCATION 's3://log/sql_error_log_stream/';
But it didn't give any results. Can someone help me to figure it out?

Few observations:
Timestamp '2019-09-21T12:19:32.107Z' is not in hive TIMESTAMP format, define it as STRING in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344
message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try (.*?)$ instead of (\\w+) for message field.
Try this regexp:
(\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$
Use (\\S+) - this means everything except spaces.
(\\w+) does not work for the first group because \\w matches any alphanumerical character and the underscore only, and first group (timestamp) contains - and : characters also.
Also hyphen - if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344

Related

Oracle regex and replace

I have varchar field in the database that contains text. I need to replace every occurrence of a any 2 letter + 8 digits string to a link, such as VA12345678 will return /cs/page.asp?id=VA12345678
I have a regex that replaces the string but how can I replace it with a string where part of it is the string itself?
SELECT REGEXP_REPLACE ('test PI20099742', '[A-Z]{2}[0-9]{8}$', 'link to replace with')
FROM dual;
I can have more than one of these strings in one varchar field and ideally I would like to have them replaced in one statement instead of a loop.
As mathguy had said, you can use backreferences for your use case. Try a query like this one.
SELECT REGEXP_REPLACE ('test PI20099742', '([A-Z]{2}[0-9]{8})', '/cs/page.asp?id=\1')
FROM DUAL;
For such cases, you may want to keep the "text to add" somewhere at the top of the query, so that if you ever need to change it, you don't have to hunt for it.
You can do that with a with clause, as shown below. I also put some input data for testing in the with clause, but you should remove that and reference your actual table in your query.
I used the [:alpha:] character class, to match all letters - upper or lower case, accented or not, etc. [A-Z] will work until it doesn't.
with
text_to_add (link) as (
select '/cs/page.asp?id=' from dual
)
, sample_strings (str) as (
select 'test VA12398403 and PI83048203 to PT3904' from dual
)
select regexp_replace(str, '([[:alpha:]]{2}\d{8})', link || '\1')
as str_with_links
from sample_strings cross join text_to_add
;
STR_WITH_LINKS
------------------------------------------------------------------------
test /cs/page.asp?id=VA12398403 and /cs/page.asp?id=PI83048203 to PT3904

python regex fetching column details from ddl

USE test_db2
CREATE TABLE test_table2
(
Subscn_Purch_Id BIGINT COMMENT 'from deserializer',
Price_Amt DECIMAL(38,18),
Purch_Line_Item_Id BIGINT,
Subscn_Purch_Status_Id BIGINT COMMENT 'from defdf',
Offer_Coupon_Id BIGINT, -- INTRO OFFER
Offer_Period_Hrs BIGINT,
discount_offer_id STRING -- DISCOUNT
)
PARTITIONED BY (
testcol bigint
)
ROW FORMAT SERDE
'eeee'
STORED AS INPUTFORMAT
'rrrr'
OUTPUTFORMAT
'tttt';
from the above DDL I need to get the column level details.
ie,
column name,
data type,
data length (if any present)
data precision (if any present)
column comment (if any present)
I don't need the comment details like '-- INTRO OFFER' and '-- DISCOUNT' in the above sample.
I have tried using the regex
\s*(\w+)\s*(\w+)(?:\s*\,\s*)?(?:\((\d+)(?:,\s?(\d+))?\))?(?:\s*\,\s*)?(?:(?=(?:.*COMMENT\s*)\'(.*)\'(?:\,|\))))
this regex is fetching the details that have 'COMEMNT' word In it but not the others.on adding '?' at the end of this regex ,its fetching the details which I son't need.
how to achieve this.
attaching the regex101 link :
https://regex101.com/r/QfOCfj/3
The regex by #Jan might get you what you want, but there is potentially a much cleaner way to go about this. You may try just querying the information schema tables directly in DB2.
SELECT
COLNO,
SYSTEM_COLUMN_NAME,
DATA_TYPE,
COALESCE(PRECISION, LENGTH) AS length,
SMALLINT(SCALE) AS scale,
STORAGE
FROM QSYS2/SYSCOLUMNS
WHERE
SYSTEM_TABLE_SCHEMA = 'your_db' AND
SYSTEM_TABLE_NAME = 'test_table2';
Generally, it is usually not a good idea to try to parse these strings with regular expressions. That being said, you could try to use the newer regex module which supports \G:
(?:\G(?!\A)|\()
\s*
(?P<column_name>\w+)\s+
(?P<column_type>\w+)
(?:
\(
(?P<column_size>[^()]+)
\)
)?
[, ]+
.*
See a demo on regex101.com and mind the modifiers.
Alternatively - if installing another module is not an option - use two expressions:
Fetch every block of ( and ) first recursively
Analyze that block with the above expression minus the first line

Regular Expression in redshift

I have a data which is being fed in the below format -
2016-006-011 04:58:22.058
This is an incorrect date/timestamp format and in order to convert this to a right one as below -
2016-06-11 04:58:22.058
I'm trying to achieve this using regex in redshift. Is there a way to remove the additional Zero(0) in the date and month portion using regex. I need something more generic and not tailed for this example alone as date will vary.
The function regexp_replace() (see documentation) should do the trick:
select
regexp_replace(
'2016-006-011 04:58:22.058' -- use your date column here instead
, '\-0([0-9]{2}\-)0([0-9]{2})' -- matches "-006-011", captures "06-" in $1, "11" in $2
, '-$1$2' -- inserts $1 and $2 to give "-06-11"
)
;
And so the result is, as required:
regexp_replace
-------------------------
2016-06-11 04:58:22.058
(1 row)

Hive Serde Regex not recognizing string pattern

Here are two lines from my log files that I'm trying to match. I'm trying to separate each line into four columns (date, hostname, command, status).
The line is tab deliminated between date, hostname, command, and status in the line. The status column may contain spaces.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
In Rubular (http://rubular.com/) my regex expression matches exactly as I want it; however after I query my hive table for the date column, I get the entire line which leads me to believe that the regex statement doesn't match what HIVE is looking for.
([^ ])\s([^ ])\s([^ ])\s(.*)
And this is my create table statement with results from select query:
CREATE EXTERNAL TABLE IF NOT EXISTS sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\\s*(.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION '/user/sys_log_output/sys-results/';
select date from sys_results;
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
I figured it out. hive regex recognizes tabs using '\t' I changed my input.regex expression to this.
"input.regex" = "([^ ])\t([^ ])\t([^ ])\t([^ ].)"

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma.
While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?
If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.
I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';
Any suggestions would be appreciated.
Thanks
Jitendra
I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Now you will see the magic. Thanks.
Following code solved same type of problem
CREATE TABLE TableRowCSV2(
CODE STRING,
PRODUCTCODE STRING,
PRICE STRING
)
COMMENT 'row data csv'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.
But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';
Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:
Use a different field separator (e.g. a pipe).
Write a custom InputFormat based on OpenCSV.
The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:
CREATE TABLE foo (
col1 INT,
col2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
Use the csv-serde-0.9.1.jar file in your hive query, see
http://illyayalovyy.github.io/csv-serde/
add jar /path/to/jar_file
Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties
(
"separatorChar" = "\;",
"quoteChar" = "\"
) stored as textfile
tblproperties("skip.header.line.count"="1") ---to skip if have any header file
LOCATION 's3://emrTest/folder';
There can be multiple solutions to this problem.
Write custom SerDe class
Use RegexSerde
Remove escaped delimiter chars from data
Read more at http://grokbase.com/t/hive/user/117t2c6zhe/urgent-hive-not-respecting-escaped-delimiter-characters