Log analysis in Pig - regex

I have a .txt file which looks like :
2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,
stock_symbol string, date TIMESTAMP,
The regex I wrote is ^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2}),(\\d{3})\\s((?i)(create|select|use).*)$.
But my output is
2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,
It is not taking lines in next line of input viz stock_symbol string, date TIMESTAMP,. I need to capture this line as well.

Try using the following pattern:
^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s((?i)(create|select|use)[\s\S]*)$
I replaced the .* at the end with [\s\S]*, because the latter consumes new lines.

Finally, this expression has worked out
(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s(\w{4})\s(.)(()(create\s|select\s|use\s).(.\s\S?\D.\s\D)*)
Thank you for replies

Related

Select the next line of the matched pattern in clob column using oracle regular expression

I have a clob column "details" in table xxx. I want to select the next line of the matched pattern using Regex.
Input Text (CLOB DATA) like below :( all placed in new line)
MODEL_DATA 1
TEST1:
NONE
TEST2:
NONE
INFO:
SERVICES,VALUED-YES
TYPE:
NONE
I tried to use INFO as pattern match string and retrieve the next line of the text . But could not able to do it by using Regular expression function . Please help me to resolve this
Output :
SERVICES,VALUES-YES
You can use the below to get the details
select replace(regexp_substr(details,'INFO:'||chr(10)||'.+'),'INFO:')
from your_table;
You can also try the below to be operation system independent
select replace(regexp_substr(details,'INFO:
('||chr(10)||'|'||chr(13)||chr(10)||').+'),'INFO:')
from your_table;

Hive - Regex for the SYSLOG/ERRORLOG

I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server Buffer pool extension is already disabled. No action is necessary.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled.
So I created a table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
timestamp string,
date string,
time string,
user string,
message stringg
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
) LOCATION 's3://log/sql_error_log_stream/';
But it didn't give any results. Can someone help me to figure it out?
Few observations:
Timestamp '2019-09-21T12:19:32.107Z' is not in hive TIMESTAMP format, define it as STRING in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344
message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try (.*?)$ instead of (\\w+) for message field.
Try this regexp:
(\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$
Use (\\S+) - this means everything except spaces.
(\\w+) does not work for the first group because \\w matches any alphanumerical character and the underscore only, and first group (timestamp) contains - and : characters also.
Also hyphen - if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344

Adding a space within a line in file with a specific pattern

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.
Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo
You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

Finding/replacing values for a specific column in Notepad++

I think I need RegEx for this, but it is new to me...
What I have in a text file are 200 rows of data, 100 INSERT INTO rows and 100 corresponding VALUE rows.
So it looks like this:
INSERT INTO DB1.Tbl1 (Col1, Col2, Col3........Col20)
VALUES(123, 'ABC', '201450204 15:37:48'........'DEF')
What I want to do is replace every Date/Timestamp value in Col3 with this: CURRENT_TIMESTAMP. The Date/Timestamps are NOT the same for every row. They differ, but they are all in Column 3.
There are 100 records in this table, some other tables have more, that's why I am looking for a shortcut to do this.
Try this:
search with (INSERT[^,]+,[^,]+,)([^,]+,)([^']+'[^']+'[^']+)('[^']+',) and replace with $1$3 and check mark regular expression in the notepad++
Live demo
With
"VALUES" being right at the beginning of the line,
"Col1" values being all numeric, and
no single quotes inside the values for "Col2"
you can search for
^(VALUES\(\d+, '[^']+', )'(\d{9} \d{2}:\d{2}:\d{2})'
and replace with
\1CURRENT_TIMESTAMP
along RegEx101. (Remember, Notepad++ uses the backslash in the replacement string…)
Personally, I'd consider to go straight to the database, and fix the timestamp there - especially, if you have more data to handle. (See my above comment for the general idea.)
Please comment, if and as further detail / adjustment is required.

Regex to remove footer using wildcards

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.