python regex fetching column details from ddl - regex

USE test_db2
CREATE TABLE test_table2
(
Subscn_Purch_Id BIGINT COMMENT 'from deserializer',
Price_Amt DECIMAL(38,18),
Purch_Line_Item_Id BIGINT,
Subscn_Purch_Status_Id BIGINT COMMENT 'from defdf',
Offer_Coupon_Id BIGINT, -- INTRO OFFER
Offer_Period_Hrs BIGINT,
discount_offer_id STRING -- DISCOUNT
)
PARTITIONED BY (
testcol bigint
)
ROW FORMAT SERDE
'eeee'
STORED AS INPUTFORMAT
'rrrr'
OUTPUTFORMAT
'tttt';
from the above DDL I need to get the column level details.
ie,
column name,
data type,
data length (if any present)
data precision (if any present)
column comment (if any present)
I don't need the comment details like '-- INTRO OFFER' and '-- DISCOUNT' in the above sample.
I have tried using the regex
\s*(\w+)\s*(\w+)(?:\s*\,\s*)?(?:\((\d+)(?:,\s?(\d+))?\))?(?:\s*\,\s*)?(?:(?=(?:.*COMMENT\s*)\'(.*)\'(?:\,|\))))
this regex is fetching the details that have 'COMEMNT' word In it but not the others.on adding '?' at the end of this regex ,its fetching the details which I son't need.
how to achieve this.
attaching the regex101 link :
https://regex101.com/r/QfOCfj/3

The regex by #Jan might get you what you want, but there is potentially a much cleaner way to go about this. You may try just querying the information schema tables directly in DB2.
SELECT
COLNO,
SYSTEM_COLUMN_NAME,
DATA_TYPE,
COALESCE(PRECISION, LENGTH) AS length,
SMALLINT(SCALE) AS scale,
STORAGE
FROM QSYS2/SYSCOLUMNS
WHERE
SYSTEM_TABLE_SCHEMA = 'your_db' AND
SYSTEM_TABLE_NAME = 'test_table2';

Generally, it is usually not a good idea to try to parse these strings with regular expressions. That being said, you could try to use the newer regex module which supports \G:
(?:\G(?!\A)|\()
\s*
(?P<column_name>\w+)\s+
(?P<column_type>\w+)
(?:
\(
(?P<column_size>[^()]+)
\)
)?
[, ]+
.*
See a demo on regex101.com and mind the modifiers.
Alternatively - if installing another module is not an option - use two expressions:
Fetch every block of ( and ) first recursively
Analyze that block with the above expression minus the first line

Related

Hive - Regex for the SYSLOG/ERRORLOG

I want to query the syslog(basically its my SQL error log) using Athena. here is my sample data.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.17 Server Buffer pool extension is already disabled. No action is necessary.
2019-09-21T12:19:32.107Z 2019-09-21 12:19:24.29 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled.
So I created a table like this:
CREATE EXTERNAL TABLE IF NOT EXISTS bhuvi (
timestamp string,
date string,
time string,
user string,
message stringg
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(.*\\-.*\\-.*)\\s+(\\d+:\\d+:\\d+.\\d+)\\s+(\\w+)\\s+(\\w+)"
) LOCATION 's3://log/sql_error_log_stream/';
But it didn't give any results. Can someone help me to figure it out?
Few observations:
Timestamp '2019-09-21T12:19:32.107Z' is not in hive TIMESTAMP format, define it as STRING in DDL and convert like in this answer: https://stackoverflow.com/a/23520257/2700344
message in the serde is represented as (\w+) group. This is wrong because message contains spaces. Try (.*?)$ instead of (\\w+) for message field.
Try this regexp:
(\\S+)\\s+(.*-.*-.*)\\s+(\\d+:\\d+:\\d+\\.\\d+)\\s+(\\S+)\\s+(.*?)$
Use (\\S+) - this means everything except spaces.
(\\w+) does not work for the first group because \\w matches any alphanumerical character and the underscore only, and first group (timestamp) contains - and : characters also.
Also hyphen - if outside of character class [in square brackets] does not need shielding. and Dot . has a special meaning and needs shielding when used as dot literally: https://stackoverflow.com/a/57890202/2700344

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

Pattern or Format Match in XQuery MarkLogic

I am looking for given string, it has to be in *(*) format, * should not have space, no two words before (.
I am searching MarkLogic DB to see if given column value is in [^\s]+\((?!\s)[^()]+(?<!\s)\) format, if not replace it with this format.
I am still stuck at fetching data, and could not write the query to update
I am searching DB as
let $query-opts := cts:search(doc(),
cts:and-query((
cts:directory-query(("/xyz/documentData/"),"1"),
cts:element-query(
xs:QName("cd:clause"), (: <clause> element inside extended for checking query id :)
cts:and-query((
cts:element-attribute-value-query( xs:QName("cd:clause"), xs:QName("tag"), "Title" ), (: only if the <clause> is of type "Title" :)
cts:element-attribute-value-query( xs:QName("cd:xmetadata"), xs:QName("tag"), "Author")
))
)
))
for $d in $query-opts
return (
for $x in $d//cd:document/cd:clause/cd:xmetadata[fn:matches(#tag,"Author")]/cd:metadata_string
where fn:matches($x/string(), "[^\s]+\((?!\s)[^()]+(?<!\s)\)")
return
( <documents> {
<documentId> {$d//cd:cdf/cd:documentId/string()}</documentId>
}</documents>
)
)
It's throwing up error invalid pattern
The fn:matches function does not support group modifiers like (?! and (?<!. Simplify your pattern, and capture false positives after the match with another match if necessary.
Doing an educated guess at what you are trying to do, I think you are looking for something like:
where fn:matches($x, '^.+\([^)]+\).*$') (: it uses parentheses :)
and fn:not(fn:matches($x, '^[^\s]+\([^\s)]+\)$')) (: but does not comply to strict rules :)
HTH!

Regular Expression in redshift

I have a data which is being fed in the below format -
2016-006-011 04:58:22.058
This is an incorrect date/timestamp format and in order to convert this to a right one as below -
2016-06-11 04:58:22.058
I'm trying to achieve this using regex in redshift. Is there a way to remove the additional Zero(0) in the date and month portion using regex. I need something more generic and not tailed for this example alone as date will vary.
The function regexp_replace() (see documentation) should do the trick:
select
regexp_replace(
'2016-006-011 04:58:22.058' -- use your date column here instead
, '\-0([0-9]{2}\-)0([0-9]{2})' -- matches "-006-011", captures "06-" in $1, "11" in $2
, '-$1$2' -- inserts $1 and $2 to give "-06-11"
)
;
And so the result is, as required:
regexp_replace
-------------------------
2016-06-11 04:58:22.058
(1 row)

Hive regex split a string in to two different fields

my record is like:
0x0000110PPPP111KZY0 H123456789 XYZ 000000000000000000607532030000607532000060753203002014101707199999
I am searching for a regex where i can split first 3 char 0x0 in to one field in a hive table and the rest 000110PPPP111KZY0 in to second field and so on fixed length file and no delimiter.
I have no experience with hadoop or hive, however the following regex will work with what I believe you're looking for.
/(\dx\d)(.*)/ This will capture/split 0x0 into the first capture group, and everything afterwards into the second capture group. If you only want the numbers/letters following the 0x0 number (so none of the H123456789 or trailing words and letters), use /(\dx\d)([^ ]*)/
If I misunderstood what you're looking for, can you just clarify the exact section of that code you provided that you'd like to select and/or capture? Thanks!
Select
regexp_extract(data, '^(\\dx\\d).*', 1),
regexp_extract(data, '^\\dx\\d(.*)', 1)
from (Select '0x0000110PPPP111KZY0 ' as data) a;
This code returns a Hive row with two fields:
0x0 000110PPPP111KZY0