JScript Regex - extract dates preceded by substrings

JScript Regex - extract dates preceded by substrings - regex

I've got oneline string that includes several dates. In JScript Regex I need to extract dates that are proceded by case insensitive substrings of "dat" and "wy" in the given order. Substrings can be preceded by and followed by any character (except new line).
reg = new RegExp('dat.{0,}wy.{0,}\\d{1,4}([\-/ \.])\\d{1,2}([\-/ \.])\\d{1,4}','ig');
str = ('abc18.Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30')
result = str.match(reg).toString()
Received result: 'Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30'
Expected result: 'Dat wy.03/12/2019,data wy2020-09-30' or preferably: '03/12/2019,2020-09-30'
Thanks.

Several issues.
You want to match as few as possible between the substrings and date, but your current regex uses greed .{0,} (same like .*). See this Question and use .*? instead.
dat.*?wy.*?FOO can still skip over any other dat. To avoid skipping over, use what some call a Tempered Greedy Token. The .*? becomes (?:(?!dat).)*? for NOT skipping over.
Not really an issue, but you can capture the date separator and reuse it.
If you want to extract only the date part, also use capturing groups. I put a demo at regex101.
dat(?:(?!dat).)*?wy.*?(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})
There are many ways to achieve your desired outcome. Another idea, I would think of - if you know, there will never appear any digits between the dates, use \D for non-digit instead of the .
dat\D*?wy\D*(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})

You might use a capturing group with a backreference to make sure the separators like - and / are the same in the matched date.
\bdat\w*\s*wy\.?(\d{4}([-/ .])\d{2}\2\d{2}|\d{2}([-/ .])\d{2}\3\d{4})
\bdat\w*\s*wy\.? A word boundary, match dat followed by 0+ word chars and 0+ whitespace chars. Then match wy and an optional .
( Capture group 1
\d{4}([-/ .])\d{2}\2\d{2} Match a date like format starting with the year where \2 is a backreference to what is captured in group 2
| Or
\d{2}([-/ .])\d{2}\3\d{4} Match a date like format ending with the year where \3 is a backreference to what is captured in group 3
) Close group
The value is in capture group 1
Regex demo
Note That you could make the date more specific specifying ranges for the year, month and day.

Related

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015

The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo

A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

Regular expression with multiline matching (subtitles strings)

Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.

Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.

Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo

Extract text between certain characters

I have the following link structure (example, link can't be joined):
https://zoom.us/j/345678634?pwd=fdgSDdfdfasgdgJEeXNaRjNBZz09
My goal is to extract two numbers in two different cells
First one:
345678634
I tried:
(?<=/j/).(?=?pwd)
Second one:
fdgSDdfdfasgdgJEeXNaRjNBZz09
I tried (besides others):
(?<=?pwd).
What I thought about is for the second one just everything that's behind ?pwd= and for the first one everything that's between /j/ and ?pwd=. I just don't know how to get this done with regex.

You may try:
.*?\/j\/(\d+)\?pwd=(\w+)
Explanation of the above regex:
.*? - Matches lazily everything before j.
\/j\/ - Matches /j/ literally.
(\d+) - Represents first capturing group matching digits 1 or more times.
\? - Matches ? literally.
pwd= - Matches pwd= literally.
(\w+) - Represents second capturing group capturing the word characters i.e. [0-9a-zA-Z_] one or more times.
You can find the demo of the above regex in here.

Unfortunately lookarounds are not supported (AFAIK) in RE2. But it seems like you could use:
=REGEXEXTRACT(A1,"(\d+).*=(.*)")
( - Open 1st capture group.
\d+ - Match at least a single digit.
) - Close 1st capture group.
.* - Match zero or more characters (greedy)
= - Match a literal =.
( - Open 2nd capture group.
.* - Match any character other than newline zero or more times.
) - Close 2nd capture group.
Because of the spill feature both groups will be extracted into neighboring cells.
A 2nd option, if you want to avoid REGEX, is using SPLIT and QUERY. However, depending on your data, I'm not sure which one would be faster in processing:
=QUERY(SPLIT(SUBSTITUTE(A1,"?pwd=","/"),"/"),"Select Col4,Col5")

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.

Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo

You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo

Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

RegEx expression that will match Date format

I have a line with "1999-08-16"^^xsd:date. What will be the regex to capture the whole string as "1999-08-16"^^xsd:datein flex file? And, is it possible to capture only "1999-08-16" as a string. If yes, then what will be the regex for it in flex?

Try this one:
^"(\d{4}-(?:0?[1-9]|1[012])-(?:30|31|[12]\d|0?[1-9]))"\^\^xsd:date$
Explanation:
^ - start of line
\d{4} - year part
(?:0?[1-9]|1[012]) - month, can be:
01-09 or 1-9 (that's why 0?)
10,11,12 (1[012] part)
?: means non-capturing group (if we need just alternating matches
with |, but not outputting them to user)
(?:30|31|[12]\d|0?[1-9]) - day part, can be:
30,31,
10-29 (part [12]\d)
1-9, or 01-09 (0?[1-9])
Also we use non-capturing group for matching day
$ matches end of line
Everything in between "" is captured to standard capturing group, as you need to extract date
Demo
NOTICE:
When matching days we put 1-9 day numbers in LAST alternating group:
(?:30|31|[12]\d|0?[1-9])
that's because regex engine when given alternating matches uses FIRST matched result and other matched alternatives are ignored. For example-
in string 1 11 expression:
(?:\d{2}|\d) gives 2 matches
(?:\d|\d{2}) gives 3 matches

To capture whole string \"[0-9]{4}-[0-9]{2}-[0-9]{2}\"\^\^[^ ]* can be used.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

JScript Regex - extract dates preceded by substrings - regex

Related

Comma separated prefix list with commas inside

Regular expression with multiline matching (subtitles strings)

Extract text between certain characters

Regex - optional capture group after wildcard

RegEx expression that will match Date format

Categories

Resources