Capturing groups and money symbol in Regex - regex

I am trying to write a regular expression that takes a string and parses it into three different capturing groups:
$3.99 APP DOWNLOAD – 200 11/19 – 1/21 3.99
Group 1: $3.99 APP DOWNLOAD – 200
Group 2: 11/29 – 1/28
Group 3: 3.99
Does anyone have any ideas???
I do not have much experience with capturing groups and do not know how to create them.
i.e. I believe this expression would work for identifying the dates?
/(\d{2}\/\d{2})/
Any help would be greatly appreciated!

Regex:
([$]\d+[.]\d{2}.*?)\s*(\d{1,2}/\d{2}.*?\d{1,2}/\d{2})\s(\d+[.]\d{2})
So with this we have 3 capture groups (()) separated by \s* which means 0+ characters of whitespace (this isn't necessary, but it will remove trailing spaces from your captured groups).
The first capture group [$]\d+[.]\d{2}.*? matches a dollar sign, followed by 1+ digits, followed by a period, followed by 2 digits, followed by a lazy match of 0+ characters (.*?). What this lazy match does is match anything up until the next match in our expression (in this case, our next capture group).
Our second capture group \d{1,2}/\d{2}.*?\d{1,2}/\d{2} matches 1-2 digits, a slash, and 2 digits. Then we use another lazy match of any characters followed by another date.
Our final capture group \d+[.]\d{2} looks for 1+ digits, a period, and 2 more digits.
Note: I used ~ as delimiters so that we do not need to escape our / in the dates. Also, I put $ and . in character classes because I think it looks cleaner than escaping them ([$] vs \$)..either works though :)

Related

Regex: Replace certain part of the matched characters

I want to be able to match with a certain condition, and keep certain parts of it. For example:
June 2021 9 Feature Article Three-Suiters Via Puppets Kai-Ching Lin
should turn into:
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
So, everything until the end of the word Article should be matched; then, only the first three characters of the months is kept, as well as the year, and this part is going to replace the matched characters.
My strong regex knowledge got me as far as:
.+Article(?)
You could use 2 capture groups and use those in a replacement:
\b([A-Z][a-z]+)[a-z](\s+\d{4})\b.*?\bArticle\b
\b A word boundary to prevent a partial word match
([A-Z][a-z]+) Capture group 1, match a single uppercase char and 1+ lowercase chars
[a-z] Match a single char a-z
(\s+\d{4})\b Capture group 2, match 1+ whitspace chars and 4 digits followed by a word boundary
.*?\bArticle\b Match as least as possible chars until Article
Regex demo
The replaced value will be
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
You could use positive lookbehinds:
(?<=^[A-Z][a-z]{2})[a-z]*|(?<=\d{4}).*Article
(?<=^[A-Z][a-z]{2}) - behind me is the start of a line and 3 chars; presumably the first three chars of the month
[a-z]* - optionally, capture the rest of the month
| - or
(?<=\d{4}) - behind me is 4 digits; presumably a year
.*Article - capture everything leading up to and including "Article"
https://regex101.com/r/fbYdpH/1

greedy-but-not-too-greedy regex: need to exclude last occurrence of optional character

(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.
If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2
You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo

How can I match everything between 2 commas?

I want to match basically any text that has a comma separated list of weekdays.
(?i)(every (mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5}, .*+,
(mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5})
Above is what what I have and I want to make it match the following strings. I don't need help in the case that only 2 weekdays are supplied.
Every mon, tue, wednesday
Every wed, Saturday, Friday, sun.
Try pattern: (?<=,|^)[^,\n]+
Explanation
(?<=,|^) - positive lookbehind: assert what preceeds is comma , or beginning of the string ^
[^,\n]+ - match one or more characters other than comma , or newline \n
Demo
You might list the abbreviations and optionally match the full name by listing them using an alternation followed by a comma and a space.
Add that to a group and repeat that 0+ times. After that add the group without a comma to make sure you match at least a single day.
(?i)\bevery (?:(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), )*(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?)\b
Explanation
(?i)\bevery Case insensitive modifier
(?: No capturing group
(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), Match any of the listed followed by a comma and space
)* Close non capturing group and repeat 0+ times
(?: Non capturing group
mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)? Match any of the listed
)\b Close non capturing group and add a word boundary to prevent being part of a larger word
Regex demo
To not match only multiple days, you could update the * quantifier for the first non capturing groupe to for example + or {2,}.

RegEx for identifying a date followed by a special pattern

I have a pattern of strings/values occurring at different interval. The Pattern is as follows:
30/09/2016 2,085,669 0 0 UC No
Date>SPACE>Number separated by comma>SPACE> NUMBER> SPACE> NUMBER> SPACE>STRING>SPACE>NUMBER
How do i identify this and extract from a cell. I have been trying to use regex to solve this problem. Please note the pattern can occur at any instance in single cell. Viz.
Somestring(space)(30/09/2016 2,085,669 0 0 UC No)(space) More string
Somemorestring(space)(30/09/2016 2,085,669 0 0 UC No)
Brackets are for illustration only
To identify for date I am using the below regex, not the best way, but does my job.
(^\d{1,2}\/\d{1,2}\/\d{4}$)
How to stitch this with remaining pattern?
You are only matching the date like part between the anchors to assert the start ^ and the end $ of the string.
Note that if you only want to match the value you can omit the parenthesis () to make it a capturing group around the expression.
You could extend it to:
^\d{1,2}\/\d{1,2}\/\d{4} \d+(?:,\d+)+ \d+ \d+ [A-Za-z]+ [A-Za-z]+$
Explanation
^ Start of string
\d{1,2}\/\d{1,2}\/\d{4} Match date like pattern
\d+(?:,\d+)+ Match 1+ digits and repeat 1+ times matching a comma and a digit
\d+ \d+ Match two times 1+ digits followed by a space
[A-Za-z]+ [A-Za-z]+ Match 2 times 1+ chars a-z followed by a space
$ End of string
Regex demo
If you only wish to extract the date from anywhere in a string, this expression uses two capturing groups before and after the date, and the middle group captures the desired date:
(.*?)(\d{1,2}\/\d{1,2}\/\d{4})(.*)
You may not want to use start ^ and end $ chars and it would work.
If you wish to match and capture everything, you might just want to add more boundaries, and match patterns step by step, maybe similar to this expression:
(.*?)(\d{1,2}\/\d{1,2}\/\d{4})\s+([0-9,]+)\s+([0-9]+)\s+([0-9]+)\s+([A-Z]+)\s+(No)(.*)
This tool can help you to edit/modify/change your expressions as you wish.
I have added extra boundaries, just to be safe, which you can simplify it.
RegEx Descriptive Graph
This link helps you to visualize your expressions:

Regular expression to exclude group with 0 and more occurence issue

I need to extract 1234567 from below URLs
http://www.test.in/some--wonders-1234567---2
http://www.test.in/some--wonders-1234567
I tried with .*\-([0-9]+)(?:-{2,}2)?.
but for the first URL it returned 2, but this is in non-capturing group.
Please give me a solution. I am digging it for so long. not getting any idea.
Try this one:
.*?\-([0-9]+)(?:-{2,}2|$)
It sets lazy mode for first .* pattern, you can also remove it at all with same effect:
\-([0-9]+)(?:-{2,}2|$)
If your regex engine supports negative look behinds (some do not), you can do it this way:
(?<!\d+-+)\d+
It gives you any non-empty digit string, which is not preceded by (minuses followed by digits).
Big advantage is that you don't have to use groups here - regex itself returns what you want.
You could match a - followed by one or more digits which you could capture in a group ([0-9]+). This group will contain the value you want to extract.
Then an optional part (?:-{2,}[0-9]+)? that would match ---2 followed by asserting the end of the line $.
-(\d+)(?:-{2,}\d+)?$
Explanation
- Match literally
(\d+) Capture one or more digits in a group
(?: Non capturing group
-{2,} Match 2 or more times -
\d+ Match one or more digits
)? close non capturing group and make it optional
$ Assert position at the end of the line