I want to be able to match with a certain condition, and keep certain parts of it. For example:
June 2021 9 Feature Article Three-Suiters Via Puppets Kai-Ching Lin
should turn into:
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
So, everything until the end of the word Article should be matched; then, only the first three characters of the months is kept, as well as the year, and this part is going to replace the matched characters.
My strong regex knowledge got me as far as:
.+Article(?)
You could use 2 capture groups and use those in a replacement:
\b([A-Z][a-z]+)[a-z](\s+\d{4})\b.*?\bArticle\b
\b A word boundary to prevent a partial word match
([A-Z][a-z]+) Capture group 1, match a single uppercase char and 1+ lowercase chars
[a-z] Match a single char a-z
(\s+\d{4})\b Capture group 2, match 1+ whitspace chars and 4 digits followed by a word boundary
.*?\bArticle\b Match as least as possible chars until Article
Regex demo
The replaced value will be
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
You could use positive lookbehinds:
(?<=^[A-Z][a-z]{2})[a-z]*|(?<=\d{4}).*Article
(?<=^[A-Z][a-z]{2}) - behind me is the start of a line and 3 chars; presumably the first three chars of the month
[a-z]* - optionally, capture the rest of the month
| - or
(?<=\d{4}) - behind me is 4 digits; presumably a year
.*Article - capture everything leading up to and including "Article"
https://regex101.com/r/fbYdpH/1
Related
I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.
The strings I parse with a regular expression contain a region of fixed length N where there can either be numbers or dashes. However, if a dash occurs, only dashes are allowed to follow for the rest of the region. After this region, numbers, dashes, and letters are allowed to occur.
Examples (N=5, starting at the beginning):
12345ABC
12345123
1234-1
1234--1
1----1AB
How can I correctly match this? I currently am stuck at something like (?:\d|-(?!\d)){5}[A-Z0-9\-]+ (for N=5), but I cannot make numbers work directly following my region if a dash is present, as the negative look ahead blocks the match.
Update
Strings that should not be matched (N=5)
1-2-3-A
----1AB
--1--1A
You could assert that the first 5 characters are either digits or - and make sure that there is no - before a digit in the first 5 chars.
^(?![\d-]{0,3}-\d)(?=[\d-]{5})[A-Z\d-]+$
^ Start of string
(?![\d-]{0,3}-\d) Make sure that in the first 5 chars there is no - before a digit
(?=[\d-]{5}) Assert at least 5 digits or -
[A-Z\d-]+ Match 1+ times any of the listed characters
$ End of string
Regex demo
If atomic groups are available:
^(?=[\d-]{5})(?>\d+-*|-{5})[A-Z\d_]*$
^ Start of string
(?=[\d-]{5}) Assert at least 5 chars - or digit
(?> Atomic group
\d+-* Match 1+ digits and optional -
| or
-{5} match 5 times -
) Close atomic group
[A-Z\d_]* Match optional chars A-Z digit or _
$ End of string
Regex demo
Use a non-word-boundary assertion \B:
^[-\d](?:-|\B\d){4}[A-Z\d-]*$
A non word-boundary succeeds at a position between two word characters (from \w ie [A-Za-z0-9_]) or two non-word characters (from \W ie [^A-Za-z0-9_]). (and also between a non-word character and the limit of the string)
With it, each \B\d always follows a digit. (and can't follow a dash)
demo
Other way (if lookbehinds are allowed):
^\d*-*(?<=^.{5})[A-Z\d-]*$
demo
The words' length could be 2 or 6-10 and could be separated by space or comma. The word only include alphabet, not case sensitive.
Here is the groups of words that should be matched:
RE,re,rereRE
Not matching groups:
RE,rere,rel
RE,RERE
Here is the pattern that I have tried
((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|\s+)?)
But unfortunately this pattern can match string like this: RE,RERE
Look like the word boundary has not been set.
You could match chars a-z either 2 or 6 - 10 times using an alternation
Then repeat that pattern 0+ times preceded by a comma or a space [ ,].
^(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*$
Explanation
^ Start of string
(?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match chars a-z 6 -10 or 2 times
(?: Non capturing group
[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match comma or space and repeat previous pattern
)* Close non capturing group and repeat 0+ times
$ End of string
Regex demo
If lookarounds are supported, you might also assert what is directly on the left and on the right is not a non whitespace character \S.
(?<!\S)(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[ ,](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*(?!\S)
Regex demo
([a-zA-Z]{2}(,|\s)|[a-zA-Z]{6,10}|(,|\s))
This one will get only the words who have 2 letter, or between 6 and 10
\b,?([a-zA-Z]{6,10}|[a-zA-Z]{2}),?\b
You can use this
^(?!.*\b[a-z]{4}\b)(?:(?:[a-z]{2}|[a-z]{6,10})(?:,|[ ]+)?)+$
Regex Demo
This regex will match your first case, but neither of your two other cases:
^((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|[ ]+|$))+$
I'm making the assumption here that each line should be a single match.
Here it is in action.
I'm trying a regex fro Alpha Numeric of length 7 (with positions 1,3,4 as characters and positions 2,5,6,7 as digits).
[a-zA-Z]|[0-9]|[a-zA-Z]|[a-zA-Z]|[0-9]|[0-9]|[0-9]
Can someone help me?
The sequence "character, digit, character, character, digit, digit, digit" is expressed in regex as
[a-zA-Z][0-9][a-zA-Z]{2}[0-9]{3}
If you're working in PCRE (with say, PHP):
^([a-zA-Z])([0-9])(?1){2}(?2){3}$
Breakdown:
^ - from the start of the string
([a-zA-Z]) - match and capture a single character in the ranges given: a-z, A-Z
([0-9]) - match and capture a single character in the ranges given: 0-9
(?1){2} - redo the regex in the first group twice (recursive subpattern)
(?2){3} - redo the regex in the second group 3 times (recursive subpattern)
$ - the end of the string
If you want to match this in the middle of a sentence, exchange ^ and $ for \b - which will match a word boundary
See the demo
If you're not using PCRE:
^[a-zA-Z][0-9][a-zA-Z]{2}[0-9]{3}$
Which does the same thing, but has some copy-paste involved
I am trying to write a regular expression that takes a string and parses it into three different capturing groups:
$3.99 APP DOWNLOAD – 200 11/19 – 1/21 3.99
Group 1: $3.99 APP DOWNLOAD – 200
Group 2: 11/29 – 1/28
Group 3: 3.99
Does anyone have any ideas???
I do not have much experience with capturing groups and do not know how to create them.
i.e. I believe this expression would work for identifying the dates?
/(\d{2}\/\d{2})/
Any help would be greatly appreciated!
Regex:
([$]\d+[.]\d{2}.*?)\s*(\d{1,2}/\d{2}.*?\d{1,2}/\d{2})\s(\d+[.]\d{2})
So with this we have 3 capture groups (()) separated by \s* which means 0+ characters of whitespace (this isn't necessary, but it will remove trailing spaces from your captured groups).
The first capture group [$]\d+[.]\d{2}.*? matches a dollar sign, followed by 1+ digits, followed by a period, followed by 2 digits, followed by a lazy match of 0+ characters (.*?). What this lazy match does is match anything up until the next match in our expression (in this case, our next capture group).
Our second capture group \d{1,2}/\d{2}.*?\d{1,2}/\d{2} matches 1-2 digits, a slash, and 2 digits. Then we use another lazy match of any characters followed by another date.
Our final capture group \d+[.]\d{2} looks for 1+ digits, a period, and 2 more digits.
Note: I used ~ as delimiters so that we do not need to escape our / in the dates. Also, I put $ and . in character classes because I think it looks cleaner than escaping them ([$] vs \$)..either works though :)