How can I capture the desired group using REGEX - regex

How can I break this string, to just capture Chocolate cake & nuts?
Input string
pizza & coke > sweets > Chocolate cake & nuts >
I am using this regex:
.*[\>]\s(.*)
However, it is capturing Chocolate cake & nuts >
How can I remove the > and the space in the end?
Desired result
lastone=Chocolate cake & nuts

Avoiding capture of space around the final phrase is a bit tricky. In Java,
.*>\s*(\S+(?:\s+[^>\s]+)*)\s*>.*
captures everything except initial and ending whitespace between the final two >'s. Note that you only get the last stuff between >'s because the * is "greedy." It matches the longest possible string that allows the rest of the regex to match.
Note that when you ask about a regex, you need to specify which regex engine you're using.
Edit: How it works
.*> matches anything followed by >. Then \s* matches 0 or more whitespace chars, and capturing starts. The \S+ matches one or more non-space characters, and (?:\s+[^>\s]+)* matches 0 or more repeats of spaces followed by characters that are anything except > and space (this is the tricky part), whereupon capturing stops. The (?: ) form of parentheses are non-capturing. They only group what's inside so * can match 0 or more of whatever that is. Finally, \s*>.* matches a final > preceded by optional whitespace and followed by anything.

Try move the > out of (). .*[\>]\s(.*?)\s*>
Or the more precise version [>\s]+(\w+[\w ]*&[ \w]*\w+)[> ]+
DEMO

Related

How to extract a word that could possibly be followed with another word

I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).

How to find regex for multiple conditions

I am trying to find regex which would find below matches. I would replace these with blank. I am able to create regex for few of these conditions individually, but I am not able to figure out how to create one regex for all of these
Strings:
song1 artist (SiteWithMp3Keyword.com).mp3
02.song2 | siteWithdownloadKeyword.in 320 Kbps
song3 [SitewithDjKeyword.in] 128kbps.mp3
Output
song1 artist.mp3
song2
song3.mp3
Criteria for match:
Case Insensitive
Find Strings with particular keyword and remove whole word, even if inside any braces
Find kpbs keyword and remove it along with any number before it (128/320)
if string ends in .mp3, keep it as it is.
Remove junk characters (like | ) and replace _ with space.
Remove number if present at start of string, like 001_ 02. etc.
Trim whitespaces before and after remaining string
Example Regex for 2.
\S+(mp3|dj|download)\S+
https://regex101.com/r/nxp4d3/1
Try this regex ....
Find:^[0-9. ]*(song\d+ (\w+ )?).*?(\.mp3 ?)?$
Replace with:$1$3
P.S , if this code doesn't solve your problem, please share a sample of your real data, so someone well better understand you,
Thanks...
For the example data, you might use:
^\h*(?:\d+\W*)?(\w+(?:\h+\w+)*).*?(\.mp3)?\h*$
The pattern matches:
^ Start of string
\h* Match optional leading spaces
(?:\d+\W*)? Match 1+ digits followed by optional non word characters
(\w+(?:\h+\w+)*) Capture group 1, match word characters optionally repeated with a space in between
.*? Match any character except a newline, as least as possible
(\.mp3)? Optionally capture .mp3 in group 2
\h* Match optional trailing spaces
$ End of string
Regex demo
Replace with capture group 1 and group 2
$1$2

Extract application name from user agent

I am using the following regex to extract application name from user agents:
^([^\s/\[]+)([\s/\[]|\z)
Application name termination character class consists of white space, backslash and [.
It reads any character that is not whitespace or / or [ until characters from the beginning until whitespace or / or [
link : https://regex101.com/r/7ndDEq/1
It is failing on some application name which has white space in between and extracts characters before white space.
eg:
Based on above regex on:
Pump Log/1300 CFNetwork/1121.2.2 Darwin/19.3.0
It extracts Pump
but the ground truth is Pump Log
Unless I'm misreading your requirements, your application name is anything up to but not including the first slash, which would just be
^([^/]+)
Or depending on your regex engine (which you should always specify when asking regex questions), you could do this with PCRE:
^(.+?)/
Try this:
^([^\s/[]+(?:\s[\w]+/)?)
It's almost there (the last slash should be removed in some matches).
The principle is simple: after capturing the required string, allow the regex to catch the optional stuff (in our case it's the second word after the first space) as well if it is available after the main match (the ? sign at the end makes this second part like optional).
UPD: this one is more general
^([^\s/[]+(?: [^/\d]+)?)
But there are two interesting points here:
I had to put a whitespace in regex, \s did not work there, I don't know how it will be in the code
It is required to have some rule what is possible after the whitespace, where we need to stop in the second optional part. If it's a slash or a bracket that will work fine but in strings like Apple iPhone10,4 iOS v13.3.1 Main/3.2.0 or POF 12.51.1859; (iPhone8,4; iOS 13.3.1; en_US; g=ON; p=ON; r=WWAN) 56BA8A93-3748-4C5E-9D00-D811FCC4EBCE; it's hard to find where to stop...
You might specify the allowed characters in a character class or use an alternation |
You can extend those to allow more characters or allowed strings.
^([^\s/\[]+(?: (?:& )?[A-Z][a-z]*)*)(?:[\s/\[]|\Z)
^ Start of string
( Capture group 1
[^\s/\[]+ Match 1+ times any char except a whitespace char, / or [
(?: Match a space (Or use \s+ to match 1+ whitespace chars which could also match a newline)
(?:& )?[A-Z][a-z]* Optionally match & and match an uppercase char A-Z followed by optional lowercase chars a-z
)* Close non capture group and optionally repeat
) Close group 1
(?:[\s/\[]|\Z) Match either a space / [ or assert the end of the string
Regex demo
Note that as you selected Python on regex101, you can use \Z to assert the position at the end of the string.

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After