Regex match for text - regex

I am tring to create a regex to match the content between numbered lists, e.g. with the following content:
1) Text for part 1
2) Text for part 2
3) Text for part 3

The following PCRE should work, assuming you haven't got any thing formatted like "1)" or the like inside of the sections:
\d+\)\s*(.*?)\s*(?=\d+\)|$)
Explanation:
\d+\) gives a number followed by a ).
\s* matches the preceding whitespace.
(.*?) captures the contents non-greedily.
\s* matches the trailing whitespace.
(?=\d+\)|$) ensures that the match is followed by either the start of a new section or the end of the text.
Note, it doesn't enforce that they must be ascending or anything like that, so it'd match the following text as well:
4) Hello there 1) How are you? 5) Good.

I'd suggest the following (PCRE):
(?:\d+\)\s*(.*?))*$
The inner part \d+\)\s* matches the list number and the closing brace, followed by optional white space(s).
(.*?) matches the list text, but in a non-greedy manner (otherwise, it would also match the next list item).
The enclosing (?: )*$ then matches the above zero or more times, until the end of the input.

You should keep in mind text after number and bracket might be any text, this would find your substrings:
\d\).+?(?=\d\)|$)
EDIT:
To get rid of whitespace and return only text without a number, get group 1 from following match:
\d\)\w*(.+?)(?=\d\)|$)
To get number in group(1) and text in group(2) use this:
(\d)\)\w*(.+?)(?=\d\)|$)

Related

Notepad++ regex extract two options

I've a list below:
7080508136242611718:7080508978035787525:7549dda86ba9af19:31050:install_id=7080508978035787525; store-country-code=us; store-idc=useast5; ttreq=1$fd2f36282a10633c5638a02cc54c19ff13f60755; passport_csrf_token=13bf74c4e5fe04307f0a99de9aed53f9; passport_csrf_token_default=13bf74c4e5fe04307f0a99de9aed53f9; odin_tt=11ed1b48fba2d7a9fe3d86929b3d52cebbad0ca7f7dbd127e220cfb3be279621ba04487517b536050a6ded9fbe50e300cd11615e2e9551523478e5484896a9dda800e55e428842872fcf862e8c57d439:1648559503:351451268482810:3f:49:8c:b7:8c:cb:c5379d41-6cf3-4152-9d48-7aa45f7f611c:79375640-197c-4aaa-86cf-4ef8e7238be2:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdkW2eP4TZYMDY7A
7080507996291827206:7080508977079666438:6742591cc0d20580:31050:install_id=7080508977079666438; store-country-code=us; store-idc=useast5; ttreq=1$a119611bfe79541b0b4c029fe910b6507123eec2; passport_csrf_token=fb42bbd472462c17f45acb531deb057a; passport_csrf_token_default=fb42bbd472462c17f45acb531deb057a; odin_tt=6c3b06ff01fd67f42e3dccb60a1e69ca67cb8654f49662017acc209f7176517bcd13a374311f7a1b3538e6407fb237267abf43578d3180d8c834e7df886fa4377a9b950dbb6ff146e3fabf37158dcfa8:1648559508:351451233766930:dd:9e:82:59:5f:7f:596da881-89e8-4f60-b644-5fef23f0a422:f04adc87-56de-4191-a25f-843bec1d5818:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdsYPWv4TZYMDY7A
7080509102451394054:7080509820378072837:e36dc9aceecfc1cc:31050:install_id=7080509820378072837; store-country-code=us; store-idc=useast5; ttreq=1$d94700921d5ee2b21992910a2a4e84dd0ade1ec8; passport_csrf_token=2d4f4eca772dbfcbb37548ff02da3166; passport_csrf_token_default=2d4f4eca772dbfcbb37548ff02da3166; odin_tt=53d6999ebe29c0d5144a9669331ce3307a290891370914dabadbfa0520114e6e76b9103c9a6db5476e139251ee478f3a305577a89e3fa07288b7aca00774d3fccbd03566687dbcfdce31700065295939:1648559700:351451299637010:71:de:41:2b:ad:b4:1eba1ae9-3216-40e1-be7f-00303e524c27:2713cbd3-7a4f-493e-b76f-ac6d56ab8045:5:AgMNAgIAhyQWF-RPsNA-7qeIMtk5-CKcsBcWP4TZYMDY7w
7080509086894851590:7080509909225604870:98be64e38551984d:31050:install_id=7080509909225604870; store-country-code=us; store-idc=useast5; ttreq=1$05929375d8605739d8ebdbb5ce15eb406da5c467; passport_csrf_token=c95c71ad206a1d371e5b67505ae25be8; passport_csrf_token_default=c95c71ad206a1d371e5b67505ae25be8; odin_tt=6ddaa02f6133e61a4c591ef2a872f0ec2339d8b6a3fc480575fe279b13ded615e1fa7de979e18565f3ac8b8229a19a98bdf79aa1804071dcc025e1a4cd5314522cf40a62ca961770baea1d5d653d6d64:1648559720:351451292934660:9d:cf:c3:92:f6:f5:787dfb42-f4bf-43fa-9c64-ded19a1b1660:366c3024-217d-4f85-90dd-d95a0fd3e296:4:AgICAw0AFockF-RPsNA-7qeIMtk5-CKcs7bUP4TZYMDY7w
7080509183397299718:7080509974838085382:f39db5d314071713:31050:install_id=7080509974838085382; store-country-code=us; store-idc=useast5; ttreq=1$561ee2083cb13f0849a9f09e7f89edfe08c7ce6c; passport_csrf_token=721a8fee6f4f97c16ed1923ad3bbc72d; passport_csrf_token_default=721a8fee6f4f97c16ed1923ad3bbc72d;
I'd like to extract first two options aka below:
7080508136242611718:7080508978035787525
7080507996291827206:7080508977079666438
7080509102451394054:7080509820378072837
7080509086894851590:7080509909225604870
7080509183397299718:7080509974838085382
I've tried: *.: but its remove the reset of text. and keeps only first.
I've tried ^.*[0-9]+.*$ to get the second one. but no success.
Hopefully somebody can help me with accurate regex.
Thank you in advance.
This pattern *.: by itself is not a valid regex, and this pattern ^.*[0-9]+.*$ matches the whole string with at least a single digit.
If you want to match the digits and : you could make use of \K to forget what is matched so far and then match the rest of the line.
In the replacement use an empty string.
^\d+:\d+\K.*
^ Start of string
\d+:\d+ Match 1+ digits with : in between
\K.* Clear the current match, and match the rest of the line
Regex demo
^[^:]*:[^:]*\K.*
When matching things with delimiters I will use a negated character set to match the contents. In this case, the delimiter is a colon, so I want to match everything that isn't a colon until there's a colon. Then I want to match everything that isn't a colon. This will match everything up until the second colon. Because I want to keep what I just matched, I am using .* after \K, which resets the match at that point and matches everything else.
That pattern can be replaced with nothing, and the result is the first two columns of each line left.
You can use
Find: ^(\d+:\d+).*
Replace: $1
See this regex demo online.
The ^(\d+:\d+).* regex matches and captures into Group 1 one or more digits + : + one or more digits (with (\d+:\d+)) at the beginning of a line (^) and then matches the rest of the line (with .*).
The $1 replacement replaces the match with the Group 1 value.
See the demo and settings screenshot:
As an alternative, if there are chars other than digits you can also use
^([^:\v]+:[^:\v]+).*
where [^:\v]+ matches one or more chars other than a comma and any vertical whitespace.

notepad++ regex divide two lists

i have below list:
21870172299%3Akvm6wcmcVYaoQ2J%3A2 340282366841710300949128111982633033733
21200717504%3AUhGubOhpHPtBKLk%3A6 340282366841710300949128111984034029824
21256096197%3AMGYmtB2uoj4er5i%3A1 340282366841710300949128111984541030820
11665946937%3AHBBkUBzcy3cvbtb%3A5 340282366841710300949128111986242038268
21719881031%3AH3t9c4b7re6cs5%3A24 340282366841710300949128111986284030213
21697692027%3A1S0fM2Jp6Ivsxo9%3A5 340282366841710300949128111986299030036
20424141770%3AFPiScGMuAVBPGvk%3A7 340282366841710300949128111987613032298
I would like to use regular expression to divide these 2 list. example:
list1:
21870172299%3Akvm6wccVYaoQ2J%3A2
21200717504%3AUhGubOpHPtBKLk%3A6
21256096197%3AMGYmtBuoj4er5i%3A1
11665946937%3AHBBkUBcy3cvbtb%3A5
21719881031%3AH3t9c4b7re6cs5%3A24
21697692027%3A1S0fMJp6Ivsxo9%3A5
20424141770%3AFPiSGMuAVBPGvk%3A7
list2:
340282366841710300949128111982633033733
340282366841710300949128111984034029824
340282366841710300949128111984541030820
340282366841710300949128111986242038268
340282366841710300949128111986284030213
340282366841710300949128111986299030036
340282366841710300949128111987613032298
I have tried to use online regex (regex101) but with failed attempts.
Kindly help me to divide this lists.
Thank you.
Copy this text and paste twice to your text file, one below the other.
Select first block of data:
Check "In selection" option and use pattern (^\S+).+ and replace it \1 meaning replacing with first capturing group.
Pattern explanation: ^ matches beginning of a string, \S+ matches one or more non-whitespace characters, .+ matches one or more of any character, (...) means store matched text in first capturing group.
Similarly, select second block of data and use pattern: ^\S+\s+(.+)
\s+ matches one or more of whitespaces. Again, check "In selection" check box.

Regex get text inside one pair of square brackets

In the text below how do I get the text inside the first pair of square brackets
xxxx [I can be any text and even have digits like 0 25 ] [sdfsfsf] [ssf sf565wf]
This is what I tried. But it goes till the last square bracket.
.*\[.*]
What i want selected is
[I can be any text and even have digits like 0 25 ]
If you don't want to go past the closing square bracket, use [^\]]* in place of .*:
^[^\[]*(\[[^\]]*])
Add ^ anchor at the beginning if you would like to search multiple lines.
Add a capturing group around the square brackets, and get the content of that group to obtain the text that you need.
Demo.
Another one with DEMO. A bit complicated though:
(\[[^\]]+\])[^\[\]]*(?:\[[^\]].*\])
EXPLANATION
(\[[^\]]+\]) #capturing group
#match first [] pair
[^\[\]]* #match characters except ] and [
(?:\[[^\]].*\]) #non-capturing group
#match all the rest [] pairs
#this is a greedy match
* and + are 'greedy' by default, so they try to match as much text as possible. If you want them to match as little as possible, you can make them non-greedy with ?, eg .*\[.*?\]. Also, the .* at the beginning matches any number of any characters before the opening square bracket, so this regex will match all text up to a ']' as long as there is a '[' somewhere before the ']'. If you only want to match the brackets and their contents, you want simply \[.*?\].
Non-greedy modifiers with ? are not supported in all regex engines; if it's available to you you should write it with ? because it makes your intent clearer, but if you are using a simpler regex engine you can achieve the same effect by using \[[^\]]*\] instead. This is a negated character class, which matches as many as possible of any character except ']'.

regex in parenthesis at the beginning

I have a regex trying to divide questions by speciality. Say I have the following regex:
(?P<speciality>[0-9x]+)
It works fine for this question (correct match: 7)
(7)Which of the following is LEAST to be considered as a risk factor for esophageal cancer?;
And for this (correct match: 8 and 13)
(8,13)30 year old woman with amenorrhea, low serum estrogen and high serum LH/FSH, the most likely diagnosis is:
But not for this one (incorrect match: 20).
First trimester spontaneous abortion (before 20 wk) is most commonly due to:
I only need the numbers in parentheses at the beginning of the question, all other parentheses should be ignored. Is this possible with a regex alone (lookahead?).
If your regex flavor supports \G continuous matching and \K reset beginning of match, try:
(?:^\(|\G,)\K[\dx]+
^\( would match parenthesis at start | OR \G match , after last match. Then \K resets and match + one or more of [\dx]. (\d is a shorthand for [0-9]). Matches will be in $0.
Test at regex101.com; Regex FAQ
PHP example
$str = "(1x,2,3x) abc (1,2x,3) d";
preg_match_all('~(?:^\(|\G,)\K[\dx]+~', $str, $out);
print_r($out[0]);
Array
(
[0] => 1x
[1] => 2
[2] => 3x
)
Test at eval.in
Perhaps something like this will work (you don't mention the regex flavor that you're using, though I am guessing it is PCRE by the use of the named group - and yes, it does use positive lookahead):
^\((?P<speciality>(?:[0-9x]+,?)+)(?=\))/mg
The caret ^ combined with the multiline modifier \m (which causes the anchors ^ and $ to match the beginning and end of lines, respectively, instead of the beginning and end of the string) will ensure that what is matched is at the start of the paragraph. The specialties will be captured in the specialty named capture group; the only caveat is that if more than one specialty is given (as in your example starting (8,13)) the capture will be a comma-delimited list, just as the specialty is a comma-delimited list (to use the same example, the capture will be 8,13 in that case).
Please see Regex Demo here.
(?P<speciality>[0-9x]+) matches any nonempty sequence of digits anywhere in the input. the parentheses just delimit the capturing group but are not part of the match.
to match a number (or more separated by commas) between parentheses at the beginning of the line you could use something like this
^\((\d+)(,(\d+))*\)
EDIT
it seems repeated capturing groups, as in (,(\d+))*, will only return the last match. so to get the values it'd be necessary to catch the complete list of numbers and parse it afterwards:
^\((?P<specialities>(\d+)(,(\d+))*)\)
will catch one or more numbers separated by commas, between parentheses.
added the start of line anchor so it is at the beginning of the line.
Demo

perl style regex to match nth item in a list

Trying to match the third item in this list:
/text word1, word2, some_other_word, word_4
I tried using this perl style regex to no avail:
([^, ]*, ){$m}([^, ]*),
I want to match ONLY the third word, nothing before or after, and no commas or whitespace. I need it to be a regex, this is not in a program but UltraEdit for a word file.
What can I use to match some_other_word (Or anything third in the list.)
Based on some input by the community members I made the following change to make the logic of the regex pattern clearer.
/^(?:(?:.(?<!,))+,){2}\s*(\w+).*/x
Explanation
/^ # 1.- Match start of line.
(?:(?:.(?<!,))+ # 2.- Match but don't capture a secuence of character not containing a comma ...
,) # 3.- followed by a comma
{2} # 4.- (exactly two times)
\s* # 5.- Match any optional space
(\w+) # 6.- Match and capture a secuence of the characters represented by \w a leat one character long.
.* # 7.- Match anything after that if neccesary.
/x
This is the one suggested previously.
/(?:\w+,?\s*){3}(\w+)/
Try group 1 of this regex:
^(?:.*?,){2}\s*(.*?)\s*(,|$)
See a live demo using your sample, plus an edge case, input showing capture in group 1.
It can't only return one match at a time because your string has more than one occurrence of the same pattern and Regular Expression doesn't have a selective return option! So you can do whatever you want from the returned array.
,\s?([^,]+)
See it in action, 2nd matched group is what you need.