How to replace within a capture group - regex

I am modifying an existing HTML doc. I'm doing things like adding a table of contents etc.
I have a heading with this ID: id="transcending intellectual limitations" (for real!)
I want to be able to find the whole ID, and then replace the spaces with hyphens.
It would be simple if I had just the IDs but I don't want to remove all the spaces in the whole document.
I'm reasonably new to regex, I'm using Sublime's find and replace to do this.

You can use
(?:\bid="|(?!^)\G)[^\s"]*\K\s+ 
And replace with anything you need to replace spaces with.
The (?:\bid="|(?!^)\G) pattern sets the initial boundary: either id=" or the end of the last successful match. This pattern presents an alternation list with two alternatives. \b matches a word boundary so that id=" is matched as a whole word. The \G operator matches at the start of the string and after ech successful match. To exclude the start position, a negative (?!^) lookahead is added (not followed with a string start position).
See more about \G in "Where You Left Off: The \G Assertion".
The [^\s"]* matches zero or more characters other than whitespace and a quote.
The \K operator makes the regex engine omit all the text matched so far from the match buffer.
The \s+ finally matches one or more whitespaces that will be replaced.
Regex101 Demo

Here's a 2 pass solution using Ruby as the regex parser:
#!/usr/bin/env ruby
line = 'yadayadayadaid="transcending intellectual limitations"yadayadayada'
line =~ /id="(.*)"/
part = $1.gsub( /\s+/, '-' )
print part
yields:
transcending-intellectual-limitations
Note that this will replace all whitespace between the words on the 2nd pass.

Related

Notepad++ regex extract two options

I've a list below:
7080508136242611718:7080508978035787525:7549dda86ba9af19:31050:install_id=7080508978035787525; store-country-code=us; store-idc=useast5; ttreq=1$fd2f36282a10633c5638a02cc54c19ff13f60755; passport_csrf_token=13bf74c4e5fe04307f0a99de9aed53f9; passport_csrf_token_default=13bf74c4e5fe04307f0a99de9aed53f9; odin_tt=11ed1b48fba2d7a9fe3d86929b3d52cebbad0ca7f7dbd127e220cfb3be279621ba04487517b536050a6ded9fbe50e300cd11615e2e9551523478e5484896a9dda800e55e428842872fcf862e8c57d439:1648559503:351451268482810:3f:49:8c:b7:8c:cb:c5379d41-6cf3-4152-9d48-7aa45f7f611c:79375640-197c-4aaa-86cf-4ef8e7238be2:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdkW2eP4TZYMDY7A
7080507996291827206:7080508977079666438:6742591cc0d20580:31050:install_id=7080508977079666438; store-country-code=us; store-idc=useast5; ttreq=1$a119611bfe79541b0b4c029fe910b6507123eec2; passport_csrf_token=fb42bbd472462c17f45acb531deb057a; passport_csrf_token_default=fb42bbd472462c17f45acb531deb057a; odin_tt=6c3b06ff01fd67f42e3dccb60a1e69ca67cb8654f49662017acc209f7176517bcd13a374311f7a1b3538e6407fb237267abf43578d3180d8c834e7df886fa4377a9b950dbb6ff146e3fabf37158dcfa8:1648559508:351451233766930:dd:9e:82:59:5f:7f:596da881-89e8-4f60-b644-5fef23f0a422:f04adc87-56de-4191-a25f-843bec1d5818:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdsYPWv4TZYMDY7A
7080509102451394054:7080509820378072837:e36dc9aceecfc1cc:31050:install_id=7080509820378072837; store-country-code=us; store-idc=useast5; ttreq=1$d94700921d5ee2b21992910a2a4e84dd0ade1ec8; passport_csrf_token=2d4f4eca772dbfcbb37548ff02da3166; passport_csrf_token_default=2d4f4eca772dbfcbb37548ff02da3166; odin_tt=53d6999ebe29c0d5144a9669331ce3307a290891370914dabadbfa0520114e6e76b9103c9a6db5476e139251ee478f3a305577a89e3fa07288b7aca00774d3fccbd03566687dbcfdce31700065295939:1648559700:351451299637010:71:de:41:2b:ad:b4:1eba1ae9-3216-40e1-be7f-00303e524c27:2713cbd3-7a4f-493e-b76f-ac6d56ab8045:5:AgMNAgIAhyQWF-RPsNA-7qeIMtk5-CKcsBcWP4TZYMDY7w
7080509086894851590:7080509909225604870:98be64e38551984d:31050:install_id=7080509909225604870; store-country-code=us; store-idc=useast5; ttreq=1$05929375d8605739d8ebdbb5ce15eb406da5c467; passport_csrf_token=c95c71ad206a1d371e5b67505ae25be8; passport_csrf_token_default=c95c71ad206a1d371e5b67505ae25be8; odin_tt=6ddaa02f6133e61a4c591ef2a872f0ec2339d8b6a3fc480575fe279b13ded615e1fa7de979e18565f3ac8b8229a19a98bdf79aa1804071dcc025e1a4cd5314522cf40a62ca961770baea1d5d653d6d64:1648559720:351451292934660:9d:cf:c3:92:f6:f5:787dfb42-f4bf-43fa-9c64-ded19a1b1660:366c3024-217d-4f85-90dd-d95a0fd3e296:4:AgICAw0AFockF-RPsNA-7qeIMtk5-CKcs7bUP4TZYMDY7w
7080509183397299718:7080509974838085382:f39db5d314071713:31050:install_id=7080509974838085382; store-country-code=us; store-idc=useast5; ttreq=1$561ee2083cb13f0849a9f09e7f89edfe08c7ce6c; passport_csrf_token=721a8fee6f4f97c16ed1923ad3bbc72d; passport_csrf_token_default=721a8fee6f4f97c16ed1923ad3bbc72d;
I'd like to extract first two options aka below:
7080508136242611718:7080508978035787525
7080507996291827206:7080508977079666438
7080509102451394054:7080509820378072837
7080509086894851590:7080509909225604870
7080509183397299718:7080509974838085382
I've tried: *.: but its remove the reset of text. and keeps only first.
I've tried ^.*[0-9]+.*$ to get the second one. but no success.
Hopefully somebody can help me with accurate regex.
Thank you in advance.
This pattern *.: by itself is not a valid regex, and this pattern ^.*[0-9]+.*$ matches the whole string with at least a single digit.
If you want to match the digits and : you could make use of \K to forget what is matched so far and then match the rest of the line.
In the replacement use an empty string.
^\d+:\d+\K.*
^ Start of string
\d+:\d+ Match 1+ digits with : in between
\K.* Clear the current match, and match the rest of the line
Regex demo
^[^:]*:[^:]*\K.*
When matching things with delimiters I will use a negated character set to match the contents. In this case, the delimiter is a colon, so I want to match everything that isn't a colon until there's a colon. Then I want to match everything that isn't a colon. This will match everything up until the second colon. Because I want to keep what I just matched, I am using .* after \K, which resets the match at that point and matches everything else.
That pattern can be replaced with nothing, and the result is the first two columns of each line left.
You can use
Find: ^(\d+:\d+).*
Replace: $1
See this regex demo online.
The ^(\d+:\d+).* regex matches and captures into Group 1 one or more digits + : + one or more digits (with (\d+:\d+)) at the beginning of a line (^) and then matches the rest of the line (with .*).
The $1 replacement replaces the match with the Group 1 value.
See the demo and settings screenshot:
As an alternative, if there are chars other than digits you can also use
^([^:\v]+:[^:\v]+).*
where [^:\v]+ matches one or more chars other than a comma and any vertical whitespace.

Regex - Discard the entire string if any part of the string doesn't match the pattern

I have a comma separated string which I want to validate using a regex. What I have written is gives me a match if there a part wrong later in the string. I want to discard it completely if any part is wrong.
My regex : ^(?:[\w\.]+,{1}(?:STR|INT|REAL){1},{1}(\s*|$))+
Positive Case : Component,STR,YoungGenUse,STR,YoungGenMax,STR,OldGenUse,INT,OldGenMax,INT,PermGenUse,INT,PermGenMax,INT,MajCollCnt,INT,MinCollDur,REAL,MinCollCnt,INT,
Negative Case :
Component,STR,YoungGenUse,STR,YoungGenMax,TEST,OldGenUse,INT,OldGenMax,INT,PermGenUse,INT,PermGenMax,INT,MajCollCnt,INT,MinCollDur,REAL,MinCollCnt,INT,
For the second case, my regex gives a match for the bold portion eventhough, later there is an incorrect part (TEST). How can I modify my regex to discard the entire string?
The pattern that you tried would not match TEST in YoungGenMax,TEST because the alternatives STR|INT|REAL do not match it.
It would show until the last successful match in the repetition which would be Component,STR,YoungGenUse,STR,
You have to add the anchor at the end, outside of the repetition of the group, to indicate that the whole pattern should be followed by asserting the end of the string.
There are no spaces or dots in your string, so you might leave out \s* and use \w+ without the dot in the character class. Note that \s could also possibly match a newline.
^(?:\w+,(?:STR|INT|REAL),)+$
Regex demo
If you want to keep matching optional whitespace chars and the dot:
^(?:[\w.]+,(?:STR|INT|REAL),\s*)+$
Regex demo
Note that by repeating the group with the comma at the end, the string should always end with a comma. You can omit {1} from the pattern as it is superfluous.
your regex must keep matching until end of the string, so you must use $ to indicate end of the line:
^(?:[\w.]+,{1}(?:STR|INT|REAL){1},{1}(\s*|$))+$
Regex Demo

Regex - Finding fullstops (periods) that aren't followed by a space

I'm trying to create a simple Grammar correction tool.
I want to create a regular expression that finds fullstops (" . ") that are not followed by a space so I can replace that with a fullstop and space.
For e.g. This is a sentence.This is another sentence.
Only the first fullstop in the above example should be matched in the expression.
I've tried /\.[^\s]/g but it returns an additional character after the matched fullstop. I would like to match only the fullstop.
How can I do this?
The negated character class [^\s] in the pattern expects a match (any character except a whitespace character), that is why you have the additional character.
If you want to match the dot only, you could use a negative lookahead to assert what is on the right is not a whitspace char or the end of the string:
\.(?!\s|$)
Regex demo
To not match a dot that is not followed by a whitespace char excluding a newline:
\.(?![^\S\r\n])
Regex demo
You can look for all dots using:
(\.)
This will match all dots on below examples:
This is a sentence.This is another sentence.
i am looking. for dots. . ...
You can add a |$ to seek for end of line, and with a little tweak, you get a regex that match all dots not followed by whitespace nor being on the end of a line:
(\.(?!\ |$))
Note that there's a whitespace as literal here. The "must-work-everywhere" example will be like:
(\.(?![[:space:]]|$))
If not, search on the regex reference on the language you use.

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?
Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s
jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

perl style regex to match nth item in a list

Trying to match the third item in this list:
/text word1, word2, some_other_word, word_4
I tried using this perl style regex to no avail:
([^, ]*, ){$m}([^, ]*),
I want to match ONLY the third word, nothing before or after, and no commas or whitespace. I need it to be a regex, this is not in a program but UltraEdit for a word file.
What can I use to match some_other_word (Or anything third in the list.)
Based on some input by the community members I made the following change to make the logic of the regex pattern clearer.
/^(?:(?:.(?<!,))+,){2}\s*(\w+).*/x
Explanation
/^ # 1.- Match start of line.
(?:(?:.(?<!,))+ # 2.- Match but don't capture a secuence of character not containing a comma ...
,) # 3.- followed by a comma
{2} # 4.- (exactly two times)
\s* # 5.- Match any optional space
(\w+) # 6.- Match and capture a secuence of the characters represented by \w a leat one character long.
.* # 7.- Match anything after that if neccesary.
/x
This is the one suggested previously.
/(?:\w+,?\s*){3}(\w+)/
Try group 1 of this regex:
^(?:.*?,){2}\s*(.*?)\s*(,|$)
See a live demo using your sample, plus an edge case, input showing capture in group 1.
It can't only return one match at a time because your string has more than one occurrence of the same pattern and Regular Expression doesn't have a selective return option! So you can do whatever you want from the returned array.
,\s?([^,]+)
See it in action, 2nd matched group is what you need.