Trying to match the third item in this list:
/text word1, word2, some_other_word, word_4
I tried using this perl style regex to no avail:
([^, ]*, ){$m}([^, ]*),
I want to match ONLY the third word, nothing before or after, and no commas or whitespace. I need it to be a regex, this is not in a program but UltraEdit for a word file.
What can I use to match some_other_word (Or anything third in the list.)
Based on some input by the community members I made the following change to make the logic of the regex pattern clearer.
/^(?:(?:.(?<!,))+,){2}\s*(\w+).*/x
Explanation
/^ # 1.- Match start of line.
(?:(?:.(?<!,))+ # 2.- Match but don't capture a secuence of character not containing a comma ...
,) # 3.- followed by a comma
{2} # 4.- (exactly two times)
\s* # 5.- Match any optional space
(\w+) # 6.- Match and capture a secuence of the characters represented by \w a leat one character long.
.* # 7.- Match anything after that if neccesary.
/x
This is the one suggested previously.
/(?:\w+,?\s*){3}(\w+)/
Try group 1 of this regex:
^(?:.*?,){2}\s*(.*?)\s*(,|$)
See a live demo using your sample, plus an edge case, input showing capture in group 1.
It can't only return one match at a time because your string has more than one occurrence of the same pattern and Regular Expression doesn't have a selective return option! So you can do whatever you want from the returned array.
,\s?([^,]+)
See it in action, 2nd matched group is what you need.
Related
How can I use regex in notepad++ to make a query like this:
I have a list with subdomains containing three words such as
web1.com
test.web2.com
www.test.web3.com
I want to filter so that only three words remain and something like this comes out:
web1.com
test.web2.com
test.web3.com
I was able to delete so that only the domain remains, but this is not what I want
^(?:.+\.)?([^.\r\n]+\.[^.\r\n]+)$
An idea to match until the endpart starts and capture that.
^.*?\.([\w-]+\.[\w-]+\.[\w-]+)$
Replace with $1 (what was captured by the first group)
.*? matches lazily any amount of any characters (besides newline)
[\w-]+ char-class matches one or more word characters and hyphen
See this demo at regex101 (more explanation on the right side)
In Notepad++ be sure to have unchecked: [ ] dot matches newline
Another take at it using a positive lookahead to assert the 3 "words" to the right, allowing for non whitespace chars excluding a dot using [^\s.]
In the replacement use an empty string.
^\S+?\.(?=[^\s.]+\.[^\s.]+\.[^\s.]+$)
See a regex demo.
I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.
I've a list below:
7080508136242611718:7080508978035787525:7549dda86ba9af19:31050:install_id=7080508978035787525; store-country-code=us; store-idc=useast5; ttreq=1$fd2f36282a10633c5638a02cc54c19ff13f60755; passport_csrf_token=13bf74c4e5fe04307f0a99de9aed53f9; passport_csrf_token_default=13bf74c4e5fe04307f0a99de9aed53f9; odin_tt=11ed1b48fba2d7a9fe3d86929b3d52cebbad0ca7f7dbd127e220cfb3be279621ba04487517b536050a6ded9fbe50e300cd11615e2e9551523478e5484896a9dda800e55e428842872fcf862e8c57d439:1648559503:351451268482810:3f:49:8c:b7:8c:cb:c5379d41-6cf3-4152-9d48-7aa45f7f611c:79375640-197c-4aaa-86cf-4ef8e7238be2:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdkW2eP4TZYMDY7A
7080507996291827206:7080508977079666438:6742591cc0d20580:31050:install_id=7080508977079666438; store-country-code=us; store-idc=useast5; ttreq=1$a119611bfe79541b0b4c029fe910b6507123eec2; passport_csrf_token=fb42bbd472462c17f45acb531deb057a; passport_csrf_token_default=fb42bbd472462c17f45acb531deb057a; odin_tt=6c3b06ff01fd67f42e3dccb60a1e69ca67cb8654f49662017acc209f7176517bcd13a374311f7a1b3538e6407fb237267abf43578d3180d8c834e7df886fa4377a9b950dbb6ff146e3fabf37158dcfa8:1648559508:351451233766930:dd:9e:82:59:5f:7f:596da881-89e8-4f60-b644-5fef23f0a422:f04adc87-56de-4191-a25f-843bec1d5818:1:AgICAw0AFockF-RPsNA-7qeIMtk5-CKdsYPWv4TZYMDY7A
7080509102451394054:7080509820378072837:e36dc9aceecfc1cc:31050:install_id=7080509820378072837; store-country-code=us; store-idc=useast5; ttreq=1$d94700921d5ee2b21992910a2a4e84dd0ade1ec8; passport_csrf_token=2d4f4eca772dbfcbb37548ff02da3166; passport_csrf_token_default=2d4f4eca772dbfcbb37548ff02da3166; odin_tt=53d6999ebe29c0d5144a9669331ce3307a290891370914dabadbfa0520114e6e76b9103c9a6db5476e139251ee478f3a305577a89e3fa07288b7aca00774d3fccbd03566687dbcfdce31700065295939:1648559700:351451299637010:71:de:41:2b:ad:b4:1eba1ae9-3216-40e1-be7f-00303e524c27:2713cbd3-7a4f-493e-b76f-ac6d56ab8045:5:AgMNAgIAhyQWF-RPsNA-7qeIMtk5-CKcsBcWP4TZYMDY7w
7080509086894851590:7080509909225604870:98be64e38551984d:31050:install_id=7080509909225604870; store-country-code=us; store-idc=useast5; ttreq=1$05929375d8605739d8ebdbb5ce15eb406da5c467; passport_csrf_token=c95c71ad206a1d371e5b67505ae25be8; passport_csrf_token_default=c95c71ad206a1d371e5b67505ae25be8; odin_tt=6ddaa02f6133e61a4c591ef2a872f0ec2339d8b6a3fc480575fe279b13ded615e1fa7de979e18565f3ac8b8229a19a98bdf79aa1804071dcc025e1a4cd5314522cf40a62ca961770baea1d5d653d6d64:1648559720:351451292934660:9d:cf:c3:92:f6:f5:787dfb42-f4bf-43fa-9c64-ded19a1b1660:366c3024-217d-4f85-90dd-d95a0fd3e296:4:AgICAw0AFockF-RPsNA-7qeIMtk5-CKcs7bUP4TZYMDY7w
7080509183397299718:7080509974838085382:f39db5d314071713:31050:install_id=7080509974838085382; store-country-code=us; store-idc=useast5; ttreq=1$561ee2083cb13f0849a9f09e7f89edfe08c7ce6c; passport_csrf_token=721a8fee6f4f97c16ed1923ad3bbc72d; passport_csrf_token_default=721a8fee6f4f97c16ed1923ad3bbc72d;
I'd like to extract first two options aka below:
7080508136242611718:7080508978035787525
7080507996291827206:7080508977079666438
7080509102451394054:7080509820378072837
7080509086894851590:7080509909225604870
7080509183397299718:7080509974838085382
I've tried: *.: but its remove the reset of text. and keeps only first.
I've tried ^.*[0-9]+.*$ to get the second one. but no success.
Hopefully somebody can help me with accurate regex.
Thank you in advance.
This pattern *.: by itself is not a valid regex, and this pattern ^.*[0-9]+.*$ matches the whole string with at least a single digit.
If you want to match the digits and : you could make use of \K to forget what is matched so far and then match the rest of the line.
In the replacement use an empty string.
^\d+:\d+\K.*
^ Start of string
\d+:\d+ Match 1+ digits with : in between
\K.* Clear the current match, and match the rest of the line
Regex demo
^[^:]*:[^:]*\K.*
When matching things with delimiters I will use a negated character set to match the contents. In this case, the delimiter is a colon, so I want to match everything that isn't a colon until there's a colon. Then I want to match everything that isn't a colon. This will match everything up until the second colon. Because I want to keep what I just matched, I am using .* after \K, which resets the match at that point and matches everything else.
That pattern can be replaced with nothing, and the result is the first two columns of each line left.
You can use
Find: ^(\d+:\d+).*
Replace: $1
See this regex demo online.
The ^(\d+:\d+).* regex matches and captures into Group 1 one or more digits + : + one or more digits (with (\d+:\d+)) at the beginning of a line (^) and then matches the rest of the line (with .*).
The $1 replacement replaces the match with the Group 1 value.
See the demo and settings screenshot:
As an alternative, if there are chars other than digits you can also use
^([^:\v]+:[^:\v]+).*
where [^:\v]+ matches one or more chars other than a comma and any vertical whitespace.
I am modifying an existing HTML doc. I'm doing things like adding a table of contents etc.
I have a heading with this ID: id="transcending intellectual limitations" (for real!)
I want to be able to find the whole ID, and then replace the spaces with hyphens.
It would be simple if I had just the IDs but I don't want to remove all the spaces in the whole document.
I'm reasonably new to regex, I'm using Sublime's find and replace to do this.
You can use
(?:\bid="|(?!^)\G)[^\s"]*\K\s+
And replace with anything you need to replace spaces with.
The (?:\bid="|(?!^)\G) pattern sets the initial boundary: either id=" or the end of the last successful match. This pattern presents an alternation list with two alternatives. \b matches a word boundary so that id=" is matched as a whole word. The \G operator matches at the start of the string and after ech successful match. To exclude the start position, a negative (?!^) lookahead is added (not followed with a string start position).
See more about \G in "Where You Left Off: The \G Assertion".
The [^\s"]* matches zero or more characters other than whitespace and a quote.
The \K operator makes the regex engine omit all the text matched so far from the match buffer.
The \s+ finally matches one or more whitespaces that will be replaced.
Regex101 Demo
Here's a 2 pass solution using Ruby as the regex parser:
#!/usr/bin/env ruby
line = 'yadayadayadaid="transcending intellectual limitations"yadayadayada'
line =~ /id="(.*)"/
part = $1.gsub( /\s+/, '-' )
print part
yields:
transcending-intellectual-limitations
Note that this will replace all whitespace between the words on the 2nd pass.
My question is pretty similar to this question and the answer is almost fine. Only I need a regexp not only for character-to-character but for a second occurance of a character till a character.
My purpose is to get password from uri, example:
http://mylogin:mypassword#mywebpage.com
So in fact I need space from the second ":" till "#".
You could give the following regex a go:
(?<=:)[^:]+?(?=#)
It matches any consecutive string not containing any : character, prefixed by a : and suffixed by a #.
Depending on your flavour of regex you might need something like:
:([^:]+?)#
Which doesn't use lookarounds, this includes the : and # in the match, but the password will be in the first capturing group.
The ? makes it lazy in case there should be any # characters in the actual url string, and as such it is optional. Please note that that this will match any character between : and # even newlines and so on.
Here's an easy one that does not need look-aheads or look-behinds:
.*:.*:([^#]+)#
Explanation:
.*:.*: matches everything up to (and including) the second colon (:)
([^#]+) matches the longest possible series of non-# characters
# - matches the # character.
If you run this regex, the first capturing group (the expression between parentheses) will contain the password.
Here it is in action: http://regex101.com/r/fT6rI0