I need to come up with a regular expression with flavor PCRE. It must be a regular expression <
I want to grab all lines of text that end in a newline character up until I encounter <zz> where zz is a digit enclosed in '<' and '>'.
e.g.
111a z
222 aset
333 //+
12 <zz> 11
abc
def
It would need to capture "111a z", "222 aset", "333 //+" in this case [and nothing else].
Right now I have ^(?!.*<zz>)[^\n]+(?=\n) but it's pretty far off from what it needs to be.
For clarification purposes, the regex I was using shows <zz>, but definitely looking for a digit enclosed in angle brackets.
Would really appreciate some help.
Edit
This is /really/ difficult for me, because at least one of the answers looks like it does the job. I'll try to mark one... Thank you, everyone.
You could repeat matching all lines including a Unicode newline sequence while the <\d+> pattern does not occur in the line.
\A(?:(?!.*<\d+>).*\R)+
Explanation
\A Start of string
(?: Non capture group
(?!.*<\d+>) Negative lookahead, assert that the pattern <\d+> does not occur
.*\R Match any char except a newline followed by matching a Unicode newline sequence
)+ Close the non capturing group, and repeat it 1+ times to match at least a single line
Regex demo
If the <\d+> has to be present, you could assert that with a positive lookahead at the end
\A(?:(?!.*<\d+>).*\R)+(?=.*<\d+>)
I'm not sure why you're using a negative lookahead, but I think you want a positive lookahead. This lets you only match the line if you see the <zz> in a lookahead. I would solve the problem using something like this:
^.*(?=.*(?:\n.*)*<\d+>)\n
^ Anchors match to beginning of line (like yours)
.* Matches all the characters it can. In this case it matches the whole line because it has to satisfy the \n at the end.
(?=...) Performs a positive lookahead (makes sure the string exists somewhere ahead)
.*(?:\n.*)* Allows any number of characters on any number of lines
<\d+> Only matches one or more digits enclosed in angle brackets
\n ensures that there is a newline at the end of the line.
I have assumed that the text may have more than one line that contains one or digits bracketed in '<' and '>', and that those lines are not themselves to be matched.
You can use the following expression to match the lines of interest.
^(?!.*<\d+>).*\r?\n(?=[\s\S]*?<\d+>)
PCRE Demo
The regex engine performs the following operations.
^ match beginning of line
(?! begin negative lookahead (prevent matching line with '<12>'
.* match 0+ characters other than newlines
<\d+> match '<', 1+ digits, '>'
) end negative lookahead
.* match 0+ characters other than newlines
\r?\n match newline optionally preceded by '\r'
(?= begin positive lookahead
[\s\S]*? match 0+ characters (incl. newlines), non-greedily
<\d+> match '<', 1+ digits, '>'
) end positive lookahead
'\r', a carriage return, will be present if the file was produced when using the Windows operating system.
Related
I am trying to solve http://play.inginf.units.it/#/level/10
I have some strings as follows:
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},
I need to match the names in bold. I tried the following regex:
(?<=author={).+(?=})
But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?
It took me a little while to get the samples to show up in your link. What about:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
See an online demo
(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.
Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
See an online demo
In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:
(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
See the demo
(?<= - Open lookbehind;
\bauthor={ - Match word-boundary and literally 'author={';
(?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.
If using a lookbehind assertion is supported and matching word characters, you might use:
(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s+[A-Z][^\s,]*)+\b
Explanation
(?<= Postive lookahead, assert that to the left of the current position is
\bauthor={ Match author={ preceded by a word boundary
[^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
\s+[A-Z][^\s,]* Match 1+ whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
)+ Close the non capture group and repeat it 1 or more times
\b a word boundary
See a regex101 demo.
I need to match only those words which doesn't have special characters like # and :.
For example:
git#github.com shouldn't match
list should return a valid match
show should also return a valid match
I tried it using a negative lookahead \w+(?![#:])
But it matches gi out of git#github.com but it shouldn't match that too.
You may add \w to the lookahead:
\w+(?![\w#:])
The equivalent is using a word boundary:
\w+\b(?![#:])
Besides, you may consider adding a left-hand boundary to avoid matching words inside non-word non-whitespace chunks of text:
^\w+(?![\w#:])
Or
(?<!\S)\w+(?![\w#:])
The ^ will match the word at the start of the string and (?<!S) will match only if the word is preceded with whitespace or start of string.
See the regex demo.
Why not (?<!\S)\w+(?!\S), the whitespace boundaries? Because since you are building a lexer, you most probably have to deal with natural language sentences where words are likely to be followed with punctuation, and the (?!\S) negative lookahead would make the \w+ match only when it is followed with whitespace or at the end of the string.
You can use negative lookbehind and negative lookahead patterns around a word pattern to make sure that the word is not preceded or followed by a non-space character, or in other words, to make sure that it is surrounded by either a space or a string boundary:
(?<!\S)\w+(?!\S)
Demo: https://regex101.com/r/cjhUUM/2
I can remove line without 1 space character with notepad++
^[^ ]*$
How to remove line without 2 space character.
To match a line that does not contain 2 spaces, you could use a negative lookahead asserting not 2 times a space using \S* to match zero or more times a non whitespace char.
^(?!\S* \S* \S*$).+$
^ Start of string
(?! Negative lookahead, assert what is on the right is not
\S* \S* \S*$ Match 2 spaces between 0+ non whitespace chars \S*
) Close lookahead
.+ Match any char 0+ times except a newline
$ End of string
Regex demo
I guess, maybe you want to remove lines with 1 space and 3 or more, maybe then
^ {1}$|^ {3,}$
might be OK to look into.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo
I would like to ask for help regarding my problem when it comes to spoofing let say usernames and I want to catch them using regex.
for example the correct username is :
rolf
and here are the spoofed versions that I could think of:
roooolf
r123olf
123rolf123
rolf5623
123rolf
rollllf
rrrrrrolf
rolffff
So basically I have this regex expression ( that I know is not sufficient because I've tried it on regex101 website )
.+(?![rolf]).+
I'm using this as a baseline because it doesnt catch the correct username which is :
rolf
but it doesn't catch all the other "spoofed" versions of the username.
Any Ideas how can I make my regex more efficient?
Thanks in advance!
You may try this too
(?m)^(?![^\n]*?rolf[^\n]*$).*$
Demo
To match not exactly rolf You can use a negative lookahead (?! to assert that what follows from the beginning of the string is not 'rolf' until the end of the string.
^(?!rolf$).+$
That would match
^ Assert position at the begin of the string
(?! Negative lookahead that asserts that what follows is not
rolf Match literally
) Close negative lookahead
.+ Match any character one or more times
$Assert position at the end of the string
From your example regex you match .+ where #Ωmega has a fair point, matches spaces.
Instead of .+ you could specify what characters you might accept like \w+ for example to match one or more word characters or specify more using a character class.
You can use a regex pattern
\b(?!rolf\b)\S+\b
\b Word boundary - Matches a word boundary position between a
word character and non-word character or position (start / end of
string).
(?! Negative lookahead - Specifies a group that can not match
after the main expression (if it matches, the result is discarded).
\S Not whitespace - Matches any character that is not a
whitespace character (spaces, tabs, line breaks).
+ Quantifier - Match 1 or more of the preceding token.
Test your inputs with this pattern here.