how to match a group except if it starts with a certain character.
e.g. I have the following sentence:
just _checking any _string.
I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.
The above example is a watered down version of what I'm actually trying to parse.
I'm parsing a code file, which contains string in the following format :
package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10
The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.
So, the result should be :
{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }
I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.
Hence, I earlier asked if I can ignore a capture group starting from a specific character.
Here's the link where I tried : regex101
there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.
I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.
EDIT : as suggested by #Corion, I've added more details.
EDIT2 : added regex101 link
Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):
(\b(?=[^_])[\w]+)
Using the following Perl script to validate that:
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
Matched <just>
Matched <any>
regex101 playground
In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):
(\b(?=[^_.])[\w.]+)
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
just _checking any _string. and. this. inclu.ding dots
Matched <just>
Matched <any>
Matched <and.>
Matched <this.>
Matched <inclu.ding>
Matched <dots>
regex101 playground
After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:
(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
Matched <package1.class1>
Matched <package2.class2>
Matched <package3.class3>
Matched <package4.class4>
Matched <package5.package6.class5>
regex 101 playground
Related
I am trying to build a combined regular expression, but I don't know how to combine the two sub expressions
I have an input string like this: 4711_001.doc
In want to match the following: 4711.doc
I am able to match 4711 with this expression: [^\_\.]*
I am able to match .prt with this exression: \.[^.]+
Is there some kind of logical AND to combine the two expressions and match 4711.doc? How would the expression look like?
You can use groups to do it in one regular expression. Check out this code for reference:
import re
s = "4711_001.doc"
match = re.search(r"(.+?)_\d+(\..+)", s)
print(match.group(1) + match.group(2))
Output:
4711.doc
Another possibility would be to match the part you don't want:
_\d+
And replace this with "":
import re
s = "4711_001.doc"
match = re.sub(r"_\d+", "", s)
print(match)
See the online demo
For this example string 4711_001.doc, using [^_.]* and \.[^.]+ is quite a broad match as it can match any character except what is listed in the character class.
Perhaps you could make the pattern a bit more specific, matching digits at the start and word characters as the extension.
In the replacement use capture group 1 and 2, often denoted as $1$2 or \1\2
(\d+)_\d+(\.\w+)
Regex demo
There is no language tagged, but if for example \K is supported to clear the match buffer this might also be an option (including the parts that you tried)
In the replacement use an empty string.
[^_.]*\K_[^._]+(?=\.[^.]+$)
In parts
[^_.]*\K Match the part before the underscore, then forget what is matched so far
_[^._]+ Match the underscore, follwed by 1+ chars other than . and _
(?=\.[^.]+$) A positive lookahead assertion to make sure what is at the right is a . followed by any char other than a . until the end of the string.
Regex demo
I'd like to know how can I ignore characters that follows a particular pattern in a Regex.
I tried with positive lookaheads but they do not work as they preserves those character for other matches, while I want them to be just... discarded.
For example, a part of my regex is: (?<DoubleQ>\"\".*?\"\")|(?<SingleQ>\".*?\")
in order to match some "key-parts" of this string:
This is a ""sample text"" just for "testing purposes": not to be used anywhere else.
I want to capture the entire ""sample text"", but then I want to "extract" only sample text and the same with testing purposes. That is, I want the group to match to be ""sample text"", but then I want the full match to be sample text. I partially achieved that with the use of the \K option:
(?<DoubleQ>\"\"\K.*?\"\")|(?<SingleQ>\"\K.*?\")
Which ignores the first "" (or ") from the full match but takes it into account when matching the group. How can I ignore the following "" (")?
Note: positive lookahead does not work: it does not ignore characters from the following matches, it just does not include them in the current match.
Thanks a lot.
I hope I got your questions right. So you want to match the whole string including the quotes, but you want to replace/extract it only the expression without the quotes, right?
You typically can use the regex replace functionality to extract just a part of the match.
This is the regex expression:
""?(.*?)""?
And this the replace expression:
$1
I'm looking to match any text and non text values, located within " " , but only between the 1st and the last instance of [[ and ]]
window.google.ac.h(["text",[["text1",0],["text2",0,[3]],["text3",0],["text4",0,[3]],["text5",0],["text6",0],["text7",0
]],{"q":"hygjgjhbjh","k":1}])
So far I've managed to get some results (far from ideal) by using: "(.*?)",0
The issue I have is the it either matches all the way until
"text4",0,[3]]
or starts matching at
["text",
I only need to match text1 text2 text3 .. text7
Notes: double square bracket position and nr of instances, between the 1st and the last is not consistent.
Thanks for your help guys!
Edit: I'm using http://regexr.com/ to test it
Well, you didn't say which language, but using .NET regex (which largely follows the Perl standard), the first match groups of all the matches of a global match using the following regex will contain the values you want:
(?<=\[\[.*)"(.+?)"(?=.*\]\])
A global match on "(.+?)" alone would return matches whose first match groups would contain the characters between the quotes. The lookbehind assertion (?<=\[\[.*) tells it to include only cases where there is a [[ anywhere behind, i.e. after the first instance of [[. Similarly, the lookahead assertion (?=.*\]\]) tells it to include only cases where there is a ]] anywhere ahead, i.e. before the last instance of ]].
I tested this using PowerShell. The exact syntax to do a global match and extract the first match groups of all the results will depend on the language.
Depending on your RegEx Engine, you could use this pattern
(?:^.*?\[\[|\G[^"]*)("[^"]*")(?=.*\]\])
Demo
I have the following text string:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$. Some more random text
I am trying to match the $ABCD(file="somefile.txt")$ part of the string using a regular expression.
I am using this (?=[$]ABCD[(]file=).*(?<=[)][$]) regular expression pattern to make the intended match. It's not working as expected because I am getting a match all the way to the second )$ in the string.
For example, the match will be as follows:
$ABCD(file="somefile.txt")$' />Some more text followed by a dollar like this one)$
How should I modify the pattern to match to the end of the first occurrence of the )$?
Here is a good online regular expression engine tester:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
try appending a ? to the greedy *
(?=[$]ABCD[(]file=).*?(?<=[)][$])
Lazy quantification
The standard quantifiers in regular expressions are greedy, meaning
they match as much as they can. Modern regular expression tools allow a quantifier to be specified as lazy (also known as > non-greedy, reluctant, minimal, or ungreedy) by putting a question mark after the quantifier
You could just use this:
\$ABCD\(file="[a-z.]+"\)\$
to get $ABCD(file="somefile.txt")$.
Your problem was the .* bit, it was too general and thus matched everything up to the last $.
I would advance you to use the second quote to define the end of the searched pattern: [^"]* will match to anything except ".
So the pattern for the file name would be: \$ABCD\(file="([^"]*)
I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".