I am trying to build a combined regular expression, but I don't know how to combine the two sub expressions
I have an input string like this: 4711_001.doc
In want to match the following: 4711.doc
I am able to match 4711 with this expression: [^\_\.]*
I am able to match .prt with this exression: \.[^.]+
Is there some kind of logical AND to combine the two expressions and match 4711.doc? How would the expression look like?
You can use groups to do it in one regular expression. Check out this code for reference:
import re
s = "4711_001.doc"
match = re.search(r"(.+?)_\d+(\..+)", s)
print(match.group(1) + match.group(2))
Output:
4711.doc
Another possibility would be to match the part you don't want:
_\d+
And replace this with "":
import re
s = "4711_001.doc"
match = re.sub(r"_\d+", "", s)
print(match)
See the online demo
For this example string 4711_001.doc, using [^_.]* and \.[^.]+ is quite a broad match as it can match any character except what is listed in the character class.
Perhaps you could make the pattern a bit more specific, matching digits at the start and word characters as the extension.
In the replacement use capture group 1 and 2, often denoted as $1$2 or \1\2
(\d+)_\d+(\.\w+)
Regex demo
There is no language tagged, but if for example \K is supported to clear the match buffer this might also be an option (including the parts that you tried)
In the replacement use an empty string.
[^_.]*\K_[^._]+(?=\.[^.]+$)
In parts
[^_.]*\K Match the part before the underscore, then forget what is matched so far
_[^._]+ Match the underscore, follwed by 1+ chars other than . and _
(?=\.[^.]+$) A positive lookahead assertion to make sure what is at the right is a . followed by any char other than a . until the end of the string.
Regex demo
Related
how to match a group except if it starts with a certain character.
e.g. I have the following sentence:
just _checking any _string.
I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.
The above example is a watered down version of what I'm actually trying to parse.
I'm parsing a code file, which contains string in the following format :
package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10
The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.
So, the result should be :
{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }
I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.
Hence, I earlier asked if I can ignore a capture group starting from a specific character.
Here's the link where I tried : regex101
there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.
I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.
EDIT : as suggested by #Corion, I've added more details.
EDIT2 : added regex101 link
Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):
(\b(?=[^_])[\w]+)
Using the following Perl script to validate that:
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
Matched <just>
Matched <any>
regex101 playground
In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):
(\b(?=[^_.])[\w.]+)
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
just _checking any _string. and. this. inclu.ding dots
Matched <just>
Matched <any>
Matched <and.>
Matched <this.>
Matched <inclu.ding>
Matched <dots>
regex101 playground
After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:
(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
Matched <package1.class1>
Matched <package2.class2>
Matched <package3.class3>
Matched <package4.class4>
Matched <package5.package6.class5>
regex 101 playground
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.
I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".
My use case is as follows: I would like to find all occurrences of something similar to this /name.action, but where the last part is not .action eg:
name.actoin - should match
name.action - should not match
nameaction - should not match
I have this:
/\w+.\w*
to match two words separated by a dot, but I don't know how to add 'and do not match .action'.
Firstly, you need to escape your . character as that's taken as any character in Regex.
Secondly, you need to add in a Match if suffix is not present group - signified by the (?!) syntax.
You may also want to put a circumflex ^ to signify the start of a new line and change your * (any repetitions) to a + (one or more repititions).
^/\w+\.(?!action)\w+ is the finished Regex.
^\w+\.(?!action)\w*
You need to escape the dot character.
\w+\.(?!action).*
Note the trailing .* Not sure what you want to do after the action text.
See also Regular expression to match string not containing a word?
You'll need to use a zero-width negative lookahead assertion. This will let you look ahead in the string, and match based on the negation of a word.
So the regex you'd need (including the escaped . character) would look something like:
/name\.(?!action)/
I need some help with regex.
I have a pattern AB.* , this pattern should match for strings
like AB.CD AB.CDX (AB.whatever).and
so on..But it should NOT match
strings like AB,AB.CD.CD ,AB.CD.
AB.CD.CD that is ,if it encounters a
second dot in the string. whats the
regex for this?
I have a pattern AB.** , this pattern should match strings like
AB,AB.CD.CD, AB.CD. AB.CD.CD but NOT
strings like AB.CD ,AB.CDX,
AB.whatever Whats the regex for
this?
Thanks a lot.
Looks like you've got globs not regular expressions. Dot matches any char, and * makes the previous element match any 0+ times.
1) AB\.[^.]*
Escape the first dot so it matches a literal dot, and then match any character other than a dot, any number of times.
2) "^(AB)|(AB\.[^.]*\.[^.]*$"
This matches AB or AB followed by .<stuff>.<stuff>
http://www.regular-expressions.info/ contains lots of useful information for learning about regular expressions.
If your regex engine supports negative lookahead you might try something like:
^AB\.[^.]+$
^AB(?!\.[^.]+$)
(or
^AB\.[^.]*$
^AB(?!\.[^.]*$)
if you want to allow AB. )
I don't find you're question entirely clear; please comment here (or edit your question if you can't add comments) if I'm getting this wrong but what I think you're looking for is:
1) matching strings "AB.AnyTextHereWithoutDots" but not "AB" or "AB.foo." etc
If so a matching regex would be:
"^AB\.[^.]*$"
2) matching "AB" or "AB.something.something" with either none or two or more dots
If so a matching regex would be something like:
"^AB(\..*\..*)?$" or "'^AB\(\..*\..*\)\?" (depending on the nature of your regex engine)
As Douglas suggests matching with globs would likely be easier.
And as spdenne suggests, find a good regex reference.
I tried this in vim. Here is the sample data:
AB.CD
AB.CDX
AB.whatever
AB
AB.CD.CD
AB.CD.
AB.CD.CD
Here is my regexes
This captures all lines starting with AB and then expects a literal dot, and then filters out all lines that has a second dot.
^AB\.[^.]*$
This captures all lines that is just an AB (the part before the pipe) or lines that start with AB that is followed by two literal dots (escaped with a backslash)
^AB$\|^AB\..\..$