How is ? used in regular expression in python? - regex

I have this snippet
print(re.sub(r'(-script\.pyw|\.exe)?', '','.exe1.exe.exe'))
The output is 1
If i remove ? from the above snippet and run it as
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe'))
Th output is again same.
Although I am using ?, it is getting greedy and replacing all '.exe' with NULL.
Is there any workaround to replace only first occurrence?

re.sub(pattern, repl, string, count=0, flags=0)
This is the signature for the re.sub function. Notice the count parameter. If you just want the first occurence to be replaced, use count=1.
? is a non-greedy modifier for repetition operators; when it stands next to anything else, it makes the previous element optional. Thus, Your top expression is replacing either -script.pyw or .exe or nothing with nothing. Since replacement of nothing by nothing doesn't change the string, the top and the bottom version (where empty string cannot be matched) will give the same result.

Question mark is making the preceding token in the regular expression optional
Use
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe', 1))
if you want to remove only the first match.

? is greedy. So if it can match, It will.
For example: aaab? will match aaab instead of aaa
In order to make ? non greedy, you must add an extra ? (this is the same way you make * and + non greedy, by the way)
So aaab?? will just match aaa. Yet, at the same time, aaab??c will match aaabc

Related

Matching multiple letters and special characters in regex

I am trying to catch strings around the acronym ADJ. The strings look like this:
·NOM·JJ·ADJ+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_DEF_ACC
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·JJ·ADJ+NSUFF_FEM_SG+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_ACC
So far I have this:
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*?/g
But it only matches from the beginning of the strings until "ADJ+" ·NOM·DT+JJ·DET+ADJ+.
Since the rest of the strings after ADJ have the same composition of the beginning of the strings before ADJ, I thought this /[A-Z·\+#_]*?[·\+]/g should work, but it doesn't.
How do I get it to match the rest of the string?
My guess is that you want to make sure if you have an ADJ in the string, which if so, maybe we could simplify our expression to something similar to:
([A-Z·+#_]*)\bADJ\b([A-Z·+#_]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
That *? quantifier after the +ADJ+ phrase is satisfied with the empty string right after it, since the ? makes the quantifier before it match "the minimum number of times possible" and for * that is zero times.
So drop the ?, which also has no purpose for the rest of the line
perl -wE'$_=q(-XADJX-JJ+ADJ-REST-);
($before, $after) = /(.*?)[+\-]ADJ[+\-](.*)/;
say for $before,$after'
Removing the ? at the end would match the whole strings,
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*/g
I am not entirely sure why you needed a ? in a *.

Remove string between 2 pattern using gawk regex only

Input:
secNm:ATA,_class:com.dddao.domaffin.summaggrfy.GddenericMohsg},{ttlRec:0,ttlVal:{:0}secNm:B2B,_class:com.xyz.dakjdain.sfffummary.GenericMo73hs}extra
secNm:ATA,_class:com.dddao.domaffin.summaggrfy.GddenericMohsg},{ttlRec:0,ttlVal:{:0}secNm:B2B,_class:com.xyz.dakjdain.sfffummary.GenericMo73hs
In above both the string I want to remove,
For String 1: parts which stars from ",_class" and ends at 1st occurrence "}"
For String 2: parts which stars from ",_class" till the end if if 1st condition fails.
Output:
secNm:ATA,{ttlRec:0,ttlVal:{:0}secNm:B2Bextra
secNm:ATA,{ttlRec:0,ttlVal:{:0}secNm:B2B
This type of pattern is present undefinable times in this above string.
I want simple want to remove those part.
I have written regex function gsub(/,_class(.*?)\}/,"",$0)
I want answer only using gawk regex function only no other method.
My above give function is having some issue and removing big part of the string.
Help me to correct my regex formula please.
Thanks in advance.
You may use a [^}] negated bracket expression to match any char but } since lazy quantifiers are not supported.
Besides, you do not even need a grouping construct here as you are not referring to the captured value here. You may remove ( and ) safely.
Use
/,_class[^}]*}/
Basically, this should be understood as:
,_class - match ,_class substring
[^}]* - 0 or more chars other than }
} - up to and including }.

Regular expressions with an alternative if the first one doesn't match

I need to have a regular expression that takes a function signature as an input and returns the name of the function, i.e I may have the following input:
FUNCTION(A,B,C)
and after applying the following regular expression:
^(.*?)(?=\()
I correctly obtain the word "FUNCTION" as expected.
However, sometimes I can get the name of the function WITHOUT parentheses (and therefore without parameters), like this:
FUNCTION
In this case, the previous regex fails and doesn't take the name. Is there any way to define a regex that, in case it cannot find the first regular expression, try another one? (In this case would be taking the whole input.)
From what I see, you want to match the first n characters other than (, ) and space.
Thus, it is much more efficient to use
^[^()\s]+
See demo
^(.*?)(?=\(|\s*$|\s)
This should do it for you.You need to use | or operator.
\s*$ === stop if you have 0 or more spaces and then string ends
\s ==== stop at the first instance of space
^([^)]+)\s*\(?
Could do what you want.
Explanation :
([^(]+) : one or more character that is not (
\s* : maybe some blank spaces
\(? : optionnal parenthesis

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

How to Match The Inner Possible Result With Regular Expressions

I have a regular expression to match anything between { and } in my string.
"/{.*}/"
Couldn't be simpler. The problem arises when I have a single line with multiple matches. So if I have a line like this:
this is my {string}, it doesn't {work} correctly
The regex will match
{string}, it doesn't {work}
rather than
{string}
How do I get it to just match the first result?
Question-mark means "non-greedy"
"/{.*?}/"
Use a character class that includes everything except a right bracket:
/{[^}]+}/
this will work with single nested braces with only a depth of one: {(({.*?})*|.*?)*}
I'm not sure how to get infinite depth or if it's even possible with regex
Default behaviour is greedy matching, i.e. first { to last }. Use lazy matching by the ? after your *.,
/{.*?}/
or even rather than * use "not a }"
/{[^}]*}/