Python: RE only captures first and last match - regex

I'm trying to make a Regular Expression that captures the following:
- XX or XX:XX, up to 6 repetitions (XX:XX:XX:XX:XX:XX), where X is a hexadecimal number.
In other words, I'm trying to capture MAC addresses than can range from 1 to 6 bytes.
regex = re.compile("^([0-9a-fA-F]{2})(?:(?:\:([0-9a-fA-F]{2})){0,5})$")
The problem is that if I enter for example "11:22:33", it only captures the first match and the last, which results in ["11", "22"].
The question: is there any method that {0,5} character will let me catch all repetitions, and not the last one?
Thanks!

Not in Python, no. But you can first check the correct format with your regex, and then simply split the string at ::
result = s.split(':')
Also note that you should always write regular expressions as raw strings (otherwise you get problems with escaping). And your outer non-capturing group does nothing.
Technically there is a way to do it with regex only, but the regex is quite horrible:
r"^([0-9a-fA-F]{2})(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?$"
But here you would always get six captures, just that some might be empty.

Related

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

re.sub (python) substitute part of the matched string

I have a series of strings which are identifiable by finding a substring "p" tag followed by at least two CAPITAL letters.
Input:
<p>JIM <p>SALLY <p>ROBERT <p>Eric
I want to change the "p" tag to an "i" tag if it's followed by those two capital letters (so not the last one, 'Eric').
Desired output:
<i>JIM <i>SALLY <i>ROBERT <p>Eric
I've tried this using regular expressions in Python:
import re
Mytext = "<p>JIM <p>SALLY <p>ROBERT <p>Eric"
changeTags = re.sub('<p>[A-Z]{2}', '<i>' + re.search('<p>[A-Z]{2}', Mytext).group()[-2:], Mytext)
print changeTags
But the output uses "i" tag + JI in every instance, rather than interating through to use SA and then RO in entries 2 and 3.
<i>JIM <i>JILLY <i>JIBERT <p>Eric
I believe the problem is that I don't understand the .group() method properly. Can anyone advise what I've done wrong?
Thank you.
Another way using look-ahead assertion:
re.sub(r'<p>(?=[A-Z]{2,})','<i>',MyText)
Your inner re.search is only evaluted once, and the result is passed as one of the parameters to re.sub. This can't possible capture all the capital-letters-pairs, only the first one. This means your approach cannot work, not merely your understanding of groups.
Furthermore, using groups is unnecessary.
You need to capture the capital letters using parenthesis, and reference it as \1 in the substitution expression:
re.sub('<p>([A-Z]{2})', r'<i>\1', Mytext)
\1 here means: replace with the substring matched by the first (...) in the regular expression. (docs)
Note the leading r in front of the substitution string, to make it raw.

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

how to create regular expression for this sentence?

i have following statement {$("#aprilfoolc").val("HoliWed27"); $("#UgadHieXampp").val("ugadicome");}.and i want to get the string with combination.i have written following regex but it is not working.
please help!
(?=[\$("#]?)[\w]*(?<=[")]?)
Your lookaround assertions are using character classes by mistake, and you've confused lookbehind and lookahead. Try the following:
(?<=\$\(")\w*(?="\))
You could use this simpler one :
'{$("#aprilfoolc").val("HoliWed27");}'.match(/\$\(\"#(\w+)\"[^"]*"(\w+)"/)
This returns
["$("#aprilfoolc").val("HoliWed27"", "aprilfoolc", "HoliWed27"]
where the strings you want are at indexes 1 and 2.
This construction
(?=[\$*"#]?)
will match a lookahead, but only optional -- the character set is followed by a ?. This kind of defeats the next part,
[\w]
which matches word characters only. So the lookahead will never match. Similar, this part
(?<=[")])
will also never match, because logically there can never be one of the characters " or ) at the end of a string that matches \w only. Again, since this portion is optional (that ? at the end again) it will simply never match.
It's a bit unclear what you are after. Strings inside double quotes, yes, but in the first one you want to skip the hash -- why? Given your input and desired output, this ought to work:
\w+(?=")
Also possible:
/\("[#]?(.*?)"\)/
import re
s='{$("#aprilfoolc").val("HoliWed27");}'
f = re.findall(r'\("[#]?(.*?)"\)',s)
for m in f:
print m
I don't know why, but if you want capturing of two groups simultaneously, so:
/\("#(.*?)"\).*?\("(.*?)"\)/
import re
s='{$("#aprilfoolc").val("HoliWed27");}'
f = re.findall(r'\("#(.*?)"\).*?\("(.*?)"\)',s)
for m in f:
print m[0],m[1]
In JavaScript:
var s='{$("#aprilfoolc").val("HoliWed27")';
var re=/\("#(.*?)"\).*?\("(.*?)"\)/;
alert(s.match(re));

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.