python regular expression: how to extract 'A =BC= D' -> 'BC' - regex

I'm at a loss because I don't know how to write regular expressions of python to extract particular strings such as A =BC= D =EF= -> 'BC', 'EF. I searched a lot but couldn't write this operation. please help.

Something like this
=..=
regex.101
result:
Match 1
Full match 2-6 =BC=
Match 2
Full match 9-13 =EF=
Here is a nice tutorial:
Regex tutorial — A quick cheatsheet by examples

You could use =([^=]+)= to extract character (except =) any (non-zero) number of times. You can extract the contents within the equal signs using groups.
If you want to match exactly two characters within equal signs, =([^=]{2})= should do.

First you'll need to use the Regex library
import re
Then you can use re.findall(pattern, string) to get a list of all the substrings that match your pattern.
It's not clear from your question what defines the 'particular strings' you are looking for. Assuming you are looking for everything between two equals signs, but not greedily (not including equals signs inside), you could use the regex "=(.*?)=".
import re
m = re.findall("=(.*?)=", "A =BC= D =EF=")
Result:
>>>m
['BC', 'EF']

Related

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

how to create regular expression for this sentence?

i have following statement {$("#aprilfoolc").val("HoliWed27"); $("#UgadHieXampp").val("ugadicome");}.and i want to get the string with combination.i have written following regex but it is not working.
please help!
(?=[\$("#]?)[\w]*(?<=[")]?)
Your lookaround assertions are using character classes by mistake, and you've confused lookbehind and lookahead. Try the following:
(?<=\$\(")\w*(?="\))
You could use this simpler one :
'{$("#aprilfoolc").val("HoliWed27");}'.match(/\$\(\"#(\w+)\"[^"]*"(\w+)"/)
This returns
["$("#aprilfoolc").val("HoliWed27"", "aprilfoolc", "HoliWed27"]
where the strings you want are at indexes 1 and 2.
This construction
(?=[\$*"#]?)
will match a lookahead, but only optional -- the character set is followed by a ?. This kind of defeats the next part,
[\w]
which matches word characters only. So the lookahead will never match. Similar, this part
(?<=[")])
will also never match, because logically there can never be one of the characters " or ) at the end of a string that matches \w only. Again, since this portion is optional (that ? at the end again) it will simply never match.
It's a bit unclear what you are after. Strings inside double quotes, yes, but in the first one you want to skip the hash -- why? Given your input and desired output, this ought to work:
\w+(?=")
Also possible:
/\("[#]?(.*?)"\)/
import re
s='{$("#aprilfoolc").val("HoliWed27");}'
f = re.findall(r'\("[#]?(.*?)"\)',s)
for m in f:
print m
I don't know why, but if you want capturing of two groups simultaneously, so:
/\("#(.*?)"\).*?\("(.*?)"\)/
import re
s='{$("#aprilfoolc").val("HoliWed27");}'
f = re.findall(r'\("#(.*?)"\).*?\("(.*?)"\)',s)
for m in f:
print m[0],m[1]
In JavaScript:
var s='{$("#aprilfoolc").val("HoliWed27")';
var re=/\("#(.*?)"\).*?\("(.*?)"\)/;
alert(s.match(re));

How to apply conditional treatment with line.endswith(x) where x is a regex result?

I am trying to apply conditional treatment for lines in a file (symbolised by list values in a list for demonstration purposes below) and would like to use a regex function in the endswith(x) method where x is a range page-[1-100]).
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-2']
for line in lines:
if line.startswith('http') and line.endswith('page-2'):
print line
So the required functionality is that if the value starts with http and ends with a page in the range of 1-100 then it will be returned.
Edit: After reflecting on this, I guess the corollary questions are:
How do I make a regex pattern ie page-[1-100] a variable?
How do I then use this variable eg x in endswith(x)
Edit:
This is not an answer to the original question (ie it does not use startswith() and endswith()), and I have no idea if there are problems with this, but this is the solution I used (because it achieved the same functionality):
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-100']
for line in lines:
match_beg = re.search( r'^http://', line)
match_both = re.search( r'^http://.*page-(?:[1-9]|[1-9]\d|100)$', line)
if match_beg and not match_both:
print match_beg.group()
elif match_beg and match_both:
print match_both.group()
I don't know python well enough to paste usable code, but as far as the regular expression is concerned, this is rather trivial to do:
page-(?:[2-9]|[1-9]\d|100)$
What this expression will match:
page- is just a fixed string that will be matched 1:1 (case insensitive if you set Options for that).
(?:...) is a non-capturing group that's just used for separating the following branching.
| all act as "either or" with the expressions being to their left/right.
[2-9] will match this numerical range, i.e. 2-9.
[1-9]\d will match any two Digit number (10-99); \d matches any digit.
100 is again a plain and simple match.
$ will match the line end or end of string (again based on settings).
Using this expression you don't use any specific "ends with" functionality (that's given through using $).
Considering this will have to parse the whole string anyway, you may include the "begins with" check as well, which shouldn't cause any additional overhead (at least none you'd notice):
^http://.*page-(?:[2-9]|[1-9]\d|100)$
^ matches the beginning of the line or string (based on settings).
http:// is once again a plain match.
. will match any character.
* is a quantifier "none or more" for the previous expression.
To get you going in the right direction, the Regex that matches your needed range of pages is:
^http.*page-([2-9]?|[1-9][0-9]|100)$
this will match lines that start with http and end with page-<2 to 100> inclusive.