Find all groups of 9 digits (\d{9}) up to a certain word - regex

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)

The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']

The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']

You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']

If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Related

How to retrieve the targeted substring, if the number of characters can vary?

I want to retrieve from input similar to the following: code="XY85XXXX", the substring between "".
In case of a fixed number of 8 characters I can retrieve the value with (?<=code=").{8}.
But the targeted substring length varies, 7 or 9, or somewhere in the range between 3 and 11 (as in the examples below) and that is what I need to also handle.
Input can for example be code="XY85XXXX765" or code="123".
How must I adjust the regex to achieve that flexibility?
You can use positive lookbehind to 'anchor' your matches to the fixed part (?<=code=") and a negative character class allowing any character but " occurring one or more times:
(?<=code=")[^"]+
You can use a lookahead and lookbehind both searching for quotes:
(?<=").*(?=")
let rx = /(?<=").*(?=")/;
let extract = (txt) => console.log(txt.match(rx)[0]);
extract('code="XY85XXXX"');
extract('code="Y85XXXX"');
extract('code="ZXY85XXXXZ"');
I've copied the solution ( (?<=code=")[^"]+) in this tool https://regex101.com/ for PHP.
Ok, I get my result but when I select in the tool .NET I have no result.
What should/must be changed?

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

How to apply conditional treatment with line.endswith(x) where x is a regex result?

I am trying to apply conditional treatment for lines in a file (symbolised by list values in a list for demonstration purposes below) and would like to use a regex function in the endswith(x) method where x is a range page-[1-100]).
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-2']
for line in lines:
if line.startswith('http') and line.endswith('page-2'):
print line
So the required functionality is that if the value starts with http and ends with a page in the range of 1-100 then it will be returned.
Edit: After reflecting on this, I guess the corollary questions are:
How do I make a regex pattern ie page-[1-100] a variable?
How do I then use this variable eg x in endswith(x)
Edit:
This is not an answer to the original question (ie it does not use startswith() and endswith()), and I have no idea if there are problems with this, but this is the solution I used (because it achieved the same functionality):
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-100']
for line in lines:
match_beg = re.search( r'^http://', line)
match_both = re.search( r'^http://.*page-(?:[1-9]|[1-9]\d|100)$', line)
if match_beg and not match_both:
print match_beg.group()
elif match_beg and match_both:
print match_both.group()
I don't know python well enough to paste usable code, but as far as the regular expression is concerned, this is rather trivial to do:
page-(?:[2-9]|[1-9]\d|100)$
What this expression will match:
page- is just a fixed string that will be matched 1:1 (case insensitive if you set Options for that).
(?:...) is a non-capturing group that's just used for separating the following branching.
| all act as "either or" with the expressions being to their left/right.
[2-9] will match this numerical range, i.e. 2-9.
[1-9]\d will match any two Digit number (10-99); \d matches any digit.
100 is again a plain and simple match.
$ will match the line end or end of string (again based on settings).
Using this expression you don't use any specific "ends with" functionality (that's given through using $).
Considering this will have to parse the whole string anyway, you may include the "begins with" check as well, which shouldn't cause any additional overhead (at least none you'd notice):
^http://.*page-(?:[2-9]|[1-9]\d|100)$
^ matches the beginning of the line or string (based on settings).
http:// is once again a plain match.
. will match any character.
* is a quantifier "none or more" for the previous expression.
To get you going in the right direction, the Regex that matches your needed range of pages is:
^http.*page-([2-9]?|[1-9][0-9]|100)$
this will match lines that start with http and end with page-<2 to 100> inclusive.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.