Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome

To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.

You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

Related

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Python regex to parse '#####' text in description field [duplicate]

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

Remove parenthesis if length > max length

I have a number of models like
"Dell Inspiron 6 (i7-8550U/8GB/256GB/Radeon 520/FHD/W10)"
"Lenovo V130 (i5-7200U/4GB/128GB/FHD/W10) (2017)"
"Dell XPS 13 Touch (i7-8550U/16GB/512GB/UHD/W10)"
and I want to remove the parenthesis where it mentions the specs.
The final output should be something like this
"Dell Inspiron 6"
"Lenovo V130 (2017)"
"Dell XPS 13 Touch"
I managed to get the parenthesis contents with this regex
\((.+?)\)
However, the regex returns both parentheses.
Is there any way to only get the text that is greater than a number of characters?
Here is a regex example.
You may use this regex to match (...) where there are 6+ characters between ( and ):
\s*\(([^)]{6,})\)
Updated RegEx Demo
RegEx Details:
[^)]: matches any character except )
{6,}: Quantifier to match 6 or more characters
Use a regex substitution, and sub in an empty string. For example, in Python, you could do the following:
import re
name = 'Dell Inspiron 6 (i7-8550U/8GB/256GB/Radeon 520/FHD/W10)'
re.sub("\s\(.{5,}\)", "", name)
Output
'Dell Inspiron 6'
Note that I used {5,} which will match only parathentical strings with 5 or more characteres, which will leave the years (e.g., (2017)) in place.
You can exploit some text that is surely there in your first brackets and not in second brackets like a slash / and change your regex from,
\((.+?)\)
to
\((.*\/[^)]*?)\)
And replace it with empty string.
Live Demo

Python 2.7 Regex Tokenizer Implementation Not Working [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
I created a regular expression to match tokens in a german text which is of type string.
My Regular expression is working as expected using regex101.com. Here is a link of my regex with an example sentence: My regex + example on regex101.com
So I implemented it in python 2.7 like this:
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
([A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+([.,]\d+)?([€$%])? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():-_'!] # matches special characters including !
'''
def tokenize_german_text(text):
'''
Takes a text of type string and
tokenizes the text
'''
matchObject = re.findall(GERMAN_TOKENIZER, text)
pass
tokenize_german_text(u'Das ist ein Deutscher Text! Er enthält auch Währungen, 10€')
Result:
When I was debugging this I found out that the matchObject is only a list containing 11 entries with empty characters. Why is it not working as expected and how can I fix this?
re.findall() collects only the matches in capturing groups (unless there are no capturing groups in your regex, in which case it captures each match).
So your regex matches several times, but every time the match is one where no capturing group is participating. Remove the capturing groups, and you'll see results. Also, place the - at the end of the character class unless you actually want to match the range of characters between : and _ (but not the - itself):
GERMAN_TOKENIZER = r'''(?x) # set flag to allow verbose regex
(?:[A-ZÄÖÜ]\.)+ # abbrevations including ÄÖÜ
|\d+(?:[.,]\d+)?[€$%]? # numbers, allowing commas as seperators and € as currency
|[\wäöü]+ # matches normal words
|\.\.\. # ellipsis
|[][.,;\"'?():_'!-] # matches special characters including !
'''
Result:
['Das', 'ist', 'ein', 'Deutscher', 'Text', '!', 'Er', 'enthält', 'auch', 'Währungen', ',', '10€']

Regexp - Match any character except "Something.AnyChar" [duplicate]

This question already has answers here:
Need a regex to exclude certain strings
(6 answers)
Closed 9 years ago.
I have a string:
Input:
"Feature.. sklsd " AND klsdjkls 9290 "Feass . lskdk SDFSD __ ksdljsklfsd" NOT "Feuas" "Feature.lskd" OR PUT klasdkljf al9- .s.a, 9a0sd90209 .a,sdklf jalkdfj al;akd
I need to match any character except OR, NOT, AND, "Feature.any_count_of_characters"
the last one is important this start with: "Feature.
This is followed by any number of characters and then ends with: " character.
I'm trying to solve this using lookahead or lookbehind but I can get the expected output, only a portion of characters that I don't want.
My expected output is
"Feature.. sklsd " AND klsdjkls 9290 "Feass . lskdk SDFSD __ ksdljsklfsd" NOT "Feuas" "Feature.lskd" OR PUT klasdkljf al9- .s.a, 9a0sd90209 .a,sdklf jalkdfj al;akd
All that is in black.
To test it i'm using these links:
http://gskinner.com/RegExr/
http://regexpal.com/
Thanks.
EDIT
Check this link http://regexr.com?37v36
inside the link i get matched some expression. But i don't need the expression that matched. i need the inverse, how i can get it?
Thanks.
Just use
\s*(?:AND|OR|NOT|"[^"]+")\s*
but do a replace operation. That will leave what you want.
Your basic problem is that look behinds can not have arbitrary lengths, but you need that. There are work arounds, but a simpler approach is to use a capturing group:
"Feature\.[^"]*" (?:OR|NOT|AND) ([^"])
And your target will be in group 1 of the match.