Capturing two different lines using regex - regex

I want to capture two lines in one variable, like this is my input:
Rose 0 82
ABC 0 0
ABC (Backup) 0 0
ABC XYZ 637 2021
ABC XYZ (Backup) 0 0
ABC EXYZ 0 0
I Want to capture the lines which are in bold.
I tried this code:
var = re.search("ABC\s+\d+\s+ .*\n(.*)\nABC XYZ .*",file_name)
but it is giving me output like this:
ABC 0 0
ABC (Backup) 0 0
ABC XYZ 637 2021
and my expected output is this:
ABC 0 0
ABC XYZ 637 2021
Can someone please suggest what modification is needed.

You may use
re.search("^(ABC[ \t]+\d+[ \t].*\n).*\n(ABC[ \t]+XYZ[ \t].*)",s, re.MULTILINE)
The regex will find the match you need and capture 2 lines into separate capturing groups. Then, check if there was a match and, if yes, join the two capturing group values.
See the Python demo
import re
s="""Rose 0 82
ABC 0 0
ABC (Backup) 0 0
ABC XYZ 637 2021
ABC XYZ (Backup) 0 0
ABC EXYZ 0 0"""
v = re.search("^(ABC[ \t]+\d+[ \t].*\n).*\n(ABC[ \t]+XYZ[ \t].*)",s, re.MULTILINE)
if v:
print("{}{}".format(v.group(1), v.group(2)))
Output:
ABC 0 0
ABC XYZ 637 2021
Pattern details
^ - start of a line (due to re.MULTILINE)
(ABC[ \t]+\d+[ \t].*\n) - Capturing group 1: ABC, 1+ spaces or tabs, 1+ digits, a space or tab and then the rest of the line with the newline
.*\n - whole next line
(ABC[ \t]+XYZ[ \t].*) - - Capturing group 2: ABC, 1+ spaces or tabs, XYZ, a space or tab and then the rest of the line.

you can make use of the "^" and the "$" to catch the start and end of a line.
^\*\*.*\*\*
This will give you 2 matches to iterate through. All the matches represent blod lines, qualified by the two * in the beginning end end of a line.

If the syntax includes a comment start as two stars than you can use this (but it will not cut two comments, if they are in one line).
^[\*]{2}(.*)[\*]{2}
If you want to find any comment with the form of **comment** use this
[\*]{2}[^\*]+[\*]{2}

Related

Regex for number of digits 0 and at least one digit 1

I have a block of text in which I try to find lines that contain any (*) number of digits 0 and at least one (+) digit 1. Explaination:
1234 xxx 00000000000111000000 00000010000100000000 Some text <-- matches
2345 yyy 00000000000000000000 00000000000000000000 Some text <-- does not match
2345 yyy 00000001000000000000 00000000000000000000 Some text <-- matches
3456 zzz 11111111111111111111 11111111111111111111 Some text <-- matches
How to accomplish this? Thanks!
Tried with negative lookahead but failed:
\s+\d+ +[a-zA-Z]+ +(?![0]{20}) +(?![0]{20}) +([0-9a-zA-Z ]+)
You are not matching any digits 0 or 1 after the assertions.
If both columns with the digits 0 or 1 can not be only zeroes, you can use both columns in the assertion:
+\d+ +[a-zA-Z]+ +(?!0{20} +0{20}\b)[01]{20} +[01]{20} +([0-9a-zA-Z ]+)
See a regex101 demo.
Here is my shorter version of the regex. But it only test line by line. So you will have to iterate through each line in your file like the code below:
import re
text = '''1234 xxx 00000000000111000000 00000010000100000000 Some text
2345 yyy 00000000000000000000 00000000000000000000 Some text
2345 yyy 00000001000000000000 00000000000000000000 Some text
3456 zzz 11111111111111111111 11111111111111111111 Some text'''
regex = r'^\d+\s+\w+\s+0*1+0*\s+\d+\s+\w+'
matches = re.findall(regex, text, re.MULTILINE)
for match in matches:
print(match)
For explanation and details, please check regex101 demo

Put everything in ordered groups, but have chars in parentheses grouped together

Say I have this string:
111 222 (333 444) 555 666 (777) 888
What I want is:
Group 1: 111 222
Group 2: 333 444
Group 3: 555 666
Group 4: 777
Group 5: 888
I have this regex \(([^\)]+)\) but it only captures what's between parens.
You can use
String text = "111 222 (333 444) 555 666 (777) 888";
RegExp rx = new RegExp(r'\(([^()]+)\)|[^()]+');
var values = rx.allMatches(text).map((z) => z.group(1) != null ? z.group(1)?.trim() : z.group(0)?.trim()).toList();
print(values);
// => [111 222, 333 444, 555 666, 777, 888]
See the regex demo. The output is either trimmed Group 1 values, or the whole match values (also trimmed) otherwise. The \(([^()]+)\)|[^()]+ pattern matches a (, then captures into Group 1 any one or more chars other than parentheses and then matches a ), or matches one or more chars other than parentheses.
To avoid empty items, you may require at least one non-whitespace:
\(\s*([^\s()][^()]*)\)|[^\s()][^()]*
See this regex demo. Details:
\( - a ( char
\s* - zero or more whitespaces
([^\s()][^()]*) - Group 1: a char other than whitespace, ( and ), and then zero or more chars other than round parentheses
\) - a ) char
| - or
[^\s()][^()]* - a char other than whitespace, ( and ), and then zero or more chars other than round parentheses.
If you want to group the string by identical characters, I would consider using a stack to keep a running storage of the congruent, consecutive characters. Once you reach a character that does not match, you can clear the entire stack until it is empty.
You can add more code (if statement logic) to keep populating a stack once an opening parentheses is read until a closing parentheses is read.

Python regex to extract phone numnber

I would like to clean up the phone number column in my pandas dataframe. I'm using below code but it leaves a bracket at the end. How do I get the right regex to exclude any extra characters in the end like (, or anything which is not part of phone number. I've looked through old posts, but can't seem to find exact solution.
sample code below :
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.extract('([\(\)\s\d\-]+)',expand= True)
expected output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
Current output :
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567(
You may use
((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})
See the regex demo
Details
(?:\(\d{3}\)|\d{3})? - an optional sequence of
\(\d{3}\) - (, three digits, )
| - or
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{4} - four digits.
Pandas test:
>>> df1['x'].str.extract(r'((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})',expand= True)
0
0 1234567890
1 202-456-3456
2 (202)-456-3456
3 (202)-456- 4567
4 1234564567
How about a different approach? Instead of trying to match the phone numbers, remove the bits you don't want:
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.replace(r'\([^0-9]+\)|\D*$', '')
Output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
It means using str.replace instead of str.extract but I think the code is simpler as a result.
Explanation:
\([^0-9]+\) matches any characters except 0-9 inside parentheses.
| means logical OR.
\D*$ matches zero or more non-numeric characters at the end of the string.
Used with replace, this matches the above pattern and replaces it with an empty string.
I would use replace.
df1['x1'] = df1['x'].str.replace(r'(?<=\(\d{3}\)[-]\d{3}[-]\d{4})[a-z]*', '')
df1
Simply put replace Y if it is immediately to the right of X that is (?<+X)Y
Y= group of lower case alphanumerics - [a-z]*
X=
three digits between () followed by a dash \(\d{3}\)[-] followed by;
another three digits and a dash \(\d{3}\)[-] and finally followed by;
four digits and a dash `(\d{4})
Output

Regex: Match a malformed date

I'm trying to grab the date (without time) from the following OCR'd strings:
04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
With the following expression I can catch the dates, but I can't skip the "12" hours:
([\d\s]{2,}(?:\.|,)[\d\s]{2,}(?:\.|,)[\d\s]{4,})
How can I make it work? In plain English, how can I make the last part stop once it has found 4 digits in a mix of digits and spaces/tabs?
By catching the first 8 digits on a line, you will get your date.
\D is any non-digit charater
\d is a digit character
(?:...) is a group that will be ignored
^\D* is used to ignore the beginning of the line until we get a digit
We match 8 times a digits followed by any non-numerics characters, starting with first digit found.
import re
p = re.compile(ur'^\D*((?:\d\D*?){8})', re.MULTILINE)
test_str = u"""04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
"""
print re.findall(p, test_str)
Have a test over here: https://regex101.com/r/eQ8zJ9/4
You can then filter out any non digits to get the date:
from datetime import datetime
for s in re.findall(p, test_str):
digits = re.sub(ur'\D', '', s)
print datetime.strptime(digits, '%d%m%Y')
You can also try with:
((?:\d\s*){2})[,.-]((?:\s*\d\s*){2})[,.-]((?:\s*\d){4})
DEMO
which is not restricted by beginning of a line. Also it match is there is one of choosen delimiters beetwen numbers, like ,, . or -. As there could be more 8-digits chaotic number sequences in such formatted text.
The other answer is nice and short, but if the delimiters are of importance:
((?:(?:\d\s*){2}[.,]\s*){2}(?:\d\s*?){4})
The key being:
(?:\d\s*?){𝑛}
To capture 𝑛 digits with optional, but non-greedy, whitespace in-between.
I also took the liberty to shorten (?:\.|,) to [.,].

Perl, Any match string after a pattern

I'm trying to match if any string exists after a certain pattern.
The pattern is "pattern" 'anything in between' "[after]". case insensitive.
e.g
pattern 1 [after] ABC
pattern 2 [after] 123 abc DEX
pattern 3 [after]
pattern 12345123 [after]
pattern #ASd#98 #_90sd [after] xyz dec
[after] 4 pattern
So the result I would like to obtain is,
pattern 1 [after] ABC
pattern 2 [after] 123 abc DEX
pattern #ASd#98 #_90sd [after] xyz dec
It begins with "pattern" and ends with "[after], anything sandwiched between is also accepted.
I'm having difficulty incorporating the delimits of [ ] & if string exists together.
I've tried, the closest I've gotten ends up matching
m/pattern/ ../ \[after]/
pattern 1 [after] ABC
pattern 2 [after] 123 ABC DEX
pattern 3 [after]
pattern 12345123 [after]
pattern #ASd#98 #_90sd [after] xyz dec
But I don't need the 3rd or 4th pattern as it doesn't hold any numerics or characters after "[after]".
Thanks
Here is the code I used to test against your input (which I just cat'ed and piped to the script)
#!/usr/bin/perl
while(<>)
{
print if (/^pattern.*\[after\]\s*\S+/);
}
So to break it down for you:
/^pattern : match any string that begins with "pattern"
.*\[after\] : match any characters followed by "[after]"
\s*\S+ : match 0 or more whitespace characters followed by one or more non-whitespace character
That should give you enough to work with to tweak it up as you see fit.
Code:
$str = 'pattern 2 [after] 123 abc DEX';
if ($str =~ m/^pattern\s+(\d+)\s+\[after\]\s+(.+)/) {
print "$1\t$2\n";
} else {
print "(no match)\n";
}
Output:
2 123 abc DEX
Test this code here.
Is this what you want:
/pattern [0-9] \[after\](?= .)/s
or
/pattern [0-9] \[after\] ./s