How to extract headings in text file using regex in python? - regex

I have always used stackoverflow for solving many of my problems by searching the threads. Today I would like some guidance on creating a regex pattern for my text files. My files have headings that are varied in nature and do not follow the same naming pattern. The pattern they do follow somewhat is like this:
2.0 DESCRIPTION
3.0 PLACE OF PERFORMANCE
5.0 SERVICES RETAINED
6.0 STRUCTURE AND ROLES
etc....
It always follows a number and then capital letters or number and then spaces and then capital letters. The output I need is a list :
output = ['2.0 DESCRIPTION','3.0 PLACE OF PERFORMANCE','5.0 SERVICES RETAINED','6.0 STRUCTURE AND ROLES']
I am extremely new to python and regex. I tried the following but it did not give me the output desired:
import re
text = f'''2.0 DESCRIPTION
some text here
3.0 SERVICES
som text
5.0 SERVICES RETAINED
some text
6.0 STRUCTURE AND ROLES
sometext'''
pattern = r"\d\s[A-Z][A-Z]+"
matches = re.findall(pattern,text)
But it returned:
['0 DESCRIPTION', '0 SERVICES', '0 SERVICES']
Not the output that I was looking for. Your guidance in finding a pattern will be really appreciated.
Cheers,
Abhishek

You may use
matches = re.findall(r'^\d+(?:\.\d+)* *[A-Z][A-Z ]*$',text, re.M)
See the regex demo.
Here,
^ - start of a line (re.M redefines ^ behavior to include these positions, too)
\d+(?:\.\d+)* - 1+ digits and then 0+ sequences of a . and 1+ digits
* - zero or more spaces
[A-Z][A-Z ]* - an uppercase letter and then 0 or more uppercase letters or spaces
$ - end of a line.

import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extractenter code here" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Problem Statement:
I have two seperate regular expressions that I am trying to "combine" into one and apply to each row in a dataset. The matching part of each row should go to a new Pandas dataframe column called "Wanted". Please see example data below for how values that match should be formatted in the "Wanted" column.
Example Data (how I want it to look):
Column0
Wanted (Want "Column0" to look like this)
Alice\t12-345-623/ 10-1234
Alice, 12-345-623, 10-1234
Bob 201-888-697 / 12-0556a
Bob, 201-888-697, 12-0556a
Tim 073-110-101 / 13-1290
Tim, 073-110-101, 13-1290
Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c
Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c
In other words...:
2-3 digits ----- hyphen ---- 3 digits --- hyphen ---- 3 digits ---- any character ----
2 digits --- hyphen --- 4 digits ---- permit one single character
What I have tried #1:
After dinking around for a while I figured out two different regular expressions that on their own will solve part of the problem. Kinda.
This will match for the first group of numbers in each row (but doesn't get the second group--which I want) I'm interested in that I have tried. I'm not sure how robust this is though.
Example Problem Row (regex = r"(?:\d{1,3}-){0,3}\d{1,3}")
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{1,3}\-){0,3}\d{1,3}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted: Alice, 12-345-623, 10-1234
Got: 12-345-623 # matches the group of numbers but isn't formatted how I would like (see example data)
What I have tried #2:
This will match for the second part in each row--- but! --- only if its the only value in the column. The problem I have is that it matches on the first group of digits instead of the second.
Example Problem Row (regex = r"(?:\d{2,3}-){1}\d{3,4}") # different regex than above!
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{2,3}\-){1}\d{3,4}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted : Alice, 12-345-623, 10-1234
Got: 12-345 # matched on the first part
Known Problems:
When I try, "Alice\t12-345-623/ 10-1234", it will match "12-345" when I'm trying to match "10-1234"
Thank you!
Thanks in advance to all you wizards being willing to help me with this problem. I really appreciate it:)
Note: I have asked regarding regex that may make solving this problem easier. It might not, but here is the link anyways --> How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe
So this works for the four test examples you gave. How's this using the .split() method? Technically this returns a list of values and not a string.
import re
# text here
text = "Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c"
# split this out to a list. remove the ending parenthesis since you are *splitting* on this
new_splits = re.split(r'\t|/|and|\(| ', text.replace(')',''))
# filter out the blank spaces
list(filter(None,new_splits))
['Joe', '74-111-333', '33-1290', 'Amy', '12-345-623', '10-1234c']
and if you are using pandas you can try the same steps above:
df['answer_Step1'] = df['Column0'].str.split(r'\\t+|/|and|\(| ')
df['answer_final'] = df['answer_Step1'].apply(lambda x: list(filter(None,x)))
You can use
re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text)
See the regex demo.
Pandas version:
df['Wanted'] = df['Column0'].str.replace(r'\s*\band\b\s*|[^\w-]+', ', ', regex=True)
Details:
\s*\band\b\s* - a whole word (\b are word boundaries) and enclosed with optional zero or more whitespace chars
| - or
[^\w-]+ - one or more chars other than letters, digits, _ and -
See a Python demo:
import re
texts = ['Alice 12-345-623/ 10-1234',
'Bob 201-888-697 / 12-0556a','Tim 073-110-101 / 13-1290',
'Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c']
for text in texts:
print(re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text))
# => Alice, 12-345-623, 10-1234
# Bob, 201-888-697, 12-0556a
# Tim, 073-110-101, 13-1290
# Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

What is the Regex for Windows Domain Username in C#?

What is the regular expression to accept only text, number and backslash. It should not accept space and should start with text only. For example domain\username. Thanks in advance...
this is a regex for domain\name with the restriction that 'domain' should start with a char and end with a char. You can easily maniplate the regex for your desire
/^[a-zA-Z][a-zA-Z0-9-]{1,61}[a-zA-Z]\.[a-zA-Z]{2,}$/
Domain - Beginning:
[a-zA-Z] Text
Domain - Text:
1-61 times of [a-zA-Z0-9-] Text, Numbers, '-'
Domain - End:
1 time [a-zA-Z] = Text
Backslash:
1 time [\]
User - Text:
2-infinity times [a-zA-Z] = Text
Edit: as bgh pointed out in the comment you could include more valid characters
/^[a-zA-Z][a-zA-Z0-9‌​\-\.]{0,61}[a-zA-Z]\\\w‌​[\w\.\- ]*$/
The following is a regex with named groups, this can be pasted into Linqpad and run. Make note that actually a lot of characters are allowed in user names in Active Directory, actually any Unicode character save for some special characters (of which are used in LDAP searches among others).
Oh yes - English alphabet ends in Z. In Norwegian language we have three extra vowels: Æ,Ø,Å.
void Main()
{
string user = "someaddomain\\someuser99";
var matches = Regex.Match(user, #"^(?<domain>[a-æA-Æ0-9-]+)\\(?<username>[a-æA-Æ0-9-]+)$").Dump();
string[] comps = user.Split('\\');
comps.Dump();
matches.Groups["domain"].Value.Dump();
matches.Groups["username"].Value.Dump();
}
Linqpad is available for download at for those new programmers who has not used this development tool yet:
[enter link description here][1]
[1]: https://www.linqpad.net Linqpad website