More Regex solution? - regex

I want to use the sub function to format the string
"Ross McFluff: 0456-45324: 155 Elm Street\nRonald Heathmore: 5543-23464: 445 Finley Avenue".
For each person it should look like this:
Contact
Name: xx yy
Phone number: 0000-00000
Address: 000 zzz zzz
I tried to resolve the problem:
line = """Ross McFluff: 0456-45324: 155 Elm Street \nRonald Heathmore: 5543-23464: 445 Finley Avenue"""
match = re.sub(r':', r'', line)
rematch = re.sub(r'([A-Z][a-z]+\s[A-Z][a-zA-Z]+)(.*?)(\d\d\d\d-\d\d\d\d\d)', r'Contact. Name: \1. Phone number: \3. Address:\2', match)
I got something like this :
"Contact. Name: Ross McFluff. Phone number: 0456-45324. Address: 155 Elm Street \nContact. Name: Ronald Heathmore. Phone number: 5543-23464. Address: 445 Finley Avenue"
How can i do to get this result :
Contact
Name: Ross McFluff
Phone number: 0456-45324
Address: 155 Elm Street
Contact
Name: Ronald Heathmore
Phone number: 5543-23464
Address: 445 Finley Avenue
Any idea? thanks
/Georges

I would toss a split in there like this:
import re
data = """Ross McFluff: 0456-45324: 155 Elm Street \nRonald Heathmore: 5543-23464: 445 Finley Avenue"""
linelist = data.split("\n")
for theline in linelist:
rematch = re.sub('([^:]+): ([^:]+): (.*)', r'Contact\nName: \1\nPhone Number: \2\nAddress: \3', theline)
print (rematch)
results:
Contact
Name: Ross McFluff
Phone Number: 0456-45324
Address: 155 Elm Street
Contact
Name: Ronald Heathmore
Phone Number: 5543-23464
Address: 445 Finley Avenue
That way you can easily process each "line". I really like using stuff like:
([^:]+)
That's a negative character class, it matches NOT what is in the class since that's really what you are doing. I suppose you could also just do splits on the colons, but you may want more control by using a regex like this. You may have to play around with using trim to make sure all the whitespaces are cleaned up, really depends what you are doing with the data.
If you need to go with a pure regex solution, it can be done by just fiddling around on here: https://regex101.com/

I tend to prefer the size specifier when I can, and I am not sure how your first response came back correctly, I am assuming that is just a weird anomaly, but below is a query that should work. Your values will be \1, \3, and \5 For name number and address. This should work in reading the address to the end of your string. (I use a generic parser for testing)
([A-Z][a-z]+\s[A-Z][a-zA-Z]+)(.*?)(\d{4}-\d{5})(.*?)([\w+ ]+)

Related

Using Regex to capture content after first occurrence of a string

I've done some research and I'm struggling to figure out how to answer this question. I have the following text and I want to extract the zip code in the business address field:
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
This code captures the last instance (77001):
(BUSINESS\s*ADDRESS:)(.*)(ZIP:\s*)(.*)
How can I capture the first zip code (77027)?
Thanks for helping a noob.
Well, in your example you just need to add question mark to (.*?) and specify that zip consists only digits:
BUSINESS\s*ADDRESS:.*?ZIP:\s*(\d+)
By default asterisk and plus are greedy.
And no need to capture things other than zip code
Given:
my $tgt="BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001";
You can do:
print "$1: $2\n" while $tgt=~/^(\S[^:]+):[^\R]*\R.*?^\s+ZIP:\s+(\d+)/gms;
Prints:
BUSINESS ADDRESS: 77027
MAIL ADDRESS: 77001
Same method you can construct a hash mapping the address to the zip for each block.
The match operator running in list context will return all the matching values that were found. So you could do something like this:
my $data = '
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
';
my #allzips = ($data =~ /ZIP:\s*(\d+)/g);
foreach my $zip (#allzips) {
print "Found ZIP: $zip\n";
}
Which prints:
Found ZIP: 77027
Found ZIP: 77001
For those about to awk...
There is a tested version below, given that the file is named test.txt in current directory:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
It will print the second field for all lines containing ZIP:, but only the lines encountered in a block between a line containing BUSINESS ADDRESS: and another line containing MAIL ADDRESS:
The test is below:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
77027

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!
There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺
my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.
Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;
Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.
Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

Regex Pattern for String including newline characters

I am looking for a regex pattern that will return a match from %PDF-1.2 to and including %%EOF in the string below.
So far my patterns don't seem to work.
DOCUMENTS ACCEPTED
001//201//0E9136614////ACME 107 PTY LTD//8
**E10 End of validation report**
BDAT 4367 LAST
XSVBOUT
001XSVSEPRXXXOUT_TP.19
ZHDASCRA55 0700 8
ZCO*** TEST DATABASE ***ACME 107 PTY LTD 551824563 APTY LMSH PDF NSW 20111217 PNPC
ZIL 77000030149 Australian Securities and Investments Commission 86768265615 ZUMESOFT SOLUTIONS PTY LTD 61 buxton st north adelaide SA 5006
ZIAProprietary Company 42600 0E9136614 201 TAX INVOICE EXE 0 0E9136614201C PA 20111217 Not Subject to GST - Treasurer's Determination (Exempt Taxes, Fees and Charges)
ZTRENDRA55 5
%PDF-1.2
%????
3495
%%EOF
BDAT 11 LAST
/(?s)(%PDF-1\.2.+%%EOF)/ should solve your problem
If you are using an older flavor of regex the (?s) could be moved to the end of regex modifier like //s so.

Regular Expressions task

Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!
Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.

Search then Extract

I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.
If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)
In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!