I've done some research and I'm struggling to figure out how to answer this question. I have the following text and I want to extract the zip code in the business address field:
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
This code captures the last instance (77001):
(BUSINESS\s*ADDRESS:)(.*)(ZIP:\s*)(.*)
How can I capture the first zip code (77027)?
Thanks for helping a noob.
Well, in your example you just need to add question mark to (.*?) and specify that zip consists only digits:
BUSINESS\s*ADDRESS:.*?ZIP:\s*(\d+)
By default asterisk and plus are greedy.
And no need to capture things other than zip code
Given:
my $tgt="BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001";
You can do:
print "$1: $2\n" while $tgt=~/^(\S[^:]+):[^\R]*\R.*?^\s+ZIP:\s+(\d+)/gms;
Prints:
BUSINESS ADDRESS: 77027
MAIL ADDRESS: 77001
Same method you can construct a hash mapping the address to the zip for each block.
The match operator running in list context will return all the matching values that were found. So you could do something like this:
my $data = '
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
';
my #allzips = ($data =~ /ZIP:\s*(\d+)/g);
foreach my $zip (#allzips) {
print "Found ZIP: $zip\n";
}
Which prints:
Found ZIP: 77027
Found ZIP: 77001
For those about to awk...
There is a tested version below, given that the file is named test.txt in current directory:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
It will print the second field for all lines containing ZIP:, but only the lines encountered in a block between a line containing BUSINESS ADDRESS: and another line containing MAIL ADDRESS:
The test is below:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
77027
Related
I want to use the sub function to format the string
"Ross McFluff: 0456-45324: 155 Elm Street\nRonald Heathmore: 5543-23464: 445 Finley Avenue".
For each person it should look like this:
Contact
Name: xx yy
Phone number: 0000-00000
Address: 000 zzz zzz
I tried to resolve the problem:
line = """Ross McFluff: 0456-45324: 155 Elm Street \nRonald Heathmore: 5543-23464: 445 Finley Avenue"""
match = re.sub(r':', r'', line)
rematch = re.sub(r'([A-Z][a-z]+\s[A-Z][a-zA-Z]+)(.*?)(\d\d\d\d-\d\d\d\d\d)', r'Contact. Name: \1. Phone number: \3. Address:\2', match)
I got something like this :
"Contact. Name: Ross McFluff. Phone number: 0456-45324. Address: 155 Elm Street \nContact. Name: Ronald Heathmore. Phone number: 5543-23464. Address: 445 Finley Avenue"
How can i do to get this result :
Contact
Name: Ross McFluff
Phone number: 0456-45324
Address: 155 Elm Street
Contact
Name: Ronald Heathmore
Phone number: 5543-23464
Address: 445 Finley Avenue
Any idea? thanks
/Georges
I would toss a split in there like this:
import re
data = """Ross McFluff: 0456-45324: 155 Elm Street \nRonald Heathmore: 5543-23464: 445 Finley Avenue"""
linelist = data.split("\n")
for theline in linelist:
rematch = re.sub('([^:]+): ([^:]+): (.*)', r'Contact\nName: \1\nPhone Number: \2\nAddress: \3', theline)
print (rematch)
results:
Contact
Name: Ross McFluff
Phone Number: 0456-45324
Address: 155 Elm Street
Contact
Name: Ronald Heathmore
Phone Number: 5543-23464
Address: 445 Finley Avenue
That way you can easily process each "line". I really like using stuff like:
([^:]+)
That's a negative character class, it matches NOT what is in the class since that's really what you are doing. I suppose you could also just do splits on the colons, but you may want more control by using a regex like this. You may have to play around with using trim to make sure all the whitespaces are cleaned up, really depends what you are doing with the data.
If you need to go with a pure regex solution, it can be done by just fiddling around on here: https://regex101.com/
I tend to prefer the size specifier when I can, and I am not sure how your first response came back correctly, I am assuming that is just a weird anomaly, but below is a query that should work. Your values will be \1, \3, and \5 For name number and address. This should work in reading the address to the end of your string. (I use a generic parser for testing)
([A-Z][a-z]+\s[A-Z][a-zA-Z]+)(.*?)(\d{4}-\d{5})(.*?)([\w+ ]+)
An example am trying to understand from website.
People2.txt is as follows.
2323:Doe John California
827:Doe Jane Texas
982982:Neuman Alfred Nebraska
I don't get the output as shown from the command below.
*PS C:\ Get-Content people2.txt | %{$data = [regex]::split($_, '\t|:'); Write-Output "$($data[2]) $($data[1]), $($data[3])"}
John Doe, California
Jane Doe, Texas
Alfred Neuman, Nebraska*
I could take out numbers and swapping first and second using
gc C:\appl\ppl.txt | %{$data = [regex]::split($_, ":") ;write-output $data[1] } | Out-File c:\appl\ppll.txt
gc C:\appl\ppll.txt | %{$data = $_.split(" "); Write-Output "$($data[1]) $($data[0]),
$($data[2])"}
Please help
**Need to find more efficient ways to do this.
Also I want to understand '\t|:' - is it 'Split at first TAB stop and a : ' ?**
Just threw this off the top of my head: ^(?<number>\d+):(?<first>\w+)\s+(?<last>\w+)\s(?<location>.*)$
I have following string
> show box detail
2 boxes:
1) Box ID: 1
IP: 127.0.0.1
Interface: 1/1
Priority: 31
2) Box ID: 2
IP: 192.68.1.1
Interface: 1/2
Priority: 31
How to get BOX ID from above string in perl?
The number of boxes here can vary . So based on the number of boxes "n", how to extract box Ids if the show box detail can go upto n nodes in the same format ?
my #ids = $string =~ /Box ID: ([0-9]+)/g;
More restrictive:
my #ids = $string =~ /^[0-9]+\) Box ID: ([0-9]+)$/mg;
I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!
There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺
my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.
Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;
Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.
Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on
I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.
If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)
In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!