Using Regex to capture content after first occurrence of a string

Using Regex to capture content after first occurrence of a string - regex

I've done some research and I'm struggling to figure out how to answer this question. I have the following text and I want to extract the zip code in the business address field:
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
This code captures the last instance (77001):
(BUSINESS\s*ADDRESS:)(.*)(ZIP:\s*)(.*)
How can I capture the first zip code (77027)?
Thanks for helping a noob.

Well, in your example you just need to add question mark to (.*?) and specify that zip consists only digits:
BUSINESS\s*ADDRESS:.*?ZIP:\s*(\d+)
By default asterisk and plus are greedy.
And no need to capture things other than zip code

Given:
my $tgt="BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001";
You can do:
print "$1: $2\n" while $tgt=~/^(\S[^:]+):[^\R]*\R.*?^\s+ZIP:\s+(\d+)/gms;
Prints:
BUSINESS ADDRESS: 77027
MAIL ADDRESS: 77001
Same method you can construct a hash mapping the address to the zip for each block.

The match operator running in list context will return all the matching values that were found. So you could do something like this:
my $data = '
BUSINESS ADDRESS:
STREET 1: 101 AWESOME DRIVE
STREET 2: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77027
BUSINESS PHONE: 7138675309
MAIL ADDRESS:
STREET 1: P O BOX 144
CITY: HOUSTON
STATE: TX
ZIP: 77001
';
my #allzips = ($data =~ /ZIP:\s*(\d+)/g);
foreach my $zip (#allzips) {
print "Found ZIP: $zip\n";
}
Which prints:
Found ZIP: 77027
Found ZIP: 77001

For those about to awk...
There is a tested version below, given that the file is named test.txt in current directory:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
It will print the second field for all lines containing ZIP:, but only the lines encountered in a block between a line containing BUSINESS ADDRESS: and another line containing MAIL ADDRESS:
The test is below:
awk '{if ($0 ~ /BUSINESS ADDRESS:/) { inzone=1; } if (inzone) {if ($0 ~ /ZIP:/) { print $2; } else if ($0 ~ /MAIL ADDRESS:/) { inzone=0; }}}' test.txt
77027

Related

Spiting Regular expression and accessing Array of Array

An example am trying to understand from website.
People2.txt is as follows.
2323:Doe John California
827:Doe Jane Texas
982982:Neuman Alfred Nebraska
I don't get the output as shown from the command below.
*PS C:\ Get-Content people2.txt | %{$data = [regex]::split($_, '\t|:'); Write-Output "$($data[2]) $($data[1]), $($data[3])"}
John Doe, California
Jane Doe, Texas
Alfred Neuman, Nebraska*
I could take out numbers and swapping first and second using
gc C:\appl\ppl.txt | %{$data = [regex]::split($_, ":") ;write-output $data[1] } | Out-File c:\appl\ppll.txt
gc C:\appl\ppll.txt | %{$data = $_.split(" "); Write-Output "$($data[1]) $($data[0]),
$($data[2])"}
Please help
**Need to find more efficient ways to do this.
Also I want to understand '\t|:' - is it 'Split at first TAB stop and a : ' ?**

Just threw this off the top of my head: ^(?<number>\d+):(?<first>\w+)\s+(?<last>\w+)\s(?<location>.*)$

How to parse through a string in perl to extract certain value?

I have following string
> show box detail
2 boxes:
1) Box ID: 1
IP: 127.0.0.1
Interface: 1/1
Priority: 31
2) Box ID: 2
IP: 192.68.1.1
Interface: 1/2
Priority: 31
How to get BOX ID from above string in perl?
The number of boxes here can vary . So based on the number of boxes "n", how to extract box Ids if the show box detail can go upto n nodes in the same format ?

my #ids = $string =~ /Box ID: ([0-9]+)/g;
More restrictive:
my #ids = $string =~ /^[0-9]+\) Box ID: ([0-9]+)$/mg;

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!

There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺

my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.

Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;

Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.

Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

Search then Extract

I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.

If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)

In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Regex to capture content after first occurrence of a string - regex

Well, in your example you just need to add question mark to (.?) and specify that zip consists only digits: BUSINESS\sADDRESS:.?ZIP:\s(\d+) By default asterisk and plus are greedy. And no need to capture things other than zip code

Related

More Regex solution?

Spiting Regular expression and accessing Array of Array

How to parse through a string in perl to extract certain value?

Using Perl to extract text from a text file

Search then Extract

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Regex to capture content after first occurrence of a string - regex

Well, in your example you just need to add question mark to (.*?) and specify that zip consists only digits: BUSINESS\s*ADDRESS:.*?ZIP:\s*(\d+) By default asterisk and plus are greedy. And no need to capture things other than zip code

Related

More Regex solution?

Spiting Regular expression and accessing Array of Array

How to parse through a string in perl to extract certain value?

Using Perl to extract text from a text file

Search then Extract

Categories

Resources

Well, in your example you just need to add question mark to (.?) and specify that zip consists only digits: BUSINESS\sADDRESS:.?ZIP:\s(\d+) By default asterisk and plus are greedy. And no need to capture things other than zip code