Search then Extract

Search then Extract - regex

I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.

If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)

In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!

Related

Terraform templating question - iterating over a directory of variable files

Is it possible in Terraform to iterate over a directory of files containing variables and apply each file's values to a template? In very general, non-Terraform, bad pseudo-code -- here's what I'm hoping to do:
/varfiles/var_file1.yml
User:
Name: Jim
Address:
Street: 123 Main
City: MyTown
State: IL
Zip: 12345
Age: 40
Children:
- Beth
- Mike
/varfiles/var_file2.yml
User:
Name: Jill
Address:
Street: 321 Oak St
City: UrTown
State: IL
Zip: 54321
Age: 27
Children:
- Ricky
Template Logic
FOR each file in /varfiles/*(
print(User's name is: %{file.User.Name})
print(User's age is: %{file.User.Age})
print(User's address is: %{file.User.Street file.User.City, file.User.State file.User.Zip})})
IF COUNT(%{file.User.Children}) > 0 THEN (
FOR child in %{file.User.Children}(
print(The user has a child named: %{child})
)
)
)
Each iteration would be expected to generate a separate set of outputs - e.g. separate files.
Expected 'output' for var_file1.yml
User's name is: Jim
User's age is: 40
User's address is: 123 Main MyTown, IL 12345
The user has a child named: Beth
The user has a child named: Mike
Expected 'output' for var_file2.yml
User's name is: Jill
User's age is: 27
User's address is: 321 Oak Street UrTown, IL 54321
The user has a child named: Ricky
Solution doesn't have to implement this particular logic - but I'm not familiar enough with Terraform yet to know if it's possible at all, if this is the right approach, or if Terraform is the right tool.
The problem I'm hoping to solve is to template configuration files that would be generated on a per-user or per-role basis. Then store user or role-specific variables in separate files within a specific directory. This would make it easy to potentially manage hundreds of roles/users without having to deal with a var file that's hundreds or thousands of lines long and the dir structure and file names would make it obvious who/what was assigned what.

Regex capture lines A, B, or C in any order only when not preceded by D

I have a file with content something like this:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
Then later in the file, it has something like this:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.
The regex I was trying to build looks like this:
(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
The desired results would be as follows:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.

I ended up figuring it out on my own. Try it out here.
/SUBJECT COMPANY:\s+COMPANY DATA:(?:\s+(?:(?:COMPANY CONFORMED NAME:\s+(?'conformed_name'[^\n]+))|(?:CENTRAL INDEX KEY:\s+(?'CIK'\d{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:\s+(?'assigned_SIC'[^\n]+))|(?:IRS NUMBER:\s+?(?'IRS_number'\w{2}-?\w{7,8}))|(?:STATE OF INCORPORATION:\s+(?'state_of_incorporation'\w{2}))|(?:FISCAL YEAR END:\s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))\n))+/s

To match only the company section, and only when preceded by “SUBJECT COMPANY”, use a look behind:
(?<=SUBJECT COMPANY:\t\n \n )(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,#`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))

Stata - inputting data from .txt with "" and ,

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""

There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.

This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!

There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺

my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.

Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;

Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.

Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Search then Extract - regex

Related

Terraform templating question - iterating over a directory of variable files

Regex capture lines A, B, or C in any order only when not preceded by D

More Regex solution?

Stata - inputting data from .txt with "" and ,

Using Perl to extract text from a text file

Categories

Resources