PIG - Create a tuple line from flat file

PIG - Create a tuple line from flat file - tuples

I'm trying to create a tuple in Pig, but the format of file is not much friendly:
File Format:
Name: Zach
LastName: Red
Address: 34 Store Av
Age: 34
Name: Brian
LastName: Curts
Address: 123 Street Av
Age: 23
I need to create a tuple:
Name: Zach LastName: Red Address: 34 Store Av Age: 34
Name: Brian LastName: Curts Address: 123 Street Av Age: 23

You can write your own UDF in Java/Python/... to load this data. Check doc:
http://pig.apache.org/docs/r0.15.0/udf.html#load-store-functions

Crazy idea, but it might work; I ASSUME all your elements have 4 rows. Otherwise - it won't work.
Load the file using PigStorage
Use the RANK operator to generate a RANK field for each row. First row would get 1, 2nd row would get 2, etc.
For each row generate another number, between 1-4, based on its type: 1 for name, 2 for LastName, 3 for Address, 4 for Age. Let's call it 'RecordType'
add another field which would be FLOOR((RANK-1)/4). Name it 'PersonID'. For the 1st person it would be 0, for the 2nd one it would be 1, etc.
Now you can group by PersonID to get all records for the same person 'together'.
Now, for each person you would get the PersonID, and a bag containing all the records. We need to get them sorted. For that purpose you can use
output = foreach Person {
sorted = order PersonRows by RecordType;
generate PersonID,sorted;
}
Flatten the Bag into a Tuple using the BagToTuple function
and you're done.

Related

Terraform templating question - iterating over a directory of variable files

Is it possible in Terraform to iterate over a directory of files containing variables and apply each file's values to a template? In very general, non-Terraform, bad pseudo-code -- here's what I'm hoping to do:
/varfiles/var_file1.yml
User:
Name: Jim
Address:
Street: 123 Main
City: MyTown
State: IL
Zip: 12345
Age: 40
Children:
- Beth
- Mike
/varfiles/var_file2.yml
User:
Name: Jill
Address:
Street: 321 Oak St
City: UrTown
State: IL
Zip: 54321
Age: 27
Children:
- Ricky
Template Logic
FOR each file in /varfiles/*(
print(User's name is: %{file.User.Name})
print(User's age is: %{file.User.Age})
print(User's address is: %{file.User.Street file.User.City, file.User.State file.User.Zip})})
IF COUNT(%{file.User.Children}) > 0 THEN (
FOR child in %{file.User.Children}(
print(The user has a child named: %{child})
)
)
)
Each iteration would be expected to generate a separate set of outputs - e.g. separate files.
Expected 'output' for var_file1.yml
User's name is: Jim
User's age is: 40
User's address is: 123 Main MyTown, IL 12345
The user has a child named: Beth
The user has a child named: Mike
Expected 'output' for var_file2.yml
User's name is: Jill
User's age is: 27
User's address is: 321 Oak Street UrTown, IL 54321
The user has a child named: Ricky
Solution doesn't have to implement this particular logic - but I'm not familiar enough with Terraform yet to know if it's possible at all, if this is the right approach, or if Terraform is the right tool.
The problem I'm hoping to solve is to template configuration files that would be generated on a per-user or per-role basis. Then store user or role-specific variables in separate files within a specific directory. This would make it easy to potentially manage hundreds of roles/users without having to deal with a var file that's hundreds or thousands of lines long and the dir structure and file names would make it obvious who/what was assigned what.

Pull Value After String

Please could someone assist with the following? I'm looking to pull values for Applicant Two, so I would need the regex to ignore Applicant One and pull the value after First Name: I would be left with "Joe" for example not "Marie".
Applicant One
Title: Miss
First Name: Marie
Surname: Miller
DOB: 01-01-1982
Applicant Two
Title: Mr
First Name: Joe
Surname: Blogss
DOB: 19-09-1983

in every iteration you could pull out a line behind "Applicant Two:"
(.*)(Applicant Two)([\s\S]*?)^([^\n]+)$

Extracting only dates from a text file and ignoring large numbers

I have a text file and I want to extract all dates from it but somehow my code is also extracting the other values like
Procedure #: 10075453.
Below is a small sample of that file:
Patient Name: Mills, John Procedure #: 10075453
October 7, 2017
Med Rec #: 747901 Visit ID: 110408731
Patient Location: OUTPATIENT Patient Type: OUTPATIENT
DOB:07/09/1943 Gender: F Age: 73Y Phone: (321)8344-0456
Can I get an idea how I could approach this problem?
doc = []
with open('Clean.txt', encoding="utf8") as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_extract():
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

Using Perl to extract text from a text file

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:
REPORTING-OWNER:
OWNER DATA:
COMPANY CONFORMED NAME: DOE JOHN
CENTRAL INDEX KEY: 99999999999
FILING VALUES:
FORM TYPE: 4
SEC ACT: 1934 Act
SEC FILE NUMBER: 811-00248
FILM NUMBER: 11530052
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET
STREET 2: STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
ISSUER:
COMPANY DATA:
COMPANY CONFORMED NAME: ACME INC
CENTRAL INDEX KEY: 0000002230
IRS NUMBER: 134912740
STATE OF INCORPORATION: MD
FISCAL YEAR END: 1231
BUSINESS ADDRESS:
STREET 1: SEVEN ST PAUL ST STE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
BUSINESS PHONE: 4107525900
MAIL ADDRESS:
STREET 1: 7 ST PAUL STREET SUITE 1140
CITY: BALTIMORE
STATE: MD
ZIP: 21202
I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.
I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).
if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
Any idea as to how I can extract the information for both the owner and the company?
Thanks!

There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.
As it happens, the file you gave looks very much like YAML.
use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};
Prints:
DOE JOHN
ACME INC
Isn't that cool? All in a few lines of safe and maintainable code ☺

my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms
If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do
$data =~ tr/\r//d;
first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.

Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.
($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;

Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:
$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}
That should be relatively easy to do.

Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

Search then Extract

I have a text file with multiple records. I want to search a name and date, for example if I typed JULIUS CESAR as name then the whole data about JULIUS will be extracted. What if I want only to extract information?
Record number: 1
Date: 08-Oct-08
Time: 23:45:01
Name: JULIUS CESAR
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Computer Engineering student
An OJT at TIPI.
23 years old.
Record number: 2
Date: 09-Oct-08
Time: 23:45:01
Name: JOHN Castro
Address: BAGUIO CITY, Philippines
Information:
I lived in Peza Loakan, Bagiou City
A Electronics Comm. Engineering Student at SLU.
An OJT at TIPI.
My Hobby is Programming.
Record number: 3
Date: 08-Oct-08
Time: 23:45:01
Name: CESAR JOSE
Address: BAGUIO CITY, Philippines
Information:
Hi,,
I lived Manila City
A Computer Engineering student
Working at TIPI.

If it is one line per entry, you could use a regular expression such as:
$name = "JULIUS CESAR";
Then use:
/$name/i
to test if each line is about "JULIUS CESAR." Then you simply have to use the following regex to extract the information (once you find the line):
/Record number: (\d+) Date: (\d+)-(\w+)-(\d+) Time: (\d+):(\d+):(\d+) Name: $name Address: ([\w\s]+), ([\w\s]+?) Information: (.+?)$/i
$1 = record number
$2-$4 = date
$5-$7 = time
$6 = address
$7 = comments
I would write a code example, but my perl is rusty. I hope this helps :)

In PHP, you can run a SQL select statement like:
"SELECT * WHERE name LIKE 'JULIUS%';"
There are native aspects of PHP where you can get all of your results in an associative array. I'm pretty sure it's ordered by row order. Then you can just do something like this:
echo implode(" ", $whole_row);
Hope this is what you're looking for!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PIG - Create a tuple line from flat file - tuples

You can write your own UDF in Java/Python/... to load this data. Check doc: http://pig.apache.org/docs/r0.15.0/udf.html#load-store-functions

Related

Terraform templating question - iterating over a directory of variable files

Pull Value After String

Extracting only dates from a text file and ignoring large numbers

Using Perl to extract text from a text file

Search then Extract

Categories

Resources