How to format first 7 rows in this txt file using Regex - regex

I have a text file with data formatted as below. Figured out how to format the second part of the file to format it for upload into a db table. Hitting a wall trying to get the just the first 7 lines to format in the same way.
If it wasn't obvious, I'm trying to get it pipe delimited with the exact same number of columns, so I can easily upload it to the db.
Year: 2019 Period: 03
Office: NY
Dept: Sales
Acct: 111222333
SubAcct: 11122234-8
blahblahblahblahblahblahblah
Status: Pending
1000
AAAAAAAAAA
100,000.00
2000
BBBBBBBBBB
200,000.00
3000
CCCCCCCCCC
300,000.00
4000
DDDDDDDDDD
400,000.00
some kind folks answered my question about the bottom part, using the following code I can format that to look like so -
(.*)\r?\n(.*)\r?\n(.*)(?:\r?\n|$)
substitute with |||||||$1|$2|$3\n
|||||||1000|AAAAAAAAAA|100,000.00
|||||||2000|BBBBBBBBBB|200,000.00
|||||||3000|CCCCCCCCCC|300,000.00
|||||||4000|DDDDDDDDDD|400,000.00
just need help formatting the top part - to look like this, so the entire file matches with the exact same number of columns.
Year: 2019|Period: 03|Office: NY|Dept: Sales|Acct: 111222333|SubAcct: 11122234-8|blahblahblahblahblahblahblah|Status: Pending|||
I'm ok with having multiple passes on the file to get the desired end result.

I've helped you on your previous question, so I will focus now on the first part of your file.
You can use this regex:
\n|\b(?=Period)
Working demo
And use | as the replacement string
If you don't want the previous space before Period, then you can use:
\n|\s(?=Period)

Related

Regex to extract shoe size from string column

I have a database with string column product_name which has data like:
Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)
Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)
I am trying to extract the US size...
SELECT REGEXP_SUBSTR("product_name", ...) AS "size"
...with desired output like this.
size
6
9.5
I have tried this, but to no avail
SELECT REGEXP_SUBSTR("product_name", '(US)(\d+)') AS "size"
I need to agree with B001, this might not be the best way of saving your information. However, if you are sure your strings are going to have this format, you could use this regex
\(US\) ?: ?(\d+\.?\d*) \(EUR: ?(\d+\.?\d*)\)
This will match the US shoe size first and then the EUR one.
Here is a visual explaination of the regex
Please note that this regex will match BOTH sizes, I'm not sure which one you prefer
You can test more cases in this regex101
When working in the web UI I had to double slash my slashes. Thus the following worked as you want.
select REGEXP_SUBSTR(str, '\\(US\\)\\s\\:\\s(\\d+\\.?\\d*)',1,1,'i',1)
from values ('Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)'),
('Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)') v(str);
gives:
REGEXP_SUBSTR(STR, '\\(US\\)\\S\\:\\S(\\D+\\.?\\D*)',1,1,'I',1)
6
9.5

Get a string after a specific word, using a program that has limited regex features?

Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!
Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$
I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)

How to load specific columns with varying location from a text file in python?

I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])
You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.

source file fixed width , need only Header and Footer to the target(oracle)

I have this scenario with source as fix width flat file, and I need to read to target only the Header and Footer not the details records.
I need to trim the first column (PA22109 ) and get only PA and next 2 columns to rows as two different dates.
For Footer get only the PT(PT000000000700000030620E00000055612I00000010277I) and the rest into a column of the target.
How can I achieve this logic, inputs are appreciated.
source file :
PA22109 00153252015110905408179 2015110820151108PO ---header
DE0E9D TESTGROUPEXCH TESTINSEXCH TESTLOCEXCH ID014 LNAME014 FNAME014 14 MAIN ST ANYWHERE NJ011110000 195001012Z 01000000014 LNAME014 PATFIRST014 14 MAIN ST ANYWHERE NJ011110000 1955010110106000220 TESTGROUPEXCH 8179 TESTBENEXCH TESTCNTE53 0000000000 0000002643005 011234567890 011234567890 1234 TEST PHARMACY TEST PHARMACY LANE PHARMACYTOWN NJ09876 5555555555 11Y5 019876543210 019876543210 NJPRESCLAST PRESCFIRST 5555555551 DRLAST DRFIRST 110110000009770990300406048410 2015092720150927154401000000000000120150929 0000100000000000000000000000000
PT000000000700000030620E00000055612I00000010277I --Footer
As this a fixed file you can perform following to meet your requirement.
In your Informatica mapping, Read row in a single column.
In Expression, Mark each record for filter out if It does not start with PA OR PT (Assumption your Detail records do not start with PA or PT). Filter detail record out using Filter transformation.
Now you have only Header and Footer Records.
Now you can apply respective condition in expression for PA and PT Records.

Stata - inputting data from .txt with "" and ,

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""
There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.
This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.