I am creating YML templates to match files (through Python parsing), and in the YML template I have to enter fields which match from the input file and Python, then converts into a database (CSV file).
But I am facing a problem matching company details. A portion of the file looks like this:
COMPANY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111
And the company is actually ABC Industries. But in the file that I have, the Date is coming between the COMPANY DETAILS text and the actual company details.
I matched the Date as:
date: Date :\s+(\d+\-\d+\-\d+)
in the YML template file. But I am unable to match the company details.
I am using Regex like this to skip the line starting with the text DATE:
company: COMPANY DETAILS\s+^(Date :.*)?([A-Za-z\s*]*)\s+Contact No.
But it isn't working. Please help me out with a proper Regex which skips any blank lines or the lines which start with Date : so that I can extract the proper company details from the text.
Thanks in advance.
EDIT
This problem is solved now.
COMPANY DETAILS\s+Date :\s+\d+\-\d+\-\d+\s+([A-Z ]*)\n
Did the trick.
You may use
COMPANY DETAILS\s+Date\s*:.*\s*(.+)
See the regex demo
Details
COMPANY DETAILS - a literal substring
\s+ - 1+ whitespace
Date\s*: - Date, 0+ whitespaces, :
.*\s* - a line with any whitespaces after
(.+) - Group 1: the line with the company data.
Python demo:
import re
rx = r"COMPANY DETAILS\s+Date\s*:.*\s*(.+)"
s = "COMPANY DETAILS\n\nDate : 01-06-2018\n\nABC Industries\n12-31 Lane\nNew York\n\nContact No. 1111"
m = re.search(rx, s, re.MULTILINE)
if m:
print(m.group(1)) # => ABC Industries
Using re.search
Demo:
import re
s = """COMAPNY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111"""
m = re.search("(?<=COMAPNY DETAILS)(?P<company>.*?(?=Contact))", s, flags=re.DOTALL)
if m:
print( m.group('company') )
Output:
ABC Industries
12-31 Lane
New York
Related
I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc
Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns
I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.
I have a passage and I need to extract a couple of words from it in tableau. The passage is given below:
This looks like a suspicious account. Please look at the details
below. Name: John Mathew Email:john.mathew#abc.com Phone:+1
111-111-1111 Department: abc
For more enquiries contact: ----
Name, email, phone and the department are in the same line separated by blank spaces. I used the below regex and it works well for the department alone:
regexp_extract([CASE DESCRIPTION],'Department : (.+)')
When I apply this one name, I get:
Name: John Mathew Email:john.mathew#abc.com Phone:+1 111-111-1111
Department: abc
instead of just the name. The same happens with email.
How do I solve this problem?
It looks to me like the issue is that your regex just has '(.+)' as its capture group, which basically means "everything" (after the specified string). Since the fields are all on one line, everything after "name" includes the email, phone, and department. (The regex works with department because it's the last thing on the line.)
So, to make it work right, you need to give your regex something other than the end of the line to stop on. To capture just the name, you need to stop before the Email tag, and so on down the list. Something like
Name = regexp_extract([CASE_DESCRIPTION],'Name: (.+) Email:')
email = regexp_extract([CASE_DESCRIPTION],'Email: (.+) Phone:')
phone = regexp_extract([CASE_DESCRIPTION],'Phone: (.+) Department:')
department = regexp_extract([CASE_DESCRIPTION],'Department: (.+)')
I am new to regex.
I have a String formatted like below
Street Name
City, StateCode ZipNumber
for example, the string can be like
50 Connecticut Avenue
Norwalk, CT 06850
or
123 6th Avenue
New York, NY 10013
or
4TH Highway 6
Rule, TX 79547
I am trying to construct a regex here.
But cannot proceed as I have a little idea about regex.
Can you please help me?
The following might be enough :
^(?<Street>[^\n]+)\n(?<City>[^,]+), (?<StateCode>[A-Z]{2}) (?<Zip>\d+)$
It captures the following segments in different groups :
the first line in a group named Street
the part of the second line which precedes the comma in a group named City
the next two capital letters in a group named StateCode
the following digits in a group named Zip
I have the below data.
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
The data has special symbols, tab space and spaces. I want to extract only the text part from this data as:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
I have tried using REGEX_EXTRACT_ALL in Pig Script as below but it does not work.
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
When I try dumping Cleansed, it does not show any data. Can any one please help.
You can use
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
The regex matches the following:
^ - start of string
[^a-zA-Z]* - 0 or more characters other than the Latin letters in the character class
([a-zA-Z].*[a-zA-Z]) - a capturing group that we'll reference to as FIELD1 later, matching:
[a-zA-Z].*[a-zA-Z] - a Latin letter, then any characters, as many as possible (the greedy * is used, not *? lazy one)
[^a-zA-Z]* - 0 or more characters other than the Latin letters
$ - end of string
My text file (input):
City,Description
Chicago,One day car rental is <b>$90</b>
Dallas,One day car rental is <b>$65</b>
Output needed:
City Costofrental
Chicago, $90
Dallas, $65
I am using regex extract to get the cost ($) details but not getting desired output. New to regex so please let me know what am i missing? TIA
A = LOAD '/user/Testfile.csv' USING PigStorage(',') AS(a1:chararray,a8:chararray);
B = FOREACH A GENERATE a1,REGEX_EXTRACT(a8, '/<b>([0-9]*)</b>/',1);
dump B;
You need to add escaped \$ to your regex (and escape closing </b> tag):
'/<b>(\$[0-9]*)<\/b>/'