PIG REGEX_EXTRACT_ALL is not working - regex

I have the below data.
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
The data has special symbols, tab space and spaces. I want to extract only the text part from this data as:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
I have tried using REGEX_EXTRACT_ALL in Pig Script as below but it does not work.
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
When I try dumping Cleansed, it does not show any data. Can any one please help.

You can use
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
The regex matches the following:
^ - start of string
[^a-zA-Z]* - 0 or more characters other than the Latin letters in the character class
([a-zA-Z].*[a-zA-Z]) - a capturing group that we'll reference to as FIELD1 later, matching:
[a-zA-Z].*[a-zA-Z] - a Latin letter, then any characters, as many as possible (the greedy * is used, not *? lazy one)
[^a-zA-Z]* - 0 or more characters other than the Latin letters
$ - end of string

Related

RegEx for String Validation [x_name1_zded|we3e_name2_235|yyy_name3_3435]

I have requirement to check, is getting valid input in the product
Example Product String = "x_name1_zded|we3e_name2_235|yyy_name3_3435"
each protect is delimited with "|".
each productInfo is delimited with "_"
in this example 3 products are there
1-> x_name1_zded
2-> we3e_name2_235
3-> yyy_name3_3435
and each product has 3 details, example product 1 id:x, name: name1, store: zded.
i need RegEx to validate if we have delimited with "|", the minimum 3 section(id,name, store) should be there, user can send N number of product with "|".
so RegEx should validate if product is there, than should have 3 section.
Iam trying to do it in Json schema validator, in pattern section
Suppose, each details section can contain a-z letters and 0-9 digits, at least 1 symbol. Detail section will be [a-z0-9]+. We have to have at least 3 section divided by _ - there is one section, then at least 2 sequences of "delimiter + section" (pseudocode):
section (delimiter section)*(2 or more times)
In regex a single product will be:
[a-z0-9]+(_[a-z0-9]+){2,}
Next. It can be N products. If N is any value greater or equals to 1 - then we can use the same schema:
product (delimiter product)*(zero or more times)
So final version of regex is:
[a-z0-9]+(_[a-z0-9]+){2,}(\|[a-z0-9]+(_[a-z0-9]+){2,})*
\| is escaped delimiter, because | is regex metasymbol. * means "zero or more times".
You can replace [a-z0-9]+ on any another regexp describes your details section.
For instance, see example.

Skip multiple lines in YML Regex

I am creating YML templates to match files (through Python parsing), and in the YML template I have to enter fields which match from the input file and Python, then converts into a database (CSV file).
But I am facing a problem matching company details. A portion of the file looks like this:
COMPANY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111
And the company is actually ABC Industries. But in the file that I have, the Date is coming between the COMPANY DETAILS text and the actual company details.
I matched the Date as:
date: Date :\s+(\d+\-\d+\-\d+)
in the YML template file. But I am unable to match the company details.
I am using Regex like this to skip the line starting with the text DATE:
company: COMPANY DETAILS\s+^(Date :.*)?([A-Za-z\s*]*)\s+Contact No.
But it isn't working. Please help me out with a proper Regex which skips any blank lines or the lines which start with Date : so that I can extract the proper company details from the text.
Thanks in advance.
EDIT
This problem is solved now.
COMPANY DETAILS\s+Date :\s+\d+\-\d+\-\d+\s+([A-Z ]*)\n
Did the trick.
You may use
COMPANY DETAILS\s+Date\s*:.*\s*(.+)
See the regex demo
Details
COMPANY DETAILS - a literal substring
\s+ - 1+ whitespace
Date\s*: - Date, 0+ whitespaces, :
.*\s* - a line with any whitespaces after
(.+) - Group 1: the line with the company data.
Python demo:
import re
rx = r"COMPANY DETAILS\s+Date\s*:.*\s*(.+)"
s = "COMPANY DETAILS\n\nDate : 01-06-2018\n\nABC Industries\n12-31 Lane\nNew York\n\nContact No. 1111"
m = re.search(rx, s, re.MULTILINE)
if m:
print(m.group(1)) # => ABC Industries
Using re.search
Demo:
import re
s = """COMAPNY DETAILS
Date : 01-06-2018
ABC Industries
12-31 Lane
New York
Contact No. 1111"""
m = re.search("(?<=COMAPNY DETAILS)(?P<company>.*?(?=Contact))", s, flags=re.DOTALL)
if m:
print( m.group('company') )
Output:
ABC Industries
12-31 Lane
New York

How to lookup an array of strings to match a value in a column?

I have a master table holding the list of possible street types:
CREATE TABLE land.street_type (
str_type character varying(300)
);
insert into land.street_type values
('STREET'),
('DRIVE'),
('ROAD');
I have a table in which address is loaded and I need to parse the string to do a lookup on the master street type to fetch the suburb following the street.
CREATE TABLE land.bank_application (
mailing_address character varying(300)
);
insert into land.bank_application values
('8 115 MACKIE STREET VICTORIA PARK WA 6100 AU'),
('69 79 CABBAGE TREE ROAD BAYVIEW NSW 2104 AU'),
('17 COWPER DRIVE CAMDEN SOUTH NSW 2570 AU');
Expected output:
VICTORIA PARK
BAYVIEW
CAMDEN SOUTH
Any PostgreSQL technique to look up a array of values against a table column and fetch the data following the matching word?
If I'm able to fetch the data present after the street type, then I can remove the last 3 fields state, postal code and country code from that to identify the suburb.
This query does what you ask for using regular expressions:
SELECT substring(b.mailing_address, ' ' || s.str_type || ' (.*) \D+ \d+ \D+$') AS suburb
FROM bank_application b
JOIN street_type s ON b.mailing_address ~ (' ' || s.str_type || ' ');
The regexp ' (.*) \D+ \d+ \D+$' explained step by step:
.. leading space (the assumed delimiter, else something like 'BROAD' would match 'ROAD')
(.*) .. capturing parentheses with 0-n arbitrary characters: .*
\D+ .. 1-n non-digits
\d+ .. 1-n digits
$ .. end of string
The manual on POSIX Regular Expressions.
But it relies on the given format of mailing_address. Is the format of your strings that reliable?
And suburbs can have words like 'STREET' etc. as part of their name - the approach seems unreliable on principal.
BTW, there is no array involved, you seem to be confusing arrays and sets.

VBscript_Words containing unique and case-insensitive letters only

I have a large text file in which I need to find the words containing only unique letters(a-z,A-Z). The words should not contain any character other than letters. Also, it needs to be case-insensitive so that words like alphA, morNing are not matched.
Examples:
marco - Should match(because of unique letters)
asia - Should Not Match(contains 2 'a')
asiA - Should Not Match(as it has 'a' and 'A')
alpha - Should not match
mike - Should match
roger - Should not match
abascus - Should not match
mach1 - Should not match(because of presence of 1)
Sample text from file against which I need to test:
The shares together form stock.The stock of a corporation is partitioned into shares, the total of which are stated at the time of business formation. Additional shares may subsequently be authorized by the existing shareholders and issued by the company. In some jurisdictions, each share of stock has a certain declared par value, which is a nominal accounting value used to represent the equity on the balance sheet of the corporation. In other jurisdictions, however, shares of stock may be issued without associated par value.
Shares represent a fraction of ownership in a business. A business may
declare different types (or classes) of shares, each having
distinctive ownership rules, privileges, or share values. Ownership of
shares may be documented by issuance of a stock certificate. A stock
certificate is a legal document that specifies the number of shares
owned by the shareholder, and other specifics of the shares, such as
the par value, if any, or the class of the shares.
In the United Kingdom, Republic of Ireland, South Africa, and
Australia, stock can also refer to completely different financial
instruments such as government bonds or, less commonly, to all kinds
of marketable securities.
My attempt:
\b(?![^a-zA-Z]+)(?!(?:[a-zA-Z]*([a-zA-Z]))*\1)[a-zA-Z]+\b
but it is not able to match anything here.
I have been stuck here for quite sometime. Please point me in right direction. Thanks
Try this regex:
\b(?![^a-zA-Z]+\b)(?![a-zA-Z]*([a-zA-Z])[a-zA-Z]*\1)[a-zA-Z]+\b
Click for Demo
Explanation:
\b - Word boundary
(?![^a-zA-Z]+\b) - Negative lookahead validating that words should only contain the 1+ letters
(?![a-zA-Z]*([a-zA-Z])[a-zA-Z]*\1) - Another negative lookahead - this part is for validating no 2 letters are repeated. Further break-up below:
[a-zA-Z]* - checks for the presence of 0+ letters
([a-zA-Z]) - captures a letter in a group. This letter captured in the group will be checked for any repetition.
[a-zA-Z]* - checks for the presence of 0+ letters again so as to consider for the cases when the repeated letters are not next to each other.
\1 - checks for the letter captured captured in group1
[a-zA-Z]+ - matches 1+ occurrences of a letter
\b - Word Boundary
VBScript Code:
Option Explicit
Dim objRE, strTest, objMatches, match, strOutput
strTest = "marco asia asiA alpha mike roger abascus mach1"
Set objRE = New RegExp
objRE.Global=True
objRE.IgnoreCase=True
objRE.Pattern="\b(?![^a-zA-Z]+\b)(?![a-zA-Z]*([a-zA-Z])[a-zA-Z]*\1)[a-zA-Z]+\b"
Set objMatches = objRE.Execute(strTest)
For Each match In objMatches
strOutput = strOutput & match.Value & vbCrLf
Next
MsgBox strOutput
Set objMatches = Nothing
Set objRE = Nothing
Output:

scala regex to limit with double space

I have a data like below
135 stjosephhrsecschool london DunAve
175865 stbele_higher_secondary sch New York
11 st marys high school for women Paris Louis Avenue
I want to extract id schoolname city area.
Pattern is id(digits) followed by single space then school name. name can have multiple words split by single space or it may have special chars. then minimum of double space or more then city . Again city may have multi words split space or may have special chars. then minimum of 2 spaces or more then its area. Even area follows the same properties as school name & city. But area may or may not present in the line. If its not then i want null value for area.
Here is regex I have tried.
([\d]+) ([\w\s\S]+)\s\s+([\w\s\S]+)\s\s+([\w\s\S]*)
But This regex is not stopping when it see more than 2 spaces. Not sure how to modify this to fit to my data.
all the help are appreciated.
Thanks
If I understand your issue correctly - the issue is that the resulting groups contain trailing spaces (e.g. "Louis Avenue "). If so - you can fix this by using the non-greedy modifiers like +? and *?:
([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*
Which results in what seems to be the desired output:
val s1 = "135 stjosephhrsecschool london DunAve"
val s2 = "175865 stbele_higher_secondary sch New York "
val s3 = "11 st marys high school for women Paris Louis Avenue "
val r = """([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*""".r
def matching(s: String) = s match {
case r(a,b,c,d) => println((a,b,c,d))
case _ => println("no match")
}
matching(s1) // (135,stjosephhrsecschool,london,DunAve)
matching(s2) // (175865,stbele_higher_secondary sch,New York,)
matching(s3) // (11,st marys high school for women,Paris,Louis Avenue)