Extracting Title, FirstName, MiddleName and LastName from Name field in Informatica

Extracting Title, FirstName, MiddleName and LastName from Name field in Informatica - informatica

I have a table with a name field which consists of a Title (optional), FirstName, MiddleName (optional) and LastName. All records have a FirstName and LastName but not all records have either a Title or MiddleName. For example:
Mr. Joey Tribbiani (no middlename)
Rachel Karen Green (no title)
Ms. Monica E Geller (all four fields)
Phoebe Buffay (no title or middlename) and so on
The titles that I have on the table consists of either (Mr., Mrs., Ms., Ms, Sr., or Sra.)
In this scenario, how do I separate the Name field according to the Title, FirstName, MiddleName and LastName in Informatica?

You can achieve this using INSTR and SUBSTR functions in informatica
Say for example your input NAME is Mr.Joey Tribbiani
Note : I am assuming there is no space between Mr. and the firstname
TITLE :
IIF((INSTR(NAME,'.',1))=0,NULL,
(SUBSTR(NAME,1,INSTR(NAME,'.',1)-1)))
FIRSTNAME :
SUBSTR(NAME,INSTR(NAME,'.',1)+1,((INSTR(NAME,' ',1)-1)-INSTR(NAME,'.',1)))
SUBSTR('Mr.Joey Tribbiani',INSTR('Mr.Joey Tribbiani','.',1)+1,((INSTR('Mr.Joey Tribbiani',' ',1)-1)-INSTR('Mr.Joey Tribbiani','.',1)))
LAST NAME :
SUBSTR(NAME,INSTR(NAME,' ',1)+1)
SUBSTR('Mr.Joey Tribbiani',INSTR('Mr.Joey Tribbiani',' ',1)+1)
Oracle SQL statements for your reference
SELECT SUBSTR('Mr.Joey Tribbiani',1,INSTR('Mr.Joey Tribbiani','.',1)-1) FROM DUAL;
SELECT SUBSTR('Mr.Joey Tribbiani',INSTR('Mr.Joey Tribbiani','.',1)+1,((INSTR('Mr.Joey Tribbiani',' ',1)-1)-INSTR('Mr.Joey Tribbiani','.',1))) FROM DUAL;
SELECT SUBSTR('Mr.Joey Tribbiani',INSTR('Mr.Joey Tribbiani',' ',1)+1) FROM DUAL;
Please view the below links for detailed explanation on INSTR and SUBSTR
http://www.techonthenet.com/oracle/functions/instr.php
http://www.techonthenet.com/oracle/functions/substr.php
Regards,
Raj

I am assuming that there is a single space between every word.
What I did
Check if title exists and saved in a variable.
Find the number of spaces and checked if middle name existed or not.
Derived title, first_name, middle_name and last_name based on above conditions.
Create an expression transformation like this.
Here are the variable port conditions.
v_title_exists
IIF(
SUBSTR(upper(v_FULL_NAME),1,3)='MR.' OR
SUBSTR(upper(v_FULL_NAME),1,4)='MRS.' or
SUBSTR(upper(v_FULL_NAME),1,3)='MS.' or
SUBSTR(upper(v_FULL_NAME),1,2)='MS' or
SUBSTR(upper(v_FULL_NAME),1,3)='SR.' or
SUBSTR(upper(v_FULL_NAME),1,4)='SRA.'
,1,0)
v_number_of_spaces
length(v_FULL_NAME) - length(REPLACECHR(0,v_FULL_NAME,' ',''))
v_middle_name_exists
IIF(v_title_exists=1,
IIF(v_number_of_spaces=3,1,0),
IIF(v_number_of_spaces=2,1,0)
)
v_title_exists
IIF(v_title_exists=1,
substr(v_FULL_NAME,0,instr(v_FULL_NAME,' ',1,1) -1)
, null
)
v_first_name
IIF(v_title_exists=1,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,1)+1,
instr(v_FULL_NAME,' ',1,2)-instr(v_FULL_NAME,' ',1,1)
)
,
substr(
v_FULL_NAME,
1,
instr(v_FULL_NAME,' ',1,1)-1
)
)
v_middle_name
IIF(v_middle_name_exists=1,
IIF(v_title_exists=1,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,2)+1,
instr(v_FULL_NAME,' ',1,3)-instr(v_FULL_NAME,' ',1,2)
)
,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,1)+1,
instr(v_FULL_NAME,' ',1,2)-instr(v_FULL_NAME,' ',1,1)
)
)
,
null)
v_last_name
IIF(v_middle_name_exists=1,
IIF(v_title_exists=1,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,3),
length(v_FULL_NAME)
)
,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,2),
length(v_FULL_NAME)
)
)
,
IIF(v_title_exists=1,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,2),
length(v_FULL_NAME)
)
,
substr(
v_FULL_NAME,
instr(v_FULL_NAME,' ',1,1),
length(v_FULL_NAME)
)
)
)

Related

Using Match with Regex and Array formula in Sheets

I have a list of names for which I want to know if there's a cross match in family name.
So if all in Family column contain family name (as the one in col B) - there'd be a Match, otherwise not.
I started by cleaning/splitting the names
=TRANSPOSE(ARRAYFORMULA(TRIM( SPLIT(SUBSTITUTE($A2," and",","),","))))
then doing a T/F match of only the family name for each case
=ISNUMBER(MATCH(REGEXEXTRACT($B$2,"\w+$"),REGEXEXTRACT(D2,"\w+$"),0))
I wanted to do this MATCH as an array, but it's not working. And then I'd have to do a count of the TRUE value if all are TRUE return a MATCH, else NO MATCH.
I obviously want to do this in a single cell, but got stuck because I can't make the MATCH an array. I hope that makes sense, or am I going about this the wrong way.
Here's the sample sheet

try:
=ARRAYFORMULA(IF(A2:A="",,IF(1+LEN(
REGEXREPLACE(SUBSTITUTE(A2:A, "and", ","), "[^,]", ))=
MMULT(N(IFERROR(IF(SPLIT(SUBSTITUTE(A2:A, "and", ","), ",")="",,
REGEXMATCH(TRIM(SPLIT(SUBSTITUTE(A2:A, "and", ","), ",")),
REGEXEXTRACT(B2:B, "\w+$"))))),
SEQUENCE(COLUMNS(SPLIT(SUBSTITUTE(A2:A, "and", ","), ",")), 1, 1, 0)),
"match", "no match")))

use this
C2=trim(index(split(B2," "),1,COUNTA(split(B2," "))))
D2=SUBSTITUTE(A2,"and",",")
E2=if(COUNTA(split(D2,C2,false))=counta(split(D2,",",false)),"matched","not matched")
1- C2 gets the last word from sentence as last name
2- D2 Replaces "and" by ","
3- E2 splits D2 by "," and splits D2 by C2 then counts and compares if same means all matched
Result

another one for you:
=ARRAYFORMULA(
IFS(
A2:A = "",,
ISNA(MATCH(
ROW(A2:A),
QUERY(
QUERY(
SPLIT(
FLATTEN(
FILTER(
ROW(A2:A) & "♥"
& --NOT(REGEXMATCH(
SPLIT(
REGEXREPLACE(A2:A, ",\s*|\s+and\s+", "♥"),
"♥"
),
"^$|" & REGEXEXTRACT(B2:B, "\s(\w+)$")
)),
A2:A <> ""
)
),
"♥"
),
"SELECT Col1, SUM(Col2)
GROUP BY Col1",
),
"SELECT Col1
WHERE Col2 = 0",
),
)),
"NO MATCH",
True,
"MATCH"
)
)

PL SQL regular expression substring

I have a long string.
message := 'I loooove my pet animal';
This string in 23 chars long. If message is greater that 15 chars, I need to find the length of message where I can break the string into 2 strings. For example, in this case,
message1 := 'I loove my'
message2 := 'pet animal'
Essentially it should find the position of a whole word at the previous to 15 chars and the break the original string into 2 at that point.
Please give me ideas how I can do this.
Thank you.

Here is a general solution - with possibly more than one input string, and with inputs of any length. The only assumption is that no single word may be more than 15 characters, and that everything between two spaces is considered a word. If a "word" can be more than 15 characters, the solution can be adapted, but the requirement itself would need to state what the desired result is in such a case.
I make up two input strings in a CTE (at the top) - that is not part of the solution, it is just for testing and illustration. I also wrote this in plain SQL - there is no need for PL/SQL code for this type of problem. Set processing (instead of one row at a time) should result in much better execution.
The approach is to identify the location of all spaces (I append and prepend a space to each string, too, so I won't have to deal with exceptions for the first and last substring); then I decide, in a recursive subquery, where each "maximal" substring should begin and where it should end; and then outputting the substrings is trivial. I used a recursive query, that should work in Oracle 11.1 (or 11.2 with the syntax I used, with column names in CTE declarations - it can be changed easily to work in 11.1). In Oracle 12, it would be easier to rewrite the same idea using MATCH_RECOGINZE.
with
inputs ( id, str ) as (
select 101, 'I loooove my pet animal' from dual union all
select 102, '1992 was a great year for - make something up here as needed' from dual
),
positions ( id, pos ) as (
select id, instr(' ' || str || ' ', ' ', 1, level)
from inputs
connect by level <= length(str) - length(replace(str, ' ')) + 2
and prior id = id
and prior sys_guid() is not null
),
r ( id, str, line_number, pos_from, pos_to ) as (
select id, ' ' || str || ' ', 0, null, 1
from inputs
union all
select r.id, r.str, r.line_number + 1, r.pos_to,
( select max(pos)
from positions
where id = r.id and pos - r.pos_to between 1 and 16
)
from r
where pos_to is not null
)
select id, line_number, substr(str, pos_from + 1, pos_to - pos_from - 1) as line_text
from r
where line_number > 0 and pos_to is not null
order by id, line_number
;
Output:
ID LINE_NUMBER LINE_TEXT
---- ----------- ---------------
101 1 I loooove my
101 2 pet animal
102 1 1992 was a
102 2 great year for
102 3 - make
102 4 something up
102 5 here as needed
7 rows selected.

First you reverse string.
SELECT REVERSE(strField) FROM DUAL;
Then you calculate length i = length(strField).
Then find the first space after the middle
j := INSTR( REVERSE(strField), ' ', i / 2, i)`
Finally split by i - j (maybe +/- 1 need to test it)
DEMO
WITH parameter (id, strField) as (
select 101, 'I loooove my pet animal' from dual union all
select 102, '1992 was a great year for - make something up here as needed' from dual union all
select 103, 'You are Supercalifragilisticexpialidocious' from dual
), prepare (id, rev, len, middle) as (
SELECT id, reverse(strField), length(strField), length(strField) / 2
FROM parameter
)
SELECT p.*, l.*,
SUBSTR(strField, 1, len - INSTR(rev, ' ', middle)) as first,
SUBSTR(strField, len - INSTR(rev, ' ', middle) + 2, len) as second
FROM parameter p
JOIN prepare l
ON p.id = l.id
OUTPUT

how to get out string oracle regex

I have the following string my trying get out the 1111111 and 33333333333 with out the |
character
SELECT regexp_substr('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|','*[|]*[|][0-9]*')FROM dual

Using REGEXP_REPLACE may be a bit simpler;
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){1}([^|]*).*$', '\2') FROM dual;
> 1111111
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){3}([^|]*).*$', '\2') FROM dual;
> 33333333333
You can choose column by choosing how many pipes to skip in the {1} part.
A simple SQLfiddle to test with.
A short explanation of the regexp;
([^|]+[|]){3} -- Matches 3 groups of {optional characters}{pipe}
(\d*) -- Matches the next digit group (the one we want)
.* -- Matches the rest of the expression
What we want is the second paranthesized group, that is, we replace the whole string by the back reference \2.

Because "|" separators always present it's simpler to extract fields with simple substring function rather than using regular expressions.
Just find positions of corresponding separators in source string and extract content between them:
with test_data as (
select
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' as s,
8 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
decode( field_number,
1,1,
instr(s,'|',1,field_number - 1) + 1
),
(
decode( instr(s,'|',1,field_number),
0, length(s)+ 1,
instr(s,'|',1,field_number)
)
-
decode( field_number,
1, 1,
instr(s,'|',1,field_number - 1) + 1
)
)
) as field_value
from
test_data
SQLFiddle
This variant works with empty fields, non-numeric fields and so on.
Possible simplification with appending additional separators to the start and the end of the string:
with test_data as (
select
(
'|' ||
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' ||
'|'
) as s, -- additional separators appended before and after original string
10 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
instr(s, '|', 1, field_number) + 1,
(
instr(s, '|', 1, field_number + 1)
-
(instr(s, '|', 1, field_number) + 1)
)
) as field_value
from
test_data
;
SQLFiddle

Efficient SQL statement to encode text for NLP locutions

Background
A locution is a noun-phrase consisting of at least two words, such as:
black olive
hot pepper sauce
rose finn apple potato
The separate words black and olive are an adjective (black - JJ) and a noun (olive - NN). However, humans know that black olive is a noun (that differentiates it from, say, a green olive).
The problem here is how to most efficiently transform a list of normalized ingredient names (such as the list above) into a specific format for a natural language processor (NLP).
Example Data
The table can be created as follows:
CREATE TABLE ingredient_name (
id bigserial NOT NULL, -- Uniquely identifies the ingredient.
label character varying(30) NOT NULL
);
The following SQL statements show actual database records:
insert into ingredient_name (label) values ('alfalfa sprout');
insert into ingredient_name (label) values ('almond extract');
insert into ingredient_name (label) values ('concentrated apple juice');
insert into ingredient_name (label) values ('black-eyed pea');
insert into ingredient_name (label) values ('rose finn apple potato');
Data Format
The general format is:
lexeme1_lexeme2_<lexemeN> lexeme1_lexeme2_lexemeN NN
Given the list of words above, the NLP expects:
black_<olive> black_olive NN
hot_pepper_<sauce> hot_pepper_sauce NN
rose_finn_apple_<potato> rose_finn_apple_potato NN
The database has a table (recipe.ingredient_name) and a column (label). The labels are normalized (e.g., single space, lower case).
SQL Statement
The code that produces the expected results:
CREATE OR REPLACE VIEW ingredient_locutions_vw AS
SELECT
t.id,
-- Replace spaces with underscores
translate( t.prefix, ' ', '_' )
|| '<' || t.suffix || '>' || ' ' ||
translate( t.label, ' ', '_' )
|| ' NN' AS locution_nlp
FROM (
SELECT
id,
-- Ingredient name
label,
-- All words except the last word
left( label, abs( strpos( reverse( label ), ' ' ) - length( label ) ) + 1 ) AS prefix,
-- Just the last word
substr( label,
length( label ) - strpos( reverse( label ), ' ' ) + 2
) AS suffix
FROM
ingredient_name
WHERE
-- Limit set to ingredient names having at least one space
strpos( label, ' ' ) > 0
) AS t;
Question
What is a more efficient (or elegant) way to split the prefix (all words except the first) and suffix (just the last word) in the above code?
The system is PostgreSQL 9.1.
Thank you!

CREATE OR REPLACE VIEW ingredient_locutions_vw AS
SELECT
t.id,
format('%s_<%s> %s NN',
array_to_string(t.prefix, '_'),
t.suffix,
array_to_string(t.label, '_')
) AS locution_nlp
FROM (
SELECT
id,
-- Ingredient name
label,
-- All words except the last word
label[1:array_length(label, 1) - 1] AS prefix,
-- Just the last word
label[array_length(label, 1)] AS suffix
FROM (
select id, string_to_array(label, ' ') as label
from ingredient_name
) s
WHERE
-- Limit set to ingredient names having at least one space
array_length(label, 1) > 1
) AS t;
select * from ingredient_locutions_vw ;
id | locution_nlp
----+--------------------------------------------------------
1 | alfalfa_<sprout> alfalfa_sprout NN
2 | almond_<extract> almond_extract NN
3 | concentrated_apple_<juice> concentrated_apple_juice NN
4 | black-eyed_<pea> black-eyed_pea NN
5 | rose_finn_apple_<potato> rose_finn_apple_potato NN
(5 rows)

convert comma-separated string-pairs with regex

I have a comma-separated list of first- and lastnames which I need to convert to SQL
(whitespace exists after the comma):
joe, cool
alice, parker
etc.
should become:
( firstname ='joe' and lastname = 'cool' ) or
( firstname ='alice' and lastname = 'parker' )
How can I achieve this with a regular expression?

In Perl you can do this:
s/(\S+),\s*(\S+)/( firstname ='\1' and lastname = '\2' )/
From the command line:
> perl -pe "s/(\S+),\s*(\S+)/( firstname ='\1' and lastname = '\2' )/" input.txt
Input:
joe, cool
alice, parker
Output:
( firstname ='joe' and lastname = 'cool' )
( firstname ='alice' and lastname = 'parker' )

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting Title, FirstName, MiddleName and LastName from Name field in Informatica - informatica

Related

Using Match with Regex and Array formula in Sheets

PL SQL regular expression substring

how to get out string oracle regex

Efficient SQL statement to encode text for NLP locutions

convert comma-separated string-pairs with regex

Categories

Resources