I have 2 tables
one--src
I have 2 cols,
id Name
101,102 abc,cde
two--tgt
id Name
101 abc
102 cde
How to achieve this in informatica?
I am assuming you always have one source and two targets like you described, and data will be always like you mentioned.
After source qualifier, you can use and expression tansformation to create below output ports.
idcol1 = substr(id, 1, instr(id,',',1)) -- pick up 101 from the concatenated id col
idcol2 = substr(id, instr(id,',',1)+1) -- pick up 102 from the concatenated id col
namecol1 = substr(name, 1, instr(name,',',1))
namecol2 = substr(name, instr(name,',',1)+1)
Then use a router with below conditions
group1= idcol1= 101
group2= idcol2= 102
Then use two sorters after the router. Then attach all the rows to each targets.
router_group1 --> linked to Target1
router_group2 --> linked to Target2
Related
I'm trying to recode a list of values using Pyspark to create a new column. I've set my mapping up with nested dictionaries, but can't get the mapping syntax figured out. The original data has several string values that need to get recoded to a new value, then I want to give the column a new name. The original column values will get grouped several different ways to create different new columns.
The df will have several thousand columns, so I need the code to be as efficient as possible.
I have a different scenario with a 1-1 mapping where I was able to create my expression with:
#expr = [ create_map([lit(x) for x in chain(*values.items())])[orig_df[key]].cast(IntegerType()).alias('new_name') for key, values in my_dict.items() if key in orig_df.columns]
I just can't figure out the syntax for mapping the many to one.
Here's what I've tried:
grouping_dict = {'orig_col_n':{'new_col_n_a': {'20':['011','012'.'013'],'30':['014','015','016']},
'new_col_n_b': {'25':['011','013','015'],'35':['012','014','016']}}}
expr = [ f.when(f.col(key) == f.lit(old_val),f.lit(new_value))
.cast(IntegerType())
.alias(new_var_name)
for key, new_var_names_dict in grouping_dict.items()
for new_var_name,mapping_dict in new_var_names_dict.items()
for new_value,old_value_list in mapping_dict.items()
for old_val in old_value_list
if key in original_df.columns]
new_df = original_df.select(*expr)
This expression isn't quite right, it creates multiple columns with the same name as it loops through the values that need to be mapped.
Any suggestions for restructuring my dictionary or how to fix my syntax would be greatly appreciated!
enter image description here
enter image description here
orig_col_n new_col_n_a new_col_n_b
011 20 25
012 20 35
013 20 25
014 30 35
015 30 25
016 30 35
I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'
I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.
I am new to Apache Pig. I want to split and flatten the following input into my required output like who are all viewed that product.
My Input :(UserId, ProductId)
12345 123456,23456,987653
23456 23456,123456,234567
34567 234567,765678,987653
My Required Output:(ProductId, UserId)
123456 12345
123456 23456
23456 12345
23456 23456
987653 12345
987653 34567
234567 23456
234567 34567
765678 34567
My Pig Scripts:
a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);
b = foreach a generate key as ky1, FLATTEN(TOKENIZE(val)) as vl1;
c = group b by vl1;
d = foreach c generate group as vl2, $1 as ky2;
e = foreach d generate vl2, BagToString(ky2) as kyy;
f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;
g = foreach f generate vl3, FLATTEN(TOKENIZE(ky3)) as kk1;
dump g;
I got the following output which eliminates the repeated (duplicate)values,
(23456,12345)
(123456,12345)
(234567,23456)
(765678,34567)
(987653,12345)
I don't know how to solve this problem. Can anyone help me to solve this problem? and how to do this in a simple way?
Well, the second line of your code does exactly what you want, it simply displays the customer first and the product second. Put first the FLATTEN and then the key part:
a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);
b = foreach a generate FLATTEN(TOKENIZE(val)) as ProductId, key as UserId;
dump b;
(123456,12345)
(23456,12345)
(987653,12345)
(23456,23456)
(123456,23456)
(234567,23456)
(234567,34567)
(765678,34567)
(987653,34567)
As to why you are getting only one result per ProductId with your current code, you are grouping by ProductId, which gives you one row per different ProductId with a bag that contains all of the customers who viewed that product. Then, you convert that bag to a huge string separated by _, to convert it again to the same bag as before:
d = foreach c generate group as vl2, $1 as ky2;
e = foreach d generate vl2, BagToString(ky2) as kyy;
f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;
The BagToString UDF converts a bag to a string, joining the different values in the bag separated by a custom delimiter, which defaults to _. In the next line, however, you split it by _ resulting in the same bag as before. However, you FLATTEN that bag, so now instead of having a row with the ProductId and a bag, you have a row with several fields, being the first the ProductId, and the following fields all the customers that viewed the product:
Before FLATTEN:
(23456,{(23456,23456),(12345,23456)})
(123456,{(23456,123456),(12345,123456)})
(234567,{(34567,234567),(23456,234567)})
(765678,{(34567,765678)})
(987653,{(34567,987653),(12345,987653)})
After FLATTEN:
(23456,23456,23456,12345,23456)
(123456,23456,123456,12345,123456)
(234567,34567,234567,23456,234567)
(765678,34567,765678)
(987653,34567,987653,12345,987653)
And here lies the error. You have one only row for each of the products, and several fields in each row for each customer. When applying the last foreach, you select the first field (the product) and the second (the first of all the customers), discarding the rest of the customers on each row.
I have a table temp that have a column name "REMARKS"
Create script
Create table temp (id number,remarks varchar2(2000));
Insert script
Insert into temp values (1,'NAME =GAURAV Amount=981 Phone_number =98932324 Active Flag =Y');
Insert into temp values (2,'NAME =ROHAN Amount=984 Phone_number =98932333 Active Flag =N');
Now , i want to fetch the corresponding value of NAME ,Amount ,phone_number, active_flag from the remarks column of the table.
I thought of using regular expression ,but i am not comfortable in using it .
I tried with substr and instr to fetch the name from the remakrs column ,but if i want to fetch all four, i need to write a pl sql .Can we achieve this using Regular expression.
Can i get output(CURSOR) like
id Name Amount phone_number Active flag
------------------------------------------
1 Gaurav 981 98932324 Y
2 Rohan 984 98932333 N
-------------------------------------------
Thanks for your help
you can use something like :
SQL> select regexp_replace(remarks, '.*NAME *=([^ ]*).*', '\1') name,
2 regexp_replace(remarks, '.*Amount *=([^ ]*).*', '\1') amount,
3 regexp_replace(remarks, '.*Phone_number *=([^ ]*).*', '\1') ph_number,
4 regexp_replace(remarks, '.*Active Flag *=([^ ]*).*', '\1') flag
5 from temp;
NAME AMOUNT PH_NUMBER FLAG
-------------------- -------------------- -------------------- --------------------
GAURAV 981 98932324 Y
ROHAN 981 98932324 N