Alternate for Normalizer Transformation? - informatica

I have to implement Normalizer transformation logic without using Normalizer transformation in Informatica Powercenter. May i know all the ways i can implement with or without knowing the number of groups in input data.

Knowing the number of groups it's possible to implement this with expression transformation and variable ports. But it's tedious.
Without knowing the number of groups it won't be possible, I'm afraid. Or will require to create plenty of additional columns hoping it will be enough (but with no certanity).

Try implementing using SQL. Override the below SQL in Source Qualifier and try loading into the target
The normalizer converts columns to rows in informatica. I have implemented the scenario using Oracle SQL's "Connect by Level"
CREATE TABLE MARKS
(
NAME VARCHAR2(50),
SUBJ1 NUMBER,
SUBJ2 NUMBER,
SUBJ3 NUMBER);
INSERT INTO MARKS VALUES ('RAJ',70,80,90);
INSERT INTO MARKS VALUES ('RAM',70,85,75);
INSERT INTO MARKS VALUES ('RAVI',90,80,90);
INSERT INTO MARKS VALUES ('RANI',80,80,95);
INSERT INTO MARKS VALUES ('RAGHU',73,82,90);
COMMIT;
WITH DATA AS (SELECT LEVEL L FROM DUAL CONNECT BY LEVEL <= 3)
SELECT NAME,DECODE(L,1,'PHYSICS',2,'CHEMISTRY',3, 'MATHS') AS SUBJECT ,
DECODE(L ,1,SUBJ1,2,SUBJ2,3,SUBJ3) AS MARKS
FROM MARKS, DATA
ORDER BY NAME;
NAME SUBJECT MARKS
raghu physics 73
raghu maths 90
raghu chemistry 82
raj physics 70
raj chemistry 80
raj maths 90
ram physics 70
ram maths 75
ram chemistry 85
rani chemistry 80
rani physics 80
rani maths 95
ravi maths 90
ravi physics 90
ravi chemistry 80
The solution might not be dynamic but it serves your purpose.
Thanks
Raj

You can use Java Transformation, I have used Java transformation in many instances to achieve results that a Normalizer would provide.

Related

How to derive AVISIT in SAS using different conditions mentioned in specs

Hello I am new in SAS and have to derive AVISIT and specs are like -
1.If there are 2 or more assessments faling in the same visit window, the non-missing assessment closest to midpoint will be use.
2. If there are 2 assessments equidistance from the mispoint then create a new observation using avarage of AVAL of 2 observations.
Mid poits are like -
If avisit= 'month1' then midpoint is 28
if avisit= 'month 3' then midpoint is 84 and so on.
So looking help for derivation of the above 2 mentioned conditions using mid points.
Thank you in advance.
Do not have any idea about how to program it.

Using Logistic Regression For Timeseries Data in Amazon SageMaker

For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label
Gross Revenue
Balance Sheet
Creditors
Debts
good
10000
20000
0
0
bad
0
5
100
10000
bad
20000
0
4
100000000
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label
Gross Revenue(2019)
Balance Sheet(2019)
Creditors(2019)
Debts(2019)
Gross Revenue(2020)
Balance Sheet(2020)
Creditors(2020)
Debts(2020)
good
10000
20000
0
0
30000
10000
40
500
bad
100
50
200
50000
100
5
100
10000
bad
5000
0
2000
800000
2000
0
4
100000000
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.
However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.
Summary:
1.)
It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable
2.)
It is possible but you will loose information over time.
All the best,
Patrick

Excel IF function and in between values, but only if

I have values for postage, pricing and postage service (only if). I have two choices for postage service (express and eco), price depends on a weight, but service depends on a price (fast service for items over £5, eco - under).
Service: if product price(A2)
<5=eco; >5=express
Service price(C2) by weight(B2):
<=1000gr= £2 eco or £3 express
1001-1250gr= £5 eco or £6 express
1251-5000gr=£9 eco or £11 express
Cells A2 and B2 always display a value, need a formula for C2 to display the price of service calculated by weight, but if item over £5 must display express service price if less - eco.
I have tried:
>IF(AND(OR(B2<=1000),A2<5),2,IF(AND(OR(B2>1000,B2<=1250),A2<5),5,IF(AND(OR(B2>1250,B2<=5000),A2<5),9)))
>IF(AND(OR(B2<=1000),A2<5),2)+IF(AND(OR(B2>=1001,B2<=1250),A2<5),5)+IF(AND(OR(B2>2000),A2<5),9)
Didn't start adding A2>5, because nothing works anyway! Tried many more, but no luck.
Would appreciate any help because stuck and ran out of options :(
Thanks!
There are a couple of ways to accomplish this. The preferred method is to build a small cross-reference table for your surcharges and use the VLOOKUP function to return the values.
However, this question was about hard-coded values in a conditional statement, so I will address that with a LOOKUP function and arrayed constants.
The standard formula in C2 is,
=LOOKUP(B2,{0,1001,1251},{2,5,9})+SIGN(A2)*LOOKUP(B2,{0,1001,1251},{1,1,2})
Fill down as necessary.
In the following image, custom number formats were used on columns A and B ([Color9]\Exp\r\e\s\s - [$£-809]#,##0.00;;[Color10]\Eco - [$£-809]#,##0.00; and 0\g\r_)). Weights >5000 in column B trigger a conditional formatting in column C that displays too heavy.
    

Extract all numeric values in Columns in R

I am currently working with a large data set using R. So, I have a column called "Offers". This column contains text describing 'promotions' that companies offer on their products. I am trying to extract numeric values from these. While, for most cases, I am able to do so well using a combination of regex and functions in R packages, I am unable to deal with a couple of specific cases of text shown below. I would really appreciate any help on these.
"Buying this ensures Savings of $50. Online Credit worth 35$ is also available. So buy soon!"
1a. I want to get both the numeric values out but in 2 different columns. How
do I go about that?
1b. For another problem that I have to solve, I only need to take the value associated with the credit. It is always the case that for texts like above, the second numeric value in the text, if it exists, is the one associated with the credit.
"Get 50% off on your 3 night stay along with 25 credits, offer available on 3 December 2016"
(How should I only take the value associated with credits?)
Note: Efficiency would be important as well because I am dealing with about 14 million rows.
I have tried looking online for a solution but have not found anything very satisfactory.
I am not 100% sure about what you want but this may help you.
A <- "do 50% and whatever 23"
B <- gregexpr("\\d+",A)[[1]]
firstNum <- substr(A,B[1],B[1]+attr(B,"match.length")[1]-1)
secondNum <- substr(A,B[2],B[2]+attr(B,"match.length")[2]-1)
Hope this helps.

What is the best data mining method for vehicle search?

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is:
2003 Ford F150.
However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?
You could use the Levenshtein distance to match the found string against your database records.
Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.
If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.
If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.
Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.
One of these approaches will certainly help you.