I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!
to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))
Can I have a one-line regex code that matches the values between a pipe line "|" independent of the number if items between the pipe lines. E.g. I have the following regex:
^(.*?)\|(.*?)\|(.*?)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)$
which works only if I have 12 items. How can I make the same work for e.g. 6 items as well?
([^|]+)+
This is the pattern I've used in the past for that purpose. It matches 1 or more group that does not contain the pipe delimeter.
For Adobe Classification Rule Builder (CRB), there is no way to write a regex that will match an arbitrary number of your pattern and push them to $n capture group. Most regex engines do not allow for this, though some languages offer certain ways to more or less effectively do this as returned arrays or whatever. But CRB doesn't offer that sort of thing.
But, it's mostly pointless to want this anyways, since there's nothing upstream or downstream that really dynamically/automatically accommodates this sort of thing anyways.
For example, there's no way in the CRB interface to dynamically populate the output value with an arbitary $1$2$3[$n..] value, nor is there a way to dynamically generate an arbitrary number of rules in the rule set.
In addition, Adobe Analytics (AA) does not offer arbitrary on-the-fly classification column generation anyways (unless you want to write a script using the Classification API, but you can't say the same for CRBs anyways).
For example if you have
s.eVar1='foo1|foo2';
And you want to classify this into 2 classification columns/reports, you have to go and create them in the classification interface. And then let's say your next value sent in is:
s.eVar1='foo1|foo2|foo3';
Well AA does not automatically create a new classification level for you; you have to go in and add a 3rd, and so on.
So overall, even though it is not possible to return an arbitrary number of captured groups $n in a CRB, there isn't really a reason you need to.
Perhaps it would help if you explain what you are actually trying to do overall? For example, what report(s) do you expect to see?
One common reason I see this sort of "wish" come up for is when someone wants to track stuff like header or breadcrumb navigation links that have an arbitrary depth to them. So they push e.g. a breadcrumb
Home > Electronics > Computers > Monitors > LED Monitors
...or whatever to an eVar (but pipe delimited, based on your question), and then they want to break this up into classified columns.
And the problem is, it could be an arbitrary length. But as mentioned, setting up classifications and rules for them doesn't really accommodate this sort of thing.
Usually the best practice for a scenario like this is to to look at the raw data and see how many levels represents the bulk of your data, on average. For example if you look at your raw eVar report and see even though upwards of like 5 or 6 levels in the values can be found, but you can also see that most of values on average are between 1-3 levels, then you should create 4 classification columns. The first 3 classifications represent the first 3 levels, and the 4th one will have everything else.
So going back to the example value:
Home|Electronics|Computers|Monitors|LED Monitors
You can have:
Level1 => Home
Level2 => Electronics
Level3 => Computers
Level4+ => Monitors|LED Monitors
Then you setup a CRB with 4 rules, one for each of the levels. And you'd use the same regex in all 4 rule rows:
^([^|]+)(?:\|([^|]+))?(?:\|([^|]+))?(?:\|(.+))?
Which will return the following captured groups to use in the CRB outputs:
$1 => Home
$2 => Electronics
$3 => Computers
$4 => Monitors|LED Monitors
Yeah, this isn't the same as having a classification column for every possible length, but it is more practical, because when it comes to analytics, you shouldn't really try to be too granular about things in the first place.
But if you absolutely need to have something for every possible amount of delimited values, you will need to find out what the max possible is and make that many, hard coded.
Or as an alternative to classifications, consider one of the following alternatives:
Use a list prop
Use a list variable (e.g. list1)
Use a Merchandising eVar (product variable syntax)
This isn't exactly the same thing, and they each have their caveats, but you didn't provide details for what you are ultimately trying to get out of the reports, so this may or may not be something you can work with.
Well anyways, hopefully some of this is food for thought for you.
I have 10 records in a file and I don't need the first and the last line, I need data from 2 through 9 lines only.
Can anybody provide me solution on it?
Source file example:
SIDE,MTYPE,PAGENO,CONTIND,SUBACC,SIGN,DEAL QUANTITY,SECURITY,SOURCE SYSTEM,TODATE,SETTLEMENT DATE,REFERENCE 4,REFERENCE 2,TRADE DATE,ACCRUED INTEREST,ACCRUED INTEREST CURRENCY,XAMT1,XAMT2,XAMT3,XAMT4,XAMT5
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00107020052_CSA,107020052,6/12/2013,0,USD,,0,250000,0,200000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00115020036_CSA,115020036,6/12/2013,0,USD,,0,250000,0,220000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00301410097_CSA,301410097,6/12/2013,0,USD,,0,226725,0,226725
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00030020088_CSA,30020088,6/12/2013,0,USD,,0,250000,0,250000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00106410075_CSA,106410075,6/12/2013,0,USD,,0,250000,0,260000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00116510010_CSA,116510010,6/12/2013,300000,USD,,0,250000,0,260000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00177020015_CSA,177020015,6/12/2013,0,USD,,0,250000,0,270000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00189110093_CSA,189110093,6/12/2013,0,USD,,0,250000,0,280000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,00272220015_CSA,272220015,6/12/2013,0,USD,,0,250000,0,10000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,SLAVE1,189110093,6/12/2013,0,USD,,0,250000,0,250000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,SLAVE2,272220015,6/12/2013,0,USD,,0,250000,0,1000
L,536,1,M,L_CAMS_COLATAGREEMENT,C,0,AGREEMENTS,CAMS_AGREEMENT,6/12/2013,6/12/2013,SLAVE3,301410097,6/12/2013,0,USD,,0,250000,0,200
Not an expert in Informatica but I found the following answer on the web, hope it should be useful for you.
Step 1: You have to assign row numbers to each record. Generate the row numbers using the expression transformation. Create a DUMMY output port in the same expression transformation and assign 1 to that port. So that, the DUMMY output port always return 1 for each row.
Step 2: Pass the output of expression transformation to aggregator and do not specify any group by condition. Create an output port Ototalrecords in the aggregator and assign Ocount port to it. The aggregator will return the last row by default. The output of aggregator contains the DUMMY port which has value 1 and Ototal_records port which has the value of total number of records in the source.
Step 3: Pass the output of expression transformation, aggregator transformation to joiner transformation and join on the DUMMY port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.
Step 4: In the last step use router transformation. In the router transformation create two output groups.
In the first group, the condition should be Ocount = 1 and connect the corresponding output group to table A. In the second group, the condition should be Ocount = Ototalrecords and connect the corresponding output group to table B. The output of default group should be connected to table C, which will contain all records except first & last.
Source: http://www.queryhome.com/47922/informatica-how-to-get-middle-data-from-a-file
From informatica prospective, There are multiple way to do this.
if data in flat file, the sqloverride would not work. you can create two pipe line, first line read from source and use aggregator get the count and assign to a mapping variable such v_total. second pipe line you use another variable v_count, initialize to 0 , call count function. create filter transformation, filter out v_count=1 and (v_total-v_count)=1, the rest will be load to target.
Seems a lot of code wasted making the mapping unnecessarilly complex when a simple unix command such as
head -9 (currentfilename) (newinputfilename)
Will do the job. Then all you need do is use the new file for your mapping (if you even need it anymore)
For a windows server equivalent see https://serverfault.com/questions/490841/how-to-display-the-first-n-lines-of-a-command-output-in-windows-the-equivalent