Lookup On Table Using Wildcards: - informatica

I have this requirement, where my lookup table is:
Fi Sortorder
003 Transit
002 Serial
001 Transit,Serial
0** Transit,Account
*** Account
I want to look up on this table on FI field from source, so here problem is how will I search for the wild cards, in Source I will always get 3 digits in FI col, so lookin up on exact 3 digit is fine but how to lookup for last 2 digits and extract its sortorder col.
If I use wildcards then there might be possibility that wrong sortorder will be taken so I have to lookup with more precise information. Also there are 100s of combination and entries its just an example. I don't need full solution with syntax and all, I am just looking for approach. I will post my solution soon which I am trying right now if you guys think of anything kindly let me know.

If you need to do a lookup using less number of characters, just use a substring. And create a lookup for each number of characters you need. Eg. if you need to match using last 2 digits (TSQL):
use select right(Fi, 2) from yourtable as lookup SQL override
do a substring on your port in expression transformation to get last 2 characters from Fi port
use if for lookup condition
Just make sure to set the order and Lookup policy on multiple match according to your needs.

Related

How to use Calculated Join in Tableau using 1-M relationship and regex_extract

Question 1 - Is Tableau able to use multiple results from from a single line in a REGEXP using the global variable to compare against another table during a Join operation? If no, question 2 is null. If yes...
Question 2 - I'm attempting to join two data sources in Tableau using a regexp in a calculated join because the left table has 1 value in each cell (ie. 64826) and the right table has 4 possible matches in each cell (ie. 00000|00000|21678|64826).
The problem is that my regex stops looking after it finds 1 match (the first of 4 values), and the global variable /g has the opposite effect I expected and eliminates all matches.
I've tried calculated joins on the Data Source tab. I've also tried separating those 4 values into their own columns in worksheets using
regexp_extract_nth. In both cases, regex stops looking after the first result. A Left Join seems to work somewhat, while an Outer Join returns nothing.
REGEXP_EXTRACT([Event Number],'(\d{5})')
REGEXP_EXTRACT_NTH([Event Number],'(?!0{5})(\d{5})',1)
With these examples, regex would match a NULL with the left table even though 64826 is in the right table. I expect the calculated join to return all possible matches from the right set, so there'd be a match on 21678 and on 64826, duplicating rows in the right table like so...
21678 - 00000|00000|21678|64826
64826 - 00000|00000|21678|64826
45245 - 45106|45245|00000|00000
45106 - 45106|45245|00000|00000
Your original expression is just fine, we might want to make sure that we are sending a right command in Tableau, which I'm not so sure, maybe let's try an expression similar to:
\b([^0]....)\b
even just for testing, then maybe let's modify our commands to:
REGEXP_EXTRACT([Event Number], '\b([^0]....)\b')
or:
REGEXP_EXTRACT_NTH([Event Number], '\b([^0]....)\b', 1)
to see what happens. I'm assuming that the desired numbers won't be starting with 0.
Please see the demo here
Reference

regex to match max of input numeric string

I have incoming input entries.
Like these
750
1500
1
100
25
55
And There is an lookup table like given below
25
7
5
75
So when I will receive my first entry, in this case its 750. So this will look up into lookup entry table will try to match with a string which having max match from left to right.
So for 750, max match case would be 75.
I was wondering, Is that possible if we could write a regex for this kind of scenario. Because if I choose using startsWith java function It can get me output of 7 as well.
As input entries will be coming from text file one by one and all lookup entries present in file different text file.
I'm using java language.
May I know how can I write a regex for this flavor..?
This doesn't seem like a regex problem at first, but you could actually solve it with a regex, and the result would be pretty efficient.
A regex for your example lookup table would be:
/^(75?|5|25)/
This will do what you want, and it will avoid the repeated searches of a naive "check every one" approach.
The regex would get complicated,though, as your lookup table grew. Adding a couple of terms to your lookup table:
25
7
5
75
750
72
We now have:
/^(7(50?|2)?|5|25)/
This is obviously going to get complicated quickly. The trick would be programmatically constructing the appropriate regex for arbitrary data--not a trivial problem, but not insurmountable either.
That said, this would be an..umm...unusual thing to implement in production code.
I would be hesitant to do so.
In most cases, I would simply do this:
Find all the strings that match.
Find the longest one.
(?: 25 | 5 | 75? )
There is free software that automatically makes full blown regex trie for you.
Just put the output regex into a text file and load it instead.
If your values don't change very much, this is a very fast way to do a lookup.
If it does change, generate another one.
Whats good about a full blown trie here, is that it never takes more than 8
steps to match.
The one I just did http://imgur.com/a/zwMhL
App screenshot
Even a 175,000 Word Dictionary takes no more than 8 steps.
Internally the app initially makes a ternary tree from the input
then converts it into a full blown regex trie.

Google Sheets Pattern Matching/RegEx for COUNTIF

The documentation for pattern matching for Google Sheets has not been helpful. I've been reading and searching for a while now and can't find this particular issue. Maybe I'm having a hard time finding the correct terms to search for but here is the problem:
I have several numbers (part numbers) that follow this format: ##-####
Categories can be defined by the part numbers, i.e. 50-03## would be one product category, and the remaining 2 digits are specific for a model.
I've been trying to run this:
=countif(E9:E13,"50-03[123][012]*")
(E9:E13 contains the part number formatted as text. If I format it any other way, the values show up screwed up because Google Sheets thinks I'm writing a date or trying to do arithmetic.)
This returns 0 every time, unless I were to change to:
=countif(E9:E13,"50-03*")
So it seems like wildcards work, but pattern matching does not?
As you identified and Wiktor mentioned COUNTIF only supports wildcards.
There are many ways to do what you want though, to name but 2
=ArrayFormula(SUM(--REGEXMATCH(E9:E13, "50-03[123][012]*")))
=COUNTA(FILTER(E9:E13, REGEXMATCH(E9:E13, "50-03[123][012]*")))
This is a really big hammer for a problem like yours, but you can use QUERY to do something like this:
=QUERY(E9:E13, "select count(E) where E matches '50-03[123][012]' label count(E) ''")
The label bit is to prevent QUERY from adding an automatic header to the count() column.
The nice thing about this approach is that you can pull in other columns, too. Say that over in column H, you have a number of orders for each part. Then, you can take two cells and show both the count of parts and the sum of orders:
=QUERY(E9:H13, "select count(E), sum(H) where E matches '50-03[123][012]' label count(E) '', sum(H) ''")
I routinely find this question on $searchEngine and fail to notice that I linked another question with a similar problem and other relevant answers.

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

What would be the best (runtime performance) application or pattern or code or library for matching string patterns

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.
The simplest thougt is that there are some specified patterns and we want to know which of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.
I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.
Scenario:
Data File:
Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:
A1- Will be "3 digits" followed by a
"." follwed by "2 digits" -
[0-9]{3}.[0-9]{2}
A2- Will be "1
character" follwoed by "digits" -
[A-Z][0-9]{4}
The sample would be something like:
<Data>
<R1>
<A1>123.45</A1>
<A2>A5567</A2>
<A4>456EV</A4>
<An>xxx</An>
</R1>
</Data>
Rule File:
Rule ID A1 A2
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}
Rule Location - I am planning to store the Rule IDs in some sort of bit mask.
So the rule IDs are then listed as location on a string
Rule ID Location (from left to right)
1001 1
2002 2
3055 3
Pattern file: (This is not the final structure, but just a thought)
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011
Now let's assume that SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100
Doing AND on each of the loctions
gives me the location 1 - 1001 -
Complete match.
Doing XOR on each of the loctions
gives me the location 3 - 3055 -
Partial match. (I am purposely not
doing an OR, because that would have
returned 1001 and 3055 as the result
which would be wrong for partial
match)
The final reulsts I am looking for are:
1001 - Complete Match
3055 - Partial Match
Start Edit_1: Explanation on Matching results
Complete Match - This occurs when all
of the patterns in given Rule are
matched.
Partial Match - This ocurrs when NOT
all of the patterns in given Rule are
matched, but atleast one pattern
matches.
Example Complete Match (AND):
Rule ID 1001 matched for A1(101) and A2 (100). If you look at the first charcter in 101 and 100 it is "1". When you do an AND - 1 AND 1 the result is 1. Thus position 1 i.e. 1001 is a Complete Match.
Exmple Partial Match (XOR):
Rule ID 3055 matched for A1(101). If you look at the last character in 101 and 100 it is "1" and "0". When you do an XOR - 1 XOR 0 the result is 1. Thus position 3 i.e. 3055 is Partial Match.
End Edit_1
Input:
The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.
Rules:
The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.
Output:
Need to know which rules matched completely and partially.
Preferences:
I would prefer doing as much of the coding as I can in C#. However if there is a major performance boost, I can use a different language.
The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.
Questions:
Are there any existing pattern or
framewroks to do this?
Is using Regex the right path
specifically Assembly Compiled
Regex?
If I end up using Regex how can I
specify for A1 patterns to only
match against A1 column?
If I do specify rule locations in a
bit type pattern. How do I process
ANDs and XORs when it grows to be
100K charcter long?
I am looking for any suggestions or options that I should consider.
Thanks..
The regular expression API only tells you when they fully matched, not when they partially matched. What you therefore need is some variation on a regular expression API that lets you try to match multiple regular expressions at once, and at the end can tell you which matched fully, and which partially matched. Ideally one that lets you precompile a set of patterns so you can avoid compilation at runtime.
If you had that then you could match your A1 patterns against the AI column, A2 columns against the A2 pattern, and so on. Then do something with the list of partial and full regular expressions.
The bad news is that I don't know of any software out there that implements this.
The good news is that the strategy described in http://swtch.com/~rsc/regexp/regexp1.html should be able to implement this. In particular the State sets can be extended to have information about your current state in multiple patterns at the same time. This extended set of State sets will result in a more complex state diagram (because you're tracking more stuff), and a more complex return at the end (you're returning a set of State sets), but runtime won't be changed a bit, whether you're matching one pattern or 50.