Mapping logic for converting a pattern in excel or database - regex

I have thousands of mapping pattern that I need to convert. Attached is the image which shows the source value which I need to convert into Target values.
Few of the rules that I am able to decipher are below: -
If the dash ('-') is just before any value then it needs to convert to pipe ('|')
If there are multiple dashes in the middle e.g. 4 dashes, then they would converted to 4 pipes and 3 dashes as shown in second example in order to show 3 empty fields (|-|-|-|)
If there are multiple dashes at the end e.g. 3 dashes, then they would be converted to 3 pipes and 3 dashes as shown in first example in order to show 3 empty fields without pipe at the end (|-|-|-)
There would never be a dash at the beginning
There are in total 8 values. Each |-| is considered a empty value. Each field is separated by pipe.
I am looking at ways to convert the source values into intended target values using any software possible.

This quick user defined function appears to meet your requirements without regular expressions.
Function mapSource(str As String)
Dim tmp As Variant, i As Long
'strip leading hyphens
Do While Left(str, 1) = Chr(45) And Len(str) > 0: str = Right(str, Len(str) - 1): Loop
'split str to a maximum of 8 array elements
tmp = Split(str, Chr(45), 8)
'preserve an array of 8 elements
ReDim Preserve tmp(7)
'replace empty array elements with hyphens
For i = LBound(tmp) To UBound(tmp)
If tmp(i) = vbNullString Then tmp(i) = Chr(45)
Next i
'rejoin array into str
str = Join(tmp, Chr(124))
'output mapped str
mapSource = str
End Function

Related

Optional parts of regex pattern in vba

I am trying to build regex pattern for the text like that
numb contra: 1.29151306 number mafo: 66662308
numb contra 1.30789668number mafo 60.046483
numb contra/ 1.29154056 number mafo: 666692638
numb contra 137459625
mafo: 666692638
mafo: 666692638 numb contra/ 1.29154056
Here's the pattern I could build
contra?.\s+?(\d+\.?\d+)(.+mafo.?\s+(\d+\.?\d+))?
It works fine for all the lines except the last one. How can I implement all the possibilities to include the last line too?
Please have a look at this link
https://regex101.com/r/pSThAU/1
All is OK as for contra but not as for mafo
I think the key here is to make your regexp do less and your vba do more. What I think I see here is either the word 'mafo' or 'contra' and a number following. Don't know what order or whether each is present or how many times. So you can scan each of your strings for ALL occurrences with a regexp like this:
(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)
Then process it with some VBA code like this that I created in Excel:
Sub BreakItUp()
Dim rg As RegExp, scanned As MatchCollection, eachMatch As Match, i As Long, col As Long
Set rg = New RegExp
rg.Pattern = "(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)"
rg.IgnoreCase = True
rg.Global = True
i = 1
Do While (Not IsEmpty(ActiveSheet.Cells(i, 1).Value))
Set scanned = rg.Execute(ActiveSheet.Cells(i, 1).Value)
col = 2
For Each eachMatch In scanned
ActiveSheet.Cells(i, col).Value = eachMatch.SubMatches(0) & eachMatch.SubMatches(1)
ActiveSheet.Cells(i, col + 1).Value2 = "'" & eachMatch.SubMatches(2)
col = col + 2
Next eachMatch
i = i + 1
Loop
End Sub
That MatchCollection object will get one item for each Match that occurs and the subMatches array contains each capturing group. You should be able write your own logic within this processing loop to interpret what was extracted. When I ran it on your data it created all the fields in blue:
Notice I added a line to your data that had two contra entries and one mafo and it found all the occurrences. You should be able to modify this to interpret the meanings.

Extract words based on a pattern in Groovy [duplicate]

Is there a nicer/shorter/better way of performing the following:
filename = "AA_BB_CC_DD_EE_FF.xyz"
parts = filename.split("_")
packageName = "${parts[0]}_${parts[1]}_${parts[2]}_${parts[3]}"
//packageName == "AA_BB_CC_DD"
The format remains constant (6 parts, _ separator) but some of the values and lengths of AA,BB are variable.
You can do the same thing by just programming the "joining" part differently:
The following result in the same thing as packageName:
filename.split('_')[0..3].join('_')
It just uses a range to slice the array, and .join to concatenate with a delimiter.
As the separator char between the "segments" in the source filename and in the
result is the same (_), you don't need to split the filename and join the parts again.
Your task can be done with a single regex:
def result = filename.find(/([A-Z0-9]+_){3}[A-Z0-9]+/)

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Spotfire: count the number of a certain character in a string

I am trying to add a new calculated column that counts the number of semi colons in a string and adds one to it. So the column i have contains a bunch of aliases and I need to know how many for each row.
For example,
A; B; C; D
So basically this means there are 4 aliases (3 semi colons + 1)
Need to do this for over 2 million rows. Help please!
Basic idea is to subtract length of your string without ; characters from it's original length:
len([columnName])-len(Substitute([columnName],";",""))+1
Here it is with a regular expression:
Len(RXReplace([Column 1], "(?!;).", "", "gis"))+1
RXReplace takes as arguments:
The string you are wanting to work on (in this case it is on Column 1)
The regular expression you want to use (here it is (?!;). )
What you want to replace matches with (blank in this situation so
that everything that matches the regex is removed)
Finally a parameter saying how you want it to work (we are passing
in gis which means replace all matches not just the first, ignore case, replace newlines)
We wrap this in a Len which gives us the amount of semicolons since that is all that is left and finally we add 1 to it to get the final result.
You can read more about the regular expression here: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx but in a nutshell it says match everything that isn't a semi colon.
You can read more about RXReplace and Len here: https://docs.tibco.com/pub/spotfire/6.0.0-november-2013/userguide-webhelp/ncfe/ncfe_text_functions.htm

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))