Spotfire: count the number of a certain character in a string - regex

I am trying to add a new calculated column that counts the number of semi colons in a string and adds one to it. So the column i have contains a bunch of aliases and I need to know how many for each row.
For example,
A; B; C; D
So basically this means there are 4 aliases (3 semi colons + 1)
Need to do this for over 2 million rows. Help please!

Basic idea is to subtract length of your string without ; characters from it's original length:
len([columnName])-len(Substitute([columnName],";",""))+1

Here it is with a regular expression:
Len(RXReplace([Column 1], "(?!;).", "", "gis"))+1
RXReplace takes as arguments:
The string you are wanting to work on (in this case it is on Column 1)
The regular expression you want to use (here it is (?!;). )
What you want to replace matches with (blank in this situation so
that everything that matches the regex is removed)
Finally a parameter saying how you want it to work (we are passing
in gis which means replace all matches not just the first, ignore case, replace newlines)
We wrap this in a Len which gives us the amount of semicolons since that is all that is left and finally we add 1 to it to get the final result.
You can read more about the regular expression here: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx but in a nutshell it says match everything that isn't a semi colon.
You can read more about RXReplace and Len here: https://docs.tibco.com/pub/spotfire/6.0.0-november-2013/userguide-webhelp/ncfe/ncfe_text_functions.htm

Related

Remove character from string in informatica

I have the following string values .
string 1 = test123
string 2 = stri567
now i need to remove 123,567 from the string. which means i need only first four character from the strings.(test,stri)
Have you tried reg replace to replace all numbers from a string -
REG_REPLACE( inp_col, '[0-9]','')
If you need first four characters from the column then use substring
SUBSTR(column_name, 0, 4)
Please make sure that you need first four digits or just strings from the columns.
In case if you need strings from the values, please use Koushik Roy's solution

Get just X number of strings from a comma separated cel

having real trouble finding a succinct solution to this simple problem. Currently I have cells which contain many comma separated items. I just want the first 5.
ie. cell A1 =
text, another string, something else, here's another one, guess what another string here, and another, hello i'm another string, another string etc, etc, etccccc
and I'm trying to grab just the first 5 strings.
Beyond that, I wonder if I can incorporate a formula such as =LEN(A1)>20
Currently I do this with numerous; =IFERROR(INDEX( SPLIT(C31,","),1)) then =IFERROR(INDEX( SPLIT(C31,","),2)) etc. then run the LEN formula above.
Is there a simpler solution? Thanks so much.
Try,
=split(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), ", ", false)
For Excel, with data in A1, in B1 enter:
=TRIM(MID(SUBSTITUTE($A1,",",REPT(" ",999)),COLUMNS($A:A)*999-998,999))
and copy across:
To get all 5 substrings into a single cell, use:
=LEFT(A1,FIND(CHAR(1),SUBSTITUTE(A1,",",CHAR(1),5))-1)
=ARRAY_CONSTRAIN(SPLIT(A1,","),1,5)
=REGEXEXTRACT(A1,"((?:.*?,){5})")
=REGEXEXTRACT(A1,REPT("(.*?),",5))
SPLIT to split by delimiter
ARRAY_CONSTRAIN to constrain the array
REGEX1 to extract 5 comma separated values
. Any character
.*?, Any character repeated unlimited number of times (? as little as possible) followed by a ,
{5} Quantifier
REPT to repeat strings

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

How to find a specific string followed by a number, with any number of characters between? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm trying to write a regex for the following pattern:
[MyLiteralString][0 or more characters without restriction][at least 1 digit]
I thought this should do it:
(theColumnName)[\s\S]*[\d]+
As it looks for the literal string theColumnName, followed by any number of characters (whitespace or otherwise), and then at least one digit. But this matches more than I want, as you can see here:
https://www.regex101.com/r/HBsst1/1
(EDIT) Second set of more complex data - https://www.regex101.com/r/h7PCv7/1
Using the sample data in that link, I want the regex to identify the two occurrences of theColumnName] VARCHAR(10) and nothing more.
I have 300+ sql scripts which containing create statements for every type of database object: procedures, tables, triggers, indexes, functions -- everything. Because of that, I can't be too strict with my regex.
A stored procedure's file might include text like LEFT(theColumnName, 10) which I want to identify.
A create table statement would be like theColumnName VARCHAR(12).
So it needs to be very flexible as the number(s) isn't always the same. Sometimes it's 10, sometimes it's 12, sometimes it's 51 -- all kinds of different numbers.
Basically, I'm looking for the regex equivalent of this C# code:
//Get file data
string[] lines = File.ReadAllLines(filePath);
//Let's assume the first line contains 'theColumnName'
int theColumnNameIndex = lines[0].IndexOf("theColumnName");
if (theColumnNameIndex >= 0)
{
//Get the text proceeding 'theColumnName'
string temp = lines[0].Remove(0, theColumnNameIndex + "theColumnNameIndex".Length;
//Iterate over our substring
foreach (char c in temp)
{
if (Char.IsDigit(c))
//do a thing
}
}
(theColumnName).*?[\d]+
That'll make it stop capturing after the first number it sees.
The difference between * and *? is about greediness vs. laziness. .*\d for example would match abcd12ad4 in abcd12ad4, whereas .*?\d would have its first match as abcd1. Check out this page for more info.
Btw, if you don't want to match newlines, use a . (period) instead of [\s\S]

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!
I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');