Parsing pipes from a line using expression - informatica

I have data that looks like this:
A|B|CC|DD|EE|FF|GG
Is there any way I can parse the string to output values of the pipe separators? Can someone give me some examples?
e.g.
A is the value before the first pipe
B is the value before the second pipe
etc.
etc.

It's possible within Expression Transformation but very inconvenient. You need to use INSTR and SUBSTR functions as indicated by #Vikas.
What you can also try is Java Transformation or...
A trick: how about dumping this (i.e. the string along with some key value) to a file prior to processing the dataset. And then use an additional Source Qualifier with Column delimiter set to "|" to do all the dirty work for you? Then you can join it all back together using a Joiner Transformation and the key value dumped to the file.

You can use INSTR and SUBSTR combination or REG_ commands . Thanks !!

Related

How can I extract specific patterns from a string?

I currently have a dataset filled with the following pattern:
My goal is to get each value into a different cell.
I have tried with the following formula, but it's not yielded the results I am looking for.
=SPLIT(D8,"[Stock]",FALSE,FALSE)
I would appreciate any guidance on how I can get to the ideal output, using Google Sheets.
Thank you in advance!
I will assume here from your post that your original data runs D8:D.
If you want to retain [Stock] in each entry, try the following in the Row-8 cell of a column that is otherwise empty from Row 8 downward:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","(\[Stock\]).","$1~"),"~",1,1))))
If you don't want to retain [Stock] in each entry, use this version:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","\[Stock\].","~"),"~",1,1))))
These formulas don't function based on using any punctuation at all as markers. They also assure that you don't wind up with blank (and therefore unusable) cells interspersed for ending SPLITs.
, only used in the separator
=ARRAYFORMULA(SPLIT(D8:D,", ",FALSE))
, used also in each string ([stock] will be replaced)
=ARRAYFORMULA(SPLIT(D8:D," [Stock], ",FALSE))
, used also in each string ([stock] will not be replaced)
=ArrayFormula(SPLIT(REGEXREPLACE(M9:M11,"(\[Stock\]), ","$1♦"),"♦"))
use:
=INDEX(TRIM(IFNA(SPLIT(D8:D; ","))))

How can I use Regex to parse irregular CSV and not select certain characters

I have to handle a weird CSV format, and I have been running into problems. The string I have been able to work out thus far is
(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*?)+?
My files are often broken and irregular, since we have to deal with OCR'd text which is usually not checked by our users. Therefore, we tend to end up with lots of weird things, like a single " within a field, or even a newline character(which is why I am using Regex instead of my previous readLine()-based solution). I've gotten it to parse most everything correctly, except it captures [,] [,]. How can I get it to NOT select fields with only a single comma? When I try and have it not select commas, it turns "156,000" into [156] and [000]
The test string I've been using is
"156,000","",""i","parts","dog"","","Monthly "running" totals"
The ideal desire capture output is
[156,000],[],[i],[parts],[dog],[],[Monthly "running" totals]
I can do with or without the internal quotes, since I can always just strip them during processing.
Thank you all very much for your time.
Your CSV is indeed irregular and difficult to parse. I suggest you do 2 replacements first to your data.
// remove all invalid double ""
input = Regex.Replace(input, #"(?<!,|^)""(?=,|$)|(?<=,)""(?!,|$)", "\"");
// now escape all inner "
input = Regex.Replace(input, #"(?<!,|^)"(?!,|$)", #"\\\"");
// at this stage your have proper CSV data and I suggest using a good .NET csv parser
// to parse your data and get individual values
Replacement 1 demo
Replacement 2 demo

How do you remove text from example sets before processing the data?

I am using RapidMiner 5.3.013. I am reading from an excel file with thousands of rows of worklogs from Remedy. I want to remove texts based upon the regex ^[A-Z][\w\d/?(# ]+[\w0-9#)]{2}: then use Process Documents from Data. So far have not figured out how to do this. I could just probably write VBA, but would like to know how it can be done in Rapidminer.
Having read the Excel data, make sure the field to be processed by the Process Documents operator is set to type text. Do this using the Nominal to Text operator. Inside the process documents loop, split the data into tokens using the Tokenize operator. Use the Filter Tokens operator to remove any tokens you don't want. This operator takes a regular expression as a parameter. Make sure the invert flag is set on this operator to remove the tokens you don't want rather than keep them

Separate text using regex

I have a string like
abcdefangners
and a set of numbers that specifies how to group the above string, such as
3,4
In this case, the output should be
abc,defa,gners
Is something like this possible using regex? I have one option of using a loop to get the comparisons of the set one by one, but is there a better way to do it?
You could do:-
/(.{3})(.{4})(.*)/
This would give you the substrings which you'd then have to join together.
You'd have to create the regexp for each set of numbers so it would not be as easy as other methods of string manipulation.

Using SAS to format a string as a substring

I am new to SAS formats.
Say I have a string in the form of NNN.xxx where NNN is a number in the format of z3. and xxx is just some text.
E.g.
001.NUL and 002.ABC
Now can I define a format, fff, such that b = put("&NNN..&xxx.",fff.); returns only the &xxx. part?
I know we can achieve this by using b = substr("&NNN..&xxx.",5,3); but I want to have a format so that I can simply assign the format to a variable and not have to create a new variable out of it.
Thanks in advance.
Probably the only way is to code your own custom character format using SAS/TOOLKIT. It will be much easier to create another variable as you do with substr().
As said, I think this can be achieved thru combination of custom defined formats along with SAS builtin character functions - i.e. CAT, CATX, CATS, CATT etc...