Regex Extract with two different delimiters - regex

Working in Google Data Studio and having trouble extracting a string between two different delimiters
For example if I have the following line item:
Company_Clothes_Shirt:Red_Online_US
I would like to extract just Red
I’ve tried
REGEXP_EXTRACT(Dimension,'^(?:[^\\_]*\\_){2}([^\\:]*\\:){1}') but it just gives me Shirt:
Tried several other iterations but have only been able to extract the first part (Shirt), rather than the second (Red).
Would appreciate any help on this!

You don't need to extract based on the whole string, you can just extract the value between the two delimiters:
SELECT REGEXP_EXTRACT(Dimension,':([^_]+)_')
For an input value of Company_Clothes_Shirt:Red_Online_US, this will give Red.
Note that neither _ or : are special characters for regex, so they don't need to be escaped.

Related

KNIME regex expression to return 6th line

I have a column with string values present in several lines. I would like to only have the values in the 6th line, all the lines have varying lengths, but all the cells in the column have the information I need in the 6th line.
I am honestly absolutely new and have no background in Java nor KNIME - I have scoured this forum and other internet sources, and none seem to tackle what I need in KNIME specifically - I found something similar but it doesn't work in KNIME:
Regex for nth line in a text file
Your answer will probably need to be broken into two parts
How to do a regex search in KNIME
How to do a regex search for the 6th line
I can help with the regex search, but I don't know KNIME
To start with, you want to know how to search for a single line which is
([^\n]*\n)
This looks for
*: 0 or more of
[^\n]: anything that isn't a new line
followed by \n: a new line
and (): groups them together into a single match
We can then expand this into: ([^\n]*\n){5}([^\n]*\n){1} Which creates 2 capture groups, one with the first 5 lines, the second with the 6th line.
If KNIME supports Non-Capturing groups you can then expand that into the following so that you only have one matching capture group. You can decide for yourself which you like best.
(?:[^\n]*\n){5}([^\n]*\n){1}
I've created an example you can test on RegExr
Regardless of which way you go, make sure to document the regex with comments or stick it into a variable with a very clear name since they aren't particularly human readable

regexReplace in String Manipulation KNIME

I'm trying to remove the content of all cells that start with a character that is not a number using KNIME (v3.2.1). I have different ideas but nothing works.
1) String Manipulation Node: regexReplace(§column§,"^[^0-9].*","")
The cells contain multiple lines, however only the first line is removed by this approach.
2) String Manipulation Node: regexMatcher($casrn_new$,"^[^0-9].*") followed by Rule Engine Node to remove all columns that are "TRUE".
The regexMatcher gives me "False" even for columns that should be "True" though.
3) String Replacer Node: I inserted the expression ^[^0-9].* into the Pattern column and selected "Replace whole String" but the regex is not recognised by that node so nothing gets replaced.
Does anyone have a solution for any of those approaches or knows another Node that might do the job? Help is much appreciated!
I would go with your first solution, since it has already worked, you just have to expand your regex to include newlines. I would try something like this:
regexReplace($column$,"^[^0-9].(.|\n)*","")
This should match any text starting with a character that is not a number, followed by any number of occurrences of any character or a newline. Depending on the line endings, you might need (.|\n|\r) instead of (.|\n).
You should use the following expression:
"(?s)^\D.*$"
So the dot will match even new lines. (Based on this: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#DOTALL)
In case you need to only change the content of the cells that do not start with a number, I do not think you need to filter any columns or rows. (BTW in case you want to remove rows, there are the Rule-based Row Filter/Splitter nodes which also support regular expressions with the MATCHES predicate.)

REGEX to find first instance after set length

I'm probably going to get pilloried for asking this question, but after searching and trying to figure out this regex on my own, I'm just tired of wasting time trying to figure out. Here's the problem I'm trying to solve. I frequently use editpad pro to to convert character strings so they will fit into a mainframe.
For instance, I want to convert a column of words from excel into an IN clause for sql. The column is 5000 words or so.
I can easily copy and paste that into the text editor and then using find and replace convert that from a column of words to a single row with ',' separating each word.
Once that's done, though I want to use a regex to split this row before or after a comma after 70 characters have gone by.
(?P<start>^.{0,70})
This will give me the first 70 characters, but then I get stuck as I can't figure out how to create the next group to find all the characters up to the next comma so I can refer to it like this
(?P<start>^.{0,70})(?P<next>????,)
If I could get that, then I could create do a find and replace that would break it after the first comma that appears after the 70th character.
I know given the rest of the day I could figure it out, but I need to move on. I've tried this before. I would even be willing to only find the first 7o characters and then next few characters until the comma and then have to repeat the replace and find multiple times, if necessary, but I can not get the regex to work.
Any assistance with this would be greatly appreciated.
Here is some sample data that I have added line breaks into as an example of what I want it to look like after the regex runs.
'Ability','Absence','Absolute','Absorb','Accident','Acclaim','Accompany',
'Accomplish','Achievement','Acquaintance','Acquire','Across','Acting','Address',
'Admire','Adorable','Advance','Advertisement','Afraid','Agriculture','Align',
'All','Allow','Allowance','Allowed','Alone','Aluminium','Always','America',
'Analyze','Android','Angle','Announce','Annual','Ant','Antarctica','Antler',
I think you should consider restricting your initial concatenation, but here's a solution to your specific implementation :
^.{0,70}[^,]*
This will select the first 70 characters (if available), then every character up to the one before the next comma.
I don't think you need groups here, but you can obviously add them to the regex :
(?P<start>^.{0,70})(?P<next>[^,]*)

Regular expression for rest of line after first x characters

I have a bunch of lines with IDs as the first six characters, and data I don't need after. Is there a way to identify everything after the ID section so Find and Replace can replace it with whitespace?
/.{6}\K.*//
If you want something more specific, please be more specific in your question.

Extract values from this string?

I have the following string of text.
LOCATION: -20.443 122.951TEMPERATURE: 54.5CCONFIDENCE:
50%SATELLITE: aquaOBS TIME: 2014-05-06T05:30:30ZGRID:
1km
This is being pulled from a feed, and the fieldnames stay the same, but the values differ.
I have been trying to get my head around regular expressions and find a way to pull:
54.5 (temperature)
50 (confidence)
So I need two separate regular expressions that can pull the above from the original string. Any clues or pointers would be great.
I am doing this within a product that allows me to point to strings and can apply regular expressions to the strings so that values can be extracted and written to new fields.
ArcGIS appears to be using a very limited regex engine. It looks like it doesn't even support capturing groups, let alone lookaround. So I guess you need to try the following:
TEMPERATURE: ([0-9.]+)C
will match the TEMPERATURE entry and
CONFIDENCE: ([0-9]+)%
will match the CONFIDENCE entry.
If you're lucky, you can then access the relevant part of the match via the special variable \1 or $1 (which would then contain "54.5" and "50", respectively.
If that's not possible, you'll have to "manually" trim the first 13/12 characters from the left side from the string as well as the rightmost character.
You can split this text with delimiter- new line. As result you get an array. Than you can split the elements of the array with delimiter ':'