Getting "null" when extracting string using REGEXP_EXTRACT in Tableau - regex

I have been trying to use the REGEXP_EXTRACT function in Tableau without success (see image below). I have a string column 'FOB', and I want to extract the leading capital letters. Sometimes there's a dash following the capital letters, sometimes not, so I used the following syntax in the created field 'Advertiser':
REGEXP_EXTRACT([FOB],'^[A-Z]*')
However, this produces a column full of "null". The weird thing is even if I changed the pattern from '^[A-Z]*' to 'SDM', it was still the same. It just seems that Tableau is not regex enabled...
I did check my regex online here and it worked... getting really confused, any help will be appreciated.

Since you need to extract the first character in each [FOB] column cell, you need to use ^ anchor and a [A-Z] character class, but also you need to wrap the pattern with a capturing group (i.e. paired parentheses, (...)) to tell Tableau you need to extract this pattern part:
REGEXP_EXTRACT([FOB],'^([A-Z])')
^ ^
To extract all (one or more) leading capital letters, add +:
REGEXP_EXTRACT([FOB],'^([A-Z]+)')
^
See Mark Jackson's regex blog excerpt:
The whole pattern is wrapped in parenthesis to tell Tableau what part of the pattern to return. This is an update from the earlier beta version I was using when I created this post. The nice thing about this addition is that Tableau lets you pattern match on a larger portion of the string, but allows you to return a subset of the pattern.

Related

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Regex for value.contains() in Google Refine

I have a column of strings, and I want to use a regex to find commas or pipes in every cell, and then make an action. I tried this, but it doesn't work (no syntax error, just doesn't match neither commas nor pipes).
if(value.contains(/(,|\|)/), ...
The funny thing is that the same regex works with the same data in SublimeText. (Yes, I can work it there and then reimport, but I would like to understand what's the difference or what is my mistake).
I'm using Google Refine 2.5.
Since value.match should return captured texts, you need to define a regex with a capture group and check if the result is not null.
Also, pay attention to the regex itself: the string should be matched in its entirety:
Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups.
So, add .* before and after the pattern you are looking inside a larger string:
if(value.match(/.*([,|]).*/) != null)
You can use a combination of if and isNonBlank like:
if(isNonBlank(value.match(/your regex/), ...

Regex Extraction for Google Analytics Content Grouping

I'm attempting to setup Content Groupings using Extraction within Google Analytics.
I have URL's of the form http://www.ehattons.com/52674/Bachmann_Branchline_37_671_Pack_of_3_14_Ton_tank_wagons_in_Fina_livery_weathered/StockDetail.aspx
I wish to use Regex to say that only in cases where a URL contains /StockDetail.aspx, extract everything before the first underscore, excluding any digits. e.g. 'Bachmann'.
I've managed to source the following regex to return everything before the first underscore
^[^_]+(?=_).
However, that's as far as I can get with my limited understanding. Anyone know what regex will do the trick here?
Many thanks,
Well you did the halfway.
Think about it this way : you're looking for extracting something followed by a underscore but not following one when the string contain /StockDetail.aspx. You know that this part of string will always be after your first underscore.
So you start with no underscore before : [^_]
Then you create the group you want to match with ([a-zA-Z]*) (you cannot work with \w since it's including underscore). Your string has to be followed by a underscore so you add _ after your group. And finnaly somewhere in the url you've got /StockDetail.aspx. Your regex should look like this :
[^_]([a-zA-Z]*)_.*(?:\/StockDetail\.aspx)
Result

Improving a regex

I am looking for alternate methods to get john from the provided example.
My expression works as is but was hoping for some examples of better methods.
Example: john&home
my regexp: [a-z]{3,6}[^&home]
Im matching any character of length 3-6 upto but not including &home
Every item i run the regexp on is in the same format. 3-6 characters followed by &home
I have looked at other posts but was hoping for a reply specific to my regexp.
Most regex engines allow you to capture parts of a regex with capture groups. For instance:
^([A-Za-z]{3,6})&home$
The brackets here mean that you are interested in the part before the &home. The ^ and $ mean that you want to match the entire string. Without it, averylongname&homeofsomeone will be matched as well.
Since you use rubular, I assume you use the Ruby regex engine. In that case you can for instance use:
full = "john&home"
name = full.match(/^([A-Za-z]{3,6})&home$/).captures
And name will in this case contain john.

Regex for capitalizing first letter in a tag, alt=", etc

I've found regular expressions that capitalize the first letter in a sentence. But does anyone know a regex that capitalizes the first letter inside a tag, including URL and image attributes (e.g. title="antelope" or alt="antelope").
I used another regex to change all my image paths to lower case, and it zapped a bunch of my tags as well (alt, title, h2, etc.). So now I'd like to get a head start fixing them by capitalizing the first letters.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors.
Before...
alt="antelope" title="antelope" <h2>antelope
After...
alt="Antelope" title="Antelope" <h2>Antelope
Regex
(="\w|>\w)
Replace Regex
\U$1\E
Description: This will work for your example, depending on the regex engine you are using.
Debuggex Demo
This replaces the value in parameters in a url. NOT in html, as I now see that is what you mean. Oh well.
Find what: (\?|\&)([a-z_]+=)([a-z])([^&]+)
Replace (all) with: $1$2\u$3$4
Free spaced:
(\?|\&)
Capture group 1: Either the literal question mark or ampersand.
([a-z_]+=)
Capture group 2: One or more of any lowercase letter or underscore, followed by the equals sign.
([a-z])
Capture group 3: The first letter in the value of the url parameter. Note this does not even notice parameters whose values don't start with a letter.
([^&]+)
Capture group four: Every other character in the value. Or more specifically, one or more of any character as long as it's not an ampersand. This is a negative character class.
The \u in the replace-with is an option in TextWrangler (and in TextPad, which is what I use...so TextWrangler might also use the Boost regex engine) replacement that uppercases the immediately-following character. I'm not sure if this would work if capture groups 3 and 4 were merged.
Try it (although it doesn't have the \u option.)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a lot of helpful information in it, including a list of online regex testers (in the bottom section), so you can try things out yourself. All the links in this answer come from the FAQ.