Google Data Studio Regexp Replace formula - delete all characters after ? and # - regex

I have a dasbhoard in Google Data Studio
I'm trying to create a custom field and replace all the characters that are going after # and ? sing (of course them too). But this formula - i dont know why - does not work
I was trying this one
REGEXP_REPLACE(Landing Page,'(#|\?)(.*)','')
Could you please help?

The pattern you tried (#|\?)(.*) caputures either # or ? using a capturing group with an alternation | followed by capturing 0+ times any char in another capturing group.
But in the replacement there is an empty string specified, removing all that is matched.
You could make use of a character class ([#?]) in a capturing group to capture one of the listed.
To only do the replacement where there is something after the match, you could match 1+ times any character except a newline using .+
To remove what comes after the matched character, you could refer to the capturing group using \\1 so that you keep the # or ? and remove what is matched afterwards.
The pattern could look like:
([#?]).+

Related

Regex - extract last term between _ and before . from path

This is the regex that I'm currently testing
[\w\. ]+(?=[\.])
My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
This doesn't work in Impala however.
Examples of path to extract from:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.
My expections are:
Program1
Test-program
program
Case-general
Any suggestions? I'm also open to using something other than regexp_extract.
Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.
You can use
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
See the regex demo.
The regex means
([^-_\\]+) - Group 1: one or more chars other than -, _ and \
\. - a dot
\w+ - one or more word chars
$ - end of string.
Using \w also matches an underscore, instead you can use [a-zA-Z0-9] instead.
Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.
Note that you don't have to escape dots in a character class.
([a-zA-Z0-9.-]+)[.]
See a regex101 demo
Example using regexp_extract where the , 1 gets the group 1 value:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
If it should be at the end of the string only, matching the last dot without matching any backslashes in between:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)

Is there a RegEx to remove the first instance of "."?

I am trying to remove the first dot "." from a sequence of numbers like this: 2500155978.06. intending to have 250015597806.
Typically, I try to only match what I need and substitute later, i.e. match all "." and then remove the first match. I have been trying with ^[^.]+ but I am only getting the digits up to the first "."
Thought about using a capture group with a positive lookahead but it got me nowhere (still learning RegEx).
Thank you in advance for your time and assistance!
You can use
^(\d+)\.
and replace with $1, the placeholder pointing to the value stored in Group 1.
See the regex demo. Details:
^ - start of string
(\d+) - Group 1 (later referred to with $1 from the replacement pattern): one or more digits
\. - a dot.

Match a part of a string using regex

I have a string and would like to match a part of it.
The string is Accept: multipart/mixedPrivacy: nonePAI: <sip:4168755400#1.1.1.238>From: <sip:4168755400#1.1.1.238>;tag=5430960946837208_c1b08.2.3.1602135087396.0_1237422_3895152To: <sip:4168755400#1.1.1.238>
I want to match PAI: <sip:4168755400#
the whitespace can be a word so i would like to use .* but if i used that it matches most of the string
The example on that link is showing what i'm matching if i use the whitespace instead of .*
(PAI: <sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
The example on that link is showing what i'm trying to achieve with .* but it should only match PAI: <sip:4168755400#
(PAI:.*<sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
I tried lookaround but failing.
Any idea?
thanks
Matching the single space can be updated by using a character class matching either a space or a word character and repeat that 1 or more times to match at least a single occurrence.
Note that you don't have to escape the spaces, and in both occasions you can use an optional character class matching either a space or hyphen [ -]?
If you want the match only, you can omit the 2 capturing groups if you want to.
(PAI:[ \w]+<sip:)((?:\([2-9]\d{2}\) ?|[2-9]\d{2}[ -]?)[2-9]\d{2}[- ]?\d{4})#
Regex demo
The regex should be like
PAI:.*?(<sip:.*?#)
Explanation:
PAI:.*? find the word PAI: and after the word it can be anything (.*) but ? is used to indicate that it should match as few as possible before it found the next expression.
(<sip:.*?#) capturing group that we want the result.
<sip:.*?# find <sip: and after the word it can be anything .*? before it found #.
Example

Regex to match ISO languages ISO

I have the following languages or language locale codes in a URL and i am trying to identify through REGEX. I was partially successful in identifying them but it is failing for some scenarios
Languages that i am testing with
en-us -- Passes
us -- Fails
Here is the REGEX that i have
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/)c\/(deals-and-tips\/)?
For instance:
https://forum.leasehackr.com/en-us/c/deals-and-tips (passes)
https://forum.leasehackr.com/us/c/deals-and-tips (fails)
What am I missing in the above REGEX?
The regex you wanted is:
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips\/)?
The difference from your regex is that I moved the first \/ from inside the parenthesis to outside (to sit with c\/).
Test here.
The last / fails the match in any case since your urls doesn't have it, in any way I would rewrite your regex as this: ([a-zA-Z]{2})(-[a-zA-Z]{2})?\/c\/(deals-and-tips)?.
This way it always looks for the first part (en) and consider the second (-us) as optional.
Alternatively use (\w{2})(-\w{2})?\/c\/(deals-and-tips)?, if you don't mind risking to match underscores and similar simbols
The reason your pattern does not match us is because the alternation ([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2}\/) only matches the \/ in the second part of the alternation.
Also it does not match the last group with deals-and-tips because there is no trailing \/ in the example data.
Your updated pattern might look like
([a-zA-Z]{2}|[a-zA-Z]{2}-[a-zA-Z]{2})\/c\/(deals-and-tips)?
Regex demo
You could shorten the pattern a bit by using an optional non capturing group (?:-[a-zA-Z]{2})? inside the first capturing group to optionally match the part starting with a hyphen.
As in the example data you could match the leading \/ in front of the capturing group to get a more efficient match.
\/([a-zA-Z]{2}(?:-[a-zA-Z]{2})?)\/c\/(deals-and-tips)?
In parts
\/ To be a bit more precise, match the leading /
( Capture group 1
[a-zA-Z]{2} Match 2 chars a-z
(?:-[a-zA-Z]{2})? Optionally match - and 2 chars a-z
) Close group
\/c\/ Match /c/deals-and-tips`
(deals-and-tips)? Optional capture group 2 match deals-and-tips
Regex demo
Note that if you use another delimiter than / you don't have to escape the forward slash.

Keep only the strings in between quotes in Notepad++

In Notepad++, I use the expression (?<=").*(?=") to find all strings in between quotes. It would the seem rather trivial to be able to only keep those results. However, I cannot find an easy solution for this.
I think the problem is that Notepad++ is not able to make multiple selections. But there must be some kind of workaround, right? Perhaps I must invert the regex and then find/replace those results to end up with the strings I want.
For example:
blablabla "Important" blabla
blabla "Again important" blablabla
I want to keep:
Important
Again important
There is no great solution for this and depending on your use case I would recommend writing a quick script that actually uses your first expression and creates a new file with all of the matches (or something like this). However, if you just want something quick and dirty, this expression should get you started:
[^"]*(?:"([^"]*)")?
\1\n
Explanation:
[^"]* # 0+ non-" characters
(?: # Start non-capturing group
" # " literally
( # Start capturing group
[^"]* # 0+ non-" characters
) # End capturing group
" # " literally
)? # End non-capturing group AND make it optional
The reason the optional non-capturing group is used is because the end of your file may very well not have a string in quotes, so this isn't a necessary match (we're more interested in the first [^"]* that we want to remove).
Try something like this:
[^"\r\n]+"([^"]+)"[^"\r\n]+
And replace with $1. The above regex assumes there will be only 2 double quotes in each line.
[^"]+ matches non-quote characters.
[^"\r\n]+ matches non-quote, non newline characters.
regex101 demo
Hard to be certain from your post, but I think you may want : SEE BELOW
<(?<=")(.*)(?=")
The part you keep will be captured as \2.
(?<=")(.*)(?=")
\1 \2 \3
Your original regex string uses parentheses to group characters for evaluation. Parentheses ALSO group characters for capturing. That is what I added.
Update:
The regex pattern you provided doesn't seem to work correctly. Won't this work?
\"(.*)\"
\1 now captures the content.