Regular Expression: Replace values according to a translation table - regex

How can I replace a list of values like
married
single
non
married
couple
to a list like this using a regular expression
Status 2
Status 1
non
Status 2
couple
? I know can match each group by something like this
/(married|single)/gm
and that I can address the matched group by $1, $2, ... . But how can I address and/or if-else the group-value in the replace-part to acutally translate the values?
Edit
Let's say I have the values to replace in a MariaDB-colum marital in myTable. Then I can do something like
SELECT
marital,
REGEXP_REPLACE(REGEXP_REPLACE(marital,
"married", "Status 2")
, "single", "Status 1")
FROM myTable
To get the desired result. But Is there a way to do this with just one REGEXP_REPLACE?
Thanks for your help!

You cannot do it with a single REGEXP_REPLACE because MariaDB doesn't support the required features in the third parameter.
You may do it using PHP with arrays: http://php.net/manual/en/function.preg-replace.php
or with callback: http://php.net/manual/en/function.preg-replace-callback.php
You may do it using Perl: How to replace a set of search/replace pairs?

Related

Select all subtring before a special character

In PostgreSQL, I have the following text type in a column value.
{186=>15.55255158, 21=>5123.43494408, 164=>0.0}
I would like to select the numbers before the => character and use the ouput in a subquery. So the output should be:
186
21
164
I tried several regex statement but it does not work. Any help would be appreciated.
You need to both use a regular expression, and a function to extract the values matched from the regular expression into a data set. The ~ operator is only used to match in a where clause. You need the REGEXP_MATCHES function.
SELECT REGEXP_MATCHES(your_column_name, '(\d+)=>', 'g')
FROM your_table_name
The 'g' option will return multiple matches, rather than just the first.
SQL Fiddle
you can use the simple substring(column_name from '(\d+)=>') to extract the appropriate data.

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

Using Regex to extract numeric values in Tableau

I am trying to pull out the numeric values (10004, 12245, 13456) from the following IDs:
10004a,
12v245, and
13456n
I can get the correct ID numbers with the exception of 12v245 ID, using the following regex code:
REGEXP_EXTRACT([ID], '([0-9]+)')
The 12v245 ID is only returning the the first two numbers. What am I missing in my code?
Your issue is that the function REGEXP_EXTRACT in Tableau requires exactly one capturing group.
The function [0-9]+ returns a capturing group per block of numbers and as the ID 12v245 has a letter in between the string of numbers it returns two capturing groups i.e. the 12 and then the 245.
The workaround for this is to use a nested replace as follows:
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE([ID], '[\D]+',"")
, '[\D]+' , "")
, '[\D]+' , "")
Depending on the nature of your data you may want to add more replaces.
This issue is documented on the Tableau community so feel free to vote up for a better fix: https://community.tableau.com/ideas/4975#

Using OpenOffice sCalc - How to use IF function with REGEX capture and if true print capture to cell

I have a worksheet (OpenOffice sCalc) with many rows of data, MOST of them have a year enclosed in ()
One of the cell's has this content: Mary had a little lamb, Sarah Josepha Hale (1830)
I would like to capture the year and save it in the cell to the right.
This stmt will tell me if a year is present:
=IF(COUNTIF(L115; ".*[(][0-9]{4,4}[)].*");"hooray"; "boo")
When I try to replace "Hooray" with $1 in this stmt I get an error:
=IF(COUNTIF(L115; ".*([(][0-9]{4,4}[)]).*");$1; "boo")
I get this: #REF!
What is the correct syntax? Thank you in advance!
Regex capturing is possible in Search/replace (must be enabled under "More Options"), but I don't know if you can use capturing in formulae.
An alternative way:
=VALUE(MID(L115;FIND("(";L115)+1;4))

Format all IP-Addresses to 3 digits

I'd like to use the search & replace dialogue in UltraEdit (Perl Compatible Regular Expressions) to format a list of IPs into a standard Format.
The list contains:
192.168.1.1
123.231.123.2
23.44.193.21
It should be formatted like this:
192.168.001.001
123.231.123.002
023.044.193.021
The RegEx from http://www.regextester.com/regular+expression+examples.html for IPv4 in the PCRE-Format is not working properly:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
I'm stucked. Does anybody have a proper solution which works in UltraEdit?
Thanks in advance!
Set the regular expression engine to Perl (on the advanced section) and replace this:
(?<!\d)(\d\d?)(?!\d)
with this:
0$1
twice. That should do it.
If your input is a single IP address (per line) and nothing else (no other text), this approach will work:
I used "Replace All" with Perl style regular expressions:
Replace (?<!\d)(?=\d\d?(?=[.\s]|$))
with 0
Just replace as often as it matches. If there is other text, things will get more complicated. Maybe the "Search in Column" option is helpful here, in case you are dealing with CSV.
If this is just a one-off data cleaning job, I often just use Excel or OpenOffice Calc for this type of thing:
Open your textfile and make sure only one IP address per line.
Open Excel or whatever and goto "Data|Import External Data" and import your textfile using "." as the separator.
You should now have 4 columns in excel:
192 | 168 | 1 | 1
Right click and format each column as a number with 3 digits and leading zeroes.
In column 5 just do a string concatenation of the previous columns with a "." in between each column:
A1 & "." & B1 & "." & C1 & "." & D1
This obviously is a cheap and dirty fix and is not a programmatic way of dealing with this, but I find this sort of technique useful for cleaning up data every now and then.
I'm not sure how you can use Regular Expression in Replace With box in UltraEdit.
You can use this regular expression to find your string:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])$