MariaDB REGEXP_REPLACE backreferences not works in FUNCTION - regex

I have this simple expression:
SELECT REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
+-------------------------------------------------------------------------+
| REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0')) |
+-------------------------------------------------------------------------+
| aaa 00000000123 bbb 00000000456 ccc |
+-------------------------------------------------------------------------+
But if i put this into a FUNCTION:
CREATE FUNCTION test()
RETURNS VARCHAR(4000)
BEGIN
RETURN REGEXP_REPLACE('aaa 123 bbb 456 ccc', '([0-9]+)', LPAD('\\1', 10, '0'));
END
It doesn't works:
SELECT test();
+-----------------------------------+
| test() |
+-----------------------------------+
| aaa 0000000001 bbb 0000000001 ccc |
+-----------------------------------+
How can i fix it? (10.1.38-MariaDB-0+deb9u1)

Use '\1' instead of '\\1'....

Related

Extract string pattern in new variable based on other string variable

Consider the following variable:
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
I can use the regexm() function to extract all occurrences of abc, cde and def:
generate new = regexm(string, "abc|cde|def")
list
|string new |
|--------------------|
| abc bcd cde 1 |
| def efg fgh 1 |
| ghi hij ijk 0 |
How can I get the following?
|string wanted |
|--------------------------|
| abc bcd cde abc cde |
| def efg fgh def |
| ghi hij ijk |
This question is an extension of the one answered here:
Create new string variable with partial matching of another
I read this as your
Having a list of allowed words.
Wanting the words in a string that occur among the allowed words.
It is fashionable to seek a fancy regular expression solution for such problems, but your example at least yields to a plain loop over the words that exist. Be aware, however, that inlist() has advertised limits.
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
generate wanted = ""
generate wc = wordcount(string)
summarize wc, meanonly
quietly forvalues j = 1/`r(max)' {
replace wanted = wanted + " " + word(string, `j') if inlist(word(string, `j'), "abc", "cde", "def")
}
replace wanted = trim(wanted)
list
+----------------------------+
| string wanted wc |
|----------------------------|
1. | abc bcd cde abc cde 3 |
2. | def efg fgh def 3 |
3. | ghi hij ijk 3 |
+----------------------------+
This is the solution using a regular expression:
clear
input str18 string
"abc bcd cde"
"def efg fgh"
"ghi hij ijk"
end
generate wanted = ustrregexra(string, "(\b((?!(abc|cde|def))\w)+\b)", " ")
replace wanted = strtrim(stritrim(wanted))
list
+-----------------------+
| string wanted |
|-----------------------|
1. | abc bcd cde abc cde |
2. | def efg fgh def |
3. | ghi hij ijk |
+-----------------------+

Hive: Extract string between first and last occurrence of a character

I have a Hive table column which has string separated by '-' and i need to extract the string between first and last occurrence of '-'
+-----------------+
| col1 |
+-----------------+
| abc-123-na-00-sf|
| 123-abc-01-sd |
| 123-abcd-sd |
+-----------------+
Required output:
+-----------+
| col1 |
+-----------+
| 123-na-00 |
| abc-01 |
| abcd |
+-----------+
Please suggest some regex to extract the desired output.
Thanks
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'-(.*)-',1)
from t
;
123-na-00
abc-01
abcd
or
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'(?<=-).*(?=-)',0)
from t
;
123-na-00
abc-01
abcd

How to delete prefix, suffix in a string matching a pattern and split on a character using sed?

I have the following string, which is the output of a cassandra query in bash
col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d (3 rows)
I want to split this string so as to remove the string in the beginning till the last + symbol and then remove the tail end, which is (XYZ rows).
So, the string becomes A|1|a B|2|b C|3|c D|4|d. Now, I want to split this string into multiple arrays that look like this
A 1 a
B 2 b
C 3 c
D 4 d
so that I can iterate over each row using a for loop to do some processing. The number of rows can vary.
How can I do this using sed or grep?
I tried this for the first pass but it didn't work:
echo $string | sed 's/([0-9])rows//' | sed 's/[^+]//'
NOTE: the column strings can have multiple spaces in them
ex: output of CQL query when written to file is
topic | partition | offset
---------+-----------+--------
topic_2 | 31 | 4
topic_2 | 30 | 4
topic_2 | 29 | 4
topic_2 | 28 | 4
topic_2 | 27 | 4
topic_2 | 26 | 4
topic_2 | 25 | 4
topic_2 | 24 | 4
topic_2 | 23 | 4
topic_2 | 22 | 4
topic_2 | 21 | 4
topic_2 | 20 | 4
topic_2 | 19 | 4
topic_2 | 18 | 4
topic_2 | 17 | 4
topic_2 | 16 | 4
topic_2 | 15 | 4
topic_2 | 14 | 4
topic_2 | 13 | 4
topic_2 | 12 | 4
topic_2 | 11 | 4
topic_2 | 10 | 4
topic_2 | 9 | 4
topic_2 | 8 | 4
topic_2 | 7 | 4
topic_2 | 6 | 4
topic_2 | 5 | 4
topic_2 | 4 | 4
topic_2 | 3 | 4
topic_2 | 2 | 4
topic_2 | 1 | 4
topic_2 | 0 | 4
(32 rows)
$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/;y/ |/\n /' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d (3 rows)'
A 1 a
B 2 b
C 3 c
D 4 d
The substitution does the following (hat tip to potong for pointing out how to get rid of one more substitution):
s/
[^+]* # Match non-plusses
[+]* # Followed by plusses
\( # Capture the next group
.* # Any characters (greedily)
[^ ] # that end with a non-space
\) # End of capture group
* # Spaces
(.*) # Followed by whatever in parentheses
$/\1/ # Replace all that by the capture group
resulting in this intermediate stage:
$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d (3 rows)'
A|1|a B|2|b C|3|c D|4|d
The transformation (y///) turns all spaces into newlines and pipes into spaces.
Spaces other than the ones separating rows
If there are spaces within column and we assume that each entry has the format
[spaces]entry[spaces]
i.e., exactly two sets of spaces per entry, we have to replace the transformation y/// with another substitution,
s/\([^ |]\)\( \+[^ |]\)/\1\n\2/g
This looks for spaces following not a space or pipe and followed by not a space or pipe, and inserts a newline before those spaces. Result:
$ var='col1 | col2 | col3 +++++++++++ A | 1 | a B | 2 | b C | 3 | c D | 4 | d (3 rows)'
$ sed 's/[^+]*[+]*\(.*[^ ]\) *(.*)$/\1/;s/\([^ |]\)\( \+[^ |]\)/\1\n\2/g' <<< "$var"
A | 1 | a
B | 2 | b
C | 3 | c
D | 4 | d
echo 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d (3 rows)' |
sed -r "s/^.*\+//;s/\(.* rows\)//;s/ /\n/g;s/\|/ /g"
A 1 a
B 2 b
C 3 c
D 4 d
There are 4 substitutions:
turn from start until last plus (greedy) into nothing
turn parens, ending in 'rows' into nothing
replace blanks with newlines
make pipe characters blanks (order of commands matters)
You can use sed and xargs -n flag to break the numbers up, xargs will echo by default:
echo "A|1|a B|2|b C|3|c D|4|d" | sed 's/|/ /g;s/ / /g' | xargs -n 3
A 1 a
B 2 b
C 3 c
D 4 d

awk remove substring using regex

i have a pipe delimited file that looks like this:
34ab1 | aaa bbb ccc fff vf | 2015-01-01
35ab1 | aaa bbb ccc dddefd ddff ssss fff vi | 2015-01-01
i want to replace everything that starts with bbb and ends with fff.
i used this:
BEGIN {
FS = OFS = "|"
}
{
sub(/[0-9].*[0-9]/, "", $2); sub(/bbb.*fff/, "", $2);
print
}
the regex part for the numbers worked but the second part of the regex didnt.
output i want:
34ab1 | aaa vf | 2015-01-01
35ab1 | aaa vi | 2015-01-01
Use a single gsub function for both.
BEGIN {
FS = OFS = "|"
}
{
gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2);
print
}

How to change case of back references?

I'm trying to modify a back reference in PowerShell but am having no luck :(
This is my example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper()) | `$1 |"
If I run it I get this:
| Jane Doe | 456 |
But I'm really expecting this:
| JANE DOE | 456 |
If I run the following (the same as above but without the '()' on the call to ToUpper):
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper) | `$1 |"
I get this:
| string ToUpper(), string ToUpper(System.Globalization.CultureInfo culture) | 456 |
So it would appear that PowerShell knows that the back reference '$2' is a string but why can't I get PowerShell to convert it to upper case?
Terry
[Regex]::Replace('456,Jane Doe',
'^(\d{3}),(.*)$',
{
param($m)
'| ' + $m.Groups[2].Value.ToUpper() + ' | ' + $m.Groups[1].Value + ' |'
}
)
Not very pretty, I admit. And you sadly cannot use script blocks as replacement in the -replace operator.
Just to explain what is happening, in "| $(`"`$2`".ToUpper()) | `$1 |" PowerShell is evaluating the highlighted subexpression before passing the string to the -replace operator, rather than after the replace operation has occurred.
In other words, ToUpper is called on the string value $2, resulting in | $2 | $1 | being used for the replace operation. You can see this by including a letter in the subexpression string, for example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"zz `$2`".ToUpper()) | `$1 |"
This has an effective replace string of | ZZ $2 | $1 |, giving | ZZ Jane Doe | 456 | as the result.
Similarly, the second version omitting parenthesis, "| $(`"`$2`".ToUpper) | `$1 |", is evaluated as "some string".ToUpper, which puts the array of overload definitions for the ToUpper method on System.String in the replace string.
To keep the replace operation as a one-liner, Joey's answer using the MatchEvaluator overload to Regex.Replace works well. Or you might do the string formatting yourself based on the results of a -match:
if( '456,Jane Doe' -match '^(\d{3}),(.*)$' ) {
'| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
}
If this needs to be replaced in the context of a larger string, you can always do a literal replace to get the final result:
PS> $r = '| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
PS> 'A longer string with 456,Jane Doe in it.'.Replace( $matches[0], $r )
A longer string with | JANE DOE | 456 | in it.