Hive regexp_replace - regex

my use case is the follow:
String text_string: "text1:message1,text3:message3,text2:message,..."
select regexp_replace(text_string, '[^:]*:([^,]*(,|$))', '$1')
Correct output: message1,message3,message2,...
The pattern work, but the problem is that if there is a character ":" o "," in the message the replace doesn't work.
So I tried to use "::" and ",," characters as a separators in the string
String text_string: "text1::message1,,text3::message3,,text2::message2,..."
select regexp_replace(text_string, '[^::]*::([^,,]*(,,|$))', '$1')
Correct output: message1,,message3,,message2,,...
but also in this case, if there is one ":" or "," character in the string (in the text or in the message) the replace command doesn't work.
How should the regular expression be modified to work?

Delimiters cannot be characters that are likely to be in the data. Since you have control over it, use pipes '|' or tildes '~' maybe. Only you can come up with the right characters by analyzing the data.
If you can't do that, then you'll need to put quotes around the data that contains the delimiter character and come up with a way to deal with that.

Related

REGEX_TOO_COMPLEX error when parsing regex expression

I need to split the CSV file at commas, but the problem is that file can contain commas inside fields. So for an example:
one,two,tree,"four,five","six,seven".
It uses double quotes to escape, but I could not solve it.
I tried to use something like this with this regex, but I got an error: REGEX_TOO_COMPLEX.
data: lv_sep type string,
lv_rep_pat type string.
data(lv_row) = iv_row.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
concatenate '$1$2' lv_sep into lv_rep_pat.
"replace all commas that are separator with the new separator
replace all occurrences of regex '(?:"((?:""|[^"]+)+)"|([^,]*))(?:,|$)' in lv_row with lv_rep_pat.
split lv_row at lv_sep into table rt_cells.
You must use this Regex => ,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
DATA: lv_sep TYPE string,
lv_rep_pat TYPE string.
DATA(lv_row) = 'one,two,tree,"four,five","six,seven"'.
"Define a separator to replace commas in double quotes
lv_sep = cl_abap_conv_in_ce=>uccpi( uccp = 10 ).
CONCATENATE '$1$2' lv_sep INTO lv_rep_pat.
"replace all commas that are separator with the new separator
REPLACE ALL OCCURRENCES OF REGEX ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)' IN lv_row WITH lv_rep_pat.
SPLIT lv_row AT lv_sep INTO TABLE data(rt_cells).
LOOP AT rt_cells into data(cells).
WRITE cells.
SKIP.
ENDLOOP.
Testing output
I never ever touched ABAP, so please see this as pseudo code
I'd recommend using a non-regex solution here:
data: checkedOffsetComma type i,
checkedOffsetQuotes type i,
baseOffset type i,
testString type string value 'val1, "val2, val21", val3'.
LOOP AT SomeFancyConditionYouDefine.
checkedOffsetComma = baseOffset.
checkedOffsetQuotes = baseOffset.
find FIRST OCCURRENCE OF ','(or end of line here) in testString match OFFSET checkedOffsetComma.
write checkedOffsetComma.
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write checkedOffsetQuotes.
*if the next comma is closer than the next quotes
IF checkedOffsetComma < checkedOffsetQuotes.
REPLACE SECTION checkedOffsetComma 1 OF ',' WITH lv_rep_pat.
baseOffset = checkedOffsetComma.
ELSE.
*if we found quotes, we go to the next quotes afterwards and then continue as before after that position
find FIRST OCCURRENCE OF '"' in testString match OFFSET checkedOffsetQuotes.
write baseOffset.
ENDIF.
ENDLOOP.
This assumes that there are no quotes in quotes thingies. Didn't test, didn't validate in any way. I'd be happy if this at least partly compiles :)

simple character replacement regex, tokenizing jsonpath

I am trying to tokenize a jsonpath string. For example, given a string like the following:
$.sensor.subsensor[0].foo[1][2].temp
I want ["$", "sensor", "subsensor", "0", "foo", "1", "2", "temp"]
I am terrible with regexs but I managed to come up with the following which according to regexr matches "." and "[" and "]". Assume the jsonpath string does not contain slices, wildcards, unions nor recursive descent.
[\.\[\]]
I am planning to match all "." and "[" and "]" chars and replace them with ";". Then i will split on ";".
The problem with the above regex is that I will get in certain instances ";;".
$;sensor;subsensor;0;;foo;1;;2;;temp
Is there a way I can in a single regex replace ".", "[", "]" as well as ".[" and "]." or "][" with ";"? Do I need to check for these groups explicitly or do I need to run the sequence through 2 regexs?
No needs to transform .[] into ;, just split directly:
console.log('$.sensor.subsensor[0].foo[1][2].temp'.split(/[.[\]]+/));
You can use this code to omit double semicolon:
console.log(
'$.sensor.subsensor[0].foo[1][2].temp'.replace(/[\.\[\]]+/g, ';')
)
Got a decent solution.
/(\.|\].|\[)/g
Apparently when you use [] as part of your regex that matches only a single character, which is why groups like "]." become ";;". Using () allows you to specify character groups, and the above group just enumerates the possibilities.

Replace every " with \" in Lua

X-Problem: I want to dump an entire lua-script to a single string-line, which can be compiled into a C-Program afterwards.
Y-Problem: How can you replace every " with \" ?
I think it makes sense to try something like this
data = string.gsub(line, "c", "\c")
where c is the "-character. But this does not work of course.
You need to escape both quotes and backslashes, if I understand your Y problem:
data = string.gsub(line, "\"", "\\\"")
or use the other single quotes (still escape the backslash):
data = string.gsub(line, '"', '\\"')
A solution to your X-Problem is to safely escape any sequence that could interfere with the interpreter.
Lua has the %q option for string.format that will format and escape the provided string in such a way, that it can be safely read back by Lua. It should be also true for your C interpreter.
Example string: This \string's truly"tricky
If you just enclosed it in either single or double-quotes, there'd still be a quote that ended the string early. Also there's the invalid escape sequence \s.
Imagine this string was already properly handled in Lua, so we'll just pass it as a parameter:
string.format("%q", 'This \\string\'s truly"tricky')
returns (notice, I used single-quotes in code input):
"This \\string's truly\"tricky"
Now that's a completely valid Lua string that can be written and read from a file. No need to manually escape every special character and risk implementation mistakes.
To correctly implement your Y approach, to escape (invalid) characters with \, use proper pattern matching to replace the captured string with a prefix+captured string:
string.gsub('he"ll"o', "[\"']", "\\%1") -- will prepend backslash to any quote

Regular expression exclusion in PostgreSQL

I have to split some string in PostgreSQL on ',' but not on '\,' (backslash is escape character).
For example, regexp_split_to_array('123,45,67\,89', ???) must split the string to array {123, 45, "67\,89"}.
What done already: E'(?<!3),' works with '3' as escape character. But how can I use the backslash instead of 3?
Does not work:
E'(?<!\),' does not split the string at all
E'(?<!\\),' throws error "parentheses () not balanced"
E'(?<!\ ),' (with space) splits on all ',' including '\,'
E'(?<!\\ ),' (with space) splits on all ',' too.
The letter E in front of the text means C string and then you must escape twice, one for the C string and one for the regexp.
Try with and without E:
regexp_split_to_array('123,45,67\,89', '(?<!\\),')
regexp_split_to_array('123,45,67\,89', E'(?<!\\\\),')
Here http://rextester.com/VEE84838 a running example (unnest() is just for row by row display of results):
select unnest(regexp_split_to_array('123,45,67\,89', '(?<!\\),'));
select unnest(regexp_split_to_array('123,45,67\,89', E'(?<!\\\\),'));
You can also split it to groups first:
(\d+),(\d+\,\d+)?
( and later on concatenate them with comma)

Regex parse a command line string but don't return spaces between quotes

I am using python to parse a string that is passed in by the optparse module.
I want to split the string on certain delimiters but not in between quote marks.
A sample string is:
--state-basedir /dir/dir/dir/ --cmd=\"param load $v2param\" --master=/dev/ttyUSB0 --console --map --out=udp:192.168.1.1:14550
This string is passed in as a single optparse argument, I am then going to pass it to another process.
I have been trying various things at http://pythex.org/
The closest I have gotten is:
`(?<!")[\s=](?![\s0-9a-zA-Z\$\\]*")`
The issue is that the = sign after --cmd and the space before --master are not matched.
In plain English, this is how I am reading my regex:
match either a space character or an equal character as long as it is not preceded by a quotation mark and as long as it is not proceeded by a combination of any other letter,numbers,punctuation and another quotation mark
I had a feeling that there was something else I was missing, like greediness, so I tried adding ? after my look-ahead and look-behind terms. If I put a ? after my look-behind one I can get the space before --master but if I put the ? after my look-ahead term I get the spaces in the quotation marks now, which I don't want.
The idea here is that I am going to use re.split to handle things.
Thanks for any explanations as to what I am doing wrong.
This is not a regex answer and it's also not pretty, but it is one line.
sum([[x] if '"' in x else re.split(' |=',x) for x in re.split('=(\".+?\" )',a)],[])
output:
['--state-basedir', '/dir/dir/dir/', '--cmd', '"param load $v2param" ', '--master', '/dev/ttyUSB0', '--console', '--map', '--out', 'udp:192.168.1.1:14550']
Starting from the re.split('=(\".+?\" )',a)] this splits out text surrounded by quotes (more specifically ="something another thing"). The split pieces are then split further with re.split(' |=',x) if they do not have a " in them, or are just returned as is [x] if they do. The last step is collapsing the resulting 2d list by overloading sum with sum(two_d_list,[]).
I hope this answer helps but I understand if it isn't what you're looking for