Capturing a delimiter that isn't in between single quotes - regex

Like the question says, is it possible to use a single Regex string to get a delimiter that isn't in between some quotes?
For example, I want to split this string with the delimiter &:
"example=3&testing='f&tmp'"
should produce
["example=3", "testing='f&tmp'"]
Essentially, things inside single quotes (' ') should remain untouched.
I found out how to get things within quotes with expression: (?:'.*?')
The closest I could get to a tangible solution was: (.[^']&[^'])

It is not an easy task for a String#split, but is quite a feasible task for Matcher#find if you use
[^&\s=]+=(?:'[^']*'|[^\s&]*)
(see this regex demo) and this Java code:
String text = "example=3&testing='f&tmp'";
Pattern p = Pattern.compile("[^&\\s=]+=(?:'[^']*'|[^\\s&]*)");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [example=3, testing='f&tmp']
Details
[^&\s=]+ - one or more chars other than &, = and whitespace
= - a = char
(?:'[^']*'|[^\s&]*) - a non-capturing group matching either ', zero or more chars other than ' and then a ', or zero or more chars other than whitespace and &.

Related

regex to extract substring for special cases

I have a scenario where i want to extract some substring based on following condition.
search for any pattern myvalue=123& , extract myvalue=123
If the "myvalue" present at end of the line without "&", extract myvalue=123
for ex:
The string is abcdmyvalue=123&xyz => the it should return myvalue=123
The string is abcdmyvalue=123 => the it should return myvalue=123
for first scenario it is working for me with following regex - myvalue=(.?(?=[&,""]))
I am looking for how to modify this regex to include my second scenario as well. I am using https://regex101.com/ to test this.
Thanks in Advace!
Some notes about the pattern that you tried
if you want to only match, you can omit the capture group
e* matches 0+ times an e char
the part .*?(?=[&,""]) matches as least chars until it can assert eiter & , or " to the right, so the positive lookahead expects a single char to the right to be present
You could shorten the pattern to a match only, using a negated character class that matches 0+ times any character except a whitespace char or &
myvalue=[^&\s]*
Regex demo
function regex(data) {
var test = data.match(/=(.*)&/);
if (test === null) {
return data.split('=')[1]
} else {
return test[1]
}
}
console.log(regex('abcdmyvalue=123&3e')); //123
console.log(regex('abcdmyvalue=123')); //123
here is your working code if there is no & at end of string it will have null and will go else block there we can simply split the string and get the value, If & is present at the end of string then regex will simply extract the value between = and &
if you want to use existing regex then you can do it like that
var test = data1.match(/=(.*)&|=(.*)/)
const result = test[1] ? test[1] : test[2];
console.log(result);

Regex - change commas only in a portion of a string

I make a lot of changes on a original csv string. there is a lot of comma delimiter. I have to replace by a ";" either only the commas inside the expression || ....|| or only the commas outside this expression. i need to do this change in order to have different delimiter in the expression ||....|| compare to the rest of the string.
Example:
(.*)(?:\|\|)(?:.*)(,)(?:.*)\|\|
After I use
var regex = /myregex/g;
var str = str.replace(regex, ',')
thanks
You can use
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/\|{2}[\w\W]*?\|{2}/g, (x) => x.replace(/,/g, ';')) );
The regex is
/\|{2}.*?\|{2}/gs // matches any text between two double pipes
/\|{2}[\w\W]*?\|{2}/g // matches any text between two double pipes
/\|{2}.*?\|{2}/g // matches any text but line breaks between two double pipes
Note the . does not match line breaks without the s modifier flag.
The regex matches double pipe, then any zero or more chars, as few as possible up to the next double pipe.
Then, x, the whole match value, is passed as an argument to the anonymous callback function used as a replacement argument, and all commas are replaced with ; only inside the matches.
The "contrary" solution is to match and capture the strings between double pipes and only match commas in all other contexts so that you could keep the captures and replace those commas:
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/(\|{2}[\w\W]*?\|{2})|,/g, (x,y) => y || ';') );
Big Thanks.
I also find
var newStr = str.replace(/\|{2}.*?\|{2}/g, function(match) {
return match.replace(/,/g,";");
});
Do you think is it possible to do the contrary and change all the comma outside the occurence ||...|| ?

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

Formatting regex in Dart on several lines

I have
Pattern pattern = r'^((?:19|20)\d\d)[- /.]
(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
My editor shows an error on this regexp:
How can I fix it?
You entered a line break inside a string literal, that is why you get a syntax issue.
If you want to split a pattern into several lines, just use string concatenation:
Pattern pattern = r'^((?:19|20)\d\d)[- /.]' +
r'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
Or, since string literals separated only with whitespace characters are concatenated automatically:
Pattern pattern = r'^((?:19|20)\d\d)[- /.]'
r'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
Or, if you plan to re-use a long pattern, you may define this part as a variable, and just use string interpolation:
String d = r'((?:19|20)\d\d)';
String M = r'(0[1-9]|1[012])';
String y = r'(0[1-9]|[12][0-9]|3[01])';
String sep = r'[- /.]';
Pattern pattern = '^$d$sep$M$sep$y\$';

Matlab: using regexp to get a string that has a whitespace in between

I want to use Regex to acquire some ID's in a cellstring array, the array looks like this:
myString = '(['US04650Y1001', 'US90274P3029', 'HON WI', 'US41165F1012'])';
My pattern for regex is as follows:
pattern = '[A-Za-z0-9.^_]+';
newArr = regexp(myString, pattern,'match');
I'd like to get the ID called 'HON WI', but with my current pattern, its splitting it into two because my pattern can't deal with the whitespace properly. I would like to get the whole "HON WI", as well as my other strings, everything that's in '', these might have special characters like ^, . or _, but I don't know how to add the whitespace.
I already tried stuff like this, without success:
pattern = '[A-Za-z0-9.^_\s]+';
My new array should have, in each cell, the strings/ID's contained in myString (US04650Y1001, US90274P3029, HON WI and US41165F1012) with dimensions 1x4.
Another approach that seems to work but not entirely sure:
myString = strrep(myString,'([','');
myString = strrep(myString,'])','');
myString = regexp(myString,',','split');
myString = strrep(myString,'''','');
This seems to get me what I want, but I would like to know how can I alter the regex on my first approach.
Many thanks in advance.
You may use a mere '([^']+)' regex and use 'tokens' to get the captures:
myString = '([''US04650Y1001'', ''US90274P3029'', ''HON WI'', ''US41165F1012''])';
pattern = '''([^'']+)''';
newArr = regexp(myString, pattern,'match', 'tokens');
The newArr will look like
{
[1,1] = 'US04650Y1001'
[1,2] = 'US90274P3029'
[1,3] = 'HON WI'
[1,4] = 'US41165F1012'
}
You may option is to use lookaround assertions. The following will match any string made of alphanumeric character or underscore (\w), space (' ') or characters . or ^, that is located between quotes. This will specifically exclude the blank space next to the comma, in the separation between tokens, i.e. ', ' does not give a match.
Note that \s will match any blank space character (including tab, newline), this is why a space is preferred here:
pattern2='(?<='')[\w.^ ]+(?='')';
pattern2 =
(?<=')[\w.^ ]+(?=')
newArr = regexp(myString, pattern2,'match');
newArr'
ans =
'US04650Y1001'
'US90274P3029'
'HON WI'
'US41165F1012'