Regular expression for removing white spaces but not those inside "" - regex

I have the following input string:
key1 = "test string1" ; key2 = "test string 2"
I need to convert it to the following without tokenizing
key1="test string1";key2="test string 2"

You'd be far better off NOT using a regular expression.
What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").
For example, what happens when you decide you want to escape characters?
key1="this is a \"quoted\" string"
Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.
As a bonus, you'll get the ability to detect syntax errors.

Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:
s/ *([^ "]*) *("[^"]*")?/\1\2/g
sed:
$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'
C# code:
using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);
Output:
key1="test string1";key2="test string 2"
Escape-aware version
On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:
s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g
which in C# looks like:
Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");
Please do not go blind from those backslashes!
Example
Input: key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"

Related

How to represent many parts of awk sub/gsub's matched string

How to represent more than one part of awk sub or gsub's matched string.
For a regexpr like "##code", if I want to insert a word between "##" and "code", I would want a way like VSCode's syntax in witch $1 represent the first part and $2 represent the second part
sub(/(##)(code)/, "$1before$2", str)
from awk's user manual, I found that awk use & to represent the whole matched string。 How can I represent one,two or more part in the matched string like VSCode.
sub(regexp, replacement [, target])
Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.
This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.48 For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:
The user manual's link is here
Your best option is to use GNU awk for either of these:
$ awk '{$0=gensub(/(##)(code)/,"\\1before\\2",1)} 1' <<<'##code'
##beforecode
$ awk 'match($0,/(##)(code)/,a){$0=a[1] "before" a[2]} 1' <<<'##code'
##beforecode
The first one only lets you move text segments around while the 2nd lets you call functions, perform math ops or do anything else on the matching text before moving it around in the original or doing anything else with it:
$ awk 'match($0,/(##)(code)/,a){$0=length(a[1])*10 "before" toupper(a[2])} 1' <<<'##code'
20beforeCODE
After thinking about this for a bit, I don't know how to get the desired behavior in any reasonable way using just POSIX awk constructs. Here's something I tried (the matches() function):
$ cat tst.awk
BEGIN {
str = "foobar"
re = "(f.*o)(b.*r)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "succ: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
str = "foofoo"
re = "(f.*o)(f.*o)"
printf "\nre \"%s\" matching string \"%s\"\n", re, str
print "succ: gensub(): ", gensub(re,"<\\1> <\\2>",1,str)
print "succ: match(): ", (match(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
print "fail: matches(): ", (matches(str,re,a) ? "<" a[1] "> <" a[2] ">" : "")
}
function matches(str,re,arr, start,tgt,n,i,segs) {
delete arr
if ( start=match(str,re) ) {
tgt = substr($0,RSTART,RLENGTH)
n = split(re,segs,/[)(]+/) - 1
for (i=1; RSTART && (i < n); i++) {
if ( match(str,segs[i+1]) ) {
arr[i] = substr(str,RSTART,RLENGTH)
str = substr(str,RSTART+RLENGTH)
}
}
}
return start
}
.
$ awk -f tst.awk
re "(f.*o)(b.*r)" matching string "foobar"
succ: gensub(): <foo> <bar>
succ: match(): <foo> <bar>
succ: matches(): <foo> <bar>
re "(f.*o)(f.*o)" matching string "foofoo"
succ: gensub(): <foo> <foo>
succ: match(): <foo> <foo>
fail: matches(): <foofoo> <>
but of course that doesn't work for the 2nd case as the first RE segment of f.*o matches the whole string foofoo and of course the same thing happens if you try to take the RE segments in reverse. I also considered getting the RE segments like above but then build up a new string one char at a time from the string passed in and compare the first RE segment to THAT until it matches as THAT would be the shortest matching string to the RE segment BUT that would fail for a string+RE like:
str='foooobar'
re='(f.*o)(b.*r)'
since f.*o would match foo with that alorigthm when it really needs to match fooooo.
So - I guess you'd need to keep iterating (being careful of what direction you iterate in - from the end is correct I expect) till you get the string split up into segments that each match every RE segment in a left-most-longest fashion. Seems like a lot of work!
When you use GNU awk, you can use gensub for this purpose. Without gensub for any generic awk it becomes a bit more tedious. The procedure could be something like this:
ere="(ere1)(ere2)"
match(str,ere)
tmp=substr(str,RSTART,RLENGTH)
match(tmp,"ere1"); part1=substr(tmp,RSTART,RLENGTH)
part2=substr(tmp,RLENGTH)
sub(ere,part1 "before" part2,str)
The problem with this is that it will not always work and you have to engineer it a bit. A simple fail can be created due to the greedyness of the ERE":
str="foocode"
ere="(f.*o)(code)"
match(str,ere) # finds "foocode"
tmp=substr(str,RSTART,RLENGTH) # tmp <: "foocode"
match(tmp,"(f.*o)"); # greedy "fooco"
part1=substr(tmp,RSTART,RLENGTH) # part1 <: "fooco"
part2=substr(tmp,RLENGTH) # part2 <: "de"
sub(ere,part1 "before" part2,str) # :> "foocobeforede

Exact matching with Question mark in Perl

I want to find string ?Allen in the string array but there is question mark in keyword and it causes some problems.
I write this code to find string in array
#arr = ("My name is ?Allen",
"My name is ?Allens",
"My name is s?Allen",
"My name is s?Allens",
"My name is ?allen");
$keyword = "?Allen";
for (my $i=0; $i <= 4; $i++){
if ($arr[$i] =~ /\b$keyword\b/){
print "str $i = match\n";
}else{
print "str $i = no\n";
}
}
finally I get this result
str 0 = match
str 1 = no
str 2 = match
str 3 = no
str 4 = no
but I want to find only first index array as matching string like this:
str 0 = match
str 1 = no
str 2 = no
str 3 = no
str 4 = no
Note that your regex contains non-word special chars that you need to quote before using them in the actual pattern. Also, the fact that the special chars can appear at the leading/trailing positions means you cannot expect \b to always work the same (since its meaning is context dependent). Thus, you may fix the code with
/(?<!\S)\Q$keyword\E(?!\S)/
where
(?<!\S) - requires a whitespace char or start of string before
\Q$keyword\E - a literal search string (see Quoting Metacharacters)
(?!\S) - that should be followed with a whitespace or end of string.
Another alternative for \Q...\E (mentioned by Dave Cross) is using quotemeta:
This is the internal function implementing the \Q escape in double-quoted strings.

return first instance of unmatched regex scala

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"
Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Regex VB.Net Regex.Replace

I'm trying to perform a simple regex find and replace, adding a tab into the string after some digits as outlined below.
From
a/users/12345/badges
To
a/users/12345 /badges
I'm using the following:
s = regex.replace(s, "(a\/users\/\d*)("a\/users\/\d*\t)", $1 $2")
But im clearly doing something wrong.
Where am I going wrong, I know its a stupid mistake but help would be gratefully received.
VBVirg
You can achieve that with a mere look-ahead that will find the position right before the last /:
Dim s As String = Regex.Replace("a/users/12345/badges", "(?=/[^/]*$)", vbTab)
Output:
a/users/12345 /badges
See IDEONE demo
Or, you can just use LastIndexOf owith Insert:
Dim str2 As String
Dim str As String = "a/users/12345/badges"
Dim idx = str.LastIndexOf("/")
If idx > 0 Then
str2 = str.Insert(idx, vbTab)
End If
When I read, "adding a tab into the string after some digits" I think there could be more than one set of digits that can appear between forward slashes. This pattern:
"/(\d+)/"
Will capture only digits that are between forward slashes and will allow you to insert a tab like so:
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim str As String = "a/54321/us123ers/12345/badges"
str = Regex.Replace(str, "/(\d+)/", String.Format("/$1{0}/", vbTab))
Console.WriteLine(str)
Console.ReadLine()
End Sub
End Module
Results (NOTE: The tab spaces can vary in length):
a/54321 /us123ers/12345 /badges
When String is "a/54321/users/12345/badges" results are:
a/54321 /users/12345 /badges

In DOORS DXL, how do I use a regular expression to determine whether a string starts with a number?

I need to determine whether a string begins with a number - I've tried the following to no avail:
if (matches("^[0-9].*)", upper(text))) str = "Title"""
I'm new to DXL and Regex - what am I doing wrong?
You need the caret character to indicate a match only at the start of a string. I added the plus character to match all the numbers, although you might not need it for your situation. If you're only looking for numbers at the start, and don't care if there is anything following, you don't need anymore.
string str1 = "123abc"
string str2 = "abc123"
string strgx = "^[0-9]+"
Regexp rgx = regexp2(strgx)
if(rgx(str1)) { print str1[match 0] "\n" } else { print "no match\n" }
if(rgx(str2)) { print str2[match 0] "\n" } else { print "no match\n" }
The code block above will print:
123
no match
#mrhobo is correct, you want something like this:
Regexp numReg = "^[0-9]"
if(numReg text) str = "Title"
You don't need upper since you are just looking for numbers. Also matches is more for finding the part of the string that matches the expression. If you just want to check that the string as a whole matches the expression then the code above would be more efficient.
Good luck!
At least from example I found this example should work:
Regexp plural = regexp "^([0-9].*)$"
if plural "15systems" then print "yes"
Resource:
http://www.scenarioplus.org.uk/papers/dxl_regexp/dxl_regexp.htm