Match any char inside two quotation pairs, including nested quotations - regex

I have data that will appear as dual quotation pairs like this, per line.
"Key" "Value"
Inside of these pairs there can be any character, and sometimes there comes the
dreaded "" nested pair:
"Key "superkey"" ""Space" Value"
Previously I've found: "([^"]*)"\s*"([^"]*)"
And this matches Key and Value to two groups:
$1 = Key
$2 = Value
But, with the nested pairs, it will only output:
$1 = superkey
Is there a way to match all characters between the pairs? Example output wanted:
$1 = Key "superkey"
$2 = "Space" Value
Regular expression processing from QRegularExpression and c++11 Literal string:
QRegularExpression(R"D("([^"]*)"\s*"([^"]*)")D");
I know it matches Pearl and PHP regex.

"(.*?)"[\t\r ]+"(.*?)"(?=[ ]*$)
Try this.See demo.
https://regex101.com/r/hR7tH4/2

Related

Surrounding one group with special characters in using substitute in vim

Given string:
some_function(inputId = "select_something"),
(...)
some_other_function(inputId = "some_other_label")
I would like to arrive at:
some_function(inputId = ns("select_something")),
(...)
some_other_function(inputId = ns("some_other_label"))
The key change here is the element ns( ... ) that surrounds the string available in the "" after the inputId
Regex
So far, I have came up with this regex:
:%substitute/\(inputId\s=\s\)\(\"[a-zA-Z]"\)/\1ns(/2/cgI
However, when deployed, it produces an error:
E488: Trailing characters
A simpler version of that regex works, the syntax:
:%substitute/\(inputId\s=\s\)/\1ns(/cgI
would correctly inser ns( after finding inputId = and create string
some_other_function(inputId = ns("some_other_label")
Challenge
I'm struggling to match the remaining part of the string, ex. "select_something") and return it as:
"select_something")).
You have many problems with your regex.
[a-zA-Z] will only match one letter. Presumably you want to match everything up to the next ", so you'll need a \+ and you'll also need to match underscores too. I would recommend \w\+. Unless more than [a-zA-Z_] might be in the string, in which case I would do .\{-}.
You have a /2 instead of \2. This is why you're getting E488.
I would do this:
:%s/\(inputId = \)\(".\{-}\)"/\1ns(\2)/cgI
Or use the start match atom: (that is, \zs)
:%s/inputId = \zs\".\{-}"/ns(&)/cgI
You can use a negated character class "[^"]*" to match a quoted string:
%s/\(inputId\s*=\s*\)\("[^"]*"\)/\1ns(\2)/g

Search and replace sub-patterns using regex

I'm trying to use regular expressions to search/replace sub-patterns but I seem to be stuck. Note: I'm using TextWrangler on OSX to complete this.
SCENARIO:
Here is an example of a complete match:
{constant key0="variable/three" anotherkey=$variable.inside.same.match key2="" thirdkey='exists'}
Each match will always:
start with the following: {constant key0=
terminate with a single curly brace: }
contain one or more key=value pairs
the key of the first pair is constant (in this case, the key is key0)
the value of the first pair is variable (in this case, the value is "variable/three")
each additional pairs, if any, are separated by whitespace
Here's an example of what a minimal (but complete) match would look like (with only one key=value pair):
{constant key0="first/variable/example"}
Here's another example of a valid match, but with trailing whitespace after the last (and only) key=value pair:
{constant key0="same/as/above/but/with/whitespace/after/quote" }
GOAL:
What I need to be able to do is extract each key and each value from each match and then rearrange them. For example, I might need the following:
{constant key0="variable/4" variable_key_1="yes" variable_key_2=0}
... to look like this after all is said and done:
$variable_key_1 = "yes"; $variable_key_2 = 0; {newword "variable/4"}
... where
a $ has been added to the extracted keys
spaces have been added between each key=value pair's =
a ; has been appended to each extracted value
the word constant has been changed to newword, and
key0= has been removed completely.
Here are some examples of what I've tried (note that the first one actually works, but only when there is exactly one key/value pair):
Search:
(\{constant\s+key0=\s*)([^\}\s]+)(\s*\})
Replace:
{newword \2}
Search:
(\{constant\s+key0=)([^\s]+)(([\s]+[^\s]+)([\s]*=\s*)([^\}]+)+)(\s*\})
Replace:
I wasn't able to come up with a good way to replace the output of this one.
Any help would be most appreciated.
Because of the nature of this match, it's actually three different regexes—one to figure out what the match is, and two others to process the matches. Now, I don't know how you intend to escape the quotes, so I'll give one for each common escapement system.
Without further ado, here's the set for the backslash escapement system:
Find:
\{constant\s+key0=([^\s"]\S*|"(\\.|[^\\"])*")(\s+[^\s=]+=([^\s"]\S*|"(\\.|[^\\"])*"))*\s*\}
Search 1:
(?<=\s)([^\s=]+)=([^\s"]\S*|"(\\.|[^\\"])*")(?=.*\})
Replace 1:
$1 = $2;
Search 2:
^\{constant\s+key0 = ([^\s"]\S*|"(\\.|[^\\"])*");\s*(?=\S)(.*)\}
Replace 2:
$2 {newword $1}
Now the URL/XML/HTML escapement system, much easier to parse:
Find:
\{constant\s+key0=([^\s"]\S*|"[^"]*")(\s+[^\s=]+=([^\s"]\S*|"[^"]*"))*\s*\}
Search 1:
(?<=\s)([^\s=]+)=([^\s"]\S*|"[^"]*")(?=.*\})
Replace 1:
$1 = $2;
Search 2:
^\{constant\s+key0 = ([^\s"]\S*|"[^"]*");\s*(?=\S)(.*)\}$
Replace 2:
$2 {newword $1}
Hope this helps.

Go ReplaceAllString

I read the example code from golang.org website. Essentially the code looks like this:
re := regexp.MustCompile("a(x*)b")
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
The output is like this:
-T-T-
--xx-
---
-W-xxW-
I understand the first output, but I don't understand the the rest three. Can someone explain to me the results 2,3 and 4. Thanks.
The most intriguing is the fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W")) line. The docs say:
Inside repl, $ signs are interpreted as in Expand
And Expand says:
In the template, a variable is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores.
A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.
In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.
So, in the 3rd replacement, $1W is treated as ${1W} and since this group is not initialized, an empty string is used for replacement.
When I say "the group is not initialized", I mean to say that the group is not defined in the regex pattern, thus, it was not populated during the match operation. Replacing means getting all matches and then they are replaced with the replacement pattern. Backreferences ($xx constructs) are populated during the matching phase. The $1W group is missing in the pattern, thus, it was not populated during matching, and only an empty string is used when replacing phase occurs.
The 2nd and 4th replacements are easy to understand and have been described in the above answers. Just $1 backreferences the characters captured with the first capturing group (the subpattern enclosed with a pair of unescaped parentheses), same is with Example 4.
You can think of {} as a means to disambiguate the replacement pattern.
Now, if you need to make the results consistent, use a named capture (?P<1W>....):
re := regexp.MustCompile("a(?P<1W>x*)b") // <= See here, pattern updated
fmt.Println(re.ReplaceAllString("-ab-axxb-", "T"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
Results:
-T-T-
--xx-
--xx-
-W-xxW-
The 2nd and 3rd lines now produce consistent output since the named group 1W is also the first group, and $1 numbered backreference points to the same text captured with a named capture $1W.
$number or $name is index of subgroup in regex or subgroup name
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1"))
$1 is subgroup 1 in regex = x*
fmt.Println(re.ReplaceAllString("-ab-axxb-", "$1W"))
$1W no subgroup name 1W => Replace all with null
fmt.Println(re.ReplaceAllString("-ab-axxb-", "${1}W"))
$1 and ${1} is the same. replace all subgroup 1 with W
for more information : https://golang.org/pkg/regexp/
$1 is a shorthand for ${1}
${1} is the value of the first (1) group, e.g. the content of the first pair of (). This group is (x*) i.e. any number of x.
ReplaceAllString replaces every match. There are two matches. The first is ab, the second is axxb.
No 2. replaces any match with the content of the group: This is "" in the first match and "xx" in the second.
No 4. adds a "W" after the content of the group.
No 3. Is left as an exercise. Hint: The twelfth capturing group would be $12.

Regex match for a single character delimiter in Perl

I am struggling to have a regex match for separating keys and values.
The requirement is that the delimiter is ':', yet the keys can have multiple "::". The values can have ':', but the keys cannot. So the first ':' should be the delimiter. If there is any space before the values, it should be eliminated.
I have the following regex, but it fails for key:value (no space after ':').
if ($_ =~ /^(.+?):\s+(.*)$/)
{
$data{$1} = $2;
}
Valid key values are:
key:value
key: value
key: value::subvalue
key::subkey:value
key::subkey:value:subvalue
key::subkey: value:subvalue
key::subkey::subsubkey:value
Note that key, subkey, value, subvalue can be replaces by any word. My regex works for all, but the first one.
How can I fix it?
I can have an elsif and add another regex, but I wonder if I can have a single regex for the whole thing.
/^((?:[^:]+::)*[^:]+):(?!:)\s*(.*)$/
DEMO
You can use this pattern:
/^((?>[^:]+|::)+):\s*(.*)$/

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.