c++11 - regex matching - c++

I am extracting info from a string using regex.
auto version = { // comments shows the expected output
// version // output : (year, month, sp#, patch#)
"2012.12", // "2012", "12", "", ""
"2012.12-1", // "2012", "12", "", "1"
"2012.12-SP1", // "2012", "12", "SP1", ""
"2012.12-SP2-1", // "2012", "12", "SP2", "1"
"I-2013.12-2", // "2013", "12", "", "2"
"J-2014.09", // "2014", "09", "", ""
"J-2014.09-SP2-1", // "2014", "09", "SP2", "1"
};
The regex I have is the following:
// J - 2014 . 09 - SP2 - 1
std::regex regexExpr("[A-Z]?-?([0-9]{4})\\.([0-9]{2})-?(SP[1-9])?-?([1-9])?.*");
and this seems to work well. I am not very confident about this since I don't have much expertise in regex. Is the regex right and can this be improved?

You can just use \w{2,}|\d as your regex that match any combinations of word characters with length 2 or more (\w{2,})(to avoid of matching the j at the beginning of some strings) or a digit with length 1 (\d)(for match the 1 at the end of some strings)!
Demo
You can use sub_match class template for this aim:
The class template sub_match is used by the regular expression engine to denote sequences of characters matched by marked sub-expressions. A match is a [begin, end) pair within the target range matched by the regular expression, but with additional observer functions to enhance code clarity.

Related

Repeated Capturing Matching Groups (Submatches)

For a fun exercise I wondered if I could tokenize simple arithmetic expressions (containing only positive integers and the four basic operations) using a regular expression, so I came up with the following:
But the test cases below do not behave as I expected due to the failures listed at the end (Go Playground):
func TestParseCalcExpression(t *testing.T) {
re := regexp.MustCompile(`^(\d+)(?:([*/+-])(\d+))*$`)
for _, eg := range []struct {
input string
expected [][]string
}{
{"1", [][]string{{"1", "1", "", ""}}},
{"1+1", [][]string{{"1+1", "1", "+", "1"}}},
{"22/7", [][]string{{"22/7", "22", "/", "7"}}},
{"1+2+3", [][]string{{"1+2+3", "1", "+", "2", "+", "3"}}},
{"2*3+5/6", [][]string{{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}},
} {
actual := re.FindAllStringSubmatch(eg.input, -1)
if !reflect.DeepEqual(actual, eg.expected) {
t.Errorf("expected parse(%q)=%#v, got %#v", eg.input, eg.expected, actual)
}
}
}
// === RUN TestParseCalcExpression
// prog.go:24: expected parse("1+2+3")=[][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}}, got [][]string{[]string{"1+2+3", "1", "+", "3"}}
// prog.go:24: expected parse("2*3+5/6")=[][]string{[]string{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}, got [][]string{[]string{"2*3+5/6", "2", "/", "6"}}
// --- FAIL: TestParseCalcExpression (0.00s)
// FAIL
I was hoping that the "zero or more repetition" of the non-matching subgroup ((?:...)*) which identifies and groups operators and numbers (([*/+-])(\d+)) would match all occurrences of that sub-expression but it only appears to match the last one.
On the one hand, this makes sense because the regex literally has only three matching groups, so it follows that any resulting match could only have three matches. However, the "zero or more repetition" makes it seem like it's missing all the "middle" repeated items in the failed tests (e.g. +2 in 1+2+3).
// expected parse("1+2+3")=
// [][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}},
// got [][]string{[]string{"1+2+3", "1", "+", "3"}}
Is there a way to parse these kinds of arithmetic expressions using go regular expressions or is this a fundamental limitation of regular expressions (or go/re2 regexps, or the general combination of non/capturing groups)?
(I realize I could just split by word boundaries and scan the tokens to validate the structure but I'm more interested in this limitation of non/capturing groups than the example problem.)
package main
import (
"reflect"
"regexp"
"testing"
)
func TestParseCalcExpression(t *testing.T) {
re := regexp.MustCompile(`(\d+)([*/+-]?)`)
for _, eg := range []struct {
input string
expected [][]string
}{
{"1", [][]string{{"1", "1", ""}}},
{"1+1", [][]string{{"1+", "1", "+"}, {"1", "1", ""}}},
{"22/7", [][]string{{"22/", "22", "/"}, {"7", "7", ""}}},
{"1+2+3", [][]string{{"1+", "1", "+"}, {"2+", "2", "+"}, {"3", "3", ""}}},
{"2*3+5/6", [][]string{{"2*", "2", "*"}, {"3+", "3", "+"}, {"5/", "5", "/"}, {"6", "6", ""}}},
} {
actual := re.FindAllStringSubmatch(eg.input, -1)
if !reflect.DeepEqual(actual, eg.expected) {
t.Errorf("expected parse(%q)=%#v, got %#v", eg.input, eg.expected, actual)
}
}
}
Playground link
As mentioned in this question about Swift (I'm not a Swift or regex expert so I'm just guessing this applies to Go as well), you can only return one match for each matching group in your regex. It seems to just identify the last match if the group is repeating.
From the Go standard library regexp package documentation:
If 'Submatch' is present, the return value is a slice identifying the successive submatches of the expression. Submatches are matches of parenthesized subexpressions (also known as capturing groups) within the regular expression, numbered from left to right in order of opening parenthesis. Submatch 0 is the match of the entire expression, submatch 1 the match of the first parenthesized subexpression, and so on.
Given this convention, returning multiple matches per match group would break the numbering and therefore you wouldn't know which items were associated with each matching group. It seems it's possible that a regex engine could return multiple matches per group, but this package couldn't do that without breaking this convention stated in the documentation.
My solution is to make your problem more regular. Instead of treating the entire expression as one match, which gave us the problem that we can only return finitely many strings per match, we treat the entire expression as simply a series of pairs.
Each pair is composed of a number (\d+), and an optional operator ([*/+-]?).
Then doing a FindAllStringSubmatch on the whole expression, we extract a series of these pairs and get the number and operator for each.
For example:
"1+2+3"
returns
[][]string{{"1+", "1", "+"}, {"2+", "2", "+"}, {"3", "3", ""}}}
This only tokenizes the expression; it doesn't validate it. If you need the expression to be validated, then you'll need another initial regex match to verify that the string is indeed an unbroken series of these pairs.

how to match and replace the repeated group patterns and align the result?

I have a code snippet like below
[ "sortBy", "String", "sort by method" ],
[ "sortOrder", "String", "sort order includes ascend and descend" ],
[ "count", "Int", "The number of results to return." ],
[ "names", "Array<String>", "array of strings represents name" ]
I want to use regular expression to match and replace and align so that the result would be look like this:
{ Name = "sortBy"; Ref = "String"; Description = Some "sort by method" }
{ Name = "sortOrder"; Ref = "String"; Description = Some "sort order includes ascend and descend" }
{ Name = "count"; Ref = "Int"; Description = Some "The number of results to return." }
{ Name = "names"; Ref = "Array<String>"; Description = Some "array of strings represents name" }
and each column should be aligned. I am stuck at the beginning how to group match it and align the result. My search is this
*\[ *"(.*)", *"(.*)", *"(.*)" *\],
in visual studio code but it only match the first row. Instead I want to to match all rows at once and replace it and then align it.
The point here is to match and capture only the parts you need to keep, and just match other parts.
You may use
^( *)\[( *)(".*?"),( *)(".*?"),( *)(".*?" *)\],?$
Replace with $1{$2Name = $3;$4Ref = $5;$6Description = Some $7}.
See the regex demo
Details
^ - start of line
( *) - Group 1 ($1): leading spaces
\[ - a [ char (will be replaced with {)
( *) - Group 2 ($2): spaces after [
(".*?") - Group 3 ($3): "..." substring
, - a comma (will be replaced with ;)
( *) - Group 4 ($4): spaces after the first ,
(".*?") - Group 5 ($5): "..." substring
, - a comma (will be replaced with ;)
( *) - Group 6 ($6): spaces after the second ,
(".*?" *) - Group 7 ($7): "..." substring and 0+ spaces after
\],?$ - ], an optional , and end of line.
Here is an answer using a macro extension. Because you need to run two separate regex's (although the second regex is very simple). First a demo with your original text first, some badly formatted text second and your desired results last:
Select your text first and then trigger the macro. I am using alt+r as the keybinding but you can choose whatever you want.
Using the macro extension multi-command put this into your settings.json:
"multiCommand.commands": [
{
"command": "multiCommand.insertAlignRows",
"sequence": [
"editor.action.insertCursorAtEndOfEachLineSelected",
"cursorHomeSelect",
{
"command": "editor.action.insertSnippet",
"args": {
"snippet": "${TM_SELECTED_TEXT/^(\\s*)\\[\\s*(.{12})\\s*(.{18})\\s*([^\\]]*)\\],?/$1{ Name = $2 Ref = $3Description = Some $4}/g}",
}
},
"cursorHomeSelect",
{
"command": "editor.action.insertSnippet",
"args": {
"snippet": "${TM_SELECTED_TEXT/,/;/g}",
}
},
]
}
]
In keybindings.json:
{
"key": "alt+r", // choose whatever keybinding you want
"command": "extension.multiCommand.execute",
"args": { "command": "multiCommand.insertAlignRows" },
"when": "editorTextFocus"
},
The regex that is doing almost all of the work is:
^(\s*)\[\s*(.{12})\s*(.{18})\s*([^\]]*)\],?
I removed the double escapes necessary in snippets but not in the find/replace widget so you could just use this regex in your find input (and not do the macro at all) and
$1{ Name = $2 Ref = $3Description = Some $4}
in the replace field. And then just replace , with ; after that.
Back to that regex: ^(\s*)\[\s*(.{12})\s*(.{18})\s*([^\]]*)\],? which looks brittle because of the "magic numbers" 12 and 18 derived from your sample text. But it isn't as bad as it first seems as the demo with the bad original formatting shows. They are just counting characters and as long as your input is reasonably close to what you presented it'll work.
The 12 can actually be from 12-16, with the 12 being the length of your longest first item (like "sortOrder",) and the 16 being the minimum number from the beginning of the first items to where the second items (like "String") begin.
Likewise the 18 could be 17-24 given your input and where you want the final column to start. Play with the numbers, it is pretty easy in regex101 demo.
I think the only restriction is that your input not look like this:
[ "names", "Array<String>", "array of strings represents name" ]
[ "sortOrder","String", "sort order includes ascend and descend" ],
where a later column starts before the end of the previous column - as in column 3 starts before all the column 2's end. Likewise for some column 2 item starting before all the column 1 items have ended like
[ "sortOrder", "String", "sort order includes ascend and descend" ],
[ "names", "Array<String>", "array of strings represents name" ]
If your input is that bad you could fix it first with some simple regex's.
Remember you can also adjust where the columns start in your replace by adding/subtracting spaces, as between the $2 Ref in my example above or $3Description - you can add space(s) after the $3 if you wish.

PhpStorm search and replace multiple times between two strings

In PhpStorm IDE, using the search and replace feature, I'm trying to add .jpg to all strings between quotes that come after $colorsfiles = [ and before the closing ].
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
If the "abc" is not in between $colorsfiles = [ and ], there should be no replacement.
The regex that I'm using is
$colorsfiles = \[("(\w*?)", )*
and replace string is
$colorsfiles = ["$2.jpg"]
The current result is
$colorsfiles = ["Brown.jpg"]"Sky Blue", "Silver"];
While the expected output is
$colorsfiles = ["Blue.jpg", "Red.jpg", "Orange.jpg", "Black.jpg", "White.jpg", "Golden.jpg", "Green.jpg", "Purple.jpg", "Yellow.jpg", "cyan.jpg", "Gray.jpg", "Pink.jpg", "Brown.jpg", "Sky Blue.jpg", "Silver.jpg"];
You should have said that you're trying it on IDE
Even though I don't use PHPStorm, I'm posting solution tested on my NetBeans.
Find : "([\w ]+)"([\,\]]{1})
Replace : "$1\.jpg"$2
why you need regex for this? a simple array_map() will do the trick for you.
<?php
function addExtension($color)
{
return $color.".jpg";
}
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
$colorsfiles_with_extension = array_map("addExtension", $colorsfiles);
print_r($colorsfiles_with_extension);
?>
Edit: I've tested it on my PhpStorm, let's do it like
search:
"([a-zA-Z\s]+)"
replace_all:
"$1.jpg"
You may use
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[")([^"]+)
and replace with $1$2.jpg. See this regex demo.
The regex matches $colorsfiles = [" or the end of the previous match followed with "," while capturing these texts into Group 1 (later referred to with $1 placeholder) and then captures into Group 2 (later referred to with $2) one or more chars other than a double quotation mark.
Details
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[") -
\G(?!^)",\s*" - the end of the previous match (\G(?!^)), ", substring, 0+ whitespaces (\s*) and a " char
| - or
\$colorsfiles\s*=\s*\[" - $colorsfiles, 0+ whitespaces (\s*), =, 0+ whitespaces, [" (note that $ and [ must be escaped to match literal chars)
([^"]+) - Capturing group 2: one or more (+) chars other than " (the negated character class, [^"])

Split line based on regex in Julia

I'm interested in splitting a line using a regular expression in Julia. My input is a corpus in Blei's LDA-C format consisting of docId wordID : wordCNT For example a document with five words is represented as follows:
186 0:1 12:1 15:2 3:1 4:1
I'm looking for a way to aggregate words and their counts into separate arrays, i.e. my desired output:
words = [0, 12, 15, 3, 4]
counts = [1, 1, 2, 1, 1]
I've tried using m = match(r"(\d+):(\d+)",line). However, it only finds the first pair 0:1. I'm looking for something similar to Python's re.compile(r'[ :]').split(line). How would I split a line based on regex in Julia?
There's no need to use regex here; Julia's split function allows using multiple characters to define where the splits should occur:
julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
"186"
"0"
"1"
"12"
"1"
"15"
"2"
"3"
"1"
"4"
"1"
julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
"0"
"12"
"15"
"3"
"4"
julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
"1"
"1"
"2"
"1"
"1"
I discovered the eachmatch method that returns an iterator over the regex matches. An alternative solution is to iterate over each match:
words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
wd, cnt = m.captures
push!(words, parse(Int64, wd))
push!(counts, parse(Int64, cnt))
end
As Matt B. mentions, there's no need for a Regex here as the Julia lib split() can use an array of chars.
However - when there is a need for Regex - the same split() function just works, similar to what others suggest here:
line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]
I've recently had to do exactly that in some Unicode processing code (where the split chars - where a "combined character", thus not something that can fit in julia 'single-quotes') meaning:
split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"