Regex - match between 1st and last occurrence of double square brackets - regex

I'm looking to match any text and non text values, located within " " , but only between the 1st and the last instance of [[ and ]]
window.google.ac.h(["text",[["text1",0],["text2",0,[3]],["text3",0],["text4",0,[3]],["text5",0],["text6",0],["text7",0
]],{"q":"hygjgjhbjh","k":1}])
So far I've managed to get some results (far from ideal) by using: "(.*?)",0
The issue I have is the it either matches all the way until
"text4",0,[3]]
or starts matching at
["text",
I only need to match text1 text2 text3 .. text7
Notes: double square bracket position and nr of instances, between the 1st and the last is not consistent.
Thanks for your help guys!
Edit: I'm using http://regexr.com/ to test it

Well, you didn't say which language, but using .NET regex (which largely follows the Perl standard), the first match groups of all the matches of a global match using the following regex will contain the values you want:
(?<=\[\[.*)"(.+?)"(?=.*\]\])
A global match on "(.+?)" alone would return matches whose first match groups would contain the characters between the quotes. The lookbehind assertion (?<=\[\[.*) tells it to include only cases where there is a [[ anywhere behind, i.e. after the first instance of [[. Similarly, the lookahead assertion (?=.*\]\]) tells it to include only cases where there is a ]] anywhere ahead, i.e. before the last instance of ]].
I tested this using PowerShell. The exact syntax to do a global match and extract the first match groups of all the results will depend on the language.

Depending on your RegEx Engine, you could use this pattern
(?:^.*?\[\[|\G[^"]*)("[^"]*")(?=.*\]\])
Demo

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Stop Regex Engine at first match [duplicate]

I am learning about using cucumber's step defintion, which use regex. I came across the following different usages and would like to know if there's some material difference between the two approaches of capturing a group within a pair of double quotes:
approach one: "([^"]*)"
approach two: "(.*?)"
For example, consider a string input: 'the output should be "pass!"'. Both approaches would capture pass!. Are there inputs where two the approaches capture differently; or are they equivalent?
Thanks
Well, in naked eye they look same. But slight different. Have a look on this example:
input:
a " regex
example is
here" please
Output for "([^"]*)":
regex
example is
here
And, Output for "(.*?)" is empty.
.*? means any character except \n (0 or more times), and there has few newlines between the quotes("). If we use this in regex we need to give the regex engine a hint to use Multiline matching.
"([^"]*)" will also capture newlines, so if you have
"Something
that goes on two lines"
then it will match it.
"(.*?)" does not span newlines, so it will not match that phrase.
Unless you use the single-line modifier (?s). In which case . will also include newline characters. The following expression: (?s)"(.*?)" would then match and capture.
Difference between "(.*?)" and "([^"]*)"
It depends upon where this regex fragment appears within the larger context of the overall pattern. It also depends upon the target string that is being searched. For example, given the following input string:
'foo "quote1" bar "quote2"'
The expression: /"(.*?)"$/ (note the added end of string anchor) will match: "quote1" bar "quote2" but the /"([^"]*)"$/ expression will match: "quote2".
The dot will match a double quote if it has to to get a successful overall match.

conditional group matching using regex

how to match a group except if it starts with a certain character.
e.g. I have the following sentence:
just _checking any _string.
I have the regex ([\w]+) which matches all the words {just, _checking, any, _sring}. But, what I want is to match all the words that don't start with character _ i.e. {just, any}.
The above example is a watered down version of what I'm actually trying to parse.
I'm parsing a code file, which contains string in the following format :
package1.class1<package2.class2 <? extends package3.class3> , package4.class4 <package5.package6.class5<?>.class6.class7<class8> >.class9.class10
The output that I require should create a match result like all the fully qualified names (having at least one . in the middle )but stop if encounter a <.
So, the result should be :
{ package1.class1, package2.class2, package3.class3, package4.class4, package5.package6.class5 }
I wrote ([\w]+\.)+([\w]+) to parse it but it also matches class6.class7 and class9.class10 which I don't want. I know it's way off the mark and I apologize for that.
Hence, I earlier asked if I can ignore a capture group starting from a specific character.
Here's the link where I tried : regex101
there everything that it is matching is correct except the part where it matches class6.class7 and class9.class10.
I'm not sure how to proceed on this. I'm using C++14 and it supports ECMAScript grammar as well along with POSIX style.
EDIT : as suggested by #Corion, I've added more details.
EDIT2 : added regex101 link
Just use a word boundary \b and make sure that the first character is not an underscore (but still a letter):
(\b(?=[^_])[\w]+)
Using the following Perl script to validate that:
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_])[\w]+)/g"
Matched <just>
Matched <any>
regex101 playground
In response to the expansion of the question in the comment, the following regular expression will also capture dots in the "middle" of the word (but still disallow them at the start of a word):
(\b(?=[^_.])[\w.]+)
perl -wlne "print qq(Matched <$_>) for /(\b(?=[^_.])[\w.]+)/g"
just _checking any _string. and. this. inclu.ding dots
Matched <just>
Matched <any>
Matched <and.>
Matched <this.>
Matched <inclu.ding>
Matched <dots>
regex101 playground
After the third expansion of the question, I've expanded the regular expression to match the class names but exclude the extends keyword, and only start a new match when there was a space (\s) or less-than sign (<). The full qualified matches are achieved by forcing a dot ( \. ) to appear in the match:
(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))
perl -nwle "print qq(Matched <$_>) for /(?:^|[<>\s])(?:(?![_.]|\bextends\b)([\w]+\.[\w.]+))/g"
Matched <package1.class1>
Matched <package2.class2>
Matched <package3.class3>
Matched <package4.class4>
Matched <package5.package6.class5>
regex 101 playground

Regex: Find multiple matching strings in all lines

I'm trying to match multiple strings in a single line using regex in Sublime Text 3.
I want to match all values and replace them with null.
Part of the string that I'm matching against:
"userName":"MyName","hiScore":50,"stuntPoints":192,"coins":200,"specialUser":false
List of strings that it should match:
"MyName"
50
192
200
false
Result after replacing:
"userName":null,"hiScore":null,"stuntPoints":null,"coins":null,"specialUser":null
Is there a way to do this without using sed or any other substitution method, but just by matching the wanted pattern in regex?
You can use this find pattern:
:(.*?)(,|$)
And this replace pattern:
:null\2
The first group will match any symbol (dot) zero or more times (asterisk) with this last quantifier lazy (question mark), this last part means that it will match as little as possible. The second group will match either a comma or the end of the string. In the replace pattern, I substitute the first group with null (as desired) and I leave the symbol matched by the second group unchanged.
Here is an alternative on amaurs answer where it doesn't put the comma in after the last substitution:
:\K(.*?)(?=,|$)
And this replacement pattern:
null
This works like amaurs but starts matching after the colon is found (using the \K to reset the match starting point) and matches until a comma of new line (using a positive look ahead).
I have tested and this works in Sublime Text 2 (so should work in Sublime Text 3)
Another slightly better alternative to this is:
(?<=:).+?(?=,|$)
which uses a positive lookbehind instead of resetting the regex starting point
Another good alternative (so far the most efficient here):
:\K[^,]*
This may help.
Find: (?<=:)[^,]*
Replace: null

regex - 4 digit number match and replace

I want to match and replace a number of four digit numbers in a csv file
1,1456,2,3,4,5
2,1455,2,3,4,5
so that all 1400 numbers in the second column are mapped to the range of two hundred
1456 -> 256
1455 -> 255
I have this regex to match the 1400 numbers
',[1][4][0-9][0-9],'
but how can i define the matched substring regex to retain the last two digits of the match?
EDIT
Ended up changing the match regex to
,[1][4]([0-9][0-9])
and the match defined as
,2\1
in Notepad++
Replace /14(\d{2})/ with 2\1, where \1 is a back reference to the first match. Adapt to your regex flavor of choice.
sed -e 's/,[1][4]\([0-9][0-9]\),/,2\1,/'
Notice how the \( \) syntax captures a part of the matched expression, and \1 is used to say "the first captured data".
You need to use a backreference - by surrounding one or more parts of a regex in parentheses, you can later reference them in the output. Here is my final version (works with sed -r).
's/,[1][4]([0-9][0-9])/,2\1/'
You should use a group, i.e. something like
',[1][4]([0-9][0-9]),'
Some regex dialects will let you name groups, e.g. in .NET
',[1][4](?<LastTwoDigits>[0-9][0-9]),'
If you specify which language you are using, it will be easier to help you.