Regex to match keys in json

Regex to match keys in json - regex

I am trying to match keys in JSON of this type:
define({
key1: "some text: and more",
key2 : 'some text ',
key3: ": more some text",
key4: 'some text:'
});
with this regexp (?<=\s|{|,)\s*(\w+)\s*:\s?[\"|\']/g. But currently it's matching the last text: also that should be ignore.
An example could be seen here
Could you give me hint how to fix this regex so it matches only keys.

How about this shorter regex:
(?m)^[ ]*([^\r\n:]+?)\s*:
In the demo, look at the Group 1 captures in the right pane.
(?m) allows the ^ to match at the beginning of each line
^ asserts that we are positioned at the beginning of the line
[ ]* eats up all the space characters
([^\r\n:]+?) lazily matches all characters that are colons : or newlines, and capture them to Group 1 (this is what we want), up to...
\s*: matches optional whitespace characters and a colon

I wouldn't suggest parsing JSON using regular expressions. There are small libraries for that, some even header only and with very convenient licensing terms (like rapidjson, which I'm using rightn ow).
But if you really want to, the following expression should find your key/value pairs (note that I'm using Perl, mostly for nice syntax highlighting):
(\w+)\s*:\s*('[^']*'|"[^"]*"|[+\-]?\d+(?:.\d+)?)
Keep in mind that this won't work properly with escaped quotes inside your values or not properly enclosed strings.
(\w+) will match the full key.
\s* matches any or no sequence of space characters.
: is really just a direct match.
'[^']*' will match any characters enclosed by ' (same for the second part of that bracket).
[+\-]?\d+(?:.\d+)? will match any number (with or without decimals).
Edit: Since others provided nice and easy to see online demos, here's mine.

Try this regular expression:
text is matched initially because it is considered as a key.
(\w+)\s*:\s*(["']).+\2,?
Demo

Related

replace single-quote with double-quote, if and only if quote is after specific string

I'm working in notepad++, and using its find-replace dialog box.
NP++ documentation states: Notepad++ regular expressions use the Boost regular expression library v1.70, which is based on PCRE (Perl Compatible Regular Expression) syntax. ref: https://npp-user-manual.org/docs/searching
What I'm trying to do should be simple, but I'm a regex novice, and after 2-3 hrs of web searches and playing with online regex testers, I give up.
I want to replace all single quotes ' with double quote " , but if and only if the ' is to the RIGHT of one or more #, ie inside a python comment.
For example,
list1 = ['apple','banana','pear'] # All 'single quotes' to LEFT of # remained unchanged.
list2 = ['tomato','carrot'] # All 'single quotes' to RIGHT of one or more # are replaced
# # with "double quotes", like this.
The np++ file is over 800 lines, manual replacement would be tedious & error prone. Advice appreciated.

This regex should do what you want:
(^[^#]*#|(?<!^)\G)[^'\n]*\K'
It looks for a ' which is preceded by either
^[^#]*# : start of line and some number of non-# characters followed by a #; or
(?<!^)\G : the start of line or the end of the previous match (\G), with a negative lookbehind for start of line (?<!^), meaning that it only matches at the end of the previous match
and then some number of non ' or newline (to prevent the match wrapping around the end of the previous line) characters [^'\n]*.
We then use \K to reset the match, so that everything before that is discarded from the match, and the regex only matches the '.
That can then be replaced with ".
Demo on regex101
Update
You can avoid matching apostrophes within words by only matching ones that are either preceded or followed by a non-word character:
(^[^#]*#|(?<!^)\G)[^'\n]*\K('(?=\W)|(?<=\W)')
Demo on regex101
Update 2
You can also deal with the case where there are # characters in strings by qualifying the first part of the regex with the requirement for there to be matched pairs of quotes beforehand:
(?:^[^'#]*(?:'[^']*'[^#']*)*[^'#]*#|(?<!^)\G)[^'\n]*\K(?:'(?=\W)|(?<=\W)')
Demo on regex101

Regex transform for Java code with Notepad++ - date .format to SimpleDateFormat

I need to do some re-factoring on my Java code. I need to turn this:
X.format("Z")
into this:
(new SimpleDateFormat("Z").format(X))
Examples:
dateStart.format("yyyy-MM-dd HH:mm") into
(new SimpleDateFormat("yyyy-MM-dd HH:mm").format(dateStart))
reportStart.format("yyyy-MM") into
(new SimpleDateFormat("yyyy-MM").format(reportStart))
I'm thinking to use Notepad++ find/replace, but I'm not good with Regex, and hoping someone would know easily?
I've tried variations of the below, and the closest I've got is with the one below... But with the one below, it wants to take everything to the left of .format and treat that as $1
find:([^)]*)\.format\(([^)]*)\) replace with:
(new SimpleDateFormat($2.format($1))

Probably a simple find / replace will work like this :
Find (?s)(\w+)\.format\((.*?)\)
update Escape the parenthesis when used as literals because Boost::regex uses these characters as special operators in the replacement, format string.
Boost-Extended format strings treat all characters as literals except for '$', '\', '(', ')', '?', and ':'
Replace \(new SimpleDateFormat\($2\).format\($1\)\)
https://regex101.com/r/f77yBt/1
If interested in why certain characters need to be escaped to be considered
literals, see this :
https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html
Essentially, boost::regex uses these characters to implement a pseudo-callback
that does simple (possibly nested) conditionals checking if a group matched
and taking a yes : no replacement action.

Be aware that in Notepad++ the parenthesis have to be escaped in the replacement part.
Ctrl+H
Find what: (\w+)\.format\((.+?)\)
Replace with: \(new SimpleDateFormat\($2\).format\($1\)\)
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(\w+) # group 1, 1 or more word characters
\. # a dot
format\( # literally
(.+?) # group 2, 1 or more any character but newline, not greedy
\)
Screen capture (before):
Screen capture (after):

Regex Search and replace, if between ," and ", is a comma

How can i in Visual Basic replace comma to dot
If between ," and ", is a comma replace to dot
For example in first row replace "482,5" to "482.5"
"Peter",1,1,1,1,500,"500",631,"631",19,"482,5",1
"Peter",1,1,1,1,500,"500",631,"631",19,"482,5",2
"Peter",1,1,1,1,1984,"1984",635,"635",4,"101,5",3
"Peter",1,1,1,1,500,"500",2000,"2000",19,"482,5",4
"Peter",1,1,1,1,500,"500",1962,"1962",18,"457",5
"Peter",1,1,1,1,486,"486",613,"613",18,"457",6
"Peter",1,1,1,1,1016,"1016",322,"322",19,"482,5",7
"Peter",1,1,1,1,933,"933",444,"444",16,"406,5",8
"Peter",1,1,1,1,250,"250",476,"476",16,"406,5",9
"Peter",1,1,1,1,250,"250",476,"476",16,"406,5",10
"Peter",1,1,1,1,234,"234",933,"933",16,"406,5",11
"Peter",1,1,1,1,250,"250",965,"965",16,"406,5",12

In general I suggest to parse the csv with a CSV parser because the CSV format is way more complicated than it seems to be. Just see RFC 4180 for details. The ideal solution would identify the problematic columns, and then replace the text in those columns only.
The regex approach must make some assumptions. I.e. the regex approach will work in some cases, and will not work in others.
Probably some people can write a really advanced regex that handles csvs correctly. But they are hard to understand and difficult to maintain. Let's just make assumptions here:
The only text delimiter that we care about is ". I.e. no ' -s.
There are no quotes within fields. They would look like this: "asd""ghi". Here is a more confusing example: "asd"",".
So the regex is:
(?:^|,)"[^",]*,
And the replacement is: $1.
Explanation:
(?:...) is a non-capturing group
(?:^|,) matches either start of line, or a comma
then comes the " to match the starting quote
[^",]* matches everything that's neither a quote or a comma. So it prevents matching through several fields.
finally, it matches a comma: ,
the parentheses (...) capture the stuff inside. I.e. everything before the comma.
In the replacement $1 refers to the captured group. I.e. the replacement is the matched stuff, and then a dot. The closing comma was not in the group, so this is how the replacement goes.
RegexR demo.
VB.Net fiddle demo.

Regular expression to match any word followed by a literal string

So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew

OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew

Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.

Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")

RegEx: Grabbing values between quotation marks

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.

I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.

Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.

A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1

The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.

This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/

MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1

string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character

My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match

Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.

From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match keys in json - regex

Try this regular expression: text is matched initially because it is considered as a key. (\w+)\s:\s(["']).+\2,? Demo

Related

replace single-quote with double-quote, if and only if quote is after specific string

Regex transform for Java code with Notepad++ - date .format to SimpleDateFormat

Regex Search and replace, if between ," and ", is a comma

Regular expression to match any word followed by a literal string

RegEx: Grabbing values between quotation marks

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match keys in json - regex

Try this regular expression: text is matched initially because it is considered as a key. (\w+)\s*:\s*(["']).+\2,? Demo

Related

replace single-quote with double-quote, if and only if quote is after specific string

Regex transform for Java code with Notepad++ - date .format to SimpleDateFormat

Regex Search and replace, if between ," and ", is a comma

Regular expression to match any word followed by a literal string

RegEx: Grabbing values between quotation marks

Categories

Resources

Try this regular expression: text is matched initially because it is considered as a key. (\w+)\s:\s(["']).+\2,? Demo