in regx, is it possible to accurately extract a json scalar value? - regex

wish to extract a scalar value from json.
know JSON uses double quotes.
know datatype of scalar: string, number, date, boolean.
know scalar will be on first level, ie, not an attribute of an embedded object
{ "want": "string" } => "string"
{ "want": 123 } => 123
{ "not": { "want": "wrong" }, "want": "right" } => "right
{ "nothing": 0 } => null / not found
do not know how to handle the opening/closing quotes, nor do I know how to handle embedded objects.
is this possible?
this is the best I have come up with so far:
// match `want` attribute
(?:"want"\s*:\s*)
// string, number, boolean or null
(((?:")([^"]*)(?:"))|([-0-9][.eE0-9]*)|true|false|null)
// followed by comma or right bracket
(?:\s*(,|}))
it's good because it
can be run in postgres
grabs strings
grabs numbers
grabs boolean and null
it's bad because it
does not ensure want is a first level attribute
string value cannot have quote (") inside

This expression will get you 50% of the way there:
(?<=:\s*)(".*?"(?<!\\")|\-?(0|[1-9]\d*)(\.\d+)?([eE][+-]?\d+)?)(?=\s*})
Or, when written as a multi-line regex:
(?x:
(?<=:\s*) # After : + space
(
".*?"(?<!\\") # String in double quotes
| # -or-
\-? # Optional leading -ve
(0|[1-9]\d*) # Number
(\.\d+)? # Optional fraction
([eE][+-]?\d+)? # Optional exponent
)
(?=\s*}) # space + }
)
This will not match your nested object example ({ "not": { "want" ...) or rather, it will match, but on the wrong thing. Also, your final example ({ "nothing": 0 } => null / not found) is difficult because 0 is a valid number. To work around the this problem, I would just check the result in procedural code and replace a result of 0 with null.
The nested objects problem is a whole different ball game though. It's getting into the realm of lexical analysis rather than simple tokenizing. At that point, you might as well just use a JSON library because you'd be writing a full JSON parser anyway. Fortunately, JSON is a simple enough grammar that it wouldn't be that expensive to use a third party library - certainly no more than doing it yourself.
I think the short answer is: from a simple { "name" : <value> } object, yes, but from anything more complicated, no.
For info on the JSON syntax, see http://www.json.org/.

Related

Regex- to get part of String

I have got below string and I need to Get all the values Between Pizzahut: and |.
ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|
I have got RegExpression .scan(/(?<=Pizzahut:)([.*\s\S]+)(?=\|)/) but it fetches
"j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|"
Result should be: 34532jdhgj,3242237,67688873rg
You can use
s='ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|'
p s.scan(/Pizzahut:([^|]+)/).flatten
# => ["j34532jdhgj", "3242237", "67688873rg"]
See this Ruby demo and the Rubular demo.
It does not look possible that you have Pizzahut as a part of another word, but it is possible, use a version with a word boundary, /\bPizzahut:([^|]+)/.
The Pizzahut:([^|]+) matches Pizzahut: and then captures into Group 1 any one or more chars other than a pipe (with ([^|]+)).
Note that String#scan returns the captures only if a pattern contains a capturing group, so you do not need to use lookarounds.
I'm not sure why you're jumping to a regex solution here; that input string clearly looks structured to me, and you would probably do better by splitting it on the delimiters to convert it into a more convenient data structure.
Something like this:
input = "ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg"
converted_input = input
.split('|') #=> ["ABC:2fg45rdvsg", "Pizzahut:j34532jdhgj", ... ]
.map { |pair| pair.split(':') } #=> [["ABC", "2fg45rdvsg"], ["Pizzahut", "j34532jdhgj"], ... ]
.group_by(&:first) #=> {"ABC"=>[["ABC", "2fg45rdvsg"]], "Pizzahut"=>[["Pizzahut", "j34532jdhgj"], ... ], "Dominos"=>[["Dominos", "3424232"]], ... ]
.transform_values { |v| v.flat_map(&:last) }
(The above series of transformations is just one possible way; you could probably come up with a dozen similar alternative steps to convert this input into the same hash shown below! For example, by using reduce or even the CSV library.)
Which gives you the final result:
converted_input = {
"ABC" => ["2fg45rdvsg"],
"Pizzahut" => ["j34532jdhgj", "3242237", "67688873rg"],
"Dominos" => ["3424232"],
"Wendys" => ["3462783"]
}
Now that the data is formatted conveniently, obtaining data like your original request becomes trivial:
converted_input["Pizzahut"].join(',') #=> "j34532jdhgj,3242237,67688873rg"
(Although quite likely it would be more suitable to leave it as an Array, not a comma-separated String!!)

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

eval certain regex from file to replace chars in string

I'm new to ruby so please excuse my ignorance :)
I just learned about eval and I read about its dark sides.
what I've read so far:
When is eval in Ruby justified?
Is 'eval' supposed to be nasty?
Ruby Eval and the Execution of Ruby Code
so what I have to do is to read a file in which there are some text such as /e/ 3 which will replace each e with a 3 after evaluation.
so here what i did so far:(working but..)
def evaluate_lines
result="elt"
IO.foreach("test.txt") do |reg|
reg=reg.chomp.delete(' ')
puts reg
result=result.gsub(eval(reg[0..2]),"#{reg[3..reg.length]}" )
p result
end
end
contents of the test.txt file
/e/ 3
/l/ 1
/t/ 7
/$/ !
/$/ !!
this only works because I know the length of the lines in the file.
so assuming my file has the following /a-z/ 3 my program would be not able to do what is expected from it.
Note
I tried using Regexp.new reg and this resulted with the following /\/e\/3/ which isn't very helpful in this case.
simple example to the `Regexp
str="/e/3"
result="elt"
result=result.gsub(Regexp.new str)
p result #outputs: #<Enumerator: "elt":gsub(/\/e\/3/)>
i already tried stripping off the slashes but even though this wont deliver the desired result thus the gsub() takes two parameters, such as this gsub(/e/, "3").
for the usage of the Regexp, I have already read Convert a string to regular expression ruby
While you can write something to parse that file, it rapidly gets complicated because you have to parse regular expressions. Consider /\/foo\\/.
There are a number of incomplete solutions. You can split on whitespace, but this will fail on /foo bar/.
re, replace = line.split(/\s+/, 2)
You can use a regex. Here's a first stab.
match = "/3/ 4".match(%r{^/(.*)/\s+(.+)})
This fails on escaped /, we need something more complex.
match = '/3\// 4'.match(%r{\A / ((?:[^/]|\\/)*) / \s+ (.+)}x)
I'm going to guess it was not your teacher's intent to have you parsing regexes. For the purposes of the assignment, splitting on whitespace is probably fine. You should clarify with your teacher.
This is a poor data format. It is non-standard, difficult to parse, and has limitations on the replacement. Even a tab-delimited file would be better.
There's little reason to use a non-standard format these days. The simplest thing is to use a standard data format for the file. YAML or JSON are the most obvious choices. For such simple data, I'd suggest JSON.
[
{ "re": "e", "replace": "3" },
{ "re": "l", "replace": "1" }
]
Parsing the file is trivial, use the built-in JSON library.
require 'json'
specs = JSON.load("test.json")
And then you can use them as a list of hashes.
specs.each do |spec|
# No eval necessary.
re = Regexp.new(spec["re"])
# `gsub!` replaces in place
result.gsub!(re, spec["replace"])
end
The data file is extensible. For example, if later you want to add regex options.
[
{ "re": "e", "replace": "3" },
{ "re": "l", "replace": "1", "options": ['IGNORECASE'] }
]
While the teacher may have specified a poor format, pushing back on bad requirements is good practice for being a developer.
Here's a really simple example that uses vi notation like s/.../.../ and s/.../.../g:
def rsub(text, spec)
_, mode, repl, with, flags = spec.match(%r[\A(.)\/((?:[^/]|\\/)*)/((?:[^/]|\\/)*)/(\w*)\z]).to_a
case (mode)
when 's'
if (flags.include?('g'))
text.gsub(Regexp.new(repl), with)
else
text.sub(Regexp.new(repl), with)
end
end
end
Note the matcher looks for non-slash characters ([^/]) or a literal-slash combination (\\/) and splits out the two parts accordingly.
Where you can get results like this:
rsub('sandwich', 's/and/or/')
# => "sorwich"
rsub('and/or', 's/\//,/')
# => "and,or"
rsub('stack overflow', 's/o/O/')
# => "stack Overflow"
rsub('stack overflow', 's/o/O/g')
# => "stack OverflOw"
The principle here is you can use a very simple regular expression to parse out your input regular expression and feed that cleaned up data into Regexp.new. There is absolutely no need for eval here, and if anything that severely limits what you can do.
With a little work you could alter that regular expression to parse what's in your existing file and make it do what you want.

Railo, remove some double quotes from SerializeJSON result

Let's say I have:
<cfscript>
arrButtons = [
{
"name" = "Add",
"bclass" = "add",
"onpress" = "addItem"
},
{
"name" = "Edit",
"bclass" = "edit",
"onpress" = "editItem"
},
{
"name" = "Delete",
"bclass" = "delete",
"onpress" = "deleteItem"
}
];
jsButtons = SerializeJSON(arrButtons);
// result :
// [{"onpress":"addItem","name":"Add","bclass":"add"},{"onpress":"editItem","name":"Edit","bclass":"edit"},{"onpress":"deleteItem","name":"Delete","bclass":"delete"}]
</cfscript>
For every onpress item, I need to remove the double quotes from its value to match the JS library requirement (onpress value must a callback function).
How do I remove the double quotes using a regular expression?
The final result must be:
[{"onpress":addItem,"name":"Add","bclass":"add"},{"onpress":editItem,"name":"Edit","bclass":"edit"},{"onpress":deleteItem,"name":"Delete","bclass":"delete"}]
No double quotes surrounding addItem, editItem, and deleteItem.
Edit 2012-07-13
Why I need this? I created a CFML function that the result is a collection of JS that will be used in many files. jsButton object will be used as one part of the options available in the JS library. One of that function's arguments is an array of struct (the default is arrButtons), and the supplied arguments value can merge with the default value.
Since we can't (in CFML) write onpress value without double quotes, so I have to add double quotes to that value, and convert the (CFML) array of struct to JSON (which is just a string) and remove the double quotes before place it in the JS library option.
with Railo, we can declare the struct as a linked struct to make sure we have same ordered key for loop or conversion (from above example onpress always the latest key in the struct). with this linked struct and same key order, we can remove the double quotes with simple Replace function, but of course we can't guarantee every programmer who use the CFML function doesn't forget to use linked struct and key order same as example above
I'm not sure this is actually necessary - depending on how/where you're dealing with the JS callbacks, it might be possible to use the string function names to reference the function without needing to remove the quotes (i.e. object[button.onpress]).
However, since you asked, here is a regex solution:
jsButtons = jsButtons.replaceAll('(?<="onpress":)"([^"]+)"','$1');
The regex there is made up of two parts:
(?<="onpress":) -- lookbehind to ensure we are dealing with the text "onpress":
"([^"]+)" -- match the quotes and capture their contents.
The $1 on the replacement side is to replace the matched text (i.e. the entire quoted value) with the first capture group (i.e. the contents of the quotes).
If case-sensitivity of "onpress" might be an issue, you can prefix the regex with (?i) to ignore case.
If there will be multiple different events (not just "onpress") you can update the relevant part of the expression above to be (?<="on(?:press|hover|squeek)":) etc.
Note: All the above relies on the format output from serializeJson not changing - if it's possible that there might be comments, whitespace, single quotes, or anything else in future then a longer expression would be needed to cater for those - which is part of why you should investigate if you even need regex to solve this problem in the first place.
What you're wanting to output is not JSON, so using SerializeJSON is a kludge.
Is there any reason you are putting it into a ColdFusion Array first, instead of writing the Javascript directly?
JSON is purely meant to be a data description language. Per
http://www.json.org, it is a "lightweight data-interchange format." -
not a programming language.
Per http://en.wikipedia.org/wiki/JSON, the "basic types" supported
are:
Number (integer, real, or floating point)
String (double-quoted Unicode with backslash escaping)
Boolean (true and false)
Array (an ordered sequence of values, comma-separated and enclosed in square brackets)
Object (collection of key:value pairs, comma-separated and enclosed in curly braces)
null
--Source
I guess in this case you can simply use serialize(). That should do the trick...
Gert

Vim regex to find missing JSON element

I have a big JSON file, formatted over multiple lines. I want to find objects that don't have a given property. The objects are guaranteed not to contain any further nested objects. Say the given property was "bad", then I would want to locate the value of"foo" in the second element in the following (but not in the first element).
{
result: [
{
"foo" : {
"good" : 1,
"bad" : 0
},
"bar" : 123
},
{
"foo" : {
"good" : 1
},
"bar" : 123
}
]
}
I know about multi-line regexes in Vim but I can't get anything that does what I want. Any pointers?
Try the following:
/\v"foo"\_s*:\_s*\{%(%(\_[\t ,]"bad"\_s*:)#!\_.){-}\}
When you need to exclude something, you should look at negative look-aheads or look-behinds (latter is slower and unlike vim Perl/PCRE regular expressions do not support look-behinds except fixed-width (or a number of alternative fixed-width) ones).
JSON is a context free grammar and as such is not regular. Unless you can give a much stricter set of rules to go on, no regex will be able to do what you want.