Get values of nested json with regex - regex

I'm trying to select all of the lowest tier(in the "nest") possible values that are numbers, and not strings, in a nested json. This will include arrays and objects. Currently, im using this code
(?<=:)\s*([+-]?\d+(?:\.\d*(?:E-?\d+)?)?)\b\s*(?=(?:\{|,)\s*\"[^\"]*\":)
(which is mostly from this question)
and it works, if it isn't the last value of an object, or in a list. My main problem is, that it can be fooled. for example, if i have these two "a":":1,", ":":akey-value pairs next to eachother, it sees the number 1 as a value, except that the actual value is ":1,". How can i make a foolproof system for this?
Im excpecting it to not select anything, that couldn't be an integer or a float. I would like to keep the middle part of my original regex, beacuse it also needs to work with scientific notations.

Related

Maxima: creating a function that acts on parts of a string

Context: I'm using Maxima on a platform that also uses KaTeX. For various reasons related to content management, this means that we are regularly using Maxima functions to generate the necessary KaTeX commands.
I'm currently trying to develop a group of functions that will facilitate generating different sets of strings corresponding to KaTeX commands for various symbols related to vectors.
Problem
I have written the following function makeKatexVector(x), which takes a string, list or list-of-lists and returns the same type of object, with each string wrapped in \vec{} (i.e. makeKatexVector(string) returns \vec{string} and makeKatexVector(["a","b"]) returns ["\vec{a}", "\vec{b}"] etc).
/* Flexible Make KaTeX Vector Version of List Items */
makeKatexVector(x):= block([ placeHolderList : x ],
if stringp(x) /* Special Handling if x is Just a String */
then placeHolderList : concat("\vec{", x, "}")
else if listp(x[1]) /* check to see if it is a list of lists */
then for j:1 thru length(x)
do placeHolderList[j] : makelist(concat("\vec{", k ,"}"), k, x[j] )
else if listp(x) /* check to see if it is just a list */
then placeHolderList : makelist(concat("\vec{", k, "}"), k, x)
else placeHolderList : "makeKatexVector error: not a list-of-lists, a list or a string",
return(placeHolderList));
Although I have my doubts about the efficiency or elegance of the above code, it seems to return the desired expressions; however, I would like to modify this function so that it can distinguish between single- and multi-character strings.
In particular, I'd like multi-character strings like x_1 to be returned as \vec{x}_1 and not \vec{x_1}.
In fact, I'd simply like to modify the above code so that \vec{} is wrapped around the first character of the string, regardless of how many characters there may be.
My Attempt
I was ready to tackle this with brute force (e.g. transcribing each character of a string into a list and then reassembling); however, the real programmer on the project suggested I look into "Regular Expressions". After exploring that endless rabbit hole, I found the command regex_subst; however, I can't find any Maxima documentation for it, and am struggling to reproduce the examples in the related documentation here.
Once I can work out the appropriate regex to use, I intend to implement this in the above code using an if statement, such as:
if slength(x) >1
then {regex command}
else {regular treatment}
If anyone knows of helpful resources on any of these fronts, I'd greatly appreciate any pointers at all.
Looks like you got the regex approach working, that's great. My advice about handling subscripted expressions in TeX, however, is to avoid working with names which contain underscores in Maxima, and instead work with Maxima expressions with indices, e.g. foo[k] instead of foo_k. While writing foo_k is a minor convenience in Maxima, you'll run into problems pretty quickly, and in order to straighten it out you might end up piling one complication on another.
E.g. Maxima doesn't know there's any relation between foo, foo_1, and foo_k -- those have no more in common than foo, abc, and xyz. What if there are 2 indices? foo_j_k will become something like foo_{j_k} by the preceding approach -- what if you want foo_{j, k} instead? (Incidentally the two are foo[j[k]] and foo[j, k] when represented by subscripts.) Another problematic expression is something like foo_bar_baz. Does that mean foo_bar[baz], foo[bar_baz] or foo_bar_baz?
The code for tex(x_y) yielding x_y in TeX is pretty old, so it's unlikely to go away, but over the years I've come to increasing feel like it should be avoided. However, the last time it came up and I proposed disabling that, there were enough people who supported it that we ended up keeping it.
Something that might be helpful, there is a function texput which allows you to specify how a symbol should appear in TeX output. For example:
(%i1) texput (v, "\\vec{v}");
(%o1) "\vec{v}"
(%i2) tex ([v, v[1], v[k], v[j[k]], v[j, k]]);
$$\left[ \vec{v} , \vec{v}_{1} , \vec{v}_{k} , \vec{v}_{j_{k}} ,
\vec{v}_{j,k} \right] $$
(%o2) false
texput can modify various aspects of TeX output; you can take a look at the documentation (see ? texput).
While I didn't expect that I'd work this out on my own, after several hours, I made some progress, so figured I'd share here, in case anyone else may benefit from the time I put in.
to load the regex in wxMaxima, at least on the MacOS version, simply type load("sregex");. I didn't have this loaded, and was trying to work through our custom platform, which cost me several hours.
take note that many of the arguments in the linked documentation by Dorai Sitaram occur in the reverse, or a different order than they do in their corresponding Maxima versions.
not all the "pregexp" functions exist in Maxima;
In addition to this, escaping special characters varied in important ways between wxMaxima, the inline Maxima compiler (running within Ace editor) and the actual rendered version on our platform; in particular, the inline compiler often returned false for expressions that compiled properly in wxMaxima and on the platform. Because I didn't have sregex loaded on wxMaxima from the beginning, I lost a lot of time to this.
Finally, the regex expression that achieved the desired substitution, in my case, was:
regex_subst("\vec{\\1}", "([[:alpha:]])", "v_1");
which returns vec{v}_1 in wxMaxima (N.B. none of my attempts to get wxMaxima to return \vec{v}_1 were successful; escaping the backslash just does not seem to work; fortunately, the usual escaped version \\vec{\\1} does return the desired form).
I have yet to adjust the code for the rest of the function, but I doubt that will be of use to anyone else, and wanted to be sure to post an update here, before anyone else took time to assist me.
Always interested in better methods / practices or any other pointers / feedback.

How do I prevent strings containing the word "yes" or "no" from printing out as true or false? [duplicate]

This question already has answers here:
How can I prevent SerializeJSON from changing Yes/No/True/False strings to boolean?
(7 answers)
Closed 6 years ago.
I'm currently setting a number of variables like so:
<cfset printPage = "YES">
Eventually, when I print these variables out, anything that I try to set to "YES" prints out as "true". Anything set to "NO", prints out as "false". I'm not opposed to using the YesNoFormat function. In fact I might end up using that function in this application, but in the mean time I would like to know if ColdFusion is actually storing the words "YES" and "NO" in memory, or if it is converting them to a boolean format behind the scenes.
If CF is storing my variables exactly the way that I declare them, how would I go about retrieving these variables as strings? If CF is changing the variables in some way, are there any special characters or keywords that I could use to force it to store the variables as strings?
Thank you to everyone that commented / answered. I did a little more experimenting and reading, and it seems that the serializeJSON function will automatically convert "Yes" to "true" and "No" to "false". I either need to deal with this problem in my javascript, or I can add a space in the affected properties to circumvent this behavior.
You already know how to display the boolean value as "yes" or "no" (using YesNoFormat()). I don't think there is a way to force ColdFusion to store a variable a certain way. It just doesn't support that. I guess you could call out the Java type directly by using JavaCast(). I just don't see why you would want to go through that extra work for something like this. You can certainly research that a bit more if you like. Here is a link to the document for JavaCast.
Have a look at this document regarding data types in ColdFusion. I will post some of the relevant points from that document here but please read that page for more information.
ColdFusion is often referred to as typeless because you do not assign types to variables and ColdFusion does not associate a type with the variable name. However, the data that a variable represents does have a type, and the data type affects how ColdFusion evaluates an expression or function argument. ColdFusion can automatically convert many data types into others when it evaluates expressions. For simple data, such as numbers and strings, the data type is unimportant until the variable is used in an expression or as a function argument.
ColdFusion variable data belongs to one of the following type categories:
Simple One value. Can use directly in ColdFusion expressions. Include numbers, strings, Boolean values, and date-time values.
Binary Raw data, such as the contents of a GIF file or an executable program file.
Complex A container for data. Generally represent more than one value. ColdFusion built-in complex data types include arrays, structures, queries, and XML document objects. You cannot use a complex variable, such as an array, directly in a ColdFusion expression, but you can use simple data type elements of a complex variable in an expression. For example, with a one-dimensional array of numbers called myArray, you cannot use the expression myArray * 5. However, you could use an expression myArray[3] * 5 to multiply the third element in the array by five.
Objects Complex constructs. Often encapsulate both data and functional operations.
It goes on to say this regarding Data Types:
Data type notes
Although ColdFusion variables do not have types, it is often convenient to use “variable type” as a shorthand for the type of data that the variable represents.
ColdFusion provides the following functions for identifying the data type of a variable:
IsArray
IsBinary
IsBoolean
IsImage
IsNumericDate
IsObject
IsPDFObject
IsQuery
IsSimpleValue
IsStruct
IsXmlDoc
ColdFusion also includes the following functions for determining whether a string can be represented as or converted to another data type:
IsDate
IsNumeric
IsXML
So in your code you could use something like IsBoolean(printPage) to check if it contains a boolean value. Of course that doesn't mean it is actually stored as a boolean but that ColdFusion can interpret it's value as a boolean.

Two reason why zipcode should be stored as a string data type

In the textbook "Starting Out In C++" by Gaddis in chapter 1 the author says that some numbers like zip codes are intended for humans to read, to be printed out on the screen to look at and to not calculate with so they should be stored in string data type not numeric data types. But there is a couple of other reasons why this statement is true. The only other reason why I can think this would be true is if you were to enter a zip code with an ending like 37217-1221 you may have to use string catenation to only use the first five digits chopping of the characters after the -1221. What would be some other reasons for the statement "If a number is not going to be used in an arithmetic operation, store it in a string data type". Any answers would be greatly appreciated.
Zipcodes simply are not numeric data. As you point out, zipcodes can contain extensions, which numeric data does not represent. They can also contain significant leading zeros. Some postal code schemes can also contain letters.
Your questions was a bit...not of a questions? That's the best I can explain it. Anyway, a string is text and an integer or number is numerical and should only be used for calculations or counting. For example:
A zip code is a number but you will never do calculations with it. A zip code is something you reference as a place and has no counting purpose. If you think this could confuse you later on try to give the variable with the zip code an assignment of a String so that you cannot try to do any sort of math with the variable.

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
List(
Constant("abc"),
ZeroOrOne(Constant("d")),
Constant("efg"),
OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
Constant("qrs"),
AnyChar()
)
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.