String replacing nested JSON in Scala - regex

I have a Scala method that will be given a String like so:
"blah blah sediejdri \"foos\": {\"fizz\": \"buzz\"}, odedrfj49 blah"
And I need to strip the "foos JSON" out of it using pure Java/Scala (no external libs). That is, find the substring matching the pattern:
\"foos\" : {ANYTHING},
...and strip it out, so that the input string is now:
"blah blah sediejdri odedrfj49 blah"
The token to search for will always be \"foos\", but the content inside the JSON curly braces will always be different. My best attempt is:
// Ex: "blah \"foos\": { flim flam }, blah blah" ==> "blah blah blah", etc.
def stripFoosJson(var : toClean : String) : String = {
val regex = ".*\"foos\" {.*},.*"
toClean.replaceAll(regex, "")
}
However I my regex is clearly not correct. Can anyone spot where I'm going awry?

Here are 2 solutions I came up with, hope it helps. I think you forgot to handle possible spaces with \s* etc.
object JsonStrip extends App {
// SOLUTION 1, hard way, handles nested braces also:
def findClosingParen(text: String, openPos: Int): Int = {
var closePos = openPos
var parensCounter = 1 // if (parensCounter == 0) it's a match!
while (parensCounter > 0 && closePos < text.length - 1) {
closePos += 1
val c = text(closePos)
if (c == '{') {
parensCounter += 1
} else if (c == '}') {
parensCounter -= 1
}
}
if (parensCounter == 0) closePos else openPos
}
val str = "blah blah sediejdri \"foos\": {\"fizz\": \"buzz\"}, odedrfj49 blah"
val indexOfFoos = str.indexOf("\"foos\"")
val indexOfFooOpenBrace = str.indexOf('{', indexOfFoos)
val indexOfFooCloseBrace = findClosingParen(str, indexOfFooOpenBrace)
// here you would handle if the brace IS found etc...
val stripped = str.substring(0, indexOfFoos) + str.substring(indexOfFooCloseBrace + 2)
println("WITH BRACE COUNT: " + stripped)
// SOLUTION 2, with regex:
val reg = "\"foos\"\\s*:\\s*\\{(.*)\\}\\s*,\\s*"
println("WITH REGEX: " + str.replaceAll(reg, ""))
}

This regex \\"foos\\": {(.*?)} should match what you want, in most regex engine, you might need to replace " with \". If your JSON can contains other curly brackets, you can use this \\"foos\\": \{(?>[^()]|(?R))*\}, it uses recursion to match balanced groups of brackets. Note that this one only works in pcre regex engine, others won't support recursion.

Related

Match word by its prefix

I'm trying to match a string by its prefix that ends with a particular character. For example, if my string is "abcd" and ends in #, then any word which is a prefix of "abcd" should be matched as long as it ends with #. Here are some examples to help illustrate the pattern:
Input: "ab#" gives true (as "ab" is a prefix of "abcd" and end with a #).
Input: "abcd#" gives true (as "abcd" is a prefix of "abcd" and end with a #).
Input: "bc#" gives false (as "bc" is a not a prefix of "abcd" ).
Input: "ab#" gives false (while "ab" is a prefix of "abcd", it doesn't end with #) .
Input: "ac#" gives false (while "ac" is contained within "abcd", it doesn't begin with a prefix from "abcd") .
So far, I've managed to come up with the following expression which seems to be working fine:
/(abcd|abc|ab|a)#/
While this is working, it isn't very practical, as larger words of length n will make the expression quite large:
/(n|n-1|n-2| ... |1)#/
Is there a way to rewrite this expression so it is more scalable and concise?
Example of my attempt (in JS):
const regex = /(abcd|abc|ab|a)#/;
console.log(regex.test("abcd#")); // true
console.log(regex.test("ab#")); // true
console.log(regex.test("abc#")); // true
console.log(regex.test("abz#")); // false
console.log(regex.test("abc#")); // false
Edit: Some of the solutions provided are nice and do do what I'm after, however, for this particular question, I'm after a solution which uses pure regular expressions to match the prefix.
Just use String#startsWith and String#endsWith here:
String input = "abcd";
String prefix = "ab#";
if (input.startsWith(prefix.replaceAll("#$", "")) && prefix.endsWith("#")) {
System.out.println("MATCH");
}
else {
System.out.println("NO MATCH");
}
Edit: A JavaScript version of the above:
var input = "abcd";
var prefix = "ab#";
if (input.startsWith(prefix.replace(/#$/, "")) && prefix.endsWith("#")) {
console.log("MATCH");
}
else {
console.log("NO MATCH");
}
Try ^ab?c?d?#$
Explanation:
`^` - match beginning of a string
`b?` - match match zero or one `b`
Rest is analigocal to the above.
Demo
Here's a left field JavaScript option. Build and array of valid prefixes, use join on the array to make your regex pattern.
var validPrefixes = ["abcd",
"abc",
"ab",
"a",
"areallylongprefix"];
var regexp = new RegExp("^(" + validPrefixes.join("|") + ")#$");
console.log(regexp.test("abcd#"));// true
console.log(regexp.test("ab#")); // true
console.log(regexp.test("abc#")); // true
console.log(regexp.test("abz#")); // false
console.log(regexp.test("abc#")); // false
console.log(regexp.test("areallylongprefix#")); //true
This can be adapted to the language of tour choosing, also handy if your prefixes are dynamically retrieved from a database or similar.
Here's my c# attempt:
private static bool test(string v)
{
var pattern = "abcd#";
//No error handling
return v.EndsWith(pattern[pattern.Length-1])
&& pattern.Replace("#", "").StartsWith(v.Replace("#",""));
}
Console.WriteLine(test("abcd#")); // true
Console.WriteLine(test("ab#")); // true
Console.WriteLine(test("abc#")); // true
Console.WriteLine(test("abz#")); // false
Console.WriteLine(test("abc#")); // false
Console.WriteLine(test("abc")); //false
/a(b(cd?)?)?#/
Or for a longer example, to match a prefix of "abcdefg#":
/a(b(c(d(e(fg?)?)?)?)?)?#/
Generating this regex isn't completely trivial, but some options are:
function createPrefixRegex(s) {
// This method creates an unnecessary set of parentheses
// around the last letter, but that won't harm anything.
return new RegExp(s.slice(0,-1).split('').join('(') + ')?'.repeat(s.length - 2) + '#');
}
function createPrefixRegex2(s) {
var r = s[0];
for (var i = 1; i < s.length - 2; ++i) {
r += '(' + s[i];
}
r += s[s.length - 2] + '?' + ')?'.repeat(s.length - 3) + '#';
return new RegExp(r);
}
function createPrefixRegex3(s) {
var recurse = function(i) {
if (i >= s.length - 1) {
return '';
}
if (i === s.length - 2) {
return s[i] + '?';
}
return '(' + s[i] + recurse(i + 1) + ')?';
}
return new RegExp(s[0] + recurse(1) + '#');
}
These may fail if the input string has no prefix before the '#' character, and they assume the last character in the string is '#'.

Use Meteor Match and Regex to check strings

I'm checking an array of strings for a specific combination of patterns. I'm having trouble using Meteor's Match function and regex literal together. I want to check if the second string in the array is a url.
addCheck = function(line) {
var firstString = _.first(line);
var secondString = _.indexOf(line, 1);
console.log(secondString);
var urlRegEx = /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%#\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;
if ( firstString == "+" && Match.test(secondString, urlRegEx) === true ) {
console.log( "detected: + | line = " + line )
} else {
// do stuff if we don't detect a
console.log( "line = " + line );
}
}
Any help would be appreciated.
Match.test is used to test the structure of a variable. For example: "it's an array of strings, or an object including the field createdAt", etc.
RegExp.test on the other hand, is used to test if a given string matches a regular expression. That looks like what you want.
Try something like this instead:
if ((firstString === '+') && urlRegEx.test(secondString)) {
...
}

How to find all matches of a pattern in a string using regex

If I have a string like:
s = "This is a simple string 234 something else here as well 4334
and a regular expression like:
regex = ~"[0-9]{3}"
How can I extract all the words from the string using that regex? In this case 234 and 433?
You can use CharSequence.findAll:
def triads = s.findAll("[0-9]{3}")
assert triads == ['234', '433']
Latest documentation of CharSequence.findAll
You have to use capturing groups. You can check groovy's documentation about it:
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html
For instance, you can use this code:
s = "This is a simple string 234 something else here as well 4334"
regex = /([0-9]{3})/
matcher = ( s=~ regex )
if (matcher.matches()) {
println(matcher.getCount()+ " occurrence of the regular expression was found in the string.");
println(matcher[0][1] + " found!")
}
As a side note:
m[0] is the first match object.
m[0][0] is everything that matched in this match.
m[0][1] is the first capture in this match.
m[0][n] is the n capture in this match.
You could do something like this.
def s = "This is a simple string 234 something else here as well 4334"
def m = s =~ /[0-9]{3}/
(0..<m.count).each { println m[it][0] }
Output ( Working Demo )
234
433
def INPUT= 'There,once,was,a man,from,"the , extremely,,,,bad .,, , edge",of
the,"universe, ultimately,, is mostly",empty,,,space';
def REGEX = ~/(\"[^\",]+),([^\"]+\")/;
def m = (INPUT =~ REGEX);
while (true) {
m = (INPUT =~ REGEX);
if (m.getCount()>0) {
INPUT = INPUT.replaceAll(REGEX,'$1!-!$2');
System.out.println(INPUT );
} else {
break;
}
}

Mongoose: how to find 3 words in any order and in any place in the string? (SQL 3 like with and)

I cannot find a way to find a row where has ALL the 3 words (or more) with a regexp:
example
input words: "comp abc 300"
should match: "abcdef compres 300" and "ascompr zazabcd 9300"
I have a loop at the moment with a regexp that returns this:
(.*comp.*)(.*abc.*)(.*300.*)
but only matches in this order. I would like it to match in every order like in the example
it's just like 3 like with and between in them in SQL.
Thanks ;)
Positive lookahead is what you need:
(?=.*comp)(?=.*abc)(?=.*300).*
More info at http://www.regular-expressions.info/lookaround.html
I've created a Node function that encapsulates it:
// Will add a full-text regexp search to a query
// can be used like this:
// var query = myModel.model.find();
// addFullTextSearch( query, "firstName", mySearchString );
function addFullTextSearch( query, paramName, searchString ) {
if (searchString) {
var r = "";
var sss = searchString.split(" ");
if (sss.length<=1) { // only one word
r = sss[0];
} else {
// result should look like this: (?=.*comp)(?=.*abc)(?=.*300).*
for (var s in sss) {
r += "(?=.*" + sss[s] + ")";
}
r += ".*";
}
query.where(paramName).regex(new RegExp(r, 'i')); // "i" for case insensitivity
}
} // addFullTextSearch

Regular expression to match word pairs joined with colons

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.
You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.
Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp
Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}
Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}
I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here
here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo
([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...