NodeJS REGEX: Efficient Regex Search for multiple keywords? - regex

I have a search now that iterates over each json object that I have and each keyword. I want to match this search exclusive, not inclusive and I'm guessing I will need more robust regex. Basically, test true if the string contains ALL of the keywords. (order does not matter)
Searching for "This Text" would include the following results:
"this text", "this is a text", "This Text", "Text This", "this is a long string and text", "a long string with this in the middle and text", "that this that this text"
and negate text similar to the following strings:
"that text", "this is not", "text that is not included"
Here's the script I have right now.
items.forEach(function(item) { //iterate over the items array
var s = JSON.stringify(item); //convert each item in items to a string
var matched = false;
sarray.forEach(function(qs) { //take the toArray converted query and iterate over it
var r = new RegExp(qs, "g"); //compose a regex object with the stringified query
if(r.test(s)) { //if regex finds the keyword in the item string,
matched = true; //set matched to true
}
});
if(matched) {
results.push(item); //push the item into the results array
}

I would use a simple function instead of regular expression because in this case, a complex regex would be needed.
/**
*
* Look for all `items' inside `str'.
*
*#param str the string to search inside
*#param items all items that must appear in the string
*
*#return
* TRUE => All items were found
* FALSE => At least one item was not found
*/
function all_items_present(str, items) {
var i;
var len=items.length;
var found=true;
for(i=0;i<len;i++) {
if(str.search(items[i])==-1) {
found=false;
break;
}
}
return found;
}
// returns true
all_items_present(
'{"title":"This text is foo", "location":"Austin, TX(bar)", "baz":false}',
['foo','bar','baz']
);
Demo
http://jsfiddle.net/7FwUP/1/
Here is the equivalent regex for finding foo, bar and baz in no particular order:
^.*?(?:foo.*?bar.*?baz|foo.*?baz.*?bar|baz.*?foo.*?bar|baz.*?bar.*?foo|bar.*?baz.*?foo|bar.*?foo.*?baz).*?$
Description

Related

Is it possible to match a nested pair with regex?

Im attempting to parse some BBCode with regex, but the nested structures are giving me a headache
What I'm trying to parse is the following:
[COLOR="Red"]Red [COLOR="Green"]Green[/COLOR][/COLOR]
I've come up with the following pattern, which I need to deal with the quotation marks around the color attribute, but it only matches the first leading COLOR and the first closing COLOR. Its not matching in a proper nested arrangement
\[COLOR=(\"?)(.*?)(\"?)]([\s\S]*?)\[\/COLOR\]\
Its being done in dart, as follows, but really I believe the problem might be with my regex pattern rather then the dart implementation
text = text.replaceAllMapped(RegExp(r'\[COLOR=(\"?)(.*?)(\"?)]([\s\S]*?)\[\/COLOR\]', caseSensitive: false, multiLine: true), (match) {
return '<font style="color: ${match.group(2)}">${match.group(4)}</font>';
});
Matching braces (of any kind) are not regular. It's known to be a problem which is context free (can be solved by a stack machine or specified by a context free grammar), but not regular (can be solved by a finite state machine or specified by a regular expression).
While the commonly implemented "regular expressions" can do some non-regular things (due to backreferences), this is not one of those things.
In general, I'd recommend using a RegExp to tokenize the input, then build the stack based machine yourself on top.
Here, because it's simple enough, I'd just match the start and end markers and replace them individually, and not try to match the text between.
var re = RegExp(r'\[COLOR="(\w+)"\]|\[/COLOR\]');
text = text.replaceAllMapped(re, (m) {
var color = m[1]; // The color of a start tag, null if not start tag.
return color == null ? "</span>" : ​"<span style='color:$color'>";
});
If you want to check that the tags are balanced, we're back to having a stack (in this case so simple it's just a counter):
var re = RegExp(r'\[COLOR="(\w+)"\]|\[/COLOR\]');
var nesting = 0;
text = text.replaceAllMapped(re, (m) {
var color = m[1];
if (color == null) {
if (nesting == 0) {
throw ArgumentError.value(text, "text", "Bad nesting");
}
nesting--; // Decrement on close tag.
return "</span>";
}
nesting++; // Increment on open-tag.
return ​"<span style='color:$color'>";
});
if (nesting != 0) {
throw ArgumentError.value(text, "text", "Bad nesting");
}

Remove inline styles from markdown in SwiftUI

I'm pulling in JSON data and displaying it with Text in my SwiftUI App, but some of the text contains inline styles in the MD. Is there a way to remove this or even apply the styles?
Example:
if !ticket.isEmpty {
Text(self.ticket.first?.notes.first?.prettyUpdatedString ?? "")
.padding()
Text(self.ticket.first?.notes.first?.mobileNoteText ?? "")
.padding()
.fixedSize(horizontal: false, vertical: true)
}
The prettyUpdatedString prints out "Last updated by < strong >Seth Duncan</ strong>"
Update:
In an attempt to apply this fix to the Ticket Short Detail, an exception is being thrown.
Exception NSException * "*** -[NSRegularExpression enumerateMatchesInString:options:range:usingBlock:]: Range or index out of bounds" 0x0000600003f06370
I'm not sure what's going on here. Any ideas?
Example of Data shortDetail is being pulled from
{
id: ID,
type: "Ticket",
lastUpdated: "2020-07-23T08:19:12Z",
shortSubject: null,
shortDetail: "broken screen - # CEH Tina Desk",
displayClient: "STUDENT",
updateFlagType: 0,
prettyLastUpdated: "6 days ago",
latestNote: {
id: ID,
type: "TechNote",
mobileListText: "<b>S. Duncan: </b> Sent to AGI for repair.",
noteColor: "aqua",
noteClass: "bubble right"
}
},
ERROR CODE
Screen I'm attempting to access...short detail is below name
The trick here is to play with RegEx in my opinion, in this case I would create an function that clears the markdown
UPDATE
Based on what I understood from your comment you want to replace   with an white space not an empty string.
To archive that I just replace all occurrences of   with an space " "
.replacingOccurrences(of: " ", with: " ")
Leaving you with this code
// original answer had an incorrect regex
func clearMarkdown(on str: String) -> String {
// we build markdown open and close regular expressions we ensure that they are valid
guard let match = try? NSRegularExpression(pattern: "<[^>]+>|\\n+") else { return str }
// we get the range of the string to analize, in this case the whole string
let range = NSRange(location: 0, length: str.lengthOfBytes(using: .utf8))
// we match all opening markdown
let matches = match.matches(in: str, range: range)
// we start replacing with empty strings
return matches.reversed().reduce(into: str) { current, result in
let range = Range(result.range, in: current)!
current.replaceSubrange(range, with: "")
}.replacingOccurrences(of: " ", with: " ")
}
This function will clear all markdown stylings from your strings, but it will not format the strings, nor give you any information about the markdown taking in count your example the usage would be something like this
var str = "Last updated by < strong >Seth Duncan</ strong>"
str = clearMarkdown(on: str) // prints "Last updated by Seth Duncan" without quotes
If you require the styling to be applied the will not work but I can write something that will
UPDATE 2
After looking your problem I found out that you received a couple of strings with characters not available in the UTF-8 charset.The character in this case is ’ which is available in ANSI while those that use UTF-8 normally use '. This being said you just need to change the charset by
// Replacing this line
let range = NSRange(location: 0, length: str.lengthOfBytes(using: .utf8))
// With
let range = NSRange(location: 0, length: str.lengthOfBytes(using: .windowsCP1254))
I'm not mistaken this is one of the most complete charsets and matches (or closely matches) the ANSI charset, you can also use .ascii that I tested and seems to work

Extract all allowed characters from a regular expression

I need to extract a list of all allowed characters from a given regular expression.
So for example, if the regex looks like this (some random example):
[A-Z]*\s+(4|5)+
the output should be
ABCDEFGHIJKLMNOPQRSTUVWXYZ45
(omitting the whitespace)
One obvious solution would be to define a complete set of allowed characters, and use a find method, to return the corresponding subsequence for each character. This seems to be a bit of a dull solution though.
Can anyone think of a (possibly simple) algorithm on how to implement this?
One thing you can do is:
split the regex by subgroup
test the char panel against the subgroup
See the following example (not perfect yet) c#:
static void Main(String[] args)
{
Console.WriteLine($"-->{TestRegex(#"[A-Z]*\s+(4|5)+")}<--");
}
public static string TestRegex(string pattern)
{
string result = "";
foreach (var subPattern in Regex.Split(pattern, #"[*+]"))
{
if(string.IsNullOrWhiteSpace(subPattern))
continue;
result += GetAllCharCoveredByRegex(subPattern);
}
return result;
}
public static string GetAllCharCoveredByRegex(string pattern)
{
Console.WriteLine($"Testing {pattern}");
var regex = new Regex(pattern);
var matches = new List<char>();
for (var c = char.MinValue; c < char.MaxValue; c++)
{
if (regex.IsMatch(c.ToString()))
{
matches.Add(c);
}
}
return string.Join("", matches);
}
Which outputs:
Testing [A-Z]
Testing \s
Testing (4|5)
-->ABCDEFGHIJKLMNOPQRSTUVWXYZ
? ? ???????? 45<--

how to exclude a string if it's in a URL using regex?

I'm replacing a number of different strings, but only want them to replace in normal text, and not get rewritten when they appear as a link in a document. The regex to find the strings is very straightforward: /word|anotherword|athirdword/gi but what that means is that if there's a link that contains anotherwordit's getting found and then replaced as well, breaking the link.
I think I just need a part in my regex that says "but just ignore anything that starts with http or https" but not sure how to write that.
thanks so much!
edit. here's what I'm doing with the javascript
if (node.nodeType === 3) {
var text = node.nodeValue;
var replacedText = text.replace(/word|anotherword|athirdword/gi, 'replaced text');
if (replacedText !== text) {
element.replaceChild(document.createTextNode(replacedText), node);
}
}
the result replaces those three strings anywhere on a page, which is great. except it changes http://www.foo.com/the-whole-world into http://www.foo.com/the-whole-replaced text which obviously breaks the link.
I would try negative lookbehind.
Negative lookbehind differs greatly from flavor to flavor, so it won't work in different flavors.
For JavaScript, you can try following:
str.replace(/(http:[\/\.-a-z0-9]+)?(word|anotherword|athirdword)/gi, function($0, $1){
return $1 ? $0 : '';
});
Fiddle.
You can split the string first, then do a conditional replace:
function condReplace(str) {
var sentences = [];
var res = str.split(/(https?:\/\/[^\s]+)(?:\s+|$)/i);
res.forEach(function(entry) {
if (entry) {
if (entry.match(/^http?:\/\//i)) {
sentences.push(entry);
} else {
sentences.push(entry.replace(/word|anotherword|athirdword/g, "REPLACED"))
}
}
});
document.write(sentences.join(" "));
}
var str = "http://sometext.com/word.doc and This is a word normal text anotherword containing a anotherword another link http://www.foo.com/the-whole-word. This is a single word.";
condReplace(str);

Want to Encode text during Regex.Replace call

I have a regex call that I need help with.
I haven't posted my regex, because it is not relevant here.
What I want to be able to do is, during the Replace, I also want to modify the ${test} portion by doing a Html.Encode on the entire text that is effecting the regex.
Basically, wrap the entire text that is within the range of the regex with the bold tag, but also Html.Encode the text inbetween the bold tag.
RegexOptions regexOptions = RegexOptions.Compiled | RegexOptions.IgnoreCase;
text = Regex.Replace(text, regexBold, #"<b>${text}</b>", regexOptions);
There is an incredibly easy way of doing this (in .net). Its called a MatchEvaluator and it lets you do all sorts of cool find and replace. Essentially you just feed the Regex.Replace method the method name of a method that returns a string and takes in a Match object as its only parameter. Do whatever makes sense for your particular match (html encode) and the string you return will replace the entire text of the match in the input string.
Example: Lets say you wanted to find all the places where there are two numbers being added (in text) and you want to replace the expression with the actual number. You can't do that with a strict regex approach, but you can when you throw in a MatchEvaluator it becomes easy.
public void Stuff()
{
string pattern = #"(?<firstNumber>\d+)\s*(?<operator>[*+-/])\s*(?<secondNumber>\d+)";
string input = "something something 123 + 456 blah blah 100 - 55";
string output = Regex.Replace(input, pattern, MatchMath);
//output will be "something something 579 blah blah 45"
}
private static string MatchMath(Match match)
{
try
{
double first = double.Parse(match.Groups["firstNumber"].Value);
double second = double.Parse(match.Groups["secondNumber"].Value);
switch (match.Groups["operator"].Value)
{
case "*":
return (first * second).ToString();
case "+":
return (first + second).ToString();
case "-":
return (first - second).ToString();
case "/":
return (first / second).ToString();
}
}
catch { }
return "NaN";
}
Find out more at http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchevaluator.aspx
Don't use Regex.Replace in this case... use..
foreach(Match in Regex.Matches(...))
{
//do your stuff here
}
Heres an implementation of this I've used to pick out special replace strings from content and localize them.
protected string FindAndTranslateIn(string content)
{
return Regex.Replace(content, #"\{\^(.+?);(.+?)?}", new MatchEvaluator(TranslateHandler), RegexOptions.IgnoreCase);
}
public string TranslateHandler(Match m)
{
if (m.Success)
{
string key = m.Groups[1].Value;
key = FindAndTranslateIn(key);
string def = string.Empty;
if (m.Groups.Count > 2)
{
def = m.Groups[2].Value;
if(def.Length > 1)
{
def = FindAndTranslateIn(def);
}
}
if (group == null)
{
return Translate(key, def);
}
else
{
return Translate(key, group, def);
}
}
return string.Empty;
}
From the match evaluator delegate you return everything you want replaced, so where I have returns you would have bold tags and an encode call, mine also supports recursion, so a little over complicated for your needs, but you can just pare down the example for your needs.
This is equivalent to doing an iteration over the collection of matches and doing parts of the replace methods job. It just saves you some code, and you get to use a fancy shmancy delegate.
If you do a Regex.Match, the resulting match objects group at the 0th index, is the subset of the intput that matched the regex.
you can use this to stitch in the bold tags and encode it there.
Can you fill in the code inside {} to add the bold tag, and encode the text?
I'm confused as to how to apply the changes to the entire text block AND replace the section in the text variable at the end.