Program hangs due to too many regex-based replacements - regex

In my input file, I need to do lots of string-manipulations (find/replaces) using Regex depending on various conditions. Like, if the content's one block meets the condition, I need to go to previous block and do replacement in that block.
For this reason, I am splitting the content to many substrings, so that I can move back to previous block (here, previous substring); and do the REGEX replacement.
But the Program hangs in the middle if the File content is more(or may be, no. of substrings exceeds).
Here is the code snippet.
string content = string.Empty;
string target_content = string.Empty;
string[] active_doc_nos;
byte[] content_bytes;
FileInfo input_fileinfo = new FileInfo(input_file);
long file_length = input_fileinfo.Length;
using (FileStream fs_read = new FileStream(input_file, FileMode.Open, FileAccess.Read))
{
content_bytes = new byte[Convert.ToInt32(file_length)];
fs_read.Read(content_bytes, 0, Convert.ToInt32(file_length));
fs_read.Close();
}
content = ASCIIEncoding.ASCII.GetString(content_bytes);
if (Regex.IsMatch(content, "<\\?CLG.MDFO ([^>]*) LEVEL=\"STRUCTURE\""))
{
#region Logic-1: TWO PAIRS of MDFO-MDFC-s one pair following the other
content = Regex.Replace(content, "(<\\?CLG.MDFO)([^>]*)(LEVEL=\"STRUCTURE\")", "<MDFO_VALIDATOR>$1$2$3");
string[] MDFO_Lines = Regex.Split(content, "<MDFO_VALIDATOR>");
active_doc_nos = new string[MDFO_Lines.GetLength(0)];
active_doc_nos[0] = Regex.Match(MDFO_Lines[0], "ACTIVE DOC=\"([^>]*)\"\\s+").ToString();
for (int i = 1; i < MDFO_Lines.GetLength(0); i++)
{
active_doc_nos[i] = Regex.Match(MDFO_Lines[i], "ACTIVE DOC=\"([^>]*)\"\\s+").ToString();
if (Regex.IsMatch(MDFO_Lines[i - 1], "(<\\?CLG.MDFC)([^>]*)(\\?>)(<\\S*\\s*\\S*>)*$"))
{
MDFO_Lines[i - 1] = Regex.Replace(MDFO_Lines[i - 1], "(<\\?CLG.MDFC)([^>]*)(\\?>)(<\\S*\\s*\\S*>)*$", "<?no_smark?>$1$2$3$4");
if (Regex.IsMatch(MDFO_Lines[i - 1], "^<\\?CLG.MDFO ([^>]*) ACTION=\"DELETED\""))
{
MDFO_Lines[i - 1] = Regex.Replace(MDFO_Lines[i - 1], "^<\\?CLG.MDFO ([^>]*) ACTION=\"DELETED\"", "<?no_bmark?><?CLG.MDFO $1 ACTION=\"DELETED\"");
}
if (active_doc_nos[i] == active_doc_nos[i - 1])
{
MDFO_Lines[i] = Regex.Replace(MDFO_Lines[i], "^<\\?CLG.MDFO ([^>]*) " + active_doc_nos[i], "<?no_smark?><?CLG.MDFO $1 " + active_doc_nos[i]);
}
}
}
foreach (string str_piece in MDFO_Lines)
{
target_content += str_piece;
}
byte[] target_bytes = ASCIIEncoding.ASCII.GetBytes(target_content);
using (FileStream fs_write = new FileStream(input_file, FileMode.Create, FileAccess.Write))
{
fs_write.Write(target_bytes, 0, target_bytes.Length);
fs_write.Close();
}
Do I have any other option to achieve this task??

Hard to say without seeing your data, but I have a suspicion that this part of some of your regexes may be the culprit:
(<\\S*\\s*\\S*>)*
Because \S can also match < and >, because everything is optional, and because you've got nested quantifiers, it's possible that this part of the regex leads to catastrophic backtracking.
What happens if you replace these parts with (?>(<\\S*\\s*\\S*>))*?

Related

JavaFX - TextField with regex for zipcode

for my programm I want to use a TextField where the user can enter a zipcode (German ones). For that I tried what you can see below. If the user enters more than 5 digits every additional digit shall be deleted immediately. Of course letters are not allowed.
When I use this pattern ^[0-9]{0,5}$ on https://regex101.com/ it does what I intended to, but when I try this in JavaFX it doesn't work. But I couldn't find a solution yet.
Can anyone tell me what I did wrong?
Edit: For people, who didn't work with JavaFX yet: When the user enters just one character, the method check(String text) is called. So the result should also be true, when there are 1 to 5 digits. But not more ;-)
public class NumberTextField extends TextField{
ErrorLabel label;
NumberTextField(String text, ErrorLabel label){
setText(text);
setFont(Font.font("Calibri", 17));
setMinHeight(35);
setMinWidth(200);
setMaxWidth(200);
this.label = label;
}
NumberTextField(){}
#Override
public void replaceText(int start, int end, String text){
if(check(text)) {
super.replaceText(start, end, text);
}
}
#Override
public void replaceSelection(String text){
if(check(text)){
super.replaceSelection(text);
}
}
private boolean check(String text){
if(text.matches("^[0-9]{0,5}$")){
label.setText("Success");
label.setBlack();
return true;
} else{
return false;
}
}
You don't need to extend TextField to do this. In fact I recommend using a TextFormatter, since this is simpler to implement:
It does not require you to overwrite multiple method. You simply need to decide based on the data about the desired input, if you want to allow the change or not.
final Pattern pattern = Pattern.compile("\\d{0,5}");
TextFormatter<?> formatter = new TextFormatter<>(change -> {
if (pattern.matcher(change.getControlNewText()).matches()) {
// todo: remove error message/markup
return change; // allow this change to happen
} else {
// todo: add error message/markup
return null; // prevent change
}
});
TextField textField = new TextField();
textField.setTextFormatter(formatter);
Your original expression should be working fine, if we wish to validate a five-digits zip though, we might want to drop the 0 quantifier:
^[0-9]{5}$
^\d{5}$
For validation purposes, we might want to keep the start and end anchors, however for just testing, we can remove and see:
[0-9]{5}
\d{5}
It is likely that some other chars, would get through our inputs, which we do not wish to have.
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^[0-9]{5}$";
final String string = "01234\n"
+ "012345\n"
+ "0\n"
+ "1234";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

Parsing tags in string

I'm trying to parse a string with custom tags like this
[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave].
I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]
I've figured out what regexps to use
/\[color value=(.*?)\](.*?)\[\/color\]/gs
/\[wave\](.*?)\[\/wave\]/gs
/\[shake\](.*?)\[\/shake\]/gs
But the thing is - I need to get correct ranges (startIndex, endIndex) of those groups in result string so I could apply them correctly. And that's where I feel completely lost, because everytime I replace tags there's always a chance for indexes to mess up. It gets espesically hard for nested tags.
So input is a string
[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave].
I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]
And in output I want to get something like
Apply color 0x000000 from 0 to 75
Apply wave from 14 to 20
Apply color 0xFF0000 from 14 to 20
Apply shake from 46 to 51
Notice that's indices match to result string.
How do I parse it?
Unfortunately, I'm not familiar with ActionScript, but this C# code shows one solution using regular expressions. Rather than match specific tags, I used a regular expression that can match any tag. And instead of trying to make a regular expression that matches the whole start and end tag including the text in between (which I think is impossible with nested tags), I made the regular expression just match a start OR end tag, then did some extra processing to match up the start and end tags and remove them from the string keeping the essential information.
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string data = "[color value=0x000000]This house is [wave][color value=0xFF0000]haunted[/color][/wave]. " +
"I've heard about ghosts [shake]screaming[/shake] here after midnight.[/color]";
ParsedData result = ParseData(data);
foreach (TagInfo t in result.tags)
{
if (string.IsNullOrEmpty(t.attributeName))
{
Console.WriteLine("Apply {0} from {1} to {2}", t.name, t.start, t.start + t.length - 1);
}
else
{
Console.WriteLine("Apply {0} {1}={2} from {3} to {4}", t.name, t.attributeName, t.attributeValue, t.start, t.start + t.length - 1);
}
Console.WriteLine(result.data);
Console.WriteLine("{0}{1}\n", new string(' ', t.start), new string('-', t.length));
}
}
static ParsedData ParseData(string data)
{
List<TagInfo> tagList = new List<TagInfo>();
Regex reTag = new Regex(#"\[(\w+)(\s+(\w+)\s*=\s*([^\]]+))?\]|\[(\/\w+)\]");
Match m = reTag.Match(data);
// Phase 1 - Collect all the start and end tags, noting their position in the original data string
while (m.Success)
{
if (m.Groups[1].Success) // Matched a start tag
{
tagList.Add(new TagInfo()
{
name = m.Groups[1].Value,
attributeName = m.Groups[3].Value,
attributeValue = m.Groups[4].Value,
tagLength = m.Groups[0].Length,
start = m.Groups[0].Index
});
}
else if (m.Groups[5].Success)
{
tagList.Add(new TagInfo()
{
name = m.Groups[5].Value,
tagLength = m.Groups[0].Length,
start = m.Groups[0].Index
});
}
m = m.NextMatch();
}
// Phase 2 - match end tags to start tags
List<TagInfo> unmatched = new List<TagInfo>();
foreach (TagInfo t in tagList)
{
if (t.name.StartsWith("/"))
{
for (int i = unmatched.Count - 1; i >= 0; i--)
{
if (unmatched[i].name == t.name.Substring(1))
{
t.otherEnd = unmatched[i];
unmatched[i].otherEnd = t;
unmatched.Remove(unmatched[i]);
break;
}
}
}
else
{
unmatched.Add(t);
}
}
int subtractLength = 0;
// Phase 3 - Remove tags from the string, updating start positions and calculating length in the process
foreach (TagInfo t in tagList.ToArray())
{
t.start -= subtractLength;
// If this is an end tag, calculate the length for the corresponding start tag,
// and remove the end tag from the tag list.
if (t.otherEnd.start < t.start)
{
t.otherEnd.length = t.start - t.otherEnd.start;
tagList.Remove(t);
}
// Keep track of how many characters in tags have been removed from the string so far
subtractLength += t.tagLength;
}
return new ParsedData()
{
data = reTag.Replace(data, string.Empty),
tags = tagList.ToArray()
};
}
class TagInfo
{
public int start;
public int length;
public int tagLength;
public string name;
public string attributeName;
public string attributeValue;
public TagInfo otherEnd;
}
class ParsedData
{
public string data;
public TagInfo[] tags;
}
}
The output is:
Apply color value=0x000000 from 0 to 76
This house is haunted. I've heard about ghosts screaming here after midnight.
-----------------------------------------------------------------------------
Apply wave from 14 to 20
This house is haunted. I've heard about ghosts screaming here after midnight.
-------
Apply color value=0xFF0000 from 14 to 20
This house is haunted. I've heard about ghosts screaming here after midnight.
-------
Apply shake from 47 to 55
This house is haunted. I've heard about ghosts screaming here after midnight.
---------
Let me show you a parsing method that you can apply not only to the case above, but to every case with a pattern cutting through the case. This method is not limited to the terms - color, wave, shake.
private List<Tuple<string, string>> getVals(string input)
{
List<Tuple<string, string>> finals = new List<Tuple<string,string>>();
// first parser
var mts = Regex.Matches(input, #"\[[^\u005D]+\]");
foreach (var mt in mts)
{
// has no value=
if (!Regex.IsMatch(mt.ToString(), #"(?i)value[\n\r\t\s]*="))
{
// not closing tag
if (!Regex.IsMatch(mt.ToString(), #"^\[[\n\r\t\s]*\/"))
{
try
{
finals.Add(new Tuple<string, string>(Regex.Replace(mt.ToString(), #"^\[|\]$", "").Trim(), ""));
}
catch (Exception es)
{
Console.WriteLine(es.ToString());
}
}
}
// has value=
else
{
try
{
var spls = Regex.Split(mt.ToString(), #"(?i)value[\n\r\t\s]*=");
finals.Add(new Tuple<string, string>(Regex.Replace(spls[0].ToString(), #"^\[", "").Trim(), Regex.Replace(spls[1].ToString(), #"^\]$", "").Trim()));
}
catch (Exception es)
{
Console.WriteLine(es.ToString());
}
}
}
return finals;
}
I also have an experience parsing JSON with a single regular expression. If you wonder what it is, visit my blog www.mysplitter.com .

Eliminate newlines in google app script using regex

I'm trying to write part of an add-on for Google Docs that eliminates newlines within selected text using replaceText. The obvious text.replaceText("\n",""); gives the error Invalid argument: searchPattern. I get the same error with text.replaceText("\r","");. The following attempts do nothing: text.replaceText("/\n/","");, text.replaceText("/\r/","");. I don't know why Google App Script does not allow for the recognition of newlines in regex.
I am aware that there is an add-on that does this already, but I want to incorporate this function into my add-on.
This error occurs even with the basic
DocumentApp.getActiveDocument().getBody().textReplace("\n","");
My full function:
function removeLineBreaks() {
var selection = DocumentApp.getActiveDocument().getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only deal with text elements
if (element.getElement().editAsText) {
var text = element.getElement().editAsText();
if (element.isPartial()) {
text.replaceText("\n","");
}
// Deal with fully selected text
else {
text.replaceText("\n","");
}
}
}
}
// No text selected
else {
DocumentApp.getUi().alert('No text selected. Please select some text and try again.');
}
}
It seems that in replaceText, to remove soft returns entered with Shift-ENTER, you can use \v:
.replaceText("\\v+", "")
If you want to remove all "other" control characters (C0, DEL and C1 control codes), you may use
.replaceText("\\p{Cc}+", "")
Note that the \v pattern is a construct supported by JavaScript regex engine, and is considered to match a vertical tab character (≡ \013) by the RE2 regex library used in most Google products.
The Google Apps Script function replaceText() still doesn't accept escape characters, but I was able to get around this by using getText(), then the generic JavaScript replace(), then setText():
var doc = DocumentApp.getActiveDocument();
var body = doc.getBody();
var bodyText = body.getText();
//DocumentApp.getUi().alert( "Does document contain \\t? " + /\t/.test( bodyText ) ); // \n true, \r false, \t true
bodyText = bodyText.replace( /\n/g, "" );
bodyText = bodyText.replace( /\t/g, "" );
body.setText( bodyText );
This worked within a Doc. Not sure if the same is possible within a Sheet (and, even if it were, you'd probably have to run this once cell at a time).
here is my pragmatic solution to eliminate newlines in Google Docs, or, more exact, to eliminate newlines from Gmail message.getPlainBody().
It looks that Google uses '\r\n\r\n' as a plain EOL and '\r\n' as a manuell Linefeed (Shift-Enter). The code should be self explainable.
It might help to get alone with the newline problem in Docs.
A solution possibly not very elegant, but works like a charm :-)
function GetEmails2Doc() {
var doc = DocumentApp.getActiveDocument();
var body = doc.getBody();
var pc = 0; // Paragraph Counter
var label = GmailApp.getUserLabelByName("_Send2Sheet");
var threads = label.getThreads();
var i = threads.length;
// LOOP Messages within a THREAT
for (i=threads.length-1; i>=0; i--) {
for (var j = 0; j < messages.length; j++) {
var message = messages[j];
/* Here I do some ...
body.insertParagraph(pc++, Utilities.formatDate(message.getDate(), "GMT",
"dd.MM.yyyy (HH:mm)")).setHeading(DocumentApp.ParagraphHeading.HEADING4)
str = message.getFrom() + ' to: ' + message.getTo();
if (message.getCc().length >0) str = str + ", Cc: " + message.getCc();
if (message.getBcc().length >0) str = str + ", Bcc: " + message.getBcc();
body.insertParagraph(pc++,str);
*/
// Body !!
var str = processBody(message.getPlainBody()).split("pEOL");
Logger.log(str.length + " EOLs");
for (var k=0; k<str.length; k++) body.insertParagraph(pc++,str[k]);
}
}
}
function processBody(tx) {
var s = tx.split(/\r\n\r\n/g);
// it looks like message.getPlainBody() [of mail] uses \r\n\r\n as EOL
// so, I first substitute the 'EOL's with the string pattern "pEOL"
// to be replaced with body.insertParagraph in the main function
tx = '';
for (k=0; k<s.length; k++) tx = tx + s[k] + "pEOL";
// then replace all remaining simple \r\n with a blank
s = tx.split(/\r\n/g);
tx = '';
for (k=0; k<s.length; k++) tx = tx + s[k] + " ";
return tx;
}
I have now found out through much trial and error -- and some much needed help from Wiktor Stribiżew (see other answer) -- that there is a solution to this, but it relies on the fact that Google Script does not recognise \n or \r in regex searches. The solution is as follows:
function removeLineBreaks() {
var selection = DocumentApp.getActiveDocument()
.getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only deal with text elements
if (element.getElement()
.editAsText) {
var text = element.getElement()
.editAsText();
if (element.isPartial()) {
var start = element.getStartOffset();
var finish = element.getEndOffsetInclusive();
var oldText = text.getText()
.slice(start, finish);
if (oldText.match(/\r/)) {
var number = oldText.match(/\r/g)
.length;
for (var j = 0; j < number; j++) {
var location = oldText.search(/\r/);
text.deleteText(start + location, start + location);
text.insertText(start + location, ' ');
var oldText = oldText.replace(/\r/, ' ');
}
}
}
// Deal with fully selected text
else {
text.replaceText("\\v+", " ");
}
}
}
}
// No text selected
else {
DocumentApp.getUi()
.alert('No text selected. Please select some text and try again.');
}
}
Explanation
Google Docs allows searching for vertical tabs (\v), which match newlines.
Partial text is a whole other problem. The solution to dealing with partially selected text above finds the location of newlines by extracting a text string from the text element and searching in that string. It then uses these locations to delete the relevant characters. This is repeated until the number of newlines in the selected text has been reached.
This Stack Overflow answer removes, specifically, "\n". It may help, it helped me indeed.

How to remove asterisk from this spin syntax code?

here is my code it is a text spinner (synonym)
public function fetchContent($keyword)
{
$customContent = $this->getOption('custom_content_text');
$this->_setHttpStatusCode(200);
if (!$customContent)
{
$this->_setContentStatus(self::CONTENT_STATUS_NO_RESULTS);
return false;
}
if (preg_match_all('/({\*)(.*?)(\*})/', $customContent, $result))
{
if (is_array($result[0]))
{
foreach ($result[0] as $index => $group_string)
{
//replace the first or next pattern match with a replaceable token
$customContent = preg_replace('/(\{\*)(.*?)(\*\})/', '{#'.$index.'#}', $customContent, 1);
$words = explode('|', $result[2][$index]);
//clean and trim all words
$finalPhrase = array();
foreach ($words as $word)
{
if (preg_match('/\S/', $word))
{
$word = preg_replace('/{%keyword%}/i', $keyword, $word);
$finalPhrase[] = trim($word);
}
}
$finalPhrase = $finalPhrase[rand(0, count($finalPhrase) - 1)];
//now inject it back to where the token was
$customContent = str_ireplace('{#' . $index . '#}', $finalPhrase, $customContent);
}
$this->_setContentStatus(self::CONTENT_STATUS_PASSED);
}
}
return $customContent;
}
}
there is regex that request bracket like this
{*spin1|spin2|spin3*}
here is the regex from the snippet above
if (preg_match_all('/({\*)(.*?)(\*})/', $customContent, $result))
$customContent = preg_replace('/(\{\*)(.*?)(\*\})/', '{#'.$index.'#}', $customContent, 1);
i would like to remove the * to format allow just {spin1|spin2|spin3} wich is more compatible with most spinner ,
i tried with some regex that i find online
i tried to remove the * from both regex without result
thanks you very much for your help
Remove \* instead of just * – Lucas Trzesniewski

Regular expression for csv with commas and no quotes

I'm trying to parse really complicated csv, which is generated wittout any quotes for columns with commas.
The only tip I get, that commas with whitespace before or after are included in field.
Jake,HomePC,Microsoft VS2010, Microsoft Office 2010
Should be parsed to
Jake
HomePC
Microsoft VS2010, Microsoft Office 2010
Can anybody advice please on how to include "\s," and ,"\s" to column body.
If your language supports lookbehind assertions, split on
(?<!\s),(?!\s)
In C#:
string[] splitArray = Regex.Split(subjectString,
#"(?<!\s) # Assert that the previous character isn't whitespace
, # Match a comma
(?!\s) # Assert that the following character isn't whitespace",
RegexOptions.IgnorePatternWhitespace);
split by r"(?!\s+),(?!\s+)"
in python you can do this like
import re
re.split(r"(?!\s+),(?!\s+)", s) # s is your string
Try this. It gave me the desired result which you have mentioned.
StringBuilder testt = new StringBuilder("Jake,HomePC,Microsoft VS2010, Microsoft Office 2010,Microsoft VS2010, Microsoft Office 2010");
Pattern varPattern = Pattern.compile("[a-z0-9],[a-z0-9]", Pattern.CASE_INSENSITIVE);
Matcher varMatcher = varPattern.matcher(testt);
List<String> list = new ArrayList<String>();
int startIndex = 0, endIndex = 0;
boolean found = false;
while (varMatcher.find()) {
endIndex = varMatcher.start()+1;
if (startIndex == 0) {
list.add(testt.substring(startIndex, endIndex));
} else {
startIndex++;
list.add(testt.substring(startIndex, endIndex));
}
startIndex = endIndex;
found = true;
}
if (found) {
if (startIndex == 0) {
list.add(testt.substring(startIndex));
} else {
list.add(testt.substring(startIndex + 1));
}
}
for (String s : list) {
System.out.println(s);
}
Please note that the code is in Java.