Regex: Replace semi-colons with enter key "\n" - regex

I have a string to replace semi-colons with \n. The requirement I have is to detect only those semi-colons that are outside HTML <> tags and replace them with \n.
I have come very close by using this regex by implementing multiple fixes.
/((?:^|>)[^<>]*);([^<>]*(?:<|$))/g, '$1\n$2'
The above regex works well if I input string like the below one -
Value1;<p style="color:red; font-weight:400;">Value2</p>;<p style="color:red; font-weight:400;">Value3</p>;Value4
The output it gives is this (which is expected and correct) -
Value1
<p style="color:red; font-weight:400;">Value2</p>
<p style="color:red; font-weight:400;">Value3</p>
Value4
But fails if I input string like - M1;M2;M3
The output this gives is -
M1;M2
M3
(semi-colon doesn't remove between M1 and M2).
whereas the expected output should be -
M1
M2
M3
Also the string can be like this too (both combined) - M1;M2;M3;Value1;<p style="color:red; font-weight:400;">Value2</p>;<p style="color:red; font-weight:400;">Value3</p>;Value4
The major goal is to replace all the semicolons outside HTML Tags <> and replace it with '\n` (enter key).

You can use this regex associate with .replace() function of JavaScript:
/(<[^<>]*>)|;/g
For substitution, you may use this function:
(_, tag) => tag || '\n'
If (<[^<>]*>) catches anything - which is a HTML tag, it will go into tag parameter, otherwise an outbound ; must be matched.
So you can check if tag exists. If it exists, replace with itself, otherwise replace it with a \n.
const text = `Value1;<p style="color:red; font-weight:400;">Value2</p>;<p style="color:red; font-weight:400;">Value3</p>;Value4
M1;M2;M3`;
const regex = /(<[^<>]*>)|;/g;
const result = text.replace(regex, (_, tag) => tag || '\n');
console.log(result);

Related

How to replace newline or soft linebreak (ctrl+enter) in Google doc app script?

I have a working template (Google Doc) and have variables with following patterns to be replace with values
{{BASIC SALARY_.Description}}
{{OT1.5.Description}}
{{MEL ALW.Description}}
{{OST ALW.Description}}
{{TRV ALW.Description}}
{{ADV SAL.Description}}
note: I am using soft line break (ctrl+enter) in google doc as I couldn't figure out to detect normal linebreak pattern "\n", "\n", "\r\n" but my result always weird as some line need to be replaced as proper descriptions but some need to be totally nullify (remove whole {{pattern}} together with the line break to avoid empty line)
I have tried out multiple REGEX patterns, googled the online forum
https://github.com/google/re2/wiki/Syntax
Eliminate newlines in google app script using regex
Use RegEx in Google Doc Apps Script to Replace Text
and figure out only soft linebreak is the only way to deal with (identify pattern \v. Please check my sample code as the pattern replace doesn't work as expected.
// code block 1
var doc = DocumentApp.openById(flPayslip.getId());
var body = doc.getBody();
body.replaceText("{{BASIC SALARY_.Description}}", "Basic Salary");
body.replaceText("{{OST ALW.Description}}", "Outstation Allowance");
// code block 2
var doc = DocumentApp.openById(flPayslip.getId());
var body = doc.getBody();
body.replaceText("{{BASIC SALARY_.Description}}", "Basic Salary");
body.replaceText("{{OST ALW.Description}}", "Outstation Allowance");
body.replaceText("{{.*}}\\v+", ""); // to replace soft linebreak
Actual Result of code block 1
Basic Salary
{{OT1.5.Description}}
{{MEL ALW.Description}}
Outstation Allowance
{{TRV ALW.Description}}
{{ADV SAL.Description}}
Actual Result of code block 2:
Basic Salary
Issue: actual result "Outstation Allowance" was removed from regex replacement.
Expected result
Basic Salary
Outstation Allowance
What's the proper regex pattern I should use in my code?
Try
body.replaceText("{{[^\\v]*?}}\\v+", ""); // No \v inside `{{}}` and not greedy`?`
When you use {{.*}}, .* matches everything between the first {{ and the last }}
Basic Salary
{{
OT1.5.Description}}
{{MEL ALW.Description}}
Outstation Allowance
{{TRV ALW.Description}}
{{ADV SAL.Description
}}

Regex trim all <br>'s on a string while ignoring line breaks and spaces

var str = `
<br><br/>
<Br>
foobar
<span>yay</span>
<br><br>
catmouse
<br>
`;
//this doesn't work but what I have so far
str.replace(/^(<br\s*\/?>)*|(<br\s*\/?>)*$/ig, '');
var desiredOutput = `
foobar
<span>yay</span>
<br><br>
catmouse
`;
I want to ensure that I remove all <br>'s regardless of case or ending slash being present. And I want to keep any <br>'s that reside in the middle of the text. There may be other html tags present.
Edit: I want to note that this will be happening server-side so DOMParser won't be available to me.
We may try using the following pattern:
^\s*(<br\/?>\s*)*|(<br\/?>\s*)*\s*$
This pattern targets <br> tags (and their variants) only if they occur at the start or end of the string, possibly preceded/proceeded by some whitespace.
var str = '<br><br/>\n<Br>\nfoobar\n<span>yay</span>\n<br><br>\ncatmouse\n<br>';
console.log(str + '\n');
str = str.replace(/^\s*(<br\/?>\s*)*|(<br\/?>\s*)*\s*$/ig, '');
console.log(str);
Note that in general parsing HTML with regex is not advisable. But in this case, since you just want to remove flat non-nested break tags from the start and end, regex might be viable.
Don't use a regular expression for this - regular expressions and HTML parsing don't work that well together. Even if it's possible with a regex, I'd recommend using DOMParser instead; transform the text into a document, and iterate through the first and last nodes, removing them while their tagName is BR (and removing empty text nodes too, if they exist):
var str = `
<br><br/>
<Br>
foobar
<span>yay</span>
<br><br>
catmouse
<br>
`;
const body = new DOMParser().parseFromString(str.trim(), 'text/html').body;
const nodes = [...body.childNodes];
let node;
while (node = nodes.shift(), node.tagName === 'BR') {
node.remove();
const next = nodes[0];
if (next.nodeType === 3 && next.textContent.trim() === '') nodes.shift().remove();
}
while (node = nodes.pop(), node.tagName === 'BR') {
node.remove();
const next = nodes[nodes.length - 1];
if (next.nodeType === 3 && next.textContent.trim() === '') nodes.pop().remove();
}
console.log(body.innerHTML);
Note that it gets a lot easier if you don't have to worry about empty text nodes, or if you don't care about whether there are empty text nodes or not in HTML output.
Try
/^(\s*<br\s*\/?>)*|(<br\s*\/?>\s*)*$/ig

How to use regex to extract such url

Here is a text:
<a class="mkapp-btn mab-download" href="javascript:void(0);" onclick="zhytools.downloadApp('C100306099', 'appdetail_dl', '24', 'http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite' , 'v1.1.4');">
I want to extract
http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite
I use below code to extract it.
m = re.search("mkapp-btn mab-download.*'http://[^']'", apk_page)
In my opinion, I can use .* to match the string between mkapp-btn mab-download and http. However I failed.
EDIT
I also tried.
m = re.search("(?<=mkapp-btn mab-download.*)http://[^']'", apk_page)
You need to add + after exclusion ([^']) because is more than one character. Also, you need to group using parenthesis to extract only the part you want.
m = re.search("mkapp-btn mab-download.*'(http[^']+)'", apk_page)
m.groups()
And the output will be
('http://appdlc.hicloud.com/dl/appdl/application/apk/f4/f44d320c2c1b466389e6f6b3d3f5cff4/com.uniquestudio.android.iemoji.1806141014.apk?sign=portal#portal1531621480529&source=portalsite',)

regex fail on razor

I have a simple regex - [a-zA-Z0-9][^\s] - that checks that there are at least two characters and the second one is not white space. This works fine in C# but not in an <input> field in Razor.
Here is the field in Razor:
<input type="search" name="search" required pattern="[a-zA-Z0-9][^\s]">
Here is a test in C#:
Console.WriteLine("string 'c-III' is valid = {0}", rgx.IsMatch("c-III"));
Console.WriteLine("string 'c ' is valid = {0}", rgx.IsMatch("c "));
Here is the result:
string 'c-III' is valid = True
string 'c ' is valid = False
This works as expected as well in regex101.com
When I type c-III I get the error message: Please match the requested format.
The expression needs to validate the following:
minimum of 2 characters
second character cannot be white space
I am not sure if the expression need to be adjusted or if the problem is somewhere else. Any help will be appreciated

Regex match only working with first catches (JavaScript)

I have a file content into memory. Within the file, there are variables with the form of:
{{ _("variable1") }}
{{ _("variable2") }}
{{ _("variable3") }}
I'm trying to catch them with /\{\{ _(.+) \}\}/i:
var result = /\{\{ _(.+) \}\}/i.exec(fileContents);
It seems to work at first, as the first two variables are pushed into the array, but then it pushes the whole file content.
What am I missing?
BONUS: It would be awesome if I could grab variable1 instead of {{ _("variable1") }} but I can live with it.
What you need is the g flag. This way you get an additional match every time you call exec (until there are no further matches, and you get null). For the bonus, just include the (" and ") in the pattern, so that they are not captured. Finally, you might want to make the .+ ungreedy, otherwise you'll get funny surprises if there are multiple occurrences of this pattern in a single line:
r = /\{\{ _\("(.+?)"\) \}\}/ig;
while(m = r.exec(fileContents)
{
// m[0] will contain the entire match
// m[1] will contain the contents of the quotes
}
By the way, if "variable1" cannot contain escaped quotes (like "some\"oddvariable"), then this regex should be slightly more efficient:
r = /\{\{ _\("([^"]*)"\) \}\}/ig