How to replace text values that contain a # - regex

In my dataset I have a variable with values which contain html-code, e.g.:
<font color="#800080">None of these</font>.
I wanted to replace that with Other by:
df$Country <- gsub("<font color="#800080">None of these</font>", "Other", df$Country)
However that doesn't work, which is probably caused by the #-character. How can I solve this?
Part of the data:
structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Spain", "<font color=\"#800080\">None of these</font>"), class = "factor")

All these problems with regex on html are reasons not to use it. Assuming your data started out as an actual html document, use XPath instead. Here's an example:
html.text <- '<html>
<head></head>
<body>
<div><font color="#800080">None of these</font></div>
</body>
<html>'
library(XML)
html <- htmlTreeParse(html.text,useInternalNodes=TRUE)
replaceNodes(html['//font[#color="#800080"]'][[1]],"Other")
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# <html>
# <head></head>
# <body>
# <div>Other</div>
# </body>
# </html>

There are two options to look at. Both assume we are starting with something that looks like this.
x <- '<font color="#800080">None of these</font>'
Option 1: Using a different quote. When you used double quotes to identify your "pattern" argument, it ends at the next double quote it encounters, which comes just before your #. Hence, you can try to enclose the pattern with single quotes instead.
gsub('<font color="#800080">None of these</font>', "other", x)
Option 2: Escaping the quote character. This is as simple as putting a \ before the quote to indicate that it should be escaped.
gsub("<font color=\"#800080\">None of these</font>", "other", x)

Related

RegEx: Grabbing values with or without quotation marks

My Issue:
I am trying to grab Facebook meta value from different sites, but some website(usatoday.com) are not having appropriate HTML code. As you can see the data sample 1 & 2, so my question is how can I modify my regex expression code to get the value of the property and content.
What I've done:
With below if statement, I am kind of resolving the quotation mark issue (not dynamic enough), but I guess there must be a better way (I am really suck in regex)
Secondly, the regex I had not able to catch the content value(the url) in Data Sample 2 for usatoday.com, I guess the "" in the url mess up my regex.
Really need some help here, big thanks!
if(
preg_match( '/<meta(.*?)property="og:title"(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// for normal sites
or
preg_match( '/<meta(.*?)property=og:title(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// property no quote at all
or
preg_match( '/<meta(.*?)property=og:title(.*?)content=(.+?)(.*?)(\/)?>/', $raw_html, $matching )
// no quote at all
)
Data Sample 1 - no quotation mark on meta text attribute
# usatoday.com
<meta property=og:title content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
# normal sites
<meta property="og:title" content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
Data Sample 2 - no quotation mark on meta URL attribute
# usatoday.com
<meta property=og:url content=https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/ />
# normal sites
<meta property="og:url" content="https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/" />

regexing a string that sometimes has a space and sometimes not

This is my response
<body onload="javascript:document.getElementById('idForm').submit()">
<form id="idForm" action="https://x.y-test.z:443/hpp-webapp/hentpasient.html?ticket=I6VZgglkX/Z2z1GJYY1TzIqAscCJbWPI5pPBLl38VCEHcD/qh9qSz MzAIVv 6H2fau4DFMQscbPqy1HhFkgvg==" method="POST"
target="_top">
and I want to regex (scala/gatling) the value of <ticket>.
Tried this:
.check(regex("<form id=\"idForm\" action=\"https://x.y-test.z:443/hpp-webapp/hentpasient.html?ticket=\"(.*?)\"").saveAs("jwtUncoded"))
But I get
> regex(<form id="idForm" action="https://x.y-z 1 (100,0%)
.no:443/hpp-webapp/hentpasient.html?ticket="(.*?)").find.exist...
When observing the output in Gatling I can see that the value of <ticket> sometimes has a space and sometimes not.
How can I regex this value?
regex("ticket=(.*?)\"")
Your own take has a wrong quote before the capture group.
.check(regex("ticket=(.*?)\"").saveAs("jwtUncoded"))

Yesod Hamlet breaks HTML by replacing single quotes with double quotes

I have some HTML code that I'm using in Hamlet:
<div .modal-card .card data-options='{"valueNames": ["name"]}' data-toggle="lists">
Notice that the single quotes for data-options allows the use of double quotes inside the string.
The problem is that when Hamlet renders the page, Hamlet puts " around the ' and so the HTML is broken:
<div class="modal-card card" data-options="'{" valuenames":"="" ["name"]}'="" data-toggle="lists">
Some external JS library plugin code runs, it tries to parse the JSON inside data-options and fails.
How can I tell Hamlet to include a literal string?
I've tried various combinations of:
let theString = "{\"valueNames\": [\"name\"]}"
let theString2 = "data-options='{\"valueNames\": [\"name\"]}'"
etc
And in the hamlet file:
<div .modal-card .card data-options='#{ preEscapedText theString }' data-toggle="lists">
or
<div .modal-card .card #{ preEscapedText theString2 } data-toggle="lists">
But all attempts produce invalid HTML or invalid JSON inside the string.
How can I instruct Hamlet to simply include a literal string in the output HTML?
Update:
Tried more things, no result.
The string2 example doesn't work because Hamlet seems to think that I'm trying to set id="{" as per https://www.yesodweb.com/book/shakespearean-templates#shakespearean-templates_attributes
Why not render the JSON escaped (" become ") and “handle” the quotes later when parsing?
Interpolate in Hamlet:
<div #the-modal .modal-card .card data-options='#{theString}' data-toggle="lists">
Parse the data attribute as JSON:
let json = document.getElementById("the-modal").getAttribute("data-options");
let opts = JSON.parse(json); // At least in Chrome, it works!
As for theString2 alternative, you can also interpolate attributes in Hamlet using a tuple or list of tuples and the star symbol:
let dataOptions = ("data-options", "{\"valueNames\": [\"name\"]}") :: (Text, Text)
...
<div #the-modal .modal-card .card *{dataOptions} data-toggle="lists">

Get the second eval (regex with wildcards)?

I am reading an HTML document. So far I have been using HTML::TreeBuilder with HTML::Element and look_down, but now I am stuck with the content of a script <script>...</script>
<script language="JavaScript">
eval(function(p,a,c,k,e,r){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('7 F={f:"P+/=",Q:z(5){7 8="";7 s,k,l,v,t,h,j;7 i=0;5=F.I(5);G(i<5.B){s=5.m(i++);k=5.m(i++);l=5.m(i++);v=s>>2;t=((s&3)<<4)|(k>>4);h=((k&H)<<2)|(l>>6);j=l&u;o(J(k)){h=j=C}w o(J(l)){j=C}8=8+p.f.q(v)+p.f.q(t)+p.f.q(h)+p.f.q(j)}D 8},R:z(5){7 8="";7 s,k,l;7 v,t,h,j;7 i=0;5=5.K(/[^A-S-T-9\\+\\/\\=]/g,"");G(i<5.B){v=p.f.E(5.q(i++));t=p.f.E(5.q(i++));h=p.f.E(5.q(i++));j=p.f.E(5.q(i++));s=(v<<2)|(t>>4);k=((t&H)<<4)|(h>>2);l=((h&3)<<6)|j;8=8+b.d(s);o(h!=C){8=8+b.d(k)}o(j!=C){8=8+b.d(l)}}8=F.L(8);D 8},I:z(e){e=e.K(/\\r\\n/g,"\\n");7 a="";U(7 n=0;n<e.B;n++){7 c=e.m(n);o(c<x){a+=b.d(c)}w o((c>V)&&(c<W)){a+=b.d((c>>6)|X);a+=b.d((c&u)|x)}w{a+=b.d((c>>M)|N);a+=b.d(((c>>6)&u)|x);a+=b.d((c&u)|x)}}D a},L:z(a){7 e="";7 i=0;7 c=Y=y=0;G(i<a.B){c=a.m(i);o(c<x){e+=b.d(c);i++}w o((c>Z)&&(c<N)){y=a.m(i+1);e+=b.d(((c&10)<<6)|(y&u));i+=2}w{y=a.m(i+1);O=a.m(i+2);e+=b.d(((c&H)<<M)|((y&u)<<6)|(O&u));i+=3}}D e}}',62,63,'|||||input||var|output||utftext|String||fromCharCode|string|_keyStr||enc3||enc4|chr2|chr3|charCodeAt||if|this|charAt||chr1|enc2|63|enc1|else|128|c2|function||length|64|return|indexOf|Base64|while|15|_utf8_encode|isNaN|replace|_utf8_decode|12|224|c3|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|encode|decode|Za|z0|for|127|2048|192|c1|191|31'.split('|'),0,{}));
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
</script>
I want to get the text from the second eval eval(......) to
I tried
my ($var) = $response->decoded_content =~ /^eval(.*?)\/script/
but I get both evals, which is obvious.
EDIT : Added raw source
This program shows how you might go about it. /eval/ finds the first occurrence of eval, while /.*eval/ find the last occurrence.
I have used an HTML document that is empty apart from a single <script> element in the <head> section.
The call to look_down will find all <script> elements with a language attribute equal to JavaScript and put them in the array #script. In this case there is only one, so I use $script[0]. Depending on your HTML you may need to select one of several elements.
A call to as_text ignores <script> and <style> elements, so I have to use content_list to get the text inside the <script> element. This text is put into $content, and everything from the last occurrence of eval onwards is copied to $eval.
I hope this helps.
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file(\*DATA);
my #script = $tree->look_down(_tag => 'script', language => 'JavaScript');
my ($content) = $script[0]->content_list;
my ($eval) = $content =~ /.*(eval.+\S)/s;
print $eval;
__DATA__
<html>
<head>
<script language="JavaScript">
eval(function(p,a,c,k,e,r){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('7 F={f:"P+/=",Q:z(5){7 8="";7 s,k,l,v,t,h,j;7 i=0;5=F.I(5);G(i<5.B){s=5.m(i++);k=5.m(i++);l=5.m(i++);v=s>>2;t=((s&3)<<4)|(k>>4);h=((k&H)<<2)|(l>>6);j=l&u;o(J(k)){h=j=C}w o(J(l)){j=C}8=8+p.f.q(v)+p.f.q(t)+p.f.q(h)+p.f.q(j)}D 8},R:z(5){7 8="";7 s,k,l;7 v,t,h,j;7 i=0;5=5.K(/[^A-S-T-9\\+\\/\\=]/g,"");G(i<5.B){v=p.f.E(5.q(i++));t=p.f.E(5.q(i++));h=p.f.E(5.q(i++));j=p.f.E(5.q(i++));s=(v<<2)|(t>>4);k=((t&H)<<4)|(h>>2);l=((h&3)<<6)|j;8=8+b.d(s);o(h!=C){8=8+b.d(k)}o(j!=C){8=8+b.d(l)}}8=F.L(8);D 8},I:z(e){e=e.K(/\\r\\n/g,"\\n");7 a="";U(7 n=0;n<e.B;n++){7 c=e.m(n);o(c<x){a+=b.d(c)}w o((c>V)&&(c<W)){a+=b.d((c>>6)|X);a+=b.d((c&u)|x)}w{a+=b.d((c>>M)|N);a+=b.d(((c>>6)&u)|x);a+=b.d((c&u)|x)}}D a},L:z(a){7 e="";7 i=0;7 c=Y=y=0;G(i<a.B){c=a.m(i);o(c<x){e+=b.d(c);i++}w o((c>Z)&&(c<N)){y=a.m(i+1);e+=b.d(((c&10)<<6)|(y&u));i+=2}w{y=a.m(i+1);O=a.m(i+2);e+=b.d(((c&H)<<M)|((y&u)<<6)|(O&u));i+=3}}D e}}',62,63,'|||||input||var|output||utftext|String||fromCharCode|string|_keyStr||enc3||enc4|chr2|chr3|charCodeAt||if|this|charAt||chr1|enc2|63|enc1|else|128|c2|function||length|64|return|indexOf|Base64|while|15|_utf8_encode|isNaN|replace|_utf8_decode|12|224|c3|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|encode|decode|Za|z0|for|127|2048|192|c1|191|31'.split('|'),0,{}));
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
</script>
</head>
<body> </body>
</html>
output
eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(c/a))+String.fromCharCode(c%a+161)};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\[\xa1-\xff]+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp(e(c),'g'),k[c])}}return p}('¦ £=\'¥+¢+¤+¢+\';¡.«();¡.§(©.¨(£));¡.ª();',11,11,'document|PC9pZnJhbWU|ba2se|PGlmcmFtZSB3aWR0aCA9ICIxMDAlIiBoZWlnaHQgPSAiMTAwJSIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iJiMxMDQ7JiMxMTY7JiMxMTY7JiMxMTI7JiM1ODsmIzQ3OyYjNDc7JiMxMTc7JiMxMDg7JiM0NjsmIzExNjsmIzExMTsmIzQ3OyYjNTY7JiMxMTQ7JiM5ODsmIzEyMTsmIzExNzsmIzU3OyYjMTEzOyYjMTAwOyI|PGlmcmFtZSB3aWR0aCA9ICIwIiBoZWlnaHQgPSAiMCIgc2Nyb2xsaW5nID0gImF1dG8iIGZyYW1lYm9yZGVyID0gIjAiIHNyYz0iaHR0cDovL2dvb2dsZS5kZSI|var|write|decode|Base64|close|open'.split('|'),0,{}))
Use regex pattern
\beval\(.*\S(?!.*eval)(?=\s*<\/script>)
or
\beval\(.*\K\beval\(.*\S(?=\s*<\/script>)
Just match it twice:
/^.*?eval\([^)]+\).*?(eval\([^)]+\))/
DEMO
For now this one works for me
/eval\(function\(p,a,c,k,e,d\)\{.*\}\)\)/gmsi
Thank you all for your help, i did a mistake by not putting the whole script content in the beginning.

How to embed <pre> tag in a list in a wiki

I am trying to embed a <pre> tag in within an ordered list, of the form:
# Some content
#: <pre>
Some pre-formatted content
</pre>
But it doesn't work. Can someone please let me know on how to achieve what I am trying to do?
You can use a regular HTML list:
<ol>
<li>Some Content</li>
<li><dl><dd><pre>Some pre-formatted content</pre></dd></dl></li>
</ol>
This is the better answer for continuing a numbered list after using the <pre> tag without resorting to html:
# one
#:<pre>
#:some stuff
#:some more stuff</pre>
# two
Produces:
1. one
some stuff
some more stuff
2. two