RegEx: Grabbing values with or without quotation marks - regex

My Issue:
I am trying to grab Facebook meta value from different sites, but some website(usatoday.com) are not having appropriate HTML code. As you can see the data sample 1 & 2, so my question is how can I modify my regex expression code to get the value of the property and content.
What I've done:
With below if statement, I am kind of resolving the quotation mark issue (not dynamic enough), but I guess there must be a better way (I am really suck in regex)
Secondly, the regex I had not able to catch the content value(the url) in Data Sample 2 for usatoday.com, I guess the "" in the url mess up my regex.
Really need some help here, big thanks!
if(
preg_match( '/<meta(.*?)property="og:title"(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// for normal sites
or
preg_match( '/<meta(.*?)property=og:title(.*?)content="(.+?)"(.*?)(\/)?>/', $raw_html, $matching )
// property no quote at all
or
preg_match( '/<meta(.*?)property=og:title(.*?)content=(.+?)(.*?)(\/)?>/', $raw_html, $matching )
// no quote at all
)
Data Sample 1 - no quotation mark on meta text attribute
# usatoday.com
<meta property=og:title content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
# normal sites
<meta property="og:title" content="Lakers trading Russell Westbrook in massive three-team deal with Jazz and Timberwolves"/>
Data Sample 2 - no quotation mark on meta URL attribute
# usatoday.com
<meta property=og:url content=https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/ />
# normal sites
<meta property="og:url" content="https://www.usatoday.com/story/sports/nba/2023/02/08/lakers-jazz-timberwolves-trade-russell-westbrook-mike-conley-dangelo-russell/11214855002/" />

Related

How can I add text on meta expr:content?

I have a live streaming site, I am having trouble adding text to
<meta expr:content='data:blog.pageName' property='og:title'/>
I would like to put text before and after the data:blog.pagename data tag, so that it will show up in facebook as ( Watch "title of the post" Live Free ).
I am trying something like this below, but it is not working. Please help
<meta expr:content='Watch + data:blog.pageName + Live Free' property='og:title'/>
Here is my blogger website URL = https://www.nbalink.com/
You will need to enclose the text with double quotes ("). The following format will work
<meta expr:content='"Watch " + data:blog.pageName + " Live Free"' property='og:title'/>

How to use Regular Expression Extractor to get authenticity token with / = + signs?

I need to correctly parse authenticity token in JMeter which has +, / and spaces in it and looks like below…
<meta content="authenticity_token" name="csrf-param" />
<meta content="kJ+AzaV/saCxK+F4Ibh6LeEqH8rpiGZfyRKn3RGX960=" name="csrf-token" />
I have a “Regular Expression Extractor” and Regular Expression looks like..
meta content="([^"]+)" name="csrf-token" />
The problem is that the / gets replaced with %2F and = at the end gets replace with %3D and
kJ+AzaV%2FsaCxK+F4Ibh6LeEqH8rpiGZfyRKn3RGX960%3D
How can I parse the authenticity token correctly?
It looks like your attribute has been URI encoded, so you'll need to decode it before attempting to do more
window.decodeURIComponent('kJ+AzaV%2FsaCxK+F4Ibh6LeEqH8rpiGZfyRKn3RGX960%3D');
// "kJ+AzaV/saCxK+F4Ibh6LeEqH8rpiGZfyRKn3RGX960="
Further, using a RegExp to extract data from HTML or XML is not always the best idea, perhaps you could try parsing it and accessing the Nodes and Attributes you want via a DOM Tree.
If you're passing it as a parameter just untick "Encode?" box for it.
If you need to decode this via JavaScript as Paul S. suggests, consider using __javaScript function as follows:
${__javaScript(decodeURIComponent("${YOUR_VARIABLE_NAME_HERE}"),)}
See How to Use JMeter Functions post series for more details.

Coldfusion Regex to convert a URL to lowercase

I am trying to take convert urls in a block of html to ensure they are lowercase.
Some of the links are a mix of uppercase and lowercase and they need to be converted to just lowercase.
It would be impossible to run round the site and redo every link so was looking to use a Regex when outputting the text.
<p>Hello world Some link.</p>
Needs to be converted to:
<p>Hello world Some link.</p>
Using a ColdFusion Regex such as below (although this doesn't work):
<cfset content = Rereplace(content,'(http[*])','\L\1','All')>
Any help much appreciated.
I think I would use the lower case function, lCase().
Put your URL into a variable, if it's not already:
<cfset MyVar = "http://www.ThisSite.com">
Force it to lower case here:
<cfset MyVar = lCase(MyVar)>
Or here:
<cfoutput>
Some Link
</cfoutput>
UPDATE: Actually, I see that what you are actually asking is how to generate your entire HTML page (or a big portion) and then go back through it, find all of the links, and then lower their cases. Is that what you are trying to do?
Since you have the HTML stored in a database, there is a bit more work that needs to be done than just using lcase(). I would wrap the functionality into a function that can be easily reused. Check out this code for an example.
content = '<p>Hello world Some link.</p>
<p>Hello world Some link.</p>
<p>Hello world <a href=''http://www.somelink.com/BLARG''>Some link</a>.</p>';
writeDump( content );
writeDump( fixLinks( content ) );
function fixLinks( str ){
var links = REMatch( 'http[^"'']*', str );
for( var link in links ){
str = replace( str, link, lcase( link ), "ALL" );
}
return str;
}
This has only been tested in CF9 & CF10.
Using REMatch() you get an array of matches. You then simply loop over that array and use replace() with lcase() to make the links lowercase.
And...based on Leigh's suggestion, here is a solution in one line of code using REReplace()
REReplace( content, '(http[^"'']*)', '\L\1', 'all' )
Use a HTML parser to parse HTML, not regex.
Here's how you can do it with jQuery:
<!doctype html>
<script src="jquery.js"></script>
<cfsavecontent variable="HtmlCode">
<p>Hello world Some link.</p>
</cfsavecontent>
<pre></pre>
<script>
var HtmlCode = "<cfoutput>#JsStringFormat(HtmlCode)#</cfoutput>";
HtmlCode = jQuery('a[href]',HtmlCode).each( lowercaseHref ).end().html();
function lowercaseHref(index,item)
{
var $item = jQuery(item);
// prevent non-links from being changed
// (alternatively, can check for specific domain, etc)
if ( $item.attr('href').startsWith('#') )
return
$item.attr( 'href' , $item.attr('href').toLowerCase() );
}
jQuery('pre').text(HtmlCode);
</script>
This works for href attributes on a tags, but can of course be updated for other things.
It will ignore in-page links like <a href="#SomeId"> but not stuff like <a href="/HOME/#SomeId"> - if that's an issue you'd need to update the function to exclude page fragment part (e.g. split on # then rejoin, or whatever). Same goes if you might have case-sensitive querystrings.
And of course the above is just jQuery because I felt like it - you could also use a server-side HTML parser, like jSoup to achieve this.

Regex for parsing Facebook Open Graph meta tag

I'm trying to pull the og:title attribute from a Bing Local page for a Windows Store app.
There is no HTML parser for WinRT and C++/CX, so I've resorted to using a regex to grab the tag, then an XML parser to pull out relevant attributes.
This is what the tag looks like.
<meta property="og:title" content="Some Location Name"/>
I'm using the following regex to pull out the tag from the HTML, but whenever the content attribute has a space in it, it fails to find a match.
<meta property="og:title" content="[\s\S]*"/>
So, my regex will work for McDonald's, but not for Jack In The Box.
What do I need to do to get the entire title?
This is one of my open graph regex queries which match most things with specific problems in content, but those are rare and I'd rather have a more readable regex
<meta [^>]*property=[\"']og:title[\"'] [^>]*content=[\"']([^'^\"]+?)[\"'][^>]*>
But I do come across some times where the content comes before property so I also run this
<meta [^>]*content=[\"']([^'^\"]+?)[\"'] [^>]*property=[\"']og:image[\"'][^>]*>
You can just add a space to the regex. [ \s\S]*
DISCLAIMER: OpenGraph.io is a commercial product I work on and support.
Unfortunately, any regex you come up with is going to be hit or miss. If you end up needing to do this you can use the API available at http://www.opengraph.io/
One of its major benefits is that it will infer information like the title or description (if you end up needing it) from the content on the page if OpenGraph tags don't exist.
To get information about a site use:
GET https://opengraph.io/api/1.0/site/<URL encoded site URL>
Which will return something like:
{
"hybridGraph": {
"title": "Google",
"description": "Search the world's information...",
"image": "http://google.com/images/srpr/logo9w.png",
"url": "http://google.com",
"type": "site",
"site_name": "Google"
},
"openGraph": {..}
"htmlInferred": {..}
}

I18n special characters in page title

I'm coding a rails app, and I have a problem with the title's page :
In my config/locales/fr.yml I have this : fr:product:edit: "Modification de l'objet"
And in my /app/views/products/edit.html.erb I have this : <title><%= t('product.edit') %></title>
And when I render the page, it gives me this : Modification de l'objet.
Do you know what's wrong with it ?
I tried to add <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> in the head of my HTML, or this but it didn't worked for me...
You can use <%= raw(I18n.t('product.edit')) %> to avoid this. Be aware though, that the code won't be escaped. When using raw you have to be sure there's no way to inject malicious code in the string.
I think I can tell you where l&#'39 is coming from...
Hopefuly then you can find a solution on how to fix it.
Open up notepad and holddown the Alt key and press 39 see what character appears ??
You notice you get the ' character when you type that number so after compling you code seems to look at l'objet as l And #39
So I think there is as you are poinitg out some sort language issues and the characters are represented. You might be able to reverse this to solve your problem.
Sorry this is all I had.