Searching HTML Files using Regex - c++

I have a pool of html files and want to search through them for same targeted text. It is required to search in their text contents only while ignoring all html tags, header, script, etc.
I tried QRegExp, the regex class in Qt, but could not find a good pattern to do what I'm after.
I’d appreciate any help in this regard.
Thank you.

This may or may not be a good answer for you, but have you considered using a DOM-parser instead? That will eliminate the need to filter out what is text and what is HTML markup. Sadly I can't recommend a good one for C++ though.

Related

Regex for an anchor tag which contains a specific inline style

Unfortunately my blog was hacked and 1000+ posts have been infected with links to spam sites. As part of the cleaning process I'm trying to use a regex to find and replace the bad links in an XML file in Sublime Text.
The only consistency I can see is all the bad links contain an inline style changing the text colour to #676c6c, so I'm trying but failing to create a regular expression that can highlight all anchor tags containing this hex value - #676c6c
<a[\s]+([^>]+)>((?:.(?!\<\/a\>))*.)</a>
So far I've got this, which I believe highlights all anchor tags, can anyone help expand this to include anchors containing #676c6c between the first angled brackets? Here's an example of one of the bad links
spam keyword
I appreciate any help! Cheers.
Maybe you could use <a[^>]+#676c6c[^>]+>[^<]*<\/a>.
Try it out here.
If the anchor tag may contain other tags, use (?s)<a[^>]+#676c6c[^>]+>.*?<\/a> instead.

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

How can I manipulate just part of a Perl string?

I'm trying to write some Perl to convert some HTML-based text over to MediaWiki format and hit the following problem: I want to search and replace within a delimited subsection of some text and wondered if anyone knew of a neat way to do it. My input stream is something like:
Please mail support. if you want some help.
and I want to change Please help and Please can some one help me out here to Please%20help and Please%20can%20some%20one%20help%20me%20out%20here respectively, without changing any of the other spaces on the line.
Naturally, I also need to be able to cope with more than one such link on a line so splicing isn't such a good option.
I've taken a good look round Perl tutorial sites (it's not my first language) but didn't come across anything like this as an example. Can anyone advise an elegant way of doing this?
Your task has two parts. Find and replace the mailto URIs - use a HTML parsing module for that. This topic is covered thoroughly on Stack Overflow.
The other part is to canonicalise the URI. The module URI is suitable for this purpose.
use URI::mailto;
my #hrefs = ('mailto:help#myco.com&Subject=Please help&Body=Please can some one help me out here');
print URI::mailto->new($_)->as_string for #hrefs;
__END__
mailto:help#myco.com&Subject=Please%20help&Body=Please%20can%20some%20one%20help%20me%20out%20here
Why dont you just search for the "Body=" tag until the quotes and replace every space with %20.
I would not even use regular expresions for that since I dont find them useful for anything except mass changes where everything on the line is changes.
A simple loop might be the best solution.

A Regex builder for CSS queries

I've got a problem I need solved using Regex expressions; it involves taking a CSS selector and compiling a regex that matches the string representation of the nodes inside an HTML document. The point is to avoid parsing the HTML as XML and then either making Xpath or DOM queries to apply style attributes.
Does anyone know of a project that already implements something like this in any language? The target platform would be .NET 3.5.
Html Agility Pack
Regular expressions seem like an amazingly bad way of matching those nodes. I'm not sure I follow your problem - why not just use something like jquery to pick out those nodes? eg given a css selector 'div>span.red:first-child',
$('div>span.red:first-child')
would return an array of those matching nodes.
EDIT: Oh, wait - are you trying to do this 'offline', as it were - not in a user's browser? Yeah, ignore my advice. (Even so, I'd still suggest that regular expressions aren't going to help you. Why are you against generating an xml-document representation of the page?)