Extract specific portion of HTML file using c++/boost::regex

Extract specific portion of HTML file using c++/boost::regex - c++

I have a series of thousands of HTML files and for the ultimate purpose of running a word-frequency counter, I am only interested on a particular portion from each file. For example, suppose the following is part of one of the files:
<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
<div class="textelement "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
I currently have some code that opens the html file and reads the entire content into a single string, but when I try to run a boost::regex_match looking for that particular beginning of line <div class="preview_content clearfix module_panel">, I don't get any matches. I'm open to any suggestions as long as it's on c++.

How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
You don't.
Never use regular expressions to process HTML. Whether in C++ with Boost.Regex, in Perl, Python, JavaScript, anything and anywhere. HTML is not a regular language; therefore, it cannot be processed in any meaningful way via regular expressions. Oh, in extremely limited cases, you might be able to get it to extract some particular information. But once those cases change, you'll find yourself unable to get done what you need to get done.
I would suggest using an actual HTML parser, like LibXML2 (which does have the ability to read HTML4). But using regex's to parse HTML is simply using the wrong tool for the job.

Since all I needed was something quite simple (as per question above), I was able to get it done without using regex or any type of parsing. Following is the code snippet that did the trick:
// Read HTML file into string variable str
std::ifstream t("/path/inputFile.html");
std::string str((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());
// Find the two "flags" that enclose the content I'm trying to extract
size_t pos1 = str.find("<div class=\"preview_content clearfix module_panel\">");
size_t pos2 = str.find("</em></p></div>");
// Get that content and store into new string
std::string buf = str.substr(pos1,pos2-pos1);
Thank you for pointing out the fact that I was totally on the wrong track.

Related

Qt adjust value relatively using regular expression

I have some HTML (excerpt)
<span style="font-size:30pt;">HELLO</span>
I have this html stored in a QString.
I wish to reduce the font size by a factor. For simplicity sake let's halve it, so I need it changed to
<span style="font-size:15pt;">HELLO</span>
Can I apply QString::replace() here somehow?
The only examples I have seen replace absolute, but mine needs to read the current value and apply simple math on it, then write it back.
FWIW, I have this expression worked out:
<span.+font-size:(\d+)pt;.*?>
I don't think it really matters that this question places it in context of HTML.
I suppose it could apply to any string.

Use QRegularExpression.
Calling QRegularExpression::match(), it will return a QRegularExpressionMatch that will give you access to captured strings.
Then you will have to parse the string to a number, do the maths and rebuild the final string.
Note that Qt provides a useful example program for QRegularExpression: https://doc.qt.io/qt-5/qtwidgets-tools-regularexpression-example.html
If you are using Qt Creator you can open it from the Welcome page.
Also note that using regular expression to parse HTML does not work well.
If you are sure that you will get a simple HTML it can work, but if the HTML is coming from untrusted sources, you could end up with a mix of HTML, CSS and JavaScript that won't match your regular expression.

Parsing links with regex

I have a problem I can't seem to figure out how to write a regular expression correctly. How to write a regular expression that for example if I have loaded some text the part that interests me is links that end with .m3u or m3u8. For example if i specify this input in my program
Input - player = new Player({"player-id":"1","autoplay":"false","fullscreen":"false","debug":"true","content-volume":"85","ad-volume":"30","ad-load-timeout":"15000","div-id":"videoPlayer","default-quality-index":0,"title":"\u0428\u043f\u0438\u043e\u043d, \u043a\u043e\u0442\u043e\u0440\u044b\u0439 \u043c\u0435\u043d\u044f \u043a\u0438\u043d\u0443\u043b ","poster":"https://test/four/v1/video-file1/00/00/00/00/00/00/00/10/22/11/102211-480p.mp4/thumb-33000.jpg","content":{"mp4":[],"dash":"https://test/four/v1/video-file1/00/00/00/00/00/00/00/10/22/11/102211-,480,p.mp4.urlset/manifest.mpd","hls":"https://test/four/v1/video-file1/00/00/00/00/00/00/00/10/22/11/102211-,480,p.mp4.urlset/master.m3u8"},"about":"false","key":"4eeeb77181526bedc1025586d43a70fa","btn-play-pause":"true","btn-stop":"true","btn-fullscreen":"true","btn-prev-next":"false","btn-share":"true","btn-vk-share":"true","btn-twitter-share":"true","btn-facebook-share":"true","btn-google-share":"true","btn-linkedin-share":"true","quality":"true","volume":"true","timer":"true","timeline":"true","iframe-version":"true","max-hls-buffer-size":"10","time-from-cookie":"true","set-prerolls":["https://test/j/v.php?id=645"],"max-prerolls-impressions":1});
By using regex the output should be -
https://test/four/v1/video-file1/00/00/00/00/00/00/00/10/22/11/102211-,480,p.mp4.urlset/master.m3u8
I have tried writing this regex expression but it parses all links and not the ones that I need. I only need the links tht end with a specific tag
Thank you for your answer in advance

I dont see why there are so much downvotes, maybe the question looked totally different originally.
Using regex only, my solution in ASP.net would be to reverse the text first, then look up for everything between "u3m" until the next occurence of "ptth".
Play with it: http://refiddle.com/nwvu
Regex for m3u8 OR m3u:
(8u3m.+?ptth)|(u3m.+?ptth)
ASP String reversal (from https://forums.asp.net/t/1841367.aspx?Reverse+String+in+asp+net):
string input = TextBox1.Text;
char[] inputarray = input.ToCharArray();
Array.Reverse(inputarray);
string output = new string(inputarray);

Parse a string for open and close tags

Let's say I have the following strings:
"This [color=RGB]is[\color] a string."
"This [color=RGB][bold]is[\bold][\color] another string."
What I'm looking for is a good way to parse the string in order to extract the tag information and then reconstruct the original string without tags.
The tag informations will be used during text rendering.
Obviously I can achieve the goal by working directly with strings (find/substr/replace and so on), but I'm asking if there is another way cleaner, for example using regular expression.
Note:
There are very few tags I need, but there is the possibility to nest them (only of different type).
Can't use Boost.

There's a very simple answer that might work, depending on the complexity of your strings. (And me understanding you correctly, i.e. you just want to get the cleaned up strings, not actually extract the tags.) Just remove all tags. Replace
\[.*?]
with nothing. Example here
Now, if your string should be able to contain tag-like objects this might not work.
Regards

Classic ASP: encoding text outside tags with regex

I'm in the need of a function that could process a string and replace textual parts with html encoded ones:
Sample 1
Input: "<span>Total amount:<br>€ 50,00</span>"
Output: "<span>Total amount:<br>€ 50,00</span>"
Sample 2
Input: "<span>When threshold > x<br>act as described below:</span>"
Output: "<span>When threshold > x<br>act as described below:</span>"
These are simplified cases of course, and yes, I know I could do that by a series of replace on each specific char I need to encode, but I'd rather have a function that can recognize and skip HTML tags using Regex and perform a Server.HTMLEncode on the textual part of the input string.
Any help will be highly appreciated.

I'm not sure why you'd want to do this. Why don't you just pass the innerHTML into your parser using javascript and have Javascript create your span tag? Then you can encode the entire thing. I'd be worried that the encoding here won't have any added security for your application if that's what you are trying to do.

Folder with 1300 png files into html images list

I've got folder with about 1300 png icons. What I need is html file with all of them inside like:
<img src="path-to-image.png" alt="file name without .png" id="file-name-without-.png" class="icon"/>
Its easy as hell but with that number of files its pure waste of time to do it manually. Have you any ideas how to automate it?

If you need it just once, then do a "dir" or "ls" and redirect it to a file, then use an editor with macro-ability like notepad++ to record modifying a single line like you desire, then hit play macro for the remainder of the file. If it's dynamic, use PHP.

I would not use C++ to do this. I would use vi, honestly, because running regular expressions repeatedly is all that is needed for this.
But young an do this in C++. I would start with a plan text file with all the file names generated by Dir or ls on the command prompt.
Then write code that takes a line of input and turns it into a line formatted the way you want. Test this and get it working on a single line first.
The RE engine of C++ is probably overkill (and is not all that well supported in compilers), but substr and basic find and replace is all you need. Is there a string library you are familiar with? std::string would do.
To generate the file name without PNG, check the last four characters and see if they exist and are .PNG (if not report an error). Then strip them. To remove dashes, copy characters to a new string but if you are reading a dash write a space. Everything else is just string concatenation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract specific portion of HTML file using c++/boost::regex - c++

Related

Qt adjust value relatively using regular expression

Parsing links with regex

Parse a string for open and close tags

Classic ASP: encoding text outside tags with regex

Folder with 1300 png files into html images list

Categories

Resources