C++ Screen Scraping from HTML - c++

i'm trying to extract the data "Lady Gaga Fame Monster" from the html below using substr and find, but i wasn't able to retrieve the data.
<div class="album-name"><strong>Album</strong> > Lady Gaga Fame Monster</div>
I'm tried to extract the whole string first, but i can only extract till Album under the command cout << line_found , as there's spacing that prevents it from proceeding further.
I try cout << extract_line. I see no spaces in the extracted html code.
I tried the tutorial based from this http://www.cplusplus.com/reference/string/string/substr/, it works, even with spaces. I'm following closely but it stops extracting once it hit spaces. Pls help really appreciated. thanks. Figuring out 2 days without any solution.
here's the source code:
#include "parser.h"
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using namespace std;
int main() {
string line_found, extract_line, result, finalResult="";
int firstPosition, secondPosition, input, location;
ifstream sourceFile ("cd1.htm"); // extracts from sourcefile
while(!sourceFile.eof())
{
sourceFile >> extract_line;
location = extract_line.find("album-name");
// cout << extract_line;
if (location >=0)
{
line_found = extract_line.substr(location);
cout << line_found << endl;
firstPosition= line_found.find_first_of(">");
result = line_found.substr(firstPosition);
}
}
return 0;
}

The >> operator doesn't fetch lines. It fetches whitespace-separated tokens. Use std::getline (see here) instead.
Better still, don't use string searching tools to parse HTML. It's a disaster waiting to happen. In fact, it's happening to you right now. Note that there is more than one instance of > in your line, so you will probably find the wrong one and get yourself in a complete muddle trying to skip all the ones that don't matter (you could try looking for " > ", but what if you encounter this: ...class="album-name" > <strong>..., which is perfectly valid HTML.
If the HTML is proper XHTML, use an XML parser instead. Expat, for instance, is small, fast and (relatively) simple to use. You can find a nice, easy intro here.
If the HTML is messy, you're going to struggle with C++. There's a related SO question here. Alternatively, use a language with a good HTML library such as Python (Beautiful Soup), which you can call from C++.

Another lightweight and simple option could be to use a regex. VS2010 and VS2008 (SP1 IIRC) come with the #include header that should allow much more control and flexibility than your approach.
It wouldn't be as robust as Marcelo's approach but would be quicker to get started with.

Related

Enable the writing of ANSI escape codes on a file?

I am struggling with a problem. I searched all around the web and StackOverflow website and found similar questions, but none of them provided me the answer I am searching.
I am on a Linux system (Ubuntu) and basically want to know how to write an ANSI escape code in an output file. For example, if I want to write a red string on the terminal I do:
cout << "\033[31m" << "Red string";
and it works. But if I want to write it on a .rtf file for example:
#include <iostream>
#include <fstream>
using namespace std;
ofstream os( "this_file.rtf" );
os << "\033[31m" << "Red string";
os.close();
it doesn't work and output in the file something like:
#[31mRed string
is there a way to enable the writing of an ANSI escape code on an output file like that one? Thanks.
After all your answers and weeks of practice, the solution for this answer is pretty obvious and is the following: file redirection of ANSI escape sequences manipulation depends on the kind of file you are writing in and you have to manually set the way in which you want to translate the ANSI into the output file, depending of course also on the file format you are considering.

Trouble with special characters

First of all, I'm a newbie in programming so you might have to be patient with me. The thing is, I'm writing a program that basically gets input and uses it to output some information plus the inputs in a .doc.
My problem is that I have some constant strings that output in a screwed up way when I use special characters like é í ó ã õ º ª.
I was able to fix it by adding setlocale(LC_ALL, ("portuguese")) but then I screwed my outputs of inputs (aka variable strings) that doesn't print special characters any more. Any clues how i can solve this? I've already tried wstrings and looked everywhere but couldn't find a single solution.
I can show my code here if it helps.
Here is an example of my problem:
#include <iostream>
#include <string>
using namespace std;
int main()
{
string a;
wcout << "Enter special characters like éíó: ";
getline (cin, a);
cout << a;
}
I can't make the constant string and the variable string output correctly in the console at the same time.
You are probably using Windows. The Windows' Command Prompt default encoding is CP850, this encoding is rarely used anywhere else and it will display most special symbols differently from what you usually see in your favorite text editor. You can try to use the Windows APIs SetConsoleOutputCP(1252); and SetConsoleCP(1252); to change to CP1252, an encoding that is somewhat more compatible and should display those symbols the same way you see in the editor. You will need the #include <windows.h>, if its available.

Extract value preceeded by a certain word in c++ output

If you have a file with a text of known structure how would you extract a value preceeded by certain identifying word? Specifically, how do you extract the value from the piece of text below.
CDM-nucleon micrOMEGAs amplitudes:
proton: SI -3.443E-10
Here is how far I got with the script:
#include <string>
#include <fstream>
#include <iostream>
using namespace std;
int main()
{
string identifier;
double value;
ifstream file("output.txt");
// Commands to extract value
file.close();
return 0;
}
Thank you very much.
It depends on the complexity and size of the output file. By 'complexity' I mean, say, a possibility to find your identifying word not followed by a number and variety of keywords you want to parse.
For a big file it may be unreasonable to load it into the memory, hence possible streaming.
In your particular case, you may want to look into this:
http://www.cplusplus.com/reference/regex/
or one of the parsers suitable for streamed data:
http://en.wikipedia.org/wiki/Flex_lexical_analyser
You might also look into other languages such as perl to parse that.

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

replace string through regex using boost C++

I have string in which tags like this comes(there are multiple such tags)
|{{nts|-2605.2348}}
I want to use boost regex to remove |{{nts| and }} and replace whole string that i have typed above with
-2605.2348
in original string
To make it more clear:
Suppose string is:
number is |{{nts|-2605.2348}}
I want string as:
number is -2605.2348
I am quite new to boost regex and read many things online but not able to get answer to this any help would be appreciated
It really depends on how specific do you want to be. Do you want to always remove exactly |{{nts|, or do you want to remove pipe, followed by {{, followed by any number of letters, followed by pipe? Or do you want to remove everything that isn't whitespace between the last space and the first part of the number?
One of the many ways to do this would be something like:
#include <iostream>
#include <boost/regex.hpp>
int main()
{
std::string str = "number is |{{nts|-2605.2348}}";
boost::regex re("\\|[^-\\d.]*(-?[\\d.]*)\\}\\}");
std::cout << regex_replace(str, re, "$1") << '\n';
}
online demo: http://liveworkspace.org/code/2B290X
However, since you're using boost, consider the much simpler and faster parsers generated by boost.spirit.