How to create a html tree? - c++

I need to dig a bit in some html files and I wanted to first transform them into the readible form of a tree one tag at a line. Nevertheless I have no experience in html. Could someone correnct my code and point out the rules I've forgotten?
My code does not work for for real life pages. At the end of the program execution the nesting counter should be set to 0, as program should leave all the nested tags it has met. It does not. For a facebook page it is more than 2000 tags remaining open.
Before one would suggest me using a library, I haven't seen any good one out there. For my pages transforming into xml somehow fails and htmlcxx library has no proper documentation.
#include <cstdio>
char get_char( FILE *stream ) {
char c;
do
c = getc(stream);
while ( c == ' ' || c == '\n' || c == '\t' || c == '\r' );
return c;
}
void fun( FILE *stream, FILE *out ) {
int counter = -1;
char c;
do {
c = get_char(stream);
if ( c == EOF )
break;
if ( c != '<' ) { // print text
for ( int i = counter + 1; i; --i )
putc( ' ', out );
fprintf( out, "TEXT: " );
do {
if ( c == '\n' )
fprintf( out, "<BR>" ); // random separator
else
putc( c, out );
c = getc( stream );
} while ( c != '<' );
putc( '\n', out );
}
c = getc( stream );
if ( c != '/' ) { // nest deeper
++counter;
for ( int i = counter; i; --i )
putc( ' ', out );
} else { // go back in nesting
--counter;
// maybe here should be some exception handling
do // assuming there's no strings in quotation marks here
c = getc( stream );
while ( c != '>' );
continue;
}
ungetc( c, stream );
do { // reading tag
c = getc(stream);
if( c == '/' ) { // checking if it's not a <blahblah/>
c = getc(stream);
if ( c == '>' ) {
--counter;
break;
}
putc( '/', out );
putc( c, out );
} else if ( c == '"' ) { // not parsing strings put in quotation marks
do {
putc( c, out ); c = getc( stream );
if ( c == '\\' ) {
putc( c, out ); c = getc( stream );
if ( c == '"' ) {
putc( c, out ); c = getc( stream );
}
}
} while ( c != '"' );
putc( c, out );
} else if ( c == '>' ) { // end of tag
break;
} else // standard procedure
putc( c, out );
} while ( true );
putc( '\n', out );
} while (true);
fprintf( out, "Counter: %d", counter );
}
int main() {
const char *name = "rfb.html";
const char *oname = "out.txt";
FILE *file = fopen(name, "r");
FILE *out = fopen(oname, "w");
fun( file, out );
return 0;
}

HTML != XML
Tags could be non-closed, for example <img ...> is considered equal to <img ... />

Such interesting and useful topic and almost no answers. Really strange...
It's hard to find good C++ HTML parser! I try to guide in right direction...it may help you to move on...
The Lib curl page has some source code to get you going. Documents traversing the dom tree. You don't need an xml parser. Doesn't fail on badly formated html.
http://curl.haxx.se/libcurl/c/htmltidy.html
Another option is htmlcxx. From the website description:
htmlcxx is a simple non-validating css1 and html parser for C++.
Can try libs like tidyHTML - http://tidy.sourceforge.net (free)
If you're using Qt 4.6, you can use the QWebElement. A simple example:
frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList imgs = document.findAll("img");
Here is another example. http://doc.qt.digia.com/4.6/webkit-simpleselector.html

Related

Download files in C++ with URL dont work. (%username% and User) [duplicate]

What's the best way to expand
${MyPath}/filename.txt to /home/user/filename.txt
or
%MyPath%/filename.txt to c:\Documents and settings\user\filename.txt
with out traversing the path string looking for environement variables directly?
I see that wxWidgets has a wxExpandEnvVars function. I can't use wxWidgets in this case, so I was hoping to find a boost::filesystem equivalent or similar. I am only using the home directory as an example, I am looking for general purpose path expansion.
For UNIX (or at least POSIX) systems, have a look at wordexp:
#include <iostream>
#include <wordexp.h>
using namespace std;
int main() {
wordexp_t p;
char** w;
wordexp( "$HOME/bin", &p, 0 );
w = p.we_wordv;
for (size_t i=0; i<p.we_wordc;i++ ) cout << w[i] << endl;
wordfree( &p );
return 0;
}
It seems it will even do glob-like expansions (which may or may not be useful for your particular situation).
On Windows, you can use ExpandEnvironmentStrings. Not sure about a Unix equivalent yet.
If you have the luxury of using C++11, then regular expressions are quite handy. I wrote a version for updating in place and a declarative version.
#include <string>
#include <regex>
// Update the input string.
void autoExpandEnvironmentVariables( std::string & text ) {
static std::regex env( "\\$\\{([^}]+)\\}" );
std::smatch match;
while ( std::regex_search( text, match, env ) ) {
const char * s = getenv( match[1].str().c_str() );
const std::string var( s == NULL ? "" : s );
text.replace( match[0].first, match[0].second, var );
}
}
// Leave input alone and return new string.
std::string expandEnvironmentVariables( const std::string & input ) {
std::string text = input;
autoExpandEnvironmentVariables( text );
return text;
}
An advantage of this approach is that it can be adapted easily to cope with syntactic variations and deal with wide strings too. (Compiled and tested using Clang on OS X with the flag -std=c++0x)
Simple and portable:
#include <cstdlib>
#include <string>
static std::string expand_environment_variables( const std::string &s ) {
if( s.find( "${" ) == std::string::npos ) return s;
std::string pre = s.substr( 0, s.find( "${" ) );
std::string post = s.substr( s.find( "${" ) + 2 );
if( post.find( '}' ) == std::string::npos ) return s;
std::string variable = post.substr( 0, post.find( '}' ) );
std::string value = "";
post = post.substr( post.find( '}' ) + 1 );
const char *v = getenv( variable.c_str() );
if( v != NULL ) value = std::string( v );
return expand_environment_variables( pre + value + post );
}
expand_environment_variables( "${HOME}/.myconfigfile" ); yields /home/joe/.myconfigfile
As the question is tagged "wxWidgets", you can use wxExpandEnvVars() function used by wxConfig for its environment variable expansion. The function itself is unfortunately not documented but it basically does what you think it should and expands any occurrences of $VAR, $(VAR) or ${VAR} on all platforms and also of %VAR% under Windows only.
Within the C/C++ language, here is what I do to resolve environmental variables under Unix. The fs_parm pointer would contain the filespec (or text) of possible environmental variables to be expanded. The space that wrkSpc points to must be MAX_PATH+60 chars long. The double quotes in the echo string are to prevent the wild cards from being processed. Most default shells should be able to handle this.
FILE *fp1;
sprintf(wrkSpc, "echo \"%s\" 2>/dev/null", fs_parm);
if ((fp1 = popen(wrkSpc, "r")) == NULL || /* do echo cmd */
fgets(wrkSpc, MAX_NAME, fp1) == NULL)/* Get echo results */
{ /* open/get pipe failed */
pclose(fp1); /* close pipe */
return (P_ERROR); /* pipe function failed */
}
pclose(fp1); /* close pipe */
wrkSpc[strlen(wrkSpc)-1] = '\0';/* remove newline */
For MS Windows, use the ExpandEnvironmentStrings() function.
This is what I use:
const unsigned short expandEnvVars(std::string& original)
{
const boost::regex envscan("%([0-9A-Za-z\\/]*)%");
const boost::sregex_iterator end;
typedef std::list<std::tuple<const std::string,const std::string>> t2StrLst;
t2StrLst replacements;
for (boost::sregex_iterator rit(original.begin(), original.end(), envscan); rit != end; ++rit)
replacements.push_back(std::make_pair((*rit)[0],(*rit)[1]));
unsigned short cnt = 0;
for (t2StrLst::const_iterator lit = replacements.begin(); lit != replacements.end(); ++lit)
{
const char* expanded = std::getenv(std::get<1>(*lit).c_str());
if (expanded == NULL)
continue;
boost::replace_all(original, std::get<0>(*lit), expanded);
cnt++;
}
return cnt;
}
Using Qt, this works for me:
#include <QString>
#include <QRegExp>
QString expand_environment_variables( QString s )
{
QString r(s);
QRegExp env_var("\\$([A-Za-z0-9_]+)");
int i;
while((i = env_var.indexIn(r)) != -1) {
QByteArray value(qgetenv(env_var.cap(1).toLatin1().data()));
if(value.size() > 0) {
r.remove(i, env_var.matchedLength());
r.insert(i, value);
} else
break;
}
return r;
}
expand_environment_variables(QString("$HOME/.myconfigfile")); yields /home/martin/.myconfigfile
(It also works with nested expansions)

changing first letter to uppercase with fscanf and fseek

My program changes first letter of each word to uppercase in a .txt file.
I enter the address of file.this program save a word as a character array named "word".it changes the first cell of array to uppercase.then counts the letters of that word and and moves back to first letter of the word.then it writes the new word in file.
But it dose not work correctly!!!
#include <iostream>
#include <stdio.h>
using namespace std;
int main ()
{
int t=0, i=0,j=0;
char word[5][20];
FILE *f;
char adres[20];
cin >> adres; // K:\\t.txt
f=fopen(adres,"r+");
{
t=ftell(f);
cout << t<<"\n";
fscanf(f,"%s",&word[i]);
word[i][0]-=32;
for (j=0;word[i][j]!=0;j++){}
fseek(f,-j,SEEK_CUR);
fprintf(f,"%s",word[i]);
t=ftell(f);
cout << t<<"\n";
}
i++;
{
fscanf(f,"%s",&word[i]);
word[i][0]-=32;
for (j=0;word[i][j]!=0;j++){}
fseek(f,-j,SEEK_CUR);
fprintf(f,"%s",word[i]);
t=ftell(f);
cout << t<<"\n";
}
return 0;
}
and the file is like:
hello kami how are you
the answer is that:
Hello kaAmihow are you
I think , this is what you need.
#include<iostream>
#include<cstring>
#include<fstream>
using namespace std;
void readFile()
{
string word;
ifstream fin;
ofstream fout;
fin.open ("read.txt");
fout.open("write.txt");
if (!fin.is_open()) return;
while (fin >> word)
{
if(word[0]>='a' && word[0]<='z')
word[0]-=32;
fout<< word << ' ';
}
}
int main(){
readFile();
return 0;
}
This looks like homework.
Don't try to read and write in the same file. Use different files (in.txt & out.txt, for instance). You may delete & rename the files at end.
Use c++ streams.
Read one character at a time.
Divide your algorithm in three parts:
Read & write white-space until you find a non-white-space character.
Change the character to uppercase and write it.
Read and write the rest of the word.
Here it is a starting point:
#include <fstream>
#include <locale>
int main()
{
using namespace std;
ifstream is( "d:\\temp\\in.txt" );
if ( !is )
return -1;
ofstream os( "d:\\temp\\out.txt" );
if ( !os )
return -2;
while ( is )
{
char c;
while ( is.get( c ) && isspace( c, locale() ) )
os.put( c );
is.putback( c );
// fill in the blanks
}
return 0;
}
[EDIT]
Your program has too many problems.
It is not clear what you're trying to do. You probably want to capitalize each word.
scanf functions skip white-spaces in front of a string. If the file contains " abc" (notice the white space in front of 'a') and you use fscanf, you will get "abc" - no white-space.
Subtracting 32 from a character does not necessarily convert it to a capital letter. What if it is a digit, not a letter? Instead you should use toupper function.
etc.
[EDIT] One file & c style:
#include <stdio.h>
#include <ctype.h>
int main()
{
FILE* f = fopen( "d:\\temp\\inout.txt", "r+b" );
if ( !f )
return -1;
while ( 1 )
{
int c;
//
while ( ( c = getc( f ) ) && isspace( c ) )
;
if ( c == EOF )
break;
//
fseek( f, -1, SEEK_CUR );
putc( toupper( c ), f );
fseek( f, ftell( f ), SEEK_SET ); // add this line if you're using visual c
//
while ( ( c = getc( f ) ) != EOF && !isspace( c ) )
;
if ( c == EOF )
break;
}
fclose( f );
return 0;
}

reading csv file for specific information

I am wondering how to read a specific value from a csv file in C++, and then read the next four items in the file. For example, this is what the file would look like:
fire,2.11,2,445,7891.22,water,234,332.11,355,5654.44,air,4535,122,334.222,16,earth,453,46,77.3,454
What I want to do is let my user select one of the values, let's say "air" and also read the next four items(4535 122 334.222 16).
I only want to use fstream,iostream,iomanip libraries. I am a newbie, and I am horrible at writing code, so please, be gentle.
You should read about parsers. Full CSV specifications.
If your fields are free of commas and double quotes, and you need a quick solution, search for getline/strtok, or try this (not compiled/tested):
typedef std::vector< std::string > svector;
bool get_line( std::istream& is, svector& d, const char sep = ',' )
{
d.clear();
if ( ! is )
return false;
char c;
std::string s;
while ( is.get(c) && c != '\n' )
{
if ( c == sep )
{
d.push_back( s );
s.clear();
}
else
{
s += c;
}
}
if ( ! s.empty() )
d.push_back( s );
return ! s.empty();
}
int main()
{
std::ifstream is( "test.txt" );
if ( ! is )
return -1;
svector line;
while ( get_line( is, line ) )
{
//...
}
return 0;
}

scoped_ptr to call member function throws error

I am currently reading Accelerated C++ ch13 and thought of doing sample program given in book via boost scoped_ptr but have encountered an error.
May you guys please bail me out.
**
***error: cannot use arrow operator on a type
record->read( cin );***
^
**
Original sample code is something shown below and this works flawlessly
std::vector< Core* > students ; // read students
Core* records;
std::string::size_type maxlen = 0;
// read data and store in an object
char ch ;
while( cin >> ch )
{
if( 'U' == ch )
{
records = new Core;
}
else if( 'G' == ch )
{
records = new Grad;
}
records->read( cin );
maxlen = max( maxlen , records->getname().size() );
students.push_back( records );
}
Now using scoped_ptr MY VERSION
typedef boost::scoped_ptr<Core> record;
std::vector< record > students;
char ch;
std::string::size_type maxlen = 0;
// read and store
while( ( cin >> ch ) )
{
if( ch == 'U')
{
record( new Core);
}
else if( ch == 'G')
{
record( new Grad);
}
record->read( cin );// GOT ERROR
//maxlen = max( maxlen, record->name().size() );// SAME type of error I EXPECT HERE
// students.push_back( record );
}

Display escape sequences as text only

My program is outputting text which sometimes contains escape sequences such as "\x1B[J" (Clear screen). Is there anyway to suppress the escape sequence such that it doesn't perform its associated action but instead gets displayed via its text representation?
I would even be interested in doing this for \n and \r.
Escape the \ characters by changing each occurence to \\.
Note that these sequences work only, when you enter them in source code. Check the result of the following program:
#include <cstdio>
int main(int argc, char * argv[])
{
char test[3] = { 0x5c, 0x6e, 0x00 }; // \n
char * test2 = "\\n"; // \n
printf("%s\n", test);
printf("%s\n", test2);
printf(test);
printf(test2);
return 0;
}
It's not clear at what level you want to intervene. If you're
writing the output, the simplest solution is just to not insert
the characters to begin with. If you're passing an
std::ostream to some library, and it's inserting the
characters, it's fairly simply to insert a filtering streambuf
into the output stream, and filter them out. Something like the
following should do the trick for the standard escape sequences:
class EscapeSequenceFilter
{
std::streambuf* myDest;
std::ostream* myOwner;
bool myIsInEscapeSequence;
protected:
int overflow( int ch )
{
int retval = ch == EOF ? ch : 0;
if ( myIsInEscapeSequence ) {
if ( isalpha( ch ) ) {
myIsInEscapeSequence = false;
} else if ( ch == 0x1B ) {
myIsInEscapeSequence = true;
} else {
retval = myDest->sputc( ch );
}
return retval;
}
public:
EscapeSequenceFilter( std::streambuf* dest )
: myDest( dest )
, myOwner( NULL )
, myIsInEscapeSequence( false )
{
}
EscapeSequenceFilter( std::ostream& dest )
: myDest( dest.rdbuf() )
, myOwner( &dest )
, myIsInEscapeSequence( false )
{
myOwner->rdbuf( this );
}
~EscapeSequenceFilter()
{
if ( myOwner != NULL ) {
myOwner->rdbuf( myDest );
}
}
};
Just declare an instance of this class with the output stream as
argument before calling the function you want to filter.
This class is easily extended to filter any other characters you
might wish.