Parsing a textfile (with HTML in it) with C++ - c++

I've been able to get some raw data in the form of a html webpage, which I have in turn put into an ordinary text file. I'm currently trying to use a C++ program to parse this file, but for some reason it's giving me weird output in that it's putting #s, symbols, and ^Ms in between every single letter. I'm unsure as to whether this is because I'm trying to parse an HTML file or if it's because my code is wrong, but I've tried my code on smaller HTML files and it works fine. The file I want it to work on is just 145kB
Here is my code:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
ifstream inFile;
inFile.open(argv[1]);
string str;
while(getline(inFile, str))
{
cout << str << endl;
}
}
If anyone could give me a clue as to why this isn't working, I'd be very grateful.

HTML files may come in virtually any encoding. OP needs to open the file, according the encoding that it has, that is typically supplied by the web browser he got it from as part of the page serve. Note that each individual page served up by the same site, may have different encodings. The "#" are probably actually printed as "^#", which is what many output routines will print if you give them null characters. He may have a UTF-16 file, and is reading it assuming it is ASCII 8 bit.
He also needs to understand that "newline" conventions vary between machines; his "^M" probably means he is running on a Unix machine (which thinks "^J" is a line break, and he got his file from a Windows box, which thinks "^M^J" is a line break. Welcome to the real world.
Next, OP will find that parsing HTML is actually hard because it is complex, has lots of crazy character conventions (above and beyond encoding), and often is often simply illegal because the browsers allow it, and not every checks that their HTML is clean.

Try if this works for you.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
wifstream inFile;
inFile.open(argv[1]);
wstring str;
while(getline(inFile, str))
{
wcout << str << endl;
}
}

Related

How to append to the last line of a file in c++?

using g++, I want to append some data to the last line (but to not create a new line) of a file. Probably, a good idea would be to move back the cursor to skip the '\n' character in the existing file. However this code does not work:
#include <iostream>
#include <fstream>
using namespace std;
int main() {
ofstream myfile;
myfile.open ("file.dat", fstream::app|fstream::out);
myfile.seekp(-1,myfile.ios::end); //I believe, I am just before the last '\n' now
cout << myfile.tellp() << endl; //indicates the position set above correctly
myfile << "just added"; //places the text IN A NEW LINE :(
//myfile.write("just added",10); //also, does not work correctly
myfile.close();
return 0;
}
Please give me the idea of correcting the code. Thank you in advance. Marek.
When you open with app, writing always writes at the end, regardless of what tellp tells you.
("app" is for "append", which does not mean "write in an arbitrary location".)
You want ate (one of the more inscrutable names in C++) which seeks to the end only immediately after opening.
You also want to add that final newline, if you want to keep it.
And you probably also want to check that the last character is a newline before overwriting it.
And, seeking by characters can do strange things in text mode, and if you open in binary mode you need to worry about the platforms's newline convention.
Manipulating text is much harder than you think.
(And by the way, you don't need to specify out on an ofstream - the "o" in "ofstream" takes care of that.)

So...is there a way to stop files from clearing automatically? (c++)

So i'm making an extremely simple guessing console game and i want to store data permanently in a file (highscore). However everytime i compile the file i'm using empties itself. Is there anyway to stop that?
I've tried a lot of thing which didn't work and i honestly don't know where the problem is. I'm guessing it has to do with the fin and fout but for others it seemed to work
#include <iostream>
#include <fstream>
#include <time.h>
#include <conio.h>
int hs;
//this would be the play_game() function, unrelated to the subject
int main()
{
std::ofstream fout;
fout.open("HS.txt");
std::ifstream fin;
fin.open("HS.txt");
srand(time(NULL));
//menu with 4 options, play, quit, help and highscore (which i'm working on)
fin.close();
fout.close();
}
Don't open your file twice in parallel with two streams. Also, a simple open-for-writing will not truncate your file, but will place you at the start of the file, so you'll be overwriting existing data; see this question. You have to open files with the write mode.
You need to either:
Open your file for both input and output - and without truncating it; see: What does it mean to open an output file as both input and output? , or
Open your file for reading only when your app starts, and open it for writing, and write to it, when it exists (or every once-in-a-while for better resiliency).

Can't get a simple ifstream to work in Visual Studio Express

I am trying to learn C++ and am on the file input/output section. I've hit a brick wall because my test application just plainly isn't working in Visual Studio Express 2012. Here is my code:
// ConsoleApp03.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
ifstream file_reader;
file_reader.open("C:\temp.txt");
// Test to see if the file was opened
if (!file_reader.is_open() ) {
cout << "Could not open file!" << endl;
return 255;
}
string line;
// Read the entire file and display it to the user;
while (getline(file_reader,line)) {
cout << line << endl;
}
// Close the file
file_reader.close();
return 0;
}
Every time I run this, I get "Could not open file!". I have verified that the file being opened does exist, and I have sufficient permission to read. I have tried other text files, including in other different locations like my documents folder, but the result is always the same. My text file is very simple and only contains two lines of text. I am abel to open this file in Notepad++, and the file has no special attributes (system, Read only, etc). I have even tried converting the file to/from ANSI and UTF-8 with no luck.
I have looked at other problems similar to what I have here, but these don't seem to be applicable to me (e.g.: ifstream::open not working in Visual Studio debug mode and ifstream failing to open)
Just to show how simple the text file is, here is me typing it from the command prompt:
C:\>type C:\temp.txt
Hi
There
This may or may not fix your problem, but \ followed by char is an escape sequence. So your file path is actually invalid. Try
file_reader.open("C:\\temp.txt");
The \t actually means tab. See here.

Using getline on html file

I have this assignment to search for certian info in a html file and put the result into text file. I wanted to do it using getline, but somehow it's not working. I have no problems with using getline on text file so I assumed that you cannot use getline on html file. Is that assumption right? How can I convert such file into a text file? Or maybe there is a better/easier solution?Thanks.
Here is the code:(the names of the variables are not in english, I hope it's not a problem)
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
string nazwa_wejsciowego;
string roboczy;
ifstream html;
ifstream txt("wynik.txt");
int main()
{
cout<<"Podaj nazwę pliku html, z ktorego odczytane maja zostac dane."<<endl;
cin>>nazwa_wejsciowego;
ifstream html(nazwa_wejsciowego, ios::app); //opening the file
if(!html){
cout<<"Otwarcie pliku "<<nazwa_wejsciowego<<" nie powiodlo sie."<<endl;
system("pause");}
//checking if it opende properly
getline(html, roboczy);
cout<<roboczy<<endl;
return 0;}
No, the assumption is not right. HTML is text; the fact that it is in a structured format that can be parsed by a computer to render a webpage is not relevant to read individual characters in the file.
getline may be a suitable approach though, as Steve points out in comments, some HTML pages are "minified" (they have unnecessary whitespace removed to save space and make code harder to copy) and, in such a case, you may end up with just one really big line. It may therefore be more convenient to read in chunks of bytes.

Reading ISO-8859 type file containing special characters such as é in C++

I'm trying to read a file which is encoded in ISO-8859(ansi), and it contains some west European characters such as "é".
When I try to read the file and output the result, all the special characters appear as �, whereas normal alphabets appear correctly.
If I convert the file to utf-8 format and then do the same job, everything works perfectly.
Does anyone have any idea to solve this problem? I tried to use wifstream and wstring instead of ifstream and string but didn't help much.
Here's my sample code:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
ifstream myFS;
myFS.open("test.txt", ios::in);
string myString;
if(myFS.is_open()){
while(myFS >> myString)
cout << myString << endl;
}
myFS.close();
return 0;
}
test.txt (ISO-8859-15 format) contains:
abcd éfg
result:
abcd
�fg
Any advice will be appreciated.
Thank you in advance!
+)
forgot to mention my system environment.
I'm using ubuntu 10.10(Maverick) console with g++ ver 4.4.5
Thanks!
Your console is set to use UTF-8, so when you just dump the file in ISO-8859-15 to the console using cout, it shows the wrong letters. Letters with ascii code <128 are the same in both encodings, which means all those characters will appear correctly on your screen.
The output from the program is actually correct, it's just your console that's not set to display the output correctly.
I'd also recommend using ios::binary on files that aren't all ascii, or you may have problems on other platforms later.