C++ Text File, Chinese characters - c++

I have a C++ project which is supposed to add <item> to the beginning of every line and </item > to the end of every line. This works fine with normal English text, but I have a Chinese text file I would like to do this to, but it does not work. I normally use .txt files, but for this I have to use .rtf to save the Chinese text. After I run my code, it becomes gibberish. Here's an example.
{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f2\fbidi
\fmodern\fcharset0\fprq1{*\panose
02070309020205020404}Courier
New;}
Code:
int main()
{
ifstream in;
ofstream out;
string lineT, newlineT;
in.open("rawquote.rtf");
if(in.fail())
exit(1);
out.open("itemisedQuote.rtf");
do
{
getline(in,lineT,'\n');
newlineT += "<item>";
newlineT += lineT;
newlineT += "</item>";
if (lineT.length() >5)
{
out<<newlineT<<'\n';
}
newlineT = "";
lineT = "";
} while(!in.eof());
return 0;
}

That looks like RTF, which makes sense as you say this is an rtf file.
Basically, if you dump that file when you open, you'll see it looks like that...
Also, you should revisit your loop
std::string line;
while(getline(in, line, '\n'))
{
// do stuff here, the above check correctly that you have indeed read in a line!
out << "<item>" << line << "</item>" << endl;
}

You can't read the RTF code the same way as plain text as you'll just ignore format tags, etc. and might just break the code.
Try to save your chinese text as a text file using UTF-8 (without BOM) and your code should work. However this might fail if some other UTF-8 encoded character contains essentially a line break (not sure about this part right now), so you should try to do real UTF-8 conversion and read the file using wide chars instead of regular chars (as Chan suggested), which is a little bit tricky using C++.

It's kind of a miracle that this works for non-Chinese text. "\n" is not the line separator in RTF, "\par" is. The odds that more damage is done to the RTF header are certainly greater for Chinese.
C++ is not the best language to tackle this. It is a trivial 5 minute program in C# as long as the file doesn't get too large:
using System;
using System.Windows.Forms; // Add reference
class Program {
static void Main(string[] args) {
var rtb = new RichTextBox();
rtb.LoadFile(args[0], RichTextBoxStreamType.RichText);
var lines = rtb.Lines;
for (int ix = 0; ix < lines.Length; ++ix) {
lines[ix] = "<item>" + lines[ix] + "</item>";
}
rtb.Lines = lines;
rtb.SaveFile(args[0], RichTextBoxStreamType.RichText);
}
}
If C++ is a hard requirement then you'll have to find an RTF parser.

I think you should use 'wchar' for string instead of 'regular char'.

If I'm understanding the objective of this code, your solution is not going to work. A line break in an RTF document does not correspond to a line break in the visible text.
If you can't just use plain text (Chinese characters are not a problem with a valid encoding), take a look at the RTF spec. You'll discover that it is a nightmare. So you're best bet is probably a third-party library that can parse RTF and read it "line" by "line." I have never looked for such a library, so do not have any suggestions off the top of my head, but I'm sure they are out there.

Related

Setting source filename in C++ target

I have a code preprocessor which inserts a #line directive in the source code. The directive contains a filename, line number, and character position. I have a lexer rule for the #line directive that calls a function called newFile. The newFile function sets the lexer line number and character position. But, I don't see a way to set the source name. There are functions for getting the source name, but not setting it. I tried setting the input stream source name, but that didn't seem to work (I have an errorListener that gets the filename from recognizer->getInputStream()->getSourceName() but it always returns the initial filename).
My code is (C++ target):
preprocessor pp(_defines, _incpaths);
ANTLRInputStream input(pp.preprocess(filename));
myLexer lexer(&input);
CommonTokenStream tokens(&lexer);
myParser parser(&tokens);
antlr4::tree::ParseTree* tree = parser.start();
And, the newFile code is:
void myLexer::newFile (std::string newFilename, int newLine, int newPos)
{
static_cast<ANTLRInputStream*>(_input)->name = newFilename; // doesn't work
setLine(newLine);
setCharPositionInLine(newPos);
}
Thanks for any and all help.
There's no built-in functionality like that. Keep this information in a separate structure and manage changes there. The input stream name is just a convenient feature, which is not flexible enough for that kind of processing.
Thanks for the info. I tried a few different ways to store file/location information in a separate structure, but it quickly got overly complex.
I resolved the problem by taking advantage of Antlr's line tracking functionality. I stored the filename in a list, then encoded the list index of the filename into the line number.
int fileIndex = filename_list.size();
filename_list.append(filename);
int line = (fileIndex << 20) + newLine;
setLine(line);
setCharPositionInLine(newPos);
Then, in the errorListener, or AST builder, it's easy to access the filename:
void errorListener::syntaxError(Recognizer* recognizer......)
{
int fileIndex = line >> 20;
line &= 0xFFFFF;

How can I read CSV file in to vector in C++

I'm doing the project that convert the python code to C++, for better performance. That python project name is Adcvanced EAST, for now, I got the input data for nms function, in .csv file like this:
"[ 5.9358170e-04 5.2773970e-01 5.0061589e-01 -1.3098677e+00
-2.7747922e+00 1.5079222e+00 -3.4586751e+00]","[ 3.8175487e-05 6.3440394e-01 7.0218205e-01 -1.5393494e+00
-5.1545496e+00 4.2795391e+00 -3.4941311e+00]","[ 4.6003381e-05 5.9677261e-01 6.6983813e-01 -1.6515008e+00
-5.1606908e+00 5.2009044e+00 -3.0518508e+00]","[ 5.5172237e-05 5.8421570e-01 5.9929764e-01 -1.8425952e+00
-5.2444854e+00 4.5013981e+00 -2.7876694e+00]","[ 5.2929961e-05 5.4777789e-01 6.4851379e-01 -1.3151239e+00
-5.1559062e+00 5.2229333e+00 -2.4008298e+00]","[ 8.0250458e-05 6.1284608e-01 6.1014801e-01 -1.8556541e+00
-5.0002270e+00 5.2796564e+00 -2.2154367e+00]","[ 8.1256607e-05 6.1321974e-01 5.9887391e-01 -2.2241254e+00
-4.7920742e+00 5.4237065e+00 -2.2534993e+00]
one unit is 7 numbers, but a '\n' after first four numbers,
I wanna read this csv file into my C++ project,
so that I can do the math work in C++, make it more fast.
using namespace std;
void read_csv(const string &filename)
{
//File pointer
fstream fin;
//open an existing file
fin.open(filename, ios::in);
vector<vector<vector<double>>> predict;
string line;
while (getline(fin, line))
{
std::istringstream sin(line);
vector<double> preds;
double pred;
while (getline(sin, pred, ']'))
{
preds.push_back(preds);
}
}
}
For now...my code emmmmmm not working ofc,
I'm totally have no idea with this...
please help me with read the csv data into my code.
thanks
Unfortunately parsing strings (and consequently files) is very tedious in C++.
I highly recommend using a library, ideally a header-only one, like this one.
If you insist on writing it yourself, maybe you can draw some inspiration from this StackOverflow question on how to parse general CSV files in C++.
You could look at getdelim(',', fin, line),
But the other issue will be those quotes, unless you /know/ the file is always formatted exactly this way, it becomes difficult.
One hack I have used in the past that is NOT PERFECT, if the first character is a quote, then the last character before the comma must also be a matching quote, and not escaped.
If it is not a quote then getdelim() some more, but the auto-alloc feature of getdelim means you must use another buffer. In C++ I end up with a vector of all the pieces of getdelim results that then need to be concatenated to make the final string:
std::vector<char*> gotLine;
gotLine.push_back(malloc(2));
*gotLine.back() = fgetch();
gotLine.back()[1] = 0;
bool gotquote = *gotLine.back() == '"'; // perhaps different classes of quote
if (*gotLine.back() != ',')
for(;;)
{
char* gotSub= nullptr;
gotSub=getdelim(',');
gotLine.push_back(gotSub);
if (!gotquote) break;
auto subLen = strlen(gotSub);
if (subLen>1 && *(gotSub-1)=='"') // again different classes of quote
if (sublen==2 || *(gotSub-2)!='\\') // needs to be a while loop
break;
}
Then just concatenate all these string segments back together.
Note that getdelim supports null bytes. If you expect null bytes in the content, and not represented by the character sequences \000 or \# you need to store the actual length returned by getdelim, and use memcpy to concatenate them.
Oh, and if you allow utf-8 extended quotes it gets very messy!
The case this doesn't cover is a string that ends \\" or \\\\". Ideally you need to while count the number of leading backslashes, and accept the quote if the count is even.
Note that this leave the issue of unescaping the quoted content, i.e. converting any \" into ", and \\ into \, etc. Also discarding the enclosing quotes.
In the end a library may be easier if you need to deal with completely arbitrary content. But if the content is "known" you can live without.

No methods of read a file seem to work, all return nothing - C++

EDIT: Problem solved! Turns out Windows 7 wont let me read/ write to files without explicitly running as administrator. So if i run as admin it works fine, if i dont i get the weird results i explain below.
I've been trying to get a part of a larger program of mine to read a file.
Despite trying multiple methods(istream::getline, std::getline, using the >> operator etc) All of them return with either /0, blank or a random number/what ever i initialised the var with.
My first thought was that the file didn't exist or couldn't be opened, however the state flags .good, .bad and .eof all indicate no problems and the file im trying to read is certainly in the same directory as the debug .exe and contains data.
I'd most like to use istream::getline to read lines into a char array, however reading lines into a string array is possible too.
My current code looks like this:
void startup::load_settings(char filename[]) //master function for opening a file.
{
int i = 0; //count variable
int num = 0; //var containing all the lines we read.
char line[5];
ifstream settings_file (settings.inf);
if (settings_file.is_open());
{
while (settings_file.good())
{
settings_file.getline(line, 5);
cout << line;
}
}
return;
}
As said above, it compiles but just puts /0 into every element of the char array much like all the other methods i've tried.
Thanks for any help.
Firstly your code is not complete, what is settings.inf ?
Secondly most probably your reading everything fine, but the way you are printing is cumbersome
cout << line; where char line[5]; be sure that the last element of the array is \0.
You can do something like this.
line[4] = '\0' or you can manually print the values of each element in array in a loop.
Also you can try printing the character codes in hex for example. Because the values (character codes) in array might be not from the visible character range of ASCII symbols. You can do it like this for example :
cout << hex << (int)line[i]

Reading and writing to files isn't working in C++

I am basically trying to reverse the contents of a text file. When I run this code, nothing happens. Code:
getArguments();
stringstream ss;
ss << argument;
string fileName;
ss >> fileName;
fstream fileToReverse(fileName);
if (fileToReverse.is_open()) {
send(sock, "[*] Contents is being written to string ... ", strlen("\n[*] Contents is being written to string ... "), 0);
string line;
string contentsOfFile;
while (getline(fileToReverse, line)) {
contentsOfFile.append(line);
line = "\0";
}
send(sock, "done\n[*] File is being reversed ... ", strlen("done\n[*] File is being reversed ... "), 0);
string reversedText(contentsOfFile.length(), ' ');
int i;
int j;
for(i=0,j=contentsOfFile.length()-1;i<contentsOfFile.length();i++,j--) {
reversedText[i] = contentsOfFile[j];
}
contentsOfFile = "\0";
fileToReverse << reversedText;
fileToReverse.close();
send(sock, "done\n", strlen("done\n"), 0);
}
fileName is created from user input, and I know that the file exists. It just doesn't do anything to the file. If anyone has any ideas that they would like to share that would be great.
UPDATE:
I now can write reversedText to the file but how can I delete all of the files contents?
In this particular case, when you have read all the input content, your file is in an "error state" (eof and fail bits set in the status).
You need to clear that with fileToReverse.clear();. Your file position will also be at the end of the file, so you need to use fileToReverse.seekp(0, ios_base::beg) to set the position to the beginning.
But I, just as g-makulik, prefer to have two files, one for input and one for output. Saves a large amount of messing about.
When you need to debug something like this - saying "all the functions are being run and all the variables are being created, and it compiled without any warnings" isn't really debugging.
Debugging - this doesn't work. Remove bits until you find what doesn't work. Like you said - all variables are what you expect them. So... try and see if, for example, the way you read and write from a file works. Just write a small program that opens a file like you open it, reads from it like you do and then writes... whatever back into it in the same way you do. See if that works.
In other words, try and find the smallest program that reproduces what you see.

How to ignore a character through strtok?

In the below code i would like to also ignore the character ” . But after adding that in i still get “Mr_Bishop” as my output.
I have the following code:
ifstream getfile;
getfile.open(file,ios::in);
char data[256];
char *line;
//loop till end of file
while(!getfile.eof())
{
//get data and store to variable data
getfile.getline(data,256,'\n');
line = strtok(data," ”");
while(line != NULL)
{
cout << line << endl;
line = strtok(NULL," ");
}
}//end of while loop
my file content :
hello 7 “Mr_Bishop”
hello 10 “0913823”
Basically all i want my output to be :
hello
7
Mr_Bishop
hello
10
0913823
With this code i only get :
hello
7
"Mr_Bishop"
hello
10
"0913823"
Thanks in advance! :)
I realise i have made an error in the inner loop missing out the quote. But now i receive the following output :
hello
7
Mr_Bishop
�
hello
10
0913823
�
any help? thanks! :)
It looks like you used Wordpad or something to generate the file. You should use Notepad or Notepad++ on Windows or similar thing that will create ASCII encoding on Linux. Right now you are using what looks like UTF-8 encoding.
In addition the proper escape sequence for " is \". For instance
line = strtok(data," \"");
Once you fix your file to be in ASCII encoding, you'll find you missed something in your loop.
while(!getfile.eof())
{
//get data and store to variable data
getfile.getline(data,256,'\n');
line = strtok(data," \"");
while(line != NULL)
{
std::cout << line << std::endl;
line = strtok(NULL," \""); // THIS used to be strtok(NULL," ");
}
}//end of while loop
You missed a set of quotes there.
Correcting the file and this mistake yields the proper output.
Have a very careful look at your code:
line = strtok(data," ”");
Notice how the quotes lean at different angles (well mine do, I guess hopefully your font shows the same thing). You have included only the closing double quote in your strtok() call. However, Your data file has:
hello 7 “Mr_Bishop”
which has two different kinds of quotes. Make sure you're using all the right characters, whatever "right" is for your data.
UPDATE: Your data is probably UTF-8 encoded (that's how you got those leaning double quotes in there) and you're using strtok() which is completely unaware of UTF-8 encoding. So it's probably doing the wrong thing, splitting up the multibyte UTF-8 characters, and leaving you with rubbish at the end of the line.