Edit text file in place using C++ - c++

I have a text file which I am adding tags to in order to make it XML readable. In order for our reader to recognize it as valid, each line must at least be wrapped in tags. My issue arises because this is actually a Syriac translation dictionary and so there are many non-standard characters (the actual Syriac words). The most straight-forward way I see to accomplish what I need is to simply prepend and append each line with the needed tags, in place, without necessarily accessing or modifying the rest of the line. Any other options would also be greatly appreciated.
ifstream in_file;
string file_name;
string line;
string line2;
string pre_text;
string post_text;
int num = 1;
pre_text = "<entry n=\"";
post_text = "</entry>";
file_name = "D:/TEI/dictionary1.txt";
in_file.open(file_name.c_str());
if (in_file.is_open()){
while (getline(in_file, line)){
line2 = pre_text + to_string(num) + "\">" + line + post_text;
cout << line2;
num++;
}
}
The file in question may be downloaded here.

You are using std::string which, by default, deals with ASCII encoded text, and you are opening your file in "text translation mode". The first thing you need to do is open the file in binary mode so that it doesn't perform translation on individual char values:
in_file.open(file_name.c_str(), std::ios::binary);
or in C++11
in_file.open(file_name, std::ios::binary);
The next thing is to stop using std::string for storing the text from the file. You will need to us a string type that recognizes the character encoding you are using and use the appropriate character type.
As it turns out, std::string is actually an alias for std::basic_string<char>. In C++11 several new unicode character types were introduced, in C++03 there was wchar_t which supports "wide" characters (more than 8 bits). There is a standard alias for basic_strings of wchar_ts: std::wstring.
Start with the following simple test:
#include <iostream>
#include <fstream>
#include <string>
int main() {
std::string file_name = "D:/TEI/dictionary1.txt";
std::wifstream in_file(file_name, std::ios::binary);
if (!in_file.is_open()) {
// "L" prefix indicates a wide string literal
std::wcerr << L"file open failed\n";
return 1;
}
std::wstring line1;
std::getline(in_file, line1);
std::wcout << L"line1 = " << line1 << L"\n";
}
Note how cout etc also become prefixed with w...
The standard ASCII characterset contains 128 characters numbered 0 thru 127. In ASCII \n and \r are represented with a 7-bit value of 13 and 10 respectively.
Your text file appears to be UTF-8 encoded. UTF-8 uses an 8-bit unsigned representation that allows characters to use a variable number of bytes: the value 0 requires 1 byte, the value 128 requires 2 bytes, the value 8192 requires 3 bytes, and so on.
A value with the highest-bit (2^7) clear is a single, 7-bit ascii value or the end of a multibyte-sequence. If the highest-bit is set, the lower bits are considered to be a "prefix value". So the byte sequence { (128+2), 0 } would represent the value (2 << 7) | 0 or (wchar_t)256. The byte sequence { 130, 13 } represents (2 << 7) | 13 or wchar_t 269.
You can read and write utf-8 values through char streams and storage, but only as opaque byte streams. The moment you start to need to understand the values you generally need to resort to wchar_t, uint16_t or uint32_t etc.
If you are working with Microsoft's toolset (noting the "D:/" path), you may need to look into TCHAR (https://msdn.microsoft.com/en-us/library/c426s321.aspx)

Related

how to make a new line in binary file c++?

Can anybody help me with this simple thing in file handling?
This is my code:
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int main()
{
fstream f;
f.open("input05.bin", ios_base::binary | ios_base::out);
string str = "";
cout << "Input text:"<<endl;
while (1)
{
getline(cin, str);
if (str == "end")
break;
else {
f.write((char*)&str, sizeof(str));
}
}
f.close();
f.open("input05.bin", ios_base::in | ios_base::binary);
while (!f.eof())
{
string st;
f.read((char*)&st, sizeof(st));
cout << st << endl;
}
f.close();
}
It is running successfully now. I want to format the output of the text file according to my way.
I have:
hi this is first program i writer this is an experiment
How can I make my output file look like the following:
hi this is first program
I writer this is an experiment
What should I do to format the output in that way?
First of all,
string str;
....
f.write((char*)&str, sizeof(str));
is absolutely wrong as you cast a pointer to an object of type std::string to a pointer to a character, i.e. char*. Note that an std::string is an object having data members like the length of the string and a pointer to the memory where the string content is kept, but it is not a c-string of type char *. Further, sizeof(str) gives you the size of the "wrapper object" with the length member and the pointer, but it does not give you the length of the string.
So it should be something like this:
f.write(str.c_str(), str.length());
Another thing is the os-dependant handling of new line character. Depending on the operating system, a new line is represented either by 0x0d 0x0a or just by 0x0d. In memory, c++ treats a new line always as a single character '\n'(i.e. 0x0d). When writing to a file in text mode, c++ will expand an '\n' to 0x0d 0x0a or just keep it as 0x0d (depending on the platform). If you write to a file in binary mode, however, this replacement will not occur. So if you create a file in binary mode and insert only a 0x0d, then - depending on the platform - printing the file in the console will not result in a new line.
Try to write ...
f.write(str.c_str(), str.length());
f.put('\r');
such that it will work on your platform (and will not work on other platforms then).
That's why you should write in text mode if you want to write text.

How to apply <cctype> functions on text files with different encoding in c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

C++ Non ASCII letters

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?
You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

Converting 32 bit integer to printable 8 bit Character

I want to convert a series of 32-bit integer values into a sequence of printable 8-bit character values. Mapping the 32-bit integers to printable 8-bit character values should result in a clear ASCII art image.
I can convert Integer to ASCII:
#include <iostream>
using namespace std;
int main() {
char ascii;
int numeric;
cout << "Enter Number ";
cin >> numeric;
cout << "The ascii value of " << numeric << " is " << (char) numeric<<"\n\n"<<endl;
return 0;
}
Also I need to open the text file that my numbers are saved into:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("1.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else
cout << "Unable to open file";
return 0;
}
but my problem is , I can not open this " Text " file and print the ASCII on the screen and also print a copy of that in a " Output.txt "
Inside of my Text file is just :
757935403 544999979 175906848 538976380
757795452 170601773 170601727
That after converting to ASCII needs to look like this :
represents the ASCII art picture
+---+
| |
| |
+---+
and have this also in my output.txt.
Please advise if you know how can I write this program.
First of all, you cannot convert a 32 bit integer to 8 bit ascii without losing information. As far as I guess, you should extract 4 ascii chars from a 32 bit integer.
If your input file is non-binary (which means integer values are human-readable/seperated by some delimeter), first thing you should do is create another file/stream and write these values to the new file/stream but now in binary mode (In this mode, there will be no delimiter and resulting file/stream will not be human readable).
Now read chars one by one(open file with binary mode) from this new file/stream, and write it to your final output file using non-binary mode.
IF YOU WANT TO DO IT WITHOUT SEVERAL FILE INOUTS,
Read all your integer values in an array, then point the starting memory location with a char pointer, then write one by one the contents of this char array.
int* myIntArray; //keep the size of it somewhere
char* myCharArray =(char*)myIntArray; // size for myCharArray is 4 times of the myIntArray
Having converted those numbers into hex, you get this
2d2d2d2b 207c0a2b 0a7c2020 2020207C
- - - + | lf+ lf| |
etc etc
so basically for some reason the input file contains the characters to output stored as integers. Which is completely endian unsafe.
Your least worst bet it to read in each integer, cast it to an array of chars and output those 4 chars.
If you're using unix, I'd suggest using 'tee' to send your output to 2 files if you can, otherwise output once to stdout, then output again to whatever file handle you've opened for Output.txt.