C++ : Interpret unicode white space

C++ : Interpret unicode white space - c++

I have a file which contains text (ASCII + unicode) and I am trying to count total words in it using a C++ program. It is a requirement that I should read the file line by line (using getline) and then process each line to count the words within it.
So I have written the following simple program:
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
int main(int argc, char* argv[]) {
uint64_t ct = 0;
std::string line;
std::ifstream infile(argv[1]);
while(std::getline(infile, line)) {
std::stringstream inputStream(line);
std::string token;
while (inputStream >> token) {
++ct;
}
}
std::cout << ct << std::endl;
return 0;
}
However, the above program outputs a number that is lesser than what wc -w command gives. To narrow down the problem, I modified the program to simply output whatever it reads. So now the program becomes:
int main(int argc, char* argv[]) {
uint64_t ct = 0;
std::string line;
std::ifstream infile(argv[1]);
while(std::getline(infile, line)) {
std::stringstream inputStream(line);
std::string token;
while (inputStream >> token) {
std::cout << token << " ";
}
std::cout << std::endl;
}
return 0;
}
I redirected the output of this program to another file. Now, when I run wc -w on this new file, the number is same as running wc -w on the original file. This means, I am reading all the words (i.e., "words" defined by wc) in my program. And hence, a reasonable explanation would be that one of the values of token that is read using inputStream >> token consists of some unicode character that is interpreted as a white space by wc program. So how do I change my program to also support such interpretation of unicode white space characters?

You can go by either:
A. Java's definition of Unicode (not non-breaking) whitespace.
or
B. Wikipedia's list of all 25 Unicode code points defined as whitespace.

Related

I/O ascii codes to foreign characters

Using cout << "\n\u00f3\n << endl, I can print ó with newlines at the Unix command line. Once I start attempting to read files and print strings containing the characters, I see the literal output instead \n\u00f3\n.
I am not sure if this is because the file read techniques use character arrays or if there is some other nuance I do not know.
Any ideas?
Thanks!
const char *filename ="spanish_project_sample1.txt";
FILE *file = fopen(filename, "r");
int c;
char *data;
data = " ";
while ((c=fgetc(file)) != EOF) {
data = appendCharToCharArray(data, c);
}
printf("%s", data);

I looked at the JavaScript solutions to a similar problem (e.g. FromCharCode) and found this code online:
https://ideone.com/Udo3hN
#include <cstdarg>
#include <iostream>
using namespace std;
string FromCharCode ( int num, ... )
{
va_list arguments;
char ch;
string s;
va_start ( arguments, num );
for ( int x = 0; x < num; x++ )
{
ch = va_arg ( arguments, int );
s = s + ch;
}
va_end ( arguments );
return s;
}
int main()
{
cout<<FromCharCode (10,73,78,68,69,83,73,71,78,33,33) ;//<<endl;
return 0;
}
Specifically, it looks like reading in the characters is the issue because at runtime instead of reading '\n' as value 10 for example, the character array would actually record two ints [92,110].
Using a hardcoded string, the compiler parses the escaped characters as the desired values.
Any suggestions or solutions still welcome.

The C++ idiom for reading a file line by line is:
#include <fstream>
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
string line;
ifstream ifs;
ifs.open(argv[1]);
while(getline(ifs, line))
cout << line << endl;
}
Try that.
Your problem is probably one of interpretation though. If you have "\n\u00f3\n" in a file, that is what it reads and prints. If you have "ó" in the file, which is stored as \u00f3 in UTF-16, you will get what you want. The i/o routines don't do any conversion.
You also need to know if your file is in UTF-8 or UTF-16 so that you can read it properly.

C++ Reading multiple lines from a file to a single string

I have an input file of dna.txt which looks like what is below. I'm trying to read all of the characters to a single string and I'm not allowed to use a character array. How would I go about doing this.
cggccgattgtattctgtatagaaaaacac
atacagatggattttaactagagc
aagtcgcaataaccagcgagtattaca
cctcgaccaaatcctcgaattctc

Try the following:
std::string dna;
std::string text_read;
while (getline(input_file, text_read))
{
dna += text_read;
}
In the above loop, each line is read into a separate variable.
After the line is read, then it is appended to the DNA string.
Edit 1: Example working program:
Note: on some platforms, there may be a \r' in the buffer which causes portions to be overwritten when displayed.
#include <iostream>
#include <fstream>
#include <string>
int main()
{
std::ifstream input_file("./data.txt");
std::string dna;
std::string text_read;
while (std::getline(input_file, text_read))
{
const std::string::size_type position = text_read.find('\r');
if (position != std::string::npos)
{
text_read.erase(position);
}
dna += text_read;
}
std::cout << "As one long string:\n"
<< dna;
return 0;
}
Output:
$ ./dna.exe
As one long string:
cggccgattgtattctgtatagaaaaacacatacagatggattttaactagagcaagtcgcaataaccagcgagtattacacctcgaccaaatcctcgaattctc
The file "data.txt":
cggccgattgtattctgtatagaaaaacac
atacagatggattttaactagagc
aagtcgcaataaccagcgagtattaca
cctcgaccaaatcctcgaattctc
The program compiled using g++ version 5.3.0 on Cygwin terminal.
The issue was found by using the gdb debugger.

C++ XOR encryption - decryption issue

I followed a tutorial on stephan-brumme website for XOR encryption (unfortunately I cannot include URL because I do not have enough reputation). What I want to do is following: read the content of example.txt file and decrypt the text that it includes. For example, this is the content of example.txt:
\xe7\xfb\xe0\xe0\xe7
This, when decrypted using password "password" should return "hello". This is the code I got:
#include <string>
#include <iostream>
#include <fstream>
using namespace std;
std::string decode(const std::string& input)
{
const size_t passwordLength = 9;
static const char password[passwordLength] = "password";
std::string result = input;
for (size_t i = 0; i < input.length(); i++)
result[i] ^= ~password[i % passwordLength];
return result;
}
int main(int argc, char* argv[])
{
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( getline (myfile,line) )
{
cout << decode(line);
}
myfile.close();
}
return 0;
}
And this is the result of running the application:
click for image
As you can see, the decryption was not successful. Now, if I make it so it doesn't read the .txt, but directly decrypts the text, like this:
cout << decode("\xe7\xfb\xe0\xe0\xe7");
It works perfectly:
click for image
What am I doing wrong here?
Many thanks in advance! :)

Character XOR by same character is zero, so the result may include zero. std::string doesn't like that because zero terminates the string.
You also can use std::vector<char> instead of std::string for the actual encoding/decoding. You would have to change the decode function to handle vector<char>
And read/write the file in binary.
Edit: Using std::string only, and std::string decode(const std::string& input)
int main()
{
std::string line = "hello";
{
line = decode(line);
std::ofstream myfile("example.txt", std::ios::binary);
myfile.write(line.data(), line.size());
//Edit 2 *************
//std::cout << std::hex;
//for (char c : line)
// std::cout << "\\x" << (0xff & c);
//*************
//This will make sure width is always 2
//For example, it will print "\x01\x02" instead of "\x1\x2"
std::cout << std::hex << std::setfill('0');
for (char c : line)
std::cout << "\\x" << std::setw(2) << (0xff & c);
std::cout << std::endl;
}
{
std::ifstream myfile("example.txt", std::ios::binary | std::ios::ate);
int filesize = (int)myfile.tellg();
line.resize(filesize);
myfile.seekg(0);
myfile.read(&line[0], filesize);
line = decode(line);
std::cout << line << std::endl;
}
return 0;
}

I bet example.txt contains the characters '\', 'x', 'e', '7' etc. You have to read those, process all the backslash escapes, and then feed it to decode.
\xe7 is a common way of representing a single character with hex value E7. (Which is quite likely to be the single character 'ç' depending on your character set). If you want to store (encrypted) readable text, I suggest dropping the \x, and having the file contain lines like "e7fbe0e0e7". Then
- read each line into a string.
- Convert each pair of characters from a hex number into an integer, and store the result in a char.
- Store that char in the string.
- Then xor decrypt the string.
Alternatively, ensure the file contains the actual binary characters you need it to.
Also beware that you are XOR-ing with the terminating nul byte of the password. Did you mean to do that?

Trying to return size of input file of c++ but recieve an error when I convert the char variable to string

I am trying to count the characters in my program. Initially my variable "words" was a char and the file read just fine. When trying to determine the length of the variable, it wouldn't work with .length(). Can you explain how I can make my "words" variable as a string so that the words.length() executes correctly?
error on line words = readFile.get(); is:
no match for ‘operator!=’ in ‘words != -0x00000000000000001’
#include <iostream>
#include <cmath>
#include <fstream>
#include <cstdlib>
#include <string>
#include <stdio.h>
#include <math.h>
using namespace std;
int main() {
//buff array to hold char words in the input text file
string words;
//char words;
//read file
ifstream readFile("TextFile1.txt");
//notify user if the file didn't transfer into the system
if (!readFile)
cout <<"I am sorry but we could not process your file."<<endl;
//read and output the file
while (readFile)
{
words = readFile.get();
if(words!= EOF)
cout <<words;
}
cout << "The size of the file is: " << words.length() << " bytes. \n";
return 0;
}

char c;
while (readFile.get(c))
{
words.insert(c);
}
Of course, if you were solely doing this to count the number of characters (and were intent on using std::istream::get) you'd probably be better off just doing this:
int NumChars = 0;
while (readFile.get())
{
NumChars++;
}
Oh, and by the way, you might want to close the file after you're done with it.

You should read some reference.. try cppreference.com and look for std::instream::get
I'm not sure what do you want, but if you wanna just count words, you can do something like this:
std::ifstream InFile(/*filename*/);
if(!InFile)
// file not found
std::string s;
int numWords = 0;
while(InFile >> s)
numWords++;
std::cout << numWords;
Or if you want to get to know how many characters are in file, change std::string s to char s and use std::ifstream::get instead:
std::ifstream InFile(/*filename*/);
if(!InFile)
// file not found
char s;
int numCharacters = 0;
while(InFile.get(s)) //this will read one character after another until EOF
numCharacters++;
std::cout << numCharacters;
The second approach is easier:
If file uses ASCII, numCharacters == fileSize;
Otherwise if it uses UNICODE, numCharacters == fileSize / 2;

get() returns an int, to do what you're doing, you must check that int before appending to "words" instead of checking words against EOF, e.g.:
...
//read and output the file
while (readFile)
{
const int w = readFile.get();
if (w!= EOF) {
words += w;
cout <<words;
}
}
...

Incorrect char from file

I have the following .txt file:
test.txt
1,2,5,6
Passing into a small C++ program I made through command line as follows:
./test test.txt
Source is as follows:
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc, char **argv)
{
int temp =0;
ifstream file;
file.open(argv[1]);
while(!file.eof())
{
temp=file.get();
file.ignore(1,',');
cout<<temp<<' ';
}
return 0;
}
For some reason my output is not 1 2 5 6 but 49 50 53 54. What gives?
UPDATE:
Also, I noticed there is another implementation of get(). If I define char temp then I can do file.get(temp) and that will also save me converting ASCII representation. However I like using while (file >> temp) so I will be going with that. Thanks.

temp is an int. So you see the encoded ascii values after casting the char to an int.

49 is the ascii code for digit 49-48 = 1.
get() gives you a character (character code).
by the way, eof() only becomes true after a failed read attempt, so the code you show,
while(!file.eof())
{
temp=file.get();
file.ignore(1,',');
cout<<temp<<' ';
}
will possibly display one extraneous character at the end.
the conventional loop is
while( file >> temp )
{
cout << temp << ' ';
}
where the expression file >> temp reads in one number and produces a reference to file, and where that file objected is converted to bool as if you had written
while( !(file >> temp).fail() )

This does not do what you think it does:
while(!file.eof())
This is covered in Why is iostream::eof inside a loop condition considered wrong?, so I won't cover it in this answer.
Try:
char c;
while (file >> c)
{
// [...]
}
...instead. Reading in a char rather than an int will also save you having to convert the ascii representation (ASCII value 49 is 1, etc...).

For the record, and despite this being the nth duplicate, here's how this code might look in idiomatic C++:
for (std::string line; std::getline(file, line); )
{
std::istringstream iss(line);
std::cout << "We read:";
for (std::string n; std::getline(iss, line, ','); )
{
std::cout << " " << n;
// now use e.g. std::stoi(n)
}
std::cout << "\n";
}
If you don't care about lines or just have one line, you can skip the outer loop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ : Interpret unicode white space - c++

You can go by either: A. Java's definition of Unicode (not non-breaking) whitespace. or B. Wikipedia's list of all 25 Unicode code points defined as whitespace.

Related

I/O ascii codes to foreign characters

C++ Reading multiple lines from a file to a single string

C++ XOR encryption - decryption issue

Trying to return size of input file of c++ but recieve an error when I convert the char variable to string

Incorrect char from file

Categories

Resources