conversion from stringstream to string removes '=' characters - c++

I am reading an XML file into a stringstream buffer in order to parse it using RapidXML. RapidXML is only parsing the names of the XML nodes, but none of their attribute names or values. After some experimentation, I discovered that the problem is not likely to be with RapidXML, but with conversion of the stringstream buffer to a string using std::string content(buffer.str());. The '=' characters that are so important to XML parsing are converted to ' ' (space characters), prior to any RapidXML processing.
The character replacement is evident in the console window when the cout << calls are made in the code below, which is before RapidXML gets its hands on the string.
My code is as follows:
#include <iostream>
#include <fstream>
#include <stdio.h>
#include <conio.h>
#include <string>
#include <stdlib.h>
#include <rapidxml.hpp>
#include <vector>
#include <sstream>
using namespace std;
using namespace rapidxml;
//... main() and so forth, all works fine...
ifstream file(names.at(i)); // names.at(i) works fine...
//...
file.read(fileData, fileSize); // works fine...
//...
// Create XML document object using RapidXML:
xml_document<> doc;
//...
std::stringstream buffer;
buffer << file.rdbuf();
// This is where everything looks okay (i.e., '=' shows up properly):
cout << "\n" << buffer.str() << "\n\nPress a key to continue...";
getchar();
file.close();
std::string content(buffer.str());
// This is where the '=' are replaced by ' ' (space characters):
cout << "\n" << content << "\n\nPress a key to continue...";
getchar();
// Parse XML:
doc.parse<0>(&content[0]);
// Presumably the lack of '=' is preventing RapidXML from parsing attribute
// names and values, which always follow '='...
Thanks in advance for your help.
p.s. I followed advice on using this technique for reading an entire XML file into a stringstream, converting it to a string, and then feeding the string to RapidXML from the following links (thanks to contributors of these pieces of advice, sorry I can't make them work yet...):
Automation Software's RapidXML mini-tutorial
...this method was seen many other places, I won't list them here. Seems sensible enough. My errors seem to be unique. Could this be an ASCII vs. UNICODE issue?
I also tried code from here:
Thomas Whitton's example converting a string buffer to a dynamic cstring
code snippet from the above:
// string to dynamic cstring
std::vector<char> stringCopy(xml.length(), '\0');
std::copy(xml.begin(), xml.end(), stringCopy.begin());
char *cstr = &stringCopy[0];
rapidxml::xml_document<> parsedFromFile;
parsedFromFile.parse<0>(cstr);
...with similar RapidXML failure to parse node attribute names and values. Note that I didn't dump the character vector stringCopy to the console to inspect it, but I am getting the same problem, which for review is:
I am seeing correctly parsed names of XML tags after RapidXML parsing of the string fed to it for analysis.
There are no correctly parsed tag attribute names or values. These are dependent upon the '=' character showing up in the string to be parsed.

If you look closely the = characters probably aren't being replaced by spaces, but zero bytes. If you look at the rapidxml documentation here:
http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1differences
It specifically states that it modifies the source text. This way it can avoid allocating any new strings, instead it uses pointers to the original source.
This part seems to work correctly, maybe the problem is with the rest of your code that's trying to read the attributes?

Related

C++ std::string::at()

I want to print the first letter of a string.
#include <iostream>
#include <string>
using namespace std;
int main() {
string str = "다람쥐 헌 쳇바퀴 돌고파.";
cout << str.at(0) << endl;
}
I want '다' to be printed like java, but '?' is printed.
How can I fix it?
That text you have in str -- how is it encoded?
Unfortunately, you need to know that to get the first "character". The std::string class only deals with bytes. How bytes turn into characters is a rather large topic.
The magic word you are probably looking for is UTF-8. See here for more infomation: How do I properly use std::string on UTF-8 in C++?
If you want to go down this road yourself, look here: Extract (first) UTF-8 character from a std::string
And if you're really interested, here's an hour-long video that is actually a great explanation of text encoding: https://www.youtube.com/watch?v=_mZBa3sqTrI

Reading from file without using string

I am doing a school project where we must not use std::string. How can I do this? In the txt file the data are separated with a ";", and we do not know the length of the words.
Example:
apple1;apple2;apple3
mango1;mango2;mango3
I tried a lot of things, but nothing worked, always got errors.
I tried using getline, but since it is for string it did not work.
I also tried to reload the operator<< but it did not help.
There are two entirely separate getline()'s. One is std::getline(), which takes a std::string as a parameter.
But there's also a member function in std::istream, which works with an array of chars instead of a std::string, eg:
#include <sstream>
#include <iostream>
int main() {
std::istringstream infile{"apple1;apple2;apple3"};
char buffer[256];
while (infile.getline(buffer, sizeof(buffer), ';'))
std::cout << buffer << "\n";
}
Result:
apple1
apple2
apple3
Note: while this fits the school prohibition against using std::string, there's almost no other situation where it makes sense.

Issues with Wide Characters in C++

I have a program that is meant to read in a text file of words (each on a separate line), and then print out a random word from that file. It also gives you the ability to select a non-English language (e.g., Greek or Russian). Because of the latter condition, I use std::wstring to capture the text. Here is the code:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/random_device.hpp>
#include <boost/random/uniform_int_distribution.hpp>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cout << "Usage: word [lang]" << std::endl;
std::cout << "\tlang: Choose from de,en,es,fr,gr,it,la,ru" << std::endl;
return EXIT_FAILURE;
}
std::string file = static_cast<std::string>("C:\\util_bin\\data\\words_") + static_cast<std::string>(argv[1]) + static_cast<std::string>(".txt");
std::wfstream fin(file, std::wifstream::in);
std::vector<std::wstring> data;
std::wstring line;
while (std::getline(fin, line))
data.push_back(line);
int size = data.size();
boost::random::random_device rd;
boost::random::mt19937 mt(rd());
boost::random::uniform_int_distribution<int> dist(0, size - 1);
std::wcout << data[dist(mt)] << std::endl;
}
This code compiles just fine, however when I run it with Russian (for instance), I just get garbage text:
C:\util_bin>word ru
������������
C:\util_bin>
I'm not all that familiar with the ins and outs of wide characters in C++, so I can't really discern what's going wrong. Anyone have any ideas?
I'm going to guess you're using Visual Studio. This is a quirk of the implementation of std::basic_filebuf in Windows. From the relevant MSDN page:
Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of type wchar_t and set it using the basic_streambuf::pubsetbuf() method.
As it was explained to me, the filebuf is implemented with a FILE*; there is an internal flag that performs the ANSI conversion whether you want it or not, and you can't clear. the flag except by allocating and setting your own buffer (via pubsetbuf). Putting a codecvt in your locale won't do it. It has to happen right after a successful file open. Really, infuriatingly intrusive. I wound up having to write a wrapper class ( which wasn't so bad, as it gave you the ability to store the file name before opening).
You can also open the file with std::binary. Some people recommend that you always do that. But opening the file that way probably makes you do your own code conversions before inserting into a stream or extracting from it.
After you create instantiate your wfstream object, call imbue on it like this:
fin.imbue( std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>) );

Obtaining a certain section from a line in a file (C++)

I've spent a lot of time looking online to find a answer for this, but nothing was helping, so I figured I'd post my specific scenario. I have a .txt file (see below), and I am trying to write a routine that just finds a certain chunk of a certain line (e.g. I want to get the 5 digit number from the second column of the first line). The file opens fine and I'm able to read in the entire thing, but I just don't know how to get certain chunks from a line specifically. Any suggestions? (NOTE: These names and numbers are fictional...)
//main cpp file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
ifstream fin;
fin.open("customers.txt");
return 0;
}
//customers.txt
100007 13153 09067.50 George F. Thompson
579489 21895 00565.48 Keith Y. Graham
711366 93468 04602.64 Isabel F. Anderson
Text parsing is not such a trivial thing to implement.
If your format won't change you could try to parse it by yourself, use random access file access and use regular expressions to extract the part of the stream that you need, or read a certain quantity of chars.
If you go the regex way, you'll need C++11 or a third party library, like Boost or POCO.
If you can format the text file then you might also want to choose a standard to structure your data, like XML, and use the facilities of that format to extract the information you want. POCO might help you there.
Some simple hints in your code to help you, you will need to complete the code. But the missing pieces are easy to find at stackoverflow.
//main cpp file
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;
splitLine(const char* str, vector<string> results){
// splits str and stores each value in results vector
}
int main()
{
ifstream fin;
fin.open("customers.txt");
char buffer[128];
if(fin.good()){
while(!fin.eof()){
fin.getline(buffer, 256);
cout << buffer << endl;
vector<string> results;
splitLine(buffer, results);
// now results MUST contain 4 strings, for each
// column in a line
}
}
return 0;
}
If the columns are separated by whitespace then the second column of the first row is simpy the second token extracted from the stream.
std::ifstream input{"customers.txt"}; // Open file input stream.
std::istream_iterator<int> it{input}; // Create iterator to first token.
int number = *std::next(it); // Advance to next token and dereference.

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!
The C++ String Toolkit Library (StrTk) has the following solution to your problem:
#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("data.txt",
[&word_list](const std::string& line)
{
const std::string delimiters = "\t\r\n ,,.;:'\""
"!##$%^&*_-=+`~/\\"
"()[]{}<>";
strtk::parse(line,delimiters,word_list);
});
std::cout << strtk::join(" ",word_list) << std::endl;
return 0;
}
More examples can be found Here
If performance is a main issue you should probably stick to good old strtok which is sure to be fast:
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}
A regular expression library might work well if your tokens aren't too difficult to parse.
I wrote my own tokenizer as part of the open-source
SWISH++ indexing and search engine.
There's also the the ICU tokenizer
that handles Unicode.
I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.
Here's an ultra-trivial example of it tokenizing a sentence into words:
#include <sstream>
#include <iostream>
#include <string>
int main(void)
{
std::stringstream sentence("This is a sentence with a bunch of words");
while (sentence)
{
std::string word;
sentence >> word;
std::cout << "Got token: " << word << std::endl;
}
}
janks#phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:
The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.
The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.
Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.
You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).
The generated code has no external dependencies.
I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.
Well, I would begin by searching Boost and... hop: Boost.Tokenizer
The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.
From the introduction:
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "This is, a test";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
// prints
This
is
a
test
// notes how the ',' and ' ' were nicely removed
And there are additional features:
it can escape characters
it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream
and a few options (like keeping empty tokens etc...)
Check it out!