Vector push_back on duplicate strings with the help of delimiter - c++

I am trying to read the PATH Environment variable and remove any duplicates that are present in it using vector functionalities such as - sort, erase and unique. But as I've seen vector will delimit each element default by newline. When I get the path as C:\Program Files(x86)\..., its breaking at C:/ Program. This is my code so far:
char *path = getenv("PATH");
char str[10012] = "";
strcpy(str,path);
string strr(str);
vector<string> vec;
stringstream ss(strr);
string s;
while(ss >> s)
{
push_back(s);
}
sort(vec.begin(),vec.end());
vec.erase(unique(vec.begin(),vec.end()),vec.end());
for(unsigned i=0;i<vec.size();i++)
{
cout<<vec[i]<<endl;
}
Is it the delimiter problem? I need to pus_back at every ; and search for duplicates. Can anyone help me in this regard.

I would use a stringstream to chop it up, and the use a set to ensure there are no duplicates.
std::string p { std::getenv("PATH") }
std::set<string> set;
std::stringstream ss { p };
std::string s;
while(std::getline(ss, s, ':')) //this might need to be ';' for windows
{
set.insert(s);
}
for(const auto& elem : set)
std::cout << elem << std::endl;
Should you need to use a vector for some reason, you'd want to sort it with std::sort then remove duplicates with std::unique then erase the slack with erase.
std::sort(begin(vec), end(vec));
auto it=std::unique(begin(vec), end(vec));
vec.erase(it, end(vec));
EDIT: link to docs
http://en.cppreference.com/w/cpp/container/set
http://en.cppreference.com/w/cpp/algorithm/unique
http://en.cppreference.com/w/cpp/algorithm/sort

For this task it is better to use std::set<std::string> which will eliminate duplicates automatically. To read in PATH, use strtok to split it into substrings.

You need to use a different delimiter (':' or ';' to split the directories from the PATH, depending on the system). For instance, you can have a look at the std::getline() function to replace your current while () / push_back loop. This function allows you to specify a custom delimiter and would be a drop-in replacement in your code.

It isn't so much that std::vector<T> is delimiting anything but that the formatted input operator (operator>>()) for strings uses whitespace as delimiters. Other already posted about using std::getline() and the like. There are two other approaches:
Change what is considered to be whitespace for the stream! The std::string input operator uses the stream's std::locale object to obtain a std::ctype<char> facet which can be replaced. The std::ctype<char> facet has functions to do character classification and it can be used to consider, e.g., the character ';' as a space. It is a bit involved but a more solid approach than the next one.
I don't think path components can include newlines, i.e., a simple approach could be to replace all semicolons by newlines before reading the components:
std::string path(std::getenv("PATH"));
std::replace(path.begin(), path.end(), path.begin(), ';', '\n');
std::istringstream pin(path);
std::istream_iterator<std::string> pbegin(pin), pend;
std::vector<std::string> vec(pbegin, pend);
This approach may have the problem that the PATH may contain components which contain spaces: these would be split into individual object. You might want to replace spaces with another character (e.g., the now unused ';') and restore those at an appropriate to become spaces.

Related

C++ appending to vector of strings efficiently (and idiomatically)

If I want to fill a vector of strings with lines from a file in C++, is it a good idea to use push_back with std::move?
{
std::ifstream file("E:\\Temp\\test.txt");
std::vector<std::string> strings;
// read
while (!file.eof())
{
std::string s;
std::getline(file, s);
strings.push_back(std::move(s));
}
// dump to cout
for (const auto &s : strings)
std::cout << s << std::endl;
}
Or is there some other variant where I would simply append a new string instance to the vector and get its reference?
E.g. I can do
std::vector<std::string> strings;
strings.push_back("");
string &s = strings.back();
But I feel like there must be a better way, e.g.
// this doesn't exist
std::vector<std::string> strings;
string & s = strings.create_and_push_back();
// s is now a reference to the last item in the vector,
// no copying needed
Except for the eof misuse, this is the pretty much the idiomatic way to do it yes.
Below is the correct code:
std::string s;
while(std::getline(file, s))
{
strings.push_back(std::move(s));
s.clear();
}
Note the explicit s.clear() call: the only guarantee you have for a moved-from object std::string is that you can call member functions with no prerequisites, so clearing a string should reset it to a "fresh" state, as the move is not guaranteed to do anything to the object, and you can't rely on getline not doing anything weird.
There are some other ways to spell this out (you can probably achieve something similar with istream_iterator and a proper whitespace setting), but I do think this is the clearest.

Reading in from a .tsv file

I'm trying to read in information from a tab separated value file with the format:
<string> <int> <string>
Example:
Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle
This is currently how I'm doing it:
string americanName;
int pokedexNumber;
string japaneseName;
inFile >> americanName;
inFile >> pokedexNumber
inFile >> japaneseName;
My issue stems from the space in the "Mr. Mime" as the strings can contain spaces.
I would like to know how to read the file in properly.
Standard library uses such things as locales to determine the categories of different symbols and other locale-dependent things depending on your system locale. Standard streams use that to determine what is a space because of various unicode issues.
You can use this fact to control the meaning of ' ' in your case:
#include <iostream>
#include <locale>
#include <algorithm>
struct tsv_ws : std::ctype<char>
{
mask t[table_size]; // classification table, stores category for each character
tsv_ws() : ctype(t) // ctype will use our table to check character type
{
// copy all default values to our table;
std::copy_n(classic_table(), table_size, t);
// here we tell, that ' ' is a punctuation, but not a space :)
t[' '] = punct;
}
};
int main() {
std::string s;
std::cin.imbue(std::locale(std::cin.getloc(), new tsv_ws)); // using our locale, will work for any stream
while (std::cin >> s) {
std::cout << "read: '" << s << "'\n";
}
}
Here we make ' ' a punctuation symbol, but not a space symbol, so streams don't consider it a separator anymore. The exact category isn't important, but it mustn't be space.
That's quite powerful technique. For example, you could redefine ',' to be a space to read in CSV format.
You can use std::getline to extract strings with non-tab whitespace.
std::getline(inFile, americanName, '\t'); // read up to first tab
inFile >> pokedexNumber >> std::ws; // read number then second tab
std::getline(inFile, japaneseName); // read up to first newline
Seems like you want to read csv data or in your case tsv data. But let's stick to the common term "csv". This is a standard task and I will give you detailed explanations. In the end all the reading will be done in a one-liner.
I would recommend to use "modern" C++ approach.
After searching for "reading csv data", people are still are linking to How can I read and parse CSV files in C++?, the questions is from 2009 and now over 10 years old. Most answers are also old and very complicated. So, maybe its time for a change.
In modern C++ you have algorithms that iterate over ranges. You will often see something like "someAlgoritm(container.begin(), container.end(), someLambda)". The idea is that we iterate over some similar elements.
In your case we iterate over tokens in your input string, and create substrings. This is called tokenizing.
And for exactly that purpose, we have the std::sregex_token_iterator. And because we have something that has been defined for such purpose, we should use it.
This thing is an iterator. For iterating over a string, hence sregex. The begin part defines, on what range of input we shall operate, then there is a std::regex for what should be matched / or what should not be matched in the input string. The type of matching strategy is given with last parameter.
1 --> give me the stuff that I defined in the regex and
-1 --> give me that what is NOT matched based on the regex.
So, now that we understand the iterator, we can std::copy the tokens from the iterator to our target, a std::vector of std::string. And since we do not know, how may columns we have, we will use the std::back_inserter as a target. This will add all tokens that we get from the std::sregex_token_iterator and append it ot our std::vector<std::string>>. It does'nt matter how many columns we have.
Good. Such a statement could look like
std::copy( // We want to copy something
std::sregex_token_iterator // The iterator begin, the sregex_token_iterator. Give back first token
(
line.begin(), // Evaluate the input string from the beginning
line.end(), // to the end
re, // Add match a comma
-1 // But give me back not the comma but everything else
),
std::sregex_token_iterator(), // iterator end for sregex_token_iterator, last token + 1
std::back_inserter(cp.columns) // Append everything to the target container
);
Now we can understand, how this copy operation works.
Next step. We want to read from a file. The file conatins also some kind of same data. The same data are rows.
And as for above, we can iterate of similar data. If it is the file input or whatever. For this purpose C++ has the std::istream_iterator. This is a template and as a template parameter it gets the type of data that it should read and, as a constructor parameter it gets a reference to an input stream. It doesnt't matter, if the input stream is a std::cin, or a std::ifstream or a std::istringstream. The behaviour is identical for all kinds of streams.
And since we do not have files an SO, I use (in the below example) a std::istringstream to store the input csv file. But of course you can open a file, by defining a std::ifstream testCsv(filename). No problem.
And with std::istream_iterator, we iterate over the input and read similar data. In our case one problem is that we want to iterate over special data and not over some build in data type.
To solve this, we define a Proxy class, which does the internal work for us (we do not want to know how, that should be encapsulated in the proxy). In the proxy we overwrite the type cast operator, to get the result to our expected type for the std::istream_iterator.
And the last important step. A std::vector has a range constructor. It has also a lot of other constructors that we can use in the definition of a variable of type std::vector. But for our purposes this constructor fits best.
So we define a variable csv and use its range constructor and give it a begin of a range and an end of a range. And, in our specific example, we use the begin and end iterator of std::istream_iterator.
If we combine all the above, reading the complete CSV file is a one-liner, it is the definition of a variable with calling its constructor.
Please see the resulting code:
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <algorithm>
std::istringstream testCsv{ R"(Seaking 119 Azumao
Mr. Mime 122 Barrierd
Weedle 13 Beedle)" };
// Define Alias for easier Reading
using Columns = std::vector<std::string>;
using CSV = std::vector<Columns>;
// Proxy for the input Iterator
struct ColumnProxy {
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, ColumnProxy& cp) {
// Read a line
std::string line; cp.columns.clear();
if (std::getline(is, line)) {
// The delimiter
const std::regex re("\t");
// Split values and copy into resulting vector
std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1),
std::sregex_token_iterator(),
std::back_inserter(cp.columns));
}
return is;
}
// Type cast operator overload. Cast the type 'Columns' to std::vector<std::string>
operator std::vector<std::string>() const { return columns; }
protected:
// Temporary to hold the read vector
Columns columns{};
};
int main()
{
// Define variable CSV with its range constructor. Read complete CSV in this statement, So, one liner
CSV csv{ std::istream_iterator<ColumnProxy>(testCsv), std::istream_iterator<ColumnProxy>() };
// Print result. Go through all lines and then copy line elements to std::cout
std::for_each(csv.begin(), csv.end(), [](Columns & c) {
std::copy(c.begin(), c.end(), std::ostream_iterator<std::string>(std::cout, " ")); std::cout << "\n"; });
}
I hope the explanation was detailed enough to give you an idea, what you can do with modern C++.
This example does basically not care how many rows and columns are in the source text file. It will eat everything.

Parsing key/value pairs from a string in C++

I'm working in C++11, no Boost. I have a function that takes as input a std::string that contains a series of key-value pairs, delimited with semicolons, and returns an object constructed from the input. All keys are required, but may be in any order.
Here is an example input string:
Top=0;Bottom=6;Name=Foo;
Here's another:
Name=Bar;Bottom=20;Top=10;
There is a corresponding concrete struct:
struct S
{
const uint8_t top;
const uint8_t bottom;
const string name;
}
I've implemented the function by repeatedly running a regular expression on the input string, once per member of S, and assigning the captured group of each to the relevant member of S, but this smells wrong. What's the best way to handle this sort of parsing?
For an easy readable solution, you can e.g. use std::regex_token_iterator and a sorted container to distinguish the attribute value pairs (alternatively use an unsorted container and std::sort).
std::regex r{R"([^;]+;)"};
std::set<std::string> tokens{std::sregex_token_iterator{std::begin(s), std::end(s), r}, std::sregex_token_iterator{}};
Now the attribute value strings are sorted lexicographically in the set tokens, i.e. the first is Bottom, then Name and last Top.
Lastly use a simple std::string::find and std::string::substr to extract the desired parts of the string.
Live example
Do you care about performance or readability? If readability is good enough, then pick your favorite version of split from this question and away we go:
std::map<std::string, std::string> tag_map;
for (const std::string& tag : split(input, ';')) {
auto key_val = split(input, '=');
tag_map.insert(std::make_pair(key_val[0], key_val[1]));
}
S s{std::stoi(tag_map["top"]),
std::stoi(tag_map["bottom"]),
tag_map["name"]};

C++: storing CSV in contianer

I have a std::string that contains comma separated values, i need to store those values in some suitable container e.g. array, vector or some other container. Is there any built in function through which i could do this? Or i need to write custom code for this?
If you're willing and able to use the Boost libraries, Boost Tokenizer would work really well for this task.
That would look like:
std::string str = "some,comma,separated,words";
typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
boost::char_separator<char> sep(",");
tokenizer tokens(str, sep);
std::vector<std::string> vec(tokens.begin(), tokens.end());
You basically need to tokenize the string using , as the delimiter. This earlier Stackoverflow thread shall help you with it.
Here is another relevant post.
I don't think there is any available in the standard library. I would approach like -
Tokenize the string based on , delimeter using strtok.
Convert it to integer using atoi function.
push_back the value to the vector.
If you are comfortable with boost library, check this thread.
Using AXE parser generator you can easily parse your csv string, e.g.
std::string input = "aaa,bbb,ccc,ddd";
std::vector<std::string> v; // your strings get here
auto value = *(r_any() - ',') >> r_push_back(v); // rule for single value
auto csv = *(value & ',') & value & r_end(); // rule for csv string
csv(input.begin(), input.end());
Disclaimer: I didn't test the code above, it might have some superficial errors.

sequentially reading a text file in C++

In C++, I want to sequentially read word from a text file, and store each word into an array? After that, I will perform some operation on this array. But I do not know how to handle the first phase: sequentially reading word from a text file and store each word into an array.
I should skip those punctuations, which include ".", ",", "?"
You need to use streams for this. Take a look at the examples here:
Input/Output with files
This sounds like homework. If it is, please be forthright.
First of all, it's almost always a bad idea in C++ to use a raw array -- using a vector is a much better idea. As for your question about punctuation -- that's up to your customer, but my inclination is to separate on whitespace.
Anyway, here's an easy way to do it that takes advantage of operator>>(istream&, string&) separating on whitespace by default.
ifstream infile("/path/to/file.txt");
vector<string> words;
copy(istream_iterator<string>(file), istream_iterator<string>(), back_inserter(words));
Here's a complete program that reads words from a file named "filename", stores them in a std::vector and removes punctuation from the words.
#include <algorithm> // iostream, vector, iterator, fstream, string
struct is_punct {
bool operator()(char c) const {
static const std::string punct(",.:;!?");
return punct.find(c) != std::string::npos;
}
};
int main(int argc, char* argv[])
{
std::ifstream in("filename");
std::vector<std::string> vec((std::istream_iterator<std::string>(in)),
std::istream_iterator<std::string>());
std::transform(vec.begin(), vec.end(),
vec.begin(),
[](std::string s) {
s.erase(std::remove_if(s.begin(), s.end(), is_punct()),
s.end());
return s;
});
// manipulate vec
}
Do you know how many words you'll be reading? If not, you'll need to grow the array as you read more and more words. The easiest way to do that is to use a standard container that does it for you: std::vector. Reading words separated by whitespace is easy as it's the default behavior of std::ifstream::operator>>. Removing punctuation marks requires some extra work, so we'll get to that later.
The basic workflow for reading words from a file goes like this:
#include <fstream>
#include <string>
#include <vector>
int main()
{
std::vector<std::string> words;
std::string w;
std::ifstream file("words.txt"); // opens the file for reading
while (file >> w) // read one word from the file, stops at end-of-file
{
// do some work here to remove punctuation marks
words.push_back(w);
}
return 0;
}
Assuming you're doing homework here, the real key is learning how to remove the punctuation marks from w before adding it to the vector. I would look into the following concepts to help you:
The erase-remove idiom. Note that a std::string behaves like a container of char.
std::remove_if
The ispunct function in the cctype library
Feel free to post more questions if you run into trouble.
Yet another possibility, using (my usual) a special facet:
class my_ctype : public std::ctype<char> {
public:
mask const *get_table() {
// this copies the "classic" table used by <ctype.h>:
static std::vector<std::ctype<char>::mask>
table(classic_table(), classic_table()+table_size);
// Anything we want to separate tokens, we mark its spot in the table as 'space'.
table[','] = (mask)space;
table['.'] = (mask)space;
table['?'] = (mask)space;
// and return a pointer to the table:
return &table[0];
}
my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};
Using that, reading the words is pretty easy:
int main(int argc, char **argv) {
std::ifstream infile(argv[1]); // open the file.
infile.imbue(std::locale(std::locale(), new my_ctype()); // use our classifier
// Create a vector containing the words from the file:
std::vector<std::string> words(
(std::istream_iterator<std::string>(infile)),
std::istream_iterator<std::string>());
// and now we're ready to process the words in the vector
// though it might be worth considering using `std::transform`, to take
// the input from the file and process it directly.