Linewise lexicographic sort of a file with a catch - c++

Although I am not happy with the title of this question and this might be an odd question; bear with me, please.
So I have text files with content as follows:
& AAABBAB
this
& AAAAAAB
is
& BCAAAA
an
& BBBBBA
example
& BABABAB
text
where every other line starts with an identifier ('&'). Lines with said identifier should be lexicographically sorted, but I need it in a way such that the next line is dragged along to the new position in the output file with it.
This is what I am hoping to be the content of the output file.
& AAAAAAB
is
& AAABBAB
this
& BABABAB
text
& BBBBBA
example
& BCAAAA
an
With this, I can get the file content line-by-line:
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
using namespace std;
int main()
{
ifstream is("test.txt");
string str;
while(getline(is, str))
{
cout<<str<<endl;
}
return 0;
}
Is there an easy way to accomplish what I am looking for? Thanks for your help!

I'd bundle the pairs together while reading, making them easy to sort:
vector<pair<string, string>> vec; // first is identifier
vec.reserve(1000);
bool first = true;
while(getline(is, str))
{
if (first)
vec.emplace_back(str, string());
else
vec.back().second = str;
first = !first;
}
sort(vec.begin(), vec.end());

You can gather your lines by pairs into a vector of std::pair<std::string, std::string> :
using line_t = std::pair<std::string, std::string>;
std::vector<line_t> lines;
line_t pair_line;
while (std::getline(is, pair_line.first) &&
std::getline(is, pair_line.second)) {
lines.push_back(pair_line);
}
and sort them by their .first:
std::sort(begin(lines), end(lines),
[](auto const &l1, auto const &l2)
{ return l1.first < l2.first; });
DEMO

Yes, there is.
View the entire file as a map of key and value pairs, read into a std::map<std::string,std::string>, then output the map. Since string compares are lexicographic by default and maps have ordered keys, the map will do the sorting for you.

Here's a take that works nicely if you have a file that's too big to fit in memory, or, in general you need the efficiency.
It combines
a memory map¹
string views²
standard algorithms
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_view.hpp>
#include <deque>
namespace io = boost::iostreams;
using boost::string_view;
auto map_entries(string_view input) {
std::deque<string_view> pairs;
while (!input.empty()) {
size_t pos = input.find('\n');
if (pos != string_view::npos)
pos = input.find('\n', pos + 1);
if (pos != string_view::npos)
pairs.push_back(input.substr(0, pos));
input.remove_prefix(pos + 1); // safe with unsigned wrap around
}
return pairs;
}
#include <iostream>
int main() {
io::mapped_file_source file("input.txt");
auto data = map_entries({ file.data(), file.size() });
std::stable_sort(data.begin(), data.end());
for (auto entry : data)
std::cout << entry << "\n";
}
Prints
& AAAAAAB
is
& AAABBAB
this
& BABABAB
text
& BBBBBA
example
& BCAAAA
an
¹ it's trivial to use POSIX mmap instead of the boost thing there
² you can use std::[experimental::]string_view if your compiler/library is recent enough

Related

How to get a word vector from a string?

I want to store words separated by spaces into single string elements in a vector.
The input is a string that may end or may not end in a symbol( comma, period, etc.)
All symbols will be separated by spaces too.
I created this function but it doesn't return me a vector of words.
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t character = 0; character < sentence.size(); ++character)
{
if (sentence[character] == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
What did I do wrong?
Your problem has already been resolved by answers and comments.
I would like to give you the additional information that such functionality is already existing in C++.
You could take advantage of the fact that the extractor operator extracts space separated tokens from a stream. Because a std::string is not a stream, we can put the string first into an std::istringstream and then extract from this stream vie the std:::istream_iterator.
We could life make even more easier.
Since roundabout 10 years we have a dedicated, special C++ functionality for splitting strings into tokens, explicitely designed for this purpose. The std::sregex_token_iterator. And because we have such a dedicated function, we should simply use it.
The idea behind it is the iterator concept. In C++ we have many containers and always iterators, to iterate over the similar elements in these containers. And a string, with similar elements (tokens), separated by a delimiter, can also be seen as such a container. And with the std::sregex:token_iterator, we can iterate over the elements/tokens/substrings of the string, splitting it up effectively.
This iterator is very powerfull and you can do really much much more fancy stuff with it. But that is too much for here. Important is that splitting up a string into tokens is a one-liner. For example a variable definition using a range constructor for iterating over the tokens.
See some examples below:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
const std::regex delimiter{ " " };
const std::regex reWord{ "(\\w+)" };
int main() {
// Some debug print function
auto print = [](const std::vector<std::string>& sv) -> void {
std::copy(sv.begin(), sv.end(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << "\n"; };
// The test string
std::string test{ "word1 word2 word3 word4." };
//-----------------------------------------------------------------------------------------
// Solution 1: use istringstream and then extract from there
std::istringstream iss1(test);
// Define a vector (CTAD), use its range constructor and, the std::istream_iterator as iterator
std::vector words1(std::istream_iterator<std::string>(iss1), {});
print(words1); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 2: directly use dedicated function sregex_token iterator
std::vector<std::string> words2(std::sregex_token_iterator(test.begin(), test.end(), delimiter, -1), {});
print(words2); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 3: directly use dedicated function sregex_token iterator and look for words only
std::vector<std::string> words3(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {});
print(words3); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 4: Use such iterator in an algorithm, to copy data to a vector
std::vector<std::string> words4{};
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::back_inserter(words4));
print(words4); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 5: Use such iterator in an algorithm for direct output
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}
You added the index instead of the character:
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t i = 0; i < sentence.size(); ++i)
{
char character = sentence[i];
if (character == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
Since your mistake was only due to the reason, that you named your iterator variable character even though it is actually not a character, but rather an iterator or index, I would like to suggest to use a ranged-base loop here, since it avoids this kind of confusion. The clean solution is obviously to do what #ArminMontigny said, but I assume you are prohibited to use stringstreams. The code would look like this:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (char& character: sentence) // Now `character` is actually a character.
{
if (character==' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
word_vector.push_back(result_word); // In your solution, you forgot to push the last word into the vector.
return word_vector;
}
int main() {
string sentence="Maybe try range based loops";
vector<string> result= single_words(sentence);
for(string& word: result)
cout<<word<<" ";
return 0;
}

C++ Writing a column of a CSV file into a vector

I have a CSV file with a bunch of columns, but I only need the information for the 11th column. How do I read through each line and skip to the 11th column in each line? I'm struggling to find clear information on how to read files in c++. This is what I have so far:
#include<iostream>
#include<fstream>
#include<vector>
#include <sstream>
#include <string>
std::string readStock(std::string fileName){
std::vector<std::string> ticker; //create vector
std::ifstream f(fileName, std::ios::in|std::ios:: binary|std::ios::ate);
std::string finalString = "";
if(f.is_open()){
std::string str;
std::getline(f,str); //skip the first row
while(std::getline(f,str)){ //read each line
std::istringstream s(str); //stringstream to parse csv
std::string val; //string to hold value
for(int i=1;i<=10;++i){ //skips everything until we get to the
column that we want
while(std::getline(s,val, ',')){
}
std::getline(s,val,',');
ticker.push_back(val);
}
f.close();
finalString = ticker.front();
}
}
else{
finalString="Could not open the file properly.";
}
return finalString;
}
int main(){
std::string st;
st=readStock("pr.csv");
std::cout<<st<<std::endl;
return 0;
}
There is a very simple solution for your problem.
You define a proxy class that reads one complete line, splits it into ALL tokens, using the dedicated functionality of the std::regex_token_iterator and then extracts the 11th element.
Using this proxy mechanism, you can use the std::istream_iterator to read the complete file, column 11, into a std::vector. For that we use the range constructor of the std::vector.
The result is a simple and short one-liner.
Please see:
#include <string>
#include <iostream>
#include <vector>
#include <fstream>
#include <regex>
#include <iterator>
#include <algorithm>
std::regex delimiter{ "," };
constexpr size_t targetColumn = 10U; // Target column is eleven
struct String11 { // Proxy for the input Iterator
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, String11& s11) {
// Read a complete line
if (std::string line{}; std::getline(is, line)) {
// Split it into tokens
std::vector token(std::sregex_token_iterator(line.begin(), line.end(), delimiter, -1), {});
// We only need one column
if (targetColumn < token.size()) {
// Get column 11
s11.result = token[targetColumn];
}
}
return is;
}
// Cast the type 'String11' to std::string
operator std::string() const { return result; }
// Temporary to hold the resulting string
std::string result{};
};
int main() {
// Open CSV fíle
if (std::ifstream csvFile{ "pr.csv" }; csvFile) {
// Read complete CSV file and get column 11 of each line
std::vector col11(std::istream_iterator<String11>(csvFile), {});
// Show output. Show all columns 11
std::copy(col11.begin(), col11.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}
return 0;
}
EDIT:
For having output with doubles.
We just change one line in the cast operator in the proxy. That's all.
Even in main, there is no change in the read operatrion necessary. Through CTAD, the vector will be of type double.
Please see:
#include <string>
#include <iostream>
#include <vector>
#include <fstream>
#include <regex>
#include <iterator>
#include <algorithm>
std::regex delimiter{ "," };
constexpr size_t targetColumn = 10U; // Target column is eleven
struct String11 { // Proxy for the input Iterator
// Overload extractor. Read a complete line
friend std::istream& operator>>(std::istream& is, String11& s11) {
// Read a complete line
if (std::string line{}; std::getline(is, line)) {
// Split it into tokens
std::vector token(std::sregex_token_iterator(line.begin(), line.end(), delimiter, -1), {});
// We only need one column
if (targetColumn < token.size()) {
// Get column 11
s11.result = token[targetColumn];
}
}
return is;
}
// Cast the type 'String11' to double
operator double() const { return std::stod(result); }
// Temporary to hold the resulting string
std::string result{};
};
int main() {
// Open CSV fíle
if (std::ifstream csvFile{ "r:\\pr.csv" }; csvFile) {
// Read complete CSV file and get column 11 of each line
std::vector col11(std::istream_iterator<String11>(csvFile), {});
// Show output. Show all columns 11
std::copy(col11.begin(), col11.end(), std::ostream_iterator<double>(std::cout, "\n"));
}
return 0;
}
Output needs to adapted as well.

Creating a custom comparator in C++

Background:
I got asked this question today in a online practice interview and I had a hard time figuring out a custom comparator to sort. Here is the question
Question:
Implement a document scanning function wordCountEngine, which receives a string document and returns a list of all unique words in it and their number of occurrences, sorted by the number of occurrences in a descending order. If two or more words have the same count, they should be sorted according to their order in the original sentence. Assume that all letters are in english alphabet. You function should be case-insensitive, so for instance, the words “Perfect” and “perfect” should be considered the same word.
The engine should strip out punctuation (even in the middle of a word) and use whitespaces to separate words.
Analyze the time and space complexities of your solution. Try to optimize for time while keeping a polynomial space complexity.
Examples:
input: document = "Practice makes perfect. you'll only
get Perfect by practice. just practice!"
output: [ ["practice", "3"], ["perfect", "2"],
["makes", "1"], ["youll", "1"], ["only", "1"],
["get", "1"], ["by", "1"], ["just", "1"] ]
My idea:
The first think I wanted to do was first get the string without punctuation and all in lower case into a vector of strings. Then I used an unordered_map container to store the string and a count of its occurrence. Where I got stuck was creating a custom comparator to make sure that if I have a string that has the same count then I would sort it based on its precedence in the actual given string.
Code:
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <sstream>
#include <iterator>
#include <numeric>
#include <algorithm>
using namespace std;
struct cmp
{
bool operator()(std::string& word1, std::string& word2)
{
}
};
vector<vector<string>> wordCountEngine( const string& document )
{
// your code goes here
// Step 1
auto doc = document;
std::string str;
remove_copy_if(doc.begin(), doc.end(), std::back_inserter(str),
std::ptr_fun<int, int>(&std::ispunct));
for(int i = 0; i < str.size(); ++i)
str[i] = tolower(str[i]);
std::stringstream ss(str);
istream_iterator<std::string> begin(ss);
istream_iterator<std::string> end;
std::vector<std::string> vec(begin, end);
// Step 2
std::unordered_map<std::string, int> m;
for(auto word : vec)
m[word]++;
// Step 3
std::vector<std::vector<std::string>> result;
for(auto it : m)
{
result.push_back({it.first, std::to_string(it.second)});
}
return result;
}
int main() {
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for(int i = 0; i < result.size(); ++i)
{
for(int j = 0; j < result[0].size(); ++j)
{
std::cout << result[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
If anyone can help me with learning how to build a custom comparator for this code I would really appreciate it.
You could use a std::vector<std::pair<std::string, int>>, with each pair representing one word and the number of occurrences of that word in the sequence. Using a vector will help to maintain the order of the original sequence when two or more words have the same count. Finally sort by occurrences.
#include <vector>
#include <algorithm>
#include <string>
#include <sstream>
std::vector<std::vector<std::string>> wordCountEngine(const std::string& document)
{
std::vector<std::pair<std::string, int>> words;
std::istringstream ss(document);
std::string word;
//Loop through words in sequence
while (getline(ss, word, ' '))
{
//Convert to lowercase
std::transform(word.begin(), word.end(), word.begin(), tolower);
//Remove punctuation characters
auto it = std::remove_if(word.begin(), word.end(), [](char c) { return !isalpha(c); });
word.erase(it, word.end());
//Find this word in the result vector
auto pos = std::find_if(words.begin(), words.end(),
[&word](const std::pair<std::string, int>& p) { return p.first == word; });
if (pos == words.end()) {
words.push_back({ word, 1 }); //Doesn't occur -> add it
}
else {
pos->second++; //Increment count
}
}
//Sort vector by word occurrences
std::sort(words.begin(), words.end(),
[](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2) { return p1.second > p2.second; });
//Convert to vector<vector<string>>
std::vector<std::vector<std::string>> result;
result.reserve(words.size());
for (auto& p : words)
{
std::vector<std::string> v = { p.first, std::to_string(p.second) };
result.push_back(v);
}
return result;
}
int main()
{
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for (auto& word : result)
{
std::cout << word[0] << ", " << word[1] << std::endl;
}
return 0;
}
Output:
practice, 3
perfect, 2
makes, 1
youll, 1
only, 1
get, 1
by, 1
just, 1
In step2, try this:
std::vector<std::pair<std::pair<std::string, int>, int>> m;
Here, the pair stores the string and this index of its occurance, and the vector stores the pair and the count of its occurances. Write a logic, to sort according to the count first and then if the counts are same, then sort it according to the position of its occurance.
bool sort_vector(const std::pair<const std::pair<std::string,int>,int> &a, const std::pair<const std::pair<std::string,int>,int> &b)
{
if(a.second==b.second)
{
return a.first.second<b.first.second
// This will make sure that if the no of occurances of each string is same, then it will be sorted according to the position of the string
}
return a.second>b.second
//This will make sure that the strings are sorted in the order to return the string having higher no of occurances first.
}
You have to write a logic to count the number of occurrences and the index of occurrence of each word in the string.

separating 2 words from a string

I have done a lot of reading on this topic online, and cannot figure out if my code is working. i am working on my phone with the c4droid app, and the debugger is nearly useless as far as i can tell.
as the title says, i need to separate 2 words out of one input. depending on what the first word is, the second may or may not be used. if i do not need the second word everything is fine. if i need and have the second word it works, or seems to. but if i need a second word but only have the first it compiles, but crashes with an out of range exception.
ActionCommand is a vector of strings with 2 elements.
void splitstring(std::string original)
{
std::string
std::istringstream OrigStream(original);
OrigStream >> x;
ActionCommand.at(0) = x;
OrigStream >> x;
ActionCommand.at(1) = x;
return;
}
this code will separate the words right?
any help would be appreciated.
more of the code:
called from main-
void DoAction(Character & Player, room & RoomPlayerIn)
{
ParseAction(Player, GetAction(), RoomPlayerIn);
return;
}
std::string GetAction()
{
std::string action;
std::cout<< ">";
std::cin>>action;
action = Lowercase(action);
return action;
}
maybe Lowercase is the problem.
std::string Lowercase(std::string sourceString)
{
std::string destinationString;
destinationString.resize(sourceString.size());
std::transform(sourceString.begin(), sourceString.end(), destinationString.begin(), ::tolower);
return destinationString;
)
void ParseAction(Character & Player, std::string CommandIn, room & RoomPlayerIn)
(
std::vector<std::string> ActionCommand;
splitstring(CommandIn, ActionCommand);
std::string action = ActionCommand.at(0);
if (ActionCommand.size() >1)
std::string action2 = ActionCommand.at(1);
skipping some ifs
if (action =="wield")
{
if(ActionCommand.size() >1)
DoWield(action2);
else std::cout<<"wield what??"<<std::endl;
return;
}
and splitstring now looks like this
void splitstring(std::string const &original, std::vector<std::string> &ActionCommand)
{
std::string x;
std::istringstream OrigStream(original);
if (OrigStream >>x)
ActionCommand.push_back(x);
else return;
if (OrigStream>>x)
ActionCommand.push_back(x);
return;
}
#include <sstream>
#include <vector>
#include <string>
std::vector<std::string> ActionCommand;
void splitstring(std::string const &original)
{
std::string x;
std::istringstream OrigStream{ original };
if(OrigStream >> x)
ActionCommand.push_back(x);
else return;
if(OrigStream >> x)
ActionCommand.push_back(x);
}
Another idea would be to use the standard library. You can split a string into tokens (using spaces as dividers) with the following function:
#include <string>
#include <vector>
#include <sstream>
#include <iterator>
inline auto tokenize(const std::string &String)
{
auto Stream = std::stringstream(String);
return std::vector<std::string>{std::istream_iterator<std::string>{Stream}, std::istream_iterator<std::string>{}};
}
Here, the result is created in place by using an std::istream_iterator, which basically stands in for the >> operation in your example.
Warning:
This code needs at least c++11 to compile.

Accelerated C++ exercise 8-5 solution not clear

I am stuck at solving Accelerated C++ exercise 8-5 and I don't want to miss a single exercise in this book.
Accelerated C++ Exercise 8-5 is as follows:
Reimplement the gen_sentence and xref functions from Chapter 7 to use
output iterators rather than putting their entire output in one data
structure. Test these new versions by writing programs that attach the
output iterator directly to the standard output, and by storing the
results in list <string> and map<string, vector<int> >, respectively.
To understand scope of this question and current knowledge in this part of the book - this exercise is part of chapter about generic function templates and iterator usage in templates. Previous exercise was to implement simple versions of <algorithm> library functions, such as equal, find, copy, remove_copy_if etc.
If I understand correctly, I need to modify xref function so it:
Use output iterator
Store results in map<string, vector<int> >
I tried to pass map iterator as back_inserter(), .begin(), .end() to this function, but was not able to compile it. Answer here explains why.
xref function as in Chapter 7:
// find all the lines that refer to each word in the input
map<string, vector<int> >
xref(istream& in,
vector<string> find_words(const string&) = split)
{
string line;
int line_number = 0;
map<string, vector<int> > ret;
// read the next line
while (getline(in, line)) {
++line_number;
// break the input line into words
vector<string> words = find_words(line);
// remember that each word occurs on the current line
for (vector<string>::const_iterator it = words.begin();
it != words.end(); ++it)
ret[*it].push_back(line_number);
}
return ret;
}
Split implementation:
vector<string> split(const string& s)
{
vector<string> ret;
typedef string::size_type string_size;
string_size i = 0;
// invariant: we have processed characters `['original value of `i', `i)'
while (i != s.size()) {
// ignore leading blanks
// invariant: characters in range `['original `i', current `i)' are all spaces
while (i != s.size() && isspace(s[i]))
++i;
// find end of next word
string_size j = i;
// invariant: none of the characters in range `['original `j', current `j)' is a space
while (j != s.size() && !isspace(s[j]))
++j;
// if we found some nonwhitespace characters
if (i != j) {
// copy from `s' starting at `i' and taking `j' `\-' `i' chars
ret.push_back(s.substr(i, j - i));
i = j;
}
}
return ret;
}
Please help to understand what am i missing.
I found more details on the exercise, here: https://stackoverflow.com/questions/5608092/accelerated-c-exercise-8-5-wording-help:
template <class Out>
void gen_sentence( const Grammar& g, string s, Out& out )
USAGE:
std::ostream_iterator<string> out_str (std::cout, " ");
gen_sentence( g, "<sentence>", out_str );
template <class Out, class In>
void xref( In& in, Out& out, vector<string> find_words( const string& ) = split )
USAGE:
std::ostream_iterator<string> out_str (std::cout, " ");
xref( cin, out_str, find_url ) ;
Frankly, I have to come to the conclusion that that question is ill-posed, specifically where they specified the new interface for xref: xref should result in a map. However, using output iterators would imply using std::inserter(map, map.end()) in this case. While you can write a compiling version of the code, this will not do what you expect since map::insert will simply ignore any insertions with duplicated keys.
If the goal of xref is only to link the words to the line number of their first appearance this would still be ok, but I have a feeling that the author of the exercise simply missed this subtler point :)
Here is the code anyways (note that I invented a silly implementation for split, because it was both missing and required):
#include <map>
#include <vector>
#include <iostream>
#include <sstream>
#include <fstream>
#include <algorithm>
#include <iterator>
std::vector<std::string> split(const std::string& str)
{
std::istringstream iss(str);
std::vector<std::string> result;
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(result));
return result;
}
// find all the lines that refer to each word in the input
template <typename OutIt>
OutIt xref(std::istream& in,
OutIt out,
std::vector<std::string> find_words(const std::string&) = split)
{
std::string line;
int line_number = 0;
// read the next line
while (getline(in, line)) {
++line_number;
// break the input line into words
std::vector<std::string> words = find_words(line);
// remember that each word occurs on the current line
for (std::vector<std::string>::const_iterator it = words.begin();
it != words.end(); ++it)
*out++ = std::make_pair(*it, line_number);
}
return out;
}
int main(int argc, const char *argv[])
{
std::map<std::string, int> index;
std::ifstream file("/tmp/test.cpp");
xref(file, std::inserter(index, index.end()));
#if __GXX_EXPERIMENTAL_CXX0X__
for(auto& entry: index)
std::cout << entry.first << " first found on line " << entry.second << std::endl;
#else
for(std::map<std::string, int>::const_iterator it = index.begin();
it != index.end();
++it)
{
std::cout << it->first << " first found on line " << it->second << std::endl;
}
#endif
return 0;
}