C++ Join two pipe divided files on key fields

C++ Join two pipe divided files on key fields - c++

i am currently trying to create a C++ function to join two pipe divided files with over 10.000.000 records on one or two key fields.
The fiels look like
P2347|John Doe|C1234
P7634|Peter Parker|D2344
P522|Toni Stark|T288
and
P2347|Bruce Wayne|C1234
P1111|Captain America|D534
P522|Terminator|T288
To join on field 1 and 3, the expected output should show:
P2347|C1234|John Doe|Bruce Wayne
P522|T288|Toni Stark|Terminator
What I currently thinking about is using a set/array/vector to read in the files and create something like:
P2347|C1234>>John Doe
P522|T288>>Toni Stark
and
P2347|C1234>>Bruce Wayne
P522|T288>>Terminator
And then use the slip the first part as the key and match that against the second set/vector/array.
What I currently have is: Read in the first file and match the second file line by line against the set. It takes the whole line and matches it:
#include iostream>
#include fstream>
#include string>
#include set>
#include ctime>
using namespace std;
int main()
{
clock_t startTime = clock();
ifstream inf("test.txt");
set lines;
string line;
for (unsigned int i=1; std::getline(inf,line); ++i)
lines.insert(line);
ifstream inf2("test2.txt");
clock_t midTime = clock();
ofstream outputFile("output.txt");
while (getline(inf2, line))
{
if (lines.find(line) != lines.end())
outputFile > a;
return 0;
}
I am very happy for any suggestion. I am also happy to change the whole concept if there is any better (faster) way. Speed is critical as there might be even more than 10 million records.
EDIT: Another idea would be to take a map and have the key being the key - but this might be a little slower. Any suggestions?
Thanks a lot for any help!

I tried multiple ways to get this task completed, none of it was efficient so far:
Read everything into a set and parse the key fields into a format: keys >> values simulating an array type set. Parsing took a long time, but memory usage stays relatively low. Not fully developed code:
#include \
#include \
#include \
#include \
#include \
#include \
#include \
std::vector &split(const std::string &s, char delim, std::vector &elems) {
std::stringstream ss(s);
std::string item;
while (std::getline(ss, item, delim)) {
elems.push_back(item);
}
return elems;
}
std::vector split(const std::string &s, char delim) {
std::vector elems;
split(s, delim, elems);
return elems;
}
std::string getSelectedRecords(std::string record, int position){
std::string values;
std::vector tokens = split(record, ' ');
//get position in vector
for(auto& s: tokens)
//pick last one or depending on number, not developed
values = s;
return values;
}
int main()
{
clock_t startTime = clock();
std::ifstream secondaryFile("C:/Users/Batman/Desktop/test/secondary.txt");
std::set secondarySet;
std::string record;
for (unsigned int i=1; std::getline(secondaryFile,record); ++i){
std::string keys = getSelectedRecords(record, 2);
std::string values = getSelectedRecords(record, 1);
secondarySet.insert(keys + ">>>" + values);
}
clock_t midTime = clock();
std::ifstream primaryFile("C:/Users/Batman/Desktop/test/primary.txt");
std::ofstream outputFile("C:/Users/Batman/Desktop/test/output.txt");
while (getline(primaryFile, record))
{
//rewrite find() function to go through set and find all keys (first part until >> ) and output values
std::string keys = getSelectedRecords(record, 2);
if (secondarySet.find(keys) != secondarySet.end())
outputFile > a;
return 0;
}
Instead of pipe divided it currently uses space divided, but that should not be a problem. Reading the data is very quick, but parsing it takes an awful lot of time
The other option was taking a multimap. Similar concept with key fields pointing to values, but this one is very low and memory intensive.
#include \
#include \
#include \
#include \
#include \
#include \
#include \
int main()
{
std::clock_t startTime = clock();
std::ifstream inf("C:/Users/Batman/Desktop/test/test.txt");
typedef std::multimap Map;
Map map;
std::string line;
for (unsigned int i=1; std::getline(inf,line); ++i){
//load tokens into vector
std::istringstream buffer(line);
std::istream_iterator beg(buffer), end;
std::vector tokens(beg, end);
//get keys
for(auto& s: tokens)
//std::cout >>" second;
outputFile > a;
return 0;
}
Further thoughts are: Splitting the pipe divided files into different files with one column each right when importing the data. With that I will not have to parse anything but can read in each column individually.
EDIT: optimized the first example with a recursive split function. Still >30 seconds for 100.000 records. Would like to see that faster plus the actual find() function is still missing.
Any thoughts?
Thanks!

Related

Creating a custom comparator in C++

Background:
I got asked this question today in a online practice interview and I had a hard time figuring out a custom comparator to sort. Here is the question
Question:
Implement a document scanning function wordCountEngine, which receives a string document and returns a list of all unique words in it and their number of occurrences, sorted by the number of occurrences in a descending order. If two or more words have the same count, they should be sorted according to their order in the original sentence. Assume that all letters are in english alphabet. You function should be case-insensitive, so for instance, the words “Perfect” and “perfect” should be considered the same word.
The engine should strip out punctuation (even in the middle of a word) and use whitespaces to separate words.
Analyze the time and space complexities of your solution. Try to optimize for time while keeping a polynomial space complexity.
Examples:
input: document = "Practice makes perfect. you'll only
get Perfect by practice. just practice!"
output: [ ["practice", "3"], ["perfect", "2"],
["makes", "1"], ["youll", "1"], ["only", "1"],
["get", "1"], ["by", "1"], ["just", "1"] ]
My idea:
The first think I wanted to do was first get the string without punctuation and all in lower case into a vector of strings. Then I used an unordered_map container to store the string and a count of its occurrence. Where I got stuck was creating a custom comparator to make sure that if I have a string that has the same count then I would sort it based on its precedence in the actual given string.
Code:
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <sstream>
#include <iterator>
#include <numeric>
#include <algorithm>
using namespace std;
struct cmp
{
bool operator()(std::string& word1, std::string& word2)
{
}
};
vector<vector<string>> wordCountEngine( const string& document )
{
// your code goes here
// Step 1
auto doc = document;
std::string str;
remove_copy_if(doc.begin(), doc.end(), std::back_inserter(str),
std::ptr_fun<int, int>(&std::ispunct));
for(int i = 0; i < str.size(); ++i)
str[i] = tolower(str[i]);
std::stringstream ss(str);
istream_iterator<std::string> begin(ss);
istream_iterator<std::string> end;
std::vector<std::string> vec(begin, end);
// Step 2
std::unordered_map<std::string, int> m;
for(auto word : vec)
m[word]++;
// Step 3
std::vector<std::vector<std::string>> result;
for(auto it : m)
{
result.push_back({it.first, std::to_string(it.second)});
}
return result;
}
int main() {
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for(int i = 0; i < result.size(); ++i)
{
for(int j = 0; j < result[0].size(); ++j)
{
std::cout << result[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
If anyone can help me with learning how to build a custom comparator for this code I would really appreciate it.

You could use a std::vector<std::pair<std::string, int>>, with each pair representing one word and the number of occurrences of that word in the sequence. Using a vector will help to maintain the order of the original sequence when two or more words have the same count. Finally sort by occurrences.
#include <vector>
#include <algorithm>
#include <string>
#include <sstream>
std::vector<std::vector<std::string>> wordCountEngine(const std::string& document)
{
std::vector<std::pair<std::string, int>> words;
std::istringstream ss(document);
std::string word;
//Loop through words in sequence
while (getline(ss, word, ' '))
{
//Convert to lowercase
std::transform(word.begin(), word.end(), word.begin(), tolower);
//Remove punctuation characters
auto it = std::remove_if(word.begin(), word.end(), [](char c) { return !isalpha(c); });
word.erase(it, word.end());
//Find this word in the result vector
auto pos = std::find_if(words.begin(), words.end(),
[&word](const std::pair<std::string, int>& p) { return p.first == word; });
if (pos == words.end()) {
words.push_back({ word, 1 }); //Doesn't occur -> add it
}
else {
pos->second++; //Increment count
}
}
//Sort vector by word occurrences
std::sort(words.begin(), words.end(),
[](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2) { return p1.second > p2.second; });
//Convert to vector<vector<string>>
std::vector<std::vector<std::string>> result;
result.reserve(words.size());
for (auto& p : words)
{
std::vector<std::string> v = { p.first, std::to_string(p.second) };
result.push_back(v);
}
return result;
}
int main()
{
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for (auto& word : result)
{
std::cout << word[0] << ", " << word[1] << std::endl;
}
return 0;
}
Output:
practice, 3
perfect, 2
makes, 1
youll, 1
only, 1
get, 1
by, 1
just, 1

In step2, try this:
std::vector<std::pair<std::pair<std::string, int>, int>> m;
Here, the pair stores the string and this index of its occurance, and the vector stores the pair and the count of its occurances. Write a logic, to sort according to the count first and then if the counts are same, then sort it according to the position of its occurance.
bool sort_vector(const std::pair<const std::pair<std::string,int>,int> &a, const std::pair<const std::pair<std::string,int>,int> &b)
{
if(a.second==b.second)
{
return a.first.second<b.first.second
// This will make sure that if the no of occurances of each string is same, then it will be sorted according to the position of the string
}
return a.second>b.second
//This will make sure that the strings are sorted in the order to return the string having higher no of occurances first.
}
You have to write a logic to count the number of occurrences and the index of occurrence of each word in the string.

Linewise lexicographic sort of a file with a catch

Although I am not happy with the title of this question and this might be an odd question; bear with me, please.
So I have text files with content as follows:
& AAABBAB
this
& AAAAAAB
is
& BCAAAA
an
& BBBBBA
example
& BABABAB
text
where every other line starts with an identifier ('&'). Lines with said identifier should be lexicographically sorted, but I need it in a way such that the next line is dragged along to the new position in the output file with it.
This is what I am hoping to be the content of the output file.
& AAAAAAB
is
& AAABBAB
this
& BABABAB
text
& BBBBBA
example
& BCAAAA
an
With this, I can get the file content line-by-line:
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
using namespace std;
int main()
{
ifstream is("test.txt");
string str;
while(getline(is, str))
{
cout<<str<<endl;
}
return 0;
}
Is there an easy way to accomplish what I am looking for? Thanks for your help!

I'd bundle the pairs together while reading, making them easy to sort:
vector<pair<string, string>> vec; // first is identifier
vec.reserve(1000);
bool first = true;
while(getline(is, str))
{
if (first)
vec.emplace_back(str, string());
else
vec.back().second = str;
first = !first;
}
sort(vec.begin(), vec.end());

You can gather your lines by pairs into a vector of std::pair<std::string, std::string> :
using line_t = std::pair<std::string, std::string>;
std::vector<line_t> lines;
line_t pair_line;
while (std::getline(is, pair_line.first) &&
std::getline(is, pair_line.second)) {
lines.push_back(pair_line);
}
and sort them by their .first:
std::sort(begin(lines), end(lines),
[](auto const &l1, auto const &l2)
{ return l1.first < l2.first; });
DEMO

Yes, there is.
View the entire file as a map of key and value pairs, read into a std::map<std::string,std::string>, then output the map. Since string compares are lexicographic by default and maps have ordered keys, the map will do the sorting for you.

Here's a take that works nicely if you have a file that's too big to fit in memory, or, in general you need the efficiency.
It combines
a memory map¹
string views²
standard algorithms
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_view.hpp>
#include <deque>
namespace io = boost::iostreams;
using boost::string_view;
auto map_entries(string_view input) {
std::deque<string_view> pairs;
while (!input.empty()) {
size_t pos = input.find('\n');
if (pos != string_view::npos)
pos = input.find('\n', pos + 1);
if (pos != string_view::npos)
pairs.push_back(input.substr(0, pos));
input.remove_prefix(pos + 1); // safe with unsigned wrap around
}
return pairs;
}
#include <iostream>
int main() {
io::mapped_file_source file("input.txt");
auto data = map_entries({ file.data(), file.size() });
std::stable_sort(data.begin(), data.end());
for (auto entry : data)
std::cout << entry << "\n";
}
Prints
& AAAAAAB
is
& AAABBAB
this
& BABABAB
text
& BBBBBA
example
& BCAAAA
an
¹ it's trivial to use POSIX mmap instead of the boost thing there
² you can use std::[experimental::]string_view if your compiler/library is recent enough

Elegant solution for string parsing

so i got a dozen of strings which i download, example's below which i need to parse.
"Australija 036 AUD 1 4,713831 4,728015 4,742199"
"Vel. Britanija 826 GBP 1 10,300331 10,331325 10,362319"
So my first idea was to count manually where the number i need is (the second one, 4,728015 or 10,331325 in exampels up) and get substring.(52,8)
But then i realized that few of the the strings im parsing has a >9 number in it, so i would need a substring of (51,9) for that case, so i cant do it this way
Second idea was to save all the number like chars in a vector, and then get vector[4] and save it into a seperate variable.
And third one is to just loop the string until i position myself after the 5th group of spaces and then substring it.
Just looking for some feedback on what would be "best".

The problem
is that we can have multiple words at the beginning of the string. I.e. the first element may contain spaces.
The solution
Start from the end of the string where we are stable.
Split the string up at the spaces. Start counting from the end, and pick the previous-last element.
Solution 1: Boost string algorithms
#include <string>
#include <vector>
#include <boost/algorithm/string.hpp>
using namespace std;
using namespace boost;
string extractstring(string & fullstring)
{
vector<string> vs;
split(vs, fullstring);
return vs[vs.size() - 2];
}
Solution 2: QString (from Qt framework)
#include <QString>
QString extractstring(QString & fullstring)
{
QStringlist sl = fullstring.split(" ");
return sl[vs.size() - 2];
}
Solution 3: STL only
#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>
using namespace std;
string extractstring(string & fullstring)
{
istringstream iss(fullstring);
vector<string> elements;
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter(elements));
return elements[elements.size() - 2];
}
Other solutions: regex, C-pointer acrobatic.
Update:
I would not use sscanf based solutions because it may be difficult to identify multiple words at the beginning of the string.

I believe you can do it with a single line using sscanf?
http://www.cplusplus.com/reference/cstdio/sscanf/
For example (http://ideone.com/e2cCT9):
char *str = "Australija 4,713831 4,728015 4,742199";
char tmp[255];
int a,b,c,d;
sscanf(str, "%*[^0-9] %d,%d %d,%d", &a, &b, &c, &d);
printf("Parsed values: %d %d %d %d\n",a,b,c,d);

The hurdle is that the first field is allowed to have spaces, but the remaining fields are separated by spaces.
This may not be elegant, but the concept should work:
std::string text_line;
getline(my_file, text_line);
std::string::size_type field_1_start;
const unsigned int text_length = text_line.length();
for (field_1_start = 0; field_1_start < text_length; ++field_1_start)
{
if (is_digit(text_line[field_1_start])
{
break;
}
}
if (field_1_start < text_length)
{
std::string remaining_text = text_line.substr(field_1_start, text_length - field_1_start);
std::istringstream input_data(remaining_text);
int field_1;
std::string field2;
input_data >> field_1;
input_data >> field_2;
//...
}

How to improve my code to correctly count unique lines?

I am currently trying to write code for finding unique lines as well as unique words. I have my attempted code below for unique lines but it is not calculating correctly when it comes to unique lines. I believe it is ignoring enters and still counting lines that contain letters, words, sentences, etc. So I need help in figuring out what to add that it will count enters (blank lines) as lines and not count any extra lines that are the same. I know this is happening because I have tested a few different lines. As for the unique words, I don't even know how to get started :/
unsigned long countULines(const string& s)
{
set<string> wl;
string ulines;
for(int ul = 0; ul < s.size(); ul++){
wl.insert(ulines);
}
return wl.size();
}

This will work for you:
#include <iostream>
#include <sstream>
#include <string>
#include <set>
size_t countUniqueLines(const std::string& s)
{
std::set<std::string> uniqueLines;
std::istringstream is(s);
std::string line;
while (std::getline(is, line)) {
if (!line.empty() && line[line.size() - 1] == '\r') {
line.erase(line.size() -1);
}
uniqueLines.insert(line);
}
return uniqueLines.size();
}
int main()
{
const std::string myLines = "four\none\n\ntwo\nthree\nthree\n\n";
std::cout << countUniqueLines(myLines) << std::endl;
}

Finding all occurrences of a character in a string

I have comma delimited strings I need to pull values from. The problem is these strings will never be a fixed size. So I decided to iterate through the groups of commas and read what is in between. In order to do that I made a function that returns every occurrence's position in a sample string.
Is this a smart way to do it? Is this considered bad code?
#include <string>
#include <iostream>
#include <vector>
#include <Windows.h>
using namespace std;
vector<int> findLocation(string sample, char findIt);
int main()
{
string test = "19,,112456.0,a,34656";
char findIt = ',';
vector<int> results = findLocation(test,findIt);
return 0;
}
vector<int> findLocation(string sample, char findIt)
{
vector<int> characterLocations;
for(int i =0; i < sample.size(); i++)
if(sample[i] == findIt)
characterLocations.push_back(sample[i]);
return characterLocations;
}

vector<int> findLocation(string sample, char findIt)
{
vector<int> characterLocations;
for(int i =0; i < sample.size(); i++)
if(sample[i] == findIt)
characterLocations.push_back(sample[i]);
return characterLocations;
}
As currently written, this will simply return a vector containing the int representations of the characters themselves, not their positions, which is what you really want, if I read your question correctly.
Replace this line:
characterLocations.push_back(sample[i]);
with this line:
characterLocations.push_back(i);
And that should give you the vector you want.

If I were reviewing this, I would see this and assume that what you're really trying to do is tokenize a string, and there's already good ways to do that.
Best way I've seen to do this is with boost::tokenizer. It lets you specify how the string is delimited and then gives you a nice iterator interface to iterate through each value.
using namespace boost;
string sample = "Hello,My,Name,Is,Doug";
escaped_list_seperator<char> sep("" /*escape char*/, ","/*seperator*/, "" /*quotes*/)
tokenizer<escaped_list_seperator<char> > myTokens(sample, sep)
//iterate through the contents
for (tokenizer<escaped_list_seperator<char>>::iterator iter = myTokens.begin();
iter != myTokens.end();
++iter)
{
std::cout << *iter << std::endl;
}
Output:
Hello
My
Name
Is
Doug
Edit If you don't want a dependency on boost, you can also use getline with an istringstream as in this answer. To copy somewhat from that answer:
std::string str = "Hello,My,Name,Is,Doug";
std::istringstream stream(str);
std::string tok1;
while (stream)
{
std::getline(stream, tok1, ',');
std::cout << tok1 << std::endl;
}
Output:
Hello
My
Name
Is
Doug
This may not be directly what you're asking but I think it gets at your overall problem you're trying to solve.

Looks good to me too, one comment is with the naming of your variables and types. You call the vector you are going to return characterLocations which is of type int when really you are pushing back the character itself (which is type char) not its location. I am not sure what the greater application is for, but I think it would make more sense to pass back the locations. Or do a more cookie cutter string tokenize.

Well if your purpose is to find the indices of occurrences the following code will be more efficient as in c++ giving objects as parameters causes the objects to be copied which is insecure and also less efficient. Especially returning a vector is the worst possible practice in this case that's why giving it as a argument reference will be much better.
#include <string>
#include <iostream>
#include <vector>
#include <Windows.h>
using namespace std;
vector<int> findLocation(string sample, char findIt);
int main()
{
string test = "19,,112456.0,a,34656";
char findIt = ',';
vector<int> results;
findLocation(test,findIt, results);
return 0;
}
void findLocation(const string& sample, const char findIt, vector<int>& resultList)
{
const int sz = sample.size();
for(int i =0; i < sz; i++)
{
if(sample[i] == findIt)
{
resultList.push_back(i);
}
}
}

How smart it is also depends on what you do with those subtstrings delimited with commas. In some cases it may be better (e.g. faster, with smaller memory requirements) to avoid searching and splitting and just parse and process the string at the same time, possibly using a state machine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Join two pipe divided files on key fields - c++

Related

Creating a custom comparator in C++

Linewise lexicographic sort of a file with a catch

Elegant solution for string parsing

How to improve my code to correctly count unique lines?

Finding all occurrences of a character in a string

Categories

Resources