Finding occurences of specific string of length L inside a txt file

Finding occurences of specific string of length L inside a txt file - c++

Suppose I have a hexadecimal string y of length N of the form y{N}y{N-1}...y{1}.
Then given another hexadecimal string x of length L (L less than N), I want to check how many times (if at all) this string appears inside y... say like y{N}...x{L}x{L-1}...x{1}...y{j}..x{L}x{L-1}...x{1}....y{1}.
Which is the most efficient way to do this in C++ ?...I need a really efficient implementation as I would like to run this for a large database

"Hexadecimal" doesn't mean a thing here. C++ is a computer language, and works on bits. "Hexadecimal" is just a convenient way to group 4 bits together for human consumption.
Similarly, C++ doesn't index strings like y{N}y{N-1}...y{1}. It indexes them as y[0],y[1],y[N-1]. (There's no y[N].)
Under normal circumstances, std::string::find is going to be faster than your disk, which means it's fast enough.

Your request is a simple string search algorithm.
There are many algorithms to do that.
Most of them will give you a good answer in O(L+N) with preprocessing.
You can also use a suffix tree which will provide a faster answer in O(L + Z), where Z is the number of occurrences of x in y .
A suffix tree take a lot of memory space (O(N²)) though , might not be the ideal choice here.

Which is the most efficient way to do this in C++ ?
Try std::search across an std::istream_iterator of your input file, like this:
#include <string>
#include <iterator>
#include <iostream>
#include <algorithm>
int main () {
// std::ifstream input("input.txt");
std::istream& input(std::cin);
std::string search_for("1234");
std::istream_iterator<char> last;
std::istream_iterator<char> it(input);
int count(0);
while((it = std::search(it, last, search_for.begin(), search_for.end())) != last) {
count++;
}
std::cout << count << "\n";
}
If that isn't fast enough, you might try std::istreambuf_iterator.
If that isn't fast enough, you might try memory-mapping the file and using the initial and final pointers as your iterators.

Related

A good way to create a palindrome in C++ given first half

Is there an easy way to create a palindrome in C++ given the first half? For example given "abcdef", I want to return "abcdeffedcba", while keeping the input first half unchanged?
I know you could do, but is there a better way to do it in one line? In Java, reverse() returns a value therefore you can do it in one line.
string createPalindrome(string & half)
{
string temp = half;
reverse(temp.begin(), temp.end());
return half + temp;
}

If you want to do this in one line, here is an implementation:
#include <string>
#include <iostream>
std::string createPalindrome(const std::string & half)
{
return half + std::string(half.rbegin(), half.rend());
}
int main()
{
std::cout << createPalindrome("abcdef");
}
Live Example
Note that this basically taking the string and concatenating it with the reverse of itself.
The half.rbegin() and half.rend() are reverse iterators, so the temporary reverse string is constructed using these iterators.

It is not necessary to do this in one line. The code you have written is perfectly clean and understandable. If you want to use the reverse method, which I would recommend since it does exactly what you want, you cannot do it in one line anyway.

Find All Palindrome Substrings in a String by Rearranging Characters

For fun and practice, I have tried to solve the following problem (using C++): Given a string, return all the palindromes that can be obtained by rearranging its characters.
I've come up with an algorithm that doesn't work completely. Sometimes, it finds all the palindromes, but other times it finds some but not all.
It works by swapping each adjacent pair of characters N times, where N is the length of the input string. Here is the code:
std::vector<std::string> palindromeGen(std::string charactersSet) {
std::vector<std::string> pals;
for (const auto &c : charactersSet) {
for (auto i = 0, j = 1; i < charactersSet.length() - 1; ++i, ++j) {
std::swap(charactersSet[i], charactersSet[j]);
if (isPalandrome(charactersSet)) {
if (std::find(pals.begin(), pals.end(), charactersSet) == pals.end()) {
// if palindrome is unique
pals.push_back(charactersSet);
}
}
}
}
return pals;
}
What's the fault in this algorithm? I'm mostly concerned about the functionality of the algorithm, rather than the efficiency. Although I'll appreciate tips about efficiency as well. Thanks.

This probably fits a bit better in Code Review but here goes:
Logic Error
You change charactersSet while iterating over it, meaning that your iterator breaks. You need to make a copy of characterSet, and iterate over that.
Things to Change
Since pals holds only unique values, it should be a std::set instead of a std::vector. This will simplify some things. Also, your isPalandrome method spells palindrome wrong!
Alternative Approach
Since palindromes can only take a certain form, consider sorting the input string first, so that you can have a list of characters with an even number of occurrences, and a list of characters with an odd number. You can only have one character with an odd number of occurrences (and this only works for an odd length input). This should let you discard a bunch of possibilities. Then you can work through the different possible combinations of one half of the palindrome (since you can build one half from the other).

Here is another implementation that leverages std::next_permutation:
#include <string>
#include <algorithm>
#include <set>
std::set<std::string> palindromeGen(std::string charactersSet)
{
std::set<std::string> pals;
std::sort(charactersSet.begin(), charactersSet.end());
do
{
// check if the string is the same backwards as forwards
if ( isPalindrome(charactersSet))
pals.insert(charactersSet);
} while (std::next_permutation(charactersSet.begin(), charactersSet.end()));
return pals;
}
We first sort the original string. This is required for std::next_permutation to work correctly. A call to the isPalindrome function with a permutation of the string is done in a loop. Then if the string is a palindrome, it's stored in the set. The subsequent call to std::next_permutation just rearranges the string.
Here is a Live Example. Granted, it uses a reversed copy of the string as the "isPalindrome" function (probably not efficient), but you should get the idea.

Need for Fast map between string and integers

I have a map of string and unsigned, in which I store a word to its frequency of the following form:
map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings
Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.
map does not fit in my RAM and swapping is consuming a lot of time.
Would erasing all words having frequency < 1000 help using erase function of map?

I suggest you use an unordered_map as opposed to a map. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1) as opposed to O(logn) in a map.
As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.
You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.
Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort in linux implements external mergesort.

You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
you can use some third party libraries for this also.
EDIT:
pseudo code
int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);
loop: currString in ListOfString
int key = someHashMethod(currString);
++mapWordFrequency[key];
if(mapWordFrequency[key] > 1000)
doSomeThing();
Update:
As #Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.

Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:
#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp>
inline std::string compress(const std::string& data)
{
std::stringstream decompressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_compressor());
stream.push(decompressed);
std::stringstream compressed {};
boost::iostreams::copy(stream, compressed);
return compressed.str();
}
inline std::string decompress(const std::string& data)
{
std::stringstream compressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_decompressor());
stream.push(compressed);
std::stringstream decompressed;
boost::iostreams::copy(stream, decompressed);
return decompressed.str();
}
In addition to using a std::unordered_map as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.

You need another approach to your problem, your data is too big to be processed all at once.
For example you could split your file into multiple files, let's say the easiest would be to logically splitting them by letters.
100GB/24 letters = 4.17 GB
Now you'll have 24 files of 4.17GB each.
You know that the words in any of the files can not be part of any other file, this will help you, since you won't have to merge the results.
With a 4GB file, it now gets easier to work in ram.
std::map has a problem when you start using a lot of memory, since it fragments a lot. Try std::unordered_map, and if that's still not performing well, you may be able to load in memory the file and sort it. Counting occurrences will be then quite easy.
Assuming you have several duplicates, your map or unordered_map will have a significantly lower memory footprint.
Run your code in a loop, for each file, and append the results in another file.
You should be done quite quickly.

The main problem seems to be the memory footprint, so we are looking for a solution that uses up little memory. A way to save memory is to use sorted vectors instead of a map. Now, vector has a lookup time with ~log(n) comparisons and an average insert time of n/2, which is bad. The upside is you have basically no memory overhead, the memory to be moved is small due to separation of data and you get sequential memory (cache-friendliness) which can easily outperform a map. The required memory would be 2 (wordcount) + 4 (index) + 1 (\0-char) + x (length of word) bytes per word. To achieve that we need to get rid of the std::string, because it is just too big in this case.
You can split your map into a vector<char> that saves the strings one after another separated by \0-characters, a vector<unsigned int> for the index and a vector<short int> for the word count. The code would look something like this (tested):
#include <vector>
#include <algorithm>
#include <cstring>
#include <string>
#include <fstream>
#include <iostream>
std::vector<char> strings;
std::vector<unsigned int> indexes;
std::vector<short int> wordcount;
const int countlimit = 1000;
void insertWord(const std::string &str) {
//find the word
auto stringfinder = [](unsigned int lhs, const std::string &rhs) {
return &strings[lhs] < rhs;
};
auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder);
//increment counter
if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word
wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1);
indexes.insert(index, strings.size());
strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1);
}
else { //known word
auto &count = wordcount[index - begin(indexes)];
if (count < countlimit) //prevent overflow
count++;
}
}
int main() {
std::ifstream f("input.txt");
std::string s;
while (f >> s) { //not a good way to read in words
insertWord(s);
}
for (size_t i = 0; i < indexes.size(); ++i) {
if (wordcount[i] > countlimit) {
std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n';
}
}
}
This approach still saves all words in memory. According to Wolfram Alpha the average word length in the English language is 5.1 characters. This gives you a total memory requirement of (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have a halfway modern computer with 16+GB of RAM you can fit it all into RAM.
If this fails (because you don't have English words and they don't fit in memory), the next approach would be memory mapped files. That way you can make indexes point to the memory mapped file instead of strings, so you can get rid of strings, but the access time would suffer.
If this fails due to low performance you should look into map-reduce which is very easy to apply to this case. It gives you as much performance as you have computers.

#TonyD Can you please give a little example with trie? – Rose Sharma
Here's an example of a trie approach to this problem:
#include <iostream>
#include <string>
#include <limits>
#include <array>
class trie
{
public:
void insert(const std::string& s)
{
node_.insert(s.c_str());
}
friend std::ostream& operator<<(std::ostream& os, const trie& t)
{
return os << t.node_;
}
private:
struct Node
{
Node() : freq_(0) { }
uint16_t freq_;
std::array<Node*, 26> next_letter_{};
void insert(const char* p)
{
if (*p)
{
Node*& p_node = next_letter_[*p - 'a'];
if (!p_node)
p_node = new Node;
p_node->insert(++p);
}
else
if (freq_ < std::numeric_limits<decltype(freq_)>::max()) ++freq_;
}
} node_;
friend std::ostream& operator<<(std::ostream& os, const Node& n)
{
os << '(';
if (n.freq_) os << n.freq_ << ' ';
for (size_t i = 0; i < 26; ++i)
if (n.next_letter_[i])
os << char('a' + i) << *(n.next_letter_[i]);
return os << ')';
}
};
int main()
{
trie my_trie;
my_trie.insert("abc");
my_trie.insert("abcd");
my_trie.insert("abc");
my_trie.insert("bc");
std::cout << my_trie << '\n';
}
Output:
(a(b(c(2 d(1 ))))b(c(1 )))
The output is a compressed/tree-like representation of your word-frequency histogram: abc appears 2 times, abcd 1, bc 1. The parentheses can be thought of as pushing and popping characters from a "stack" to form the current prefix or - when there's a number - word.
Whether it improves much on a map depends on the variations in the input words, but it's worth a try. A more memory efficient implementation might use a vector or set - or even a string of say space-separated- suffixes when there are few elements beneath the current prefix, then switch to the array-of-26-pointers when that's likely to need less memory.

Word Frequency Statistics

In an pre-interview, I am faced with a question like this:
Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.
For example an input string of “a b b” would generate the following output:
b : 2
a : 1
Firstly, I'd say it is not so clear that whether the input string is made up of single-letter words or multiple-letter words. If the former is the case, it could be simple.
Here is my thought:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
I can get the statistics of the frequecy of every single-letter word in the input string, and I can get it sorted (using QuickSort or whatever). But after the count array is sorted, how to get the single-letter word associated with the count so that I can print them out in pair later?
If the input string is made of of multiple-letter word, I plan to use a map<const char *, int> to track the frequency. But again, how to sort the map's key-value pair?
The question is in C or C++, and any suggestion is welcome.
Thanks!

I would use a std::map<std::string, int> to store the words and their counts. Then I would use something this to get the words:
while(std::cin >> word) {
// increment map's count for that word
}
finally, you just need to figure out how to print them in order of frequency, I'll leave that as an exercise for you.

You're definitely wrong in assuming that you need only 26 options, 'cause your employer will want to allow multiple-character words as well (and maybe even numbers?).
This means you're going to need an array with a variable length. I strongly recommend using a vector or, even better, a map.
To find the character sequences in the string, find your current position (start at 0) and the position of the next space. Then that's the word. Set the current position to the space and do it again. Keep repeating this until you're at the end.
By using the map you'll already have the word/count available.
If the job you're applying for requires university skills, I strongly recommend optimizing the map by adding some kind of hashing function. However, judging by the difficulty of the question I assume that that is not the case.

Taking the C-language case:
I like brute-force, straightforward algos so I would do it in this way:
Tokenize the input string to give an unsorted array of words. I'll have to actually, physically move each word (because each is of variable length); and I think I'll need an array of char*, which I'll use as the arg to qsort( ).
qsort( ) (descending) that array of words. (In the COMPAR function of qsort(), pretend that bigger words are smaller words so that the array acquires descending sort order.)
3.a. Go through the now-sorted array, looking for subarrays of identical words. The end of a subarray, and the beginning of the next, is signalled by the first non-identical word I see.
3.b. When I get to the end of a subarray (or to the end of the sorted array), I know (1) the word and (2) the number of identical words in the subarray.
EDIT new step 4: Save, in another array (call it array2), a char* to a word in the subarry and the count of identical words in the subarray.
When no more words in sorted array, I'm done. it's time to print.
qsort( ) array2 by word frequency.
go through array2, printing each word and its frequency.
I'M DONE! Let's go to lunch.

All the answers prior to mine did not give really an answer.
Let us think on a potential solution.
There is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the word, to a count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<char,int> counter{};
counter[word]++;
And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map or unsorted, but faster accessible with a std::unordered_map.
Now you want to sort according to the frequency/count. Unfortunately this is not possible with maps.
Therefore we need to use a second container, like a ```std::vector`````which we then can sort unsing std::sort for any given predicate, or, we can copy the values into a container, like a std::multiset that implicitely orders its elements.
For getting out the words of a std::string we simply use a std::istringstream and the standard extraction operator >>. No big deal at all.
And because writing all this long names for the std containers, we create alias names, with the using keyword.
After all this, we now write ultra compact code and fulfill the task with just a few lines of code:
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <set>
#include <unordered_map>
#include <type_traits>
#include <iomanip>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
std::istringstream text{ " 4444 55555 1 22 4444 333 55555 333 333 4444 4444 55555 55555 55555 22 "};
int main() {
Counter counter;
// Count
for (std::string word{}; text >> word; counter[word]++);
// Sort
Rank rank(counter.begin(), counter.end());
// Output
for (const auto& [word, count] : rank) std::cout << std::setw(15) << word << " : " << count << '\n';
}

C++ string sort like a human being?

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".
Below is a test case that currently fails:
#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>
template <typename T>
struct LexicographicSort {
inline bool operator() (const T& lhs, const T& rhs) const{
std::ostringstream s1,s2;
s1 << toLower(lhs); s2 << toLower(rhs);
bool less = s1.str() < s2.str();
//Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
return less;
}
inline std::string toLower(const std::string& str) const {
std::string newString("");
for (std::string::const_iterator charIt = str.begin();
charIt!=str.end();++charIt) {
newString.push_back(std::tolower(*charIt));
}
return newString;
}
};
int main(void) {
const std::string reference[5] = {"ab","B","c1","c2","c10"};
std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));
//Insert in reverse order so we know they get sorted
std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());
std::cout<<"Items:\n";
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
std::vector<std::string> sortedStrings(strings.begin(), strings.end());
assert(sortedStrings == referenceStrings);
}

Is there any way to do with without writing a mini-parser?
Let someone else do that?
I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.
Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
If one string has run out of characters, it comes first.
Use strcoll to compare alphabetic subsequences within the current locale.
Use strtold to compare numeric subsequences within the current locale.
Repeat until finished with one or both strings.
Break ties with strcmp.
This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?
It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.
The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).
From there, sort as normal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding occurences of specific string of length L inside a txt file - c++

Related

A good way to create a palindrome in C++ given first half

Find All Palindrome Substrings in a String by Rearranging Characters

Need for Fast map between string and integers

Word Frequency Statistics

C++ string sort like a human being?

Categories

Resources