I'm trying to find the most efficient way to remove punctuation marks from a string in c++, this is what I currently have.
#include <iostream>
#include <string>
#include <fstream>
#include <iomanip>
#include <stdlib.h>
#include <algorithm>
using namespace std;
void PARSE(string a);
int main()
{
string f;
PARSE(f);
cout << f;
}
void PARSE(string a)
{
a = "aBc!d:f'a";
a.erase(remove_if(a.begin(), a.end(), ispunct), a.end());
cout << a << endl;
}
Is there a easier/more efficient way to do this?
I was thinking using str.len, get the length of the string, run it through a for loop and check ispunct then remove if it is.
No string copies. No heap allocation. No heap deallocation.
void strip_punct(string& inp)
{
auto to = begin(inp);
for (auto from : inp)
if (!ispunct(from))
*to++ = from;
inp.resize(distance(begin(inp), to));
}
Comparing to:
void strip_punct_re(string& inp)
{
inp.erase(remove_if(begin(inp), end(inp), ispunct), end(inp));
}
I created a variety of workloads. As a baseline input, I created a string containing all char values between 32 and 127. I appended this string num-times to create my test string. I called both strip_punct and strip_punct_re with a copy of the test string iters-times. I performed these workloads 10 times timing each test. I averaged the timings after dropping the lowest and highest results. I tested using release builds (optimized) from VS2015 on Windows 10 on a Microsoft Surface Book 4 (Skylake). I SetPriorityClass() for the process to HIGH_PRIORITY_CLASS and timed the results using QueryPerformanceFrequency/QueryPerformanceCounter. All timings were performed without a debugger attached.
num iters seconds seconds (re) improvement
10000 1000 2.812 2.947 4.78%
1000 10000 2.786 2.977 6.85%
100 100000 2.809 2.952 5.09%
By varying num and iters while keeping the number of processed bytes the same, I was able to see that the cost is primarily influenced by the number of bytes processed rather than per-call overhead. Reading the disassembly confirmed this.
So this version, is ~5% faster and generates 30% of the code.
Related
I want to read a text file from local storage, I'm trying to experiment with multiprocessing so I want to break the text file into smaller chunks and run a process on them.
Rough idea:
Input: 10Kb text file
Program to separate them into chunks of 1Kb each
Run a function on each chunk separately (Eg: Capitalise certain characters, find the frequency of letters or search for a word in that chunk)
Output: Return the function output with no memory leaks or mismatches in reads
I've tried using pread but I'm on windows, so any solution or leads to solve this would be helpful
Maybe you have chosen the wrong example to learn multithreading.
A file stored on a sequential drive will be read fastest in sequential mode.
Therefore I will read, in my example below, the complete file in one rush into a string. For test purposes I used a "Lorem Ipsum" generator and created a file with 1 million characters. 1 million is nowadays considered as still small.
For demo purposes, I will create 4 parallel threads.
After having this complete file in one string, I will split the big string into 4 substrings. One for each thread.
For the thread function, I created a 4 liner test function that calculates the count of letters for a given substring.
For easier learning, I will use std::async to create the threads. The result of std::async will be stored in a std::future. There we can pick up the test function result later. We need to used a shared_future to be able to store all of them in an a std::array, because the std::future's copy constructor is deleted.
Then, we let the threads do their work.
In an additional loop, we use the futures getfunction, which will wait for thread completion and then give us the result.
We sum up the values from all 4 threads and then print it out in a sorted way. Please note: Also the \nwill be counted, which will look a little bit strange in the output.
Please note. This is just demoe. It will be even slower than a straight forwad solution. It is just for showing hwo multithreading could work.
Please see below one simple example (one of many many possible solutions):
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <iterator>
#include <future>
#include <thread>
#include <array>
#include <set>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<char, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
// We will use 4 threads for our task
constexpr size_t NumberOfThreads = 4u;
// Some test function used by a thread. Count characters in text
Counter countCharacters(const std::string& text) {
// Definition of the counter
Counter counter{};
// Count all letters
for (const char c : text) counter[c]++;
// Give back result
return counter;
}
// Test / driver Code
int main() {
// Open a test file with 1M characters and check, if it could be opened
if (std::ifstream sourceStream{ "r:\\text.txt" }; sourceStream) {
// Read the complete 1M file into a string
std::string text(std::istreambuf_iterator<char>(sourceStream), {});
// ------------------------------------------------------------------------------------------------
// This is for the multhreading part
// We will split the big string in parts and give each thread the task to work with this part
// Calculate the length of one partition + some reserve in case of rounding problem
const size_t partLength = text.length() / NumberOfThreads + NumberOfThreads;
// We will create numberOfThread Substrings starting at equidistant positions. This is the start.
size_t threadStringStartpos = 0;
// Container for the futures. Please note. We can only use shared futures in containers.
std::array<std::shared_future<Counter>, NumberOfThreads> counter{};
// Now create the threats
for (unsigned int threadNumber{}; threadNumber < NumberOfThreads; ++threadNumber) {
// STart a thread. Get a reference to the future. And call it with our test function and a part of the string
counter[threadNumber] = std::async( countCharacters, text.substr(threadStringStartpos, partLength));
// Calculate next part of string
threadStringStartpos += partLength;
}
// Combine results from threads
Counter result{};
for (unsigned int threadNumber{}; threadNumber < NumberOfThreads; ++threadNumber) {
// Get will get the result from the thread via the assigned future
for (const auto& [letter, count] : counter[threadNumber].get())
result[letter] += count; // Sum up all counts
}
// ------------------------------------------------------------------------------------------------
for (const auto& [letter, count] : Rank(result.begin(), result.end())) std::cout << letter << " --> " << count << '\n';
}
else std::cerr << "\n*** Error: Could not open source file\n";
}
I noticed that the string capacities in C++ follow this pattern:
initial string size is 15
for any string that is larger than a particular size 'block', the capacity is doubled.
Here are the string capacities for strings up to length 500:
15
30
60
120
240
480
960
Capacities were found with the following C++ program:
#include <iostream>
#include <vector>
using namespace std;
string getstr(int len) {
string s = "";
for (int i=0; i<len; i++) {
s.append("1");
}
return s;
}
int main() {
vector<int> capacities;
int prevcap;
for (int i=0; i<500; i++) {
int cap = getstr(i).capacity();
if (cap > prevcap) {
capacities.push_back(cap);
prevcap = cap;
}
}
for (int i : capacities) {
cout << i << endl;
}
}
What is the logic behind choosing this algorithm? Do the numbers (here 15 and 2) have any significance, or have they been randomly chosen? Also, does this algorithm vary from compiler to compiler? (This was compiled and tested with g++ 5.4.0 on Ubuntu 16.04) Any insights are appreciated.
Doubling is a well known method. It amortizes the cost of reallocation making push_back a constant time operation (as it is required to be). The 'obvious' alternative of adding a fixed size would make push_back a linear time operation. Other patterns are possible though, any multiplicative increase would work theoretically, and I once read an article advocating that each increased capacity should be taken from the next term in a Fibonacci sequence.
I imagine the initial size of 15 is chosen with short string optimization (SSO) in mind. With SSO the string data is stored in the string object itself instead of in separately allocated memory. I imagine 15 is the largest short string that can be accommodated in this particular implementation. Knowing what sizeof(std::string) is might shed some light on this.
I wrote a program that prints all possible words that have 4 letters ,the letters can be in upper case or lower case ,And it took 42 minutes which is long amount of time .`
char Something[5]={0,0,0,0};
for(int i=65;i<=122;i++){ //65 is ascii representation of A and 122 for z
Something[0]=i;
cout<<Something<<endl;
for(int j=65;j<=122;j++){
Something[1]=j;
cout<<Something<<endl;
for(int n=65;n<=122;n++){
Something[2]=n;
cout<<Something<<endl;
for(int m=65;m<=122;m++){
Something[3]=m;
cout<<Something<<endl;`
So I need to know what takes most of the time in the program .
And how I can make it more Efficient.
We can get rid of the calls to endl, use only letters, and simply write out each string as it's complete:
#include <string>
#include <vector>
#include <iostream>
int main() {
std::string out = " ";
std::string letters = "abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ";
for (char f : letters)
for (char g : letters)
for (char h : letters)
for (char i : letters) {
out[0] = f;
out[1] = g;
out[2] = h;
out[3] = i;
std::cout << out << '\n';
}
}
A quick test on my machine (which is rather trailing edge hardware--an AMD A8-7600) shows this running in a little over half a second (with the output directed to a file). Realistically, the time is likely to depend more upon disk speed than CPU speed. It produces about 30 megabytes of output, so on a typical disk with a maximum write speed of 100 megabytes per second (or so) the minimum time would be around a third of a second, regardless of CPU speed (though you might be able to do considerably better with a really fast CPU and an SSD).
Riffing over #Jerry Coffin's answer (which is already a big win over OP's solution), I get an extra 20x improvement on my machine:
#include <string>
#include <vector>
#include <iostream>
int main() {
const char l[] = "abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ";
// trim away the final NUL
const char (&letters)[sizeof(l)-1] = (const char (&)[sizeof(l)-1])l;
std::vector<char> obuf(5*sizeof(letters)*sizeof(letters)*sizeof(letters));
for (char f : letters) {
char *out = obuf.data();
for (char g : letters) {
for (char h : letters) {
for (char i : letters) {
out[0] = f;
out[1] = g;
out[2] = h;
out[3] = i;
out[4] = '\n';
out+=5;
}
}
}
std::cout.write(obuf.data(), obuf.size());
}
return 0;
}
Redirecting to /dev/null (edit: or to a file on my disk; Linux seems pretty good at IO caching) on my machine Jerry Coffin's answer takes roughly 400 ms, mine takes 20 ms.
This is quite obvious if you think that the inner loop here is just trivial pointer manipulation into a preallocated buffer, without function calls, extra branches and "complicated" stuff that poisons the registers and wastes time (operator<< is quite a complicated beast even for chars - for no good reason if you ask me). IO (plus stupid iostream overhead) is done just once every ~ 700 KB, so its cost is well amortized.
I have to apologize for my poor English first.
I'm learning hardware transactional memory now and I'm using the spin_rw_mutex.h in TBB to implement the transaction block in C++. speculative_spin_rw_mutex is a class in the spin_rw_mutex.h is a mutex which have already implemented the RTM interface of intel TSX.
The example I used to test RTM is very simple. I created the Account class and I transfer money from one account to another randomly. All accounts are in an accounts array and the size is 100. The random function is in boost.(I think STL has the same random function). The transfer function is protected with the speculative_spin_rw_mutex. I used tbb::parallel_for and tbb::task_scheduler_init to control concurrency. All transfer methods are called in the lambda of paraller_for. The total transfer times is 1 million. The strange thing is when the task_scheduler_init is set as 2 the program is the fastest (8 seconds). In fact my CPU is i7 6700k which has 8 threads. In the range of 8 and 50,000, the performance of the program is almost no change (11 to 12 seconds). When I increase the task_scheduler_init to 100,000, the run time will increase to about 18 seconds.
I tried to use profiler to analyze the program and I found the hotspot function is the mutex. However I think the rate of transaction roll-back is not so high. I don't know why the program is so slow.
Somebody says that the false sharing slows down the performance, as a result, I tried to use
std::vector> cache_aligned_accounts(AccountsSIZE,Account(1000));
to replace the orignal array
Account* accounts[AccountsSIZE];
to avoid the false sharing. It seems nothing changed;
Here is my new codes.
#include <tbb/spin_rw_mutex.h>
#include <iostream>
#include "tbb/task_scheduler_init.h"
#include "tbb/task.h"
#include "boost/random.hpp"
#include <ctime>
#include <tbb/parallel_for.h>
#include <tbb/spin_mutex.h>
#include <tbb/cache_aligned_allocator.h>
#include <vector>
using namespace tbb;
tbb::speculative_spin_rw_mutex mu;
class Account {
private:
int balance;
public:
Account(int ba) {
balance = ba;
}
int getBalance() {
return balance;
}
void setBalance(int ba) {
balance = ba;
}
};
//Transfer function. Using speculative_spin_mutex to set critical section
void transfer(Account &from, Account &to, int amount) {
speculative_spin_rw_mutex::scoped_lock lock(mu);
if ((from.getBalance())<amount)
{
throw std::invalid_argument("Illegal amount!");
}
else {
from.setBalance((from.getBalance()) - amount);
to.setBalance((to.getBalance()) + amount);
}
}
const int AccountsSIZE = 100;
//Random number generater and distributer
boost::random::mt19937 gener(time(0));
boost::random::uniform_int_distribution<> distIndex(0, AccountsSIZE - 1);
boost::random::uniform_int_distribution<> distAmount(1, 1000);
/*
Function of transfer money
*/
void all_transfer_task() {
task_scheduler_init init(10000);//Set the number of tasks can be run together
/*
Initial accounts, using cache_aligned_allocator to avoid false sharing
*/
std::vector<Account, cache_aligned_allocator<Account>> cache_aligned_accounts(AccountsSIZE,Account(1000));
const int TransferTIMES = 10000000;
//All transfer tasks
parallel_for(0, TransferTIMES, 1, [&](int i) {
try {
transfer(cache_aligned_accounts[distIndex(gener)], cache_aligned_accounts[distIndex(gener)], distAmount(gener));
}
catch (const std::exception& e)
{
//cerr << e.what() << endl;
}
//std::cout << distIndex(gener) << std::endl;
});
std::cout << cache_aligned_accounts[0].getBalance() << std::endl;
int total_balance = 0;
for (size_t i = 0; i < AccountsSIZE; i++)
{
total_balance += (cache_aligned_accounts[i].getBalance());
}
std::cout << total_balance << std::endl;
}
As Intel TSX works on cache line granularity, false sharing is definitely things to start with. Unfortunately, cache_aligned_allocator does not what you are probably expecting, i.e. it aligned whole std::vector, but you need individual Account to occupy whole cache line to prevent false sharing.
While I can't reproduce your benchmark, I see here two possible causes for this behavior:
"Too many cooks boil the soup": you use a single spin_rw_mutex that is locked by all the transfers by all the threads. Seems to me that your transfers execute sequentially. This would explain why the profile sees a hot point there. The Intel page warns against performance degradation in such case.
Throughput vs. speed: On an i7, in a couple of benchmarks, I could notice that when you use more cores, each core runs a little bit slower, so that overall time of fixed siez loops runs longer. However, counting the overall throughput (i.e. the total number of transactions that happen in all these parallel loops) the throughput is much higher (although not fully proportinally to the number of cores).
I'd rather opt for the first case, but the second is not to eliminate.
I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?
I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)