How to test the integrity of hardware on aws instance? - amazon-web-services

I have a cluster of consumers (50 or so instance) consuming from kafka partitions.
I notice that there is this one server that is consistently slow. Its cpu usage is always around 80-100%. While the other partitions is around 50%.
Originally I thought there is a slight chance that this is traffic dependent, so I manually switch the partitions that the slow loader is consuming.
However I did not observe an increase in processing speed.
I also don't see cpu steal from iostat, but since all consumer is running the same code I suspect there is some bottle neck in the hardware.
Unfortunately, I can't just replace the server unless I can provide conclusive proof that the hardware is the problem.
So I want to write a load testing script that pin point the bottle neck.
My plan is to write a while loop in python that does n computations, and find out what is the max computation that the slow consumer can do and what is the max computation that the fast consumer can do.
What other testing strategy can I do?
Perhaps I should test disk bottle neck by having my python script write to txt file?
Here is fast consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
50.01 0.00 3.96 0.13 0.12 45.77
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.06 0.16 11.46 422953 30331733
xvdb 377.63 0.01 46937.99 35897 124281808572
xvdc 373.43 0.01 46648.25 26603 123514631628
md0 762.53 0.01 93586.24 22235 247796440032
Here is slow consumer iostat
avg-cpu: %user %nice %system %iowait %steal %idle
81.58 0.00 5.28 0.11 0.06 12.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1.02 0.40 13.74 371145 12685265
xvdb 332.85 0.02 40775.06 18229 37636091096
xvdc 327.42 0.01 40514.44 10899 37395540132
md0 676.47 0.01 81289.50 11287 75031631060

Related

runtime.futex taking 50% to 70% time in go profile

As part of profiling couple of golang services we are seeing that all the services are spending 55% to 70% time in the runtime.futex function.
Note that these services involve use of goroutines, locks, and channels for synchronization and communication between the goroutines.
Is this expected and we should only care about the time spent in the non-"runtime" functions or could this be an artifact of incorrect usage of goroutines/locks/channels that is causing this overhead in the futex call?
(pprof) top20
Showing nodes accounting for 43.80s, 80.25% of 54.58s total
Dropped 654 nodes (cum <= 0.27s)
Showing top 20 nodes out of 156
flat flat% sum% cum cum%
31.05s 56.89% 56.89% 31.05s 56.89% runtime.futex
3.55s 6.50% 63.39% 4.51s 8.26% runtime.cgocall
2.25s 4.12% 67.52% 2.25s 4.12% runtime.usleep
1.14s 2.09% 69.60% 1.53s 2.80% runtime.scanobject
0.73s 1.34% 70.94% 1.56s 2.86% syscall.Syscall
0.71s 1.30% 72.24% 2.27s 4.16% runtime.mallocgc
(pprof) top20
Showing nodes accounting for 12.49s, 95.93% of 13.02s total
Dropped 93 nodes (cum <= 0.07s)
Showing top 20 nodes out of 47
flat flat% sum% cum cum%
9.69s 74.42% 74.42% 9.69s 74.42% runtime.futex
1.25s 9.60% 84.02% 1.63s 12.52% runtime.cgocall
0.64s 4.92% 88.94% 0.64s 4.92% runtime.usleep
0.17s 1.31% 90.25% 0.22s 1.69% runtime.timeSleepUntil
0.11s 0.84% 91.09% 0.17s 1.31% runtime.scanobject
0.08s 0.61% 91.71% 0.08s 0.61% runtime.nanotime1
0.08s 0.61% 92.32% 0.26s 2.00% runtime.retake
0.07s 0.54% 92.86% 0.07s 0.54% runtime.lock2
0.07s 0.54% 93.39% 0.13s 1% runtime.traceEventLocked
0.06s 0.46% 93.86% 0.36s 2.76% runtime.notetsleep_internal
0.05s 0.38% 94.24% 0.13s 1% runtime.mallocgc
This is not unusual for a program that isn't doing very much. If your program is spending most of time waiting for something to happen, it will often be sleeping in a futex.
FYI:
I had faced the same problem. In my case, I had a worker pool implementation. In which 200 goroutines were listening on to the same channel which in turn caused the contention. Then I used 1 channel per goroutine and which reduced runtime.futex very much.

Why CPU Utilization on VM Instance has intervals? (Google Cloud Platform)

I'm new to the google cloud platform and currently trying to understand the CPU utilization chart. (Attaching below).
This is the chart for medium machine type with the 2 reserved vCPUs.
What I can't understand is, why CPU utilization has this pattern (line goes up and down, from 5% to 26% again and again), when my machine usage is more or less linear. I know, that small machines are allowing CPU bursting, but it doesn't seem to be an explanation, since my usage never topped the CPU cap.
Details on VM:
machine type is e2-medium with 2 vCPUs and 4G of memory
the instance is used as a white label server
Will be grateful for any hint!
Metric you mentioned indicates that some process on your VM uses the CPU in a specific time intervals.
Without more information it's a guessing game to figure out why your chart spikes from 5 to 25% and back.
To nail down the process that's causing it you may try using ps or more "human friendly" htop commands and see what's going on your system.
for example:
$ ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head
PID PPID CMD %MEM %CPU
397 1 /usr/bin/google_osconfig_ag 0.7 0.0
437 1 /usr/bin/google_guest_agent 0.5 0.0
408 1 /usr/bin/python3 /usr/share 0.4 0.0
1 0 /sbin/init 0.2 0.0
234 1 /lib/systemd/systemd-journa 0.2 0.0
17419 1 /lib/systemd/systemd --user 0.2 0.0
17416 543 sshd: wb [priv] 0.2 0.0
575 1 /lib/systemd/systemd-logind 0.1 0.0
543 1 /usr/sbin/sshd -D 0.1 0.0
Here's some tips on how to use ps and another usefull Q & A regarding it.
Have a look at this answer - it may give you some insight.
Also - this is what the official GCP docs say about the CPU usage metric collected from the agent running on the VM:
Percentage of the total CPU capacity spent in different states. This value is reported from inside the VM and can differ from compute.googleapis.com/instance/cpu/utilization, which is reported by the hypervisor for the VM. Sampled every 60 seconds.
cpu_number: CPU number, for example, "0", "1", or "2". This label is only set with certain Monitoring configurations. Linux only.
cpu_state: CPU state, one of [idle, interrupt, nice, softirq, steal, system, user, wait] on Linux or [idle, used] on Windows.

Google Dataflow Pricing Streaming Mode

I'm new to Dataflow.
I'd like to use the Dataflow streaming template "Pub/Sub Subscription to BigQuery" to transfer some messages, say 10000 per day.
My question is about pricing since I don't understand how they're computed for the streaming mode, with Streaming Engine enabled or not.
I've used the Google Calculator which asks for the following:
Machine Type, Number of worker nodes used by the job, If streaming or Batch job, Number of GB of Persistent Disks (PD), Hours the job runs per month.
Consider the easiest case, since I don't need many resources, i.e.
Machine type: n1-standard1
Max Workers: 1
Job Type: Streaming
Price: in us-central1
Case 1: Streaming Engine DISABLED
Hours using the vCPU = 730 hours (1 month always active). Is this always true for the streaming mode? Or there can be a case in a streaming mode in which the usage is lower?
Persistent Disks: 430 GB HDD, which is the default value.
So I will pay:
(vCPU) 730 x $0.069(cost vCPU/hour) = $50.37
(PD) 730 x $0.000054 x 430 GB = $16.95
(RAM) 730 x $0.003557 x 3.75 GB = $9.74
TOTAL: $77.06, as confirmed by the calculator.
Case 2 Streaming Engine ENABLED.
Hours using the v CPU = 730 hours
Persistent Disks: 30 GB HDD, which is the default value
So I will pay:
(vCPU) 30 x $0.069(cost vCPU/hour) = $50.37
(PD) 30 x $0.000054 x 430 GB = $1.18
(RAM) 30 x $0.003557 x 3.75 GB = $9.74
TOTAL: $61.29 PLUS the amount of Data Processed (which is extra with Streaming Engine)
Considering messages of 1024 Byte, we have a traffic of 1024 x 10000 x 30 Bytes = 0.307 GB, and an extra cost of 0.307 GB x $0.018 = $0.005 (almost zero).
Actually, with this kind of traffic, I will save about $15 in using Streaming Engine.
Am I correct? Is there something else to consider or something wrong with my assumptions and my calculations?
Additionally, considering the low amount of data, is Dataflow really fitted for this kind of use? Or should I approach this problem in a different way?
Thank you in advance!
It's not false, but not perfectly accurate.
In the streaming mode, your Dataflow always listen the PubSub subscription and thus you need to but up full time.
In batch processing, you normally start the batch, it performs its job and then it stops.
In your comparison, you consider to have a batch job that runs full time. It's not impossible, but it doesn't fit your use case, I think.
About streaming and batching, all depends on your need of real time.
If you want to ingest the data in BigQuery with low latency (in few seconds) to have real time data, streaming is the good choice
If having data only updated every hour or every day, batch is a more suitable solution.
A latest remark, if your task is only to get message from PubSub and to stream write to BigQuery, you can consider to code it yourselves on Cloud Run or Cloud Functions. With only 10k messages per day, it will be free!

rocksdb write stall with many writes of the same data

First of all I have to explain our setup, as it is quite different to the normal database server with raidgroups and so on. Usually our customers buy a server like a HP DL 380 Gen10 with two 300 GB HDDs (not SSDs) which are configured as a RAID 1 and running with Windows.
We are only managing the meta data of other storages here, so that a client can ask us and find its information on those large storages.
As our old database was always corrupt, we searched for a new more stable database which also has not that much overhead and found rocksdb, currently with version 6.12.0.
Unfortunately after it ran for some hours, it seems to block my program for many minutes due to a write stall:
2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default]
Stopping writes because we have 2 immutable memtables (waiting for flush),
max_write_buffer_number is set to 2
Am I right that the write workload is too much for the machine?
Our software is a service which retrieves at least one update per second from up to 2000 different servers (limit might be increased in the future if possible). Most of the time, the same database entries are only updated/written again, as one of the informations inside the entry is the current time of the respective server. Of course I could try to write the data less often to hdd, but if then a client requests this data from us, our information would not be up to date.
So my questions are:
I assume that currently every writerequest is really written to the disk, is there a way to enable some kind of cache (or maybe increase its size, if not sufficient?) so that the data is not written less often to the hdd, but read requests return the correct data from the memory?
I also see that there is a merge operator, but I'm not sure when this merge would take place? Is there already a cache like mentioned under 1. and the data is collected for some time, then merged and then written to hdd?
Are there any other optimizations which could help me in this situation?
Any help is welcome. Thanks in advance.
Here some more loglines if they might be interesting:
** File Read Latency Histogram By Level [default] **
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp
Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
L0 2/0 17.12 MB 0.5 0.0 0.0 0.0 0.4 0.4 0.0 1.0 0.0 81.5 5.15 0.00 52 0.099 0 0
L1 3/0 192.76 MB 0.8 3.3 0.4 2.9 3.2 0.3 0.0 8.0 333.3 327.6 10.11 0.00 13 0.778 4733K 119K
L2 20/0 1.02 GB 0.4 1.6 0.4 1.2 1.2 -0.0 0.0 3.1 387.5 290.0 4.30 0.00 7 0.614 2331K 581K
Sum 25/0 1.22 GB 0.0 4.9 0.8 4.1 4.9 0.7 0.0 11.8 257.4 254.5 19.56 0.00 72 0.272 7064K 700K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Low 0/0 0.00 KB 0.0 4.9 0.8 4.1 4.5 0.3 0.0 0.0 349.5 316.4 14.40 0.00 20 0.720 7064K 700K
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.4 0.4 0.0 0.0 0.0 81.7 5.12 0.00 51 0.100 0 0
User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.9 0.03 0.00 1 0.030 0 0
Uptime(secs): 16170.9 total, 0.0 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.31 MB/s write, 4.92 GB read, 0.31 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **
2020/10/01-15:53:21.248110 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10465. Immutable memtables: 0.
2020/10/01-15:58:44.678596 1a64 [db\db_impl\db_impl_write.cc:1701] [default] New memtable created with log file: #10466. Immutable memtables: 1.
2020/10/01-15:58:44.678646 1a64 [WARN] [db\column_family.cc:876] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2020/10/01-16:02:57.448977 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/01-16:02:57.449034 2328 [db\db_impl\db_impl.cc:901]
** DB Stats **
Uptime(secs): 16836.8 total, 665.9 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.18 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.18 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 517K writes, 517K keys, 517K commit groups, 1.0 writes per commit group, ingest: 73.63 MB, 0.11 MB/s
Interval WAL: 517K writes, 0 syncs, 517059.00 writes per sync, written: 0.07 MB, 0.11 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Our watchdog caused a breakpoint after a mutex was blocked for several minutes, after that this showed up in the logfile:
2020/10/02-17:44:18.602776 2328 [db\db_impl\db_impl.cc:900] ------- DUMPING STATS -------
2020/10/02-17:44:18.602990 2328 [db\db_impl\db_impl.cc:901]
** DB Stats **
Uptime(secs): 109318.0 total, 92481.2 interval
Cumulative writes: 20M writes, 20M keys, 20M commit groups, 1.0 writes per commit group, ingest: 3.00 GB, 0.03 MB/s
Cumulative WAL: 20M writes, 0 syncs, 20944372.00 writes per sync, written: 3.00 GB, 0.03 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
L0 2/0 17.12 MB 0.5 0.0 0.0 0.0 0.4 0.4 0.0 1.0 0.0 81.5 5.15 0.00 52 0.099 0 0
L1 3/0 192.76 MB 0.8 3.3 0.4 2.9 3.2 0.3 0.0 8.0 333.3 327.6 10.11 0.00 13 0.778 4733K 119K
L2 20/0 1.02 GB 0.4 1.6 0.4 1.2 1.2 -0.0 0.0 3.1 387.5 290.0 4.30 0.00 7 0.614 2331K 581K
Sum 25/0 1.22 GB 0.0 4.9 0.8 4.1 4.9 0.7 0.0 11.8 257.4 254.5 19.56 0.00 72 0.272 7064K 700K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0
** Compaction Stats [default] **
Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Low 0/0 0.00 KB 0.0 4.9 0.8 4.1 4.5 0.3 0.0 0.0 349.5 316.4 14.40 0.00 20 0.720 7064K 700K
High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.4 0.4 0.0 0.0 0.0 81.7 5.12 0.00 51 0.100 0 0
User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.9 0.03 0.00 1 0.030 0 0
Uptime(secs): 109318.0 total, 92481.2 interval
Flush(GB): cumulative 0.410, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4.86 GB write, 0.05 MB/s write, 4.92 GB read, 0.05 MB/s read, 19.6 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 1 memtable_compaction, 0 memtable_slowdown, interval 0 total count
** File Read Latency Histogram By Level [default] **
RocksDB does have a built-in cache, which you can definitely explore, but I suspect your write stalls can be handled by increasing either the max_write_buffer_number or max_background_flushes configuration parameters.
The former will reduce the size of the memory table table to flush, while the latter will incerase the number of background threads available for flushing operations.
It's hard to say whether the workload is too high for the machine just from this information, because there are a lot of moving parts, but I doubt it. For starters, two background flush threads is pretty low. Also, I'm obviously not familiar with the workload you're testing for, but since you're not using the caching mechanism, the performance can only improve from here.
Disk-writes are tricky. Any linux application "writing to disk" is really calling the write system call and thus writing to an intermediate kernel buffer which is then written to its intended destination once the kernel deems fit and proper. The process of actually updating the data on the disk is known as writeback (I highly recommend reading the file I/O chapter from Robert Love's Linux System Programming for more on this).
You can call use the fsync(2) system call to manually commit a file's system cache to disk, but it's not recomended, since the kernel tends to know when the best time to do this is.
Before rolling any bespoke optimizations, I would checkout the tuning guide. If tuning the configuration for your workload doesn't solve the problem, I would probably look into a RAID configuration. Write stalls are a disk I/O throughput limitation, so data striping might give you enough of a boost, even giving you the option to add some redundancy into your storage solution, depending on the RAID configuration you opt for.
Either way, my gut feeling is that 2000 connections per second seems way too low for the problem to be the current machine's CPU or even memory. Unless each request requires a significant amount of work to process, I doubt this is it. Of course, it never hurts to make sure your endpoint's server configuration is also optimally tuned.

How to improve SSD I/O throughput concurrency in Linux

The program below reads in a bunch of lines from a file and parses them. It could be faster. On the other hand, if I have several cores and several files to process, that shouldn't matter much; I can just run jobs in parallel.
Unfortunately, this doesn't seem to work on my arch machine. Running two copies of the program is only slightly (if at all) faster than running one copy (see below), and less than 20% of what my drive is capable of. On an ubuntu machine with identical hardware, the situation is a bit better. I get linear scaling for 3-4 cores but I still top out at about 50% of my SSD drive's capacity.
What obstacles prevent linear scaling of I/O throughput as the number of cores increases, and what can be done to improve I/O concurrency on the software/OS side?
P.S. - For the hardware alluded to below, a single core is fast enough that reading would be I/O bound if I moved parsing to a separate thread. There are also other optimizations for improving single-core performance. For this question, however, I'd like to focus on the concurrency and how my coding and OS choices affect it.
Details:
Here are a few lines of iostat -x 1 output:
Copying a file to /dev/null with dd:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 883.00 0.00 113024.00 0.00 256.00 1.80 2.04 2.04 0.00 1.13 100.00
Running my program:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 1.00 1.00 141.00 2.00 18176.00 12.00 254.38 0.17 1.08 0.71 27.00 0.96 13.70
Running two instances of my program at once, reading different files:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 11.00 0.00 139.00 0.00 19200.00 0.00 276.26 1.16 8.16 8.16 0.00 6.96 96.70
It's barely better! Adding more cores doesn't increase throughput, in fact it starts to degrade and become less consistent.
Here's one instance of my program and one instance of dd:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 9.00 0.00 468.00 0.00 61056.00 0.00 260.92 2.07 4.37 4.37 0.00 2.14 100.00
Here is my code:
#include <string>
#include <boost/filesystem/path.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/fstream.hpp>
typedef boost::filesystem::path path;
typedef boost::filesystem::ifstream ifstream;
int main(int argc, char ** argv) {
path p{std::string(argv[1])};
ifstream f(p);
std::string line;
std::vector<boost::iterator_range<std::string::iterator>> fields;
for (getline(f,line); !f.eof(); getline(f,line)) {
boost::split (fields, line, boost::is_any_of (","));
}
f.close();
return 0;
}
Here's how I compiled it:
g++ -std=c++14 -lboost_filesystem -o gah.o -c gah.cxx
g++ -std=c++14 -lboost_filesystem -lboost_system -lboost_iostreams -o gah gah.o
Edit: Even more details
I clear memory cache (free page cache, dentries and inodes) before running the above benchmarks, to keep linux from pulling in pages from cache.
My process appears to be CPU-bound; switching to mmap or changing the buffer size via pubsetbuf have no noticable effect on recorded throughput.
On the other hand, scaling is IO-bound. If I bring all files into memory cache before running my program, throughput (now measured via execution time since iostat can't see it) scales linearly with the number of cores.
What I'm really trying to understand is, when I read from disk using multiple sequential-read processes, why doesn't throughput scale linearly with the number of processes up to something close to the drive's maximum read speed? Why would I hit an I/O bound without saturating throughput, and how does when I do so depend on the OS/software stack over which I am running?
You're not comparing similar things.
You're comparing
Copying a file to /dev/dull with dd:
(I'll assume you meant /dev/null...)
with
int main(int argc, char ** argv) {
path p{std::string(argv[1])};
ifstream f(p);
std::string line;
std::vector<boost::iterator_range<std::string::iterator>> fields;
for (getline(f,line); !f.eof(); getline(f,line)) {
boost::split (fields, line, boost::is_any_of (","));
}
f.close();
return 0;
}
The first just reads raw bytes without a care as to what they are and dumps them into the bit bucket. Your code reads by lines, which need to be identified, and then splits them into vector.
The way you're reading the data, you read a line, then spend time processing it. The dd command you compare your code to never spends time doing things other than reading data - it doesn't have to read then process then read then process...
I believe there were at least three issues at play here:
1) My reads were occurring too regularly.
The file I was reading from had predictible-length lines with predictably placed delimiters. By randomly introducing a 1 microsecond delay one time in a thousand, I was able to push throughput among multiple cores up to about 45MB/s.
2) My implementation of pubsetbuf did not actually set the buffer size.
The standard only specifies that pubsetbuf turn off buffering when a buffer size of zero is specified, as described in this link (thanks, #Andrew Henle); all other behavior is implementation-defined. Apparently my implementation used a buffer size of 8191 (verified by strace), regardless of what value I set.
Being too lazy to implement my own stream buffering for testing purposes, I rewrote the code to read 1000 lines into a vector, then attempt to parse them in a second loop, then repeat the whole procedure until end of file (there were no random delays). This allowed me to scale up to about 50MB/s.
3) My I/O scheduler and settings were not appropriate for my drive and application.
Apparently arch linux defaults to using the cfq io scheduler for my SSD drive, with parameters appropriate for HDD drives. Setting slice_sync to 0, as described here (see Mikko Rantalainen's answer, and the linked article), or switching to the noop scheduler, as described here, the original code gets around 60MB/s maximum throughput, running four cores. This link was also helpful.
With noop scheduling, scaling appears to be almost linear, up to my machine's four physical cores (I have eight with hyperthreading).