Im basically facing a blocking problem.
I have my server coded based on C++ Boost.ASIO using 8 threads since the server has 8 logical cores.
My problem is a thread may face 0.2~1.5 seconds of blocking on a MySQL query and I honestly don't know how to go around that since MySQL C++ Connector does not support asynchronous queries, and I don't know how to design the server "correctly" to use multiple threads for doing the queries.
This is where I'm asking for opinions of what to do in this case.
Create 100 threads for async' query sql?
Could I have an opinion from experts about this?
Okay, the proper solution to this would be to extend Asio and write a mysql_service implementation to integrate this. I was almost going to find out how this is done right away, but I wanted to get started using an "emulation".
The idea is to have
your business processes using an io_service (as you are already doing)
a database "facade" interface that dispatches async queries into a different queue (io_service) and posts the completion handler back onto the business_process io_service
A subtle tweak needed here you need to keep the io_service on the business process side from shutting down as soon as it's job queue is empty, since it might still be awaiting a response from the database layer.
So, modeling this into a quick demo:
namespace database
{
// data types
struct sql_statement { std::string dml; };
struct sql_response { std::string echo_dml; }; // TODO cover response codes, resultset data etc.
I hope you will forgive my gross simplifications :/
struct service
{
service(unsigned max_concurrent_requests = 10)
: work(io_service::work(service_)),
latency(mt19937(), uniform_int<int>(200, 1500)) // random 0.2 ~ 1.5s
{
for (unsigned i = 0; i < max_concurrent_requests; ++i)
svc_threads.create_thread(boost::bind(&io_service::run, &service_));
}
friend struct connection;
private:
void async_query(io_service& external, sql_statement query, boost::function<void(sql_response response)> completion_handler)
{
service_.post(bind(&service::do_async_query, this, ref(external), std::move(query), completion_handler));
}
void do_async_query(io_service& external, sql_statement q, boost::function<void(sql_response response)> completion_handler)
{
this_thread::sleep_for(chrono::milliseconds(latency())); // simulate the latency of a db-roundtrip
external.post(bind(completion_handler, sql_response { q.dml }));
}
io_service service_;
thread_group svc_threads; // note the order of declaration
optional<io_service::work> work;
// for random delay
random::variate_generator<mt19937, uniform_int<int> > latency;
};
The service is what coordinates a maximum number of concurrent requests (on the "database io_service" side) and ping/pongs the completion back onto another io_service (the async_query/do_async_query combo). This stub implementation emulates latencies of 0.2~1.5s in the obvious way :)
Now comes the client "facade"
struct connection
{
connection(int connection_id, io_service& external, service& svc)
: connection_id(connection_id),
external_(external),
db_service_(svc)
{ }
void async_query(sql_statement query, boost::function<void(sql_response response)> completion_handler)
{
db_service_.async_query(external_, std::move(query), completion_handler);
}
private:
int connection_id;
io_service& external_;
service& db_service_;
};
connection is really only a convenience so we don't have to explicitly deal with various queues on the calling site.
Now, let's implement a demo business process in good old Asio style:
namespace domain
{
struct business_process : id_generator
{
business_process(io_service& app_service, database::service& db_service_)
: id(generate_id()), phase(0),
in_progress(io_service::work(app_service)),
db(id, app_service, db_service_)
{
app_service.post([=] { start_select(); });
}
private:
int id, phase;
optional<io_service::work> in_progress;
database::connection db;
void start_select() {
db.async_query({ "select * from tasks where completed = false" }, [=] (database::sql_response r) { handle_db_response(r); });
}
void handle_db_response(database::sql_response r) {
if (phase++ < 4)
{
if ((id + phase) % 3 == 0) // vary the behaviour slightly
{
db.async_query({ "insert into tasks (text, completed) values ('hello', false)" }, [=] (database::sql_response r) { handle_db_response(r); });
} else
{
db.async_query({ "update * tasks set text = 'update' where id = 123" }, [=] (database::sql_response r) { handle_db_response(r); });
}
} else
{
in_progress.reset();
lock_guard<mutex> lk(console_mx);
std::cout << "business_process " << id << " has completed its work\n";
}
}
};
}
This business process starts by posting itself on the app service. It then does a number of db queries in succession, and eventually exits (by doing in_progress.reset() the app service is made aware of this).
A demonstration main, starting 10 business processes on a single thread:
int main()
{
io_service app;
database::service db;
ptr_vector<domain::business_process> bps;
for (int i = 0; i < 10; ++i)
{
bps.push_back(new domain::business_process(app, db));
}
app.run();
}
In my sample, business_processes don't do any CPU intensive work, so there's no use in scheduling them across CPU's, but if you wanted you could easily achieve this, by replacing the app.run() line with:
thread_group g;
for (unsigned i = 0; i < thread::hardware_concurrency(); ++i)
g.create_thread(boost::bind(&io_service::run, &app));
g.join_all();
See the demo running Live On Coliru
I'm not a MySQL guru, but the following is generic multithreading advice.
Having NumberOfThreads == NumberOfCores is appropriate when none of the threads ever block and you are just splitting the load over all CPUs.
A common pattern is to have multiple threads per CPU, so one is executing while another is waiting on something.
In your case, I'd be inclined to set NumberOfThreads = n * NumberOfCores where 'n' is read from a config file, a registry entry or some other user-settable value. You can test the system with different values of 'n' to fund the optimum. I'd suggest somewhere around 3 for a first guess.
Related
I'm looking for some examples of usage of Triggers and Timers in Apache beam, I wanted to use Processing-time timers for listening my data from pub sub in every 5 minutes and using Processing time triggers processing the above data collected in an hour altogether in python.
Please take a look at the following resources: Stateful processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam
The first blog post is more general in how to handle states for context, and the second has some examples on buffering and triggering after a certain period of time, which seems similar to what you are trying to do.
A full example was requested. Here is what I was able to come up with:
PCollection<String> records =
pipeline.apply(
"ReadPubsub",
PubsubIO.readStrings()
.fromSubscription(
"projects/{project}/subscriptions/{subscription}"));
TupleTag<Iterable<String>> every5MinTag = new TupleTag<>();
TupleTag<Iterable<String>> everyHourTag = new TupleTag<>();
PCollectionTuple timersTuple =
records
.apply("WithKeys", WithKeys.of(1)) // A KV<> is required to use state. Keying by data is more appropriate than hardcode.
.apply(
"Batch",
ParDo.of(
new DoFn<KV<Integer, String>, Iterable<String>>() {
#StateId("buffer5Min")
private final StateSpec<BagState<String>> bufferedEvents5Min =
StateSpecs.bag();
#StateId("count5Min")
private final StateSpec<ValueState<Integer>> countState5Min =
StateSpecs.value();
#TimerId("every5Min")
private final TimerSpec every5MinSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("bufferHour")
private final StateSpec<BagState<String>> bufferedEventsHour =
StateSpecs.bag();
#StateId("countHour")
private final StateSpec<ValueState<Integer>> countStateHour =
StateSpecs.value();
#TimerId("everyHour")
private final TimerSpec everyHourSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void process(
#Element KV<Integer, String> record,
#StateId("count5Min") ValueState<Integer> count5MinState,
#StateId("countHour") ValueState<Integer> countHourState,
#StateId("buffer5Min") BagState<String> buffer5Min,
#StateId("bufferHour") BagState<String> bufferHour,
#TimerId("every5Min") Timer every5MinTimer,
#TimerId("everyHour") Timer everyHourTimer) {
if (Objects.firstNonNull(count5MinState.read(), 0) == 0) {
every5MinTimer
.offset(Duration.standardMinutes(1))
.align(Duration.standardMinutes(1))
.setRelative();
}
buffer5Min.add(record.getValue());
if (Objects.firstNonNull(countHourState.read(), 0) == 0) {
everyHourTimer
.offset(Duration.standardMinutes(60))
.align(Duration.standardMinutes(60))
.setRelative();
}
bufferHour.add(record.getValue());
}
#OnTimer("every5Min")
public void onTimerEvery5Min(
OnTimerContext context,
#StateId("buffer5Min") BagState<String> bufferState,
#StateId("count5Min") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(every5MinTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
#OnTimer("everyHour")
public void onTimerEveryHour(
OnTimerContext context,
#StateId("bufferHour") BagState<String> bufferState,
#StateId("countHour") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(everyHourTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
})
.withOutputTags(every5MinTag, TupleTagList.of(everyHourTag)));
timersTuple
.get(every5MinTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<<do something every 5 min>>);
timersTuple
.get(everyHourTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<< do something every hour>>);
pipeline.run().waitUntilFinish();
Ok, so I'm a new with all the Kafka stuff and cppkafka library in particular. I'm trying to write my convenience wrapper on top of cppkafka. The producer side is quite straightforward and looks like it does what it supposed to do
cppkafka::Producer producer(cppkafka::Configuration{{"metadata.broker.list", "localhost:9092"}});
producer.produce(cppkafka::MessageBuilder(topic[0])
.partition((message_counter++) % partitions)
.payload(buffer.str()));
producer.flush();
Error handling and retries, here and below, were removed for brevity.
This code produces messages that I can see using whatever kafka UI.
The consumer side is somewhat more complicated
cppkafka::Consumer consumer(cppkafka::Configuration{
{"metadata.broker.list", "localhost:9092"},
{"group.id", "MyGroup"},
{"client.id", "MyClient"},
// {"auto.offset.reset", "smallest"},
// {"auto.offset.reset", "earliest"},
// {"enable.auto.offset.store", "false"},
{"enable.partition.eof", "false"},
{"enable.auto.commit", "false"}});
cppkafka::TopicPartitionList assignment;
consumer.set_assignment_callback([&, this](cppkafka::TopicPartitionList& partitions_) {
assignment = partitions_;
LOG_INFO(log, "Partitions assigned");
});
consumer.set_rebalance_error_callback([this](cppkafka::Error error) {
LOG_ERROR(log, "Rebalancing error. Reason: {}", error.to_string());
});
consumer.set_revocation_callback([&, this](const cppkafka::TopicPartitionList&) {
assignment.clear();
LOG_INFO(log, "Partitions revoked");
});
consumer.subscribe({topic});
auto start = std::chrono::high_resolution_clock::now();
while (some_condition)
{
auto subscriptions = consumer.get_subscription();
if (subscriptions.empty())
{
consumer.subscribe(topic);
}
cppkafka::Message msg = consumer.poll(100ms);
if (!msg)
{
if (!assignment.empty())
{
auto committed_offset = consumer.get_offsets_committed(consumer.get_assignment());
consumer.assign(committed_offset);
}
continue;
}
try
{
std::string_view payload_view(reinterpret_cast<const char *>(msg.get_payload().get_data()), msg.get_payload().get_size());
consumer.commit(msg);
}
....
}
Looks like this code not picking old uncommitted messages, the poll always returns null message. I have no idea whats going on here and why it leaves old messages in the queue. As one can see in the beginning of the last code snippet I've tried various consumer settings to no avail. What I'm missing here? Maybe I have to configure something on the Kafka side? Like, some topic settings? Or the problem with the consumer configuration?
One important (I think) detail. Messages may be added before any consumer connects to the Kafka. Maybe it is important?
I am using questdb (embedded) to store a bunch of time series.
I would like to run my storage method inside a parallel stream, but I don't know if TableWriter is thread-safe.
Here is the code:
SqlExecutionContextImpl ctx = new SqlExecutionContextImpl(engine, 1);
try (TableWriter writer = engine.getWriter(ctx.getCairoSecurityContext(), name, "writing")) {
tickerData.stream().parallel().forEach(
r -> {
Instant i = r.getDateTime("DateTime")
.atZone(EST)
.toInstant();
long ts = TimestampFormatUtils.parseTimestamp(i.toString());
TableWriter.Row row = writer.newRow(ts);
row.putDouble(0, r.getDouble("x1"));
row.putDouble(1, r.getDouble("x2"));
row.putDouble(2, r.getDouble("y1"));
row.putDouble(3, r.getDouble("y2"));
row.putDouble(4, r.getDouble("z"));
row.append();
writer.commit();
} catch (NumericException ex) {
log.error("Cannot parse the date {}", r.getDateTime("DateTime"));
} catch (Exception ex) {
log.error("Cannot write to table {}!", name, ex);
}
});
}
This throws all sort of errors, is there a way to make the storage process parallel?
Thanks,
Juan
The short answer is TableWriter is not thread safe. You will be responsible to not use it in parallel threads.
A bit longer answer is that even in stand alone QuestDB parallel writing is restricted. It is only possible from multiple ILP connections at the moment.
I'am new to Apex and I have to call a webservice for every account (for some thousands of accounts).
Usualy a single webservice request takes 500 to 5000 ms.
As far as I know schedulable and batchable classes are required for this task.
My idea was to group the accounts by country codes (Europe only) and start a batch for every group.
First batch is started by the schedulable class, next ones start in batch finish method:
global class AccValidator implements Database.Batchable<sObject>, Database.AllowsCallouts {
private List<String> countryCodes;
private countryIndex;
global AccValidator(List<String> countryCodes, Integer countryIndex) {
this.countryCodes = countryCodes;
this.countryIndex = countryIndex;
...
}
// Get Accounts for current country code
global Database.QueryLocator start(Database.BatchableContext bc) {...}
global void execute(Database.BatchableContext bc, list<Account> myAccounts) {
for (Integer i = 0; i < this.AccAccounts.size(); i++) {
// Callout for every Account
HttpRequest request ...
Http http = new Http();
HttpResponse response = http.send(request);
...
}
}
global void finish(Database.BatchableContext BC) {
if (this.countryIndex < this.countryCodes.size() - 1) {
// start next batch
Database.executeBatch(new AccValidator(this.countryCodes, this.countryIndex + 1), 200);
}
}
global static List<String> getCountryCodes() {...}
}
And my schedule class:
global class AccValidatorSchedule implements Schedulable {
global void execute(SchedulableContext sc) {
List<String> countryCodes = AccValidator.getCountryCodes();
Id AccAddressID = Database.executeBatch(new AccValidator(countryCodes, 0), 200);
}
}
Now I'am stuck with Salesforces execution governors and limits:
For nearly all callouts I get the exceptions "Read timed out" or "Exceeded maximum time allotted for callout (120000 ms)".
I also tried asynchronous callouts, but they don't work with batches.
So, is there any way to schedule a large number of callouts?
Have you tried to limit your execute method to 100? Salesforce only allows 100 callout per transaction. I.e.
Id AccAddressID = Database.executeBatch(new AccValidator(countryCodes, 0), 100);
Perhaps this might help you:
https://salesforce.stackexchange.com/questions/131448/fatal-errorsystem-limitexception-too-many-callouts-101
Consider this context:
Having a group of threads doing some work (that work is in a infinite loop, embedded project) where the number of threads (and some parameters) depends from a Database result.
What I need is to remove or create threads from that group when there´s a change in the database.
Here is the code:
for (result::const_iterator pin = pinesBBB.begin(); pin != pinesBBB.end(); ++pin)
{
string pinStr = pin["pin"].as<string>();
boost::thread hiloNuevo(bind(WorkPin, pinStr));
Worker.add_thread(&hiloNuevo);
}
Where result is pqxx::result from pqxx library.
This piece of code iterates a table from an SQL query result and creates a thread for every record found.
After that, there´s this code that checks the same table every a couple of minutes:
`
void ThreadWorker(boost::thread_group *worker, string *pinesLocales)
{
int threadsVivosInt = worker->size();
string *pinesDB;
int contador;
for (;;)
{
contador = 0;
sleep(60);
try
{
result pinesBBB = TraerPines();
for (result::const_iterator pin = pinesBBB.begin(); pin != pinesBBB.end(); ++pin)
{
pinesDB[contador] = pin["pin"].as<string>();
contador++;
}
thread hiloMuerto
}
catch (...)
{
sleep(360);
}
}
}
`
What I want to do is access this thread_group worker and remove one of those threads.
I´ve tryed using an Int index like worker[0] and with thread´s ID boost::thread::id
I can remove a thread using a native_handle and then using an plattform specific like pthread_cancel but I can´t get the thread from the thread group.
Any ideas? Thanks!
boost::thread_group::remove_thread() removes the specified thread from a given thread_group. Once you've done this, you're now responsible for managing the thread.