Kafka consumer poll newest message - c++

I am using CppKafka to programming Kafka consumer. I want when my consumer starts, it will only poll new arrival messages (i.e message arrive after the consumer-start-time) instead of messages at consumer offset.
// Construct the configuration
Configuration config = {
{ "metadata.broker.list", "127.0.0.1:9092" },
{ "group.id", "1" },
// Disable auto commit
{ "enable.auto.commit", false },
// Set offest to latest to receive latest message when consumer start working
{ "auto.offset.reset", "latest" },
};
// Create the consumer
Consumer consumer(config);
consumer.set_assignment_callback([](TopicPartitionList& partitions) {
cout << "Got assigned: " << partitions << endl;
});
// Print the revoked partitions on revocation
consumer.set_revocation_callback([](const TopicPartitionList& partitions) {
cout << "Got revoked: " << partitions << endl;
});
string topic_name = "test_topic";
// Subscribe to the topic
consumer.subscribe({ topic_name });
As I understand, the configuration auto.offset.reset set to latest only works if the consumer has no commited offset when it starts reading assigned partition. So my guess here that I should call consumer.poll() without commit, but it feels wrong and I am afraid i will break something along the way. Can anyone show me the right way to achieve my requirement?

If "enable.auto.commit" is set as false and you do not commit offsets in your code, then every time your consumers starts it starts message consumption from the first message in the topic if auto.offset.reset=earliest.
The default for auto.offset.reset is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running).
Based on your question above it looks like auto.offset.reset=latest should solve your problem.
But if you need a real time based offset you need to apply the time filter in your consumer. That means get the message from the topic compare offset time with either on some custom field in message payload or the meta attribute of the message (ConsumerRecord.timestamp())and do further processing accordingly.
Also refer to this answer Retrieve Timestamp based data from Kafka

use seekToEnd(Collection partitions) method.
Seek to the last offset for each of the given partitions. This function evaluates lazily,seeking to the final offset in all partitions only when poll(long) is called. If no partitions are provided, seek to the final offset for all of the currently assigned partitions.

Related

Can AWS step function executes more than 25000 times?

I am currently evaluating AWS state machine that can process single document. The state machine would take 5-10 mins to process a single document.
{
"Comment":"Process document",
"StartAt": "InitialState",
"States": {
//the document goes through multiple states here
}
}
The C# code invokes the state machine by passing some json for each document. Something like
// max 100 documents
public Task Process(IEnumerable<Document> documents)
{
var amazonStepFunctionsConfig = new AmazonStepFunctionsConfig { RegionEndpoint = RegionEndpoint.USWest2 };
using (var amazonStepFunctionsClient = new AmazonStepFunctionsClient(awsAccessKeyId, awsSecretAccessKey, amazonStepFunctionsConfig))
{
foreach(var document in documents)
{
var jsonData1 = JsonConvert.SerializeObject(document);
var startExecutionRequest = new StartExecutionRequest
{
Input = jsonData1,
Name = document.Id,
StateMachineArn = "arn:aws:states:us-west-2:<SomeNumber>:stateMachine:ProcessDocument"
};
var taskStartExecutionResponse = await amazonStepFunctionsClient.StartExecutionAsync(startExecutionRequest);
}
}
}
We process the documents in batch of 100. So in above loop the max number of documents will be 100. However we process thousands of documents weekly (25000+).
As per the AWS documentation Maximum execution history size is 25,000 events. If the execution history reaches this limit the execution will fail.
Does that mean we can not execute a single state machine more than 25000 times?
Why execution of state machine should depend on its history, why cant AWS just purge history?
I know there is a way to continue as new execution but I am just trying to understand the history limit and its relation to state machine execution, and is my understanding is correct?
Update 1
I don't think this is duplicate question. I am trying find if my understanding of history limit is correct? Why history has anything to do with number of times state machine can execute? When state machine executes, it creates history record, if history records goes more 25000+, then purge them or archive them. Why would AWS stop execution of state machine. That does not make sense.
So question, Can single state machine (unique arn) execute more than 25000+ times in loop?
if i have to create new state machine (after 25000 executions) wouldn't that state machine will have different arn?
Also if i had to follow linked SO post where would i get current number of executions? Also he is looping with-in the step function, while i am calling step function with-in the loop
Update 2
So just for testing i created the following state machine
{
"StartAt": "HelloWorld",
"States": {
"HelloWorld": {
"Type": "Pass",
"Result": "Hello World!",
"End": true
}
}
}
and executed it 26000 times with NO failure
public static async Task Main(string[] args)
{
AmazonStepFunctionsClient client = new AmazonStepFunctionsClient("my key", "my secret key", Amazon.RegionEndpoint.USWest2);
for (int i = 1; i <= 26000; i++)
{
var startExecutionRequest = new StartExecutionRequest
{
Input = JsonConvert.SerializeObject(new { }),
Name = i.ToString(),
StateMachineArn = "arn:aws:states:us-west-2:xxxxx:stateMachine:MySimpleStateMachine"
};
var response = await client.StartExecutionAsync(startExecutionRequest);
}
Console.WriteLine("Press any key to continue");
Console.ReadKey();
}
and on AWS Console i am able to pull the history for all 26000 executions
So i am not sure exactly what does it mean by Maximum execution history size is 25,000 events
I don't think you've got it right. 25,000 limit is for a State Machine execution history. You have tested 26,000 State Machine executions. State Machine executions limit is 1,000,000 open executions.
A State Machine can run for up to 1 year, and during this time its execution history should not reach more than 25,000.
Hope it helps.
The term "Execution History" is used to describe 2 completely different things in the quota docs, which has caused your confusion (and mine until I realized this):
90 day quota on execution history retention: This is the history of all executions, as you'd expect
25,000 quota on execution history size: This is the history of "state events" within 1 execution, NOT across all executions in history. In other words, if your single execution runs through thousands of steps, thereby racking up 25k events (likely because of a looping structure in the workflow), it will suddenly fail and exit.
As long as your executions complete in under 25k steps each, you can run the state machine much more than 25k times sequentially without issue :)

Siddhi check if an event does not arrive within a specified time window?

I am using CEP to check if an event has arrived within a specified amount of time (lets say 1 min). If not, I want to publish an alert.
More specifically, a (server) machine generates a heartbeat data stream and sends it to CEP. The heartbeat stream contains the server id and a timestamp. An alert should be generated if no heartbeat data arrive within the 1 min period.
Is it possible to do something like that with CEP? I have seen other questions regarding the detection of non-occurencies but I am still not sure how to approach the scenario described above.
You can try this :
define stream heartbeats (serverId string, timestamp long);
from heartbeats#window.time(1 minute) insert expired events into delayedStream;
from every e = heartbeats -> e2 = hearbeats[serverId == e.serverId]
or expired = delayedStream[serverId == e.serverId]
within 1 minute
select e.serverId, e2.serverId as id2, expired.serverId as id3
insert into tmpStream;
// every event on tmpStream with a 'expired' match has timeout
from tmpStream[id3 is not null]
select serverId
insert into expiredHearbeats;

Building an Orderbook representation for a Bitcoin exchange

I am trying to build an Orderbook representation for the Poloniex Bitcoin exchange. I am subscribing to the Push-API which sends updates of the Orderbook over Websocket. The problem is that my Orderbook gets inconsistent over time, i.e. orders which should have been removed are still in my book.
The Orderbook on the following picture has this format:
Exchange-Name - ASK - Amount - Price | Price - Amount - BID - Exchange-Name
On the left side (ASK) are people who are selling a currency. On the right side (BID) are people who are buying a currency. BTCUSD, ETHBTC and ETHUSD describe the different markets. BTCUSD means Bitcoin is exchanged for US-Dollar, ETHBTC means Ethereum is exchanged for Bitcoin and ETHUSD means Ethereum is exchanged for US-Dollar.
Poloniex sends updates over Websocket in JSON-Format. Here is an example of such an update:
[
36,
7597659581972377,
8089731807973507,
{},
[
{"data":{"rate":"609.00000029","type":"bid"},"type":"orderBookRemove"},{"data":{"amount":"0.09514285","rate":"609.00000031","type":"bid"},"type":"orderBookModify"}
],
{
"seq":19976127
}
]
json[0] can be ignored for this question.
json[1] is the market identifier. That means I send a request like "Subscribe to market BTCUSD" and they answer "BTCUSD updates will be sent under identifier number 7597659581972377".
json[2] can be ignored for this question.
json[3] can be ignored for this question.
json[4] contains the actual update data. More about that later.
json[5] contains a sequence number. It is used to execute the updates correctly if they arrive out of order. So if I receive 5 updates within 1 second by the order 1 - 3 - 5 - 4 - 2 they have to be executed like 1 - 2 - 3 - 4 - 5. Each market gets a different "sequence-number-sequence".
As I said, json[4] contains an array of updates. There are three different kinds in json[4][array-index]["type"]:
orderBookModify: The available amount for a specific price has changed.
orderBookRemove: The order is not available anymore and must be removed.
newTrade: Can be used to build a trade history. Not required for what I am trying to do so it can be ignored.
json[4][array-index]["data"] contains two values if it is a orderBookRemove and three values if it is a orderBookModify.
rate: The price.
amount (only existant if it is a orderBookModify): The new amount.
type: ask or bid.
There is also one kind of special message:
[36,8932491360003688,1315671639915103,{},[],{"seq":98045310}]
It only contains a sequence number. It is kind of a heartbeat message and does not send any updates.
The Code
I use three containers:
std::map<std::uint64_t,CMarket> m_mMarkets;
std::map<CMarket, long> m_mCurrentSeq;
std::map<CMarket, std::map<long, web::json::value>> m_mStack;
m_mMarkets is used to map the market-identifier number to the Market as it is stored inside my program.
m_mCurrentSeq is used to store the current sequence number for each market.
m_mStack stores the updates by market and sequence-number (that's what the long is for) until they can be executed.
This is the part which receives the updates:
// ....
// This method can be called asynchronously, so lock the containers.
this->m_muMutex.lock();
// Map the market-identifier to a CMarket object.
CMarket market = this->m_mMarkets.at(json[1].as_number().to_uint64());
// Check if it is a known market. This should never happen!
if(this->m_mMarkets.find(json[1].as_number().to_uint64()) == this->m_mMarkets.end())
{
this->m_muMutex.unlock();
throw std::runtime_error("Received Market update of unknown Market");
}
// Add the update to the execution-queue
this->m_mStack[market][(long)json[5]["seq"].as_integer()] = json;
// Execute the execution-queue
this->executeStack();
this->m_muMutex.unlock();
// ....
Now comes the execution-queue. I think this is where my mistake is located.
Function: "executeStack":
for(auto& market : this->m_mMarkets) // For all markets
{
if(this->m_mCurrentSeq.find(market.second) != this->m_mCurrentSeq.end()) // if market has a sequence number
{
long seqNum = this->m_mCurrentSeq.at(market.second);
// erase old entries
for(std::map<long, web::json::value>::iterator it = this->m_mStack.at(market.second).begin(); it != this->m_mStack.at(market.second).end(); )
{
if((*it).first < seqNum)
it = this->m_mStack.at(market.second).erase(it);
else
++it;
}
// This container is used to store the updates to the Orderbook temporarily.
std::vector<Order> addOrderStack{};
while(this->m_mStack.at(market.second).find(seqNum) != this->m_mStack.at(market.second).end())// has entry for seqNum
{
web::json::value json = this->m_mStack.at(market.second).at(seqNum);
for(auto& v : json[4].as_array())
{
if(v["type"].as_string().compare("orderBookModify") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), std::stod(v["data"]["amount"].as_string()), t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("orderBookRemove") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), 0, t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("newTrade") == 0)
{
//
} else
{
throw std::runtime_error("Unknown message format");
}
}
this->m_mStack.at(market.second).erase(seqNum);
seqNum++;
}
// The actual OrderList gets modified here. The mistake CANNOT be inside OrderList::addOrderStack, because I am running Orderbooks for other exchanges too and they use the same method to modify the Orderbook, and they do not get inconsistent.
if(addOrderStack.size() > 0)
OrderList::addOrderStack(addOrderStack);
this->m_mCurrentSeq.at(market.second) = seqNum;
}
}
So if this runs for a longer period, the Orderbook becomes inconsistent. That means Orders which should have been removed are still available and there are wrong entrys inside the book. I am not quite sure why this is happening. Maybe I did something wrong with the sequence-numbers, because it seems that the Update-Stack does not always get executed correctly. I have tried everything that came to my mind but I could not get it to work and now I am out of ideas what could be wrong. If you have any questions please feel free to ask.
tl;dr: Poloniex API is imperfect and drops messages. Some simply never arrive. I've found that this happens for all users subscribed regardless of location in the world.
Hope that answer regarding utilization of Autobahn|cpp to connect to Poloniex' Websocket API (here) was useful. I suspect you had already figured it out though (otherwise this question/problem couldn't exist for you). As you might have gathered, I too have a Crypto Currency Bot written in C++. I've been working on it off and on now for about 3.5 years.
The problem set you're facing is something I had to overcome as well. In this case, I'd prefer not to provide my source code as the speed at which you process this can have huge effects on your profit margins. However, I will give sudo code that offers some very rough insight into how I'm handling Web Socket events processing for Poloniex.
//Sudo Code
void someClass::handle_poloniex_ws_event(ws_event event){
if(event.seq_num == expected_seq_num){
process_ws_event(event)
update_expected_seq_num
}
else{
if(in_cache(expected_seq_num){
process_ws_event(from_cache(expected_seq_num))
update_expected_seq_num
}
else{
cache_event(event)
}
}
}
Note that what I've written above is a super simplified version of what I'm actually doing. My actual solution is about 500+ lines long with "goto xxx" and "goto yyy" throughout. I recommend taking timestamps/cpu clock cycle counts and comparing to current time/cycle counts to help you make decisions at any given moment (such as, should I wait for the missing event, should I continue processing and note to the rest of the program that there may be inaccuracies, should I utilize a GET request to refill my table, etc.?). The name of the game here is speed, as I'm sure you know. Good luck! Hope to hear from ya. :-)

StatsD gauge timer send data issue - cannot send only one value to statsd server in one flush interval

I use Statsd client based on Akka IO source code to send data to statsd server. In my scenario, I want to monitor spark jobs status, if current job success, we gonna send 1 to statsd server, otherwise 0. So in one flush interval I just wanna send one value (1 or 0) to statsd server, but it didn't work, If I added a for loop, and send this value(1 or 0) at least twice, it works, But I don't know why should I send the same value twice, so I checked the statsd source code, and found:
for (key in gauges) {
var namespace = gaugesNamespace.concat(sk(key));
stats.add(namespace.join(".") + globalSuffix, gauges[key], ts);
numStats += 1;
}
So the type of gauges should be the iterator, if I just send one value, It can not be iteratored, this is what I thought, maybe it is wrong, hope someone can help me explain why should I send one value at least twice.
My client code snippet:
for(i<- 1 to 2) {
client ! ExcutionTime("StatsD_Prod.Reporting."+name+":"+status_str+"|ms", status)
Thread.sleep(100)
}

How to limit an Akka Stream to execute and send down one message only once per second?

I have an Akka Stream and I want the stream to send messages down stream approximately every second.
I tried two ways to solve this problem, the first way was to make the producer at the start of the stream only send messages once every second when a Continue messages comes into this actor.
// When receive a Continue message in a ActorPublisher
// do work then...
if (totalDemand > 0) {
import scala.concurrent.duration._
context.system.scheduler.scheduleOnce(1 second, self, Continue)
}
This works for a short while then a flood of Continue messages appear in the ActorPublisher actor, I assume (guess but not sure) from downstream via back-pressure requesting messages as the downstream can consume fast but the upstream is not producing at a fast rate. So this method failed.
The other way I tried was via backpressure control, I used a MaxInFlightRequestStrategy on the ActorSubscriber at the end of the stream to limit the number of messages to 1 per second. This works but messages coming in come in at approximately three or so at a time, not just one at a time. It seems the backpressure control doesn't immediately change the rate of messages coming in OR messages were already queued in the stream and waiting to be processed.
So the problem is, how can I have an Akka Stream which can process one message only per second?
I discovered that MaxInFlightRequestStrategy is a valid way to do it but I should set the batch size to 1, its batch size is default 5, which was causing the problem I found. Also its an over-complicated way to solve the problem now that I am looking at the submitted answer here.
You can either put your elements through the throttling flow, which will back pressure a fast source, or you can use combination of tick and zip.
The first solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val throttlingFlow = Flow[Long].throttle(
// how many elements do you allow
elements = 1,
// in what unit of time
per = 1.second,
maximumBurst = 0,
// you can also set this to Enforcing, but then your
// stream will collapse if exceeding the number of elements / s
mode = ThrottleMode.Shaping
)
veryFastSource.via(throttlingFlow).runWith(Sink.foreach(println))
The second solution would be like this:
val veryFastSource =
Source.fromIterator(() => Iterator.continually(Random.nextLong() % 10000))
val tickingSource = Source.tick(1.second, 1.second, 0)
veryFastSource.zip(tickingSource).map(_._1).runWith(Sink.foreach(println))