Can AWS step function executes more than 25000 times?

Can AWS step function executes more than 25000 times? - amazon-web-services

I am currently evaluating AWS state machine that can process single document. The state machine would take 5-10 mins to process a single document.
{
"Comment":"Process document",
"StartAt": "InitialState",
"States": {
//the document goes through multiple states here
}
}
The C# code invokes the state machine by passing some json for each document. Something like
// max 100 documents
public Task Process(IEnumerable<Document> documents)
{
var amazonStepFunctionsConfig = new AmazonStepFunctionsConfig { RegionEndpoint = RegionEndpoint.USWest2 };
using (var amazonStepFunctionsClient = new AmazonStepFunctionsClient(awsAccessKeyId, awsSecretAccessKey, amazonStepFunctionsConfig))
{
foreach(var document in documents)
{
var jsonData1 = JsonConvert.SerializeObject(document);
var startExecutionRequest = new StartExecutionRequest
{
Input = jsonData1,
Name = document.Id,
StateMachineArn = "arn:aws:states:us-west-2:<SomeNumber>:stateMachine:ProcessDocument"
};
var taskStartExecutionResponse = await amazonStepFunctionsClient.StartExecutionAsync(startExecutionRequest);
}
}
}
We process the documents in batch of 100. So in above loop the max number of documents will be 100. However we process thousands of documents weekly (25000+).
As per the AWS documentation Maximum execution history size is 25,000 events. If the execution history reaches this limit the execution will fail.
Does that mean we can not execute a single state machine more than 25000 times?
Why execution of state machine should depend on its history, why cant AWS just purge history?
I know there is a way to continue as new execution but I am just trying to understand the history limit and its relation to state machine execution, and is my understanding is correct?
Update 1
I don't think this is duplicate question. I am trying find if my understanding of history limit is correct? Why history has anything to do with number of times state machine can execute? When state machine executes, it creates history record, if history records goes more 25000+, then purge them or archive them. Why would AWS stop execution of state machine. That does not make sense.
So question, Can single state machine (unique arn) execute more than 25000+ times in loop?
if i have to create new state machine (after 25000 executions) wouldn't that state machine will have different arn?
Also if i had to follow linked SO post where would i get current number of executions? Also he is looping with-in the step function, while i am calling step function with-in the loop
Update 2
So just for testing i created the following state machine
{
"StartAt": "HelloWorld",
"States": {
"HelloWorld": {
"Type": "Pass",
"Result": "Hello World!",
"End": true
}
}
}
and executed it 26000 times with NO failure
public static async Task Main(string[] args)
{
AmazonStepFunctionsClient client = new AmazonStepFunctionsClient("my key", "my secret key", Amazon.RegionEndpoint.USWest2);
for (int i = 1; i <= 26000; i++)
{
var startExecutionRequest = new StartExecutionRequest
{
Input = JsonConvert.SerializeObject(new { }),
Name = i.ToString(),
StateMachineArn = "arn:aws:states:us-west-2:xxxxx:stateMachine:MySimpleStateMachine"
};
var response = await client.StartExecutionAsync(startExecutionRequest);
}
Console.WriteLine("Press any key to continue");
Console.ReadKey();
}
and on AWS Console i am able to pull the history for all 26000 executions
So i am not sure exactly what does it mean by Maximum execution history size is 25,000 events

I don't think you've got it right. 25,000 limit is for a State Machine execution history. You have tested 26,000 State Machine executions. State Machine executions limit is 1,000,000 open executions.
A State Machine can run for up to 1 year, and during this time its execution history should not reach more than 25,000.
Hope it helps.

The term "Execution History" is used to describe 2 completely different things in the quota docs, which has caused your confusion (and mine until I realized this):
90 day quota on execution history retention: This is the history of all executions, as you'd expect
25,000 quota on execution history size: This is the history of "state events" within 1 execution, NOT across all executions in history. In other words, if your single execution runs through thousands of steps, thereby racking up 25k events (likely because of a looping structure in the workflow), it will suddenly fail and exit.
As long as your executions complete in under 25k steps each, you can run the state machine much more than 25k times sequentially without issue :)

Related

Why an AWS state machine does not display an execution as "timed out", but as "failed"?

We created a minimal state machine with a single AWS Lambda step. Then set the time out period of the step in the state machine descriptor to a low value.
It is correctly terminated with time out, but then the result is "failure" instead of "time out". I wonder why?
Steps to reproduce
Create a simple Lambda function which will make a long running process. To keep it simple, create a Python script and make the function sleep for some seconds:
import time
def lambda_handler(event, context):
time.sleep(10) # Delays for 10 seconds.
return event
Set the timeout for the Lambda function to 30 sec. (It will never actually time out.)
Create a simple State Machine which will invoke this Lambda with a timeout of 5 seconds:
{
"StartAt": "Execute Lambda",
"States" : {
"Execute Lambda" : {
"Type" : "Task",
"Resource": "arn:aws:lambda:eu-west-1:**********:function:helloWorld",
"TimeoutSeconds" : 5,
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"MaxAttempts": 0
}
],
"End" : true
}
}
}
Start an execution.
Result
According to the "Execution event history", the last event is "ExecutionFailed" and "error" is "States.Timeout". So far, so good.
But:
When you view the list of the executions of the state machine, the status of this execution is "Failed". (Expected: "Timed out".)
When you view the list of the state machines, this execution increases the counter in the "Failed" column. (Expected: increase the counter in the "Timed out" column.)
I'd guess that somehow the execution result is not correctly "mapped", but can't find the reason why. Or, it is just a bug in Lambda-based State Machine steps?!

There are five status associated to a State Machine execution.
Running
Succeeded
Failed
Aborted
Timed out
An execution can run for up to 1 year. If one of the states in the state machine execution is timed out then execution is failed not timed out. However if execution is running for more than 1 year you would see the status 'Timed out'.
See step functions limits.
If an execution runs for more than the 1 year limit, it will fail with
a States.Timeout error and emit a ExecutionsTimedout CloudWatch
metric.

The Timed Out status in Step Function is only for the execution time of the Step Function (until 1 year), the time out state of your lambda function is a Failed event for the State. Remember that you can create an state "Type": "Wait" in your step machine, this way you save time running lambda that is billable.

Inserting rows on BigQuery: InsertAllRequest Vs BigQueryIO.writeTableRows()

When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?

With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.

Kafka consumer poll newest message

I am using CppKafka to programming Kafka consumer. I want when my consumer starts, it will only poll new arrival messages (i.e message arrive after the consumer-start-time) instead of messages at consumer offset.
// Construct the configuration
Configuration config = {
{ "metadata.broker.list", "127.0.0.1:9092" },
{ "group.id", "1" },
// Disable auto commit
{ "enable.auto.commit", false },
// Set offest to latest to receive latest message when consumer start working
{ "auto.offset.reset", "latest" },
};
// Create the consumer
Consumer consumer(config);
consumer.set_assignment_callback([](TopicPartitionList& partitions) {
cout << "Got assigned: " << partitions << endl;
});
// Print the revoked partitions on revocation
consumer.set_revocation_callback([](const TopicPartitionList& partitions) {
cout << "Got revoked: " << partitions << endl;
});
string topic_name = "test_topic";
// Subscribe to the topic
consumer.subscribe({ topic_name });
As I understand, the configuration auto.offset.reset set to latest only works if the consumer has no commited offset when it starts reading assigned partition. So my guess here that I should call consumer.poll() without commit, but it feels wrong and I am afraid i will break something along the way. Can anyone show me the right way to achieve my requirement?

If "enable.auto.commit" is set as false and you do not commit offsets in your code, then every time your consumers starts it starts message consumption from the first message in the topic if auto.offset.reset=earliest.
The default for auto.offset.reset is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running).
Based on your question above it looks like auto.offset.reset=latest should solve your problem.
But if you need a real time based offset you need to apply the time filter in your consumer. That means get the message from the topic compare offset time with either on some custom field in message payload or the meta attribute of the message (ConsumerRecord.timestamp())and do further processing accordingly.
Also refer to this answer Retrieve Timestamp based data from Kafka

use seekToEnd(Collection partitions) method.
Seek to the last offset for each of the given partitions. This function evaluates lazily,seeking to the final offset in all partitions only when poll(long) is called. If no partitions are provided, seek to the final offset for all of the currently assigned partitions.

How to detect cause of Dart VM crash

I have two Dart apps running on Amazon (AWS Ubuntu), which are:
Self-hosted http API
Worker that handles background tasks on a timer
Both apps use PostgreSQL. They were occasionally crashing so, in addition to trying to find the root causes, I also implemented a supervisor script that just detects whether those 2 main apps are running and restarts them as needed.
Now the problem I need to solve is that the supervisor script is crashing, or the VM is crashing. It happens every few days.
I don't think it is a memory leak because if I increase the polling rate from 10s to much more often (1 ns), it correctly shows in the Dart Observatory that it exhausts 30MB and then garbage-collects and starts over at low memory usage, and keeps cycling.
I don't think it's an uncaught exception because the infinite loop is completely enclosed in try/catch.
I'm at a loss for what else to try. Is there a VM dump file that can be examined if the VM really crashed? Is there any other technique to debug the root cause? Is Dart just not stable enough to run apps for days at a time?
This is the main part of the code in the supervisor script:
///never ending function checks the state of the other processes
Future pulse() async {
while (true) {
sleep(new Duration(milliseconds: 100)); //DEBUG - was seconds:10
try {
//detect restart (as signaled from existence of restart.txt)
File f_restart = new File('restart.txt');
if (await f_restart.exists()) {
log("supervisor: restart detected");
await f_restart.delete();
await endBoth();
sleep(new Duration(seconds: 10));
}
//if restarting or either proc crashed, restart it
bool apiAlive = await isRunning('api_alive.txt', 3);
if (!apiAlive) await startApi();
bool workerAlive = await isRunning('worker_alive.txt', 8);
if (!workerAlive) await startWorker();
//if it's time to send mail, run that process
if (utcNow().isAfter(_nextMailUtc)) {
log("supervisor: starting sendmail");
Process.start('dart', [rootPath() + '/sendmail.dart'], workingDirectory: rootPath());
_nextMailUtc = utcNow().add(_mailInterval);
}
} catch (ex) {}
}
}

If you have the observatory up you can get a crash dump with:
curl localhost:<your obseratory port>/_getCrashDump
I'm not totally sure if this is related but Process.start returns a future which I don't believe will be caught by your try/catch if it completes with an error...

Building an Orderbook representation for a Bitcoin exchange

I am trying to build an Orderbook representation for the Poloniex Bitcoin exchange. I am subscribing to the Push-API which sends updates of the Orderbook over Websocket. The problem is that my Orderbook gets inconsistent over time, i.e. orders which should have been removed are still in my book.
The Orderbook on the following picture has this format:
Exchange-Name - ASK - Amount - Price | Price - Amount - BID - Exchange-Name
On the left side (ASK) are people who are selling a currency. On the right side (BID) are people who are buying a currency. BTCUSD, ETHBTC and ETHUSD describe the different markets. BTCUSD means Bitcoin is exchanged for US-Dollar, ETHBTC means Ethereum is exchanged for Bitcoin and ETHUSD means Ethereum is exchanged for US-Dollar.
Poloniex sends updates over Websocket in JSON-Format. Here is an example of such an update:
[
36,
7597659581972377,
8089731807973507,
{},
[
{"data":{"rate":"609.00000029","type":"bid"},"type":"orderBookRemove"},{"data":{"amount":"0.09514285","rate":"609.00000031","type":"bid"},"type":"orderBookModify"}
],
{
"seq":19976127
}
]
json[0] can be ignored for this question.
json[1] is the market identifier. That means I send a request like "Subscribe to market BTCUSD" and they answer "BTCUSD updates will be sent under identifier number 7597659581972377".
json[2] can be ignored for this question.
json[3] can be ignored for this question.
json[4] contains the actual update data. More about that later.
json[5] contains a sequence number. It is used to execute the updates correctly if they arrive out of order. So if I receive 5 updates within 1 second by the order 1 - 3 - 5 - 4 - 2 they have to be executed like 1 - 2 - 3 - 4 - 5. Each market gets a different "sequence-number-sequence".
As I said, json[4] contains an array of updates. There are three different kinds in json[4][array-index]["type"]:
orderBookModify: The available amount for a specific price has changed.
orderBookRemove: The order is not available anymore and must be removed.
newTrade: Can be used to build a trade history. Not required for what I am trying to do so it can be ignored.
json[4][array-index]["data"] contains two values if it is a orderBookRemove and three values if it is a orderBookModify.
rate: The price.
amount (only existant if it is a orderBookModify): The new amount.
type: ask or bid.
There is also one kind of special message:
[36,8932491360003688,1315671639915103,{},[],{"seq":98045310}]
It only contains a sequence number. It is kind of a heartbeat message and does not send any updates.
The Code
I use three containers:
std::map<std::uint64_t,CMarket> m_mMarkets;
std::map<CMarket, long> m_mCurrentSeq;
std::map<CMarket, std::map<long, web::json::value>> m_mStack;
m_mMarkets is used to map the market-identifier number to the Market as it is stored inside my program.
m_mCurrentSeq is used to store the current sequence number for each market.
m_mStack stores the updates by market and sequence-number (that's what the long is for) until they can be executed.
This is the part which receives the updates:
// ....
// This method can be called asynchronously, so lock the containers.
this->m_muMutex.lock();
// Map the market-identifier to a CMarket object.
CMarket market = this->m_mMarkets.at(json[1].as_number().to_uint64());
// Check if it is a known market. This should never happen!
if(this->m_mMarkets.find(json[1].as_number().to_uint64()) == this->m_mMarkets.end())
{
this->m_muMutex.unlock();
throw std::runtime_error("Received Market update of unknown Market");
}
// Add the update to the execution-queue
this->m_mStack[market][(long)json[5]["seq"].as_integer()] = json;
// Execute the execution-queue
this->executeStack();
this->m_muMutex.unlock();
// ....
Now comes the execution-queue. I think this is where my mistake is located.
Function: "executeStack":
for(auto& market : this->m_mMarkets) // For all markets
{
if(this->m_mCurrentSeq.find(market.second) != this->m_mCurrentSeq.end()) // if market has a sequence number
{
long seqNum = this->m_mCurrentSeq.at(market.second);
// erase old entries
for(std::map<long, web::json::value>::iterator it = this->m_mStack.at(market.second).begin(); it != this->m_mStack.at(market.second).end(); )
{
if((*it).first < seqNum)
it = this->m_mStack.at(market.second).erase(it);
else
++it;
}
// This container is used to store the updates to the Orderbook temporarily.
std::vector<Order> addOrderStack{};
while(this->m_mStack.at(market.second).find(seqNum) != this->m_mStack.at(market.second).end())// has entry for seqNum
{
web::json::value json = this->m_mStack.at(market.second).at(seqNum);
for(auto& v : json[4].as_array())
{
if(v["type"].as_string().compare("orderBookModify") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), std::stod(v["data"]["amount"].as_string()), t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("orderBookRemove") == 0)
{
Order::Type t = v["data"]["type"].as_string().compare("ask") == 0 ? Order::Type::Ask : Order::Type::Bid;
Order newOrder(std::stod(v["data"]["rate"].as_string()), 0, t, market.second, this->m_pclParent, v.serialize());
addOrderStack.push_back(newOrder);
} else if(v["type"].as_string().compare("newTrade") == 0)
{
//
} else
{
throw std::runtime_error("Unknown message format");
}
}
this->m_mStack.at(market.second).erase(seqNum);
seqNum++;
}
// The actual OrderList gets modified here. The mistake CANNOT be inside OrderList::addOrderStack, because I am running Orderbooks for other exchanges too and they use the same method to modify the Orderbook, and they do not get inconsistent.
if(addOrderStack.size() > 0)
OrderList::addOrderStack(addOrderStack);
this->m_mCurrentSeq.at(market.second) = seqNum;
}
}
So if this runs for a longer period, the Orderbook becomes inconsistent. That means Orders which should have been removed are still available and there are wrong entrys inside the book. I am not quite sure why this is happening. Maybe I did something wrong with the sequence-numbers, because it seems that the Update-Stack does not always get executed correctly. I have tried everything that came to my mind but I could not get it to work and now I am out of ideas what could be wrong. If you have any questions please feel free to ask.

tl;dr: Poloniex API is imperfect and drops messages. Some simply never arrive. I've found that this happens for all users subscribed regardless of location in the world.
Hope that answer regarding utilization of Autobahn|cpp to connect to Poloniex' Websocket API (here) was useful. I suspect you had already figured it out though (otherwise this question/problem couldn't exist for you). As you might have gathered, I too have a Crypto Currency Bot written in C++. I've been working on it off and on now for about 3.5 years.
The problem set you're facing is something I had to overcome as well. In this case, I'd prefer not to provide my source code as the speed at which you process this can have huge effects on your profit margins. However, I will give sudo code that offers some very rough insight into how I'm handling Web Socket events processing for Poloniex.
//Sudo Code
void someClass::handle_poloniex_ws_event(ws_event event){
if(event.seq_num == expected_seq_num){
process_ws_event(event)
update_expected_seq_num
}
else{
if(in_cache(expected_seq_num){
process_ws_event(from_cache(expected_seq_num))
update_expected_seq_num
}
else{
cache_event(event)
}
}
}
Note that what I've written above is a super simplified version of what I'm actually doing. My actual solution is about 500+ lines long with "goto xxx" and "goto yyy" throughout. I recommend taking timestamps/cpu clock cycle counts and comparing to current time/cycle counts to help you make decisions at any given moment (such as, should I wait for the missing event, should I continue processing and note to the rest of the program that there may be inaccuracies, should I utilize a GET request to refill my table, etc.?). The name of the game here is speed, as I'm sure you know. Good luck! Hope to hear from ya. :-)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can AWS step function executes more than 25000 times? - amazon-web-services

Related

Why an AWS state machine does not display an execution as "timed out", but as "failed"?

Inserting rows on BigQuery: InsertAllRequest Vs BigQueryIO.writeTableRows()

Kafka consumer poll newest message

How to detect cause of Dart VM crash

Building an Orderbook representation for a Bitcoin exchange

Categories

Resources