I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.
Say, I have two devices A and B that write events into the stream.
My procedure has name of MyFunction and takes the following params:
DeviceId
Array of data for a period
If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds)
then I need to make two calls:
MyFunction(А, {Events for device A from 09:59:50 to 10:00:00})
MyFunction(B, {Events for device B from 09:59:50 to 10:00:00})
In the next second, at 10:00:01
MyFunction(А, {Events for device A from 09:59:51 to 10:00:01})
MyFunction(B, {Events for device B from 09:59:51 to 10:00:01})
and so on.
Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.
And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.
I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)
So my questions is: what can be other options to implement such processing?
So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.
In memory cache systems: Ehcache, Hazelcast
Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.
In memory database: Redis, VoltDB
Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.
Spark Streaming: http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.
Related
Context: I have a long message(~1G) to send over gRPC.(Why I use gRPC other than mere HTTP? 95% messages are shorter than 4MB)
Since gRPC default max send size is 4MB, I have to split them into multiple messages at server-side and reassemble them at client-side.
This is how I do:
Get the Response, and marshal it into a std::string by response.SerializeToString(&str)
Split the long string into multiple short strings, wrap them with a metadata(eg, index) and send them one by one
client-side receives all of them and concatenate them
Get Response by message.ParseFromString(&concated_str)
In these steps, I assume there're four times of copying 1GB data. Is there a way to avoid any of them, if possible?
It looks like OK.
But it is kind of rare to send such a huge message with rpc. Why don't you use some DB or messaging component such as Kafka or RabbitMQ? RPC is simply Remote Procedure Call, which means that you should use it just like calling a function. You can pass some arguments to a function but not a piece of 1G data. GRPC can retry and set timeout. You should set retry-times and timeout because you don't want to use the configuration by default. In your case, what if some retry happen while sending the third piece of data? This could happen because network is not always reliable. You ganna have to wait for a very long time to finish just one agent-level call. How would you set the timeout of GRPC? 10 seconds? 20 seconds? What if the first 1G request hasn't finished while the second one is coming? The stack may explode!
So your design may work as expected but is not a good design.
Do NOT try to use one technique to solve all of issues. GRPC is simply a kind of technique. But it's not a silver bullet. Sending huge data should use another technique. In fact, that's why people developed many kinds of messaging components.
I suggest you using a messaging queue, such as Kafka. 1G data can't be generated in one short, right? So when some data is generated, write them into Kafka immediately. And on the other side, read the Kafka queue.
Here is the architecture:
GRPC ----> Kafka writer ----> Kafka ----> Kafka reader ----> GRPC
You see that's why people invented a word: stream.
Do not regard 1G data as a piece of static data, instead, regard it as a stream.
I have a flink job (scala) that is basically reading from a kafka-topic (1.0), aggregating data (1 minute event time tumbling window, using a fold function, which I know is deprecated, but is easier to implement than an aggregate function), and writing the result to 2 different kafka topics.
The question is - when I'm using a FS state backend, everything runs smoothly, checkpoints are taking 1-2 seconds, with an average state size of 200 mb - that is, until the state size is increasing (while closing a gap, for example).
I figured I would try rocksdb (over hdfs) for checkpoints - but the throughput is SIGNIFICANTLY less than fs state backend. As I understand it, flink does not need to ser/deserialize for every state access when using fs state backend, because the state is kept in memory (heap), rocks db DOES, and I guess that is what is accounting for the slowdown (and backpressure, and checkpoints take MUCH longer, sometimes timeout after 10 minutes).
Still, there are times that the state cannot fit in memory, and I am trying to figure out basically how to make rocksdb state backend perform "better".
Is it because of the deprecated fold function? Do I need to fine tune some parameters that are not easily searchable in documentation? any tips?
Each state backend holds the working state somewhere, and then durably persists its checkpoints in a distributed filesystem. The RocksDB state backend holds its working state on disk, and this can be a local disk, hopefully faster than hdfs.
Try setting state.backend.rocksdb.localdir (see https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/state_backends.html#rocksdb-state-backend-config-options) to somewhere on the fastest local filesystem on each taskmanager.
Turning on incremental checkpointing could also make a large difference.
Also see Tuning RocksDB.
I have a large file which I want to process using Lambda functions in AWS. Since I can not control the size of the file, I came up with the solution to distribute the processing of the file to multiple lambda function calls to avoid timeouts. Here's how it works:
I dedicated a bucket to accept the new input files to be processed.
I set a trigger on the bucket to handle each time a new file is uploaded (let's call it uploadHandler)
Reading the file, uploadHandler measures the size of the file and splits it into equal chunks.
Each chunk is sent to processor lambda function to be processed.
Notes:
The uploadHandler does not read the file content.
The data sent to processor is just a { start: #, end: # }.
Multiple instances of the processor are called in parallel.
Each processor call reads its own chunk of the file individually and generates the output for it.
So far so good. The problem is how to consolidate the output of the all processor calls into one output? Does anyone have any suggestion? And also how to know when the execution of all the processors is done?
I recently had a similar problem. I solve it using AWS lambda and Step functions using this solution https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-create-iterate-pattern-section.html
In this specific example the execution doesn't happen in Parallel, but it's sequential. But when the state machine finish to execute you have the garantee that the file was totally processed correctly. I don't know if is exactly what you are looking.
Option 1:
After breaking the file, make the uploadHandler function call the processor functions synchronously.
Make the calls concurrent, so that you can trigger all processors at once. Lambda functions have only one vCPU (or 2 vCPUs if RAM > 1,800 Gb), but the requests are IO-bound, so you only need one processor.
The uploadHandler will wait for all processors to respond, then you can assemble all responses.
Pros: simpler to implement, no storage;
Cons: no visibility on what's going on until everything is finished;
Option 2:
Persist a processingJob in a DB (RDS, DynamoDB, whatever). The uploadHandler would create the job and save the number of parts into which the file was broken up. Save the job ID with each file part.
Each processor gets one part (with the job ID), processes it, then store in the DB the results of the processing.
Make each processor check if it's the last one delivering its results; if yes, make it trigger an assembler function to collect all results and do whatever you need.
Pros: more visibility, as you can query your storage DB at any time to check which parts were processed and which are pending; you could store all sorts of metadata from the processor for detailed analysis, if needed;
Cons: requires a storage service and a slightly more complex handling of your Lambdas;
Working with a streaming, unbounded PCollection in Google Dataflow that originates from a Cloud PubSub subscription. We are using this as a firehose to simply deliver events to BigTable continuously. Everything with the delivery is performing nicely.
Our problem is that we have downstream batch jobs that expect to read a day's worth of data out of BigTable once it is delivered. I would like to utilize windowing and triggering to implement a side effect that will write a marker row out to bigtable when the watermark advances beyond the day threshold, indicating that dataflow has reason to believe that most of the events have been delivered (we don't need strong guarantees on completeness, just reasonable ones) and that downstream processing can begin.
What we've tried is write out the raw events as one sink in the pipeline, and then window into another sink, using the timing information in the pane to determine if the watermark has advanced. The problem with this approach is that it operates upon the raw events themselves again, which is undesirable since it would repeat writing the event rows. We can prevent this write, but the parallel path in the pipeline would still be operating over the windowed streams of events.
Is there an effecient way to attach a callback-of-sorts to the watermark, such that we can perform a single action when the watermark advances?
The general ability to set a timer in event time and receive a callback is definitely an important feature request, filed as BEAM-27, which is under active development.
But actually your approach of windowing into FixedWindows.of(Duration.standardDays(1)) seems like it will accomplish your goal using just the features of the Dataflow Java SDK 1.x. Instead of forking your pipeline, you can maintain the "firehose" behavior by adding the trigger AfterPane.elementCountAtLeast(1). It does incur the cost of a GroupByKey but does not duplicate anything.
The complete pipeline might look like this:
pipeline
// Read your data from Cloud Pubsub and parse to MyValue
.apply(PubsubIO.Read.topic(...).withCoder(MyValueCoder.of())
// You'll need some keys
.apply(WithKeys.<MyKey, MyValue>of(...))
// Window into daily windows, but still output as fast as possible
.apply(Window.into(FixedWindows.of(Duration.standardDays(1)))
.triggering(AfterPane.elementCountAtLeast(1)))
// GroupByKey adds the necessary EARLY / ON_TIME / LATE labeling
.apply(GroupByKey.<MyKey, MyValue>create())
// Convert KV<MyKey, Iterable<MyValue>>
// to KV<ByteString, Iterable<Mutation>>
// where the iterable of mutations has the "end of day" marker if
// it was ON_TIME
.apply(MapElements.via(new MessageToMutationWithEndOfWindow())
// Write it!
.apply(BigTableIO.Write.to(...);
Please do comment on my answer if I have missed some detail of your use case.
Firstly I'm an engineer, not a computer scientist, so please be gentle.
I currently have a C++ program which uses MySQL++. The program also incorporates the NI Visa runtime. One of the interrupt handlers receives data (1 byte) from a USB device about 200 times a second. I would like to store this data with a time stamp on each sample on a remote server. Is this feasible? Can anyone recommend a good approach?
Regards,
Michael
I think that performing 200 transactions/second against a remote server is asking a lot, especially when you consider that these transactions would be occurring in the context of an interrupt handler which has to do its job and get done quickly. I think it would be better to decouple your interrupt handler from your database access - perhaps have the interrupt handler store the incoming data and timestamp into some sort of in-memory data structure (array, circular linked list, or whatever, with appropriate synchronization) and have a separate thread that waits until data is available in the data structure and then pumps it to the database. I'd want to keep that interrupt handler as lean and deterministic as possible, and I'm concerned that database access across the network to a remote server would be too slow - or worse, would be OK most of the time, but sometimes would go to h*ll for no obvious reason.
This, of course, raises the question/problem of data overrun, where data comes in faster than it can be pumped to the database and the in-memory storage structure fills up. This could cause data loss. How bad a thing is it if you drop some samples?
I don't think you'll be able to maintain that speed with 1 separate insert per value, but if you batched them up into large enough batches you could send it all as one query and it should be fine.
INSERT INTO records(timestamp, value)
VALUES(1, 2), (3, 4), (5, 6), [...], (399, 400);
Just push the timestamp and value onto a buffer, and when the buffer hits 200 in size (or some other arbitrary figure), generate the SQL and send the whole lot off. Building this string up with sprintf shouldn't be too slow. Just beware of reading from a data structure that your interrupt routine might be writing to at the same time.
If you find that this SQL generation is too slow for some reason, and there's no quicker method using the API (eg. stored procedures), then you might want to run this concurrently with the data collection. Simplest is probably to stream the data across a socket or pipe to another process that performs the SQL generation. There are also multithreading approaches but they are more complex and error-prone.
In my opinion, you should do two things: 1. buffer the data and 2. one time stamp per buffer. The USB protocol is not byte based and more message based. If you are tracking messages, then time stamp the messages.
Also, databases would rather receive blocks or chunks of data than one byte at a time. There is overhead in the database with each transaction. To measure the efficiency, divide the overhead by the number of bytes in the transaction. You'll see that large blocks are more efficient than lots of little transactions.
Another option is to store the data into a file, then use the MySQL LOADFILE function to load the data into the the database. Also, there is storing the data into a buffer then using the MySQL C++ connector stream to load the data into the database.
multi-threading doesn't guarantee being any faster than apartment even if you cached it correctly on server side unless there was some strange cpu priority preference. What about using shaders and letting the pass by reference value in windows.h be the time stamp