I'm trying to build up a web server that only processes http request (twitter data) from client side, process it (some calculation) and return a response to the client. The only consideration is speed.
I'm thinking of spark streaming but seems that it can't give a response back. Is there a efficient solution? Or are there any other recommendations for the entire framework?
one of the solution can be use of Kafka message queue with apache spark streaming.
Kafka has two parts :
1] Producer => send messages across network and act as a buffer
2] Consumer => receives messages from producer
The possible design can be :
1] Write producer using javascript/node-js
2] Write consumer in spark streaming program which will consume all the twitter messages sent from front end. Then you can process these message in spaek.
Now to give response back -
3] Write second producer in spark program and send your processed data to consumer.
4] Now write second consumer in your front end program (javascript/node-js) which will consume processed messages sent by spark streaming.
Please let me know if you need more information.
Related
We have created a Alpakka stream, which consumes Kafka message from a topic and then process those messages. These messages are processed in parallel, using mapAsyncUnordered with a configured parallelism. The Kafka lag for the consumer increases, but the application uses only 1 core of CPU. I have changed the default dispatchers to akka.actor.default-dispatchers, which uses a fork-join executor expecting it to use more than a CPU core. I have my application running in 32 cores.
Please find the configured settings below:
akka.kafka.consumer.use-dispatcher = "akka.actor.default-dispatcher"
Consumer stream code:
Consumer.DrainingControl<Done> control = Consumer.committableSource(consumerSettings, Subscriptions.topics(topic))
.buffer( 500, OverflowStrategy.backpressure() )
//De-serialize the response from json to java object
.mapAsyncUnordered( 5, //deserialize the output )
.mapAsyncUnordered(5, //Process it and perform some calculations )
.mapAsyncUnordered( 5, //Do something and return the consumer offset )
//Commit the offset
.toMat( Committer.sink(committerSettings.withMaxBatch(100)), Consumer::createDrainingControl)
.run( materializer );
The stream runs in a akka-cluster, which is load balanced by same consumer group id. We have a typed actor system as well in the application which is used for triggering the request, with a group router which helps in sharing the load across the cluster. The triggered request is sent to a micro service as a Kafka message and we get a response as a Kafka message which is processed by streams. And these messages are not necessarily to be processed in order, hence the use of mapAsyncUnordered…
Tried increasing the parallelism to even 100, but didn’t see a change.
Thanks in advance
If you're using PUSH sockets, you'll find that the first PULL socket to connect will grab an unfair share of messages. The accurate rotation of messages only happens when all PULL sockets are successfully connected, which can take some milliseconds. As an alternative to PUSH/PULL, for lower data rates, consider using ROUTER/DEALER and the load balancing pattern.
So one way to do sync in PUSH/PULL is using the load balancing pattern.
For this specific case below, I wonder whether there is another way to do sync:
I could set the PULL endpoint in worker to block until the connection successfully setup, and then send a special message via worker's PULL endpoint to 'sink'. After 'sink' receives #worker's special messages, 'sink' sends a message with REQ-REP to 'ventilator' to notify that all workers ready. 'ventilator' starts to distribute jobs to workers.
Is it reliable?
The picture is from here
Yes, so long as the Sink knows how many Workers to wait for before telling the Ventilator that it's OK to start sending messages. There's the question of whether the special messages from the Workers get through if they start up before the Sink connects - but you could solve that by having them keep sending their special message until they start getting data from the Ventilator. If you do this, the Sink would of course simply ignore any duplicates it receives.
Of course, that's not quite the same as the Workers having a live, working connection to the Ventilator, but that could itself be sending out special do-nothing messages that the Workers receive. When they receive one of those that's when they can start sending a special message to the Sink.
I have a Kinesis producer which writes a single type of message to a stream. I want to process this stream in multiple, completely different consumer applications. So, a pub/sub with a single publisher for a given topic/stream. I also want to make use of checkpointing to ensure that each consumer processes every message written to the stream.
Initially, I was using the same App Name for all consumers and producers. However, I started getting the following error once I started more than one consumer:
com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49564236296344566565977952725717230439257668853369405442 used in GetShardIterator on shard shardId-000000000000 in stream PackageCreated under account ************ is invalid because it did not come from this stream. (Service: AmazonKinesis; Status Code: 400; Error Code: InvalidArgumentException; Request ID: ..)
This seems to be because consumers are clashing with their checkpointing as they are using the same App Name.
From reading the documentation, it seems the only way to do pub/sub with checkpointing is by having a stream per consumer application, which requires each producer to know about all possible consumers. This is more tightly coupled than I want; it's really just a queue.
It seems like Kafka supports what I want: arbitrary consumption of a given topic/partition, since consumers are completely in control of their own checkpointing. Is my only option to move to Kafka, or some other alternative, if I want pub/sub with checkpointing?
My RecordProcessor code, which is identical in each consumer:
override def processRecords(processRecordsInput: ProcessRecordsInput): Unit = {
log.trace("Received record(s) from kinesis")
for {
record <- processRecordsInput.getRecords
json <- jawn.parseByteBuffer(record.getData).toOption
msg <- decode[T](json.toString).toOption
} yield subscriber ! msg
processRecordsInput.getCheckpointer.checkpoint()
}
The code parses the message and sends it off to the subscriber. For now, I'm simply marking all messages as successfully received. I can see messages being sent on the AWS Kinesis dashboard, but no reads happen, presumably because each application has its own AppName and doesn't see any other messages.
The pattern you want, that of one publisher to & multiple consumers from one Kinesis stream, is supported. You don't need a separate stream per consumer.
How do you do that? You need to give a different application-name to every consumer. That way, checkpointing info of one consumer won't collide with that of another.
Check the first response to this: https://forums.aws.amazon.com/message.jspa?messageID=554375
I am trying to create web service in Spring mvc4.
I have created producer and consumer too.
I have sent data from producer,i m getting that json data on consumer side, but i am not able to respond to producer that i have received json data.
When using kafka, I can set a codec by setting the kafka.compression.codec property of my kafka producer.
Suppose I use snappy compression in my producer, when consuming the messages from kafka using some kafka-consumer, should I do something to decode the data from snappy or is it some built-in feature of kafka consumer?
In the relevant documentation I could not find any property that relates to encoding in kafka consumer (it only relates to the producer).
Can someone clear this?
As per my understanding goes the de-compression is taken care by the Consumer it self. As mentioned in their official wiki page
The consumer iterator transparently decompresses compressed data and only returns an uncompressed message
As found in this article the way consumer works is as follows
The consumer has background “fetcher” threads that continuously fetch data in batches of 1MB from the brokers and add it to an internal blocking queue. The consumer thread dequeues data from this blocking queue, decompresses and iterates through the messages
And also in the doc page under End-to-end Batch Compression its written that
A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
So it appears that the decompression part is handled in the consumer it self all you need to do is to provide the valid / supported compression type using the compression.codec ProducerConfig attribute while creating the producer. I couldn't find any example or explanation where it says any approach for decompression in the consumer end. Please correct me if I am wrong.
I have the same issue with v0.8.1 and this compression decomression in Kafka is poorly documented other than saying the Consumer should "transparently" decompresses compressed data which it NEVER did.
The example high level consumer client using ConsumerIterator in Kafka web site only works with uncompressed data. Once I enable compression in Producer client, the message never gets into the following "while" loop. Hopefully they should fix this issue asap or they shouldn't claim this feature as some users may use Kafka to transport large size message that needs batching and compression capabilities.
ConsumerIterator <byte[], byte[]> it = stream.iterator();
while(it.hasNext())
{
String message = new String(it.next().message());
}