I made a unit test using KafkaEmbedded (and KafkaTemplate), but the message order is random. Does anyone know if it is logical, and if it is possible guaranty order?
here is my code:
public class KafkaTest {
private static String TOPIC = "test.topic";
#ClassRule
public static KafkaEmbedded embeddedKafka = new KafkaEmbedded(1, true, TOPIC);
#Test
public void testEmbeddedKafkaSendOrder() throws Exception {
Map<String, Object> producerConfig = new HashMap<>();
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, embeddedKafka.getBrokersAsString());
producerConfig.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
producerConfig.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class);
KafkaTemplate<String, byte[]> kafkaTemplate = new KafkaTemplate<>(new DefaultKafkaProducerFactory<>(producerConfig));
kafkaTemplate.send(TOPIC, "TEST1".getBytes()).get();
kafkaTemplate.send(TOPIC, "TEST2".getBytes()).get();
kafkaTemplate.send(TOPIC, "TEST3".getBytes()).get();
kafkaTemplate.send(TOPIC, "TEST4".getBytes()).get();
kafkaTemplate.send(TOPIC, "TEST5".getBytes()).get();
Map<String, Object> consumerConfig = new HashMap<>();
consumerConfig.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, embeddedKafka.getBrokersAsString());
consumerConfig.put(ConsumerConfig.GROUP_ID_CONFIG, "consumer-test-group");
consumerConfig.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
consumerConfig.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
consumerConfig.put("auto.offset.reset", "earliest");
final Consumer<String, byte[]> consumer = new KafkaConsumer<>(consumerConfig);
embeddedKafka.consumeFromAnEmbeddedTopic(consumer, TOPIC);
ConsumerRecords<String, byte[]> records = consumer.poll(100L);
// Tests
final Iterator<ConsumerRecord<String, byte[]>> recordIterator = records.iterator();
while (recordIterator.hasNext()) {
System.out.println("received:" + new String(recordIterator.next().value()));
}
}
This code prints for example (but the order can change):
received:TEST2
received:TEST4
received:TEST1
received:TEST3
received:TEST5
In Kafka, you can be sure that order of messages is the same on the same partition, but not on the topic.
Note that as a topic typically has multiple partitions, there is
no guarantee of message time-ordering across the entire topic, just within a single
partition
Quote from the book Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale .
What you can do about this and how to receive messages in order?
Option 1:
kafkaTemplate.send(TOPIC,"1", "TEST1".getBytes()).get();
kafkaTemplate.send(TOPIC,"1", "TEST2".getBytes()).get();
kafkaTemplate.send(TOPIC,"1", "TEST3".getBytes()).get();
kafkaTemplate.send(TOPIC,"1", "TEST4".getBytes()).get();
kafkaTemplate.send(TOPIC,"1", "TEST5".getBytes()).get();
This way, for every value, you send the same key "1". Kafka will choose partition based on your key. Since all keys are equal, all messages will go to the same partition and you will receive your records in order.
Option 2:
Initialize KafkaEmbedded this way:
new KafkaEmbedded(1, true,1, TOPIC);
This way you are telling kafka that for this topic you would like to have only one partition so every record will go to that partition.
Related
We are using Spark job with emr-dynamodb-connector to load the data from S3 files into Dyanamodb.
https://github.com/awslabs/emr-dynamodb-connector
But if document is already present in dynamodb, my code is replacing it.
Is there a way to avoid updating existing records (based on id) if they are present in Dynamodb. If id is present in dynamodb, i simply don't want to update it, just skip that id and write rest of records. Code i am using is
JobConf ddbConf = new JobConf(spark.sparkContext().hadoopConfiguration());
ddbConf.set("dynamodb.output.tableName", tableName);
ddbConf.set("dynamodb.throughput.write.percent", "50");
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat");
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat");
JavaPairRDD<Text, DynamoDBItemWritable> ddbInsertFormattedRDD = finalDatasetToBeSaved.toJavaRDD().mapToPair(new PairFunction<Row, Text, DynamoDBItemWritable>() {
#Override
public Tuple2<Text, DynamoDBItemWritable> call(Row row) throws Exception {
Map<String, AttributeValue> ddbMap = new HashMap<String, AttributeValue>();
for (int i = 0 ; i <= schemaDdb.length - 1; i++) {
Object value = row.get(i);
if (value != null) {
AttributeValue att = new AttributeValue();
if(schemaDdb[i]._2.toString().equalsIgnoreCase("IntegerType")){
att.setN(value.toString());
}else{
att.setS(value.toString());
}
ddbMap.put((String)schemaDdb[i]._1, att);
}
}
DynamoDBItemWritable item = new DynamoDBItemWritable();
item.setItem(ddbMap);
return new Tuple2<Text, DynamoDBItemWritable>(new Text(""), item);
}
});
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf);
By saying Is there a way to avoid updating existing records (based on id) if they are already present, Do you want to add another document instead of replacing/updating it?
If yes, I am afraid it wont be possible with primary key, since that should be unique and distinguishes it from other. You need to make a key non-primary in order to do this.
If you want to ignore the insertion (if item exists), you can use condition-expression attribute_not_exists(your-key) as defined in the documentation
I have an List items to be inserted into the DynamoDb collection. The size of the list may vary from 100 to 10k. I looking for an optimised way to Batch Write all the items using the BatchWriteItemEnhancedRequest (JAVA SDK2). What is the best way to add the items into the WriteBatch builder and then write the request using BatchWriteItemEnhancedRequest?
My Current Code:
WriteBatch.Builder<T> builder = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build();
items.forEach(item -> { builder.addPutItem(item); });
BatchWriteItemEnhancedRequest bwr = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build()
BatchWriteResult batchWriteResult =
DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
do {
// Check for unprocessed keys which could happen if you exceed
// provisioned throughput
List<T> unprocessedItems = batchWriteResult.unprocessedPutItemsForTable(getTable());
if (unprocessedItems.size() != 0) {
unprocessedItems.forEach(unprocessedItem -> {
builder.addPutItem(unprocessedItem);
});
batchWriteResult = DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
}
} while (batchWriteResult.unprocessedPutItemsForTable(getTable()).size() > 0);
Looking for a batching logic and a more better way to execute the BatchWriteItemEnhancedRequest.
I came up with a utility class to deal with that. Their batches of batches approach in v2 is overly complex for most use cases, especially when we're still limited to 25 items overall.
public class DynamoDbUtil {
private static final int MAX_DYNAMODB_BATCH_SIZE = 25; // AWS blows chunks if you try to include more than 25 items in a batch or sub-batch
/**
* Writes the list of items to the specified DynamoDB table.
*/
public static <T> void batchWrite(Class<T> itemType, List<T> items, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
Stream<List<T>> chunksOfItems = Lists.partition(items, MAX_DYNAMODB_BATCH_SIZE);
chunksOfItems.forEach(chunkOfItems -> {
List<T> unprocessedItems = batchWriteImpl(itemType, chunkOfItems, client, table);
while (!unprocessedItems.isEmpty()) {
// some failed (provisioning problems, etc.), so write those again
unprocessedItems = batchWriteImpl(itemType, unprocessedItems, client, table);
}
});
}
/**
* Writes a single batch of (at most) 25 items to DynamoDB.
* Note that the overall limit of items in a batch is 25, so you can't have nested batches
* of 25 each that would exceed that overall limit.
*
* #return those items that couldn't be written due to provisioning issues, etc., but were otherwise valid
*/
private static <T> List<T> batchWriteImpl(Class<T> itemType, List<T> chunkOfItems, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
WriteBatch.Builder<T> subBatchBuilder = WriteBatch.builder(itemType).mappedTableResource(table);
chunkOfItems.forEach(subBatchBuilder::addPutItem);
BatchWriteItemEnhancedRequest.Builder overallBatchBuilder = BatchWriteItemEnhancedRequest.builder();
overallBatchBuilder.addWriteBatch(subBatchBuilder.build());
return client.batchWriteItem(overallBatchBuilder.build()).unprocessedPutItemsForTable(table);
}
}
I'm building up an Instances object, adding Attributes, and then adding data in the form of Instance objects.
When I go to write it out, the toString() method is already throwing an OutOfBoundsException and unable to evaluate the data in the Instances. I receive the error when I try to print out the data and I can see the exception being thrown just in the Debugger as it shows it can't evaluate the toString() for the data object.
The only clue I have is that the error message seems to be using the first data element (StudentId) and using it as an index. I'm confused as to why.
The code:
// Set up the attributes for the Weka data model
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("StudentIdentifier", true));
attributes.add(new Attribute("CourseGrade", true));
attributes.add(new Attribute("CourseIdentifier"));
attributes.add(new Attribute("Term", true));
attributes.add(new Attribute("YearCourseTaken", true));
// Create the data model object - I'm not happy that capacity is required and fixed? But that's another issue
Instances dataSet = new Instances("Records", attributes, 500);
// Set the attribute that will be used for prediction purposes - that will be CourseIdentifier
dataSet.setClassIndex(2);
// Pull back all the records in this term range, create Weka Instance objects for each and add to the data set
List<Record> records = recordsInTermRangeFindService.find(0, 10);
int count = 0;
for (Record r : records) {
Instance i = new DenseInstance(attributes.size());
i.setValue(attributes.get(0), r.studentIdentifier);
i.setValue(attributes.get(1), r.courseGrade);
i.setValue(attributes.get(2), r.courseIdentifier);
i.setValue(attributes.get(3), r.term);
i.setValue(attributes.get(4), r.yearCourseTaken);
dataSet.add(i);
}
System.out.println(dataSet.size());
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("./test.arff"));
writer.write(dataSet.toString());
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
The error message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1010, Size: 0
I figured it out finally. I was setting the Attributes to strings with the 'true' second parameter in the Constructors, but they were integers coming out of the database table. I needed to change my lines to convert the integers to strings:
i.setValue(attributes.get(0), Integer.toString(r.studentIdentifier));
However, that created a different set of issues for me as things like the Apriori algorithm don't work on strings! I'm continuing to plug along learning Weka.
I'm using save expression on an encrypted attribute named transactionAmount while updating data in dynamo DB. However the update query is failing with ConditionalCheckFailedException. The data is encrypted on client side during initial persistence in dynamodb in way same as described here. Following is the code:
Data Transfer Object:
public final class SampleDTO {
#DynamoDBHashKey(attributeName = CommonDynamoDBSchemaConstants.UNIQUE_KEY)
#Getter(onMethod = #__({ #DoNotTouch }))
private String uniqueKey;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.EVENT_RUNNING_TIME_EPOCH)
#Getter(onMethod = #__({ #DoNotTouch }))
private Long eventRunningTimeInEpoch;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.INSTRUMENT_TYPE)
#DynamoDBTypeConverted(converter = InstrumentTypeConverter.class)
#Getter(onMethod = #__({ #DoNotTouch }))
private InstrumentType instrumentType;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.TRANSACTION_AMOUNT)
private String transactionAmount;
}
Data Access Code:
// fetches data from dynamoDB based on unique key passed to it.
SampleDTO sampleDTO = getSampleDTO("testLedgerUniqueKey");
sampleDTO.setInstrumentType(InstrumentType.MACHINE);
DynamoDBSaveExpression saveExpression = new DynamoDBSaveExpression();
Map<String, ExpectedAttributeValue> expressionAttributeValues =
new HashMap<String, ExpectedAttributeValue>();
expressionAttributeValues.put(
CommonDynamoDBSchemaConstants.LEDGER_UNIQUE_KEY,
new ExpectedAttributeValue(true)
.withValue(new AttributeValue(sampleDTO.getLedgerUniqueKey())));
expressionAttributeValues.put(
CommonDynamoDBSchemaConstants.TRANSACTION_AMOUNT,
new ExpectedAttributeValue(true).withValue(
new AttributeValue(sampleDTO.getTransactionAmount())));
saveExpression.setExpected(expressionAttributeValues);
saveExpression.setConditionalOperator(ConditionalOperator.AND);
dynamoDBMapper.save(sampleDTO, saveExpression, null /*dynamoDBMapperConfig*/);
ConditionalCheckFailedException:
You are trying to update a record that does not exist with your query condition. Please verify your query condition to make sure your query returns a record.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes
You specified a condition that evaluated to false. For example, you
might have tried to perform a conditional update on an item, but the
actual value of the attribute did not match the expected value in the
condition.
Hope it helps.
I recently Setup 4 node Cassandra cluster for learning with one column family which hold time series data as.
Key -> {column name: timeUUID, column value: csv log line, ttl: 1year}, I use Netflix Astyanax java client to load about 1 million log lines.
I also configured Hadoop to run map-reduce jobs with 1 namenode and 4 datanode's to run some analytics on Cassandra data.
All the available examples on internet uses column name as SlicePredicate for Hadoop Job Configuration, where as I have timeUUID as columns how can I efficiently feed Cassandra data to Hadoop Job configurator with batches of 1000 columns at one time.
There are more than 10000 column's for some rows in this test data and expected to be more in real data.
I configure my job as
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), JOB_NAME);
Job.setJarByClass(LogTypeCounterByDate.class);
job.setMapperClass(LogTypeCounterByDateMapper.class);
job.setReducerClass(LogTypeCounterByDateReducer.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
ConfigHelper.setRangeBatchSize(getConf(), 1000);
SliceRange sliceRange = new SliceRange(ByteBuffer.wrap(new byte[0]),
ByteBuffer.wrap(new byte[0]), true, 1000);
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(sliceRange);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), INPUT_RPC_PORT);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), INPUT_INITIAL_ADRESS);
ConfigHelper.setInputPartitioner(job.getConfiguration(), INPUT_PARTITIONER);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
But I can't able to understand how I define Mapper, kindly can you provide template for Mapper class.
public static class LogTypeCounterByDateMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
private Text key = null;
private LongWritable value = null;
#Override
protected void setup(Context context){
}
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context){
//String[] lines = columns.;
}
}
ConfigHelper.setRangeBatchSize(getConf(), 1000)
...
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(TimeUUID.asByteBuffer(startValue), TimeUUID.asByteBuffer(endValue), false, 1000))
ConfigHelper.setInputSlicePredicate(conf, predicate)