Apache Beam: Why does it write to Spanner twice on REPORT_FAILURES mode? - google-cloud-platform

I found interesting write operation codes while looking at SpannerIO, and want to understand reasons.
On write(WriteToSpannerFn) and REPORT_FAILURES failure mode, it seems trying to write failed mutations twice.
I think it's for logging each mutation's exceptions. Is it a correct assumption, and is there any workaround?
Below, I removed some lines for simplicity.
public void processElement(ProcessContext c) {
Iterable<MutationGroup> mutations = c.element();
boolean tryIndividual = false;
try {
Iterable<Mutation> batch = Iterables.concat(mutations);
spannerAccessor.getDatabaseClient().writeAtLeastOnce(batch);
} catch (SpannerException e) {
if (failureMode == FailureMode.REPORT_FAILURES) {
tryIndividual = true;
} else {
...
}
}
if (tryIndividual) {
for (MutationGroup mg : mutations) {
try {
spannerAccessor.getDatabaseClient().writeAtLeastOnce(mg);
} catch (SpannerException e) {
LOG.warn("Failed to submit the mutation group", e);
c.output(failedTag, mg);
}
}
}
}

So rather than write each Mutation individually to the database, the SpannerIO.write() connector tries to write a batch of Mutations in a single transaction for efficiency.
If just one of these Mutations in the batch fails, then the whole transaction fails, so in REPORT_FAILURES mode, the mutations are re-tried individually to find which Mutation(s) are the problematic ones...

Related

Unit testing suspend coroutine

a bit new to Kotlin and testing it... I am trying to test a dao object wrapper with using a suspend method which uses an awaitFirst() for an SQL return object. However, when I wrote the unit test for it, it is just stuck in a loop. And I would think it is due to the awaitFirst() is not in the same scope of the testing
Implementation:
suspend fun queryExecution(querySpec: DatabaseClient.GenericExecuteSpec): OrderDomain {
var result: Map<String, Any>?
try {
result = querySpec.fetch().first().awaitFirst()
} catch (e: Exception) {
if (e is DataAccessResourceFailureException)
throw CommunicationException(
"Cannot connect to " + DatabaseConstants.DB_NAME +
DatabaseConstants.ORDERS_TABLE + " when executing querySelect",
"querySelect",
e
)
throw InternalException("Encountered R2dbcException when executing SQL querySelect", e)
}
if (result == null)
throw ResourceNotFoundException("Resource not found in Aurora DB")
try {
return OrderDomain(result)
} catch (e: Exception) {
throw InternalException("Exception when parsing to OrderDomain entity", e)
} finally {
logger.info("querySelect;stage=end")
}
}
Unit Test:
#Test
fun `get by orderid id, null`() = runBlocking {
// Assign
Mockito.`when`(fetchSpecMock.first()).thenReturn(monoMapMock)
Mockito.`when`(monoMapMock.awaitFirst()).thenReturn(null)
// Act & Assert
val exception = assertThrows<ResourceNotFoundException> {
auroraClientWrapper.queryExecution(
databaseClient.sql("SELECT * FROM orderTable WHERE orderId=:1").bind("1", "123") orderId
)
}
assertEquals("Resource not found in Aurora DB", exception.message)
}
I noticed this issue on https://github.com/Kotlin/kotlinx.coroutines/issues/1204 but none of the work around has worked for me...
Using runBlocking within Unit Test just causes my tests to never complete. Using runBlockingTest explicitly throws an error saying "Job never completed"... Anyone has any idea? Any hack at this point?
Also I fairly understand the point of you should not be using suspend with a block because that kinda defeats the purposes of suspend since it is releasing the thread to continue later versus blocking forces the thread to wait for a result... But then how does this work?
private suspend fun queryExecution(querySpec: DatabaseClient.GenericExecuteSpec): Map {
var result: Map<String, Any>?
try {
result = withContext(Dispatchers.Default) {
querySpec.fetch().first().block()
}
return result
}
Does this mean withContext will utilize a new thread, and re-use the old thread elsewhere? Which then doesnt really optimize anything since I will still have one thread that is being blocked regardless of spawning a new context?
Found the solution.
The monoMapMock is a mock value from Mockito. Seems like the kotlinx-test coroutines can't intercept an async to return a mono. So I forced the method that I can mock, to return a real Mono value instead of a Mocked Mono. To do so, as suggested by Louis. I stop mocking it and return a real value
#Test
fun `get by orderid id, null`() = runBlocking {
// Assign
Mockito.`when`(fetchSpecMock.first()).thenReturn(Mono.empty())
Mockito.`when`(monoMapMock.awaitFirst()).thenReturn(null)
// Act & Assert
val exception = assertThrows<ResourceNotFoundException> {
auroraClientWrapper.queryExecution(
databaseClient.sql("SELECT * FROM orderTable WHERE orderId=:1").bind("1", "123") orderId
)
}
assertEquals("Resource not found in Aurora DB", exception.message)
}

how to apply in DebuggerHiddenAttribute dependent/cascade methods

I'm trying to ignore a exception that happen when run my test Methods.
I`m using a unitTest proyect.
The problem apears when
TestCleanup Method runs. It's a recursive method. This method clean all entities created in DB during the test. This is a recursive method because of dependences.
Anyway this method call to delete generic method in my ORM(Petapoco). It throws an exception if it can't delete the entity. Any problem, it run again with recursive way until delete it.
Now the problem, if i'm debugging VS stop a lot of times in Execute method because of failed deletes. But I can't modify this method to ignore it. I need a way to ignore this stops when i'm debugging tests. A way like DebuggerHiddenAttribute or similar.
Thanks!
I tried to use DebuggerHiddenAttribute, but cannot works in methods called by main method.
[TestCleanup(), DebuggerHidden]
public void CleanData()
{
ErrorDlt = new Dictionary<Guid, object>();
foreach (var entity in TestEntity.CreatedEnt)
{
try
{
CallingTest(entity);
}
catch (Exception e)
{
if (!ErrorDlt.ContainsKey(entity.Key))
ErrorDlt.Add(entity.Key, entity.Value);
}
}
if (ErrorDlt.Count > 0)
{
TestEntity.CreatedEnt = new Dictionary<Guid, object>();
ErrorDlt.ForEach(x => TestEntity.CreatedEnt.Add(x.Key, x.Value));
CleanData();
}
}
public int Execute(string sql, params object[] args)
{
try
{
OpenSharedConnection();
try
{
using (var cmd = CreateCommand(_sharedConnection, sql, args))
{
var retv = cmd.ExecuteNonQuery();
OnExecutedCommand(cmd);
return retv;
}
}
finally
{
CloseSharedConnection();
}
}
catch (Exception x)
{
OnException(x);
throw new DatabaseException(x.Message, LastSQL, LastArgs);
}
}
Error messages are not required.

Why is writing to Bigquery using Dataflow EXTREMELY slow?

I can stream inserts directly into BigQuery at a speed of about 10,000 inserts per second but when I try to insert using Dataflow the 'ToBqRow' step (given below) is EXTREMELY slow. Barely 50 rows per 10 minutes and this is with 4 Workers. Any idea why? Here's the relevant code:
PCollection<Status> statuses = p
.apply("GetTweets", PubsubIO.readStrings().fromTopic(topic))
.apply("ExtractData", ParDo.of(new DoFn<String, Status>() {
#ProcessElement
public void processElement(DoFn<String, Status>.ProcessContext c) throws Exception {
String rowJson = c.element();
try {
TweetsWriter.LOGGER.debug("ROWJSON = " + rowJson);
Status status = TwitterObjectFactory.createStatus(rowJson);
if (status == null) {
TweetsWriter.LOGGER.error("Status is null");
} else {
TweetsWriter.LOGGER.debug("Status value: " + status.getText());
}
c.output(status);
TweetsWriter.LOGGER.debug("Status: " + status.getId());
} catch (Exception var4) {
TweetsWriter.LOGGER.error("Status creation from JSON failed: " + var4.getMessage());
}
}
}));
statuses
.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", (Object)null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)
.withSchema(schema)
.withMethod(Method.STREAMING_INSERTS)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
p.run();
Turns out Bigquery under Dataflow is NOT slow. Problem was, 'status.getPlace().getCountryCode()' was returning NULL so it was throwing NullPointerException that I couldn't see anywhere in the log! Clearly, Dataflow logging needs to improve. It's running really well now. As soon as message comes in the topic, almost instantaneously it gets written to BigQuery!

How to execute multiple sql query parallel in java

I have 3 method which is returning a List of result and my sql query execute in each method and returning a list of result .I want to execute all 3 method parallel so it will not wait to complete of one and another . I saw one stachoverflow post but its not working.
that link is [How to execute multiple queries in parallel instead of sequentially?
[Execute multiple queries in parallel via Streams
I want to solve using java 8 features.
But the above link how I can call multiple method please tell me.
Execute multiple queries in parallel via Streams works for your task. Here is a sample code which demonstrates it:
public static void main(String[] args) {
// Create Stream of tasks:
Stream<Supplier<List<String>>> tasks = Stream.of(
() -> getServerListFromDB(),
() -> getAppListFromDB(),
() -> getUserFromDB());
List<List<String>> lists = tasks
// Supply all the tasks for execution and collect CompletableFutures
.map(CompletableFuture::supplyAsync).collect(Collectors.toList())
// Join all the CompletableFutures to gather the results
.stream()
.map(CompletableFuture::join).collect(Collectors.toList());
System.out.println(lists);
}
private static List<String> getUserFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getUser");
return Arrays.asList("User1", "User2", "User3");
}
private static List<String> getAppListFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getAppList");
return Arrays.asList("App1", "App2", "App3");
}
private static List<String> getServerListFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getServer");
return Arrays.asList("Server1", "Server2", "Server3");
}
The output is:
ForkJoinPool.commonPool-worker-1 getServer
ForkJoinPool.commonPool-worker-3 getUser
ForkJoinPool.commonPool-worker-2 getAppList
[[Server1, Server2, Server3], [App1, App2, App3], [User1, User2, User3]]
You can see that default ForkJoinPool.commonPool is used and each get* method is executed from separate thread in this pool. You just need to run your SQL queries inside these get* methods
The solution from Alex won't work in an expected way. CompletableFuture executes effects one after another. You will have to run parallel streams for this or use CompletableFuture with parallelStream

MongoDB C++ driver handling replica set connection failures

So the mongo c++ documentation says
On a failover situation, expect at least one operation to return an
error (throw an exception) before the failover is complete. Operations
are not retried
Kind of annoying, but that leaves it up to me to handle a failed operation. Ideally I would just like the application to sleep for a few seconds (app is single threaded). And retry with the hopes that a new primary mongod is established. In the case of a second failure, well I take it the connection is truly messed up and I just want to thrown an exception.
Within my MongodbManager class this means all operations have this kind of double try/catch block set up. I was wondering if there is a more elegant solution?
Example method:
template <typename T>
std::string
MongoManager::insert(std::string ns, T object)
{
mongo::BSONObj = convertToBson(object);
std::string result;
try {
connection_->insert(ns, oo); //connection_ = shared_ptr<DBClientReplicaSet>
result = connection_->getLastError();
lastOpSucceeded_ = true;
}
catch (mongo::SocketException& ex)
{
lastOpSucceeded_ = false;
boost::this_thread::sleep( boost::posix_time::seconds(5) );
}
// try again?
if (!lastOpSucceeded_) {
try {
connection_->insert(ns, oo);
result = connection_->getLastError();
lastOpSucceeded_ = true;
}
catch (mongo::SocketException& ex)
{
//do some clean up, throw exception
}
}
return result;
}
That's indeed sort of how you need to handle it. Perhaps instead of having two try/catch blocks I would use the following strategy:
keep a count of how many times you have tried
create a while loop with as terminator (count < 5 && lastOpSucceeded)
and then sleep with pow(2,count) to sleep more in every iteration.
And then when all else fails, bail out.