Why is writing to Bigquery using Dataflow EXTREMELY slow?

Why is writing to Bigquery using Dataflow EXTREMELY slow? - google-cloud-platform

I can stream inserts directly into BigQuery at a speed of about 10,000 inserts per second but when I try to insert using Dataflow the 'ToBqRow' step (given below) is EXTREMELY slow. Barely 50 rows per 10 minutes and this is with 4 Workers. Any idea why? Here's the relevant code:
PCollection<Status> statuses = p
.apply("GetTweets", PubsubIO.readStrings().fromTopic(topic))
.apply("ExtractData", ParDo.of(new DoFn<String, Status>() {
#ProcessElement
public void processElement(DoFn<String, Status>.ProcessContext c) throws Exception {
String rowJson = c.element();
try {
TweetsWriter.LOGGER.debug("ROWJSON = " + rowJson);
Status status = TwitterObjectFactory.createStatus(rowJson);
if (status == null) {
TweetsWriter.LOGGER.error("Status is null");
} else {
TweetsWriter.LOGGER.debug("Status value: " + status.getText());
}
c.output(status);
TweetsWriter.LOGGER.debug("Status: " + status.getId());
} catch (Exception var4) {
TweetsWriter.LOGGER.error("Status creation from JSON failed: " + var4.getMessage());
}
}
}));
statuses
.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", (Object)null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)
.withSchema(schema)
.withMethod(Method.STREAMING_INSERTS)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
p.run();

Turns out Bigquery under Dataflow is NOT slow. Problem was, 'status.getPlace().getCountryCode()' was returning NULL so it was throwing NullPointerException that I couldn't see anywhere in the log! Clearly, Dataflow logging needs to improve. It's running really well now. As soon as message comes in the topic, almost instantaneously it gets written to BigQuery!

Related

How to enter data from Spring Boot Application into Amazon Kinesis?

I want to add data into kinesis using Sprint Boot Application and React. I am a complete beginner when it comes to Kinesis, AWS, etc. so a beginner friendly guide would be appriciated.

To add data records into an Amazon Kinesis data stream from a Spring BOOT app, you can use the AWS SDK for Java V2 and specifically the Amazon Kinesis Java API. You can use the software.amazon.awssdk.services.kinesis.KinesisClient.
Because you are a beginner, I recommend that you read the AWS SDK Java V2 Developer Guide to become familiar with how to work with this Java API. See Developer guide - AWS SDK for Java 2.x.
Here is a code example that shows you how to add data records using this Service Client. See Github that has the other required classes here.
package com.example.kinesis;
//snippet-start:[kinesis.java2.putrecord.import]
import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider;
import software.amazon.awssdk.core.SdkBytes;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.kinesis.KinesisClient;
import software.amazon.awssdk.services.kinesis.model.PutRecordRequest;
import software.amazon.awssdk.services.kinesis.model.KinesisException;
import software.amazon.awssdk.services.kinesis.model.DescribeStreamRequest;
import software.amazon.awssdk.services.kinesis.model.DescribeStreamResponse;
//snippet-end:[kinesis.java2.putrecord.import]
/**
* Before running this Java V2 code example, set up your development environment, including your credentials.
*
* For more information, see the following documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class StockTradesWriter {
public static void main(String[] args) {
final String usage = "\n" +
"Usage:\n" +
" <streamName>\n\n" +
"Where:\n" +
" streamName - The Amazon Kinesis data stream to which records are written (for example, StockTradeStream)\n\n";
if (args.length != 1) {
System.out.println(usage);
System.exit(1);
}
String streamName = args[0];
Region region = Region.US_EAST_1;
KinesisClient kinesisClient = KinesisClient.builder()
.region(region)
.credentialsProvider(ProfileCredentialsProvider.create())
.build();
// Ensure that the Kinesis Stream is valid.
validateStream(kinesisClient, streamName);
setStockData( kinesisClient, streamName);
kinesisClient.close();
}
// snippet-start:[kinesis.java2.putrecord.main]
public static void setStockData( KinesisClient kinesisClient, String streamName) {
try {
// Repeatedly send stock trades with a 100 milliseconds wait in between
StockTradeGenerator stockTradeGenerator = new StockTradeGenerator();
// Put in 50 Records for this example
int index = 50;
for (int x=0; x<index; x++){
StockTrade trade = stockTradeGenerator.getRandomTrade();
sendStockTrade(trade, kinesisClient, streamName);
Thread.sleep(100);
}
} catch (KinesisException | InterruptedException e) {
System.err.println(e.getMessage());
System.exit(1);
}
System.out.println("Done");
}
private static void sendStockTrade(StockTrade trade, KinesisClient kinesisClient,
String streamName) {
byte[] bytes = trade.toJsonAsBytes();
// The bytes could be null if there is an issue with the JSON serialization by the Jackson JSON library.
if (bytes == null) {
System.out.println("Could not get JSON bytes for stock trade");
return;
}
System.out.println("Putting trade: " + trade);
PutRecordRequest request = PutRecordRequest.builder()
.partitionKey(trade.getTickerSymbol()) // We use the ticker symbol as the partition key, explained in the Supplemental Information section below.
.streamName(streamName)
.data(SdkBytes.fromByteArray(bytes))
.build();
try {
kinesisClient.putRecord(request);
} catch (KinesisException e) {
e.getMessage();
}
}
private static void validateStream(KinesisClient kinesisClient, String streamName) {
try {
DescribeStreamRequest describeStreamRequest = DescribeStreamRequest.builder()
.streamName(streamName)
.build();
DescribeStreamResponse describeStreamResponse = kinesisClient.describeStream(describeStreamRequest);
if(!describeStreamResponse.streamDescription().streamStatus().toString().equals("ACTIVE")) {
System.err.println("Stream " + streamName + " is not active. Please wait a few moments and try again.");
System.exit(1);
}
}catch (KinesisException e) {
System.err.println("Error found while describing the stream " + streamName);
System.err.println(e);
System.exit(1);
}
}
// snippet-end:[kinesis.java2.putrecord.main]
}

Unit testing suspend coroutine

a bit new to Kotlin and testing it... I am trying to test a dao object wrapper with using a suspend method which uses an awaitFirst() for an SQL return object. However, when I wrote the unit test for it, it is just stuck in a loop. And I would think it is due to the awaitFirst() is not in the same scope of the testing
Implementation:
suspend fun queryExecution(querySpec: DatabaseClient.GenericExecuteSpec): OrderDomain {
var result: Map<String, Any>?
try {
result = querySpec.fetch().first().awaitFirst()
} catch (e: Exception) {
if (e is DataAccessResourceFailureException)
throw CommunicationException(
"Cannot connect to " + DatabaseConstants.DB_NAME +
DatabaseConstants.ORDERS_TABLE + " when executing querySelect",
"querySelect",
e
)
throw InternalException("Encountered R2dbcException when executing SQL querySelect", e)
}
if (result == null)
throw ResourceNotFoundException("Resource not found in Aurora DB")
try {
return OrderDomain(result)
} catch (e: Exception) {
throw InternalException("Exception when parsing to OrderDomain entity", e)
} finally {
logger.info("querySelect;stage=end")
}
}
Unit Test:
#Test
fun `get by orderid id, null`() = runBlocking {
// Assign
Mockito.`when`(fetchSpecMock.first()).thenReturn(monoMapMock)
Mockito.`when`(monoMapMock.awaitFirst()).thenReturn(null)
// Act & Assert
val exception = assertThrows<ResourceNotFoundException> {
auroraClientWrapper.queryExecution(
databaseClient.sql("SELECT * FROM orderTable WHERE orderId=:1").bind("1", "123") orderId
)
}
assertEquals("Resource not found in Aurora DB", exception.message)
}
I noticed this issue on https://github.com/Kotlin/kotlinx.coroutines/issues/1204 but none of the work around has worked for me...
Using runBlocking within Unit Test just causes my tests to never complete. Using runBlockingTest explicitly throws an error saying "Job never completed"... Anyone has any idea? Any hack at this point?
Also I fairly understand the point of you should not be using suspend with a block because that kinda defeats the purposes of suspend since it is releasing the thread to continue later versus blocking forces the thread to wait for a result... But then how does this work?
private suspend fun queryExecution(querySpec: DatabaseClient.GenericExecuteSpec): Map {
var result: Map<String, Any>?
try {
result = withContext(Dispatchers.Default) {
querySpec.fetch().first().block()
}
return result
}
Does this mean withContext will utilize a new thread, and re-use the old thread elsewhere? Which then doesnt really optimize anything since I will still have one thread that is being blocked regardless of spawning a new context?

Found the solution.
The monoMapMock is a mock value from Mockito. Seems like the kotlinx-test coroutines can't intercept an async to return a mono. So I forced the method that I can mock, to return a real Mono value instead of a Mocked Mono. To do so, as suggested by Louis. I stop mocking it and return a real value
#Test
fun `get by orderid id, null`() = runBlocking {
// Assign
Mockito.`when`(fetchSpecMock.first()).thenReturn(Mono.empty())
Mockito.`when`(monoMapMock.awaitFirst()).thenReturn(null)
// Act & Assert
val exception = assertThrows<ResourceNotFoundException> {
auroraClientWrapper.queryExecution(
databaseClient.sql("SELECT * FROM orderTable WHERE orderId=:1").bind("1", "123") orderId
)
}
assertEquals("Resource not found in Aurora DB", exception.message)
}

Apache Beam: Why does it write to Spanner twice on REPORT_FAILURES mode?

I found interesting write operation codes while looking at SpannerIO, and want to understand reasons.
On write(WriteToSpannerFn) and REPORT_FAILURES failure mode, it seems trying to write failed mutations twice.
I think it's for logging each mutation's exceptions. Is it a correct assumption, and is there any workaround?
Below, I removed some lines for simplicity.
public void processElement(ProcessContext c) {
Iterable<MutationGroup> mutations = c.element();
boolean tryIndividual = false;
try {
Iterable<Mutation> batch = Iterables.concat(mutations);
spannerAccessor.getDatabaseClient().writeAtLeastOnce(batch);
} catch (SpannerException e) {
if (failureMode == FailureMode.REPORT_FAILURES) {
tryIndividual = true;
} else {
...
}
}
if (tryIndividual) {
for (MutationGroup mg : mutations) {
try {
spannerAccessor.getDatabaseClient().writeAtLeastOnce(mg);
} catch (SpannerException e) {
LOG.warn("Failed to submit the mutation group", e);
c.output(failedTag, mg);
}
}
}
}

So rather than write each Mutation individually to the database, the SpannerIO.write() connector tries to write a batch of Mutations in a single transaction for efficiency.
If just one of these Mutations in the batch fails, then the whole transaction fails, so in REPORT_FAILURES mode, the mutations are re-tried individually to find which Mutation(s) are the problematic ones...

Calling a Web Service (containg multiple pages) does not load all the pages (without an added sleep delay)

My question is about a strange behavious I notice both on my iPhone device and the codenameone simulator (NetBeans).
I invoke the following code below which calls a google web service to provide a list of food places around a GPS coordinate:
The web service that is called is as follows (KEY OBSCURED):
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXXXXXXXXXXXXXXXXX
Each result contains the next page token and thus, the second call (for the subsequent page) is as follows:
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXXXXXXXXXXXXXXXXX&pagetoken=YYYYYYYYYYYYYYYYYY
public static byte[] getWSResponseData(String urlString, boolean usePost)
{
ConnectionRequest r = new ConnectionRequest();
r.setUrl(urlString);
r.setPost(usePost);
InfiniteProgress prog = new InfiniteProgress();
Dialog dlg = prog.showInifiniteBlocking();
r.setDisposeOnCompletion(dlg);
NetworkManager.getInstance().addToQueueAndWait(r);
try
{
Thread.sleep(2000);
}
catch (InterruptedException ex)
{
}
byte[] responseData = r.getResponseData();
return responseData;
}
public static void getLocationsList(double lat, double lng)
{
boolean done = false;
while (!done)
{
byte[] responseData = getWSResponseData(finalURL,false);
result = Result.fromContent(parser.parseJSON(new InputStreamReader(new ByteArrayInputStream(responseData))));
String venueNames[] = result.getAsStringArray("/results/name");
nextToken = result.getAsString("/next_page_token");
if ( nextToken == null || nextToken.equals(""))
done = true;
else
finalURL = completeURL + "&pagetoken=" + nextToken;
}
.....
}
This code works fine with the sleep timer, but when I remove the Thread.sleep, only the first page gets called.
Any help would be appreciated.
Using the debugger does not help as this is a timing issue and the issue does not occur when using the debugger.
Also when I put some print statements into the code
while (!done)
{
String nextToken = null;
**System.out.println(finalURL);**
...
}
System.out.println("Total Number of entries returned: " + itemCount);
I get the following output:
First Run (WITHOUT SLEEP):
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXX
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXX&pagetoken=CqQCF...
Total Number of entries returned: 20
Using the network monitor I see that the response to the second WS call returns:
{
"html_attributions" : [],
"results" : [],
"status" : "INVALID_REQUEST"
}
Which is strange as when I cut and paste the WS URL into my browser, it works fine...
Second Run (WITH SLEEP):
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXXX
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXXX&pagetoken=CqQCFQEAA...
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=40.714353,-74.00597299999998&radius=200&types=food&key=XXXXXXXXX&pagetoken=CsQDtQEAA...
Total Number of entries returned: 60

Well it seems to be a google API issue as indicated here:
Paging on Google Places API returns status INVALID_REQUEST
I still could not get it to work by changing the WS URL with a random parameter as they suggested, but I will keep trying and post something here if I get it to work. For now I will just keep a 2 second delay between the calls which seems to work.

Well gave up on using the google WS for this and switched to Yelp, works very well:
https://api.yelp.com/v3/businesses/search?.....

How to do multiple parallel readers for data export using Google Spanner?

External Backups/Snapshots for Google Cloud Spanner recommends to use queries with timestamp bounds to create snapshots for export. On the bottom of the Timestamp Bounds documentation it states:
Cloud Spanner continuously garbage collects deleted and overwritten data in the background to reclaim storage space. This process is known as version GC. By default, version GC reclaims versions after they are one hour old. Because of this, Cloud Spanner cannot perform reads at a read timestamp more than one hour in the past.
So any export would need to complete within an hour. A single reader (i.e. select * from table; using timestamp X) would not be able to export the entire table within an hour.
How can multiple parallel readers be implemented in spanner?
Note: It is mentioned in one of the comments that support for Apache Beam is coming, but it looks like that uses a single reader:
/** A simplest read function implementation. Parallelism support is coming. */
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/NaiveSpannerReadFn.java#L26
Is there a way to do the parallel reader that beam requires today using exising APIs? Or will Beam need to use something that isn't released yet on google spanner?

It is possible to read data in parallel from Cloud Spanner with the BatchClient class. Follow read_data_in_parallel for more information.
If you are looking to export data from Cloud Spanner, I'd recommend you to use Cloud Dataflow (see the integration details here) as it provides higher level abstractions and takes care data processing details, like scaling and failure handling.

Edit 2018-03-30 - The example project has been updated to use the BatchClient offered by Google Cloud Spanner
After the release of the BatchClient for reading/downloading large amounts of data, the example project below has been updated to use the new batch client instead of the standard database client. The basic idea behind the project is still the same: Copy data to/from Cloud Spanner and any other database using standard jdbc functionality. The following code snippet sets the jdbc connection in batch read mode:
if (source.isWrapperFor(ICloudSpannerConnection.class))
{
ICloudSpannerConnection con = source.unwrap(ICloudSpannerConnection.class);
// Make sure no transaction is running
if (!con.isBatchReadOnly())
{
if (con.getAutoCommit())
{
con.setAutoCommit(false);
}
else
{
con.commit();
}
con.setBatchReadOnly(true);
}
}
When the connection is in 'batch read only mode', the connection will use the BatchClient of Google Cloud Spanner instead of the standard database client. When one of the Statement#execute(String) or PreparedStatement#execute() methods are called (as these allow multiple result sets to be returned) the jdbc driver will create a partitioned query instead of a normal query. The results of this partitioned query will be a number of result sets (one per partition) that can be fetched by the Statement#getResultSet() and Statement#getMoreResults(int) methods.
Statement statement = source.createStatement();
boolean hasResults = statement.execute(select);
int workerNumber = 0;
while (hasResults)
{
ResultSet rs = statement.getResultSet();
PartitionWorker worker = new PartitionWorker("PartionWorker-" + workerNumber, config, rs, tableSpec, table, insertCols);
workers.add(worker);
hasResults = statement.getMoreResults(Statement.KEEP_CURRENT_RESULT);
workerNumber++;
}
The result sets that are returned by the Statement#execute(String) are not executed directly, but only after the first call to ResultSet#next(). Passing these result sets to separate worker threads ensures parallel download and copying of the data.
Original answer:
This project was initially created for conversion in the other direction (from a local database to Cloud Spanner), but as it uses JDBC for both source and destination it can also be used the other way around: Converting a Cloud Spanner database to a local PostgreSQL database. Large tables are converted in parallel using a thread pool.
The project uses this open source JDBC driver instead of the JDBC driver supplied by Google. The source Cloud Spanner JDBC connection is set to read-only mode and autocommit=false. This ensures that the connection automatically creates a read-only transaction using the current time as timestamp the first time you execute a query. All subsequent queries within the same (read-only) transaction will use the same timestamp giving you a consistent snapshot of your Google Cloud Spanner database.
It works as follows:
Set the source database to read-only transactional mode.
The convert(String catalog, String schema) method iterates over all
tables in the source database (Cloud Spanner)
For each table the number of records is determined, and depending on the size of the table, the table is copied using either the main thread of the application or by a worker pool.
The class UploadWorker is responsible for the parallel copying. Each worker is assigned a range of records from the table (for example rows 1 to 2,400). The range is selected by a select statement in this format: 'SELECT * FROM $TABLE ORDER BY $PK_COLUMNS LIMIT $BATCH_SIZE OFFSET $CURRENT_OFFSET'
Commit the read-only transaction on the source database after ALL tables have been converted.
Below is a code snippet of the most important parts.
public void convert(String catalog, String schema) throws SQLException
{
int batchSize = config.getBatchSize();
destination.setAutoCommit(false);
// Set the source connection to transaction mode (no autocommit) and read-only
source.setAutoCommit(false);
source.setReadOnly(true);
try (ResultSet tables = destination.getMetaData().getTables(catalog, schema, null, new String[] { "TABLE" }))
{
while (tables.next())
{
String tableSchema = tables.getString("TABLE_SCHEM");
if (!config.getDestinationDatabaseType().isSystemSchema(tableSchema))
{
String table = tables.getString("TABLE_NAME");
// Check whether the destination table is empty.
int destinationRecordCount = getDestinationRecordCount(table);
if (destinationRecordCount == 0 || config.getDataConvertMode() == ConvertMode.DropAndRecreate)
{
if (destinationRecordCount > 0)
{
deleteAll(table);
}
int sourceRecordCount = getSourceRecordCount(getTableSpec(catalog, tableSchema, table));
if (sourceRecordCount > batchSize)
{
convertTableWithWorkers(catalog, tableSchema, table);
}
else
{
convertTable(catalog, tableSchema, table);
}
}
else
{
if (config.getDataConvertMode() == ConvertMode.ThrowExceptionIfExists)
throw new IllegalStateException("Table " + table + " is not empty");
else if (config.getDataConvertMode() == ConvertMode.SkipExisting)
log.info("Skipping data copy for table " + table);
}
}
}
}
source.commit();
}
private void convertTableWithWorkers(String catalog, String schema, String table) throws SQLException
{
String tableSpec = getTableSpec(catalog, schema, table);
Columns insertCols = getColumns(catalog, schema, table, false);
Columns selectCols = getColumns(catalog, schema, table, true);
if (insertCols.primaryKeyCols.isEmpty())
{
log.warning("Table " + tableSpec + " does not have a primary key. No data will be copied.");
return;
}
log.info("About to copy data from table " + tableSpec);
int batchSize = config.getBatchSize();
int totalRecordCount = getSourceRecordCount(tableSpec);
int numberOfWorkers = calculateNumberOfWorkers(totalRecordCount);
int numberOfRecordsPerWorker = totalRecordCount / numberOfWorkers;
if (totalRecordCount % numberOfWorkers > 0)
numberOfRecordsPerWorker++;
int currentOffset = 0;
ExecutorService service = Executors.newFixedThreadPool(numberOfWorkers);
for (int workerNumber = 0; workerNumber < numberOfWorkers; workerNumber++)
{
int workerRecordCount = Math.min(numberOfRecordsPerWorker, totalRecordCount - currentOffset);
UploadWorker worker = new UploadWorker("UploadWorker-" + workerNumber, selectFormat, tableSpec, table,
insertCols, selectCols, currentOffset, workerRecordCount, batchSize, source,
config.getUrlDestination(), config.isUseJdbcBatching());
service.submit(worker);
currentOffset = currentOffset + numberOfRecordsPerWorker;
}
service.shutdown();
try
{
service.awaitTermination(config.getUploadWorkerMaxWaitInMinutes(), TimeUnit.MINUTES);
}
catch (InterruptedException e)
{
log.severe("Error while waiting for workers to finish: " + e.getMessage());
throw new RuntimeException(e);
}
}
public class UploadWorker implements Runnable
{
private static final Logger log = Logger.getLogger(UploadWorker.class.getName());
private final String name;
private String selectFormat;
private String sourceTable;
private String destinationTable;
private Columns insertCols;
private Columns selectCols;
private int beginOffset;
private int numberOfRecordsToCopy;
private int batchSize;
private Connection source;
private String urlDestination;
private boolean useJdbcBatching;
UploadWorker(String name, String selectFormat, String sourceTable, String destinationTable, Columns insertCols,
Columns selectCols, int beginOffset, int numberOfRecordsToCopy, int batchSize, Connection source,
String urlDestination, boolean useJdbcBatching)
{
this.name = name;
this.selectFormat = selectFormat;
this.sourceTable = sourceTable;
this.destinationTable = destinationTable;
this.insertCols = insertCols;
this.selectCols = selectCols;
this.beginOffset = beginOffset;
this.numberOfRecordsToCopy = numberOfRecordsToCopy;
this.batchSize = batchSize;
this.source = source;
this.urlDestination = urlDestination;
this.useJdbcBatching = useJdbcBatching;
}
#Override
public void run()
{
// Connection source = DriverManager.getConnection(urlSource);
try (Connection destination = DriverManager.getConnection(urlDestination))
{
log.info(name + ": " + sourceTable + ": Starting copying " + numberOfRecordsToCopy + " records");
destination.setAutoCommit(false);
String sql = "INSERT INTO " + destinationTable + " (" + insertCols.getColumnNames() + ") VALUES \n";
sql = sql + "(" + insertCols.getColumnParameters() + ")";
PreparedStatement statement = destination.prepareStatement(sql);
int lastRecord = beginOffset + numberOfRecordsToCopy;
int recordCount = 0;
int currentOffset = beginOffset;
while (true)
{
int limit = Math.min(batchSize, lastRecord - currentOffset);
String select = selectFormat.replace("$COLUMNS", selectCols.getColumnNames());
select = select.replace("$TABLE", sourceTable);
select = select.replace("$PRIMARY_KEY", selectCols.getPrimaryKeyColumns());
select = select.replace("$BATCH_SIZE", String.valueOf(limit));
select = select.replace("$OFFSET", String.valueOf(currentOffset));
try (ResultSet rs = source.createStatement().executeQuery(select))
{
while (rs.next())
{
int index = 1;
for (Integer type : insertCols.columnTypes)
{
Object object = rs.getObject(index);
statement.setObject(index, object, type);
index++;
}
if (useJdbcBatching)
statement.addBatch();
else
statement.executeUpdate();
recordCount++;
}
if (useJdbcBatching)
statement.executeBatch();
}
destination.commit();
log.info(name + ": " + sourceTable + ": Records copied so far: " + recordCount + " of "
+ numberOfRecordsToCopy);
currentOffset = currentOffset + batchSize;
if (recordCount >= numberOfRecordsToCopy)
break;
}
}
catch (SQLException e)
{
log.severe("Error during data copy: " + e.getMessage());
throw new RuntimeException(e);
}
log.info(name + ": Finished copying");
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is writing to Bigquery using Dataflow EXTREMELY slow? - google-cloud-platform

Related

How to enter data from Spring Boot Application into Amazon Kinesis?

Unit testing suspend coroutine

Apache Beam: Why does it write to Spanner twice on REPORT_FAILURES mode?

Calling a Web Service (containg multiple pages) does not load all the pages (without an added sleep delay)

How to do multiple parallel readers for data export using Google Spanner?

Categories

Resources