I am trying to add a Spanner connection within an Apache Beam ParDo(DoFn). I need to lookup some rows as part of the ParDo. The dataflow creates a number of workers (usually 4 max) and I use the startBundle and finishBundle methods to open and close the spanner connections for the workers lifetime. Then within the processElement method I perform the lookup for each item passing the DatabaseClient and using a singleUseReadOnlyTransaction.
I should add this is running as a dataflow under GCP
Some code to illustrate this.
private static CustomDoFn<String, TransactionImport> processRow = new CustomDoFn<String, TransactionImport>(){
private static final long serialVersionUID = 1L;
private Spanner spanner = null;
private DatabaseClient dbClient = null;
#StartBundle
public void startBundle(StartBundleContext c){
TransactionFileOptions options = c.getPipelineOptions().as(TransactionFileOptions.class);
com.google.cloud.spanner.SpannerOptions spannerOptions = com.google.cloud.spanner.SpannerOptions.newBuilder().build();
spanner = spannerOptions.getService();
String spannerProjectID = options.getSpannerProjectId();
String spannerInstanceID = options.getSpannerInstanceId();
String spannerDatabaseID = options.getSpannerDatabaseId();
DatabaseId db = DatabaseId.of(spannerProjectID, spannerInstanceID, spannerDatabaseID);
dbClient = spanner.getDatabaseClient(db);
}
#FinishBundle
public void finishBundle(FinishBundleContext c){
spanner.close();
}
#ProcessElement
public void processElement(DoFn<String, TransactionImport>.ProcessContext c) throws Exception {
TransactionImport import = new TransactionImport();
Statement statement = Statement.newBuilder("SELECT * FROM Table1 WHERE Name= #Name")
.bind("Name").to( text)
.build();
ResultSet resultSet = dbClient.singleUseReadOnlyTransaction().executeQuery(statement);
// set some value on import dependant on retrieved value
c.output(import);
}
This always results in the dataflow not completing and when I check the log I see:
Processing stuck in step Process Rows for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
at com.google.common.util.concurrent.Uninterruptibles.takeUninterruptibly(Uninterruptibles.java:233)
at com.google.cloud.spanner.SessionPool$Waiter.take(SessionPool.java:411)
at com.google.cloud.spanner.SessionPool$Waiter.access$3300(SessionPool.java:399)
at com.google.cloud.spanner.SessionPool.getReadSession(SessionPool.java:754)
at com.google.cloud.spanner.DatabaseClientImpl.singleUseReadOnlyTransaction(DatabaseClientImpl.java:52)
at com.mycompany.pt.SpannerDataAccess.getBinDetails(SpannerDataAccess.java:197)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1.processLine(TransactionFileDataflow.java:411)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1.processElement(TransactionFileDataflow.java:336)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1$DoFnInvoker.invokeProcessElement(Unknown Source)
`
Does anyone have any experience using Spanner like this within a ParDo?
I'm not a spanner expert, but maybe I can help:
You should use #Setup/#Teardown to connect & disconnect from spanner. #{Start,Finish}Bundle gets called multiple times over the lifetime of a worker. See here for more details: https://beam.apache.org/documentation/execution-model/#bundling-and-persistence
Does your processElement method ever emit an element using
c.output(...)? If not, beam will think your pipeline is stuck
Related
At the moment I am going through the GCP docs trying to figure out what is the optimal/fastest way to ingest data from BigQuery (using Python) to PubSub. What I am doing so far (in a simplified way) is:
MESSAGE_SIZE_IN_BYTES = 500
MAX_BATCH_MESSAGES = 20
MAX_BYTES_BATCH = MESSAGE_SIZE_IN_BYTES * MAX_BATCH_MESSAGES
BATCH_MAX_LATENCY_IN_10MS = 0.01
MAX_FLOW_MESSAGES = 20
MAX_FLOW_BYTES = MESSAGE_SIZE_IN_BYTES * MAX_FLOW_MESSAGES
batch_settings = pubsub_v1.types.BatchSettings(
max_messages=MAX_BATCH_MESSAGES,
max_bytes=MAX_BYTES_BATCH,
max_latency=BATCH_MAX_LATENCY_IN_10MS,
)
publisher_options = pubsub_v1.types.PublisherOptions(
flow_control=pubsub_v1.types.PublishFlowControl(
message_limit=MAX_FLOW_MESSAGES,
byte_limit=MAX_FLOW_BYTES,
limit_exceeded_behavior=pubsub_v1.types.LimitExceededBehavior.BLOCK,
),
)
pubsub_client = pubsub_v1.PublisherClient(credentials=credentials,
batch_settings=self.batch_settings,
publisher_options=self.publisher_options)
bigquery_client = ....
bq_query_job = bigquery_client.query(QUERY)
rows = bq_query_job.result()
for row in rows:
callback_obj = PubsubCallback(...)
json_data = json.dumps(row).encode("utf-8")
publish_future = pubsub_client.publish(topic_path, json_data)
publish_future.add_done_callback(callback_obj.callback)
publish_futures.append(publish_future)
so one message per row. I have being trying to tweak different params for the PubSub publisher client etc, but I cannot get further than 20/30 messages(rows) per second. Is there a way to read from BigQuery using Pubsub in a faster way (at least 1000 times faster than now)?
We also have a need to get data from BigQuery into PubSub and we do so using Dataflow. I've just looked at one of the jobs we ran today and we loaded 3.4million rows in about 5 minutes (so ~11000 rows per second).
Our Dataflow jobs are written in java but you could write them in python if you wish. Here is the code for the pipeline I described above:
package com.ourcompany.pipelines;
import com.google.api.services.bigquery.model.TableRow;
import java.util.HashMap;
import java.util.Map;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* The {#code BigQueryEventReplayer} pipeline runs a supplied SQL query
* against BigQuery, and sends the results one-by-one to PubSub
* The query MUST return a column named 'json', it is this column
* (and ONLY this column) that will be sent onward. The column must be a String type
* and should be valid JSON.
*/
public class BigQueryEventReplayer {
private static final Logger logger = LoggerFactory.getLogger(BigQueryEventReplayer.class);
/**
* Options for the BigQueryEventReplayer. See descriptions for more info
*/
public interface Options extends PipelineOptions {
#Description("SQL query to be run."
+ "An SQL string literal which will be run 'as is'")
#Required
ValueProvider<String> getBigQuerySql();
void setBigQuerySql(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
#Description("The ID of the BigQuery dataset targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetDataset();
void setBigQueryTargetDataset(ValueProvider<String> value);
#Description("The ID of the BigQuery table targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetTable();
void setBigQueryTargetTable(ValueProvider<String> value);
#Description("The SourceSystem attribute of the event")
#Required
ValueProvider<String> getSourceSystem();
void setSourceSystem(ValueProvider<String> value);
}
/**
* Takes the data from the TableRow and prepares it for the PubSub, including
* adding attributes to ensure the payload is routed correctly.
*/
public static class MapQueryToPubsub extends DoFn<TableRow, PubsubMessage> {
private final ValueProvider<String> targetDataset;
private final ValueProvider<String> targetTable;
private final ValueProvider<String> sourceSystem;
MapQueryToPubsub(
ValueProvider<String> targetDataset,
ValueProvider<String> targetTable,
ValueProvider<String> sourceSystem) {
this.targetDataset = targetDataset;
this.targetTable = targetTable;
this.sourceSystem = sourceSystem;
}
/**
* Entry point of DoFn for Dataflow.
*/
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
if (!row.containsKey("json")) {
logger.warn("table does not contain column named 'json'");
}
Map<String, String> attributes = new HashMap<>();
attributes.put("sourceSystem", sourceSystem.get());
attributes.put("targetDataset", targetDataset.get());
attributes.put("targetTable", targetTable.get());
String json = (String) row.get("json");
c.output(new PubsubMessage(json.getBytes(), attributes));
}
}
/**
* Run the pipeline. This is the entrypoint for running 'locally'
*/
public static void main(String[] args) {
// Parse the user options passed from the command-line
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
run(options);
}
/**
* Run the pipeline. This is the entrypoint that GCP will use
*/
public static PipelineResult run(Options options) {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from BigQuery query",
BigQueryIO.readTableRows().fromQuery(options.getBigQuerySql()).usingStandardSql().withoutValidation()
.withTemplateCompatibility())
.apply("Map data to PubsubMessage",
ParDo.of(
new MapQueryToPubsub(
options.getBigQueryTargetDataset(),
options.getBigQueryTargetTable(),
options.getSourceSystem()
)
)
)
.apply("Write message to PubSub", PubsubIO.writeMessages().to(options.getOutputTopic()));
return pipeline.run();
}
}
This pipeline requires that each row retrieved from BigQuery is a JSON document, something that can easily be achieved using TO_JSON_STRING.
I know this might look rather daunting to some (it kinda does to me I admit) but it will get you the throughput that you require!
You can ignore this part:
Map<String, String> attributes = new HashMap<>();
attributes.put("sourceSystem", sourceSystem.get());
attributes.put("targetDataset", targetDataset.get());
attributes.put("targetTable", targetTable.get());
that's just some extra attributes we add to the pubsub message purely for our own use.
Use Pub/Sub Batch Messages. This allows your code to batch multiple messages into a single call to the Pub/Sub service.
Example code from Google (link):
from concurrent import futures
from google.cloud import pubsub_v1
# TODO(developer)
# project_id = "your-project-id"
# topic_id = "your-topic-id"
# Configure the batch to publish as soon as there are 10 messages
# or 1 KiB of data, or 1 second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_messages=10, # default 100
max_bytes=1024, # default 1 MB
max_latency=1, # default 10 ms
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_id)
publish_futures = []
# Resolve the publish future in a separate thread.
def callback(future: pubsub_v1.publisher.futures.Future) -> None:
message_id = future.result()
print(message_id)
for n in range(1, 10):
data_str = f"Message number {n}"
# Data must be a bytestring
data = data_str.encode("utf-8")
publish_future = publisher.publish(topic_path, data)
# Non-blocking. Allow the publisher client to batch multiple messages.
publish_future.add_done_callback(callback)
publish_futures.append(publish_future)
futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)
print(f"Published messages with batch settings to {topic_path}.")
We have a DynamoDB table which has an attribute counter, which will be decremented asynchronously by multiple lambda based on an event. I am trying to update the counter using UpdateItemEnhancedRequest (using the Dynamodb Enhanced Client. - JAVA SDK 2). I am able to build the condition for updating the counter but it updates the entire item and not just the counter. Can somebody please guide on how to update a single attribute using DynamoDb Enhanced Client?
Code Sample
public void update(String counter, T item) {
AttributeValue value = AttributeValue.builder().n(counter).build();
Map<String, AttributeValue> expressionValues = new HashMap<>();
expressionValues.put(":value", value);
Expression myExpression = Expression.builder()
.expression("nqctr = :value")
.expressionValues(expressionValues)
.build();
UpdateItemEnhancedRequest<T> updateItemEnhancedRequest =
UpdateItemEnhancedRequest.builder(collectionClassName)
.item(item)
.conditionExpression(myExpression)
.build();
getTable().updateItem(updateItemEnhancedRequest);
}
When you update a specific column, you need to specify which column to update. Assume we have this table:
Now assume we want to update the archive column. You need to specify the column in your code. Here we change the archive column of the item that corresponds to the key to Closed (a single column update). Notice we specify the column name by using the HashMap object named updatedValues.
// Archives an item based on the key
public String archiveItem(String id){
DynamoDbClient ddb = getClient();
HashMap<String,AttributeValue> itemKey = new HashMap<String,AttributeValue>();
itemKey.put("id", AttributeValue.builder()
.s(id)
.build());
HashMap<String, AttributeValueUpdate> updatedValues =
new HashMap<String,AttributeValueUpdate>();
// Update the column specified by name with updatedVal
updatedValues.put("archive", AttributeValueUpdate.builder()
.value(AttributeValue.builder()
.s("Closed").build())
.action(AttributeAction.PUT)
.build());
UpdateItemRequest request = UpdateItemRequest.builder()
.tableName("Work")
.key(itemKey)
.attributeUpdates(updatedValues)
.build();
try {
ddb.updateItem(request);
return"The item was successfully archived";
NOTE: This is not the Enhanced Client.
This code is from the AWS Tutorial that show how to build a Java web app by using Spring Boot. Full tutorial here:
Creating the DynamoDB web application item tracker
TO update a single column using the Enhanced Client, call the Table method. This returns a DynamoDbTable instance. Now you can call the updateItem method.
Here is the logic to update the the archive column using the Enhanced Client. Notice you get a Work object, call its setArchive then pass the Work object. workTable.updateItem(r->r.item(work));
Java code:
// Update the archive column by using the Enhanced Client.
public String archiveItemEC(String id) {
DynamoDbClient ddb = getClient();
try {
DynamoDbEnhancedClient enhancedClient = DynamoDbEnhancedClient.builder()
.dynamoDbClient(getClient())
.build();
DynamoDbTable<Work> workTable = enhancedClient.table("Work", TableSchema.fromBean(Work.class));
//Get the Key object.
Key key = Key.builder()
.partitionValue(id)
.build();
// Get the item by using the key.
Work work = workTable.getItem(r->r.key(key));
work.setArchive("Closed");
workTable.updateItem(r->r.item(work));
return"The item was successfully archived";
} catch (DynamoDbException e) {
System.err.println(e.getMessage());
System.exit(1);
}
return "";
}
This answer shows both ways to update a single column in a DynamoDB table. The above tutorial now shows this way.
In your original solution, you're misinterpreting the meaning of the conditionExpression attribute. This is used to validate conditions that must be true on the item that matches the key in order to perform the update, not the expression to perform the update itself.
There is a way to perform this operation with the enhanced client without needing to fetch the object before making an update. The UpdateItemEnhancedRequest class has an ignoreNulls attribute that will exclude all null attributes from the update. This is false by default, which is what causes a full overwrite of the object.
Let's assume this is the structure of your item (without all the enhanced client annotations and boilerplate, you can add those):
class T {
public String partitionKey;
public Int counter;
public String someOtherAttribute
public T(String partitionKey) {
this.partitionKey = partitionKey;
this.counter = null;
this.someOtherAttribute = null
}
}
You can issue an update of just the counter, and only if the item exists, like this:
public void update(Int counter, String partitionKey) {
T updateItem = new T(partitionKey)
updateItem.counter = counter
Expression itemExistsExpression = Expression.builder()
.expression("attribute_exists(partitionKey)")
.build();
UpdateItemEnhancedRequest<T> updateItemEnhancedRequest =
UpdateItemEnhancedRequest.builder(collectionClassName)
.item(item)
.conditionExpression(itemExistsExpression)
.ignoreNulls(true)
.build();
getTable().updateItem(updateItemEnhancedRequest);
}
I have a small dataflow job triggered from a cloud function using a dataflow template. The job basically reads from a table in Bigquery, converts the resultant Tablerow to a Key-Value, and writes the Key-Value to Datastore.
This is what my code looks like :-
PCollection<TableRow> bigqueryResult = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQuery()).usingStandardSql()
.withoutValidation());
bigqueryResult.apply("WriteFromBigqueryToDatastore", ParDo.of(new DoFn<TableRow, String>() {
#ProcessElement
public void processElement(ProcessContext pc) {
TableRow row = pc.element();
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactory keyFactoryCounts = datastore.newKeyFactory().setNamespace("MyNamespace")
.setKind("MyKind");
Key key = keyFactoryCounts.newKey("Key");
Builder builder = Entity.newBuilder(key);
builder.set("Key", BooleanValue.newBuilder("Value").setExcludeFromIndexes(true).build());
Entity entity= builder.build();
datastore.put(entity);
}
}));
This pipeline runs fine when the number of records I try to process is anywhere in the range of 1 to 100. However, when I try putting more load on the pipeline, ie, ~10000 records, the pipeline does not scale (eventhough autoscaling is set to THROUGHPUT based and maximumWorkers is specified to as high as 50 with an n1-standard-1 machine type). The job keeps processing 3 or 4 elements per second with one or two workers. This is impacting the performance of my system.
Any advice on how to scale up the performance is very welcome.
Thanks in advance.
Found a solution by using DatastoreIO instead of the datastore client.
Following is the snippet I used,
PCollection<TableRow> row = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQueryForSegmentedUsers()).usingStandardSql()
.withoutValidation());
PCollection<com.google.datastore.v1.Entity> userEntity = row.apply("ConvertTablerowToEntity", ParDo.of(new DoFn<TableRow, com.google.datastore.v1.Entity>() {
#SuppressWarnings("deprecation")
#ProcessElement
public void processElement(ProcessContext pc) {
final String namespace = "MyNamespace";
final String kind = "MyKind";
com.google.datastore.v1.Key.Builder keyBuilder = DatastoreHelper.makeKey(kind, "root");
if (namespace != null) {
keyBuilder.getPartitionIdBuilder().setNamespaceId(namespace);
}
final com.google.datastore.v1.Key ancestorKey = keyBuilder.build();
TableRow row = pc.element();
String entityProperty = "sample";
String key = "key";
com.google.datastore.v1.Entity.Builder entityBuilder = com.google.datastore.v1.Entity.newBuilder();
com.google.datastore.v1.Key.Builder keyBuilder1 = DatastoreHelper.makeKey(ancestorKey, kind, key);
if (namespace != null) {
keyBuilder1.getPartitionIdBuilder().setNamespaceId(namespace);
}
entityBuilder.setKey(keyBuilder1.build());
entityBuilder.getMutableProperties().put(entityProperty, DatastoreHelper.makeValue("sampleValue").build());
pc.output(entityBuilder.build());
}
}));
userEntity.apply("WriteToDatastore", DatastoreIO.v1().write().withProjectId(options.getProject()));
This solution was able to scale from 3 elements per second with 1 worker to ~1500 elements per second with 20 workers.
At least with python's ndb client library it's possible to write up to 500 entities at a time in a single .put_multi() datastore call - a whole lot faster than calling .put() for one entity at a time (the calls are blocking on the underlying RPCs)
I'm not a java user, but a similar technique appears to be available for it as well. From Using batch operations:
You can use the batch operations if you want to operate on multiple
entities in a single Cloud Datastore call.
Here is an example of a batch call:
Entity employee1 = new Entity("Employee");
Entity employee2 = new Entity("Employee");
Entity employee3 = new Entity("Employee");
// ...
List<Entity> employees = Arrays.asList(employee1, employee2, employee3);
datastore.put(employees);
Is there a way to get partitions of an Ignite node in C++?
I would like to parallelize scan queries over the partitions.
Something similar to this in Java:
ignite.compute(ignite.cluster().forDataNodes("myCache"))
.broadcast(new IgniteCallable<Void>() {
#IgniteInstanceResource
private Ignite ignite0;
#Override public Void call() throws Exception {
ClusterNode localNode = ignite0.cluster().localNode();
// get partitions
int[] parts = ignite0.affinity("myCache").primaryPartitions(localNode);
partList.parallelStream().forEach(p -> {
ScanQuery<Integer, Record> qry = new ScanQuery().setLocal(true).setPartition(p);
// query over the partition.
...
}
}
There is no yet Cluster API in Ignite C++, though there is some Compute API. You can track ticket [1] for updates on Cluster API.
[1] - https://issues.apache.org/jira/browse/IGNITE-5708
I'm researching a way to build an n-tierd sync solution. From the WebSharingAppDemo-CEProviderEndToEnd sample it seems almost feasable however for some reason, the app will only sync if the client has a live SQL db connection. Can some one explain what I'm missing and how to sync without exposing SQL to the internet?
The problem I'm experiencing is that when I provide a Relational sync provider that has an open SQL connection from the client, then it works fine but when I provide a Relational sync provider that has a closed but configured connection string, as in the example, I get an error from the WCF stating that the server did not receive the batch file. So what am I doing wrong?
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder();
builder.DataSource = hostName;
builder.IntegratedSecurity = true;
builder.InitialCatalog = "mydbname";
builder.ConnectTimeout = 1;
provider.Connection = new SqlConnection(builder.ToString());
// provider.Connection.Open(); **** un-commenting this causes the code to work**
//create anew scope description and add the appropriate tables to this scope
DbSyncScopeDescription scopeDesc = new DbSyncScopeDescription(SyncUtils.ScopeName);
//class to be used to provision the scope defined above
SqlSyncScopeProvisioning serverConfig = new SqlSyncScopeProvisioning();
....
The error I get occurs in this part of the WCF code:
public SyncSessionStatistics ApplyChanges(ConflictResolutionPolicy resolutionPolicy, ChangeBatch sourceChanges, object changeData)
{
Log("ProcessChangeBatch: {0}", this.peerProvider.Connection.ConnectionString);
DbSyncContext dataRetriever = changeData as DbSyncContext;
if (dataRetriever != null && dataRetriever.IsDataBatched)
{
string remotePeerId = dataRetriever.MadeWithKnowledge.ReplicaId.ToString();
//Data is batched. The client should have uploaded this file to us prior to calling ApplyChanges.
//So look for it.
//The Id would be the DbSyncContext.BatchFileName which is just the batch file name without the complete path
string localBatchFileName = null;
if (!this.batchIdToFileMapper.TryGetValue(dataRetriever.BatchFileName, out localBatchFileName))
{
//Service has not received this file. Throw exception
throw new FaultException<WebSyncFaultException>(new WebSyncFaultException("No batch file uploaded for id " + dataRetriever.BatchFileName, null));
}
dataRetriever.BatchFileName = localBatchFileName;
}
Any ideas?
For the Batch file not available issue, remove the IsOneWay=true setting from IRelationalSyncContract.UploadBatchFile. When the Batch file size is big, ApplyChanges will be called even before fully completing the previous UploadBatchfile.
// Replace
[OperationContract(IsOneWay = true)]
// with
[OperationContract]
void UploadBatchFile(string batchFileid, byte[] batchFile, string remotePeer1
I suppose it's simply a stupid example. It exposes "some" technique but assumes you have to arrange it in proper order by yourself.
http://msdn.microsoft.com/en-us/library/cc807255.aspx