Streaming pubsub -bigtable using apache beam dataflow java

Streaming pubsub -bigtable using apache beam dataflow java - google-cloud-platform

Trying to update the pubsub json message to bigtable .I am running code from local machine .the dataflow job is getting created .but i dont see any data updated in bigtable instance and also it does not throw any error in console or dataflow job.I also tried to have hardcode value and try to update in bigtable but still it didnt work. Please can anyone suggest or guide me in this issue
try{
PipelineOptions options = PipelineOptionsFactory.fromArgs(projectArgs).create();
options.setRunner(DataflowRunner.class);
System.out.println("tempfile-" + options.getTempLocation());
Pipeline p = Pipeline.create(options);
System.out.println("options" + options.getTempLocation());
p.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(PUBSUB_SUBSCRIPTION))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(ParDo.of(new RowGenerator())).apply(CloudBigtableIO.writeToTable(bigtableConfig));
p.run();
}catch (Exception e) {
// TODO: handle exception
System.out.println(e);
}
}
#ProcessElement
public void processElement(ProcessContext context) {
try {
System.out.println("In for RowGenerator");
String decodedMessageAsJsonString = context.element();
System.out.println("decodedMessageAsJsonString"+decodedMessageAsJsonString);
String rowKey = String.valueOf(
LocalDateTime.ofInstant(Instant.now(), ZoneId.of("UTC"))
.toEpochSecond(ZoneOffset.UTC));
System.out.println("rowKey"+rowKey);
Put put = new Put(rowKey.getBytes());
put.addColumn("VALUE".getBytes(), "VALUE".getBytes(), decodedMessageAsJsonString.getBytes());
// put.addColumn(Bytes.toBytes("IBS"), Bytes.toBytes("name"),Bytes.toBytes("ram"));
context.output(put);
}catch (Throwable e) {
// TODO: handle exception
System.out.println(e);
}
}enter image description here

I don't see any issue with the Bigtable side of the template. Just make sure that the column family (which I am assuming is "VALUE" exists on the destination table.
Are you sure that you are reading the right PubSub subscription and there are messages being sent to PubSub. If its all correct, it seems there is some issue in the PubSub configuration. Maybe add the PubSub tag on the question and someone from the pubsub community can help.

Related

How to pass an aws IAM role to a Java client (through a lambda function)

I'm aiming to create a lambda function which it will execute a java client, such (the java client) is supposed to call an aws service endpoint.
Since my java client needs authentication (I am approaching this with the AWS4Signer library). I would like to authenticate my java code with the IMDS of my lambda exception role, as I can't use users due to a sec procedures.
I've been trying to use InstanceProfileCredentialsProvider as my credential provider
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-roles.html
which in theory it should take the IMDS of the instance.Not sure if it only when using an EC2 instance, or it could also work with any aws compute service, such as lambda is.
with InstanceProfileCredentialsProvider I'm getting the following error:
com.amazonaws.internal.InstanceMetadataServiceResourceFetcher - Token is not supported. Ignoring
Failed to connect to service endpoint: com.amazonaws.SdkClientException: Failed to connect to service endpoint:
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:100)
I came across with the following blogs where a similar issue was resported
https://github.com/aws/aws-sdk-java/issues/2285
https://medium.com/expedia-group-tech/service-slow-to-retrieve-aws-credentials-ebc02a38e95b
and it seems this is happening due to the catched instance credentials is already outdated at the time to authenticate the credentials.
So I have added a logic to refresh the credentialsProvider object (InstanceProfileCredentialsProvider)
public static Optional<AWSCredentials> retrieveCredentials (AWSCredentialsProvider provider){
var attempts = 0;
System.out.println("Retrieving credentials...");
try {
System.out.printf("Retrieving credentials at attempt : %s", attempts);
return Optional.of(provider.getCredentials());
}catch(Exception e){
while(attempts < 15) {
try {
TimeUnit.SECONDS.sleep(30);
} catch (InterruptedException ex) {
ex.printStackTrace();
}
System.out.printf("Retrieving credentials at attempt : %s", attempts);
provider.refresh();
try {
return Optional.of(provider.getCredentials());
}catch (Exception e1){
System.out.printf("Attempt : %s failed due to: %s", attempts, e1.getMessage());
}
attempts ++;
}
e.printStackTrace();
System.exit(1);
}
return Optional.empty();
}
```
But I'm still getting the same error.
Any kind of help will be very appreciated.

GCP BigTable Metrics - what do 404 requests mean?

We switched to BigTable some time ago and since then there is a number of "404 requests" and also a high number of errors in the GCP Metrics console.
We see no errors in our logs and even data storage/retrieval seems to work as expected.
What is the cause for these errors and how is it possible to find out what is causing them?

As mentioned previously 404 means resource is not found. The relevant resource here is the Bigtable table (which could mean that either the instance id or table id are misconfigured in your application).
I'm guessing that you are looking at the metrics under APIs & Services > Cloud Bigtable API. These metrics show the response code from the Cloud Bigtable Service. You should be able to see this error rate under Monitoring > Metrics Explorer > metric:bigtable.googleapis.com/server/error_count and grouping by instance, method, error_code and app_profile. This will tell which instance and which RPC is causing the errors. Which let you grep your source code for incorrect usages.
A significantly more complex approach is that you can install an interceptor in Bigtable client that:
dumps the resource name of the RPC
once you identify the problematic table name, logs the stack trace of the caller
Something along these lines:
BigtableDataSettings.Builder builder = BigtableDataSettings.newBuilder()
.setProjectId("...")
.setInstanceId("...");
ConcurrentHashMap<String, Boolean> seenTables = new ConcurrentHashMap<>();
builder.stubSettings().setTransportChannelProvider(
EnhancedBigtableStubSettings.defaultGrpcTransportProviderBuilder()
.setInterceptorProvider(() -> ImmutableList.of(new ClientInterceptor() {
#Override
public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
MethodDescriptor<ReqT, RespT> methodDescriptor, CallOptions callOptions,
Channel channel) {
return new ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>(channel.newCall(methodDescriptor, callOptions)) {
#Override
public void sendMessage(ReqT message) {
Message protoMessage = (Message) message;
FieldDescriptor desc = protoMessage.getDescriptorForType()
.findFieldByName("table_name");
if (desc != null) {
String tableName = (String) protoMessage.getField(desc);
if (seenTables.putIfAbsent(tableName, true) == null) {
System.out.println("Found new tableName: " + tableName);
}
if ("projects/my-project/instances/my-instance/tables/my-mispelled-table".equals(
tableName)) {
new RuntimeException(
"Fake error to get caller location of mispelled table id").printStackTrace();
}
}
delegate().sendMessage(message);
}
};
}
}))
.build()
);

Google Cloud Support here,
Without more insight I won’t be able to provide valid information about this 404 issue.
The issue must be either a typo or with the configuration, but cannot confirm with the shared data.
In order to provide more meaningful support, I would suggest you to open a Public Issue Tracker or a Google Cloud Support ticket.

GCP cloud build VIEW RAW logs link

I have written a small cloud function in GCP which is subscribed to Pub/Sub event. When any cloud builds triggered function post message into the slack channel over webook.
In response, we get lots of details to trigger name, branch name, variables details but i am more interested in Build logs URL.
Currently getting build logs URL in response is like : logUrl: https://console.cloud.google.com/cloud-build/builds/899-08sdf-4412b-e3-bd52872?project=125205252525252
which requires GCP console access to check logs.
While in the console there an option View Raw. Is it possible to get that direct URL in the event response? so that i can directly sent it to slack and anyone can access direct logs without having GCP console access.

In your Cloud Build event message, you need to extract 2 values from the JSON message:
logsBucket
id
The raw file is stored here
<logsBucket>/log-<id>.txt
So, you can get it easily in your function with Cloud Storage client library (preferred solution) or with a simple HTTP Get call to the storage API.
If you need more guidance, let me know your dev language, I will send you a piece of code.

as #guillaume blaquiere helped.
Just sharing the piece of code used in cloud function to generate the singedURL of cloud build logs.
var filename ='log-' + build.id + '.txt';
var file = gcs.bucket(BUCKET_NAME).file(filename);
const getURL = async () => {
return new Promise((resolve, reject) => {
file.getSignedUrl({
action: 'read',
expires: Date.now() + 76000000
}, (err, url) => {
if (err) {
console.error(err);
reject(err);
}
console.log("URL");
resolve(url);
});
})
}
const singedUrl = await getURL();
if anyone looking for the whole code please follow this link : https://github.com/harsh4870/Cloud-build-slack-notification/blob/master/singedURL.js

GCP Pub/Sub: How to get event details in onFailure() of PublishCallbackListener

We want to have the fail back mechanism in case of any failure to publish event to Pub/Sub. I am using "ListenableFutureCallback" to know message published successfully or not. In case of failure, it is just throwing exception and I need event details to post it to internal messaging service. How do I get event details in onFailure() method.
I am using Spring Integration.
Below is piece of code.
Listener:
#Component
public class PubSubOperationListener implements ListenableFutureCallback<String> {
private static Logger LOGGER = LoggerFactory.getLogger(PubSubOperationListener.class);
#Override
public void onFailure(Throwable throwable) {
LOGGER.error("Failed to publish the message and details : {}",throwable);
// Logic to process it using different approach.
}
#Override
public void onSuccess(String s) {
LOGGER.info("Message published successfully.");
}
ServiceActivator:
PubSubMessageHandler pubSubMessageHandler = new PubSubMessageHandler(pubSubTemplate, testTopic);
pubSubMessageHandler.setPublishCallback(pubSubOperationListener);
return pubSubMessageHandler;
Please suggest if there is different approach to do same.

Currently, it's not possible because Spring Cloud GCP simply delegates to the Pub/Sub Publisher in the client library.
However, when we wrap the Future provided by the Publisher in Spring Cloud GCP, we can potentially include the original message there and other metadata. This would be a feature request that should be filed here.

Message lost and duplicates in GCP Pubsub

I'm running into a issue reading GCP PubSub from Dataflow where when publish large number of messages in short period of time, Dataflow will receive most of the sent messages, except some messages will be lost, and some other messages would be duplicated. And the most weird part is that the number of lost messages will be exactly the same as the number of messages being duplicated.
In one of the examples, I send 4,000 messages in 5 sec, and in total 4,000 messages were received, but 9 messages were lost, and exactly 9 messages were duplicated.
The way I determine the duplicates is via logging. I'm logging every message that is published to Pubsub along with the message id generated by pubsub. I'm also logging the message right after reading from PubsubIO in a Pardo transformation.
The way I read from Pubsub in Dataflow is using org.apache.beam.sdk.ioPubsubIO:
public interface Options extends GcpOptions, DataflowPipelineOptions {
// PUBSUB URL
#Description("Pubsub URL")
#Default.String("https://pubsub.googleapis.com")
String getPubsubRootUrl();
void setPubsubRootUrl(String value);
// TOPIC
#Description("Topic")
#Default.String("projects/test-project/topics/test_topic")
String getTopic();
void setTopic(String value);
...
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setRunner(DataflowRunner.class);
...
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO
.<String>read()
.topic(options.getTopic())
.withCoder(StringUtf8Coder.of())
)
.apply("Logging data coming out of Pubsub", ParDo
.of(some_logging_transformation)
)
.apply("Saving data into db", ParDo
.of(some_output_transformation)
)
;
pipeline.run().waitUntilFinish();
}
I wonder if this is a known issue in Pubsub or PubsubIO?
UPDATE:
tried 4000 request with pubsub emulator, no missing data and no duplicates
UPDATE #2:
I went through some more experiments and found that the duplicating messages are taking the message_id from the missing ones. Because the direction of the issue has been diverted from it's origin quite a bit, I decide to post another question with detailed logs as well as the code I used to publish and receive messages.
link to the new question: Google Cloud Pubsub Data lost

I talked with a Google guy from the PubSub team. It seems to be caused by a thread-safety issue with the Python client. Please refer to the accepted answer for Google Cloud Pubsub Data lost for the response from Google

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Streaming pubsub -bigtable using apache beam dataflow java - google-cloud-platform

Related

How to pass an aws IAM role to a Java client (through a lambda function)

GCP BigTable Metrics - what do 404 requests mean?

GCP cloud build VIEW RAW logs link

GCP Pub/Sub: How to get event details in onFailure() of PublishCallbackListener

Message lost and duplicates in GCP Pubsub

Categories

Resources