I launched a dataflow batch job to load csv data from GCS to Pubsub.
Dataflow job is failing with the following log portion:
Error message from worker: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "reason" : "badRequest" } ], "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "status" : "INVALID_ARGUMENT" } com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443) com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1108) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591) org.apache.beam.sdk.io.gcp.pubsub.PubsubJsonClient.publish(PubsubJsonClient.java:138) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.publish(PubsubIO.java:1195) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.finishBundle(PubsubIO.java:1184)
So basically it's saying that there is at least on line empty but the csv data contains some empty fields but not a full empty line
here is a sample below
2019-12-01 00:00:00 UTC,remove_from_cart,5712790,1487580005268456287,,f.o.x,6.27,576802932,51d85cb0-897f-48d2-918b-ad63965c12dc
2019-12-01 00:00:00 UTC,view,5764655,1487580005411062629,,cnd,29.05,412120092,8adff31e-2051-4894-9758-224bfa8aec18
2019-12-01 00:00:02 UTC,cart,4958,1487580009471148064,,runail,1.19,494077766,c99a50e8-2fac-4c4d-89ec-41c05f114554
2019-12-01 00:00:05 UTC,view,5848413,1487580007675986893,,freedecor,0.79,348405118,722ffea5-73c0-4924-8e8f-371ff8031af4
2019-12-01 00:00:07 UTC,view,5824148,1487580005511725929,,,5.56,576005683,28172809-7e4a-45ce-bab0-5efa90117cd5
2019-12-01 00:00:09 UTC,view,5773361,1487580005134238553,,runail,2.62,560109803,38cf4ba1-4a0a-4c9e-b870-46685d105f95
2019-12-01 00:00:18 UTC,cart,5629988,1487580009311764506,,,1.19,579966747,1512be50-d0fd-4a92-bcd8-3ea3943f2a3b
Any help ?
Thanks
I am getting the below error when running the dataflow. My Datasource is in GCP BQ(asia-south1) and Destination is PostgreSQL DB(AWS -> Mumbai Region).
java.io.IOException: Extract job beam_job_0c64359f7e274ff1ba4072732d7d9653_firstcrybqpgnageshpinjarkar07200750105c51e26c-extract failed, status: {
"errorResult" : {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
},
"errors" : [ {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
} ],
"state" : "DONE"
}.
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.executeExtract(BigQuerySourceBase.java:185)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:121)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:139)
at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:197)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:181)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:160)
at com.google.cloud.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:77)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:391)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:360)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:288)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:134)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:114)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My code is as below:
p
.apply(BigQueryIO.read().from("datalake:Yearly2020.Sales"))
.apply(JdbcIO.<TableRow>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.postgresql.Driver", "jdbc:postgresql://xx.xx.xx.xx:1111/dbname")
.withUsername("username")
.withPassword("password"))
.withStatement("INSERT INTO Table VALUES(ProductRevenue)")
.withPreparedStatementSetter(new BQPGStatementSetter()));
p.run().waitUntilFinish();
I am running the pipeline as below:
gcloud beta dataflow jobs run sales_data \
--gcs-location gs://datalake-templates/Template \
--region=asia-east1 \
--network=datalake-vpc \
--subnetwork=regions/asia-east1/subnetworks/asia-east1 \
When Bigquery is the source it runs load jobs which stages data in gcs buckets. The data is staged in temp_location and if temp_location is not specified then it used the region specified in staging_location.
In the dataflow job can you specify temp_location with a bucket that is created in asia-south as that is where your Bigquery dataset is.
Also, if you are using network and subnetwork it is also advisable to turn off public ip so that the connectivity is done via the VPN.
I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?
I am using S3 Powershell api (Get-S3Object) to retrieve files from S3, the thing is that the api works in a strange way.
I run the following command first:
Get-S3Object -BucketName "tools-bucket" -keyprefix "Rollback/ust1twastool01a"
It returns this list:
ETag : "d41d8cd98f00b204e9800998ecf8427e"
Key : Rollback/ust1twastool01a/
LastModified : 11/7/2016 3:24:13 PM
Owner : Amazon.S3.Model.Owner
Size : 0
StorageClass : STANDARD
ETag : "e0ada177422c1fe4d9bd9801636f4e8a"
Key : Rollback/ust1twastool01a/Rollback_Kit.txt
LastModified : 11/7/2016 3:25:00 PM
Owner : Amazon.S3.Model.Owner
Size : 626
StorageClass : STANDARD
The first one is the keyprefix itself which is a folder. Then I run the command with another keyprefix:
Get-S3Object -BucketName "tools-bucket" -keyprefix "Rollback/autopatch"
It returns this:
ETag : "4c3723148b9fb78d5b182c72aa6f1866-62"
Key : Rollback/autopatch/2016-08-30_21-15-17_server-1.1.20558_client-1.1.20518.zip
LastModified : 8/30/2016 5:18:43 PM
Owner : Amazon.S3.Model.Owner
Size : 323772907
StorageClass : STANDARD
ETag : "bfc65b2cde2c3f24a2086ca503270a54"
Key : Rollback/autopatch/buildRecords.txt
LastModified : 8/30/2016 5:19:44 PM
Owner : Amazon.S3.Model.Owner
Size : 53
StorageClass : STANDARD
This time, the keyprefix is not returned. I don't quite figure it out why it happens
This typically happens when you create a folder in the console. They're placeholders. They aren't needed unless you need to navigate "into" a folder manually to upload objects. Your workaround is to skip objects ending in / that are zero-byte objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/about-using-console.html#welcome-folder-concept
I'm trying to install a custom compiled package that I have in S3 as a zip file. I added this on my Cloudformation template:
"sources" : {
"/opt" : "https://s3.amazonaws.com/mybucket/installers/myapp-3.2.1.zip"
},
It downloads and unzip it on /opt without issues, but all the "executables" files don't have the "x" permission. I mean "-rw-r--r-- 1 root root 220378 Dec 4 18:23 myapp".
If I download the zip and unzip it in any directory, the permissions are Ok.
I already read the Cloudformation documentation and there is no clue there.
Someone can help me figuring this out? Thanks in advance.
Maybe you can combine a "configSets" (to guarantee the execution order) and a "command" element to write something like :
"AWS::CloudFormation::Init" : {
"configSets" : {
"default" : [ "download", "fixPermissions" ]
},
"download" : {
"sources" : {
"/opt" : "https://s3.amazonaws.com/mybucket/installers/myapp-3.2.1.zip"
},
},
"fixPermissions" : {
"commands" : {
"fixMyAppPermissions" : {
"command" : "chmod +x /opt/myapp-3.2.1/myapp"
}
}
}
}
Source :
https://s3.amazonaws.com/cloudformation-examples/BoostrappingApplicationsWithAWSCloudFormation.pdf
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-init.html