Dataprep Bigquery running in different region - google-cloud-platform

So I get the following error when running dataprep.
java.io.IOException: Query job beam_job_9e016180fbb74637b35319c89b6ed6d7_clouddataprepleads6085795bynick-query-d23eb37a1bee4a788e7b16c1de1f92e6 failed, status: { "errorResult" : { "message" : "Not found: Dataset lumi-210601:attribution was not found in location US", "reason" : "notFound" }, "errors" : [ { "message" : "Not found: Dataset lumi-210601:attribution was not found in location US", "reason" : "notFound" } ], "state" : "DONE" }
I've tried doing the following already:
Under dataprep "Project Settings" I've et the regional endpoint to asia-east1 and the zone to australia-southeast-1a
Have set under "Profile" all of the upload/job run dir/temp dir to new directories in a bucket that belongs to the australia south east
In the flow output I'm just doing a basic CSV output to test with dataprep execution settings to be australia region as well
I can't seem to find any other references to US anywhere. If I go into the dataflow UI I can see that the jobs ran and failed on the asia-east region as well

Related

How to join whosonfirst results with other layers? Pelias API

I've configured Pelias API and see that my results don't include whosonfirst locality
my API results:
api.geocode.earth results:
I do have configured all services (placeholder, pip, libpostal) running correctly
Config for whosonfirst import job:
"whosonfirst": {
"sqlite": true,
"datapath": "/data/whosonfirst",
"dataHost": "https://data.geocode.earth/wof/dist/sqlite/whosonfirst-data-admin-ua-latest.db.bz2",
"importPostalcodes": "true",
"countryCode": "UA",
"importPlace": [ 101752483 ]
}
I can see some data from whosonfirst are imported to elastic succesfully:
How can I add this information to my venues/streets layers results?

Dataflow Job - HTTP 400 Non Empty Data

I launched a dataflow batch job to load csv data from GCS to Pubsub.
Dataflow job is failing with the following log portion:
Error message from worker: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "reason" : "badRequest" } ], "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "status" : "INVALID_ARGUMENT" } com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443) com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1108) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591) org.apache.beam.sdk.io.gcp.pubsub.PubsubJsonClient.publish(PubsubJsonClient.java:138) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.publish(PubsubIO.java:1195) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.finishBundle(PubsubIO.java:1184)
So basically it's saying that there is at least on line empty but the csv data contains some empty fields but not a full empty line
here is a sample below
2019-12-01 00:00:00 UTC,remove_from_cart,5712790,1487580005268456287,,f.o.x,6.27,576802932,51d85cb0-897f-48d2-918b-ad63965c12dc
2019-12-01 00:00:00 UTC,view,5764655,1487580005411062629,,cnd,29.05,412120092,8adff31e-2051-4894-9758-224bfa8aec18
2019-12-01 00:00:02 UTC,cart,4958,1487580009471148064,,runail,1.19,494077766,c99a50e8-2fac-4c4d-89ec-41c05f114554
2019-12-01 00:00:05 UTC,view,5848413,1487580007675986893,,freedecor,0.79,348405118,722ffea5-73c0-4924-8e8f-371ff8031af4
2019-12-01 00:00:07 UTC,view,5824148,1487580005511725929,,,5.56,576005683,28172809-7e4a-45ce-bab0-5efa90117cd5
2019-12-01 00:00:09 UTC,view,5773361,1487580005134238553,,runail,2.62,560109803,38cf4ba1-4a0a-4c9e-b870-46685d105f95
2019-12-01 00:00:18 UTC,cart,5629988,1487580009311764506,,,1.19,579966747,1512be50-d0fd-4a92-bcd8-3ea3943f2a3b
Any help ?
Thanks

Dataflow : Cannot read and write in different locations: source: asia-south1, destination: us-central1"

I am getting the below error when running the dataflow. My Datasource is in GCP BQ(asia-south1) and Destination is PostgreSQL DB(AWS -> Mumbai Region).
java.io.IOException: Extract job beam_job_0c64359f7e274ff1ba4072732d7d9653_firstcrybqpgnageshpinjarkar07200750105c51e26c-extract failed, status: {
"errorResult" : {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
},
"errors" : [ {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
} ],
"state" : "DONE"
}.
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.executeExtract(BigQuerySourceBase.java:185)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:121)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:139)
at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:197)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:181)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:160)
at com.google.cloud.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:77)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:391)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:360)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:288)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:134)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:114)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My code is as below:
p
.apply(BigQueryIO.read().from("datalake:Yearly2020.Sales"))
.apply(JdbcIO.<TableRow>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.postgresql.Driver", "jdbc:postgresql://xx.xx.xx.xx:1111/dbname")
.withUsername("username")
.withPassword("password"))
.withStatement("INSERT INTO Table VALUES(ProductRevenue)")
.withPreparedStatementSetter(new BQPGStatementSetter()));
p.run().waitUntilFinish();
I am running the pipeline as below:
gcloud beta dataflow jobs run sales_data \
--gcs-location gs://datalake-templates/Template \
--region=asia-east1 \
--network=datalake-vpc \
--subnetwork=regions/asia-east1/subnetworks/asia-east1 \
When Bigquery is the source it runs load jobs which stages data in gcs buckets. The data is staged in temp_location and if temp_location is not specified then it used the region specified in staging_location.
In the dataflow job can you specify temp_location with a bucket that is created in asia-south as that is where your Bigquery dataset is.
Also, if you are using network and subnetwork it is also advisable to turn off public ip so that the connectivity is done via the VPN.

Pyspark - read data from elasticsearch cluster on EMR

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?

StorageException: Anonymous caller does not have storage.objects.get access

When trying to run the below code on CircleCI
fun getJsonFromCloudStorage(): ByteArrayInputStream {
val blobId = BlobId.of("my-company", "creds/my-company-creds.json")
val storage = StorageOptions.getDefaultInstance().service
val get = storage.get(blobId)
return get.getContent().inputStream()
}
it will throw the below error during the integration tests.
> Task :test FAILED
function.GetMetadataFromYouTubeTest > extractIncorrectId FAILED
java.lang.ExceptionInInitializerError
at function.GetMetadataFromYouTube.expand(GetMetadataFromYouTube.kt:17)
at function.GetMetadataFromYouTube.expand(GetMetadataFromYouTube.kt:14)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at function.GetMetadataFromYouTubeTest.extractIncorrectId(GetMetadataFromYouTubeTest.kt:71)
Caused by:
com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.get access to cni-analytics/creds/cni-awesome.json.
at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:220)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:414)
at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:198)
at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:195)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
at com.google.cloud.RetryHelper.run(RetryHelper.java:74)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:195)
at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:209)
at storage.CredentialHelper$Companion.getJsonFromCloudStorage(CredentialHelper.kt:18)
at service.YoutubeService.initialiseYouTube(YoutubeService.kt:50)
at service.YoutubeService.<init>(YoutubeService.kt:19)
at MainKt.<clinit>(main.kt:15)
... 6 more
Caused by:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
"code" : 401,
"errors" : [ {
"domain" : "global",
"location" : "Authorization",
"locationType" : "header",
"message" : "Anonymous caller does not have storage.objects.get access to my-company/creds/my-company-creds.json.",
"reason" : "required"
} ],
"message" : "Anonymous caller does not have storage.objects.get access to my-company/creds/my-company-creds.json."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:411)
... 17 more
I followed their documentation.
They said this in their documentation:
Note: To use certain services (like Google Cloud Datastore), you will also need to set the CircleCI $GOOGLE_APPLICATION_CREDENTIALS environment variable to ${HOME}/gcloud-service-key.json.
Instead I set $GOOGLE_APPLICATION_CREDENTIALS in the CircleCI UI to /home/circleci/gcloud-service-key.json and it worked.
I'm assuming this is because I was trying to reference an environment variable from the UI so ${HOME} had not been set when it was setting this environment variable. Perhaps if this environment variable was set in the config.yml ${HOME} would resolve.