I've configured Pelias API and see that my results don't include whosonfirst locality
my API results:
api.geocode.earth results:
I do have configured all services (placeholder, pip, libpostal) running correctly
Config for whosonfirst import job:
"whosonfirst": {
"sqlite": true,
"datapath": "/data/whosonfirst",
"dataHost": "https://data.geocode.earth/wof/dist/sqlite/whosonfirst-data-admin-ua-latest.db.bz2",
"importPostalcodes": "true",
"countryCode": "UA",
"importPlace": [ 101752483 ]
}
I can see some data from whosonfirst are imported to elastic succesfully:
How can I add this information to my venues/streets layers results?
I launched a dataflow batch job to load csv data from GCS to Pubsub.
Dataflow job is failing with the following log portion:
Error message from worker: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "reason" : "badRequest" } ], "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "status" : "INVALID_ARGUMENT" } com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443) com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1108) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591) org.apache.beam.sdk.io.gcp.pubsub.PubsubJsonClient.publish(PubsubJsonClient.java:138) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.publish(PubsubIO.java:1195) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.finishBundle(PubsubIO.java:1184)
So basically it's saying that there is at least on line empty but the csv data contains some empty fields but not a full empty line
here is a sample below
2019-12-01 00:00:00 UTC,remove_from_cart,5712790,1487580005268456287,,f.o.x,6.27,576802932,51d85cb0-897f-48d2-918b-ad63965c12dc
2019-12-01 00:00:00 UTC,view,5764655,1487580005411062629,,cnd,29.05,412120092,8adff31e-2051-4894-9758-224bfa8aec18
2019-12-01 00:00:02 UTC,cart,4958,1487580009471148064,,runail,1.19,494077766,c99a50e8-2fac-4c4d-89ec-41c05f114554
2019-12-01 00:00:05 UTC,view,5848413,1487580007675986893,,freedecor,0.79,348405118,722ffea5-73c0-4924-8e8f-371ff8031af4
2019-12-01 00:00:07 UTC,view,5824148,1487580005511725929,,,5.56,576005683,28172809-7e4a-45ce-bab0-5efa90117cd5
2019-12-01 00:00:09 UTC,view,5773361,1487580005134238553,,runail,2.62,560109803,38cf4ba1-4a0a-4c9e-b870-46685d105f95
2019-12-01 00:00:18 UTC,cart,5629988,1487580009311764506,,,1.19,579966747,1512be50-d0fd-4a92-bcd8-3ea3943f2a3b
Any help ?
Thanks
I am getting the below error when running the dataflow. My Datasource is in GCP BQ(asia-south1) and Destination is PostgreSQL DB(AWS -> Mumbai Region).
java.io.IOException: Extract job beam_job_0c64359f7e274ff1ba4072732d7d9653_firstcrybqpgnageshpinjarkar07200750105c51e26c-extract failed, status: {
"errorResult" : {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
},
"errors" : [ {
"message" : "Cannot read and write in different locations: source: asia-south1, destination: us-central1",
"reason" : "invalid"
} ],
"state" : "DONE"
}.
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.executeExtract(BigQuerySourceBase.java:185)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:121)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:139)
at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:197)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:181)
at com.google.cloud.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:160)
at com.google.cloud.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:77)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:391)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:360)
at com.google.cloud.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:288)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:134)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:114)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:101)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My code is as below:
p
.apply(BigQueryIO.read().from("datalake:Yearly2020.Sales"))
.apply(JdbcIO.<TableRow>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.postgresql.Driver", "jdbc:postgresql://xx.xx.xx.xx:1111/dbname")
.withUsername("username")
.withPassword("password"))
.withStatement("INSERT INTO Table VALUES(ProductRevenue)")
.withPreparedStatementSetter(new BQPGStatementSetter()));
p.run().waitUntilFinish();
I am running the pipeline as below:
gcloud beta dataflow jobs run sales_data \
--gcs-location gs://datalake-templates/Template \
--region=asia-east1 \
--network=datalake-vpc \
--subnetwork=regions/asia-east1/subnetworks/asia-east1 \
When Bigquery is the source it runs load jobs which stages data in gcs buckets. The data is staged in temp_location and if temp_location is not specified then it used the region specified in staging_location.
In the dataflow job can you specify temp_location with a bucket that is created in asia-south as that is where your Bigquery dataset is.
Also, if you are using network and subnetwork it is also advisable to turn off public ip so that the connectivity is done via the VPN.
I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?
When trying to run the below code on CircleCI
fun getJsonFromCloudStorage(): ByteArrayInputStream {
val blobId = BlobId.of("my-company", "creds/my-company-creds.json")
val storage = StorageOptions.getDefaultInstance().service
val get = storage.get(blobId)
return get.getContent().inputStream()
}
it will throw the below error during the integration tests.
> Task :test FAILED
function.GetMetadataFromYouTubeTest > extractIncorrectId FAILED
java.lang.ExceptionInInitializerError
at function.GetMetadataFromYouTube.expand(GetMetadataFromYouTube.kt:17)
at function.GetMetadataFromYouTube.expand(GetMetadataFromYouTube.kt:14)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at function.GetMetadataFromYouTubeTest.extractIncorrectId(GetMetadataFromYouTubeTest.kt:71)
Caused by:
com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.get access to cni-analytics/creds/cni-awesome.json.
at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:220)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:414)
at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:198)
at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:195)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
at com.google.cloud.RetryHelper.run(RetryHelper.java:74)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:195)
at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:209)
at storage.CredentialHelper$Companion.getJsonFromCloudStorage(CredentialHelper.kt:18)
at service.YoutubeService.initialiseYouTube(YoutubeService.kt:50)
at service.YoutubeService.<init>(YoutubeService.kt:19)
at MainKt.<clinit>(main.kt:15)
... 6 more
Caused by:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
"code" : 401,
"errors" : [ {
"domain" : "global",
"location" : "Authorization",
"locationType" : "header",
"message" : "Anonymous caller does not have storage.objects.get access to my-company/creds/my-company-creds.json.",
"reason" : "required"
} ],
"message" : "Anonymous caller does not have storage.objects.get access to my-company/creds/my-company-creds.json."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:411)
... 17 more
I followed their documentation.
They said this in their documentation:
Note: To use certain services (like Google Cloud Datastore), you will also need to set the CircleCI $GOOGLE_APPLICATION_CREDENTIALS environment variable to ${HOME}/gcloud-service-key.json.
Instead I set $GOOGLE_APPLICATION_CREDENTIALS in the CircleCI UI to /home/circleci/gcloud-service-key.json and it worked.
I'm assuming this is because I was trying to reference an environment variable from the UI so ${HOME} had not been set when it was setting this environment variable. Perhaps if this environment variable was set in the config.yml ${HOME} would resolve.