How to get time zone boundaries in BigQuery? - google-cloud-platform

Giving the following GCP services:
BigQuery
Cloud Storage
Cloud Shell
What is the easiest way to create a BigQuery table with the following 2-columns structure ?
Column
Description
Type
Primary key
tzid
Time zone identifier
STRING
x
bndr
Boundaries
GEOGRAPHY
For example:
tzid
bndr
Africa/Abidjan
POLYGON((-5.440683 4.896553, -5.303699 4.912035, -5.183637 4.923927, ...))
Africa/Accra
POLYGON((-0.136231 11.13951, -0.15175 11.142384, -0.161168 11.14698, ...))
Pacific/Wallis
MULTIPOLYGON(((-178.350043 -14.384951, -178.344628 -14.394109, ...)))

Download and unzip timezones.geojson.zip from #evan-siroky repository on your computer.
Coordinates are structured as follows (geojson format):
{
"type": "FeatureCollection",
"features":
[
{
"type":"Feature",
"properties":
{
"tzid":"Africa/Abidjan"
},
"geometry":
{
"type":"Polygon",
"coordinates":[[[-5.440683,4.896553],[-5.303699,4.912035], ...]]]
}
},
{
"type":"Feature",
"properties": ...
}
]
}
BigQuery does not accept geojson but jsonl (new line delimited json) format to load tables. Steps 3 to 5 aim to convert to jsonl format.
Upload the file timezones_geojson.json to Cloud Storage gs://your-bucket/.
Move the file in the Cloud Shell Virtual Machine
gsutil mv gs://your-bucket/timezones_geojson.json .
Parse the file timezones_geojson.json, filter on "features" and return one line per element (see jq command):
cat timezones_geojson.json | jq -c ".features[]" > timezones_jsonl.json
The previous format will be transformed to:
{
"type":"Feature",
"properties":
{
"tzid":"Africa/Abidjan"
},
"geometry":
{
"type":"Polygon",
"coordinates":[[[-5.440683,4.896553],[-5.303699,4.912035], ... ]]]
}
}
{
"type":"Feature",
"properties":...
"geometry":...
}
Move the jsonl on Cloud Storage
gsutil mv timezones_jsonl.json gs://your-bucket/
Load the jsonl to BigQuery
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON --json_extension=GEOJSON your_dataset.timezones gs://your-bucket/timezones_jsonl.json

Related

python, google cloud platform: unable to overwite a file from google bucket: CRC32 does not match

I am using python3 client to connect to google buckets and trying to the following
download 'my_rules_file.yaml'
modify the yaml file
overwrite the file
Here is the code that i used
from google.cloud import storage
import yaml
client = storage.Client()
bucket = client.get_bucket('bucket_name')
blob = bucket.blob('my_rules_file.yaml')
yaml_file = blob.download_as_string()
doc = yaml.load(yaml_file, Loader=yaml.FullLoader)
doc['email'].clear()
doc['email'].extend(["test#gmail.com"])
yaml_file = yaml.dump(doc)
blob.upload_from_string(yaml_file, content_type="application/octet-stream")
This is the error I get from the last line for upload
BadRequest: 400 POST https://storage.googleapis.com/upload/storage/v1/b/fc-sandbox-datastore/o?uploadType=multipart: {
"error": {
"code": 400,
"message": "Provided CRC32C \"YXQoSg==\" doesn't match calculated CRC32C \"EyDHsA==\".",
"errors": [
{
"message": "Provided CRC32C \"YXQoSg==\" doesn't match calculated CRC32C \"EyDHsA==\".",
"domain": "global",
"reason": "invalid"
},
{
"message": "Provided MD5 hash \"G/rQwQii9moEvc3ZDqW2qQ==\" doesn't match calculated MD5 hash \"GqyZzuvv6yE57q1bLg8HAg==\".",
"domain": "global",
"reason": "invalid"
}
]
}
}
: ('Request failed with status code', 400, 'Expected one of', <HTTPStatus.OK: 200>)
why is this happening. This seems to happen only for ".yaml files".
The reason for your error is because you are trying to use the same blob object for both downloading and uploading this will not work you need two separate instances... You can find some good examples here Python google.cloud.storage.Blob() Examples
You should use a seperate blob instance to handle the upload you are trying with only one...
.....
blob = bucket.blob('my_rules_file.yaml')
yaml_file = blob.download_as_string()
.....
the second instance is needed here
....
blob.upload_from_string(yaml_file, content_type="application/octet-stream")
...

Pyspark - read data from elasticsearch cluster on EMR

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?

How to prepare rendered JSON for aws-cli in Terraform?

In another thread I have asked how to keep ECS task definitions active in AWS. As a result I am planning to update a task definition like this:
resource "null_resource" "update_task_definition" {
triggers {
keys = "${uuid()}"
}
# Workaround to prevent older task definitions being deactivated
provisioner "local-exec" {
command = <<EOF
aws ecs register-task-definition \
--family my-task-definition \
--container-definitions ${data.template_file.task_definition.rendered} \
--network-mode bridge \
EOF
}
}
data.template_file.task_definition is a template data source which provides templated JSON from a file. However, this does not work, since the JSON contains new lines and whitespaces.
I figured out already that I can use the replace interpolation function to get rid of new lines and whitespaces, however I still require to escape double quotes so that the AWS API accepts the request.
How can I safely prepare the string resulting from data.template_file.task_definition.rendered? I am looking for something like this:
Raw string:
{
"key": "value",
"another_key": "another_value"
}
Prepared string:
{\"key\":\"value\",\"another_key\":\"another_value\"}
You should be able to wrap the rendered JSON with the jsonencode function.
With the following Terraform code:
data "template_file" "example" {
template = file("example.tpl")
vars = {
foo = "foo"
bar = "bar"
}
}
resource "null_resource" "update_task_definition" {
triggers = {
keys = uuid()
}
provisioner "local-exec" {
command = <<EOF
echo ${jsonencode(data.template_file.example.rendered)}
EOF
}
}
And the following template file:
{
"key": "${foo}",
"another_key": "${bar}"
}
Running a Terraform apply gives the following output:
null_resource.update_task_definition: Creating...
triggers.%: "" => "1"
triggers.keys: "" => "18677676-4e59-8476-fdde-dc19cd7d2f34"
null_resource.update_task_definition: Provisioning with 'local-exec'...
null_resource.update_task_definition (local-exec): Executing: ["/bin/sh" "-c" "echo \"{\\n \\\"key\\\": \\\"foo\\\",\\n \\\"another_key\\\": \\\"bar\\\"\\n}\\n\"\n"]
null_resource.update_task_definition (local-exec): {
null_resource.update_task_definition (local-exec): "key": "foo",
null_resource.update_task_definition (local-exec): "another_key": "bar"
null_resource.update_task_definition (local-exec): }

Using jq to parse json output of AWS CLI tools with Lightsail

I'm trying to modify a script to automate lightsail snapshots, and I am having trouble modifying the jq query.
I'm trying to parse the output of aws lightsail get-instance-snapshots
This is the original line from the script:
aws lightsail get-instance-snapshots | jq '.[] | sort_by(.createdAt) | select(.[0].fromInstanceName == "WordPress-Test-Instance") | .[].name'
which returns a list of snapshot names with one per line.
I need to modify the query so that is does not return all snapshots, but rather only ones where the name start with 'autosnap'. i'm doing this as the script rotates snapshots, but I don't want it to delete snapshots I manually create (which will not start with 'autosnap').
Here is a redacted sample output from aws lightsail get-instance-snapshots
{
"instanceSnapshots": [
{
"location": {
"availabilityZone": "all",
"regionName": "*****"
},
"arn": "*****",
"fromBlueprintId": "wordpress_4_9_2_1",
"name": "autosnap-WordPress-Test-Instance-2018-04-16_01.46",
"fromInstanceName": "WordPress-Test-Instance",
"fromBundleId": "nano_1_2",
"supportCode": "*****",
"sizeInGb": 20,
"createdAt": 1523843190.117,
"fromAttachedDisks": [],
"fromInstanceArn": "*****",
"resourceType": "InstanceSnapshot",
"state": "available"
},
{
"location": {
"availabilityZone": "all",
"regionName": "*****"
},
"arn": "*****",
"fromBlueprintId": "wordpress_4_9_2_1",
"name": "Premanent-WordPress-Test-Instance-2018-04-16_01.40",
"fromInstanceName": "WordPress-Test-Instance",
"fromBundleId": "nano_1_2",
"supportCode": "*****",
"sizeInGb": 20,
"createdAt": 1523842851.69,
"fromAttachedDisks": [],
"fromInstanceArn": "*****",
"resourceType": "InstanceSnapshot",
"state": "available"
}
]
}
I would have thought something like this would work, but I'm not having any luck after many attempts...
aws lightsail get-instance-snapshots | jq '.[] | sort_by(.createdAt) | select(.[0].fromInstanceName == "WordPress-Test-Instance") | select(.[0].name | test("autosnap")) |.[].name'
Any help would be greatly appreciated!
The basic query for making the selection you describe would be:
.instanceSnapshots | map(select(.name|startswith("autosnap")))
(If you didn't need to preserve the array structure, you could go with:
.instanceSnapshots[] | select(.name|startswith("autosnap"))
)
You could then perform additional filtering by extending the pipeline.
If you were to use test/1, the appropriate invocation would be test("^autosnap") or perhaps test("^autosnap-").
Example
.instanceSnapshots
| map(select(.name|startswith("autosnap")))
| map(select(.fromInstanceName == "WordPress-Test-Instance"))
| sort_by(.createdAt)
| .[].name
The two successive selects could of course be compacted into one. For efficiency, the sorting should be done as late as possible.
Postscript
Although you might indeed be able to get away with commencing the pipeline with .[] instead of .instanceSnapshots, the latter is advisable in case the JSON schema changes. In a sense, the whole point of data formats like JSON is to make it easy to write queries that are robust with respect to (sane) schema-evolution.

Apache Drill: Not able to query the database

I am using UBUNTU 14.04.
I have started to explore about querying HDFS using apache drill, installed it my local system and configured the Storage plugin to point remote HDFS. Below is the configuration setup:
{
"type": "file",
"enabled": true,
"connection": "hdfs://devlpmnt.mycrop.kom:8020",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
}
},
"formats": {
"json": {
"type": "json"
}
}
}
After creating a json file "rest.json", I passed the query:
select * from hdfs.`/tmp/rest.json` limit 1
I am getting following error:
org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 'hdfs./tmp/rest.json' not found
I would appreciate if someone tries to help me figure out what is wrong.
Thanks in advance!!