BigQuery: Load Data into EU Dataset from GCS

BigQuery: Load Data into EU Dataset from GCS - google-cloud-platform

In the past I have successfully loaded data into US-hosted BigQuery datasets from CSV data in US-hosted GCS buckets. We since decided to move our BigQuery data to the EU and I created a new dataset with this region selected on it. I have successfully populated those of our tables small enough to be uploaded from my machine at home. But two tables are far too large for this so I would like to load them from files in GCS. I have tried doing this from both a US-hosted GCS bucket and an EU-hosted GCS bucket (thinking that bq load might not like to cross regions) but the load fails every time. Below is the error detail I'm getting from the bq command line (500, Internal Error). Does anyone know a reason why this might be happening?
{
"configuration": {
"load": {
"destinationTable": {
"datasetId": "######",
"projectId": "######",
"tableId": "test"
},
"schema": {
"fields": [
{
"name": "test_col",
"type": "INTEGER"
}
]
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://######/test.csv"
]
}
},
"etag": "######",
"id": "######",
"jobReference": {
"jobId": "######",
"projectId": "######"
},
"kind": "bigquery#job",
"selfLink": "https://www.googleapis.com/bigquery/v2/projects/######",
"statistics": {
"creationTime": "1445336673213",
"endTime": "1445336674738",
"startTime": "1445336674738"
},
"status": {
"errorResult": {
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
},
"errors": [
{
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
}
],
"state": "DONE"
},
"user_email": "######"
}

After searching through other related questions on StackOverflow I eventually realised that I had set my GCS bucket region to EUROPE-WEST-1 and not the multi-region EU location. Things are now working as expected.

Related

Youtube API error "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup" from Google Cloud Platform VM

I've added an API Key for my project and I'm trying to retrieve data via URL:
https://www.googleapis.com/youtube/v3/videos?part=contentDetails%2Cstatistics&id={VIDEO_ID}&key={MY_KEY}
When I try it in my browser, I get the response:
{
"kind": "youtube#videoListResponse",
"etag": "\"{ETAG}\"",
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
},
"items": [
{
"kind": "youtube#video",
"etag": "\"{ETAG}\"",
"id": "{VIDEO_ID}",
"contentDetails": {
"duration": "PT3M11S",
"dimension": "2d",
"definition": "sd",
"caption": "false",
"licensedContent": false,
"projection": "rectangular"
},
"statistics": {
"viewCount": "12822",
"likeCount": "44",
"dislikeCount": "0",
"favoriteCount": "0",
"commentCount": "4"
}
}
]
}
But it doesn't work from my Google Cloud Virtual Machine, I get an error:
{
"error": {
"errors": [
{
"domain": "usageLimits",
"reason": "dailyLimitExceededUnreg",
"message": "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup.",
"extendedHelp": "https://code.google.com/apis/console"
}
],
"code": 403,
"message": "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup."
}
}
This API key is unrestricted.
What is wrong? How to fix it?

GCP Dataproc has Druid available in alpha. How to load segments?

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some google specific references to a bucket, but there are no examples how to do this.
What is the method to load data into Druid, running on GCP dataproc straight out of the box?

I haven't used Dataproc version of Druid, but have a small cluster running in Google Compute VM. The way I ingest data to it from GCS is by using Google Cloud Storage Druid extension - https://druid.apache.org/docs/latest/development/extensions-core/google.html
To enable extension you need to add it to a list of extension in your Druid common.properties file:
druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
To ingest data from GCS I send HTTP POST request to http://druid-overlord-host:8081/druid/indexer/v1/task
The POST request body contains JSON file with ingestion spec(see ["ioConfig"]["firehose"] section):
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
}
Example cURL command to start ingestion task in Druid(spec.json contains JSON from the previous section):
curl -X 'POST' -H 'Content-Type:application/json' -d #spec.json http://druid-overlord-host:8081/druid/indexer/v1/task

Alexa is 'Unable to Reach the Requested Skill'

I'm receiving "I'm unable to reach the requested skill" message from an Alexa dashboard testing console for the skill that used to work before (with no modifications to any underlying infrastructure or code).
Here's the error obtained from Alexa's device logs:
{
"header": {
"namespace": "SkillDebugger",
"name": "CaptureError",
"messageId": "57d00be6-19d6-4529-b0b1-4c5d6c2760ac"
},
"payload": {
"skillId": "amzn1.ask.skill.db1bac88-183d-409c-9d3e-0e69fa0f5fe2",
"timestamp": "2018-09-27T19:11:51.066Z",
"dialogRequestId": "d9ec106d-2ef2-4526-a156-f4714ce5d034",
"skillRequestId": "amzn1.echo-api.request.1e166266-56e1-4c51-b40a-3ceb144f997f",
"code": "SKILL_ENDPOINT_ERROR",
"description": "An error occurred while issuing a SpeechletRequest for (requestId [amzn1.echo-api.request.1e166266-56e1-4c51-b40a-3ceb144f997f]",
"debuggingInfo": {
"type": "SkillExecutionInfo",
"content": {
"invocationRequest": {
"endpoint": "https://emptio.serveo.net/abc/api/v1/alexa",
"body": {
"version": "1.0",
"session": {
"new": false,
"sessionId": "amzn1.echo-api.session.bfc02d53-fe83-4c70-b731-ea7ede99d20a",
"application": {
"applicationId": "amzn1.ask.skill.db1bac88-183d-409c-9d3e-0e69fa0f5fe2"
},
"user": {
"userId": "amzn1.ask.account.AGX2NO3NXXDS6NLEZMDZXMRZZPJ3DLEERYK7J3NUPFUYRADFB2HRILB7BZVTN336OFVSNFFUP3VDVFHERK5PKQE5H32EQ5GGWTT67EMDQKP22Q7NTXXNYDUTYNCYI6EJUEODQ54VHKW4JSWVCS7JINWLYH2LICQVETFGZBY6NBDJVEX66VCGCZMRTFZYAG2E3IXDPMPVF3U4VMY",
"accessToken": "Atza|IwEBIM_YZylf-iVoydW0WhXTS4ykk6oA0FwI9Aa7Pdz_pysLPaL1AJwQLXA-Y1GJabHTWMJxfDEKyIiLFuxEPnTxuYaEDyany7WXzHMOd0-iiD9lYBxE6rIXkC3Z-I5PYU6DQtkT6DHxbusrkyGTb1bSfbznIaaFat3yNvKY9mXaNHEEhuuPRZJkXjffBA9WKzWrkGetOdHVvo-PLw2w9rWUiQQuJ6ryzQjugYILyCuTry3qz8lvqWGxYX0XB3dx_CGuzjEnNP0-X2ozhLXN8cBjtBrl7MlTffNyo6K94vi24-16bdIdFZG3mVL_bKSCXzAx2qzPJvBCn953FrPVw9zd7CtOintRSBDZ9Aw_QgKqTklliWTBP_8uRqq_nuMB8s992-Yhi6Zb-k7VvyYp7oLtJ8ggRqRlRk9vS4HBxyfKCxvfXmvlmZJlAtGjec_-Bx8UB2pf1ZH0xi-2LYpezVh2e7dgWenKU0PHvtduprVtpO4E72148mddcYyQRzAEdk8LYQx1SiamYY64_qmkv14h1qBPUIQPuv3MFt2PB7Mhm6cVTA"
}
},
"context": {
"System": {
"application": {
"applicationId": "amzn1.ask.skill.db1bac88-183d-409c-9d3e-0e69fa0f5fe2"
},
"user": {
"userId": "amzn1.ask.account.AGX2NO3NXXDS6NLEZMDZXMRZZPJ3DLEERYK7J3NUPFUYRADFB2HRILB7BZVTN336OFVSNFFUP3VDVFHERK5PKQE5H32EQ5GGWTT67EMDQKP22Q7NTXXNYDUTYNCYI6EJUEODQ54VHKW4JSWVCS7JINWLYH2LICQVETFGZBY6NBDJVEX66VCGCZMRTFZYAG2E3IXDPMPVF3U4VMY",
"accessToken": "Atza|IwEBIM_YZylf-iVoydW0WhXTS4ykk6oA0FwI9Aa7Pdz_pysLPaL1AJwQLXA-Y1GJabHTWMJxfDEKyIiLFuxEPnTxuYaEDyany7WXzHMOd0-iiD9lYBxE6rIXkC3Z-I5PYU6DQtkT6DHxbusrkyGTb1bSfbznIaaFat3yNvKY9mXaNHEEhuuPRZJkXjffBA9WKzWrkGetOdHVvo-PLw2w9rWUiQQuJ6ryzQjugYILyCuTry3qz8lvqWGxYX0XB3dx_CGuzjEnNP0-X2ozhLXN8cBjtBrl7MlTffNyo6K94vi24-16bdIdFZG3mVL_bKSCXzAx2qzPJvBCn953FrPVw9zd7CtOintRSBDZ9Aw_QgKqTklliWTBP_8uRqq_nuMB8s992-Yhi6Zb-k7VvyYp7oLtJ8ggRqRlRk9vS4HBxyfKCxvfXmvlmZJlAtGjec_-Bx8UB2pf1ZH0xi-2LYpezVh2e7dgWenKU0PHvtduprVtpO4E72148mddcYyQRzAEdk8LYQx1SiamYY64_qmkv14h1qBPUIQPuv3MFt2PB7Mhm6cVTA"
},
"device": {
"deviceId": "amzn1.ask.device.AGUTTO7VCXPCUUSXNDCNO6LK7LZHUKPDGZBOXUOBNRNOBGD7FHBJWHOK3LJNQX4U47HTFLUXJ6MHBL6V7UCDNTWOMBJIP5R4R2ZVK3XJX42PEZG6J6TCS3U7NSYZZ3PDCUSH22CY7LYGNIK2MGXCUGR4ITQQ",
"supportedInterfaces": {}
},
"apiEndpoint": "https://api.amazonalexa.com",
"apiAccessToken": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6IjEifQ.eyJhdWQiOiJodHRwczovL2FwaS5hbWF6b25hbGV4YS5jb20iLCJpc3MiOiJBbGV4YVNraWxsS2l0Iiwic3ViIjoiYW16bjEuYXNrLnNraWxsLmRiMWJhYzg4LTE4M2QtNDA5Yy05ZDNlLTBlNjlmYTBmNWZlMiIsImV4cCI6MTUzODA3OTEwNywiaWF0IjoxNTM4MDc1NTA3LCJuYmYiOjE1MzgwNzU1MDcsInByaXZhdGVDbGFpbXMiOnsiY29uc2VudFRva2VuIjpudWxsLCJkZXZpY2VJZCI6ImFtem4xLmFzay5kZXZpY2UuQUdVVFRPN1ZDWFBDVVVTWE5EQ05PNkxLN0xaSFVLUERHWkJPWFVPQk5STk9CR0Q3RkhCSldIT0szTEpOUVg0VTQ3SFRGTFVYSjZNSEJMNlY3VUNETlRXT01CSklQNVI0UjJaVkszWEpYNDJQRVpHNko2VENTM1U3TlNZWlozUERDVVNIMjJDWTdMWUdOSUsyTUdYQ1VHUjRJVFFRIiwidXNlcklkIjoiYW16bjEuYXNrLmFjY291bnQuQUdYMk5PM05YWERTNk5MRVpNRFpYTVJaWlBKM0RMRUVSWUs3SjNOVVBGVVlSQURGQjJIUklMQjdCWlZUTjMzNk9GVlNORkZVUDNWRFZGSEVSSzVQS1FFNUgzMkVRNUdHV1RUNjdFTURRS1AyMlE3TlRYWE5ZRFVUWU5DWUk2RUpVRU9EUTU0VkhLVzRKU1dWQ1M3SklOV0xZSDJMSUNRVkVURkdaQlk2TkJESlZFWDY2VkNHQ1pNUlRGWllBRzJFM0lYRFBNUFZGM1U0Vk1ZIn19.LzPCt8QPxkEa5jFK3IMGMQLWS3vXOopyGKBu0cAy1cnJzAk7wnbKwc9eyQYDMr3uH7MyHr4s7xUKpWlvspGOAL3LqKxFbxqpB5zIjhKifqdGQhB_nurOAjeyZOipZ0ZhSuPN9fqTwp7zwca4LdYz6Kuahklz7D7pU7ICNI1DNqNKDx9HmyWbJIwXWL3MvS9sEujDo15oTdiueNaCbC7kLnPi0adrukHy3J6HVN_XjWS5mSSawuObgiT2b9eLm4qntoMG7MnDTSrzxmhKgXm3WrbFxRW_ZKE3uu1wa7-412f8DPxvbVZkeYDRwWMTO8s7BtnzjPcKEcT6daLXKRgpVw"
}
},
"request": {
"type": "SessionEndedRequest",
"requestId": "amzn1.echo-api.request.1e166266-56e1-4c51-b40a-3ceb144f997f",
"timestamp": "2018-09-27T19:11:47Z",
"locale": "en-US",
"reason": "ERROR",
"error": {
"type": "INVALID_RESPONSE",
"message": "An exception occurred while dispatching the request to the skill."
}
}
}
},
"invocationResponse": null,
"metrics": {
"skillExecutionTimeInMilliseconds": 3107
}
}
}
}
}
As is seen from the above response, the skill is configured with an endpoint: https://emptio.serveo.net/abc/api/v1/alexa which is perfectly reachable.
Again, the same exact skill used to work just yesterday. The invocation name under which I am calling it used to work fine.
I'm able to reach and verify the above endpoint is functional and responsive outside Alexa, but it's somehow not reachable from the Alexa dashboard.
I'm monitoring the logs from Serveo - they don't show any activity, meaning that something is broken before the webhook is called.
What could be the reason for the error? How can I debug what is going on in the Alexa stack?

Make sure option for the endpoint's SSL certificate type is correct. You can change this in the endpoint option in the web development console.
Your host might be using a wildcard certificate so select the wildcard option, save your endpoint again - then test again using the console for reachability.

ClientError: Saved entry is empty when using aws import-image command

I am trying to import an image into AMI in AWS. The VM has been tried over a .ova & .vmdk.
I get through the statuses of Converting, Updating, then I get the message on ClientError: Saved entry is empty. Tried googling around but to no avail. Any help on what this means?
aws ec2 import-image --description "HCPVM" --disk-containers "file://C:\hhh\containers2.json"
{
"ImportImageTasks": [
{
"Status": "deleted",
"SnapshotDetails": [
{
"UserBucket": {
"S3Bucket": "hhh-em",
"S3Key": "hhh-VM-VMDK-disk1.vmdk"
},
"DiskImageSize": 1020535808.0,
"Description": "First disk",
"Format": "VMDK"
},
{
"UserBucket": {
"S3Bucket": "hhh-em",
"S3Key": "hhh-VM-VMDK-disk2.vmdk"
},
"DiskImageSize": 132096.0,
"Description": "Second disk",
"Format": "VMDK"
},
{
"UserBucket": {
"S3Bucket": "hhh-em",
"S3Key": "hhh-VM-VMDK-disk3.vmdk"
},
"DiskImageSize": 132096.0,
"Description": "Third disk",
"Format": "VMDK"
}
],
"Description": "hhhVM",
"StatusMessage": "ClientError: Saved entry is empty",
"ImportTaskId": "import-ami-fg7xdq2r"
}
]
}

This is an error that AWS is passing back from your newly created instance right after it boots.
Most likely, there's a grub configuration problem. In my case, I had grub set to change the default entry by the last selected. Change default in grub/menu.lst to 0 and try again--but you're at the point where you need to look at what is inside your image more closely.

Amazon Redshift - Unload to S3 - Dynamic S3 file name

I have been using UNLOAD statement in Redshift for a while now, it makes it easier to dump the file to S3 and then allow people to analysie.
The time has come to try to automate it. We have Amazon Data Pipeline running for several tasks and I wanted to run SQLActivity to execute UNLOAD automatically. I use SQL script hosted in S3.
The query itself is correct but what I have been trying to figure out is how can I dynamically assign the name of the file. For example:
UNLOAD('<the_query>')
TO 's3://my-bucket/' || to_char(current_date)
WITH CREDENTIALS '<credentials>'
ALLOWOVERWRITE
PARALLEL OFF
doesn't work and of course I suspect that you can't execute functions (to_char) in the "TO" line. Is there any other way I can do it?
And if UNLOAD is not the way, do I have any other options how to automate such tasks with current available infrastructure (Redshift + S3 + Data Pipeline, our Amazon EMR is not active yet).
The only thing that I thought could work (but not sure) is not instead of using script, to copy the script into the Script option in SQLActivity (at the moment it points to a file) and reference {#ScheduleStartTime}

Why not use RedshiftCopyActivity to copy from Redshift to S3? Input is RedshiftDataNode and output is S3DataNode where you can specify expression for directoryPath.
You can also specify the transformSql property in RedshiftCopyActivity to override the default value of : select * from + inputRedshiftTable.
Sample pipeline:
{
"objects": [{
"id": "CSVId1",
"name": "DefaultCSV1",
"type": "CSV"
}, {
"id": "RedshiftDatabaseId1",
"databaseName": "dbname",
"username": "user",
"name": "DefaultRedshiftDatabase1",
"*password": "password",
"type": "RedshiftDatabase",
"clusterId": "redshiftclusterId"
}, {
"id": "Default",
"scheduleType": "timeseries",
"failureAndRerunMode": "CASCADE",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
}, {
"id": "RedshiftDataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"tableName": "orders",
"name": "DefaultRedshiftDataNode1",
"type": "RedshiftDataNode",
"database": {
"ref": "RedshiftDatabaseId1"
}
}, {
"id": "Ec2ResourceId1",
"schedule": {
"ref": "ScheduleId1"
},
"securityGroups": "MySecurityGroup",
"name": "DefaultEc2Resource1",
"role": "DataPipelineDefaultRole",
"logUri": "s3://myLogs",
"resourceRole": "DataPipelineDefaultResourceRole",
"type": "Ec2Resource"
}, {
"myComment": "This object is used to control the task schedule.",
"id": "DefaultSchedule1",
"name": "RunOnce",
"occurrences": "1",
"period": "1 Day",
"type": "Schedule",
"startAt": "FIRST_ACTIVATION_DATE_TIME"
}, {
"id": "S3DataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"directoryPath": "s3://my-bucket/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"dataFormat": {
"ref": "CSVId1"
},
"type": "S3DataNode"
}, {
"id": "RedshiftCopyActivityId1",
"output": {
"ref": "S3DataNodeId1"
},
"input": {
"ref": "RedshiftDataNodeId1"
},
"schedule": {
"ref": "ScheduleId1"
},
"name": "DefaultRedshiftCopyActivity1",
"runsOn": {
"ref": "Ec2ResourceId1"
},
"type": "RedshiftCopyActivity"
}]
}

Are you able to SSH into the cluster? If so, I would suggest writing a shell script where you can create variables and whatnot, then pass in those variables into a connection's statement-query

By using a redshift procedural wrapper around unload statement and dynamically deriving the s3 path name.
Execute the dynamic query and in your job, call the procedure that dynamically creates the UNLOAD statement and executes the statement.
This way you can avoid the other services. But depends on what kind of usecase you are working on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BigQuery: Load Data into EU Dataset from GCS - google-cloud-platform

After searching through other related questions on StackOverflow I eventually realised that I had set my GCS bucket region to EUROPE-WEST-1 and not the multi-region EU location. Things are now working as expected.

Related

Youtube API error "Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup" from Google Cloud Platform VM

GCP Dataproc has Druid available in alpha. How to load segments?

Alexa is 'Unable to Reach the Requested Skill'

ClientError: Saved entry is empty when using aws import-image command

Amazon Redshift - Unload to S3 - Dynamic S3 file name

Categories

Resources