The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some google specific references to a bucket, but there are no examples how to do this.
What is the method to load data into Druid, running on GCP dataproc straight out of the box?
I haven't used Dataproc version of Druid, but have a small cluster running in Google Compute VM. The way I ingest data to it from GCS is by using Google Cloud Storage Druid extension - https://druid.apache.org/docs/latest/development/extensions-core/google.html
To enable extension you need to add it to a list of extension in your Druid common.properties file:
druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]
To ingest data from GCS I send HTTP POST request to http://druid-overlord-host:8081/druid/indexer/v1/task
The POST request body contains JSON file with ingestion spec(see ["ioConfig"]["firehose"] section):
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"dataSource": "daily_xport_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "MONTH",
"queryGranularity": "NONE",
"rollup": false
},
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "dateday",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [{
"type": "string",
"name": "id",
"createBitmapIndex": true
},
{
"type": "long",
"name": "clicks_count_total"
},
{
"type": "long",
"name": "ctr"
},
"deleted",
"device_type",
"target_url"
]
}
}
}
},
"ioConfig": {
"type": "index_parallel",
"firehose": {
"type": "static-google-blobstore",
"blobs": [{
"bucket": "data-test",
"path": "/sample_data/daily_export_18092019/000000000000.json.gz"
}],
"filter": "*.json.gz$"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"maxNumSubTasks": 1,
"maxRowsInMemory": 1000000,
"pushTimeout": 0,
"maxRetry": 3,
"taskStatusCheckPeriodMs": 1000,
"chatHandlerTimeout": "PT10S",
"chatHandlerNumRetries": 5
}
}
}
Example cURL command to start ingestion task in Druid(spec.json contains JSON from the previous section):
curl -X 'POST' -H 'Content-Type:application/json' -d #spec.json http://druid-overlord-host:8081/druid/indexer/v1/task
Related
I'm studying AWS api to retrieve requisite information about my EC2 instances.
So, I'm on AWS Cost Explorer Service.
It has function 'GetCostAndUsage' that, for example, sends information below. (this is an example from official AWS document)
{
"TimePeriod": {
"Start":"2017-09-01",
"End": "2017-10-01"
},
"Granularity": "MONTHLY",
"Filter": {
"Dimensions": {
"Key": "SERVICE",
"Values": [
"Amazon Simple Storage Service"
]
}
},
"GroupBy":[
{
"Type":"DIMENSION",
"Key":"SERVICE"
},
{
"Type":"TAG",
"Key":"Environment"
}
],
"Metrics":["BlendedCost", "UnblendedCost", "UsageQuantity"]
}
and retrieve information below. (this is an example from official AWS document)
{
"GroupDefinitions": [
{
"Key": "SERVICE",
"Type": "DIMENSION"
},
{
"Key": "Environment",
"Type": "TAG"
}
],
"ResultsByTime": [
{
"Estimated": false,
"Groups": [
{
"Keys": [
"Amazon Simple Storage Service",
"Environment$Prod"
],
"Metrics": {
"BlendedCost": {
"Amount": "39.1603300457",
"Unit": "USD"
},
"UnblendedCost": {
"Amount": "39.1603300457",
"Unit": "USD"
},
"UsageQuantity": {
"Amount": "173842.5440074444",
"Unit": "N/A"
}
}
},
{
"Keys": [
"Amazon Simple Storage Service",
"Environment$Test"
],
"Metrics": {
"BlendedCost": {
"Amount": "0.1337464807",
"Unit": "USD"
},
"UnblendedCost": {
"Amount": "0.1337464807",
"Unit": "USD"
},
"UsageQuantity": {
"Amount": "15992.0786663399",
"Unit": "N/A"
}
}
}
],
"TimePeriod": {
"End": "2017-10-01",
"Start": "2017-09-01"
},
"Total": {}
}
]
}
The retrieved data in key 'Metrics' I guess, it is total cost. not each.
So, How can I get each usage and cost for each EC2 instance??
This was way harder than I had imagined so I'm sharing in case someone else needs it.
aws ce get-cost-and-usage \
--filter file://filters.json \
--time-period Start=2021-08-01,End=2021-08-14 \
--granularity DAILY \
--metrics "BlendedCost" \
--group-by Type=TAG,Key=Name
Contents of filters.json:
{
"Dimensions": {
"Key": "SERVICE",
"Values": [
"Amazon Elastic Compute Cloud - Compute"
]
}
}
--- Available Metrics ---
AmortizedCost
BlendedCost
NetAmortizedCost
NetUnblendedCost
NormalizedUsageAmount
UnblendedCost
UsageQuantity
Descriptions for most metrics except for usage: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/ce-advanced.html
Know this question is old, but you will need to use the GetCostAndUsageWithResources call, as opposed to GetCostAndUsage.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ce/get-cost-and-usage-with-resources.html
It's going to be difficult to associate an exact cost with each instance - simple example, you have 2 instances of the same size - one reserved and one on-demand - you run both for 1/2 the month and then turn off one of them for the second 1/2 of the month.
You will pay for a reserved instance for the entire month and an on-demand instance for 1/2 the month - but which instance was reserved and which was on-demand? You can't tell; the concept of a reserved instance is just a billing concept, and is not associated with a particular instance.
You might be able to approximate what you are looking for - but there are limitations.
You can use tags to track the cost of resources. In the case of EC2 you can assign tags like Project: myprojcet or Application: myapp and in cost explorer then filter the expenses by tags and use the tag that has been put to track the expenses. If the instance at some point was covered by a reservation plan, the tag will only show you the cost of the period in which your expenses were not covered.
I'm trying to use Data Pipeline to run a Spark Application. How can I access the input / output I specify (S3DataNode) for the EmrActivity, inside my Spark application?
My question is similar to this - https://forums.aws.amazon.com/message.jspa?messageID=507877
Earlier I used to pass the input and output as arguments to the Spark application in steps.
Thanks
I ran across the same question. There's very limited documentation around this. This is my understanding:
You specify the input and output for the EmrActivity. This will create the dependencies between the data nodes and the activity.
In the EmrActivity, you can reference the input sources like this: #{input.directoryPath},#{output.directoryPath}
Example:
...
{
"name": "Input Data Node",
"id": "inputDataNode",
"type": "S3DataNode",
"directoryPath": "s3://my/raw/data/path"
},
{
"name": "transform",
"id": "transform",
"type": "EmrActivity",
"step": [
"s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://my/transform/script.sh,#{input.directoryPath},#{output.directoryPath}"
],
"runsOn": {
"ref": "emrcluster"
},
"input": {
"ref": "inputDataNode"
},
"output": {
"ref": "outputDataNode"
}
},
{
"name": "Output Data Node",
"id": "outputDataNode",
"type": "S3DataNode",
"directoryPath": "s3://path/to/output/"
},
...
When using ARM templates to deploy various Azure components you can use some functions. One of them is called listkeys and you can use it to return through the output the keys that were created during the deployment, for example when deploying a storage account.
Is there a way to get the keys when deploying a Power BI workspace collection?
According to you mentioned link, if we want to use listKeys function, then we need to know resourceName and ApiVersion.
From the Azure PowerBI workspace collection get access keys API, we could get resource name
Microsoft.PowerBI/workspaceCollections/{workspaceCollectionName} and API version "2016-01-29"
So please have a try to use the follow coding, it works for me correctly.
"outputs": {
"exampleOutput": {
"value": "[listKeys(resourceId('Microsoft.PowerBI/workspaceCollections', parameters('workspaceCollections_tompowerBItest')), '2016-01-29')]",
"type": "object"
}
Check the created PowerBI Service from Azure portal
Whole ARM template I used:
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"workspaceCollections_tompowerBItest": {
"defaultValue": "tomjustforbitest",
"type": "string"
}
},
"variables": {},
"resources": [
{
"type": "Microsoft.PowerBI/workspaceCollections",
"sku": {
"name": "S1",
"tier": "Standard"
},
"tags": {},
"name": "[parameters('workspaceCollections_tompowerBItest')]",
"apiVersion": "2016-01-29",
"location": "South Central US"
}
],
"outputs": {
"exampleOutput": {
"value": "[listKeys(resourceId('Microsoft.PowerBI/workspaceCollections', parameters('workspaceCollections_tompowerBItest')), '2016-01-29')]",
"type": "object"
}
}
}
I have been using UNLOAD statement in Redshift for a while now, it makes it easier to dump the file to S3 and then allow people to analysie.
The time has come to try to automate it. We have Amazon Data Pipeline running for several tasks and I wanted to run SQLActivity to execute UNLOAD automatically. I use SQL script hosted in S3.
The query itself is correct but what I have been trying to figure out is how can I dynamically assign the name of the file. For example:
UNLOAD('<the_query>')
TO 's3://my-bucket/' || to_char(current_date)
WITH CREDENTIALS '<credentials>'
ALLOWOVERWRITE
PARALLEL OFF
doesn't work and of course I suspect that you can't execute functions (to_char) in the "TO" line. Is there any other way I can do it?
And if UNLOAD is not the way, do I have any other options how to automate such tasks with current available infrastructure (Redshift + S3 + Data Pipeline, our Amazon EMR is not active yet).
The only thing that I thought could work (but not sure) is not instead of using script, to copy the script into the Script option in SQLActivity (at the moment it points to a file) and reference {#ScheduleStartTime}
Why not use RedshiftCopyActivity to copy from Redshift to S3? Input is RedshiftDataNode and output is S3DataNode where you can specify expression for directoryPath.
You can also specify the transformSql property in RedshiftCopyActivity to override the default value of : select * from + inputRedshiftTable.
Sample pipeline:
{
"objects": [{
"id": "CSVId1",
"name": "DefaultCSV1",
"type": "CSV"
}, {
"id": "RedshiftDatabaseId1",
"databaseName": "dbname",
"username": "user",
"name": "DefaultRedshiftDatabase1",
"*password": "password",
"type": "RedshiftDatabase",
"clusterId": "redshiftclusterId"
}, {
"id": "Default",
"scheduleType": "timeseries",
"failureAndRerunMode": "CASCADE",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
}, {
"id": "RedshiftDataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"tableName": "orders",
"name": "DefaultRedshiftDataNode1",
"type": "RedshiftDataNode",
"database": {
"ref": "RedshiftDatabaseId1"
}
}, {
"id": "Ec2ResourceId1",
"schedule": {
"ref": "ScheduleId1"
},
"securityGroups": "MySecurityGroup",
"name": "DefaultEc2Resource1",
"role": "DataPipelineDefaultRole",
"logUri": "s3://myLogs",
"resourceRole": "DataPipelineDefaultResourceRole",
"type": "Ec2Resource"
}, {
"myComment": "This object is used to control the task schedule.",
"id": "DefaultSchedule1",
"name": "RunOnce",
"occurrences": "1",
"period": "1 Day",
"type": "Schedule",
"startAt": "FIRST_ACTIVATION_DATE_TIME"
}, {
"id": "S3DataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"directoryPath": "s3://my-bucket/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"dataFormat": {
"ref": "CSVId1"
},
"type": "S3DataNode"
}, {
"id": "RedshiftCopyActivityId1",
"output": {
"ref": "S3DataNodeId1"
},
"input": {
"ref": "RedshiftDataNodeId1"
},
"schedule": {
"ref": "ScheduleId1"
},
"name": "DefaultRedshiftCopyActivity1",
"runsOn": {
"ref": "Ec2ResourceId1"
},
"type": "RedshiftCopyActivity"
}]
}
Are you able to SSH into the cluster? If so, I would suggest writing a shell script where you can create variables and whatnot, then pass in those variables into a connection's statement-query
By using a redshift procedural wrapper around unload statement and dynamically deriving the s3 path name.
Execute the dynamic query and in your job, call the procedure that dynamically creates the UNLOAD statement and executes the statement.
This way you can avoid the other services. But depends on what kind of usecase you are working on.
In the past I have successfully loaded data into US-hosted BigQuery datasets from CSV data in US-hosted GCS buckets. We since decided to move our BigQuery data to the EU and I created a new dataset with this region selected on it. I have successfully populated those of our tables small enough to be uploaded from my machine at home. But two tables are far too large for this so I would like to load them from files in GCS. I have tried doing this from both a US-hosted GCS bucket and an EU-hosted GCS bucket (thinking that bq load might not like to cross regions) but the load fails every time. Below is the error detail I'm getting from the bq command line (500, Internal Error). Does anyone know a reason why this might be happening?
{
"configuration": {
"load": {
"destinationTable": {
"datasetId": "######",
"projectId": "######",
"tableId": "test"
},
"schema": {
"fields": [
{
"name": "test_col",
"type": "INTEGER"
}
]
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://######/test.csv"
]
}
},
"etag": "######",
"id": "######",
"jobReference": {
"jobId": "######",
"projectId": "######"
},
"kind": "bigquery#job",
"selfLink": "https://www.googleapis.com/bigquery/v2/projects/######",
"statistics": {
"creationTime": "1445336673213",
"endTime": "1445336674738",
"startTime": "1445336674738"
},
"status": {
"errorResult": {
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
},
"errors": [
{
"message": "An internal error occurred and the request could not be completed.",
"reason": "internalError"
}
],
"state": "DONE"
},
"user_email": "######"
}
After searching through other related questions on StackOverflow I eventually realised that I had set my GCS bucket region to EUROPE-WEST-1 and not the multi-region EU location. Things are now working as expected.