I'm working with a pipeline that pushes JSON entries in batches to my Gcloud Storage bucket. I want to get this data into Kafka.
The way I'm going about it now is using a lambda function that gets triggered every minute to find the files that have changed, open streams from them, read line by line and batch every so often those lines as messages into a kafka producer.
This process is pretty terrible, but it works.... eventually.
I was hoping there'd be a way to do this w/ Kafka Connect or Flink, but there really isn't much development around sensing incremental file additions to a bucket.
Do the JSON entries end up in different files in your bucket? Flink has support for streaming in new files from a source.
I am new to AWS Glue studio. I am trying to create a job involving multiple joins and custom code. Trying to read data from Glue catalog and writing the data into S3 bucket. It was working fine untill recently. I only increased more number of withColumn operations in custom transform block. Now when i try to save the job i am getting error as follows:
Failed to update job
[gluestudio-service.us-east-2.amazonaws.com] updateDag: InternalFailure: null
I tried cloning the job and doing changes on it. I also tried creating a new job from scratch.
We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks
The Cloud Spanner docs say that Spanner can export/import Avro format. Can this path also be used for batch ingestion of Avro data generated from another source? The docs seem to suggest it can only import Avro data that was also generated by Spanner.
I ran a quick export job and took a look at the generated files. The manifest and schema look pretty straight forward. I figured I would post here in case this rabbit hole is deep.
manifest file
'
{
"files": [{
"name": "people.avro-00000-of-00001",
"md5": "HsMZeZFnKd06MVkmiG42Ag=="
}]
}
schema file
{
"tables": [{
"name": "people",
"manifestFile": "people-manifest.json"
}]
}
data file
{"type":"record",
"name":"people",
"namespace":
"spannerexport","
fields":[
{"name":"fullName",
"type":["null","string"],
"sqlType":"STRING(MAX)"},{"name":"memberId",
"type":"long",
"sqlType":"INT64"}
],
"googleStorage":"CloudSpanner",
"spannerPrimaryKey":"`memberId` ASC",
"spannerParent":"",
"spannerPrimaryKey_0":"`memberId` ASC",
"googleFormatVersion":"1.0.0"}
In response to your question, yes! There are two ways to do ingestion of Avro data into Cloud Spanner.
Method 1
If you place Avro files in a Google Cloud Storage bucket arranged as a Cloud Spanner export operation would arrange them and you generate a manifest formatted as Cloud Spanner expects, then using the import functionality in the web interface for Cloud Spanner will work. Obviously, there may be a lot of tedious formatting work here which is why the official documentation states that this "import process supports only Avro files exported from Cloud Spanner".
Method 2
Instead of executing the import/export job using the Cloud Spanner web console and relying on the Avro manifest and data files to be perfectly formatted, slightly modify the code in either of two public code repositories on GitHub under the Google Cloud Platform user that provide import/export (or backup/restore or export/ingest) functionality for moving data from Avro format into Google Cloud Spanner: (1) Dataflow Templates, especially this file (2) Pontem, especially this file.
Both of these have Dataflow jobs written that allow you to move data into and out of Cloud Spanner using the Avro format. Each has a specific means of parsing an Avro schema for input (i.e., moving data from Avro into Cloud Spanner). Since your use-case is input (i.e., ingesting data into Cloud Spanner that is Avro-formatted), you need to modify the Avro parsing code to fit your specific schema and then execute the Cloud Dataflow job from the commandline locally on your machine (the job is then uploaded to Google Cloud Platform).
If you are not familiar with Cloud Dataflow, it is a tool for defining and running jobs with large data sets.
As the documentation specifically states that importing only supports Avro files initially exported from Spanner 1, I've raised a feature request for this which you can track here
1 https://cloud.google.com/spanner/docs/import
Referring to item: Watching for new files matching a filepattern in Apache Beam
Can you use this for simple use cases? My use case is that I have user uploads data to Cloud Storage -> Pipeline (Process csv to json) -> Big Query. I know Cloud Storage is bounded collection so it represents Batch Dataflow.
What I would like is to do is keep pipeline running in streaming mode and as soon as a file is uploaded to Cloud Storage, it will be processed through pipeline. Is this possible with watchfornewfiles?
I wrote my code as follows:
p.apply(TextIO.read().from("<bucketname>")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
None of the contents is being forwarded to Big Query, but the pipeline shows that it is streaming.
You may use Google Cloud Storage Triggers here :
https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
These triggers uses Cloud Functions similar to Cloud Pub/Sub which gets triggered on objects if they were: created/ deleted/archived/ or metadata change.
These event are sent using Pub/Sub notifications from Cloud Storage, but pay attention not to set many functions over the same bucket as there is some notification limits.
Also, at the end of the document there is a link to a sample implementation.