How to programmatically save cache in GitLabCI while running the script inside a job, instead of saving at end of the job? - cicd

Hi I am trying to save a data to GitLabCI cache programmatically on verifying that the data have met some criteria. Is this possible? Or I have to wait till end of the GitLab CI job? Issue is, this job might fail and if we are saving blindly on end of job, there is no guarantee that I am saving the required data. Appreciate if you can help.

Related

Dataflow writes to GCS bucket, but timestamp in filename is unchanged

I have a question on Apache Beam, especially on dataflow.
I have a pipeline which reads from a cloudsql database and writes to GCS. The filename has a timestamp in it. I expect that each time I run it, it will generate a file with a different timestamp in it.
I tested on my local machine. Beam reads from a postgres db and writes to a file (instead of gcs). It works fine. The files generated have different timestamps in it. Like
jdbc_output.csv-00000-of-00001_2020-08-19_00:11:17.csv
jdbc_output.csv-00000-of-00001_2020-08-19_00:25:07.csv
However, when I deploy to Dataflow, trigger it via Airflow (we have airflow as scheduler), the filename it generates always uses the same timestamp. The timestamp is unchanged even if I run it multiple times. The timestamp is very close to the time when the dataflow template was uploaded.
Here is the simple code to write.
output.apply("Write to Bucket", TextIO.write().to("gs://my-bucket/filename").withNumShards(1)
.withSuffix("_" + String.valueOf(new Timestamp(new Date().getTime())).replace(" ","_") +".csv"));
I'd like to know the reason why dataflow does not use the current time in the filename, instead it uses the timestamp when the template file was uploaded.
Further, how to solve this issue? My plan is to run the dataflow each day, and expecting a new file with a different timestamp in it.
My intuition (because I never tested it) is that the template creation start your pipeline and take a snapshot of it. Therefore, your pipeline is ran, your date time evaluated and kept as-is in the template. And the value never change.
The documentation description also mentions that the pipeline is run before the template creation, like a compilation indeed.
Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.
To fix this, you can use ValueProvider interface. And, I never made the link before now, but it's in the template section of the documentation
Note: However, the cheapest, and the easiest to maintain for simple read in Cloud SQL database and export to file, is to not use Dataflow!

INFA workflow fails when scheduler enabled

I am trying to run an Informatica workflow to check database table and write fetched data to a flat file on the server.
source: database table
target: flat file
transformations: non
this wf runs fine when "run on demand" but I need to run this continuously so I tried with INFA scheduler to run it every 5 minutes. When the scheduler is enabled the workflow continuously fails. Kindly help with any ideas to run this on scheduler.
oops... sorted, this was my mistake. I have not checked in the Target I created for flat file. my bad. Thanks all

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Azure streaming analytics with event hub input stream position

Setup
I use Azure stream analytics to stream data into Azure warehouse staging table.
The input source of the job is a EventHub stream.
I notice when I'm updating the job, the job input event backlog goes up massively after the start.
It looks like the job starting to process the complete EventHub queue again from the beginning.
Questions
how is the stream position management organised in stream analytics
is it possible to define a stream position where the job starts (event after queued after a specific point in time for example)
So far done
I notice a similar question here on StackOverflow.
There is mentioned a variable name "eventStartTime".
But since I use an "asaproj" project within visual studio to create, update and deploy the job I don't know where to place this before deploying.
For updating job without stop, it will use previous setting of "Joboutputstarttime", so it is possible for job starting to process the data from the beginning.
you can stop the job first, then choose "Joboutputstarttime" before you will start the job.
You can reference this document https://learn.microsoft.com/en-us/azure/stream-analytics/start-job to see detailed information for each mode. for your scenario, "When last stopped" mode maybe the one you need and it will not process data from beginning of the eventhub queue.

Glue Job fails to write file

I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?
The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.