Apache beam / PubSub time delay before processing files - google-cloud-platform

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?

You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

Related

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.
This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

AWS Glue ETL: Reading huge JSON file format to process but, got OutOfMemory Error

I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while
My code and flow is so simple as
df = spark.read.option("multiline", "true").json(f"s3/raw_path")
// ...
// and write to be as source_df to other object in s3
df.write.json(f"s3/source_path", lineSep=",\n")
In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources
I am so new in this area and service so, wanna optimize cost and time as low as possible :-)
Thank you alls in advance
After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.

AWS "state file" solution for Lambda

I'm using a library in lambda where a "state file" is persisted
This is what it looks like in code:
def initialize
#config = '/tmp/dogscaler.yaml'
#state = self.load
end
If you need to look at the whole logic
https://github.com/cvent/dogscaler/blob/master/lib/dogscaler/state.rb#L5
My issue is that, this won't work in lambda (it being serverless). I'm trying to look for a solution where I don't have to change the logic in how the file is read and modifed.
Can this be achieved with S3?
Would something like this pseudo code work?
read s3://path/to/file
write s3://path/to/file
Are there better solutions to S3?
Additional Context
The file is needed for a cooldown period logic. Every time the application runs, it would check a time stamp from that file to make a judgement on wether to change an element or not. File is less than 1KB.
Based on the updated information you could store the data in a number of places.
S3 would be perfectly fine, but might be overkill if this is all you're using it for.
The same can be said of DynamoDB.
Parameter Store is a solid option for your use case. Bear in mind that if you are calling it often you may need to increase your TPS limit. It doesn't sound like that will be an issue for you. Also keep in mind that there is no protection here for multiple instances of your Lambda function writing to the parameter at the "same time." The last write will win. If you need to protect against that DynamoDB is probably the best option.

Triggering Lambda on basis of multiple files

I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.
Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.
It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.
Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.