How to process dataflow two batch files simultaneously on GCP - google-cloud-platform

I want to process two files from gcp to dataflow at the same time simultaneously.
I think it will be possible if one more file comes in side-input.
However, in this case, I think it will be processed every time, not just once.
e.g) How to read and process file1 and file2 at the same time (do I have to put two files in one file and just follow the path?)
I'd appreciate it if you could give me a good example or advice.
Thank you.

If you know the 2 files from the beginning you can simply have a pipeline with 2 entry (fileIO)
I don't know your language, but by design you can do this
PCollection1 PCollection2
| |
FileIO(readFile1) FileIO(readFile2)
| |
Transform file Transform file
| |
WriteIO(sink) WriteIO(sink)
You can imagine side input, flatten, group by,... all depends on your needs.

Related

Scalable way to read large numbers of files with Apache Beam?

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.
It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').
It dosen’t even show up in the Dataflow jobs list.
The pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
| "Extract metadata" >> beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
I'm wondering:
Is there a smarter way to read large numbers of files?
Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?
This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.
Current solution is to use the transform ReadAllFromText which should not run into this.
In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.
It looks like I was suffering from a case of unwanted fusion.
Drawing inspiration from the page about file-processing on the Apache Beam website, I tried to add a Reshuffle to the pipeline.
I also upgraded to a paid Google Cloud account, thus getting higher quotas.
That resulted in Dataflow handling the job a lot better.
In fact Dataflow wanted to scale to 251 workers for my BiqQuery write job. At firste it didn't provision more workers, so I stopped the job and sat --num_workers=NUM_WORKERS and --max_num_workers=NUM_WORKERS, where NUM_WORKERS was the max qouta form my project. When running with those paramaters it scaled up automatically.
My final pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| MatchFiles(f'gs://{BUCKET}/{DATA_FOLDER}/*metadata.json')
| ReadMatches()
| beam.Reshuffle()
| beam.Map(lambda x: (x.metadata.path, x.read_utf8()))
| beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
)
)
Appendix
I also got a hint that SplitableParDos might be a solution, but I have not tested it.

What will happen if power get shutdown , while we are inserting into database?

I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Hadoop mapreduce using 2 mapper and 1 reducer using c++

Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?
The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.

Multiple MapReduce jobs with multiple files as input and multiple reducers

I need to chain multiple MapReduce streaming jobs in order to perform some computation over a large dataset.
I intend to use multiple reducers for each job in order to quicken the overall job. As a workflow scheduler I use Oozie.
Here is an illustration to clarify my problem:
Let say I have two files
File 1: File 2:
A B 1 A B 3
A C 4 C D 6
B D 2 B D 1
I'd like to have two mappers and two reducers and get the following output for the MapReduce job:
Output:
A B 4
A C 4
B D 3
C D 6
But this is not at all what I get, instead I have partial sums.
Here is what I think happens.
Since I have multiple reducers for each MapReduce job, the input of the next job is split into several files. These files are given to the mappers which then send their output to the reducers. It seems that the mappers send their output to the reducers without waiting the whole input to be processed and sorted with name1, for example, as the key.
I've read several threads about using multiple files as an input and I don't think it is a matter of performing a map side join. Maybe it has to do with partitioning but I haven't exactly understood what partitioning consists in.
Is there any way to sort the output of several mappers before sending it to reducers ? Or can I tell Oozie to merge the output of several reducers in order to have only one file as the input of the next MapReduce Job ?
I'm slightly new to MapReduce, but it looks like your job isn't processing the keys correctly, if you are not getting the desired output based on your example.
By default, Hadoop streaming uses Tab as the default field separator and takes everything from the start of a line to the first Tab character as the Key. In your case, if your input format is actually "A[space]B[space]1", you'll need to add
-D stream.map.output.field.separator= \
-D stream.num.map.output.key.fields=2 \
to your Hadoop streaming command in order to set space as the column delimiter and the first 2 columns as the key. This will map all the lines that start with "A B" to the same reducer. More info can be found here