Scalable way to read large numbers of files with Apache Beam? - google-cloud-platform

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.
It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').
It dosen’t even show up in the Dataflow jobs list.
The pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
| "Extract metadata" >> beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
I'm wondering:
Is there a smarter way to read large numbers of files?
Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?

This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.
Current solution is to use the transform ReadAllFromText which should not run into this.
In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.

It looks like I was suffering from a case of unwanted fusion.
Drawing inspiration from the page about file-processing on the Apache Beam website, I tried to add a Reshuffle to the pipeline.
I also upgraded to a paid Google Cloud account, thus getting higher quotas.
That resulted in Dataflow handling the job a lot better.
In fact Dataflow wanted to scale to 251 workers for my BiqQuery write job. At firste it didn't provision more workers, so I stopped the job and sat --num_workers=NUM_WORKERS and --max_num_workers=NUM_WORKERS, where NUM_WORKERS was the max qouta form my project. When running with those paramaters it scaled up automatically.
My final pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| MatchFiles(f'gs://{BUCKET}/{DATA_FOLDER}/*metadata.json')
| ReadMatches()
| beam.Reshuffle()
| beam.Map(lambda x: (x.metadata.path, x.read_utf8()))
| beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
)
)
Appendix
I also got a hint that SplitableParDos might be a solution, but I have not tested it.

Related

Join PubSub data with BigQuery data and then save result into BigQuery using dataflow SDK in python

I have a problem statement to read the streaming data from pubsub topic (PubSubTopic1) and join the data with Bigquery table (BQTable1) using dataflow and then save the result into new Bigquery table (ResultBQTable)
PubSubTopic1: has ItemID, UnitPrice
BQTable1: has ItemID, ItemName, OfferPrice
ResultBQTable: need to have columns: ItemID, ItemName, UnitPrice, OfferPrice, TotalCost
I am able to create Dataflow job using 'DataFlow SQL Workbench' but this is one time, I can not automate this, hence I want to write python code using apache beam ask and dataflow sdk to automate this so that it can be shared with anyone to implement same thing.
I am new to dataflow hence my approach might be tedious. Better and optimal approaches are all welcome.
Thinking of below things, but do not know how to implement 2nd:
I can try to implement windowing to pubsub topic to reach in small batches using time limit.
Can we read PubSubTopic1 streaming data in one PColleciton and data from BQTable1 in another PCollection and then join these?
Can we read PubSubTopic1 streaming data in one PColleciton and data from BQTable1 in another PCollection and then join these
Yes, that's exactly the right way to be thinking about it!
The Side Input Patterns page on the Beam docs contains an example of enriching streaming data with a slowly-changing side input. This is a slightly modified version to match your input and output types:
from apache_beam.transforms.periodicsequence import PeriodicImpulse
from apache_beam.transforms.window import TimestampedValue
from apache_beam.transforms import window
# from apache_beam.utils.timestamp import MAX_TIMESTAMP
# last_timestamp = MAX_TIMESTAMP to go on indefninitely
# Any user-defined function.
# cross join is used as an example.
def cross_join(left, rights):
for x in rights:
yield (left, x)
# Create pipeline.
pipeline = beam.Pipeline()
side_input = (
pipeline
| 'PeriodicImpulse' >> PeriodicImpulse(
first_timestamp, last_timestamp, interval, True)
| 'ReadFromBQ' >> beam.io.ReadFromBigQuery(table='mytable')
main_input = (
pipeline
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic='mytopic')
| 'WindowMpInto' >> beam.WindowInto(
window.FixedWindows(main_input_windowing_interval)))
result = (
main_input
| 'ApplyCrossJoin' >> beam.FlatMap(
cross_join, rights=beam.pvalue.AsIter(side_input)))
result | beam.io.WriteToBigQuery(table='my_output_table')
result here is an windowed, unbounded PCollection and Beam will use streaming inserts to send the data to BigQuery as windows are processed.

How to process dataflow two batch files simultaneously on GCP

I want to process two files from gcp to dataflow at the same time simultaneously.
I think it will be possible if one more file comes in side-input.
However, in this case, I think it will be processed every time, not just once.
e.g) How to read and process file1 and file2 at the same time (do I have to put two files in one file and just follow the path?)
I'd appreciate it if you could give me a good example or advice.
Thank you.
If you know the 2 files from the beginning you can simply have a pipeline with 2 entry (fileIO)
I don't know your language, but by design you can do this
PCollection1 PCollection2
| |
FileIO(readFile1) FileIO(readFile2)
| |
Transform file Transform file
| |
WriteIO(sink) WriteIO(sink)
You can imagine side input, flatten, group by,... all depends on your needs.

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Spark Dataframe loading 500k files on EMR

I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?
In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.
There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"

Write to dynamic destination to cloud storage in dataflow in Python

I was trying to read from a big file in cloud storage and shard them according to a given field.
I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field.
However I couldn't find a way to write dynamically to cloud storage. Is this functionality supported?
Thank you,
Yiqing
Yes, you can use the FileSystems API to create the files.
An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:
my_pcollection | beam.io.fileio.WriteToFiles(
path='/my/file/path',
destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
file_naming=beam.io.fileio.destination_prefix_naming())
which can be used to write to different files per-record.
You can skip the GroupByKey, just use destination to decide which file each record is written to. The return value of destination needs to be a value that can be grouped by.
More documentation here:
https://beam.apache.org/releases/pydoc/2.14.0/apache_beam.io.fileio.html#dynamic-destinations
And the JIRA issue here:
https://issues.apache.org/jira/browse/BEAM-2857