Write to dynamic destination to cloud storage in dataflow in Python - python-2.7

I was trying to read from a big file in cloud storage and shard them according to a given field.
I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field.
However I couldn't find a way to write dynamically to cloud storage. Is this functionality supported?
Thank you,
Yiqing

Yes, you can use the FileSystems API to create the files.

An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:
my_pcollection | beam.io.fileio.WriteToFiles(
path='/my/file/path',
destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
file_naming=beam.io.fileio.destination_prefix_naming())
which can be used to write to different files per-record.
You can skip the GroupByKey, just use destination to decide which file each record is written to. The return value of destination needs to be a value that can be grouped by.
More documentation here:
https://beam.apache.org/releases/pydoc/2.14.0/apache_beam.io.fileio.html#dynamic-destinations
And the JIRA issue here:
https://issues.apache.org/jira/browse/BEAM-2857

Related

Scalable way to read large numbers of files with Apache Beam?

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.
It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').
It dosen’t even show up in the Dataflow jobs list.
The pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
| "Extract metadata" >> beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
I'm wondering:
Is there a smarter way to read large numbers of files?
Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?
This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.
Current solution is to use the transform ReadAllFromText which should not run into this.
In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.
It looks like I was suffering from a case of unwanted fusion.
Drawing inspiration from the page about file-processing on the Apache Beam website, I tried to add a Reshuffle to the pipeline.
I also upgraded to a paid Google Cloud account, thus getting higher quotas.
That resulted in Dataflow handling the job a lot better.
In fact Dataflow wanted to scale to 251 workers for my BiqQuery write job. At firste it didn't provision more workers, so I stopped the job and sat --num_workers=NUM_WORKERS and --max_num_workers=NUM_WORKERS, where NUM_WORKERS was the max qouta form my project. When running with those paramaters it scaled up automatically.
My final pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| MatchFiles(f'gs://{BUCKET}/{DATA_FOLDER}/*metadata.json')
| ReadMatches()
| beam.Reshuffle()
| beam.Map(lambda x: (x.metadata.path, x.read_utf8()))
| beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
)
)
Appendix
I also got a hint that SplitableParDos might be a solution, but I have not tested it.

upload greater than 5TB object to google cloud storage bucket

The max size limit for a single object upload is 5TB. How does one do a backup of a larger single workload? Say - I have a single file that is 10TB or more - that needs to be backed up to cloud storage?
Also, a related question - if the 10 TB is spread across multiple files (each file is less than 5TB) in a single folder, that shouldn't affect anything correct? A single object can't be greater than 5TB, but there isn't a limit on the actual bucket size. Say a folder containing 3 objects equal to 10TB, that upload will be automatically split across multiple buckets (console or gsutil upload)?
Thanks
You are right. The current limit on the size for individual objects is 5 TB. In this way you might split your file.
About the limitation on the Total Bucket size, there is no limit documented on this. Actually, in the overview says "Cloud Storage provides worldwide, highly durable object storage that scales to exabytes of data.".
You might take a look into the best practices of GCS.
Maybe look over http://stromberg.dnsalias.org/~dstromberg/chunkup.html ?
I wrote it to backup to SRB, which is a sort of precursor to Google and Amazon bucketing. But chunkup itself doesn't depend on SRB or any other form of storage.
Usage: /usr/local/etc/chunkup -c chunkhandlerprog -n nameprefix [-t tmpdir] [-b blocksize]
-c specifies the chunk handling program to handle each chunk, and is required
-n specifies the nameprefix to use on all files created, and is required
-t specifies the temporary directory to write files to, and is optional. Defaults to $TMPDIR or /tmp
-b specifies the length of files to create, and is optional. Defaults to 1 gigabyte
-d specifies the number of digits to use in filenames. Defaults to 5
You can see example use of chunkup in http://stromberg.dnsalias.org/~strombrg/Backup.remote.html#SRB
HTH

Skip top N lines in snowflake load

My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.

Spark Dataframe loading 500k files on EMR

I am running pyspark job on EMR ( 5.5.1 ) with Spark 2.1.0, Hadoop 2.7.3, Hive 2.1.1, Sqoop 1.4.6 and Ganglia 3.7.2 which is loading data from s3. There are multiple buckets that contain input files so I have a function which uses boto to traverse through them and filter them out according to some pattern.
Cluster Size: Master => r4.xlarge , Workers => 3 x r4.4xlarge
Problem:
The function getFilePaths returns a list of s3 paths which is directly fed to spark dataframe load method.
Using Dataframe
file_list = getFilePaths() # ['s3://some_bucket/log.json.gz','s3://some_bucket/log2.json.gz']
schema = getSchema() # for mapping to the json files
df = sparkSession.read.format('json').load(file_list, schema=schema)
Using RDD
master_rdd = sparkSession.sparkContext.union(
map(lambda file: sparkSession.sparkContext.textFile(file), file_list)
)
df = sparkSession.createDataFrame(master_rdd, schema=schema)
The file_list can be a huge list ( max 500k files ) due to large amount of data & files. Calculation of these paths only takes 5-20mins but when trying to load them as dataframe with spark, spark UI remains inactive for hours i.e. not processing anything at all. The inactivity period for processing 500k files is above 9hrs while for 100k files it is around 1.5hrs.
Viewing Gangilla metrics shows that only driver is running/processing while workers are idle. There are no logs generated until the spark job has finished and I haven't got any success with 500k files.
I have tried s3, s3n connectors but no success.
Question:
Figure out the root cause of this delay?
How can I debug it properly ?
In general, Spark/Hadoop prefer to have large files they can split instead of huge numbers of small files. One approach you might try though would be to parallelize your file list and then load the data in a map call.
I don't have the resources right now to test this out, but it should be something similar to this:
file_list = getFilePaths()
schema = getSchema() # for mapping to the json files
paths_rdd = sc.parallelize(file_list)
def get_data(path):
s3 = boto3.resource('s3')
obj = s3.Object(bucket, path)
data = obj.get()['Body'].read().decode('utf-8')
return [json.loads(r) for r in data.split('\n')]
rows_rdd = rdd.flatMap(get_data)
df = spark.createDataFrame(rows_rdd, schema=schema)
You could also make this a little more efficient by using mapPartition instead so you don't need to recreate the s3 object each time.
EDIT 6/14/18:
With regards to handling the gzip data, you can decompress a stream of gzip data using python as detailed in this answer: https://stackoverflow.com/a/12572031/1461187 . Basically just pass in obj.get()['Body'].read() into the function defined in that answer.
There's two performance issues surfacing
reading the files: gzip files can't be split to have their workload shared across workers, though with 50 MB files, there's little benefit in splitting things up
The way the S3 connectors spark uses mimic a directory structure is a real performance killer for complex directory trees.
Issue #2 is what slows up partitioning: the initial code to decide what to do, which is done before any of the computation.
How would I go about trying to deal with this? Well, there's no magic switch here. But
have fewer, bigger files; as noted, Avro is good, so are Parquet and ORC later.
use a very shallow directory tree. Are these files all in one single directory? Or in a deep directory tree? The latter is worse.
Coalesce the files first.
I'd also avoid any kind of schema inference; it sounds like you aren't doing that (good!), but for anyone else reading this answer: know that for CSV and presumably JSON, schema inference means "read through all the data once just to work out the schema"

HDFS sink: "clever" folder routing

I am new to Flume (and to HDFS), so I hope my question is not stupid.
I have a multi-tenant application (about 100 different customers as for
now).
I have 16 different data types.
(In production, we have approx. 15 million messages/day through our
RabbitMQ)
I want to write to HDFS all my events, separated by tenant, data type,
and date, like this :
/data/{tenant}/{data_type}/2014/10/15/file-08.csv
Is it possible with one sink definition ? I don't want to duplicate
configuration, and new client arrive every week or so
In documentation, I see
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/
Is this possible ?
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/
I want to write to different folders according to my incoming data.
Yes this is indeed possible. You can either use the metadata or some field in the incoming data to redirect the output to.
For example, in my case I am getting different types of log data and I want to store it in respective folders accordingly. Also in my case the first word in my log lines is the file name. Here is the config snippet for the same.
Interceptor:
dataplatform.sources.source1.interceptors = i3
dataplatform.sources.source1.interceptors.i3.type = regex_extractor
dataplatform.sources.source1.interceptors.i3.regex = ^(\\w*)\t.*
dataplatform.sources.source1.interceptors.i3.serializers = s1
dataplatform.sources.source1.interceptors.i3.serializers.s1.name = filename
HDFS Sink
dataplatform.sinks.sink1.type = hdfs
dataplatform.sinks.sink1.hdfs.path = hdfs://server/events/provider=%{filename}/years=%Y/months=%Y%m/days=%Y%m%d/hours=%H
Hope this helps.
Possible solution may be to write an interceptor which passes the tenant value.
please refer to the link below
http://hadoopi.wordpress.com/2014/06/11/flume-getting-started-with-interceptors/