Regarding flink stream sink to hdfs - hdfs

I am writing a flink code in which I am reading a file from local system and writing it to database using "writeUsingOutputFormat".
Now my requirement is to write to hdfs instead of database.
Could you please help me how can i do in flink.
Note : hdfs is up and running on my local machine.

Flink provides HDFS connector which can be used to write data to any file system supported by Hadoop Filesystem.
The provided sink is a Bucketing sink which partitions the data stream into folders containing rolling files. The bucketing behavior, as well as the writing, can be configured with parameters such as batch size and batch roll over time interval
The Flink document gives following example -
DataStream<Tuple2<IntWritable,Text>> input = ...;
BucketingSink<String> sink = new BucketingSink<String>("/base/path");
sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm", ZoneId.of("America/Los_Angeles")));
sink.setWriter(new SequenceFileWriter<IntWritable, Text>());
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,
sink.setBatchRolloverInterval(20 * 60 * 1000); // this is 20 mins
input.addSink(sink);

The newer Streaming File Sink is probably a better choice than the Bucketing Sink at this point. This description is from the Flink 1.6 release notes (note that support for S3 was added in Flink 1.7):
The new StreamingFileSink is an exactly-once sink for writing to
filesystems which capitalizes on the knowledge acquired from the
previous BucketingSink. Exactly-once is supported through integration
of the sink with Flink’s checkpointing mechanism. The new sink is
built upon Flink’s own FileSystem abstraction and it supports local
file system and HDFS, with plans for S3 support in the near future [now included in Flink 1.7]. It
exposes pluggable file rolling and bucketing policies. Apart from
row-wise encoding formats, the new StreamingFileSink comes with
support for Parquet. Other bulk-encoding formats like ORC can be
easily added using the exposed APIs.

Related

How to enable pipe mode for XGBoost BYO script distributed training on parquet data using SageMaker?

If I use BYO (bring your own) script of XGBoost with distributed training and Pipe input mode on parquet data using SageMaker, what change is needed from a distributed training code (it works with script mode on parquet data) to enable pipe mode?
NOTE: If I change input_mode to Pipe, no model output is uploaded into s3 even if the training step has been completed. However, there are warning messages such as WARNING:root:Host algo-2 not being used for distributed training. and WARNING:root:Host algo-1 not being used for distributed training.
Can anyone help with this?

Stream incremental file uploads off of GCS into Kafka

I'm working with a pipeline that pushes JSON entries in batches to my Gcloud Storage bucket. I want to get this data into Kafka.
The way I'm going about it now is using a lambda function that gets triggered every minute to find the files that have changed, open streams from them, read line by line and batch every so often those lines as messages into a kafka producer.
This process is pretty terrible, but it works.... eventually.
I was hoping there'd be a way to do this w/ Kafka Connect or Flink, but there really isn't much development around sensing incremental file additions to a bucket.
Do the JSON entries end up in different files in your bucket? Flink has support for streaming in new files from a source.

Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery

I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link).
I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.
yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.

How exactly does Spark on EMR read from S3?

Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?
I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns
Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data
When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.

How to read file in Apache Samza from local file system and hdfs system

Looking for approach in Apache Samza to read file from local system or HDFS
then apply filters, aggregate, where condition, order by, group by into batch of data.
Please provide some help.
You should create a system for each source of data you want to use. For example, to read from a file, you should create a system with the FileReaderSystemFactory -- for HDFS, create a system with the HdfsSystemFactory. Then, you can use the regular process callback or windowing to process your data.
You can feed your Samza Job using standard Kafka producer. To make it easy for you. You can use Logstash, you need to create Logstash script where you specify:
input as local file or hdfs
filters (optional) here you can do basic filtering, aggregation etc.
kafka output with specific topic you want to feed
input
I was using this approach to feed my samza job from local file
Another approach could be using Kafka Connect
http://docs.confluent.io/2.0.0/connect/