How to read HDFS file from Akka Stream? - hdfs

Can someone give an example to read a text file available at HDFS location through Akka Streaming? I'm a novice in Akka Stream and after googling so much not able to found any solution for the same.
Any help would be appreciated. Thanks.

Use the Alpakka HDFS connector, with which you can create a HdfsSource to read from HDFS.

Related

Creating an API for an AWS S3 upload?

I have about 200Gb of files on a Windows server file share that i want to upload in to S3. From looking at the documentation, it looks like i have to write my own API to do.
I'm an networking and server guy - I have no coding experience. Can anyone point me in the right direction for getting started? Has anyone ever done this before, who could maybe give me the high level overview of the steps involved, so that I can go an research each step? Any info will be greatly appreciated. Thanks.

Does google store the requests that are sent via Google DLP API

I am trying to understand if Google stores text or data that are sent to DLP API? For example, I am having some data (text files) locally and I am planning to use google DLP to help identify sensitive information and maybe transform those back.
Would Google store the text files data that I am using? In other words, would it retain a copy of the files that I am sending? I am trying to read through the security and compliance page, but there is nothing that I could find that clearly explains this.
Could anyone please advise?
Here is what I was looking at https://cloud.google.com/dlp/data-security
Google DLP API only classifies and identifies the kind of data, mostly sensitive, we want to analyse and Google doesn't store the data we send.
We certainly don't store the data being scanned with the *Content api methods beyond what is needed to process it and return a response to you.

Kafka Connect HDFS - How to make it work?

This is not a very specific question. However, I have not found a single document in which it is explained how do you actually use kafka - hdfs connector.
Basically, I have a kafka topic containing json encoded strings. I would like to send the data to HDFS as avro formatted data.
Any help would be more than welcome!
What specifically are you trying to achieve with the HDFS-Connector. While docs certainly could use some help they do cover the basics for how to both configure and run the hdfs-connector. If you could be a little more specific as to the goal you are trying to achieve it will be easier for us to offer you some guidance.
Thanks,
Ryan

Amazon emr: best compression/fileformat

We currently have some files stored on an S3 server. The files are log files (.log extension but plain text content) that have been gzipped to reduce disk space.
But gzip isn't splittable and now we are looking for a few good alternatives to store/process our files on Amazon EMR.
So what is the best compression or file format to use on log files? I came across avro and SequenceFile, bzip2, LZO and snappy. It's a bit much and I am a bit overwhelmed.
So I would appreciate any insights in this matter.
Data is to be used for pig jobs (map/reduce jobs)
Kind regards
If you check the Best Practices for Amazon EMR there's a section talking about compressing the outputs:
Compress mapper outputs–Compression means less data written to disk,
which improves disk I/O. You can monitor how much data written to disk
by looking at FILE_BYTES_WRITTEN Hadoop metric. Compression can also
help with the shuffle phase where reducers pull data. Compression can
benefit your cluster HDFS data replication as well. Enable compression
by setting mapred.compress.map.output to true. When you enable
compression, you can also choose the compression algorithm. LZO has
better performance and is faster to compress and decompress.
Hi We can use following algorithms as per our use cases.
GZIP(Algorithm) : Splittable(No), Compression Ratio(High),Compress and Decompress Speed(Medium)
SNAPPY(Algorithm) : Splittable(No), Compression Ratio(LOW),Compress and Decompress Speed(Very Fast)
BZIP2(Algorithm) : Splittable(Yes), Compression Ratio(Very High),Compress and Decompress Speed(Slow)
LZO(Algorithm) : Splittable(Yes), Compression Ratio(LOW),Compress and Decompress Speed(FAST)

Programmatically write files into HDFS

I am looking at options in Java programs that can write files into HDFS with the following requirements.
1) Transaction Support: Each file, when being written, either fully written successfully or failed totally without any partial file blocks written.
2) Compression Support/File Formats: Can specify compression type or file format when writing contents.
I know how to write data into a file on HDFS by opening a FSDataOutputStream shown here. Just wondering if there is some libraries of out of the box solutions that provides the support I mentioned above.
I stumbled upon Flume, which provides HDFS sink that can support transaction, compression, file rotation, etc. But it doesn't seem to provide an API to be used as a library. The features Flume provides are highly coupled with the Flume architectural components, like source, channel, and sinks and doesn't seem to be usable independently. All I need is merely on the HDFS loading part.
Does anyone have some good suggestions?
I think using Flume as "gateway" to HDFS would be good solution. Your program sends data to Flume (using one of interfaces provided by its sources), Flume writes to HDFS.
This way you don't need to support bunch of custom code for interaction with HDFS. On the other hand, you need to install and configure Flume, but in my experience it is much easier (see this comment for installation recommendations).
Finally, Flume HDFS sink is open-source component, so you are free to reuse its code under the terms of Apache license. Get the sources here: https://git-wip-us.apache.org/repos/asf?p=flume.git;a=tree;f=flume-ng-sinks/flume-hdfs-sink;h=b9414a2ebc976240005895e3eafe37b12fad4716;hb=trunk