How to write to ORC files using BucketingSink in Apache Flink? - hdfs

I'm working on a Flink streaming program that reads kafka messages and dump the messages to ORC files on AWS s3. I found there is no document about the integration of Flink's BucketingSink and ORC file writer. and there is no such an ORC file writer implementation can be used in BucketingSink.
I'm stuck here, any ideas?

I agree, a BucketingSink writer for ORC files would be a great feature. However, it hasn't been contributed to Flink yet. You would have to implement such a writer yourself.
I'm sure the Flink community would help designing and reviewing the writer, if you would consider contributing it to Flink.

Related

How to write Parquet files on HDFS using C++?

I need to write in-memory data records to HDFS file in Parquet format using C++ language. I know there is a parquet-cpp library on github but i can't find example code.
Could anybody share copy or link to example code if you have any? Thanks.
There are examples for parquet-cpp in the github repo in the examples directory. They just deal with Parquet though, and do not involve HDFS access.
For HDFS access from C++, you will need libhdfs from Apache Hadoop. Or you may use Apache Arrow, which has HDFS integration, as desribed here.

Apache Spark/AWS EMR and tracking of processed files

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.
I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.
From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.
Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?
If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.
Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.
Again checkpointing only works in the streaming mode of spark job.

using Kinesis Client library with Spark Steaming PySpark

I am looking for using KCL on SparkStreaming using pySpark.
Any pointers would be helpful.
I tried few given by spark Kinesis Ingeration link.
But i get the error for JAVA class reference.
Seems Python is using JAVA class.
i tried linking
spark-streaming-kinesis-asl-assembly_2.10-2.0.0-preview.jar
while trying to apply the KCL app on spark.
but still having the error.
Please let me know if anyone has done it already.
if i search online i get more about Twitter and Kafka.
Not able to get much help with regard to Kinesis.
spark verision used: 1.6.3
I encountered the same problem. The kinesis-asl jar had several files missing.
To overcome this problem, I had included the following jars in my spark-submit.
amazon-kinesis-client-1.9.0.jar
aws-java-sdk-1.11.310.jar
jackson-dataformat-cbor-2.6.7.jar
Note: I am using Spark 2.3.0 so the jar versions listed might not be the same as those you should be using for your spark version.
Hope this helps.

How to pull data from API and store it in HDFS

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.

jar containing org.apache.hadoop.hive.dynamodb

I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process.
Unfortunately, I couldn't find the file as well :(.
Could someone answer the following questions for me (listed in order of priority).
Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format).
the jar containing org.apache.hadoop.hive.dynamodb.
Thanks!
It's in hive-bigbird-handler.jar. Unfortunately AWS doesn't provide any source or at least Java Doc about it. But you can find the jar on any node of an EMR Cluster:
/home/hadoop/.versions/hive-0.8.1/auxlib/hive-bigbird-handler-0.8.1.jar
You might want to checkout this Article:
Amazon DynamoDB Part III: MapReducin’ Logs
Unfortunately, Amazon haven’t released the sources for
hive-bigbird-handler.jar, which is a shame considering its usefulness.
Of particular note, it seems it also includes built-in support for
Hadoop’s Input and Output formats, so one can write straight on
MapReduce Jobs, writing directly into DynamoDB.
Tip: search for hive-bigbird-handler.jar to get to the interesting parts... ;-)
1- I am not aware of any such example, but you might find this library useful. It provides InputFormats, OutputFormats, and Writable classes for reading and writing data to Amazon DynamoDB tables.
2- I don't think they have made it available publically.