How to enable compression on an existing Hbase table? - compression

I have a very big Hbase table apData, but it was not set as compressed when it was created. Right now it's 1.5TB. So I wanna enable compression feature on this table. I did the following:
(1)disable apData
(2)alter apData,{NAME=>'cf1',COMPRESSION=>'snappy'}
(3)enable 'apData'.
But when I use "desc apData" to see the configuration, it's still showing:
COMPRESSION => 'NONE'
Why it didn't take effect? How should I compress the table please, and also make sure that the future data would be compressed automatically when it is inserted.
Thanks in advance!

HBase will only compress new HFiles - either new data you write or the results of compactions

Did you configure Snappy.
Verify first snappy is loaded in all the nodes. To verify please use this command.
hbase org.apache.hadoop.hbase.util.CompressionTest
hdfs://host/path/to/hbase snappy
Once snappy test is successful. The mentioned above compression should work.
For more detail about configuration and installation of snappy:
http://hbase.apache.org/0.94/book/snappy.compression.html

You would need to configure HBase to use Snappy.
You can follow steps mentioned in the reference link to enable snappy compression in hbase:
configure snappy compression with HBase
Hope it helps you.

We need to configure HBase to use Snappy if we installed Hadoop and HBase from tarballs; if we installed them from RPM or Debian packages, Snappy requires no HBase configuration.
Depending on the architecture of the machine we are installing on, we have to add one of the following lines to /etc/hbase/conf/hbase-env.sh:
For 32-bit platforms:
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-i386-32
For 64-bit platforms:
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

Related

Flink job using JNI on EMR

I am trying to invoke a native library from within a flink pipeline.
Environment is
EMR 5.34
Flink 1.13.1
I have built the uber fat jar and made sure the .so file is available in the JAR file.
However I am facing the below exception when starting up the flink application.
Appreciate any pointers.
Caused by: java.lang.UnsatisfiedLinkError: no <<my native library artifact name>> in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:871)
Thank you,
Amit
I was able to resolve this at least in "Session" mode by setting below config parameters in flink-conf.yaml file.
env.java.opts: "-Djava.library.path=<<path to libraries>>"
containerized.master.env.LD_LIBRARY_PATH: "<<path to libraries>>"
containerized.taskmanager.env.LD_LIBRARY_PATH: "<<path to libraries>>"
You also need to use StreamExecutionEnvironment.registerCachedFile to pass the extracted files on the JobManager to the TaskManagers involved.
On Driver side -
StreamExecutionEnvironment.getExecutionEnvironment.registerCachedFile(directorywherefilesareextracted,"somekey")
Hope this helps if someone is looking for an approach that could be used to work with such scenario.
You can access these cached files and store them in the directory configured in filnk-conf.yaml so that they are included in the library path for execution.
getRuntimeContext().getDistributedCache().getFile("somekey")
To be able to access the RuntimeContext, you need to extend RichMapFunction.
Update:
With all the above changes, when I run the Flink pipeline for the first time, it still complains about library not found. I did check the directory in which I am extracting distributed cache and the libraries are there.
Subsequent runs after the first failure are successful. I am not sure why I am seeing this kind of behavior.
Update:
Made sure that the directory, where we extract the libraries, is readily available when we create EMR cluster and it worked like a charm. I created this directory by configuring Bootstrap action.

Does Google Dataproc support Apache Impala?

I am new to using cloud services and navigating Google's Cloud Platform is quite intimidating. When it comes to Google Dataproc, they do advertise Hadoop, Spark and Hive.
My question is, is Impala available at all?
I would like to do some benchmarking projects using all four of these tools and I require Apache Impala along side Spark/Hive.
No, DataProc is a cluster that supports Hadoop, Spark, Hive and pig; using default images.
Check this link for more information about native image list for DataProc
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions
You can try also using another new instance of Dataproc, instead of using the default.
For example, you can create a Dataproc instance with HUE (Hadoop User Experience) which is an interface to handle Hadoop cluster built by Cloudera. The advantage here is that HUE has as a default component Apache Impala. It also has Pig, Hive, etc. So it's a pretty good solution for using Impala.
Another solution will be to create your own cluster by the beginning but is not a good idea (at least you want to customize everything). With this way, you can install Impala.
Here is a link, for more information:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/hue
Dataproc provides you SSH access to the master and workers, so it is possible to install additional software and according to Impala documentation you would need to:
Ensure Impala Requirements.
Set up Impala on a cluster by building from source.
Remember that it is recommended to install the impalad daemon with each DataNode.
Cloud Dataproc supports Hadoop, Spark, Hive, Pig by default on the cluster. You can install more optionally supported components such as Zookeeper, Jyputer, Anaconda, Kerberos, Druid and Presto (You can find the complete list here). In addition, you can install a large set of open source components using initialization-actions.
Impala is not supported as optional component and there is no initialization-action script for it yet. You could get it to work on Dataproc with HDFS but making it work with GCS may require non-trivial changes.

Read file from s3a along with AWS Athena SDK (1.11+)

I am writing a spark/scala program which submits a query on athena (using aws-java-sdk-athena:1.11.420) and waits for the query to complete. Once the query is complete, my spark program directly reads from the S3 bucket using s3a protocol (the output location of the query) using spark's sparkSession.read.csv() function.
In order to read the CSV file, I need to use org.apache.hadoop.hadoop-aws:1.8+ and org.apache.hadoop.hadoop-client:1.8+. Both of these libraries are build using AWS SDK version 1.10.6. However, AWS athena does not have any SDK with that version. The oldest version they have is 1.11+.
How can I resolve the conflict? I need to use the latest version of AWS SDK to get access to athena, but hadoop-aws pushed me back to an older version?
Are there other dependency version of hadoop-aws that uses 1.11+ AWS SDKs? If so, what are the versions that will work for me? If not, what other options do I have?
I found out that I can use hadoop-aws:3.1+ which comes with aws-java-sdk-bundle:1.11+. This AWS SDK Bundle comes with Athena packaged.
I although still need to run spark with hadoop-commons:3.1+ libraries. The spark cluster I have runs 2.8 version libraries.
Due to my spark cluster running 2.8, spark-submit jobs were failing, while normal execution of the jar (java -jar myjar.jar) was working fine. This is because Spark was replacing the hadoop libraries I provided with the version it was bundled with.

driver to support to read or write to HIVE from c++ code

I have core product built on c++ which uses RDBMS namely oracle DB. We are in phase to Big data enable on this product with access to Hive tables. I know from apache spark we have libraries to directly have access to hive tables.
Now with C++ being base language, what could be possible ways to read/write data on hive on cloudera?
Note: Not looking for pull data to/fro from hive and RDBMS or vice versa.(sqoop). Looking to read or fire query execution on hive itself.
Thanks in advance.
This is what worked out for me.
1. Install ODBC driver ODBC
2. Go through Installation guide Installation Guide
3. Open the Project in Visual cpp++ and execute .

jar containing org.apache.hadoop.hive.dynamodb

I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process.
Unfortunately, I couldn't find the file as well :(.
Could someone answer the following questions for me (listed in order of priority).
Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format).
the jar containing org.apache.hadoop.hive.dynamodb.
Thanks!
It's in hive-bigbird-handler.jar. Unfortunately AWS doesn't provide any source or at least Java Doc about it. But you can find the jar on any node of an EMR Cluster:
/home/hadoop/.versions/hive-0.8.1/auxlib/hive-bigbird-handler-0.8.1.jar
You might want to checkout this Article:
Amazon DynamoDB Part III: MapReducin’ Logs
Unfortunately, Amazon haven’t released the sources for
hive-bigbird-handler.jar, which is a shame considering its usefulness.
Of particular note, it seems it also includes built-in support for
Hadoop’s Input and Output formats, so one can write straight on
MapReduce Jobs, writing directly into DynamoDB.
Tip: search for hive-bigbird-handler.jar to get to the interesting parts... ;-)
1- I am not aware of any such example, but you might find this library useful. It provides InputFormats, OutputFormats, and Writable classes for reading and writing data to Amazon DynamoDB tables.
2- I don't think they have made it available publically.