Sqoop import as OrC file - hdfs

Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?
Alternatives tried: imported as text format and used a temp table to read input as text file and write to hdfs as orc in hive

At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).
For example you have the option
--hcatalog-storage-stanza
which can be set to
stored as orc tblproperties ("orc.compress"="SNAPPY")
Example:
sqoop import
--connect jdbc:postgresql://foobar:5432/my_db
--driver org.postgresql.Driver
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
--username foo
--password-file hdfs:///user/foobar/foo.txt
--table fact
--hcatalog-home /usr/hdp/current/hive-webhcat
--hcatalog-database my_hcat_db
--hcatalog-table fact
--create-hcatalog-table
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'

Sqoop import supports only below formats.
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data as parquet file (from sqoop 1.4.6 version)

In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop.
Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192
I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.

Currently there is no option to import the rdms table data directly as ORC file using sqoop.
We can achieve the same using two steps.
Import the data in any available format (say text).
Read the data using Spark SQL and save it as an orc file.
Example:
Step 1: Import the table data as a text file.
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \
--table orders \
--target-dir /user/cloudera/text \
--as-textfile
Step 2: Use spark-shell on command prompt to get scala REPL command shell.
scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#638a9d61
scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")
textDF: org.apache.spark.sql.DataFrame = [value: string]
scala> textDF.write.orc("/user/cloudera/orc/")
Step 3: Check the output.
[root#quickstart exercises]# hadoop fs -ls /user/cloudera/orc/
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc

Related

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

AWS emr driver jars

I'm trying to use external drivers in AWS EMR 5.29 on pyspark notebooks via:
#%%configure -f
{ "conf": {"spark.jars":"s3://bucket/spark-redshift_2.10-2.0.1.jar,"
"s3://bucket/minimal-json-0.9.5.jar,"
"s3://bucket/spark-avro_2.11-3.0.0.jar,"
"s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar"}}
As per https://blog.benthem.io/2020/04/21/connect-aws-emr-to-spark.html
However, when trying
from pyspark.sql import SQLContext
sc = spark # existing SparkContext
sql_context = SQLContext(sc)
df = sql_context.read.format("com.databricks.spark.redshift")\
.option("url", jdbcUrl)\
.option("query","select * from test")\
.option("tempdir", "s3://")\
.load()
I get
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift.
How can I troubleshoot this? I can confirm the emr role has access to the bucket as I can process a CSV file on the same bucket with spark. I can also confirm all the listed jar files are in the bucket.
Actually the way to troubleshoot this is to SSH into the master node and then look at the ivy logs:
/mnt/var/log/livy/livy-livy-server.out
and the downloaded jar files at
/var/lib/livy/.ivy2/jars/
based on what I found out I changed my code to:
%%configure -f
{
"conf": {
"spark.jars" : "s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar",
"spark.jars.packages": "com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4"
}
}

How to Sqoop Import data from Different Data base in HDFS

$cat > import.txt
import
--connect
jdbc:mysql://localhost/hadoopdb
--username
hadoop
-password
abc
In a txt file I have kept the jdbc url, username and password in one text file and when I call a sqoop command I call it as follows:
sqoop --options-file /user/cloudera/import.txt --table employee
But I want to import from multiple database into HDFS. How shall I approach the same for multiple database?
I tried searching for the same but dint get any proper resource. Can anyone help me with this?
I have accomplished this by writing a shell script with multiple sqoop statements. One sqoop statement per job. You could have each statement within the shell script reference it's own options file.
You can create workflow.xml for sqoop action by parameterise each fields e.g.
import
--connect
jdbc:mysql://localhost/hadoopdb
--username
hadoop
-password
abc
--connect
$(connection_string)
--username
$(user_name)
--password-file
(password_file_path)
--table
$(table_name)
And assign value of each variable in job.properties file and run it thru Oozie commands:
oozie job -oozie http://XXXX.XX.iroot.adidom.com:XXXX/oozie -config job.properties -run
you can also schedule it thru coordinator.xml
Thanks,

Pyspark error reading file. Flume HDFS sink imports file with user=flume and permissions 644

I'm using Cloudera Quickstart VM 5.12
I have a Flume agent moving CSV files from spooldir source into HDFS sink. The operation works ok but the imported files have:
User=flume
Group=cloudera
Permissions=-rw-r--r--
The problem starts when I use Pyspark and get:
PriviledgedActionException as:cloudera (auth:SIMPLE)
cause:org.apache.hadoop.security.AccessControlException: Permission denied:
user=cloudera, access=EXECUTE,
inode=/user/cloudera/flume/events/small.csv:cloudera:cloudera:-rw-r--r--
(Ancestor /user/cloudera/flume/events/small.csv is not a directory).
If I use "hdfs dfs -put ..." instead of Flume, user and group are "cloudera" and permissions are 777. No Spark error
What is the solution? I cannot find a way from Flume to change file's permissions. Maybe my approach is fundamentally wrong
Any ideas?
Thank you

pyspark Cassandra connector

I have to install pyspark-cassandra-connector which available in https://github.com/TargetHolding/pyspark-cassandra
but I faced huge problems and errors and no supported document regarding to spark with python which called pyspark!!!
I want to know is pyspark-cassandra-connector package is depricated or something else?. Also, I need clear step-by-step tutorials for git clone pyspark-cassandra-connector package, installation and import it in pyspark shell and make successful connection with cassandra and make transactions, building tables or keyspaces via pyspark and effect on it.
Approach 1 (spark-cassandra-connector)
Use below command to start pyspark shell by using spark-cassandra-connector
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2
Now you can import modules
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
Approach 2 (pyspark-cassandra)
Use below command to start pyspark shell by using pyspark-cassandra
pyspark --packages anguenot/pyspark-cassandra:2.4.0
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
I hope this link helps you in your task
https://github.com/datastax/spark-cassandra-connector/#documentation
The Link in your question points to a repository where the build are failing.
It also has a link to the above repository.
There are two ways to do this:
Either using pyspark or spark-shell
#1 pyspark:
Steps to follow:
pyspark --packages com.datastax.spark: spark-cassandra-connector_2.11: 2.4.2
df = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace":"<keyspace_name>").option("table":"<table_name>").load
Note: this will create a dataframe on which you can perform further oprations
try agg(),select(),show(),etc. methods or tab after 'df.', which will show you available options
example: df.select(sum("<column_name>")).show()
#2 spark-shell:
spark --packages or
use above package or use a connector jar file with spark-shell
above (#1)steps will work exactly the same, but just use 'val' to create variable
ex. val df = read.format().load()
Note : use ':paste' option in scala to write multiple lines or to paste your code
#3 Steps to download spark-cassandra-connector:
download the spark-cassandra-connector by cloning https://github.com/datastax/spark-cassandra-connector.git
cd to the spark-cassandra-connector
./sbt/sbt assembly
this will download the spark-cassandra-connector and will put them into 'project' folder
use spark-shell
all set
Cheers 🍻!
you can use this to connect to cassandra
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
you can read like this
if you have keyspace called test and a table called my_table
val test_spark_rdd = sc.cassandraTable("test", "my_table")
test_spark_rdd.first