$cat > import.txt
import
--connect
jdbc:mysql://localhost/hadoopdb
--username
hadoop
-password
abc
In a txt file I have kept the jdbc url, username and password in one text file and when I call a sqoop command I call it as follows:
sqoop --options-file /user/cloudera/import.txt --table employee
But I want to import from multiple database into HDFS. How shall I approach the same for multiple database?
I tried searching for the same but dint get any proper resource. Can anyone help me with this?
I have accomplished this by writing a shell script with multiple sqoop statements. One sqoop statement per job. You could have each statement within the shell script reference it's own options file.
You can create workflow.xml for sqoop action by parameterise each fields e.g.
import
--connect
jdbc:mysql://localhost/hadoopdb
--username
hadoop
-password
abc
--connect
$(connection_string)
--username
$(user_name)
--password-file
(password_file_path)
--table
$(table_name)
And assign value of each variable in job.properties file and run it thru Oozie commands:
oozie job -oozie http://XXXX.XX.iroot.adidom.com:XXXX/oozie -config job.properties -run
you can also schedule it thru coordinator.xml
Thanks,
Related
Can anyone please help how to
submit a pyspark job in google cloud shell
pass files and arguments in pyspark submit
read those files and arguments in pyspark code
To submit a job to a Dataproc cluster, run the gcloud CLI gcloud dataproc jobs submit command locally in a terminal window or in Cloud Shell. Here is the detailed official documentation.
With the help of spark-submit you can pass program arguments
you'll see that spark-submit has following syntax:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can use either application-arguments and conf to pass required configuration to the main method and SparkConf respectively.Here is the
documentation.
3.Some information on reading files with Spark:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
T
he textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.Here is blog explaining it.
Here is the guide for all processes.
I have a scenario where I have a setup of AWS EMR with few applications such as Spark, Hadoop, Hive, HCatalog, Zeppelin, Sqoop, etc. And, I have another server which runs only Airflow.
I am working on a requirement, where I want to move MySQL tables ( which is again on a different RDS instance ) to Hive using Sqoop and this trigger has to be submitted by Airflow.
Is it possible to achieve this using SqoopOperator available in Airflow given that Airflow is in remote server? I believe not, then is there any other way to achieve this?
Thanking in advance.
Yes, this is possible. I'll admit the documentation on how to use the operators is lacking but if you understand the concept of hooks and operators in Airflow, you can figure it out by reading the code of the operator you're looking to use. In the case, you'll want to read through the SqoopHook and the SqoopOperator codebase. Most of what I know how to do with Airflow comes from reading the code, while I haven't used this operator, I can try and help you out here best I can.
Let's assume you want to execute this this sqoop command:
sqoop import --connect jdbc:mysql://mysql.example.com/testDb --username root --password hadoop123 --table student
And you have a Sqoop server running on a remote host which you can access with the Scoop client at http://scoop.example.com:12000/sqoop/.
First, you'll need to create the connection in the Airflow Admin UI, call the connection sqoop. For the connection, fill in host as scoop.example.com, schema as sqoop, and port as 12000. If you have a password, you will need to put this into a file on your server and in extras fill out a json string that looks like {'password_file':'/path/to/password.txt'} (see inline code about this password file).
After you set up the connection in the UI, can now create an task using the SqoopOperator in you DAG file. This might look like this:
sqoop_mysql_export = SqoopOperator(conn_id='sqoop',
table='student',
username='root',
password='password',
driver='jdbc:mysql://mysql.example.com/testDb',
cmd_type='import')
You can see the full list of paramaters you might want to pass for imports can be found in the code here.
You can see how the SqoopOperator (and really the SqoopHook which the operator leverages to connect to Sqoop) translates these arguments to command line commands here.
Really this SqoopOperator just works by translating the kwargs you pass into sqoop client CLI commands. If you check out the SqoopHook, you can see how that's done and probably figure out how to make it work for your case. Good luck!
To troubleshoot, I would recommend SSHing into the server you're running Airflow on and confirm you can run the Scoop client from the command line and connect to the remote Scoop server.
Try to add step by use script-runner.jar. here is more.
aws emr create-cluster --name "Test cluster" –-release-label emr-5.16.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m4.large --instance-count 3 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
Then you can do it like this.
SPARK_TEST_STEPS = [
{
'Name': 'demo',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'cn-northwest-1.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
"s3://d.s3.d.com/demo.sh",
]
}
}
]
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=SPARK_TEST_STEPS,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
step_adder.set_downstream(step_checker)
demo.sh like this.
!/bin/bash
sqoop import --connect jdbc:mysql://mysql.example.com/testDb --username root --password hadoop123 --table student
I'm trying to run a PySpark script that works fine when I run it on my local machine.
The issue is that I want to fetch the input files from S3.
No matter what I try though I can't seem to be able find where I set the ID and secret. I found some answers regarding specific files
ex: Locally reading S3 files through Spark (or better: pyspark)
but I want to set the credentials for the whole SparkContext as I reuse the sql context all over my code.
so the question is: How do I set the AWS Access key and secret to spark?
P.S I tried the $SPARK_HOME/conf/hdfs-site.xml and Environment variable options. both didn't work...
Thank you
For pyspark we can set the credentials as given below
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
Setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in spark-defaults.conf before establishing a spark session is a nice way to do it.
But, also had success with Spark 2.3.2 and a pyspark shell setting these dynamically from within a spark session doing the following:
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)
And then, able to read/write from S3 using s3a:
documents = spark.sparkContext.textFile('s3a://bucket_name/key')
I'm not sure if this was true at the time, but as of PySpark 2.4.5 you don't need to access the private _jsc object to set Hadoop properties. You can set Hadoop properties using SparkConf.set(). For example:
import pyspark
conf = (
pyspark.SparkConf()
.setAppName('app_name')
.setMaster(SPARK_MASTER)
.set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
.set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)
sc = pyspark.SparkContext(conf=conf)
See https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
You can see a couple of suggestions here:
http://www.infoobjects.com/2016/02/27/different-ways-of-setting-aws-credentials-in-spark/
I usually do the 3rd one (set hadoopConfig on the SparkContext), as I want the credentials to be parameters within my code. So that I can run it from any machine.
For example:
JavaSparkContext javaSparkContext = new JavaSparkContext();
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "");
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsSecretAccessKey","");
The method where you add the AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY to hdfs-site.xml should ideally work. Just ensure that you run pyspark or spark-submit as follows:
spark-submit --master "local[*]" \
--driver-class-path /usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
--jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
repl-sql-s3-schema-change.py
pyspark --jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar
Setting them in core-site.xml, provided that directory is on the classpath, should work.
I have to install pyspark-cassandra-connector which available in https://github.com/TargetHolding/pyspark-cassandra
but I faced huge problems and errors and no supported document regarding to spark with python which called pyspark!!!
I want to know is pyspark-cassandra-connector package is depricated or something else?. Also, I need clear step-by-step tutorials for git clone pyspark-cassandra-connector package, installation and import it in pyspark shell and make successful connection with cassandra and make transactions, building tables or keyspaces via pyspark and effect on it.
Approach 1 (spark-cassandra-connector)
Use below command to start pyspark shell by using spark-cassandra-connector
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2
Now you can import modules
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
Approach 2 (pyspark-cassandra)
Use below command to start pyspark shell by using pyspark-cassandra
pyspark --packages anguenot/pyspark-cassandra:2.4.0
Read data from cassandra table "emp" and keyspace "test" as
spark.read.format("org.apache.spark.sql.cassandra").options(table="emp", keyspace="test").load().show()
I hope this link helps you in your task
https://github.com/datastax/spark-cassandra-connector/#documentation
The Link in your question points to a repository where the build are failing.
It also has a link to the above repository.
There are two ways to do this:
Either using pyspark or spark-shell
#1 pyspark:
Steps to follow:
pyspark --packages com.datastax.spark: spark-cassandra-connector_2.11: 2.4.2
df = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace":"<keyspace_name>").option("table":"<table_name>").load
Note: this will create a dataframe on which you can perform further oprations
try agg(),select(),show(),etc. methods or tab after 'df.', which will show you available options
example: df.select(sum("<column_name>")).show()
#2 spark-shell:
spark --packages or
use above package or use a connector jar file with spark-shell
above (#1)steps will work exactly the same, but just use 'val' to create variable
ex. val df = read.format().load()
Note : use ':paste' option in scala to write multiple lines or to paste your code
#3 Steps to download spark-cassandra-connector:
download the spark-cassandra-connector by cloning https://github.com/datastax/spark-cassandra-connector.git
cd to the spark-cassandra-connector
./sbt/sbt assembly
this will download the spark-cassandra-connector and will put them into 'project' folder
use spark-shell
all set
Cheers 🍻!
you can use this to connect to cassandra
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
you can read like this
if you have keyspace called test and a table called my_table
val test_spark_rdd = sc.cassandraTable("test", "my_table")
test_spark_rdd.first
Is there any option in sqoop to import data from RDMS and store it as ORC file format in HDFS?
Alternatives tried: imported as text format and used a temp table to read input as text file and write to hdfs as orc in hive
At least in Sqoop 1.4.5 there exists hcatalog integration that support orc file format (amongst others).
For example you have the option
--hcatalog-storage-stanza
which can be set to
stored as orc tblproperties ("orc.compress"="SNAPPY")
Example:
sqoop import
--connect jdbc:postgresql://foobar:5432/my_db
--driver org.postgresql.Driver
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
--username foo
--password-file hdfs:///user/foobar/foo.txt
--table fact
--hcatalog-home /usr/hdp/current/hive-webhcat
--hcatalog-database my_hcat_db
--hcatalog-table fact
--create-hcatalog-table
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'
Sqoop import supports only below formats.
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data as parquet file (from sqoop 1.4.6 version)
In current version of sqoop available, it is not possible to import data from RDBS to HDFS in ORC format in a single shoot command. This is something known issue in sqoop.
Reference link for this issue raised: https://issues.apache.org/jira/browse/SQOOP-2192
I think the only alternative available for now, is the same as you mentioned. I also came across the similar use case, and have used the alternative two step approach.
Currently there is no option to import the rdms table data directly as ORC file using sqoop.
We can achieve the same using two steps.
Import the data in any available format (say text).
Read the data using Spark SQL and save it as an orc file.
Example:
Step 1: Import the table data as a text file.
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \
--table orders \
--target-dir /user/cloudera/text \
--as-textfile
Step 2: Use spark-shell on command prompt to get scala REPL command shell.
scala> val sqlHiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlHiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#638a9d61
scala> val textDF = sqlHiveContext.read.text("/user/cloudera/text")
textDF: org.apache.spark.sql.DataFrame = [value: string]
scala> textDF.write.orc("/user/cloudera/orc/")
Step 3: Check the output.
[root#quickstart exercises]# hadoop fs -ls /user/cloudera/orc/
Found 5 items
-rw-r--r-- 1 cloudera cloudera 0 2018-02-13 05:59 /user/cloudera/orc/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 153598 2018-02-13 05:59 /user/cloudera/orc/part-r-00000-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153466 2018-02-13 05:59 /user/cloudera/orc/part-r-00001-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 153725 2018-02-13 05:59 /user/cloudera/orc/part-r-00002-24f75a77-4dd9-44b1-9e25-6692740360d5.orc
-rw-r--r-- 1 cloudera cloudera 160907 2018-02-13 05:59 /user/cloudera/orc/part-r-00003-24f75a77-4dd9-44b1-9e25-6692740360d5.orc