Using Append and last modified in sqoop import statement - hdfs

We are using sqoop to extract data from Oracle database to HDFS. People used to update and add new rows regularly to the SQL table. I am aware of sqoop incremental imports --append and --last modified options.
My question is whether we can use both options in same import statement?
For example,
sqoop --incremental import --lastmodified --append --check-column 'lastmodified' --check-column 'id' --lastvalue '2017-09-22-123456' --lastvalue 100
Or we need to use separately? Or any other better approach?

I found the answer myself.
lastmodified can take care of both updates and newly added rows based on the last modified column. No need of using Append in this scenario.

Related

Can we set remove column names from s3 partition path and set path to values?

I am just curious, for Spark using Glue sinkFormat, is it possible to save the file as "2021/05/05/filename.parquet" and not as "year=2021/month=05/day=05/filename.parquet". I tried to play with 'writepath' but it works at record level and I believe it will break Spark's ability to save partitioned files.
This is not possible.
Partitioning drops the columns used for partitioning.
Spark uses directory structure for partition discovery and the correct structure so including column names is necessary for it to work.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Query hive table with Spark

I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.

Perofrming the operations on RDD PySpark

I using Python Spark API while getting file name having an issue. For example
recordDataFrame=originalDataFrame \
.withColumn('file_name',input_file_name())
Added the file_name column to dataFrame.The column is added to the dataFrame
recordDataFrame.take(1)
Above shows the column with value
But when I am converting dataFrame to RDD or looping over RDD file_name column don't have any value.
For example,
rdd_data=recordDataFrame.rdd
print(rdd_data.take(1))
This will show file_name column with blank value
Or if I do looping over the dataFrame directly then also file name don't have any value
recordDataFrame.foreach(process_data)
But if I pass the static value to file_name instead of using input_file_name() while adding column then everything works fine
This is a bug which has been resolved in 2.0.0.
Disclaimer:
These are serious hacks and should be avoided unless you're desperate. Also non of these have been properly tested. If you can it is better to update.
Trigger a shuffle after loading the data for example with:
recordDataFrame.repartition("file_name")
or
recordDataFrame.orderBy("file_name")
Truncate lineage as shown in high-performance-spark/high-performance-spark-examples
(code is GPL licensed so it cannot be reproduced here but the main idea is to access internal Java RDD, cache it and recreate DataFrame):
cutLineage(recordDataFrame)

In Redshift, how do you combine CTAS with the "if not exists" clause?

I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.