Query hive table with Spark - python-2.7

I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?

As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.

Related

AWS GlueStudio 'datediff'

Has anyone tried using AWS GlueStudio and the custOm SQL queries? I am currently trying to find the difference in days between to dates like so..
select
datediff(currentDate, expire_date) as days_since_expire
But in the data preview window I get an
AnalysisException: cannot resolve 'currentDate' given input columns: []; line 3 pos 9; 'Project ['datediff('nz_eventdate, 'install_date) AS days_since_install#613] +- OneRowRelation
Does anyone know how to fix this solution or what causes it?
You don't write PostgreSQL/T/PL (or any other flavor) SQL, instead "you enter the Apache SparkSQL query". Read the following carefully:
Using a SQL query to transform data (in AWS Glue "SQL Query" transform task)
https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html
The functions you can write in AWS Glue "SQL Query" transform task to achieve desired transformation are here (follow correct syntax):
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
BTW: The error you wrote is not correlating with your select statement for many potential reasons, but I am writing this answer anyway just for the sake of your question heading or other who may come here.

Great Expectations - Run Validation over specific subset of a PostgreSQL table

I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:
SELECT * FROM jobs WHERE current_batch = <input_batch>.
I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.
Any help or guidance would be much appreciated.
Thanks,
I completely understand your confusion because I am working with GE too and the documentation is not really clear.
First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.
I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.
To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.
Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1
name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
- batch_kwargs:
table: pso
schema: test
query: SELECT p AS c,
[ ... ]
AND lsr = c)
datasource: my_database
data_asset_name: test.pso
expectation_suite_names:
- exp_suite1
Hope this helps! Feel free to ask if you have any doubts :)
I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.
Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.
Let's have view like this:
CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;
And checkpoint be configured something like this:
name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
batch_request:
datasource_name: my_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: table_to_check_1_today
Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook
Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Creating temporary view in Spark sql using org.apache.spark.sql.json OPTIONS

I have some data in google storage bucket. It is in Json format, originated through Kafka. I want to create a temporary view in spark-sql on top of the bucket.
I tried like this
CREATE TEMPORARY VIEW TEMP_1 USING org.apache.spark.sql.json OPTIONS ( path "gs://xxx/xx/");
Now, when i try to desc the view name, it only gives me a limited length and does not show all the column names.
keys struct<eventIDs:array<string>,id:string> NULL
values struct<Column1:string,columns2:string... 298 more fields> NULL
What should i do, if want to see all the column names in the view? New to spark. Any help would be appreciated. Thanks.
Still looking for an answer, as the below possible solutions did not work for me. I only have access to spark-sql shell. Below answers points to using scala environment which i dont have access to. Please help. Thanks
After some digging, found a solution:
run spark-shell
then
scala> val path = "gs://XX/XX/X"
val df = spark.read.json(path)
df.printSchema()
Worked like a charm.Thank you everyone who chimed in to give the answer. My below average brain took some time to figure it out.

Perofrming the operations on RDD PySpark

I using Python Spark API while getting file name having an issue. For example
recordDataFrame=originalDataFrame \
.withColumn('file_name',input_file_name())
Added the file_name column to dataFrame.The column is added to the dataFrame
recordDataFrame.take(1)
Above shows the column with value
But when I am converting dataFrame to RDD or looping over RDD file_name column don't have any value.
For example,
rdd_data=recordDataFrame.rdd
print(rdd_data.take(1))
This will show file_name column with blank value
Or if I do looping over the dataFrame directly then also file name don't have any value
recordDataFrame.foreach(process_data)
But if I pass the static value to file_name instead of using input_file_name() while adding column then everything works fine
This is a bug which has been resolved in 2.0.0.
Disclaimer:
These are serious hacks and should be avoided unless you're desperate. Also non of these have been properly tested. If you can it is better to update.
Trigger a shuffle after loading the data for example with:
recordDataFrame.repartition("file_name")
or
recordDataFrame.orderBy("file_name")
Truncate lineage as shown in high-performance-spark/high-performance-spark-examples
(code is GPL licensed so it cannot be reproduced here but the main idea is to access internal Java RDD, cache it and recreate DataFrame):
cutLineage(recordDataFrame)