I am trying to create a dynamic location for create external table statement.
I am using the statement
#set loc = 's3/root/' + Replace(Convert (Varchar, Current_Date),'-','')
In order to set loc to 's3/root/20200622' but I am unable to do it, while the select gives expected result.
Redshift doesn't support variables.
You can either create a Python UDF to to this, but I'm not sure even that'll serve your use-case.
You can execute the queries using Python code and pass the filenames dynamically from Python.
Related
BigQuery has session-bound (or script-bound) system variables, described here. Within a session, I can declare the value of one of those variables, e.g. with something like:
set ##dataset_id = 'my_dataset_id';
Now, I'd like to have a view (which I plan on running within a session) that includes something like:
create view foo
as
select ...
from ##dataset_id.my_table
... this doesn't work. Nor does any form of quoting around that variable. It appears use of that variable simply isn't allowed to help identify the namespace of my_table.
If that's true, I'm struggling to see the value of that variable at all. Does anyone know if I can use those variables as so, or how to prevent needing to namespace-bound all instances of my_table? I'd like to manage these query scripts outside of BQ itself, and ideally without templating in everywhere (e.g. {dataset_id}.my_table)
I have tried to replicate your concern,
Please try below:
set ##dataset_id = "SampleDataSetID";
EXECUTE IMMEDIATE format("""
CREATE VIEW sampleView1 as (
SELECT
*
FROM
%s.sampleTable
);
""", ##dataset_id);
I used EXECUTE IMMEDIATE to run the sql dynamically ( using the variable that we set to the session.)
Output:
Good morning all,
I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv
My need would be to set the filename dynamically so that the pipeline uses the correct file every day (something like this: ${ new Date(). Getfullyear()... }_data.csv
I managed to use the arguments in runtime by specifying the date as a string (2020-12-10) but not with a function.
And more generally is there any documentation on how to enter dynamic parameters with ready-made or custom "functions" (I couldn't find it)
Thanks in advance for your help.
There is a readymade workaround, you can give a try "BigQuery Execute" plugin.
Steps:
Put below query in SQL
select cast(current_date as string) ||'_data.csv' as filename
--for output '2020-12-15_data.csv'
Row As Arguments to 'true'
Now use the above arguments via ${filename} wherever you want to.
I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.
We have a situation where we are dealing with a relational source(Oracle). The system is developed in a way where we have to first execute a package which will enable data read from Oracle and user will be able to get results out of select statement. I am trying to find a way on how to implement this in informatica mapping.
What we tried
1. In PreSQL we tried to execute the package and in SQL query we wrote select statement - data not getting loaded in target.
2. In PreSQL we wrote a block in which we are executing the package and just after that(within same beging...end block) we wrote insert statement on top of select statement - This is inserting data through insert statement however I am not in favor of this solution as both source and target are dummy which will confuse people in future.
Is there any possibility to implement this solution somehow by using 1st option.
Please help and suggest.
Thanks
The stored procedure transformation is there for this purpose configure it to execute source pre load
Pre-Sql and data read are not a part of same session. From what I understand, this needs to be done within the same session as otherwise the read is granted only for the session.
What you can do, is create a stored procedure/package that will grant read access and then return the data. Use it as a SQL Override on your SQ. This way SQ will read the data as usual. The concept:
CREATE PROCEDURE ReadMyData AS
BEGIN
execute immediate 'GiveMeTheReadAccess';
select * from MyTable;
END;
And use the ReadMyData on the Source Qualifier.
I am trying to get a simple PigActivity to work in Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-pigactivity.html#pigactivity
The Input and Output fields are required for this activity. I have them both set to use S3DataNode. Both of these DataNodes have a directoryPath which point to my s3 input and output. I originally tried to use filePath but got the following error:
PigActivity requires 'directoryPath' in 'Output' object.
I am using a custom pig script, also located in S3.
My question is how do I reference these input and output paths in my script?
The example given on the reference uses the stage field (which can be disabled/enabled). My understanding is that this used to convert the data into tables. I don't want to do this as it also requires that you specify a dataFormat field.
Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as ${INPUT1} and ${OUTPUT1}.
I have disabled staging and I am trying to access the data in my script as follows:
input = LOAD '$Input';
But I get the following error:
IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : Input
I have tried using:
input = LOAD '${Input}';
But I get an error for this too.
There is the optional scriptVariable field. Do I have to use some sort of mapping here?
Just using
LOAD 'uri to your s3'
shall work.
Normally this is done for you in staging (table creation) and you do not have to access the URI directly from script and only specify it in S3DataNode.
Make sure you have set the "stage" property of "pigActivity" to be true.
Once I did that the script below started working for me:
part = LOAD ${input1} USING PigStorage(',') AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);