AWS GlueStudio 'datediff' - amazon-web-services

Has anyone tried using AWS GlueStudio and the custOm SQL queries? I am currently trying to find the difference in days between to dates like so..
select
datediff(currentDate, expire_date) as days_since_expire
But in the data preview window I get an
AnalysisException: cannot resolve 'currentDate' given input columns: []; line 3 pos 9; 'Project ['datediff('nz_eventdate, 'install_date) AS days_since_install#613] +- OneRowRelation
Does anyone know how to fix this solution or what causes it?

You don't write PostgreSQL/T/PL (or any other flavor) SQL, instead "you enter the Apache SparkSQL query". Read the following carefully:
Using a SQL query to transform data (in AWS Glue "SQL Query" transform task)
https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html
The functions you can write in AWS Glue "SQL Query" transform task to achieve desired transformation are here (follow correct syntax):
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
BTW: The error you wrote is not correlating with your select statement for many potential reasons, but I am writing this answer anyway just for the sake of your question heading or other who may come here.

Related

DLP data scan from bigquery table showing start byte as null

I have scanned a Bigquery table from Google DLP Console. The scan results are saved back into a big query table. DLP has identified sensitive information, but the start byte is shown as null, can anyone help me understand why?
The source data looks as follows:
2,james#example.org ,858-333-0333,333-33-3333,8
3,mallory#example.org,858-222-0222,222-22-2222,8
4,maria#example.org ,858-444-0444,444-44-4444,1
------------------------------
If I put the same data in Cloud storage bucket and then perform a scan using DLP, I get the start and end bytes for the sensitive data
Thanks folks, the product team is investigating. What's happening is that "0" is mapping to null "by accident" due to a proto to BQ schema conversion bug on our end. We'll address this.
Unfortunatelly this looks like a bug.
I was able to reproduce your issue completely; I fallowed these steps:
screated a source csv file:
1,mail1#test.com,858-333-0333,333-33-3333,8
2,epaweda-8101#yopmail.com,858-333-0334,333-33-3334,3
3,petersko#live.com,858-333-0335,333-33-3335,5
4,danneng#gmail.com,858-333-0336,333-33-3336,1
5,chance#icloud.com,858-333-0337,333-33-3337,4
imported it to a BQ table - it looks like this:
DLP'ed it and got the same result with null column:
In my opinion this is a bug (certainly looks like it) so my recommendation would be to go to Google's Issuetracker and report it here (with as much details as possible) and wait for an answer.

Query hive table with Spark

I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.

Pentaho PDI get SQL SUM() with conditions

I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?
The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.

Select stmt in source qualifier along with procedure call in Informatica

We have a situation where we are dealing with a relational source(Oracle). The system is developed in a way where we have to first execute a package which will enable data read from Oracle and user will be able to get results out of select statement. I am trying to find a way on how to implement this in informatica mapping.
What we tried
1. In PreSQL we tried to execute the package and in SQL query we wrote select statement - data not getting loaded in target.
2. In PreSQL we wrote a block in which we are executing the package and just after that(within same beging...end block) we wrote insert statement on top of select statement - This is inserting data through insert statement however I am not in favor of this solution as both source and target are dummy which will confuse people in future.
Is there any possibility to implement this solution somehow by using 1st option.
Please help and suggest.
Thanks
The stored procedure transformation is there for this purpose configure it to execute source pre load
Pre-Sql and data read are not a part of same session. From what I understand, this needs to be done within the same session as otherwise the read is granted only for the session.
What you can do, is create a stored procedure/package that will grant read access and then return the data. Use it as a SQL Override on your SQ. This way SQ will read the data as usual. The concept:
CREATE PROCEDURE ReadMyData AS
BEGIN
execute immediate 'GiveMeTheReadAccess';
select * from MyTable;
END;
And use the ReadMyData on the Source Qualifier.

Amazon Athena: no viable alternative at input

While creating a table in Athena; it gives me following exception:
no viable alternative at input
hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm
Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..
This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name
In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.
My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')
In my case, it was an extra comma in PARTITIONED BY section,
In my case, I was missing the singlequotes for the S3 URL
In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')
As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case
After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.
Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.
This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!