How do we drop partitions in hive with regex. Is it possible? - regex

I am trying to run the following
alter table historical_data drop partition (my_date not rlike '[A-Za-z]');
Which gives me an Exception
org.apache.hadoop.hive.ql.parse.ParseException: line 2:69 mismatched input 'not' expecting set null in drop partition statement
I couldn't find anything similar. I did see one answer on some question in SO but it doesn't work.
Any help is appreciated.

Regexp is not supported unfortunately.
You can use all these comparators < > <= >= <> = != maybe it will help. See usage in this answer: https://stackoverflow.com/a/56646879/2700344
See also this jira Extend ALTER TABLE DROP PARTITION syntax to use all comparators
Also one more jira not implemented yet: Extend ALTER TABLE DROP PARTITION syntax to use multiple conditions
Impala supports LIKE in drop partition:
alter table historical_data drop partition (year < 1995, last_name like 'A%');
Created this Jira for adding regexp, please vote in the Jira if you need it.

Related

What is LATEST_ON syntax in QuestDB?

I'm using QuestDB and SQL for the first time, and I stumbled upon the LATEST_ON syntax used in QuestDB. Can someone explain it's usage and where to use it?
Quoted from the docs:
For scenarios where multiple time series are stored in the same table, it is relatively difficult to identify the latest items of these time series with standard SQL syntax. QuestDB introduces LATEST ON clause for a SELECT statement to remove boilerplate clutter and splice the table with relative ease.
For more information visit the official documentation
LATEST ON is to find the latest record for each unique time series in a table. See this page for some examples: https://questdb.io/docs/reference/sql/latest-on/
It gives you the latest available record for each combination of the PARTITION BY values, according to the ON timestamp
Maybe easier to understand with an example. If you go to https://demo.questdb.io you can execute this query
select * from trades latest on timestamp
partition by symbol, side
It will then show you the latest existing row for each combination of Symbol and Side. If you wanted to do this using standard SQL, you would probably have to use a window function, something like this
select * from
(select *
,ROW_NUMBER() over (partition by Symbol, Side
order by timestamp DESC) AS RowNumber
from trades where timestamp > '2022-10-01') t
where t.RowNumber = 0
Latest on retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table.
Check this link for some examples: https://questdb.io/docs/reference/sql/latest-on/

Bigquery struct introspection

Is there a way to get the element types of a struct? For example something along the lines of:
SELECT #TYPE(structField.y)
SELECT #TYPE(structField)
...etc
Is that possible to do? The closest I can find is via the query editor and the web call it makes to validate a query:
As I mentioned already in comments - one of the option is to mimic same very Dry Run call with query built in such a way that it will fail with exact error message that will give you the info you are looking for. Obviously this assumes your use case can be implemented in whatever scripting language you prefer. Should be relatively easy to do.
Meantime, I was looking for making this within the SQL Query.
Below is the example of another option.
It is limited to below types, which might fit or not into your particular use case
object, array, string, number, boolean, null
So example is
select
s.birthdate, json_type(to_json(s.birthdate)),
s.country, json_type(to_json(s.country)),
s.age, json_type(to_json(s.age)),
s.weight, json_type(to_json(s.weight)),
s.is_this, json_type(to_json(s.is_this)),
from (
select struct(date '2022-01-01' as birthdate, 'UA' as country, 1 as age, 2.5 as weight, true as is_this) s
)
with output
You can try the below approach.
SELECT COLUMN_NAME, DATA_TYPE
FROM `your-project.your-dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE TABLE_NAME = 'your-table-name'
AND COLUMN_NAME = 'your-struct-column-name'
ORDER BY ORDINAL_POSITION
You can check this documentation for more details using INFORMATION_SCHEMA for BigQuery.
Below is the screenshot of my testing.
DATA:
RESULT USING THE ABOVE SYNTAX:

Amazon Athena: no viable alternative at input

While creating a table in Athena; it gives me following exception:
no viable alternative at input
hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm
Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..
This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name
In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.
My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')
In my case, it was an extra comma in PARTITIONED BY section,
In my case, I was missing the singlequotes for the S3 URL
In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')
As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case
After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.
Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.
This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!

In Redshift, how do you combine CTAS with the "if not exists" clause?

I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.

Django: how to filter() after distinct()

If we chain a call to filter() after a call to distinct(), the filter is applied to the query before the distinct. How do I filter the results of a query after applying distinct?
Example.objects.order_by('a','foreignkey__b').distinct('a').filter(foreignkey__b='something')
The where clause in the SQL resulting from filter() means the filter is applied to the query before the distinct. I want to filter the queryset resulting from the distinct.
This is probably pretty easy, but I just can't quite figure it out and I can't find anything on it.
Edit 1:
I need to do this in the ORM...
SELECT z.column1, z.column2, z.column3
FROM (
SELECT DISTINCT ON (b.column1, b.column2) b.column1, b.column2, c.column3
FROM table1 a
INNER JOIN table2 b ON ( a.id = b.id )
INNER JOIN table3 c ON ( b.id = c.id)
ORDER BY b.column1 ASC, b.column2 ASC, c.column4 DESC
) z
WHERE z.column3 = 'Something';
(I am using Postgres by the way.)
So I guess what I am asking is "How do you nest subqueries in the ORM? Is it possible?" I will check the documentation.
Sorry if I was not specific earlier. It wasn't clear in my head.
This is an old question, but when using Postgres you can do the following to force nested queries on your 'Distinct' rows:
foo = Example.objects.order_by('a','foreign_key__timefield').distinct('a')
bar = Example.objects.filter(pk__in=foo).filter(some_field=condition)
bar is the nested query as requested in OP without resorting to raw/extra etc. Tested working in 1.10 but docs suggest it should work back to at least 1.7.
My use case was to filter up a reverse relationship. If Example has some ForeignKey to model Toast then you can do:
Toast.objects.filter(pk__in=bar.values_list('foreign_key',flat=true))
This gives you all instances of Toast where the most recent associated example meets your filter criteria.
Big health warning about performance though, using this if bar is likely to be a huge queryset you're probably going to have a bad time.
Thanks a ton for the help guys. I tried both suggestions and could not bend either of those suggestions to work, but I think it started me in the right direction.
I ended up using
from django.db.models import Max, F
Example.objects.annotate(latest=Max('foreignkey__timefield')).filter(foreignkey__timefield=F('latest'), foreign__a='Something')
This checks what the latest foreignkey__timefield is for each Example, and if it is the latest one and a=something then keep it. If it is not the latest or a!=something for each Example then it is filtered out.
This does not nest subqueries but it gives me the output I am looking for - and it is fairly simple. If there is simpler way I would really like to know.
No you can't do this in one simple SELECT.
As you said in comments, in Django ORM filter is mapped to SQL clause WHERE, and distinct mapped to DISTINCT. And in a SQL, DISTINCT always happens after WHERE by operating on the result set, see SQLite doc for example.
But you could write sub-query to nest SELECTs, this depends on the actual target (I don't know exactly what's yours now..could you elaborate it more?)
Also, for your query, distinct('a') only keeps the first occurrence of Example having the same a, is that what you want?