Query in AWS Athena with dynamic date

Query in AWS Athena with dynamic date - amazon-athena

I'm executing this query:
SELECT tvd_data_hora FROM "mydb"."i_public_0622" limit 10;
but I need to get dynamicaly current MMYY to query, like:
SELECT tvd_data_hora FROM "mydb"."i_public_0822" limit 10;

Related

Query Google Big Query Table by today's date

I have query pertaining to the google big query tables. We are currently looking to query the big query table based on the file uploaded on the day into the cloud storage.
Meaning:
I have to load the data into big query table based on every day's data into cloud storage.
When i query:
select * from BQT where load_date =<TODAY's DATE>
Can we achieve this without adding the date field into the file?

If you just don't want to add a date column, Append current date suffix to your table name like BQT_20200112 when the GCS file is uploaded.
Then you can query specific datetime table by _TABLE_SUFFIX syntax.
Below is example query using _TABLE_SUFFIX
SELECT
field1,
field2,
field3
FROM
`your_dataset.BQT_*`
WHERE
_TABLE_SUFFIX = '20200112'
As you see, You don't need to add additional field like load_date when you query the tables using date suffix and wildcard symbol.

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help

Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'

Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

sql alchemy Update resultset of raw query

Am new to Sql Alchemy. I have a raw sql which i need to execute by passing bind parameters. Resulting rows from the query, i need to update a particular column value. How do i do this in the efficient way?
Below are the columns in my table metrics
TABLE
id,total,pass,fail,category,ref_id
query = "Select * from table where id in(select max(id) from table ...)"
sql = text(query)
result = db.engine.execute(sql, CATEGORY=category)
for row in result:
//update here
So i have this complex query, that i need to execute as an inline query. Let's say i get three rows from my query and i need to update ref_id for all the 3 rows with a values. How can i achieve this preferably bulk update.
Am using python 2.7,SQLAlchemy==0.9.9,SQLAlchemy-Utils==0.29.8

Completing information in the table Power BI

I was hoping someone could help me with a part of the formula to get me to my end goal.
Table PR_HIST_MOVIM_PEDID:
Table LOG:
PR_HIST_MOVIM_PEDID table has order status per day, but when you turn the month, it loses information. A LOG table has one last transition of each request per day and it is this table that is the correct reference. I have to create a new table that shows me all the requests and their days without losing information as soon as the birthday turn-over happens.
Example:
NOTE: Notice that this is a small example. My database table has tens of thousands of values.
How to achieve the final result?
Table Log

First, create a new table that just has the dates that you want to use. I created a new table like this:
AllDates = CALENDAR(MIN(TableLOG[DTH_INCLUI_LOG]), MAX(TableLOG[DTH_INCLUI_LOG]))
Now write a measure like this:
Measure =
VAR CurrentDate = SELECTEDVALUE(AllDates[Date])
VAR MaxDate = CALCULATE(MAX(TableLOG[DTH_INCLUI_LOG]),
TableLOG[DTH_INCLUI_LOG] <= CurrentDate)
RETURN CALCULATE(MAX(TableLOG[VLR_NOVO]), TableLOG[DTH_INCLUI_LOG] = MaxDate)
You should now be able to create a matrix with the AllDates[Date] for the columns and the Ovitem_log for the rows with the Measure for the values.

Using TABLESAMPLE with PERCENT returns all the records from table

I have a small test table with two fields - id and name, 19 records total. When I try to get 10 percent of record from this table using the following query, I get ALL the records. I tried to do this on large table, but result is the same - all records are returned. The query:
select * from test tablesample (10 percent) s;
If I use ROWS instead of TABLESAMPLE (i.e.: select * from test tablesample (10 rows) s;, it works fine, only 10 records are returned. How can I get just the neccessary percentage of records?

You can refer to the link below:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling
You must be using CombinedHiveOutputFormat, which does not go well with ORC format. Hence you will never be able to save the output from Percent query to a table.
In my knowledge the best way to do this is using rand() function. But again you should not use this with order by() clause as it will impact performance. Here is my sample query which is time efficient :
SELECT * FROM table_name
WHERE rand() <= 0.0001
DISTRIBUTE BY rand()
SORT BY rand()
LIMIT 5000;
I tested this on 900M row table and query executed in 2 mins.
Hope this helps.

You can use PERCENT with TABLESAMPLE. For example:
SELECT * FR0M TABLE_NAME
TABLESAMPLE(1 PERCENT) T;
This will select 1% of the data size of inputs and not necessarily the number of rows. More details can be found here.
But if you are really looking for a method to select a percentage of the number of rows, then you may have to use LIMIT clause with the number of records you need to retrieve.
For example, if your table has 1000 records, then you can select random 10% records as:
select * from table_name order by rand() limit 100;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Query in AWS Athena with dynamic date - amazon-athena

I'm executing this query: SELECT tvd_data_hora FROM "mydb"."i_public_0622" limit 10; but I need to get dynamicaly current MMYY to query, like: SELECT tvd_data_hora FROM "mydb"."i_public_0822" limit 10;

Related

Query Google Big Query Table by today's date

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

sql alchemy Update resultset of raw query

Completing information in the table Power BI

Using TABLESAMPLE with PERCENT returns all the records from table

Categories

Resources