what file metadata is available in aws - Athena? - amazon-web-services

aws Athena allows the user to display the underlying file from which a row is being read, like so:
select timestamp, "$path" from table
I'm after a comprehensive list of other columns similar to $path. In particular I'm looking for a way to query the total size of those files (not the data scanned by the query, but the total size of the files).

AWS released Athena engine version 3 last month and I can see that it supports the hidden columns $file_size and $file_modified_time. But I couldn't find it advertised in the documentation.

Related

How Amazon Athena selecting new files/records from S3

I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.

Determine the table creation date in AWS Athena using information_schema catalog?

Does anyone know if its possible to retrieve the creation date of a table in AWS Athena using SQL on the information_schema catalog? I know I can use show properties on an individual table basis but I want to get the data for 1000's of tables.
This can tell you when the data in a given table was populated.
1) list the s3 objects in the table
Some SQL to execute via Athena. I like knowing record counts as well, but really the "$PATH" thing is the important part.
select count(*) as record_cnt, "$PATH" as path
from myschema.mytable
group by "$PATH"
order by "$PATH"
record_cnt path
10000 s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
..etc...
2) get the timestamp for the s3 objects
$ aws s3 ls s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
2022-01-01 12:01:23 4456780 000000_00000-abcdef
$
You'll need to loop through the S3 objects & consolidate the timestamps.
I see +/- a few seconds on the backing s3 objects when I create tables, so you may want to consolidate that to the nearest 5 minutes... or hour.
shrug Whatever fits the tempo of your table creation.
I've also seen tables where data is appended over time,
so maybe the closest you could to an actual table create
timestamp would be the oldest s3 object's timestamp.
Ps. I haven't dug into the Glue table defn stuff enough to
know if one can get useful metadata out of that.

Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.

Moving a partitioned table across regions (from US to EU)

I'm trying to move a partitioned table over from the US to the EU region but whenever I manage to do so, It doesn't partition the table on the correct column.
The current process that I'm taking is:
Create a Storage bucket in the region that I want the partitioned table to be in
Export the partitioned table over via CSV to the original bucket (within the old region)
Transfer the table across buckets (from the original bucket to the new one)
Create a new table using the CSV from the new bucket (auto-detect schema is on)
bq --location=eu load --autodetect --source_format=CSV table_test_set.test_table [project ID/test_table]
I expect that the column to be partitioned on the DATE column but instead it's partitioned on the column PARTITIONTIME
Also a note that I'm currently doing this with CLI commands. This will need to be redone multiple times and so having reusable code is a must.
When I migrate data from 1 table to another one, I follow this process
I extract the data to GCS (CSV or other format)
I extract the schema to the source table with this command bq show --schema <dataset>.<table>
I create via the GUI the destination table with the edit as text schema and I paste it. I define manually the partition field that I want to use from the schema;
I load the data from GCS to the destination table.
This process has 2 advantages:
When you import a CSV format, you define the REAL type that you want. Remember, in schema autodetect, Bigquery look about 10 or 20 lines and deduce the schema. Often, string fields are set as INTEGER but the first line of my file doesn't contains letter, only numbers (in serial number for example)
You can define your partition fields properly
The process is quite easy to script. I use the GUI for creating destination table, but bq command lines are great for doing the same thing.
After some more digging I managed to find out the solution. By using "--time_partitioning_field [column name]" you are able to partition by a specific column. So the command would look like this:
bq --location=eu --schema [where your JSON schema file is] load --time_partitioning_field [column name] --source_format=NEWLINE_DELIMITED_JSON table_test_set.test_table [project ID/test_table]
I also found that using JSON files to make things easier.

how to download the big result in bigquery

I am new to gcp. My mission is to download the query result in patents dataset, but the result is too large. I cannot download it directly because gcp only supports download 16000 lines data.
I select several columns and the data is already too large
SELECT country_code, kind_code, application_kind, family_id, publication_date, filing_date, cpc.code as cpc_code, ipc.code as ipc_code
FROM
`patents-public-data.patents.publications` p
cross join unnest(p.cpc) as cpc
cross join unnest(p.ipc) as ipc
I expect I can download the result table, or download by the country_code in different tables.
To complement the response of #Christopher, and for achieving your download, here the steps to perform:
Perform your query
Save result in (temporary) table
Extract table to Google storage bucket
Download file(s) where you want, manually in console or with gsutil tool
Note that there is no limitation on size, but you can have more than 1 file of the result is huge. Take care of format for nested field and prefer gzip compression for faster download!
You can write the result in another table or export table data in Cloud Storage (take note of the export limitations)