We have thousands of files stored in S3. These files are exposed to athena so that we can query on them. While doing debugging i found that athena shows multiple blank lines when queries on a specific id. Given that there are thousands of files, I am not sure where that data is coming from.
Is there a way that i can see the source file for respective rows in athena result?
There is a hidden column exposed by Presto Hive connector: "$path"
This column exposes the path of the file particular row has been read from.
Note: the column name is actually $path, but you need to "-quote it in SQL. This is because $ is otherwise illegal in an identifier.
Related
I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.
When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)
I have created two partitions in a s3 bucket and loading a csv file in each of the folder. Accordingly running the Glue crawler on top of these files, which are registered as a table in Glue catalog,which Im able to query via Athena.
Partition-1: Loading csv file in s3, csv file has 5 columns
Partition-2: Loading csv file in s3, csv file has same 5 columns as above, but in different order compared to (1)
When I run the crawler first time on (1), it creates the Glue table/schema. Later when I upload the same data in different order to a different partition as (2) and run the crawler,it just tries to map the second file to the schema already created as part of (1), which results in data issues.
Does order of columns in Glue important? Does the crawler not automatically identify the columns based on the name, instead of the expecting in the same order (2) as of (1).
Order is important in csv files. Any change makes it think that the schema is different. However if u use parquet files, then order can be played around with
I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.
I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.
Update 2019-08-13
Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.
Update 2019-08-14
You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Athena stores query results in Amazon S3.
A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call.
(https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)
I'm new to AWS and Hive, and I'm trying to use Hive to analyze Google Ngrams data. I tried to save a table as tab-delimited CSV in an S3 bucket, but now I don't know how to view it or download it to see if my job executed correctly.
The query I used to create the table was
CREATE EXTERNAL TABLE test_table2 (
gram string,
year int,
occurrences bigint,
pages bigint,
books bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mybucket/sub-bucket/test-table2.txt';
I then filled the table with data:
INSERT OVERWRITE TABLE test_table2
SELECT
gram,
year,
occurrences,
pages,
books
FROM
eng1m_5grams_normed
WHERE
gram = 'early bird gets the worm';
The query ran fine, and I think everything worked correctly. However, when I navigate to my bucket in the S3 Management Console online, the text file appears as a folder containing a bunch of files. These files have long hexadecimal character names and are 0 bytes big.
Is this just the text file represented as a directory? Is there a way I can view or download the file to see if my query worked? I tried to make the directory public so I could download it, but the download button in the "Actions" dropdown menu is still greyed out.
In Hive/S3 , think of S3 directories as tables. The files contained in those directories are contents of those tables (i.e. rows). The reason you have multiple files in the directory is because multiple reducers are writing the "table".
S3 Browser is a very nice tool for working with S3.
What happened is that very few rows may have qualified against the predicate in the where clause. so very few (or no) rows were selected and emitted into the output (and hence the zero sized files). EMR doesn't give a simple way to download the result of a query.