I have a table in Snowflake that have 24 columns in it. And I also have a CSV files in the S3 bucket with the changing number of columns. Sometimes it can be 4 columns and sometimes it can be 24 columns etc. I need also map the names of the CSV files columns to the name of the Snowflake table columns. Is there any way how to do it?
You would need to pre-process your CSV file to bring it to a more consistent format as your target table.
You can extract the column header and generate a COPY command mapping those columns into your table.
Related
I have a bigQuery table which has around 2M rows which are loaded from a JSON file. Actual fields in JSON file are 10 but table has 7 columns as per initial DDL. Now I altered the table and added remaining three columns. After altering, the values in newly added columns are filled with NULL.
Now I want to backfill the data in existing 2M rows but for only those three newly added columns with actual data from json file. How can i bulk update the table so that existing column values remain untouched and only new column values are updated.
Note: Table has streaming buffer enabled and the table is NOT Partitioned.
Now I want to backfill the data in existing 2M rows but for only those three newly added columns with actual data from json file.
Since loading data is free of charge, I'd reload the whole table with WRITE_TRUNCATE option to overwrite the existing data.
What you said confuses me because:
If your 2M rows in BQ table has the same data as what's in JSON file, why do you care whether they are touched or not?
If your 2M rows in BQ table has been altered in some way, how do you expect the rows in JSON file matches the altered data on a per row basis (to backfill the missing column)?
--
Update: based on the comment, it seems that the loaded rows has been altered in some way. Then:
For your existing data, if there is not a (logical) primary key for you to use to match the rows, then it is technically impossible to "match and update".
If your existing data do have a logical primary key, and you don't mind the cost, you could load the full table into a temporary table then use DML to backfill the missing columns.
For your future data loading, if you want the loading to be incremental (either on rows or on columns), better you could have your loaded table untouched so that it represents that 'full fact' and keep the 'altered rows' in a separate table, assuming you have a logical primary key to match them.
I have created an impala table as
create table my_schema.my_table stored as textfile as select ...
As per the definition the table has data stored in textfiles somewhere in HDFS. Now when i run hdfs command such as:
hadoop fs -cat path_to_file | head
I do not see any column names. I suppose impala stores the column names somewhere else, but since i would like to work with these textfiles also outside of impala, it would be great if the files would include the headers.
Is there some option i can set when creating the table to add the headers to the text files? Or do i need to figure out the names by parsing the results of show create table?
We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html
I am trying to create a parameterized dataset that imports files from GCS and puts them under each other. This all works fine (Import Data > Parameterize).
To give a bit of context, I store each day a .csv file with a different name referring to that date.
Now it happens that my provider added a new column since last month into the files. This means that files before this date have 8 columns, whereas from this date 9 columns.
However, when I parameterize, Dataprep only takes into account the columns that are matching (thus 8 columns only). Ideally I would want empty observations for the rows coming from files that did not have this new column.
How can this be achieved?
The parameterized datasets only work on a fixed schema as mentioned in the documentation:
Avoid creating datasets with parameters where individual files or tables have differing schemas.
This fixed schema is generated using one of the file found during the creation of the dataset with parameters.
If the schema has changed, then you can "refresh" it by editing the dataset with parameters and clicking save. If all the matching files contain 9 columns, you should now see 9 columns in the transformer.
I am following this example:
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files - TSV example
Summary of the code:
CREATE EXTERNAL TABLE flight_delays_tsv (
yr INT,
quarter INT,
month INT,
...
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
My questions are:
My tsv does not have column names
(my tsv)
Is it ok if I just list the columns as c1,c2… and all of them as string ?
I do not understand this:
PARTITIONED BY (year STRING)
in the example, the column ‘year’ is not listed in any of the columns…
Column names
The column names are defined by the CREATE EXTERNAL TABLE command. I recommend you name them something useful so that it is easier to write queries. The column names do not need to match any names in the actual file. (It does not interpret header rows.)
Partitioning
From Partitioning Data - Amazon Athena:
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
The field used to partition the data is NOT stored in the files themselves, which is why they are not in the table definition. Rather, the column value is stored in the name of the directory.
This might seem strange (storing values in a directory name!) but actually makes sense because it avoids situations where an incorrect value is stored in a folder. For example, if there is a year=2018 folder, what happens if a file contains a column where the year is 2017? This is avoided by storing the year in the directory name, such that any files within that directory are assigned the value denoted in the directory name.
Queries can still use WHERE year = 2018 even though it isn't listed as an actual column.
See also: LanguageManual DDL - Apache Hive - Apache Software Foundation
The other neat thing is that data can be updated by simply moving a file to a different directory. In this example, it would change the year value as a result of being in a different directory.
Yes, it's strange, but the trick is to stop thinking of it like a normal database and appreciate the freedom that it offers. For example, appending new data is as simple as dropping a file into a directory. No loading required!