I have a CSV file of around 10 columns: it's the customers table in TPC-H. Size is around 1.4GB.
Just downloading the file takes around 6 seconds. However using S3 Select for just the first column takes 20 seconds.
time aws s3 cp s3://tpc-h-csv/customer/customer.tbl.1 .
time aws s3api select-object-content --bucket tpc-h-csv --key customer/customer.tbl.1 --expression "select _1 from s3object" --expression-type 'SQL' --input-serialization '{"CSV": {"FieldDelimiter": "|"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' "output.csv"
I notice that S3 Select is only fast when you have very selective limit clause. Is this in general true? Can I expect no speedup if I am just pulling a column?
Related
Amazon Athena table contains a column 'closed_date'.
closed_date
2002-05-12
2003-03-26
Now I need to find out the number of days after closing the account.
I am trying to do below:
select
extract (current_date - closed_date) as day
from athena_table
Ideally, it should return (2021-07-27) - (2002-05-12) = 7,016 Days
You can use the DATE_DIFF(unit, timestamp1, timestamp2) command:
SELECT
DATE_DIFF('day', CURRENT_DATE, closed_date) as days
FROM athena_table
I am trying to load a csv file ins s3 into redshift using aws copy command in lambda. The problem is i have more columns in csv than in redshift table.
so whenever i trigger lambda fnction i get the error "Extra columns found"
how to load specific columns from csv
my csv files is of form
year, month, description, category,SKU, sales(month)
and my redshift table is of form
year month description category SKU
-----------------------------------
my copy command is as follows
COPY public.sales
FROM 's3://mybucket/sales.csv'
iam_role 'arn:aws:iam::99999999999:role/RedShiftRole'
delimiter ','
ignoreheader 1
acceptinvchars
You can specify the list of columns to import into your table - see COPY command documentation for more details.
COPY public.sales (year, month, description, category, SKU)
FROM 's3://mybucket/sales.csv'
iam_role 'arn:aws:iam::99999999999:role/RedShiftRole'
delimiter ','
ignoreheader 1
acceptinvchars
I have around 108 tables in a dataset. I am trying to extract all those tables using the following bash script:
# get list of tables
tables=$(bq ls "$project:$dataset" | awk '{print $1}' | tail +3)
# extract into storage
for table in $tables
do
bq extract --destination_format "NEWLINE_DELIMITED_JSON" --compression "GZIP" "$project:$dataset.$table" "gs://$bucket/$dataset/$table.json.gz"
done
But it seems that bq ls only show around 50 tables at once and as a result I can not extract them to cloud storage.
Is there anyway I can access all of the 108 tables using the bq ls command?
I tried with CLI and This command worked for me:-
bq ls --max_results 1000 'project_id:dataset'
Note: --max_results number_based_on_Table_count
The default number of rows when listing tables that bq ls will display is 100. You can change this with the command line option --max_results or -n.
You can also set the default values for bq in $HOME/.bigqueryrc.
Adding flags to .bigqueryrc
This will take all the views and m/views for your dataset, and push them into a file.
Could add another loop to loop through all datasets too
#!/bin/bash
## THIS shell script will pull down every views SQL in a dataset into its own file
# Get the project ID and dataset name
DATASET=<YOUR_DATASET>
# for dataset in $(bq ls --format=json | jq -r '.[] | .dataset_id'); do
# Loop over each table in the dataset
for table in $(bq ls --max_results 1000 "$PROJECT_ID:$DATASET" | tail -n +3 | awk '{print $1}'); do
# Determine the table type and file extension
if bq show --format=prettyjson $DATASET.$table | jq -r '.type' | grep -q -E "MATERIALIZED_VIEW|VIEW"; then
file_extension=".bqsql"
# Output the table being processed
echo "Extracting schema for $DATASET.$table"
# Get the schema for the table
bq show --view --format=prettyjson $DATASET.$table | jq -r '.view.query' > "$DATASET-$table.$file_extension"
else
echo "Ignoring $table"
continue
fi
done
# done
I set up a new log table in Athena in an S3 bucket that looks like below, where Athena is sitting on top of BucketName/
I had a well-functioning Athena system based on the same data but without the subdirectory structure listed below. Now with this new subdirectory structure I can see the data is properly displaying when I do select * from table_name limit 100 but when I do something like count(x) by week the query hangs.
The data in S3 doesn't exceed 100GB in GZipped folders but the query was hanging for more than 20 minutes and said 6.5TB scanned, which sounds like it was looping and scanning over the same data. My guess is that it has to do with this directory structure but from what i've seen in other threads is that Athena should be able to parse through the subdirectories by just being pointed to the base folder BucketName/
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Daily File Format YYYY-MM-DD-Data000.gz
Any advice would be appreciated!
Create Table DDL
CREATE EXTERNAL TABLEtest_table(
foo1string,
foo2string,
foo3string,
datestring,
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
MAP KEYS TERMINATED BY '\u0003'
WITH SERDEPROPERTIES (
'collection.delim'='\u0002')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://Listen_Data/2018/01'
TBLPROPERTIES (
'has_encrypted_data'='false',
)
Fixed by adding
PARTITIONED BY (
`year` string,
`month` string)
after the schema definition in the DDL statement.
I have a serverless project, I'm trying to export Dynamo DB tables into single csv, and then upload it to S3.
All npm modules i checked export single table. Is there a way to export multiple table data into one single csv?
To export as a CSV, adding onto #dixon1e post, use jq in the shell. With DynamoDb run:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000 --output json | jq -r '.Items' | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ].S])[] | #csv' > export.my-table.csv
The AWS CLI can be used to download data from Dynamo DB:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000
The --page-size is important, there is a limit of 1M (megabyte) for every query result.