I have a serverless project, I'm trying to export Dynamo DB tables into single csv, and then upload it to S3.
All npm modules i checked export single table. Is there a way to export multiple table data into one single csv?
To export as a CSV, adding onto #dixon1e post, use jq in the shell. With DynamoDb run:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000 --output json | jq -r '.Items' | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ].S])[] | #csv' > export.my-table.csv
The AWS CLI can be used to download data from Dynamo DB:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000
The --page-size is important, there is a limit of 1M (megabyte) for every query result.
Related
I have a CSV file of around 10 columns: it's the customers table in TPC-H. Size is around 1.4GB.
Just downloading the file takes around 6 seconds. However using S3 Select for just the first column takes 20 seconds.
time aws s3 cp s3://tpc-h-csv/customer/customer.tbl.1 .
time aws s3api select-object-content --bucket tpc-h-csv --key customer/customer.tbl.1 --expression "select _1 from s3object" --expression-type 'SQL' --input-serialization '{"CSV": {"FieldDelimiter": "|"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' "output.csv"
I notice that S3 Select is only fast when you have very selective limit clause. Is this in general true? Can I expect no speedup if I am just pulling a column?
I want to get only the list of tables in a dataset. bq ls dataset command show the list of table name along with extra columns which are Type, Labels, Time, Partitioning and Clustered Fields.
How can I only show the tableId column?
bq ls <DATASET> | tail -n +3 | tr -s ' ' | cut -d' ' -f2
Working in Cloud Shell and locally on mac OS
I am creating a parquet file from reading data from oracle.
Oracle is running in UTC. I confirmed using,
SELECT DBTIMEZONE FROM DUAL;
Output:
DBTIMEZONE|
----------|
+00:00 |
Reading from JDBC and writing to S3 as parquet:
df = spark.read.format('jdbc').options(url=url,
dbtable=query,
user=user,
password=password,
fetchsize=2000).load()
df.write.parquet(s3_loc, mode="overwrite")
Now, I checked value of spark.sql.session.timeZone
print(spark.conf.get("spark.sql.session.timeZone"))
Output:
UTC
Now, I am reading data from S3 location:
df1 = spark.read.parquet(s3_loc)
df1.show()
Output:
+-------------------+
| col1 |
+-------------------+
|2012-11-11 05:00:00|
|2013-11-25 05:00:00|
|2013-11-11 05:00:00|
|2014-12-25 05:00:00|
+-------------------+
col1 is date in oracle and converted to timestamp in spark df.
Why 5 hours are added in the output? Database is running in UTC and spark.sql.session.timeZone is UTC.
Note:
Both RDS and EMR are running in AWS US-EAST-1
On all the spark nodes, I ran TZ=UTC
Timezone is reckognized by JDBC driver, which does not know about Spark's timezone setting, but relies on JVM's default timezone. Moreover, it ignores remote database session's timezone settings. You said you ran TZ=UTC - I'm not sure, but probably it didn't work. Check what TimeZone.getDefault tells you.
If, as I suspect, your JVM timezone is EDT (US-EAST-1 is Virginia), then 2012-11-11 00:00:00 read from Oracle by JDBC is interpreted to be in EDT. Displayed in Spark it's 2012-11-11 05:00:00 UTC and this is the result you got.
To fix it, override JVM default timezone when running spark-submit:
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" \
--conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC" \
...
I have around 108 tables in a dataset. I am trying to extract all those tables using the following bash script:
# get list of tables
tables=$(bq ls "$project:$dataset" | awk '{print $1}' | tail +3)
# extract into storage
for table in $tables
do
bq extract --destination_format "NEWLINE_DELIMITED_JSON" --compression "GZIP" "$project:$dataset.$table" "gs://$bucket/$dataset/$table.json.gz"
done
But it seems that bq ls only show around 50 tables at once and as a result I can not extract them to cloud storage.
Is there anyway I can access all of the 108 tables using the bq ls command?
I tried with CLI and This command worked for me:-
bq ls --max_results 1000 'project_id:dataset'
Note: --max_results number_based_on_Table_count
The default number of rows when listing tables that bq ls will display is 100. You can change this with the command line option --max_results or -n.
You can also set the default values for bq in $HOME/.bigqueryrc.
Adding flags to .bigqueryrc
This will take all the views and m/views for your dataset, and push them into a file.
Could add another loop to loop through all datasets too
#!/bin/bash
## THIS shell script will pull down every views SQL in a dataset into its own file
# Get the project ID and dataset name
DATASET=<YOUR_DATASET>
# for dataset in $(bq ls --format=json | jq -r '.[] | .dataset_id'); do
# Loop over each table in the dataset
for table in $(bq ls --max_results 1000 "$PROJECT_ID:$DATASET" | tail -n +3 | awk '{print $1}'); do
# Determine the table type and file extension
if bq show --format=prettyjson $DATASET.$table | jq -r '.type' | grep -q -E "MATERIALIZED_VIEW|VIEW"; then
file_extension=".bqsql"
# Output the table being processed
echo "Extracting schema for $DATASET.$table"
# Get the schema for the table
bq show --view --format=prettyjson $DATASET.$table | jq -r '.view.query' > "$DATASET-$table.$file_extension"
else
echo "Ignoring $table"
continue
fi
done
# done
I'm using a BigQuery view to fetch yesterday's data from a BigQuery table and then trying to write into a date partitioned table using Dataprep.
My first issue was that Dataprep would not correctly pick up DATE type columns, but converting them to TIMESTAMP works (thanks Elliot).
However, when using Dataprep and setting an output BigQuery table you only have 3 options for: Append, Truncate or Drop existing table. If the table is date partitioned and you use Truncate it will remove all existing data, not just data in that partition.
Is there another way to do this that I should be using? My alternative is using Dataprep to overwrite a table and then using Cloud Composer to run some SQL pushing this data into a date partitioned table. Ideally, I'd want to do this just with Dataprep but that doesn't seem possible right now.
BigQuery table schema:
Partition details:
The data I'm ingesting is simple. In one flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-08 | Josh1 |
| 2018-08-08 | Josh2 |
+------------+--------+
In the other flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-09 | Josh1 |
| 2018-08-09 | Josh2 |
+------------|--------+
It overwrites the data in both cases.
You ca create a partitioned table bases on DATE. Data written to a partitioned table is automatically delivered to the appropriate partition.
Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
Append the data to have the new data added to the partitions.
You can create the table using the bq command:
bq mk --table --expiration [INTEGER1] --schema [SCHEMA] --time_partitioning_field date
time_partitioning_field is what defines which field you will be using for the partitions.