Data source in rrdtool - rrdtool

I am trying to fetch logs and statistics data from server using rrd tool and shell script, but unable to set the data source.
this is my step to do :
rrdtool create latency_db.rrd \
--step 60 \
DS:pl:GAUGE:120:0:100 \
DS:rtt:GAUGE:120:0:10000000 \
RRA:MAX:0.5:1:1500 \

You need to define it like this:
rrdtool create latency_db.rrd \
--step 60 \
DS:pl:GAUGE:120:0:100 \
DS:rtt:GAUGE:120:0:10000000 \
RRA:MAX:0.5:1:1500
OR
rrdtool create latency_db.rrd --step 60 DS:pl:GAUGE:120:0:100 DS:rtt:GAUGE:120:0:10000000 RRA:MAX:0.5:1:1500

Related

S3 Select very slow

I have a CSV file of around 10 columns: it's the customers table in TPC-H. Size is around 1.4GB.
Just downloading the file takes around 6 seconds. However using S3 Select for just the first column takes 20 seconds.
time aws s3 cp s3://tpc-h-csv/customer/customer.tbl.1 .
time aws s3api select-object-content --bucket tpc-h-csv --key customer/customer.tbl.1 --expression "select _1 from s3object" --expression-type 'SQL' --input-serialization '{"CSV": {"FieldDelimiter": "|"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' "output.csv"
I notice that S3 Select is only fast when you have very selective limit clause. Is this in general true? Can I expect no speedup if I am just pulling a column?

BigQuery CLI show tableId only

I want to get only the list of tables in a dataset. bq ls dataset command show the list of table name along with extra columns which are Type, Labels, Time, Partitioning and Clustered Fields.
How can I only show the tableId column?
bq ls <DATASET> | tail -n +3 | tr -s ' ' | cut -d' ' -f2
Working in Cloud Shell and locally on mac OS

spark timestamp timezone in JDBC read/write

I am creating a parquet file from reading data from oracle.
Oracle is running in UTC. I confirmed using,
SELECT DBTIMEZONE FROM DUAL;
Output:
DBTIMEZONE|
----------|
+00:00 |
Reading from JDBC and writing to S3 as parquet:
df = spark.read.format('jdbc').options(url=url,
dbtable=query,
user=user,
password=password,
fetchsize=2000).load()
df.write.parquet(s3_loc, mode="overwrite")
Now, I checked value of spark.sql.session.timeZone
print(spark.conf.get("spark.sql.session.timeZone"))
Output:
UTC
Now, I am reading data from S3 location:
df1 = spark.read.parquet(s3_loc)
df1.show()
Output:
+-------------------+
| col1 |
+-------------------+
|2012-11-11 05:00:00|
|2013-11-25 05:00:00|
|2013-11-11 05:00:00|
|2014-12-25 05:00:00|
+-------------------+
col1 is date in oracle and converted to timestamp in spark df.
Why 5 hours are added in the output? Database is running in UTC and spark.sql.session.timeZone is UTC.
Note:
Both RDS and EMR are running in AWS US-EAST-1
On all the spark nodes, I ran TZ=UTC
Timezone is reckognized by JDBC driver, which does not know about Spark's timezone setting, but relies on JVM's default timezone. Moreover, it ignores remote database session's timezone settings. You said you ran TZ=UTC - I'm not sure, but probably it didn't work. Check what TimeZone.getDefault tells you.
If, as I suspect, your JVM timezone is EDT (US-EAST-1 is Virginia), then 2012-11-11 00:00:00 read from Oracle by JDBC is interpreted to be in EDT. Displayed in Spark it's 2012-11-11 05:00:00 UTC and this is the result you got.
To fix it, override JVM default timezone when running spark-submit:
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" \
--conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC" \
...

List all the tables in a dataset in bigquery using bq CLI and store them to google cloud storage

I have around 108 tables in a dataset. I am trying to extract all those tables using the following bash script:
# get list of tables
tables=$(bq ls "$project:$dataset" | awk '{print $1}' | tail +3)
# extract into storage
for table in $tables
do
bq extract --destination_format "NEWLINE_DELIMITED_JSON" --compression "GZIP" "$project:$dataset.$table" "gs://$bucket/$dataset/$table.json.gz"
done
But it seems that bq ls only show around 50 tables at once and as a result I can not extract them to cloud storage.
Is there anyway I can access all of the 108 tables using the bq ls command?
I tried with CLI and This command worked for me:-
bq ls --max_results 1000 'project_id:dataset'
Note: --max_results number_based_on_Table_count
The default number of rows when listing tables that bq ls will display is 100. You can change this with the command line option --max_results or -n.
You can also set the default values for bq in $HOME/.bigqueryrc.
Adding flags to .bigqueryrc
This will take all the views and m/views for your dataset, and push them into a file.
Could add another loop to loop through all datasets too
#!/bin/bash
## THIS shell script will pull down every views SQL in a dataset into its own file
# Get the project ID and dataset name
DATASET=<YOUR_DATASET>
# for dataset in $(bq ls --format=json | jq -r '.[] | .dataset_id'); do
# Loop over each table in the dataset
for table in $(bq ls --max_results 1000 "$PROJECT_ID:$DATASET" | tail -n +3 | awk '{print $1}'); do
# Determine the table type and file extension
if bq show --format=prettyjson $DATASET.$table | jq -r '.type' | grep -q -E "MATERIALIZED_VIEW|VIEW"; then
file_extension=".bqsql"
# Output the table being processed
echo "Extracting schema for $DATASET.$table"
# Get the schema for the table
bq show --view --format=prettyjson $DATASET.$table | jq -r '.view.query' > "$DATASET-$table.$file_extension"
else
echo "Ignoring $table"
continue
fi
done
# done

Dynamo db export to csv

I have a serverless project, I'm trying to export Dynamo DB tables into single csv, and then upload it to S3.
All npm modules i checked export single table. Is there a way to export multiple table data into one single csv?
To export as a CSV, adding onto #dixon1e post, use jq in the shell. With DynamoDb run:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000 --output json | jq -r '.Items' | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ].S])[] | #csv' > export.my-table.csv
The AWS CLI can be used to download data from Dynamo DB:
aws dynamodb scan --table-name my-table --select ALL_ATTRIBUTES --page-size 500 --max-items 100000
The --page-size is important, there is a limit of 1M (megabyte) for every query result.