How to read RRD file to get usage data? - rrdtool

I have setup a mrtg setup with rrdtool. Now I'm planning to get incoming outgoing usage data from these RRD files and failing to find a correct way to do it.
Can anyone show we how to get those usage data from rrd files?
Then I can maintain a db to keep those usage data and calculate the cost etc.?

you can use rrdtool graph ... PRINT:xxx or rather rrdtool xport ... to get data out of the rrd file. If you want to get to the actual data, use rrdtool fetch.
you can find tons of additional info on http://rrdtool.org

Related

How to access AWS public dataset using Databricks?

For one of my classes, I have to analyze a "big data" dataset. I found the following dataset on the AWS Registry of Open Data that seems interesting:
https://registry.opendata.aws/openaq/
How exactly can I create a connection and load this dataset into Databricks? I've tried the following:
df = spark.read.format("text").load("s3://openaq-fetches/")
However, I receive the following error:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
Also, it seems that this dataset has multiple folders. How do I access a particular folder in Databricks, and if possible, can I focus on a particular time range? Let's say, from 2016 to 2020?
Ultimately, I would like to perform various SQL queries in order to analyze the dataset and perhaps create some visualizations as well. Thank you in advance.
If you browse the bucket, you'll see that there are multiple datasets there, in different formats, that will require different access methods. So you need to point to the specific folder (and maybe its subfolder to load data). Like, to load daily dataset you need to use CSV format:
df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "false").load("s3://openaq-fetches/daily/")
To load only subset of the data you can use path filters, for example. See Spark documentation on loading data.
P.S. the inferSchema isn't very optimal from performance standpoint, so it's better to explicitly provide schema when reading.

Creating a BigQuery Transfer Service with more complex reges

I have a bucket that stores files based on a transaction time into a filename structure like
gs://my-bucket/YYYY/MM/DD/[autogeneratedID].parquet
Lets assume this structure dates back to 2015/01/01
Some of the files might arrive late, so in theory a new file could be written to the 2020/07/27 structure tomorrow.
I now want to create a BigQuery table that inserts all files with transaction date 2019-07-01 and newer.
My current strategy is to slice the past into small enough chunks to just run batch loads, e.g. by month. Then I want to create a transfer service that listens for all new files coming in.
I cannot just point it to gs://my-bucket/* as this would try to load the date prior to 2019-07-01.
So basically I was thinking about encoding the "future looking file name structures" into a suitable regex, but it seems like the wildcard names https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames only allow for very limited syntax which is not as flexible as awk regex for instance.
I know there are streaming inserts into BQ but still hoping to avoid that extra complexity and just make a smart configuration of the transfer config following the batch load.
You can use scheduled queries with external table. When you query your external table, you can use the pseudo column _FILE_NAME in the where condition of your request and filter on this.

Amazon athena with big gzip json files?

Im taking my first steps with amazon athena and i dont know why im not getting the expected results.
Im dealing with big json files, encoded in gzip and stored in s3, and i cannot get results for even a simple count query.
Now im testing with two files, each one with about 10gb of compressed json.
When i test the table, with the limit 10, i get the results, so the table is created and working, but when i have to make another query, even with a simple where, the query never ends, i mean, i had to stop it when reaching 30 minutes without response.
Ive read about data partitioning and i know big files its not the best option to store data in s3 if you want to use athena.
Despite of this, ive been searching a little in internet and get to some test where people query over big files (70-80gb) obtaining the result in about 10 second.
The use of athena seems very easy, but there must be something im doing wrong in addittion to the unpartitioned data.
Could you give any tips, or there is no solution for this situation.
Thank you

BigQuery python client dropping some rows using Streaming API

I have around a million data items being inserted into BigQuery using streaming API (BigQuery Python Client's insert_row function), but there's some data loss, ~10,000 data items are lost while inserting. Is there a chance BigQuery might be dropping some of the data? Since there aren't any insertion errors (or any errors whatsoever for that matter).
I would recommend you to file a private Issue Tracker in order for the BigQuery Engineers to look into this. Make sure to provide affected project, the source of the data, the code that you're using to stream into BigQuery along with the client library version.

Django database; how to download huge data in csv format

I have setup my database in Django in which I have huge amount of data. The task is to download all the data at a time in csv format. The problem which I am facing here is when the data size (in number of table rows) is upto 2000, I am able to download it but when number of rows reaches to more than 5k, it throws an error, "Gateway timeout". How to handle such issue. There is no table indexing as of now.
Also, when there is 2K data available, it takes around 18sec to download. So how this can be optimized.
First, make sure the code that is generating the CSV is as optimized as possible.
Next, the gateway timeout is coming from your front end proxy; so simply increase the timeout there.
However, this is a temporary reprieve - as your data set grows, this timeout will be exhausted and you'll keep getting these errors.
The permanent solution is to trigger a separate process to generate the CSV in the background, and then download it once its finished. You can do this by using celery or rq which are both ways to queue tasks for execution (and then collect the results at a later time).
If you are currently using HttpResponse from django.http then you could try using StreamingHttpResponse instead.
Failing that, you could try querying the database directly. For example, if you use the MySql database backend, these answers might help you:
dump-a-mysql-database-to-a-plaintext-csv-backup-from-the-command-line
As for the speed of the transaction, you could experiment with other database backends. However, if you need to do this often enough for the speed to be a major issue then there may be something else in the larger process which should be optimized instead.