I'm using ganglia for monitoing. ganglia store its data as rrd file.
It is rrd files that stores metrics data on gmetad. Usually the default path is /var/lib/ganglia/rrds/<cluster-name>/<node-name>/ where each metric is stored in a single rrd file like bytes_in.rrd.
Is there any way to use this rrd data in influxdb?
Related
The business logic is as below:
The user upload a csv file;
The application convert the csv file to a database table;
In the future, the user could run sql on the table to generate a BI report;
Currently, the solution is to save the table to MySQL. But as times goes on, the MySQL database contains thousands of tables.
I want find a file format, which represent a table and can be put to a object storage such as AWS S3, and then run an sql on the file.
For example:
Datasource ds = new Datasource("s3://xxx/bbb/t1.tbl");
ResultSet rs = ds.runSQL("select c1, c2 from t1 where c3=8");
What is your ideas or solutions?
Amazon S3 can run an SQL query against a single CSV file. It uses a capability called S3 Select.
From Filtering and retrieving data using Amazon S3 Select - Amazon Simple Storage Service:
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.
You can make an API call to S3 to perform the SQL query and retrieve the results. No database required. Just pay for the storage used by the CSV files (which can be gzipped to save space), plus $0.002 per GB scanned and $0.0007 per GB returned.
You can store the file as CSV in S3 and use S3 Select as mentioned in the other answer. Or you can store it as CSV or Parquet (a much more performant format) and run queries against it using AWS Athena.
I want to load data from a parquet file in google cloud storage to Bigquery. While loading I also want to add couple of extra columns to the table which are not present in the source file like insert_time_stamp and source_file_name. After doing some research I found these options -
Create a temporary table linked to file in GCS and then load the data from the temporary table along with additional columns to the final Bigquery table.
Load the data from parquet file to pandas dataframe, add the extra two columns and then use pandas.DataFrame.to_gbq or client.load_table_from_dataframe options to load data to Bigquery table.
Load the data from parquet file to a staging table(by this I mean a normal table) in Bigquery and then use this table to create the final table as - "insert into final_table select *,current_timestamp as insert_time_stamp, <file_name> as source_file_name from staging_table". And then finally dropping the staging table.
If the number of rows from the source file are in millions, what would be the best approach to take?
I have a BigQuery table and I want to update the content of few rows it from a reference CSV file. This CSV file is uploaded to Google cloud storage bucket.
When you use external table from storage, you can only read the CSV, not update them.
However, you can load you CSV into a BigQuery native table, perform the update with DML, and then export the table to CSV. That only works if you have only one CSV.
If you have several CSV files, you can, at least, print the pseudo column _FILE_NAME to identify the files where you need to perform the change. But the change will have to be performed manually or with the previous solution (native table)
Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.
I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?
I have setup a mrtg setup with rrdtool. Now I'm planning to get incoming outgoing usage data from these RRD files and failing to find a correct way to do it.
Can anyone show we how to get those usage data from rrd files?
Then I can maintain a db to keep those usage data and calculate the cost etc.?
you can use rrdtool graph ... PRINT:xxx or rather rrdtool xport ... to get data out of the rrd file. If you want to get to the actual data, use rrdtool fetch.
you can find tons of additional info on http://rrdtool.org