Merge multiple RRDs over time - rrdtool

I've got an old RRD file that was only set up to track 1 year of history. I decided more history would be nice. I did rrdtool resize, and the RRD is now bigger. I've got old backups of this RRD file and I'd like to merge the old data in so that the up-to-date RRD also has the historical data.
I've tried the rrd contrib "merged-rrd.py" but it gives:
$ python merged-rrd.py ../temperature-2010-12-06.rrd ../temperature-2011-05-24.rrd merged1.rrd
merging old:../temperature-2010-12-06.rrd to new:../temperature-2011-05-24.rrd. creating merged rrd: merged1.rrd
Traceback (most recent call last):
File "merged-rrd.py", line 149, in <module>
mergeRRD(old_path, new_path, mer_path)
File "merged-rrd.py", line 77, in mergeRRD
odict = getXmlDict(oxml)
File "merged-rrd.py", line 52, in getXmlDict
cf = line.split()[1]
IndexError: list index out of range
Also tried "rrd_merger.pl":
$ perl rrd_merger.pl --oldrrd=../temperature-2010-12-06.rrd --newrrd=../temperature-2011-05-24.rrd --mergedrrd=merged1.rrd
Dumping ../temperature-2010-12-06.rrd to XML: /tmp/temperature-2010-12-06.rrd_old_8615.xml
Dumping ../temperature-2011-05-24.rrd to XML: /tmp/temperature-2011-05-24.rrd_new_8615.xml
Parsing ../temperature-2010-12-06.rrd XML......parsing completed
Parsing ../temperature-2011-05-24.rrd XML...
Last Update: 1306217100
Start processing Round Robin DB
Can't call method "text" on an undefined value at rrd_merger.pl line 61.
at rrd_merger.pl line 286
at rrd_merger.pl line 286
Is there a tool to combine or merge RRDs that works?

I ended up putting together a really simple script that works well enough for my case, by examining the existing python script.
http://gist.github.com/2166343

That fixed rrdtool-merge.pl for me:
< my $xff = $new_rra->first_child( 'xff' )->text;
---
> my $xff = $new_rra->first_child_text( 'xff' );
From XML::Twig documentation:
first_child_text ($optional_condition)
Return the text of the first child of the element, or the first child
matching the $optional_condition If there is no first_child then returns ''. This
avoids getting the child, checking for its existence then getting the text for trivial
cases.

The rrdmerge.pl utility, included with Routers2 in the /extras directory, can do this. Collect the latest version of Routers2 from http://www.steveshipway.org/software/rrd/
This is a utility I wrote for the purpose of merging multiple archived MRTG RRD files which sounds exactly like the situation you are mentioning.
This is probably too late for the OP but will hopefully be useful to later people who come here. It can merge any RRD files, even with different DS, RRA or intervals, and can generate XML or RRD, and will pick the best available data from the component RRD files to make the output.
Example:
rrdmerge.pl --rrd --libpath $RRDLIBDIR --output /tmp/merge.rrd --rows 12000 $FROMDIR/file.rrd $ARCHIVE/*.rrd

Looking at the XML file generated by rrdtool, there is a simple logic error in the Perl script. The elements AVERAGE and are simple enough but the tag is contained within tag with the text inside.
<cf> AVERAGE </cf>
<pdp_per_row> 1 </pdp_per_row> <!-- 300 seconds -->
<params>
<xff> 5.0000000000e-01 </xff>
</params>
The parsing just has to be tweaked a bit and when it is working, the fix fed back here (where it is easy to 'Google') and also to the script's author for a fix.

Related

What will happen if power get shutdown , while we are inserting into database?

I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.

Google DataPrep - Apparently Limited Table Size

I'm trying to prepare SEO data from Screaming Frog, Majestic and Ahrefs, join it before importing said data into BigQuery for analysis.
The Majestic and Ahrefs csv files import after some pruning down to the 100MB limit.
The Screaming Frog CSV file however doesn't fully load, only displaying approx 37,000 rows of 193,000. By further pruning less important cols in Excel and reducing the filesize(from 44MB to 39MB) , the number of rows loaded increases slightly. This would indicate to me that it's not an errant character or cell.
I've made sure(resaved via text editor) that the CSV file is saved in UTF8, checked the limitations of Dataprep to see if there is a limit on the number of cells per Flow/Wrangle and can find nothing.
The Majestic and AHREFS files are larger and load completely with no issue. There is no data corruption in the Screaming Frog file. Is there something common I'm missing?
Is the total limit for all files 100MB?
Any advice or insight would be appreciated.
To get the full transformation of your files, you should run the recipe.
What you see in the Dataprep Transformer Page is a head sample.
You can take a look about how the sampling works here.

Consolidate file write and read together

I am writing a python script to write data to the Vertica DB. I use the official library vertica_db_client. For some reason, if I use the built-in cur.executemany method for some reason it takes a long time to complete (40+ seconds per 1k entries). The recommendation I got was to first save the data to a file, then use "COPY" method. Here is the save-to-a-csv-file part:
with open('/data/dscp.csv', 'w') as out:
csv_out=csv.writer(out)
csv_out.writerow(("time_stamp", "subscriber", "ip_address", "remote_address", "signature_service_name", "dscp_out", "bytes_in", "bytes_out")) # which is for adding a title line
for row in data:
csv_out.writerow(row)
My data is a list of tuples. examples are like:
[\
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291)\
]
Then, in order to use the COPY method, I have to (at least based on their instruction https://www.vertica.com/docs/9.1.x/HTML/python_client/loadingdata_copystdin.html), read the file first then do "COPY from STDIN". Here is my code
f = open("/data/dscp.csv", "r")
cur.stdin = f
cur.execute("""COPY pason.dscp FROM STDIN DELIMITER ','""")
Here is the code for connecting the DB, in case it is relevent to the problem
import vertica_db_client
user = 'dbadmin'
pwd = 'xxx'
database = 'xxx'
host = 'xxx'
db = vertica_db_client.connect(database=database, user=user, password=pwd, host=host)
cur = db.cursor()
So clearly it is waste of effort to first save then read... What is the best way to consolidate the two reading part?
If anyone can tell me why my execute.many was slow it would be equally helpful!
Thanks!
First of all, yes, it is both the recommended way and the most efficient way to write the data to a file first. It may seem inefficient at first, but writing the data to a file on disk will take next to no time at all, but Vertica is not optimized for many individual INSERT statements. Bulk loading is the fastest way to get large amounts of data into Vertica. Not only that, but when you do many individual INSERT statements, you could potentially run into ROS pushback issues, and even if you don't there will be extra load on the database when the ROS containers are merged after the load.
You could convert your array of tuples two a large string variable and then print the string to the console.
The string would look something like:
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291
But instead of actually printing it to the console, you could just pipe it into a VSQL command.
$ python my_script.py | vsql -U dbadmin -d xxx -h xxx -c "COPY pason.dscp FROM STDIN DELIMITER ','"
This may not be efficient though. I don't have much experience with exceedingly long string variables in python.
Secondly, the vertica_db_client is no longer being actively developed by Vertica. While it will still supported at least until the python2 end of life, you should be using vertica_python.
You can install vertica_python with pip.
$ pip install vertica_python
or
$ pip3 install vertica_python
depending on which version of Python you want to use it with.
You can also build from source code can be found on Vertica's GitHub page https://github.com/vertica/vertica-python/
As for using the COPY command with vertica_python, see the answer in this question here: Import Data to SQL using Python
I have used several python libraries to connect to Vertica and vertica_python is by far my favorite, and ever since Vertica took over the development from Uber it has continued to improve on a very regular basis.

Scikit Learn - Working with datasets

Reading through some stackoverflow questions and I could not find what I was looking for, at least, I didn't think it was when I read various posts.
I have some Training data set up like described here
So, I am using sklearn.datasets.load_files to read those it as it was a perfect match on set up.
BUT my files are tsv as bag of words already (aka each line is a word and it's frequency count separated by a tab).
To be honest, I am not sure how to proceed. The data pulled in by load_files is set up as a list where each element is the contents of each file, including the new line characters. I am not even 100% sure how the Bunch data type is tracking which files belong to which classifier folder.
I have worked with scikit-learn before with tsvs, but it was a single tsv file that had all the data so i used pandas to read it in and then used numpy.array to fetch what I needed from it, which is one of the things I attempted to do, but I am not sure how to do it with multiple files where the classifier is the folder name, as in that single tsv file i worked with before, each line of training data was individually
Some help on getting the data to a format that is useable for training classifiers would be appreciated.
You could loop over the files and read them, to create a list of dictionaries where each dictionary will contain the features and the frequencies of each document. Assume the file 1.txt:
import codecs
corpus = []
#make a loop over the files here and repeat the following
f = codecs.open("1.txt", encoding='utf8').read().splitlines()
corpus.append({line.split("\t")[0]:line.split("\t")[1] for line in f})
#exit the loop here
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
X=vec.fit_transform(measurements)
You can find more here for DictVectorizer

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.
The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.
If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.
A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page
One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.
Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.
Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.
I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.