My program reads in 4 terminal arguments:
a text file
"bubble", "quick", or "brick" corresponding to the sort the user wants
a number between 2 and 392 corresponding to the amount of items from the text file to sort through, and
"verbose" or "terse" where verbose prints the vector once it is sorted and where terse prints the amount of swaps used to sort the data.
For example:
./SortAutoData auto-mpg.data.txt bubble 100 terse
would print out the amount of sorts needed to do a bubblesort on the first 100 items from auto-mpg.data.txt.
What I need to do is find the number of swaps for each of the three sorts for each amount of elements 2 through 392. Is there a way to increment the amount of data without completely altering my code?
I tried doing this: for i in {2..392}; do echo ./SortAutoData auto-mpg.data.txt bubble ${i} terse; done but it doesnt print exactly how I would like. Ideally, i would like for it to print:
(sorts for 2 data items)
(sorts for 3 data items)
(sorts for 4 data items)
(sorts for 5 data items)
(sorts for 6 data items)
etc....
Thanks in advance for any help anyone can provide!
Related
I have an assignment where I need to find minimum set cover of some points. I want to be able to store each row of numbers in an individual sets but I do not know the best data structure or approach to do this. I have the number of rows there will be. For example, the .txt file will look like this:
1 2 3 4 5 6
5 6 8 9
1 4 7 10
2 5 7 8 11
3 6 9 12
10 11
Is there a way to dynamically create multiple data structures to store each row of numbers? I was thinking something that works like this if it exists:
list<int> myList[6]; // create 6 lists
myList[0].insert(num); // insert numbers into this list
myList[1].insert(num); //insert numbers into the second list
I do not want to individually create lists because in a .txt file, there can be up to 300 sets of numbers.
edit: my main issue is figuring out how to dynamically create some data structure, preferably if it works with std::set_union since it looks useful to my assignment
If you want to control the number of lists programmatically, you can use a std::vector of datasets. So in your case the declaration would be
std::vector<std::list<int>> lists(6);
Adding new, empty lists to the lists set is done by
lists.push_back({}).
I'm facing with the following problem:
I have a huge file (let's say 30 GB), that is streamed in memory with a specific API.
This API only allows me to read going forward (not backward). But the files can be read as many times as I want.
The file contains data that is almost all sorted, as in, 99% of the data is sorted but it can happen that a record is not in its correct position and should have been inserted much before if everything was sorted.
I'm trying to create a duplicate of this file, except it would need to be sorted.
Is there a graceful way to do this ?
The only way I can think of is the most generic way:
read the file
create batch of a few GB of memory, sort them, write them to a file on the HDD
use external merge to merge all these temporary files into the final output
However this is not using the specificities that the data is "almost" sorted. Would there be a better way to do this ? For instance without using external files on the HDD?
You could do this (example in Python)
last = None
special = []
for r in records:
if last is None or r > last:
last = r
else:
special.append(r)
if len(special) > max_memory:
break
if len(special) > max_memory:
# too many out of sequence records, use a regular sort
...
else:
sort(special)
i = 0
for r in records:
while i < len(special) and special[i] < r:
write(special[i])
i += 1
write(r)
while i < len(special):
write(special[i])
i += 1
Use a variation of bottom up merge sort called natural merge sort. The idea here is to find runs of ordered data, then repeatedly merge those runs back and forth between two files (all sequential I/O) until there's only a single run left. If the sort doesn't have to be stable (preserve the order of equal elements), then you can consider a run boundary to occur whenever a pair of sequential elements are out of order. This eliminates some housekeeping. If the sort needs to be stable, then you need to keep track of run boundaries on the initial pass that finds the runs, this could be an array of counts (the size of each run). Hopefully this array would fit in memory. After each merge pass, the number of counts in the array is cut in half, and once there's only a single count, the sort is done.
Wiki article (no sample code given though): natural bottom up merge sort .
If all the out of order elements consist of somewhat isolated records, you could separate the out of order elements into a third file, only copying in order records from the first file to the second file. Then you sort the third file with any method you want (bottom up merge sort is probably still best if the third file is large), then merge the second and third files to create a sorted file.
If you have multiple hard drives, keep the files on separate drives. If doing this on a SSD drive, it won't matter. If using a single hard drive, reading or writing a large number of records at a time, like 10MB to 100MB per read or write, will greatly reduce the seek overhead during the sort process.
I am running a NetLogo model in BehaviorSpace each time varying number of runs. I have turtle-breed pigs, and they accumulate a table with patch-types as keys and number of visits to each patch-type as values.
In the end I calculate a list of mean number of visits from all pigs. The list has the same length as long as the original table has the same number of keys (number of patch-types). I would like to export this mean number of visits to each patch-type with BehaviorSpace.
Perhaps I could write a separate csv file (tried - creates many files, so lots of work later on putting them together). But I would rather have everything in the same file output after a run.
I could make a global variable for each patch-type but this seems crude and wrong. Especially if I upload a different patch configuration.
I tried just exporting the list, but then in Excel I see it with brackets e.g. [49 0 31.5 76 7 0].
So my question Q1: is there a proper way to export a list of values so that in BehaviorSpace table output csv there is a column for each value?
Q2: Or perhaps there is an example of how to output a single csv that looks exactly as I want it from BehaviorSpace?
PS: In my case the patch types are costs. And I might change those in the future and rerun everything. Ideally I would like to have as output: a graph of costs vs frequency of visits.
Thanks
If the lists are a fixed length that doesn't vary from run to run, you can get the items into separate columns by using one metric for each item. So in your BehaviorSpace experiment definition, instead of putting mylist, put item 0 mylist and item 1 mylist and so on.
If the lists aren't always the same length, you're out of luck. BehaviorSpace isn't flexible that way. You would have to write a separate program (in the programming language of your choice, perhaps NetLogo itself, perhaps an Excel macro, perhaps something else) to postprocess the BehaviorSpace output and make it look how you want.
I am writing a code to do some template matching using cv::matchTemplate but I have run into some problems with the 2-dimensional vector of vectors (vov) I created which I have called vvABC. At the moment, my vov has 10 elements which can change based on the values I pass while running the code.
My problem is moving from one column in my vov to the next so I can calculate the size. From my understanding of how vov works, if I have my elements stored in my vov as:
C_A C_B
0 0
1 1
2 2
3
4
5
6
To calculate the size of the first column, I should simply do something like:
vvABC[0].size() to get the size of the first column (which would give 3 in this case) and vvABC[1].size() to get the size of the second column (which would give 7). The problem I am now faced with is both of them give '3' in both cases which is obviously wrong.
Can someone please help me out on how I can get the correct size of the next column?
I stored my detections in my vvABC, now I want to match them one at a time.
It seems like you made a mistake here:
for (uint iCaTemplate = iCa + 1; iCaTemplate < vvABC[iCa].size(); ++iCaTemplate) {
iCa is an index on the 'first level' of vector (of size 2 in your example above), i.e. columns, and you use it to go through the elements of the 'second level' of vector, i.e. rows.
Thanks a lot guys, esp. JGab, after several debug outputs, I finally found that my vector of vectors wasn't being filled up the way I thought it was...thanks once more and my apologies for my belated response.
I am testing the speed of inserting multiple rows with a single INSERT statement.
For example:
INSERT INTO [MyTable] VALUES (5, 'dog'), (6, 'cat'), (3, 'fish)
This is very fast until I pass 50 rows on a single statement, then the speed drops significantly.
Inserting 10000 rows with batches of 50 take 0.9 seconds.
Inserting 10000 rows with batches of 51 take 5.7 seconds.
My question has two parts:
Why is there such a hard performance drop at 50?
Can I rely on this behavior and code my application to never send batches larger than 50?
My tests were done in c++ and ADO.
Edit:
It appears the drop off point is not 50 rows, but 1000 columns. I get similar results with 50 rows of 20 columns or 100 rows of 10 columns.
It could also be related to the size of the row. The table you use as an example seems to have only 2 columns. What if it has 25 columns? Is the performance drop off also at 50 rows?
Did you also compare with the "union all" approach shown here? http://blog.sqlauthority.com/2007/06/08/sql-server-insert-multiple-records-using-one-insert-statement-use-of-union-all/
I suspect there's an internal cache/index that is used up to 50 rows (it's a nice round decimal number). After 50 rows it falls back on a less efficient general case insertion algorithm that can handle arbitrary amounts of inputs without using excessive memory.
the slowdown is probably the parsing of the string values: VALUES (5, 'dog'), (6, 'cat'), (3, 'fish) and not an INSERT issue.
try something like this, which will insert one row for each row returned by the query:
INSERT INTO YourTable1
(col1, col2)
SELECT
Value1, Value2
FROM YourTable2
WHERE ...--rows will be more than 50
and see what happens
If you are using SQL 2008, then you can use table value parameters and just do a single insert statement.
personally, I've never seen the slowdown at 50 inserts records even with regular batches. Regardless we moved to table value parameters which had a significant speed increase for us.
Random thoughts:
is it completely consistent when run repeatedly?
are you checking for duplicates in the 1st 10k rows for the 2nd 10k insert?
did you try batch size of 51 first?
did you empty the table between tests?
For high-volume and high-frequency inserts, consider using Bulk Inserts to load your data. Not the simplest thing int he world to implement and it brings with it a new set of challenges, but it can be much faster than doing an INSERT.