How to determine the number of filled drums, and the room left in each drum - openoffice-calc

Not quite a homework problem, but it may as well be:
You have a long list of positive integer values stored in column A. These are packets in unit U.
A Drum can fit up to 500 U, but you cannot break up packets.
How many drums are required for any given list of values in column A?
This does not have to be the most efficient answer, processing in row order is absolutely fine.
I Think you should be able to solve this with a formula, but the closest I got was
=CEILING(SUM(A1:A1000)/500;1)
Of course, this breaks up packets.
Additionally, this problem requires me to be able to find the room left in each drum used, but emphasis for this question should remain on just the number required.

This cannot be done with a single simple formula. Each drum and packet needs to be counted. However contrary to my comment, for this particular problem a spreadsheet works well, and there is no need for a macro.
First, set B2 to 500 for use in other formulas. If column A is not yet filled, use the formula =RANDBETWEEN(1,B$2) to add some values.
Column C is the main formula that determines how full each drum is. Set C2 to =A2. C3 is =IF(C2+A3>B$2,A3,C2+A3). Fill C3 down to fill the remaining rows.
For column D, use =IF(C2+A3>B$2,B$2-C2,""). However the last row of column D is shorter: =B$2-C21 and change 21 to whatever the last row is.
Finally in column E we find the answer, which is simply =COUNT(D2:D21).
Packets Drum Size How Full Room left in each drum used Number of filled drums
------- --------- -------- --------------------------- ----------------------
206 500 206 294 13
309 309
68 377
84 461 39
305 305 195
387 387 113
118 118
8 126 374
479 479 21
492 492 8
120 120
291 411 89
262 262
108 370 130
440 440 60
88 88
100 188
102 290 210
478 478 22
87 87 413
For OpenOffice Calc, use semicolons ; instead of commas , in formulas.

Related

Stata: tsline graphs values not at the right date

I want to depict the evolution of a variable called share over time. I do so by using tsline, but the resulting graph looks off: Although my data starts in May 1989 and ends in December 1993, the trendline is drawn so that it begins in January 1989 and ends in
mid-1993.
gen double time3 = monthly(time2, "YM")
format time3 %tm
tsset time3
tsline share, ///
ttitle("years") ytitle("") ylabel(0(.2).65) ///
tlabel(1989m5(12)1994m5, format(%tmY) labsize(small))
I know that Stata stores dates as integers and tried replacing the year-month-indications after tlabel by integers. Since the time variable is defined as months since 1960m1, 1989m5 is stored internally as 352 and 1993m12 as 407. I learned this by running dis tm(1989m5). But even with tlabel(352(12)407), the trendline is not drawn correctly. Has anyone an idea about how to fix this? This is the how the graph looks like by now.
This is a subsample of the data that I used:
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 time2 double(time3 share)
"1989-05" 352 .1536926147704591
"1989-06" 353 .1665024630541872
"1989-08" 355 .12674650698602793
"1989-09" 356 .18095712861415753
"1989-10" 357 .24629080118694363
"1989-11" 358 .23008849557522124
"1989-12" 359 .17638036809815952
"1990-01" 360 .20521653543307086
"1990-02" 361 .1754473161033797
"1990-03" 362 .17401960784313725
"1990-04" 363 .14173998044965788
"1990-05" 364 .1669970267591675
"1990-06" 365 .1398838334946757
"1990-08" 367 .10461689587426326
"1990-09" 368 .14965312190287414
"1990-10" 369 .1921182266009852
"1990-11" 370 .18038617886178862
"1990-12" 371 .19577735124760076
"1991-01" 372 .10562685093780849
"1991-02" 373 .09596928982725528
"1991-03" 374 .1941747572815534
"1991-04" 375 .1889106967615309
"1991-05" 376 .1794234592445328
"1991-06" 377 .1968390804597701
"1991-08" 379 .17846309403437816
"1991-09" 380 .19425173439048563
"1991-10" 381 .14556962025316456
"1991-11" 382 .15569143932267168
"1991-12" 383 .1694015444015444
"1992-01" 384 .20812928501469147
"1992-02" 385 .257590597453477
"1992-03" 386 .2204724409448819
"1992-04" 387 .22096456692913385
"1992-05" 388 .21601941747572814
"1992-06" 389 .1675025075225677
"1992-07" 390 .22176591375770022
"1992-09" 392 .15128968253968253
"1992-10" 393 .15841584158415842
"1992-11" 394 .1849112426035503
"1992-12" 395 .19642857142857142
"1993-01" 396 .22469252601702933
"1993-02" 397 .2796528447444552
"1993-03" 398 .290811339198436
"1993-04" 399 .24108910891089108
"1993-05" 400 .2562437562437562
"1993-06" 401 .22127872127872128
"1993-07" 402 .27874743326488705
"1993-09" 404 .3391472868217054
"1993-10" 405 .3840155945419103
"1993-11" 406 .45184824902723736
"1993-12" 407 .43987975951903807
end
format %tm time3
[/CODE]
The graph you've posted doesn't seem surprising.
Using the data and code you've posted
clear
input str7 time2 double(time3 share)
"1989-05" 352 .1536926147704591
"1989-06" 353 .1665024630541872
"1989-08" 355 .12674650698602793
"1989-09" 356 .18095712861415753
"2019-10" 717 .13052208835341367
"2019-11" 718 .13559059987631417
"2019-12" 719 .13997555012224938
end
format %tm time3
tsset time3
tsline share, ///
ttitle("years") ytitle("") ylabel(0(.2).65) ///
tlabel(1989m5(12)2019m12, format(%tmY) labsize(small))
it's hard to see what might be a problem.
tsline doesn't purport to draw a trend line, just a line graph for the data specified.

Looping in SAS to bring the latest value

I am trying to find days matching to a reference number of days given or else to find the number of days close to the reference days.
I coded till here, however not sure how to go forward.
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 560 86 181
3 2016-02-12 560 82 263
3 2016-02-12 560 69 332
3 2016-02-12 560 77 409
So now I want to bring out the last value close to the reference days.
and the next total_days should start from ZERO again to find the next window. How can I do this?
Here is a code that I wrote
data want;
do until (totaldays <= ref_days);
set have;
by ID ref_days notsorted;
if first.id then totaldays=0;
else totaldays+lags;
end;
run;
Required Output:
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 300 86 181
3 2016-02-12 300 82 263
3 2016-02-12 300 69 .
3 2016-02-12 300 77 146
A while ago I did similar to this via Proc sql. It calculates all the distances and takes the closest one. It works with moderate size dataset. Hopefully it is of some use.
proc sql;
select * from
(
select *,
abs(t1.link-t2.link) as dist /*In your case these would be dateVars*/
from test1 t1
left join test2 t2
on 1=1) group by system1 having dist=min(dist);
;
quit;
There was some talk that the left join on 1=1 is a bit silly (as full outter join would suffice, or something.) However this worked for the problem in question.

Reshape/pivot pandas dataframe

I have a dataframe with variables: id, 2001a, 2001b, 2002a, 2002b, 2003a, 2003b, etc.
I am trying to figure out a way to pivot the data so the variables are: id, year, a, b
The 16.2 documentation refers to some reshaping and pivoting, but that seemed to speak more towards hierarchical columns.
Any suggestions?
I am thinking about creating a hierarchical dataframe, but am not sure how to map the year in the original variable names to a created hierarchical column
sample df:
id 2001a 2001b 2002a 2002b 2003a etc.
1 242 235 5735 23 1521
2 124 168 135 1361 1
3 436 754 1 24 5124
etc.
Here is a way to create hierarchical columns.
df = pd.DataFrame({'2001a': [242,124,236],
'2001b':[242,124,236],
'2002a': [242,124,236],
'2002b': [242,124,236],
'2003a': [242,124,236]})
df.columns = df.columns.str.split('(\d+)', expand=True)
df
2001 2002 2003
a b a b a
0 242 242 242 242 242
1 124 124 124 124 124
2 236 236 236 236 236

Working with files I/O for beginners [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hi all I am working on a school beginners project using files I/O in C++,
This program consist of two parts:
1) reading and processing a student data file, and writing the results to a student report file
2) modifying part 1 to calculate some statistics and writing them to another file.
For this assignment, you will be reading one input file and writing out two other files.
Your program will be run using the referenced student data file.
Part 1 Detail
Read in the student data file. This 50 record file consists of a (8-digit numeric) student id, 8 assignment's points, midterm points, final points and lab exercise points. You must again follow the syllabus specifications for the determination of the letter grade, this time, processing 50 student grades. Extra credit points are not applicable for this assignment. You will write the input student data and the results of the processing into a student report file that looks like the output shown below. In addition to the input student data, the report should contain the "total" of the assignment grades, the total and percent of all points achieved, and the letter grade. You may assume that the input data file does not contain any errant data.
The file looks like the one below:
The file that we need to read from is hyperlinked here
The student report output file should look like this:
The Student Report Output File
Student --- Asignment Grades -- Ass Mid Fin LEx Total Pct Gr
-------- -- -- -- -- -- -- -- -- --- --- --- --- ----- --- --
56049257 16 16 20 16 12 15 12 20 115 58 123 59 355 89 B+
97201934 19 15 13 19 16 12 13 18 113 72 101 55 341 85 B
93589574 13 16 19 19 18 12 6 14 111 58 108 50 327 82 B
85404010 17 19 19 19 19 10 17 19 129 70 102 58 359 90 A-
99608681 11 15 19 19 17 10 16 19 116 42 117 57 332 83 B
84918110 11 20 18 17 12 8 12 19 109 46 122 31 308 77 C
89307179 16 16 19 18 14 17 15 19 120 56 117 52 345 86 B
09250373 15 15 18 18 11 18 17 19 120 44 106 51 321 80 B-
91909583 12 14 16 19 20 11 20 16 117 66 92 50 325 81 B-
...
Part 2 Detail
Write a summary report file that contains the average total points and average percent for all students. Also, display the number of A's, B's, C's, D's and F's for the students. Your summary output file should look something like this:
The average total points = ???
The average percent total = ??
The number of A's = ??
The number of B's = ??
The number of C's = ??
The number of D's = ??
The number of F's = ??
Additional requirements
All files must be checked for a successful open. They should also be closed when you are finished with them.
Make sure you write the student id with a leading 0, if appropriate (i.e. the 8th id).
Add headings to your output report file. They should be aligned and correctly identify the column data.
Do not use global variables, except for constants, in your solution.
For part 1 How do I duplicate the file and format it to add the headings above it and the grades at the end of each file into the new duplicated file??
Any help in this matter would be appreciated
thanks in advance.
Engineering is all about converting a large complex problem into many smaller, easy to solve, problems.
Here is how I would start.
1.) Open input file.
2.) Read one line from input file.
3.) Break the input string from one line into values.
4.) Close input file.
5.) Open output file.
6.) Write results to output file.
References:
1.)File I/O
2.)std::string
3.)File I/O C
Now you're pretty much there. Take it one step at a time.

Memory Efficient Methods To Find Unique Strings

I have a data set that looks like this:
000 100 200 300 010 020 030 001 002 003
001 101 201 301 011 021 031 000 002 003
002 102 202 302 012 022 032 001 000 003
003 103 203 303 013 023 033 001 002 000
010 110 210 310 000 020 030 011 012 013
020 120 220 320 010 000 030 021 022 023
030 130 230 330 010 020 000 031 032 033
033 133 233 333 013 023 003 031 032 030
100 000 200 300 110 120 130 101 102 103
133 033 233 333 113 123 103 131 132 130
200 100 000 300 210 220 230 201 202 203
233 133 033 333 213 223 203 231 232 230
300 100 200 000 310 320 330 301 302 303
303 103 203 003 313 323 333 301 302 300
313 113 213 013 303 323 333 311 312 310
330 130 230 030 310 320 300 331 332 333
331 131 231 031 311 321 301 330 332 333
332 132 232 032 312 322 302 331 330 333
333 133 233 033 313 323 303 331 332 330
What I intend to do is to generate list of unique strings from it, yielding:
000
001
002
003
010
011
012
013
020
021
022
023
030
031
032
033
100
101
102
103
110
113
120
123
130
131
132
133
200
201
202
203
210
213
220
223
230
231
232
233
300
301
302
303
310
311
312
313
320
321
322
323
330
331
332
333
The code I have to generate that is this. But it is very memory consumptive.
Because in reality the string is of length >36 and there are more than 35 million
lines in a file. Each line with >36*3 number of columns/entries.
Is there a memory efficient way to do it?
#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
#include <map>
using namespace std;
int main ( int arg_count, char *arg_vec[] ) {
if (arg_count !=2 ) {
cerr << "expected one argument" << endl;
return EXIT_FAILURE;
}
string line;
ifstream myfile (arg_vec[1]);
map <string,int> Tags;
if (myfile.is_open())
{
while (getline(myfile,line) )
{
stringstream ss(line);
string Elem;
while (ss >> Elem) {
Tags[Elem] = 1;
}
}
myfile.close();
}
else { cout << "Unable to open file";}
for (map <string,int>::iterator iter = Tags.begin(); iter !=
Tags.end();iter++) {
cout << (*iter).first << endl;
}
return 0;
}
This depends a bit on the characteristics of your dataset. In the worse case, where all strings are unique, you will need either O(n) memory to record your seen-set, or O(n^2) time to re-scan the entire file on each word. However, there are improvements that can be made.
First off, if your dataset only consists of 3-digit integers, then a simple array of 1000 bools will be much more memory effieicnt than a map.
If you're using general data, then another good approach would be to sort the set, so copies of the same string end up adjacent, then simply remove adjacent words. There are well-researched algorithms for sorting a dataset too large to fit in memory. This is most effective when a large percentage of the words in the set are unique, and thus holding a set of seen words in memory is prohibitively expensive.
Incidentally, this can be implemented easily with a shell pipeline, as GNU sort does the external sort for you:
tr " " "\n" < testdata | LC_ALL=C sort | uniq
Passing LC_ALL=C to sort disables locale processing and multibyte character set support, which can give a significant speed boost here.
O(1) memory [ram]:
If you want to use no memory at all (besides a couple temp variables) you could simply read 1 entry at a time and add it to the output file if it doesn't already exist in the output file. This would be slower on time though since you'd have to read 1 entry at a time from the output file.
You could insert the entry into the output file in alphabetical order though and then you would be able to see if the entry already exists or not in O(logn) time via binary search per entry being inserted. To actually insert you need to re-create the file though which is O(nlogn). You do this n times for each input string, so overall the algorithm would run in O(n^2logn) (which includes lookup to find insertion pos + insertion) and use O(1) RAM memory.
Since your output file is already in alphabetical order though future lookups would also only be O(logn) via binary search.
You could also minimize the re-creation phase of the file by leaving excessive space between entries in the file. WHen the algorithm was done you could do a vacuum on the file. This would bring it down to O(nlogn).
Another way to reduce memory:
If it's common that your strings share common prefixes then you can use a trie and probably use less memory overall since you mentioned your strings are > length 36. This would still use a lot of memory though.
Well, std::set might be slightly faster and use less memory than std::map.
It seems given that large number of entries, there will be a reasonable amount of overlap in the sequences of symbols. You could build tree using the entries at each position each sequence as nodes. Say you have an entry 12345 and 12346 then the first four entries in the sequence overlap and thus could be stored in a tree with 6 nodes.
You could walk the tree to see if a given symbol is contained at a given position, if not you would add it. When you reach the end of the given string you would just mark that node as terminating a string node. Reproducing the unique entries would be a matter of a depth first traversal of the tree the path from the root node to each terminator represents a unique entry.
If you partition the dataset, say into X thousand line chunks and aggregate the trees it would make a nice map-reduce job.
You're looking at a node space of 10^36 so if the data is entirely random you're looking at a large possible number of nodes. If there's a good deal of overlap and a smaller number of unique entries you will probably find you use a good deal less
You could probably just write an in-place sort for it. You're still going to have to load the file to memory though, because in-place sorting with IO wouldn't be efficient. You'd have to read and write for every comparison.
std::set would be better, but you should look into hash sets. Those are not available in the current C++ standard library (although it is supposed to be in C++0x's) but some compilers have implementations. Visual C++ has stdext::hash_set and boost has some kind of stuff for this, see Boost Multi-index Containers Library.
Try STXXL as an external stl for huge datasets.
The solution I would suggest is to use memory mapped file to access the data and radix sort to sort the list. This is trivial if the strings are of same length.
If the strings are of different lengths, use radix sort for a fast presorting using the n first characters, then sort the unsorted sub lists with whatever method is most appropriate. With very short sublists, a simple bubble sort would do it. With longer lists use quick sort or use the STL set with a custom class providing the compare operator.
With memory mapping and radix sort, the memory requirement and performance would be optimal. All you need is the mapping space (~size of file) and a table of 32bit values with one value per string to hold the linked list for radix sort.
You could also save memory and speed up the sorting by using a more compact encoding of the strings. A simple encoding would use 2 bits per character, using values 1,2 and 3 for the three letters and 0 to signal the end of string. Or more efficient, use a base 3 encoding, packed into integers and encode the length of string in front. Let say you have characters c1, c2, C3, c4 the encoding would yield the integer c1*3^3 + c2*3^2 + c3*3^1 + c4*3^0. This suppose you assign a value from 0 to 2 to each character. By using this encoding you'll save memory and will optimize sorting if the number of strings to sort is huge.
You can look at the excellent library Judy arrays. A compact trie based, very fast and memory efficient data structure witch scale to billons strings. Better than any search-tree.
You can use the JudySL functions to your purpose. You can use it similar to your program, but it is much faster and more memory efficient.