Splitting awkward cstring's into different arrays? - c++

Ok, so here's the deal. This is a project for school, and we can't use #include < string >.
Basically, for any strings we'll be dealing with, we have to use cstrings, or char arrays that end with a null terminator. Basically the same thing right? Well I'm having a little bit of trouble. I have to read in a first name, last name, a student id, and a minimum of 5 grades but a maximum of 6 grades from an input file. To see what that looks like is below, but there is a catch. There can be an arbitrary amount of spaces in between each of those details, with the maximum length of the line being 250. So an example of the input is below.
Adam Zeller 452231 78 86 91 64 90 76
Barbara Young 274253 88 77 91 66 82
Carl Wilson 112231 87 77 76 78 77 82
Notice, how there are random amounts of spaces in between the details. Basically, I need to get the names (both first name and last name can vary in length), read the student id into an int, and then read all the rest of their grades (preferably into an int array). Also, they can have either 5 or 6 grades,the program should be able to handle either. How in the world do I go about sorting this data? I thought maybe I could getline() into a cstring char array of the whole line, and then seperate each bit accordingly into each array, but I just don't know how to go about this. Indefinitely, I don't want anyone to give me any code, but maybe point me in the right direction of how I could go about this. Sorting a line of data into different variables, while also accounting for either 5 or 6 grades without effecting the data, and also the list could be up to 60 lines long (meaning have up to 60 students on it but no more than that). This is only a portion of the project, but seems to be the one part I can't get past. Again, I don't want any code or direct answers, maybe just point me in the right direction of a way I could go about this. Thanks so much!

I'm not going to post any code (as requested), but consider filtering each line through strtok. strtok splits the string into tokens, where they can be arranged or stored however you like. See here for more: http://www.cplusplus.com/reference/cstring/strtok/

I guess you may follow the below steps:
read a line
get the words from the line
first two words are first and last names
rest of the words are number, use atoi to get the numbers from string
continue the above till EOF
How do you get the words ? May be "isspace" c library function will help

Technically, this is a trick project/question.
The hardest part of parsing the lines in the file is as you said: dealing with the random amount of spaces.
Since you requested no code, I'll give you a few hints based on what you suggested:
You know that the size of each line is a maximum of 250 characters (including the newline character?) - this is the length of each line, and as such, since it occurs with such regularity, you can read this many characters at a time with regular file functions, or, using fstreams (as deduced from your tag).
The only issue really, is storing these tokens. If you know ahead of time how many you will store, you can define an array of c-strings (as you can't use <string>) comprised of a maximum number that you think will occur. However, as that is both unreliable and a bit inefficient (as it's a waste of memory if you choose too much lines to store), you can make it dynamic. In that regard you have the option of using a C++ container like <vector>, for ease of access and storage.
After that, figuring out the values of the data on each line is relatively easy:
First, look at your data, what do you observe?
Each individual piece of data (a token in parsing nomenclature) is delimited by at least one space character.
Also, the names are the first two tokens of any line, and do not contain digits in them.
Hence anything that has not a space or a number belongs to a string, in this case: part of a name.
All you have to do is then iterate over the container/data structure you've used to store the c-strings and parse them using the criteria described above.

Related

Copy Certain Portion of One Array to Another

I'm still quite new to programming -- about two months in -- so if this is a really basic question, then I apologize. Going along with that, my terminology might be completely off. If it is, I'd greatly appreciate any help you might be able to offer with telling me the proper terms. I searched around the forums here for a bit, but couldn't find anything that answered my question. If you're aware of a topic that does, then please just link it below.
Onto the question.
Let's say that I have an external text file with a bunch of information in it. The information is divided into items, each item delineated from the next by '::'. Each item is divided into four fields, each field delineated from the next by '\'.
What I want to do is take one item's information out of the text file and place it into an array called info. I want to then take info and pass it to another function. This function will create four new arrays and then portion out field 1 to array 1, field 2 to array 2, etc.
Basically, how do I take an array, take a portion of that array and give it to another variable, then copy another portion of that array and give it to another array.
Example:
The External Text File looks like the following:
26::Female::Kentucky::Trauma\\34::Male::Michigan::Elective\\85::Male::Unknown::Trauma\\18:Female::Washington::Emergent
Using fstream, I then take "26::Female:Kentucky::Trauma" and put it into an array called 'info', which is then passed to a function called Sort(char info[]).
How do I get Sort(char info[]) to take an array with "26::Female::Kentucky::Trauma" and turn it into four arrays such as:
Age: 26
Sex: Female
Location: Kentucky
Reason for Admission: Trauma
EDIT
Array 1 looks like:
26::Female::Kentucky::Trauma
I then create four char arrays called, Age, Sex, Location, Reason. How do I get 26 into the Age array, Female into the Sex array, Kentucky into the Location array, and Trauma, into the Reason array?
I know that I could do this at the stage where I'm reading in from an external file, but it seems easier to do it this way for my purposes.
Thank you for your time.
Look at the documentation for the string class. The functions find_first_of and substr will be useful. Split the string when it finds :: or //. For example, 26::Female::Kentucky::Trauma would be split into 26 and Female::Kentucky::Trauma. This sounds like it may be an assignment, so I will not give a complete solution, but this should be enough to get you going.

Fast way to get two first and last characters of a string from the input

I need to read a string from the input
a string has its length from 2 letters up to 1000 letters
I only need 2 first letters, 2 last letters, and the size of the entire string
Here is my way of doing it, HOWEVER, I do believe there is a smarter way, which is why I am asking this question. Could you please tell me, unexperienced and new C++ programmer, what are possible ways of doing this task better?
Thank you.
string word;
getline(cin, word);
// results - I need only those 5 numbers:
int l = word.length();
int c1 = word[0];
int c2 = word[1];
int c3 = word[l-2];
int c4 = word[l-1];
Why do I need this? I want to encode a huge number of really long strings, but I figured out I really need only those 5 values I mentioned, the rest is redundant. How many words will be loaded? Enough to make this part of code worth working on :)
I will take you at your word that this is something that is worth optimizing to an extreme. The method you've shown in the question is already the most straight-forward way to do it.
I'd start by using memory mapping to map chunks of the file into memory at a time. Then, loop through the buffer looking for newline characters. Take the first two characters after the previous newline and the last two characters before the one you just found. Subtract the address of the second newline from the first to get the length of the line. Rinse, lather, and repeat.
Obviously some care will need to be taken around boundaries, where one newline is in the previous mapped buffer and one is in the next.
The first two letters are easy to obtain and fast.
The issue is with the last two letters.
In order to read a text line, the input must be scanned until it finds an end-of-line character (usually a newline). Since your text lines are variable, there is no fast solution here.
You can mitigate the issue by reading in blocks of data from the file into memory and searching memory for the line endings. This avoids a call to getline, and it avoids a double search for the end of line (once by getline and the other by your program).
If you change the input to be fixed with, this issue can be sped up.
If you want to optimize this (although I can't imagine why you would want to do that, but surely you have your reasons), the first thing to do is to get rid of std::string and read the input directly. That will spare you one copy of the whole string.
If your input is stdin, you will be slowed down by the buffering too. As it has already been said, the best speed woukd be achieved by reading big chunks from a file in binary mode and doing the end of line detection yourself.
At any rate, you will be limited by the I/O bandwidth (disk access speed) in the end.

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Write a program to count how many times each distinct word appears in its input

This is a question(3-3) in accelerated C++.
I am new to C++. I have thought about this for a long time, however, I can't figure it out.
Will anyone resolve this problem for me?
Please explain it in detail, you know I am not very good at programming. Tell me the meaning of the variables you use.
The best data structure for this is something like a std::map<std::string,unsigned>, but you don't encounter maps until chapter 7.
Here are some hints based on the contents of chapter 3:
You can put strings in a vector, so you can have std::vector<std::string>
Strings can be compared, so std::sort works with std::vector<std::string>, and you can check if two strings are the same with s1==s2 just like for integers.
You saw in chapter 1 that std::cin >> s reads a word from std::cin into s if s is a std::string.
To provide maximal learning experience, I will not provide pastable code. That's an exercise. You have to do it yourself to learn as much as you can.
This is the perfect scenario for employing a kind of map that creates its value type upon accessing a non-existing key. Fortunately, C++ has such a map in its standard library: std::map<key_type,value_type> is exactly what you need.
So here's the jigsaw pieces:
you can read word by word from a stream into a string by using operator >>
you can store what you find in a map of words (strings) to occurrences (unsigned number type)
when you access an entry in the map through a non-existing key, the map will helpfully create a new default-constructed value under that key for you; if the value happens to be a number, default-construction will set it to 0 (zero)
Have fun put this together!
Here's my hint. std::map will be your friend.
Here is an algorthm you could use, try coding something and put you results here. People can then help you get further.
Scan down the string collecting each letter until you get to a word boundary (say space or . or , etc).
Take that word and compare it to the words you've already found, if already found then add one to the count for that word. If it's not then add that word to the list of words found with a count of 1.
Carry on down the string
Well, you need a way of getting individual words from the input stream (perhaps something like an "input stream" method applied to the "standard input stream") and a way of storing those strings and counts in some sort of "collection".
My natural homework cynicism and general apathy towards life prevent me from adding more detail at the moment :-)
The meaning of any variables I use is fairly self-evident since I tend to use things like objectsRemaining or hasBeenOpened.

Is this an acceptable use of "ASCII arithmetic"?

I've got a string value of the form 10123X123456 where 10 is the year, 123 is the day number within the year, and the rest is unique system-generated stuff. Under certain circumstances, I need to add 400 to the day number, so that the number above, for example, would become 10523X123456.
My first idea was to substring those three characters, convert them to an integer, add 400 to it, convert them back to a string and then call replace on the original string. That works.
But then it occurred to me that the only character I actually need to change is the third one, and that the original value would always be 0-3, so there would never be any "carrying" problems. It further occurred to me that the ASCII code points for the numbers are consecutive, so adding the number 4 to the character "0", for example, would result in "4", and so forth. So that's what I ended up doing.
My question is, is there any reason that won't always work? I generally avoid "ASCII arithmetic" on the grounds that it's not cross-platform or internationalization friendly. But it seems reasonable to assume that the code points for numbers will always be sequential, i.e., "4" will always be 1 more than "3". Anybody see any problem with this reasoning?
Here's the code.
string input = "10123X123456";
input[2] += 4;
//Output should be 10523X123456
From the C++ standard, section 2.2.3:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
So yes, if you're guaranteed to never need a carry, you're good to go.
The C++ language definition requres that the code-point values of the numerals be consecutive. Therefore, ASCII Arithmetic is perfectly acceptable.
Always keep in mind that if this is generated by something that you do not entirely control (such as users and third-party system), that something can and will go wrong with it. (Check out Murphy's laws)
So I think you should at least put on some validations before doing so.
It sounds like altering the string as you describe is easier than parsing the number out in the first place. So if your algorithm works (and it certainly does what you describe), I wouldn't consider it premature optimization.
Of course, after you add 400, it's no longer a day number, so you couldn't apply this process recursively.
And, <obligatory Year 2100 warning>.
Very long time ago I saw some x86 processor instructions for ASCII and BCD.
Those are AAA (ASCII Adjust for Addition), AAS (subtraction), AAM (mult), AAD (div).
But even if you are not sure about target platform you can refer to specification of characters set you are using and I guess you'll find that first 127 characters of ASCII is always have the same meaning for all characters set (for unicode that is first characters page).