Explaining frequency[toupper(new_letter) - 'A']++; - c++

So I've been searching for a solution to a problem that one step involves counting the frequency of each unique letter. Everywhere I go has the same array incrementor. I haven't seen this form and don't fully understand it. I have attempted to find support documentation for the format but can't figure out what it actually does.I Can get it to work; however, I'm not sure what each peice represents.
Peice I'm having issues understanding is what's going on inside the brackets here.
frequency[toupper(new_letter) - 'A']++;
Where frequency is an array
an example from: count number of times a character appears in an array?
Algorithm:
Open file / read a letter.
Search for the letters array for the new letter.
If the new letter exists: increment the frequency slot for
that letter: frequency[toupper(new_letter) - 'A']++; If the new
letter is missing, add to array and set frequency to 1.
After all letters are processed, print out the frequency array: `
cout << 'A' +
index << ": " << frequency[index] << endl;
any help understanding would be much apprecaited.

This is simply an array. Maybe the part that is confusing you is toupper(new_letter) - 'A' what we do here is - we convert the letter to upercase and then subtract the ASCII code of 'A' from the ASCII code of the result. Thus the result is a number in the range [0-25]. After that by adding this to 'A' we get the origianl uppercase character. As for the rest of the algorithm - this is simply something like counting sort.

Unfortunately, this solution is not completely portable. It assumes that in the execution character set, the capital letters A-Z have consecutive values. That is, it assumes 'A' + 1 is equal to 'B', 'B' + 1 is equal to 'C', and so on. This is not necessarily true, but it usually is.
toupper simply converts whatever character is passed to it to uppercase. Subtracting 'A' from this, given the above assumption, will work out the "distance" from 'A' to the given letter. That is, if new_letter is 'A', the result will be 0. If it is 'b', the result will be 1. As you can see, the reason for using toupper was to make it independent as to whether new_letter was uppercase or lowercase.
This result (essentially the position of the letter in the alphabet) is then used to access the array. If frequency is an array of 26 ints (one for each letter), you will access the corresponding int. That int is then incremented.

If it's an array (e.g. int frequency[26];) then we don't add to array - it is already there, but with a value of zero.
The ++ operator is short hand for add one to the thing, so
frequency[toupper(new_letter) - 'A']++;
is the same as:
frequency[toupper(new_letter) - 'A'] = frequency[toupper(new_letter) - 'A'] + 1;
Obviously, the short hand version is much easier to read, as there is much less repetition that has to be carefully checked that it's the same on both sides, etc.
The index is toupper(new_letter) - 'A' - this works by first making any letter into an uppercase one - so we don't care if it's a or A, 'c' or C, etc, and then subtract the value of first letter in the alphabet, 'A'. This means that if new_letter is 'A' the index is zero. If new_letter is 'G' we use index 7, etc. [This assumes that all the letters are sequential, which isn't absolutely certain, and for sure, if we talk about languages other than English that have for example ä, ǹ, Ë or ê, etc as part of the language, then those would definitely not be following A-Z]
If you were to count the number of letters in a piece of text by hand, you could just list all the letters A-Z along the edge of the paper, and then put a dot next to each letter as you read them in the text, and then count the number of dots. This does the same sort of thing, except it keeps each count running as you go along.

Related

Regex: How to match a range of characters except another range

I'm trying to create a regex filter to satisfy:
1) The 1st character should be a lower-case letter or a number
2) The rest of the characters should be a single character between index 32 and 126
3) However, none of the characters should be upper case letters or _
My current regex is:
^[a-z0-9][ -~]*$
This solves 1) and 2) above - but I struggle to include 3) above in the right way. Any help is appreciated.
A simple way is to add a negative lookahead for what you don't want.
^[a-z0-9](?!.*[A-Z_])[ -~]*$
But it's also possible to just split up the ranges, based on the ascii-table
^[a-z0-9][ -#\[-^`-~]*$
It's just a bit less easy to understand at a first glance.

Every other letter

So, I have tried this problem for what it seems like a hundred times this week alone.
It's filling in the blank for the following program...
You entered jackson and ville.
When these are combined, it makes jacksonville.
Taking every other letter gives us jcsnil.
The blanks I have filled are fine, but the rest of the blanks, I can't figure out. Here they are.
x = raw_input("Enter a word: ")
y = raw_input("Enter another word: ")
print("You entered %s and %s." % (x,y))
combined = x + y
print("When these are combined, it makes %s." % combined)
every_other = ""
counter = 0
for __________________ :
if ___________________ :
every_other = every_other + letter
____________
print("Taking every other letter gives us %s." % every_other)
I just need three blanks to this program. This is basic python, so nothing too complicated or something I can match wit the twenty options. Please, I appreciate your help!
The first blank needs to define letter so that each time through the loop it is the letter at position counter in combined
The second blank needs to test for the current letter's position being one that gets included
The last blank needs to modify counter for the next value of letter (much as the initial value of counter was for the first letter).
The solution is to slice with a step value.
In [10]: "jacksonville"[::2]
Out[10]: 'jcsnil'
The slice notation means "take the subset starting at the beginning of the iterable, ending at the end of the iterable, selecting every second element". Remember that Python slices start by selecting the first element available in the slice.
EDIT: Didn't realize it had to fill in the blanks
for letter in combined:
if(counter % 2) == 0:
every_other = every_other + letter
counter += 1
Since taking every other would mean you take every second letter, or every second pass through the loop, and you use counter to track how many passes you've made, you can use modulo division (%) to check when to take a letter. The base case is that 0 % 2 = 0, which lets you take the first letter. It's important to remember to always increment the counter.
A way to do this without the manual counter, which was already mentioned in comments is to use the enumerate function on combined. When given an iterable as a parameter, enumerate returns a generator which yields two values with each request, the position in the iterable, and the value of the iterable at that position.
I'm using this language, talking about iterables as index-able sequences, but it could be any generator-like object which doesn't have to have a finite, pre-defined sequence.

How to decrement a character value alphabetically in C++

Is there a way to decrement a character value alphabetically in C++?
For example, changing a variable containing
'b' to the value 'a' or a variable containing
'd' to the value 'c' ?
I tried looking at character sequence but couldn't find anything useful.
Characters are essentially one byte integers (although the representation may vary between compilers). While there are many encodings which map integer values to characters, almost all of them map 'a' to 'z' characters in successive numerical order. So, if you wanted to change the string "aaab" to "aaaa" you could do something like the following:
char letters [4] = {'a','a','a','b'};
letters[3]--;
Alphabet characters are part of the ASCII character table. 65 is uppercase letter A, and 32 bits later, which is 97, is lowercase letter A. Letters B through Z and b through z are 66 through 90 and 98 through 122, respectively) The original computer programmers made it 32 bits apart in the ASCII chart rather than 26 (letters in the alphabet) because bit manipulation can be done to either easily change from lowercase to uppercase (and vice-versa), as well as ignoring the case (by ignoring the 32 bit - 0010 0000).
This way, for example, the 84th character on the ASCII chart, which represents the letter T, is represented with the bits 0101 0100. Lowercase t is 116 which is 0111 0100. When ignoring the case, the 1 in the 32 bit (6th position from the right) is ignored. You can see all the other bits are exactly the same for uppercase and lowercase. This makes it more convenient for everyone and more optimal for the computer.
To decrement just convert the character to its ASCII character value, decrement by 1, then take that integer and convert it back into ASCII value. Be careful when you have an 'A' though (or 'a'), as that's a special case.

Linear time construction of a data structure to answer in constant time if two strings have 2 common letters

Problem statement:
Given a string which contains words separated by blanks(spaces) it's required to construct a data structure in linear time (O(n) where n is the number of characters in the string) that can answer in constant time (O(1)) if two words in that string have two common characters.
Any ideas on how that data structure would look like?
Thanks.
Your question is ambiguous.
What is the definition of a word?
Exactly two characters in common or at least two characters in common?
Do you just need to answer if a) there exists two words in the string which have two common characters or b) you will be given some pairs of words and need to answer if those words have two common characters for each pair?
In case of (b) of point 3, what is the format of input? Will you be given whole word as input or just the indices of the words in the string?
I'm assuming a word consists of only alphabetical character (a-zA-Z) and you will be given arbitrary pair of words by their index in the input string.
Define array f such that, (f[a],...,f[z]) = (0,...,25) and (f[A],...,f[Z]) = (26,...,51).
Let input string S consists of words W[0], W[1], ..., W[n-1]. Define a map from a word to integer by F(W) = OR(1 << f[c]) for all characters c in word W
If we have to find out if two words W1, W2 has (assuming at least) two characters in common, we just need to find out if B = F(W1) & F(W2) has at least two bits. You can either loop over all bits of B to find if it has at least two bits, which is still O(1), or you can check B && (B & (B-1)) is true. (Explanation: B & (B-1) unset the lowest set bit of B, so if that is non-zero B must have at least two bits)
Now you can precompute F(w) for each word in O(|S|) and then output for each query in O(1) by comparing the F-values.

Constructing Strings using Regular Expressions and Boolean logic ||

How do I construct strings with exactly one occurrence of 111 from a set E* consisting of all possible combinations of elements in the set {0,1}?
You can generate the set of strings based on following steps:
Some chucks of numbers and their legal position are enumerated:
Start: 110
Must have one, anywhere: 111
Anywhere: 0, 010, 0110
End: 011
Depend on the length of target string (the length should be bigger than 3)
Condition 1: Length = 3 : {111}
Condition 2: 6 > Length > 3 : (Length-3) = 1x + 3y + 4z
For example, if length is 5: answer is (2,1,0) and (1,0,1)
(2,1,0) -> two '0' and one '010' -> ^0^010^ or ^010^0^ (111 can be placed in any one place marked as ^)
(1,0,1) -> one '0' and one '0110' ...
Condition 3: If 9 > Length > 6, you should consider the solution of two formulas:
Comments:
length – 3 : the length exclude 111
x: the times 0 occurred
y: the times 010 occurred
z: the times 0110 occurred
Finding all solutions {(x,y,z) | 1x + 3y + 4z = (Length - 3)} ----(1)
For each solution, you can generate one or more qualified string. For example, if you want to generate strings of length 10. One solution of (x,y,z) is (0,2,1), that means '010' should occurred twice and '0110' should occurred once. Based on this solution, the following strings can be generated:
0: x0 times
010: x 2 times
0110: x1 times
111: x1 times (must have)
Finding the permutations of elements above.
010-0110-010-111 or 111-010-010-0110 …
(Length - 6) = 1x + 3y + 4z ---(2)
Similar as above case, find all permutations to form an intermediate string.
Finally, for each intermediate string Istr, Istr + 011 or 110 + Istr are both qualified.
For example, (10-6) = 1*0 + 3*0 + 4*1 or = 1*1 + 3*1 + 4*0
The intermediate string can be composed by one '0110' for answer(0,0,1):
Then ^0110^011 and 110^0110^ are qualified strings (111 can be placed in any one place marked as ^)
Or the intermediate string can also be composed by one '0' and one '010' for answer (1,1,0)
The intermediate string can be 0 010 or 010 0
Then ^0010^011 and 110^0100^ are qualified strings (111 can be placed in any one place marked as ^)
Condition 4: If Length > 9, an addition formula should be consider:
(Length – 9) = 1x + 3y + 4z
Similar as above case, find all permutations to form an intermediate string.
Finally, for each intermediate string Istr, 110 + Istr + 011 is qualified.
Explaination:
The logic I use is based on Combinatorial Mathematics. A target string is viewed as a combination of one or more substrings. To fulfill the constraint ('111' appears exactly one time in target string), we should set criteria on substrings. '111' is definitely one substring, and it can only be used one time. Other substrings should prevent to violate the '111'-one-time constraint and also general enough to generate all possible target string.
Except the only-one-111, other substrings should not have more than two adjacent '1'. (Because if other substring have more than two adjacent 1, such as '111', '1111', '11111,' the substring will contain unnecessary '111'). Except the only-one-111, other substrings should not have more than two consecutive '1'. Because if other substring have more than two consecutive 1, such as '111', '1111', '11111,' the substring will contain unnecessary '111' . However, substrings '1' and '11' cannot ensure the only-one-111 constraint. For example, '1'+'11,' '11'+'11' or '1'+'1'+'1' all contain unnecessary '111'. To prevent unnecessary '111,' we should add '0' to stop more adjacent '1'. That results in three qualified substring '0', '010' and '0110'. Any combined string made from three qualified substring will contain zero times of '111'.
Above three qualified substring can be placeed anywhere in the target string, since they 100% ensure no additional '111' in target string.
If the substring's position is in the start or end, they can use only one '0' to prevent '111'.
In start:
10xxxxxxxxxxxxxxxxxxxxxxxxxxxx
110xxxxxxxxxxxxxxxxxxxxxxxxxxx
In end:
xxxxxxxxxxxxxxxxxxxxxxxxxxx011
xxxxxxxxxxxxxxxxxxxxxxxxxxxx01
These two cases can also ensure no additional '111'.
Based on logics mentioned above. We can generate any length of target string with exactly one '111'.
Your question could be clearer.
For one thing, does "1111" contain one occurrence of "111" or two?
If so, you want all strings that contain "111" but do not contain either "1111" or "111.*111". If not, omit the test for "1111".
If I understand you correctly, you're trying to construct an infinite subset of the infinite set of sequences of 0s and 1s. How you do that is probably going to depend on the language you're using (most languages don't have a way of representing infinite sets).
My best guess is that you want to generate a sequence of all sequences of 0s and 1s (which shouldn't be too hard) and select the ones that meet your criteria.