What do the numbers in square brackets mean beside the TAoCP exercises? - knuth

Here is an example:
[00] The binary form of 2009...
[05] Which of the letters...
[10] Four-bit quantities -- half-bytes, or hexadecimal digits...
[15] A kilobyte...
[M13] If x is any string of 0s and 1s...
[M20] Prove or disprove...
What do the [00], [05], [10], [15], [M13], [M20] mean?
I have tried:
Googling taocp exercises square brackets
looking for a pattern in the square-bracketed numbers.
they both increase and decrease
they are mostly but not all multiples of five
the ones with an M appear every now and then
M is the only prefix
the codes are non-unique
Googling "the art of computer programming" exercises brackets
Googling "the art of computer programming" M13
Viewing http://dl.acm.org/citation.cfm?id=1312683, which indicates M means milestone.
Googling "the art of computer programming" [00]
Looking for an appendix in the book that explains
Considering the > that is beside some questions too
No luck!

On page xvii in the text there is a summary of the notation used with the exercises
► = recommended
M = Mathematically oriented
HM = Requiring "higher math"
00 = Immediate
10 = Simple (one minute)
20 = Medium (quarter hour)
30 = Moderately Hard
40 = Term Project
50 = Research Problem
It's meant to be a roughly logarithmic scale.
Furthermore "The remainder of the rating number divided by five indicates the amount of detailed work required. Thus , an exercise rated 24 may take longer to solve than an exercise that is rated 25, but the latter will require more creativity" -P. XVI (notes on the exercises)

I think this is mentioned in the introduction to the book somewhere (my copy is in my office now). If I remember correctly, the numbers indicate difficulty, with numbers beginning with 0 being warm-up questions, numbers beginning with 3 indicating problem-set level questions, numbers beginning with 4 indicating very hard problems, and 50 meaning extremely hard (possibly open) questions.
The M means "math," as in "you'll need some tricky math here." The HM means "higher math," meaning "you'll need math beyond what we've covered here to solve this problem."
Hope this helps!

Related

Regex improvements for international, common and RF3966 phone number validation?

Context
Hi, earlier I was browsing the web in order to find a quick answer about telephone number validation in one regex formula : for emergency, short, international, french, spanish and north american numbers (normal, fancy and extended versions).
Strangely, I couldn't find better than "A comprehensive regex for phone number formula", since it seems to be the best topic about this, or I missed it, which is totally possible.
So I'm new to the site and actually writing this very first question (yeah!), since that other thread is currently on hold of some sort : seems the author didn't get what he and I were seeking.
That makes at least three of us who would like to have a good solution, as I know at least my pal, the one who asked me first about finding one to be used in simple integrations like his Google Forms.
Hence my current question(s) and own answer to begin with, since I took some night time to build my own based on advices and tests patterns from the best replies on the other thread. If you're interested by the topic, there are some interesting elements.
Questions
What is the best way to optimize and improve this regex (without resorting to coding) which is dedicated to validation of international and most national phone numbers (along the recommendations of RFC 3966 at least)?
Not sure if I can add a related question as well (since it is still on purpose to improve the usefulness of the regex pattern), no harm asking I guess.
Are there other commonly-used formats that this regex should match (and not)?
If you can add them (or a link) here for me to update my test bundles, I would be thankful. Equally useful would be phone numbers that should definitely not be validated (the unwanted).
My initial solution
My current regex solution (version 4) on Regular Expressions 101
An earlier version was matching results despite leading and trailing whitespaces, not that useful to the point (a bit too fancy for the exceution time).
The latest version at the time of writing took into consideration the other posts on the subject RFC 3966 (from the IETF standards) and the wikipedia article on "Natural conventions for writing telephone numbers".
Another potentially side dish is to isolate matching groups for country code, area code and extended code... and things work relatively dandy to a certain point : it only works well when there are some separators (or the parenthesis) to distinguish those groups of digits.
Matching goals
Emergency and short numbers : 112 or 911
Spanish international : +34 987 654 321
French extended +33 (0)1 23 45 67 89
French national : 01 23 45 67 89
American extended : 001-(123)-456-7890 ext-4321
German (Microsoft style) : +49 (1234) 567890
Mexican national : (01 55) 1234 5678
Hypothetical international number (max length?) : 00321-(4321)-567.89 ext-4321
Another matching goal is to have a regex that do not under-perform too much, not really picky since it is not to be used in critical parts of code.
Still, how could we optimize those best regex(es) people will find/propose without changing their results?
Goals from the main thread
+1(234)/567.8901 x1234 and the like (with different permutations of separators : ., /, - and horizontal whitespaces.
2345678901 : same US number dialed in the states I guess.
Not sure how it should work since I though that + (or its equivalent the double zero 00) was required in front of any international number... always done it that way. The other thread had a list of positive matches without.
Could someone confirm that + or 00 is not mandatory to US numbers? Thank you again.
Best of unwanted formats
12(34567890 and 123)456789012345 : unmatched parenthesis.
)123(34567890 : parenthesis are wrongly matched.
++34123456789 : double + is a typo.
+9-123/456.7890 x12345 : ext has 4 numbers top.
1-234-567-8901 : missing 00 or + at the beginning of an international number.
1234 to 12345678 : not a short number, yet not a normal one (between 9 and 12? as far as i know).
1234567890123 : over max length (since without international features).
0012312345678901 : over max length (as international number).
Regex101.com was a big plus to rewrite and test the regex to this point, I couldn't have progressed so far without its help. Yet, I'm no expert so I can only scratch the surface here and I need your help to improve this.
Thank you for reading, it was very educating to write the question (but not something I would do every day, very time-consuming at my pace), hope it will find its answers as well. Have a nice day (or night... ;) ).
Before I forgot, here's the post of the latest version of the regex I put together and its code :
^(?=(?:\+|0{2})?(?:(?:[\(\-\)\.\/ \t\f]*\d){7,10})?(?:[\-\.\/ \t\f]?\d{2,3})(?:[\-\s]?[ext]{1,3}[\-\.\/ \t\f]?\d{1,4})?$)((?:\+|0{2})\d{0,3})?(?:[\-\.\/ \t\f]?)(\(0\d[ ]?\d{0,4}\)|\(\d{0,4}\)|\d{0,4})(?:[\-\.\/ \t\f]{0,2}\d){3,8}(?:[\-\s]?(?:x|ext)[\-\t\f ]?(\d{1,4}))?$
As far as I know, it pass the tests I put in the question and some more that I added on that Regex101.com page. You can even fork it, very useful feature indeed, I'm a new fan. :)
The code seems to work, as is, with PHP (pcre), Python and Javascript (but not Golang) with different performance that are not awesome but good enough for our purpose.
For instance, I wanted to use \h for horizontal whitespaces (instead of \t, \f and space, but it is less compatible with the different platforms.
It still need a lot of improvements, and I'm eager to see what you will be cooking to answer this little problem of ours, but I'm spent... already a sunny morning here. Good night folks.

what happens if the error correcting code word is corrupted in RS code

A message which has been encoded with Reed Solomon code and now I have the entire data which is message+code word. Now during transfer if there is a change in the message part then it is possible to decode but what if the code word itself got corrupted/changed? Is it possible to correct the code word as well? If it can be corrected how to do it? Or the Reed Solomon error correcting code itself will take care of correcting the corrupted Code word?
I'm bit confused. I hope I'll get a relevant answer.
Thanks in advance.
There an issue with the terminology in the question. A code word consists of the message data plus the parity (redundancy) data (what the question was calling the code word). In addition, the term code word usually means one with no errors (in either the message part or the parity part), one that is an exact multiple (polynomial with finite field coefficients multiplication) of the generator polynomial.
During correction, it doesn't matter if the errors are located in the message data or the parity data. As long as the total number of symbols in error is less than or equal to 1/2 times the number of parity symbols, then the errors are correctable. There are two types of Reed Solomon codes, the "original view", and the "BCH view", with most implementations being "BCH view.
The wiki article may help you understand.
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
In this "BCH view" section of the wiki article:
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Systematic_encoding_procedure
the message is treated as a polynomial p(x) multiplied by x^t to make space for t parity symbols, and the remainder sr(x) = p(x)x^t mod g(x), where g(x) is the generator polymial. The code word is s(x) = p(x)x^t mod g(x) - sr(x). If using a binary based field for Reed Solomon, then addition and subtraction are both exclusive or.
Note that in other articles about Reed Solomon, usually t represents the number of symbols that can be corrected, and the number of parity symbols would be 2t (or for odd number of parity symbols 2t+1). The wiki article uses lambda: Λ for the coefficients of the locator polynomial, while other articles use sigma: σ for the same thing.

Removing specific numbers using regex in notepad++

To illustrate my problem with an example, in the following paragraph,
1.If we may believe the Egyptians, Hephaestus was the son of the Nile, and with him philosophy began, priests and prophets being its chief exponents. 2. Hephaestus lived 48,863 years before Alexander of Macedon, and in the interval there occurred 373 solar and 832 lunar eclipses. The date of the Magians, beginning with Zoroaster the Persian, was 5000 years before the fall of Troy, as given by Hermodorus the Platonist in his work on mathematics; but Xanthus the Lydian reckons 6000 years from Zoroaster to the expedition of Xerxes, and after that event he places a long line of Magians in succession, bearing the names of Ostanas, Astrampsychos, Gobryas, and Pazatas, down to the conquest of Persia by Alexander,
I want to remove the "1." and "2.", but not the "373", "832", or any of the other numbers. The document contains much more than just this example, so just removing single digit numbers won't work. I assume this is fairly straightforward, but I'm new to using regex and I've been finding it difficult.
Find \d+\. and replace with empty string.

Regex for binary multiple of 3

I would like to know how can I construct a regex to know if a number in base 2 (binary) is multiple of 3. I had read in this thread Check if a number is divisible by 3 but they dont do it with a regex, and the graph someone drew is wrong(because it doesn't accept even numbers). I have tried with: ((1+)(0*)(1+))(0) but it doesn't works for some values. Hope you can help me.
UPDATE:
Ok, thanks all for your help, now I know how to draw the NFA, here I left the graph and the regular expresion:
In the graph, the states are the number in base 10 mod 3.
For example: to go to state 1 you have to have 1, then you can add 1 or 0, if you add 1, you would have 11(3 in base 10), and this number mod 3 is 0 then you draw the arc to the state 0.
((0*)((11)*)((1((00) *)1) *)(101 *(0|((00) *1 *) *0)1) *(1(000)+1*01)*) *
And the other regex works, but this is shorter.
Thanks a lot :)
I know this is an old question, but an efficient answer is yet to be given and this question pops up first for "binary divisible by 3 regex" on Google.
Based on the DFA proposed by the author, a ridiculously short regex can be generated by simplifying the routes a binary string can take through the DFA.
The simplest one, using only state A, is:
0*
Including state B:
0*(11)*0*
Including state C:
0*(1(01*0)*1)*0*
And include the fact that after going back to state A, the whole process can be started again.
0*((1(01*0)*1)*0*)*
Using some basic regex rules, this simplifies to
(1(01*0)*1|0)*
Have a nice day.
If I may plug my solution for this code golf question! It's a piece of JavaScript that generates regexes (probably inefficiently, but does the job) for divisibility for each base.
This is what it generates for divisibility by 3 in base 2:
/^((((0+)?1)(10*1)*0)(0(10*1)*0|1)*(0(10*1)*(1(0+)?))|(((0+)?1)(10*1)*(1(0+)?)|(0(0+)?)))$/
Edit: comparing to Asmor's, probably very inefficient :)
Edit 2: Also, this is a duplicate of this question.
For some who is learning and searching how to do this:
see this video:
https://www.youtube.com/watch?v=SmT1DXLl3f4&t=138s
write state quations and solve them with Axden's Theorem
The way I did is visible in the image-result is the same as pointed out by user #Kert Ojasoo. I hope i did it corretly because i spent 2 days to solve it...
n+2n = 3n. Thus, 2 adjacent bits set to 1 denote a multiple of 3. If there are an odd number of adjacent 1s, that would not be 3.
So I'd propose this regex:
(0*(11)?)+

Similar String algorithm

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.