I was given some Fortran code (90, I believe) and I'm trying to figure out what it does. I know no Fortran, but do know Perl.
Here is a snippet that I've not been able to figure out:
fmly='I:\CEX\Fmly'
fmlyfile=fmly(1:23)//yearqtr(qtrcnt)
open(unit=13,file=fmlyfile)
I know that // is a concatenation operator, but I'm confused about what the fmly(1:23) part is doing.
fmly(1:23) is slicing a character string fmly from position 1 to position 23. Note that in Fortran, string indexing begins from 1 and not from 0. fmly(1:23) is equivalent to fmly(:23).
string(A:B) is a substring, selecting characters A to B of string string. fmly is initialized with fewer than 23 characters, so the trailing characters will be blanks. After that it will be concatenated with an element of the string array yearqtr (or possibly a string-valued function yearqtr).
Related
Very incidentally, I wrote a findc() function and I submitted the program.
data test;
x=findc(,'abcde');
run;
I looked at the result and nothing is unnormal. As I glanced over the code, I noticed the findc() function missed the first character argument. I was immediately amazed that such code would work.
I checked the help documentation:
The FINDC function allows character arguments to be null. Null arguments are treated as character strings that have a length of zero. Numeric arguments cannot be null.
What is this feature designed for? Fault tolerance or something more? Thanks for any hint.
PS: I find findw() has the same behavior but find() not.
I suspect that allowing the argument to be not present at all is just an artifact of allowing the strings passed to it to be of zero length.
Normally in SAS strings are fixed length. So there was no such thing as an empty string, just one that was filled with spaces. If you use the TRIM() function on a string that only has spaces the result is a string with one space.
But when they introduced the TRIMN() and other functions like FINDC() and FINDW() they started allowing arguments to functions to be empty strings (if you want to store the result into a variable it will still be fixed length). But they did not modify the behavior of the existing functions like INDEX() or FIND().
For the FINDC() function you might want this functionality when using the TRIMN() function or the strip modifier.
Example use case might be to locate the first space in a string while ignoring the spaces used to pad the fixed length variable.
space = findc(trimn(string),' ');
Recently in an Interview, I was asked a question that I have a string with a couple of billions of characters in it. The string contains ASCII and non-ASCII characters in it. The task was to remove all the non-ASCII characters and in output, the string must contain only ASCII characters. The solution must be a time efficient algorithm.
I suggested two approaches:
Make an array of ASCII characters. Loop over string check if the current character is in ASCII characters array. If yes then skip or else replace that with null.
Obviously, it's not a time efficient solution.
Secondly, I suggested that if we partition the array in half and a further half and so on. I'll still be checking ASCII characters like in above approaches.
This conversation lead to a discussion where the interviewer was looking for a solution in which we don't have to go character by character and he suggested using Regular Expressions.
My Question here is when we match a pattern using Regular Expressions, will it check the string character by character or it'll use some other approach. I was sure the Regular Expressions will find/match character by character.
Can anyone please clear my doubt?
Thanks
You could use a range like this:
[\x20-\x7E]
This range matches every character from [space] to ~. The printable ascii range.
Regular expressions indeed do use optimisations for cases where a sequence of characters is matched: simply explained, if you're looking for "XXXXXXX", you know you can test every 7-th character, and only look closer once you find an X there. However, you need to filter every single character: this means, a regular expression would be not more efficient (and indeed it would be less efficient, because you would need to go in and out of regexp to process your discoveries).
Instead, the efficient method (assuming C-like architecture) would be to start with two indices (source and result) at zero, and process the string: if the character has the high-bit clear, it's ASCII: copy from source to result, increment both indices. If the high-bit is set, it's non-ASCII: just increment source index.
void removeNonAscii(char *str) {
int s, r;
for (s = 0, r = 0; str[s]; s++) {
if (!(str[s] & 128)) {
str[r++] = str[s];
}
}
str[r] = 0;
}
(or you can make a non-destructive one, by copying into a new string instead of overwriting the current one; the algorithm is the same.)
Assume that I get a few hundred lines of text as a string (C++) from an API, and sprinkled into that data are german umlauts, such as ä or ö, which need to be replaced with ae and oe.
I'm familiar with encoding (well, I've read http://www.joelonsoftware.com/articles/Unicode.html) and solving the problem was trivial (basically, searching through the string, removing the char and adding 2 others instead).
However, I do not know enough about C++ to do this fast. I've just stumbled upon StringBuilder (http://www.codeproject.com/Articles/647856/4350-Performance-Improvement-with-the-StringBuilde), which improved speed a lot, but I was curious if there are any better or smarter ways to do this?
If you must improve efficiency on such small scale, consider doing the replacement in two phases:
The first phase calculates the number of characters in the result after the replacement. Go through the string, and add 1 to the count for each normal character; for characters such as ä or ö, add 2.
At this point, you have enough information to allocate the string for the result. Make a string of the length that you counted in the first phase.
The second phase performs the actual replacement: go through the string again, copying the regular characters, and replacing umlauted ones with their corresponding pairs.
When it is encoded in UTF-8, the german umlauts are all two-byte values in unicode, and so are their replacements like ae or oe. So when you use a char[] instead of a string, you wouldn't have to reallocate any memory and could just replace the bytes while iterating the char[].
Working on my homework for a class and I came to this question:
For each of the following regular expressions, give minimal length strings that are
not in the language defined by the expression.
(bb)*(aa)*b*
a*(bab)*∪b∪ab
I'm going to try to only get help on the first one and see if i can figure out the second. Heres what I Know: Kleene * indicates 0 or more possible elements. and union of a set is the set containing all elements of set a and set b without repeating an element. Working through the first problem starting by inserting lambda, i get:
1st run: bbaab
2nd: bbbbaabaabbaabbbbaab
3rd: bbbbbbaabaabbaabbbbaabaabbbbaabaabbaabbbbaabbbbbbaabaabbaabbbbaab
If I'm doing that correctly than strings of length 0 to 5 are not in the language. Am i doing this correctly?
The first regular expression is matching any word that starts with an even number of 'b's (zero included) followed by an even number of 'a's (zero is ok), then followed by some 'b's.
This means that the empty string is in the language, as well as the string "b".
However, the string "a" is not in the language.
Thus all the minimal length string that are not in the language is "a".
The second regex matches on "", "a" and "aa" (by a*(bab)*) and also on "b" and "ab".
However it doesn't match on "ba" and "bb".
Thus the minimal strings are of length 2: "bb" and "ba".
I'm working on a problem (from Introduction to Automata Theory, Languages and Computer by Hopcroft, Motwani and Ullman) to write a regular expression that defines a language consisting of all strings of 0s and 1s not containing the substring 011.
Is the answer (0+1)* - 011 correct ? If not what should be the correct answer for this?
Edit: Updated to include start states and fixes, as per below comments.
If you are looking for all strings that do not have 011 as a substring rather than simply excluding the string 011:
A classic regex for that would be:
1*(0+01)*
Basically you can have as many ones at the beginning as you want, but as soon as you hit a zero, it's either zeros, or zero-ones that follow (since otherwise you'd get a zero-one-one).
A modern, not-really-regular regex would be:
^((?!011)[01])*$
IF, however, you want any string that is not 011, you can simply enumerate short string and wildcard the rest:
λ+0+1+00+01+10+11+(1+00+010)(0+1)*
And in modern regex:
^(?!011)[01]*$