How to get number of items (length) from a Stata macro? - stata

Say I have a global macro x, how do I extract the number of items (space-separated unquoted words) in it?
global x abc xyz foo
I am sure it is an easy task, but I've looked in help macro and haven't find anything so far.

Yeah, it was right there in the manual. The macro function word count would do the job.
. global x abc xyz foo
. local len_x : word count $x
. di `len_x'
3
Alternatively,
. di wordcount("$x")
3
Note the need to quote the global macro in this case.

Related

find string that begin and end with X or . regular expressions

I want to detect expressions that contains both [X] and [.]
like :
".......X...."
"X..........."
"...........X"
"XXXXXXXX...."
"........XX.."
"........XXXX"
"XXXXXXXXXXXX"
and not like
"............" .
I don't know sas syntax, but a regex with lookahead does the job:
^(?=.*X)[.X]+$
This matching strings that contain only . or X and at least one X.
Regex seems like overkill for the problem of "finding strings that begin and end with X". Say your variable is named STR then you could use code like:
if char(str)='X' and char(str,length(str))='X' then ....
If you want strings that have at least one X and otherwise only have X or . then something like:
if indexc(str,'X') and 0=verify(trim(str),'X.') then ...
In case you can't use lookaheads, this may work for you:
^\.*X+\.*$
You can detect strings that start and end with . or X using in:, a prefixed comparison operator.
where stringValue in: ('.', 'X') and reverse(trim(stringValue)) in: ('.', 'X')

How to find strings beginning with X?

I am trying to identify strings that begin with X using the function regexm() in Stata.
My code:
for var lookin: count if regexm(X, "X")
I have tried using double quotes, square brackets, adding the options for the other characters in the string X[0-9][0-9] etc. but to no avail.
I expect the resultant number to be about 1000, but it returns 0.
The following works for me:
clear
input str22 foo
"Xhello"
"this is a X sentence"
"X a silly one"
"but serves the purpose"
end
generate tag = strmatch(foo, "X*")
list
+------------------------------+
| foo tag |
|------------------------------|
1. | Xhello 1 |
2. | this is a X sentence 0 |
3. | X a silly one 1 |
4. | but serves the purpose 0 |
+------------------------------+
count if tag
2
This is the regular expression solution based on the above example:
generate tag = regexm(foo, "^X")
for in Stata is ancient and now undocumented syntax, unless you are using a very old version of Stata, in which case you would be better flagging that.
X is the default loop element which is substituted everywhere it is found.
Hence your syntax -- looping over a single variable -- reduces to
count if regexm(lookin, "lookin")
and even without a data example we can believe that the answer is 0.
This would be legal and is closer to what you seek:
for Y in var lookin : count if regexm(Y, "X")
but the regular expression is wrong, as #Pearly Spencer points out.
Incidentally,
count if strpos(lookin, "X") == 1
is a direct alternative to your code.
In any Stata that supports regexm() you should be looping with foreach or forvalues.

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).
You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?
With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

Stata: Why aren't these two strings the same?

I define two local variables, str1 and str2, containing strings with the same characters---the only difference is str2 has been broken into two lines. They both display the same string, but str2 is only of length 1 and hence is different from str1. Why is this the case?
local str1 "abc"
display "`str1'"
. abc
display strlen("`str1'") // get string length
. 3
local str2 "a" ///
"bc"
display "`str2'"
. abc
display strlen("`str2'")
. 1
display ("`str1'" == "`str2'")
. 0
Your problem lies not in breaking your string between two lines, but in breaking it into two parts. Consider
. local str3 "a" "bc"
. display "`str3'"
abc
. display strlen("`str3'")
1
A look at help local suggests that the syntax you attempted - giving two string constants for the macro - is not within the range of the documented syntax. Consider instead the syntax that allows an expression for the definition, and use it to concatenate the two strings.
. local str4 = "a" ///
> + "bc"
. display "`str4'"
abc
. display strlen("`str4'")
3
The fundamental here is that your definition of str2 has the consequence, no doubt unintended, of including the middle double quotation marks in the local macro. It follows that you will get misleading results unless thereafter you delimit the macro by compound double quotation marks.
What is also crucial is that display has a presentation role as well as a reporting role and to some extent interprets what is fed to it. It's not always utterly literal therefore. In particular, it will strip delimiters, or what it thinks are delimiters, before displaying a string. Usually this is pure feature, but in your example it stops you seeing the consequences of your definition.
The puzzlement therefore centres on the fact that double quotation marks " " have two roles in Stata, as string delimiters and as literal characters.
. local str2 "a" ///
> "bc"
. display "`str2'"
abc
. display `"`str2'"'
a" "bc
. display strlen("`str2'")
1
. display strlen(`"`str2'"')
17

Matlab Extracting sub string from cell array

I have a '3 x 1' cell array the contents of which appear like the following:
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'
I want to be able to elegantly extract and create another '3x1' cell array with contents as:
'NEWYORK'
'WASHINGTON'
'BERLIN'
If you notice in above the NAME's are after the last underscore and before the first SPACE or '_ML'. How do I write such code in a concise manner.
Thanks
Edit:
Sorry guys I should have used a better example. I have it corrected now.
You can use lookbehind for _ and lookahead for space:
names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once');
Where A is the cell array containing the strings:
A = {...
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'};
>> names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once')
names =
'NEWYORK'
'WASHINGTON'
'BERLIN'
NOTE: The question was changed, so the answer is no longer complete, but hopefully the regexp example is still useful.
Try regexp like this:
names = regexp(fullNamesCell,'_(NAME\d?)\s','tokens');
names = cellfun(#(x)(x{1}),names)
In the pattern _(NAME\d?)\s, the parenthesis define a subexpression, which will be returned as a token (a portion of matched text). The \d? specifies zero or one digits, but you could use \d{1} for exactly one digit or \d{1,3} if you expect between 1 and 3 digits. The \s specified whitespace.
The reorganization of names is a little convoluted, but when you use regexp with a cell input and tokens you get a cell of cells that needs some reformatting for your purposes.