Eliminate first and second number after a symbol in stata - stata

I have a variable cn that is a string variable that includes court case names.
For example, 1:2013-cv-10153 and 0:1979-cv-06704.
I would like to remove the first two numbers after the : so that it appears as:
1:13-cv-10153 and 0:79-cv-06704
I've tried substr, but I keep getting an error.
Could someone please advise the best coding strategy in Stata?

This works for me:
display substr("1:2013-cv-10153", 1, 2) + substr("1:2013-cv-10153", 5, .)
Similarly:
. generate string = "1:2013-cv-10153"
. display string
1:2013-cv-10153
. generate new_string = substr(string, 1, 2) + substr(string, 5, .)
. display new_string
1:13-cv-10153

You could also focus on the colon without worrying about its exact character position in the string.
You can do this using the strpos() function in addition to substr():
generate string = "1:2013-cv-10153"
generate new_string = substr(string, 1, strpos(string, ":")) + ///
substr(string, strpos(string, ":") + 3, .)

Related

How can I extract a 6 digit number from the text in excel? Examples shown below

I want to extract a pincode from the address. For example, I want to extract 751003 from below address:
Siksha O Annushandhan University, Extension of Sum Hospital,Khandagiri,K-8,Bhubaneswar-751003,Odisha
and another example, I want to extract 799001 from below address:
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
Assuming that your data starts in cell A2, try below formula.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Note for OP: SO Expects user to attempt a solution and include description of attempt and difficulty faced.
Edit: See below edit.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ","")&"a",ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Try using regular expressions:
Sub ExtractCode()
Set regex = CreateObject("VBScript.RegExp")
' pattern explanation: \d{6} - match 6 digits
regex.Pattern = "\d{6}"
' Get address from cell A1 and remove all spaces
testString = Replace(Cells(1, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
' Get address from cell A2 and remove all spaces
testString = Replace(Cells(2, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
End Sub
Example I used
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
U can use the following formulas to extract Pin Code:
=MID(TRIM(A1),FIND(CHAR(1),SUBSTITUTE(TRIM(A1)," ",CHAR(1),LEN(TRIM(A1))-LEN(SUBSTITUTE(TRIM(A1)," ",""))))-3,7)
Or
=MID(A2,FIND("-",A2)+1,7)
Here is a rather easy approach considering the following data:
Sub GetVals()
Dim rng As Range, cl As Range
Dim lr As Long
With Sheet1 'Change accordingly
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
Set rng = .Range("A1:A" & lr)
For Each cl In rng
cl.Offset(0, 1) = Val(Mid(cl, InStrRev(cl, "-") + 1, Len(cl)))
Next cl
End With
End Sub
Val will get the numberic values from the position of the last "-" (found through InstrRev without the space (as your wanted result in the question shows).

Regex Queries for Lines

I am trying to figure out a couple regular expressions for the below cases:
Lines with a length divisible by n, but not by m for integers n and m
Lines that do not contain a certain number n of a given character,
but may contain more or less
I am a newcomer and would appreciate any clarification on these.
I've used JavaScript for my examples.
For the first one the key is to note that 'multiples' are just repeats. So using /(...)+/ will match 3 characters and then repeat that match as many times as it can. Each matching group doesn't need to be the same set of 3 characters but they do need to be consecutive. Suitable anchoring using ^ and $ ensures you're checking the exact string length and ?: can be used to negate.
e.g. Multiple of 5 but not 3:
/^(?!(.{3})+$)(.{5})+$/gm
Note that JavaScript uses / to mark the beginning and end of the expression and gm are modifiers to perform global, multiline matches. I wasn't clear what you meant by matching 'lines' so I've assumed the string itself contains newline characters that must be taken into consideration. If you had, say, and array of lines and could check each one individually then things get slightly easier, or a lot easier in the case of your second question.
Demo for the first question:
var chars = '12345678901234567890',
str = '';
for (var i = 1 ; i <= chars.length ; ++i) {
str += chars.slice(0, i) + '\n';
}
console.log('Full text is:');
console.log(str);
console.log('Lines with length that is a multiple of 2 but not 3:');
console.log(lineLength(str, 2, 3));
console.log('Lines with length that is a multiple of 3 but not 2:');
console.log(lineLength(str, 3, 2));
console.log('Lines with length that is a multiple of 5 but not 3:');
console.log(lineLength(str, 5, 3));
function lineLength(str, multiple, notMultiple) {
return str.match(new RegExp('^(?!(.{' + notMultiple + '})+$)(.{' + multiple + '})+$', 'gm'));
}
For the second question I couldn't come up with a nice way to do it. This horror show is what I ended up with. It wasn't too bad to match n occurrences of a particular character in a line but matching 'not n' proved difficult. I ended up matching {0,n-1} or {n+1,} but the whole thing doesn't feel so great to me. I suspect there's a cleverer way to do it that I'm currently not seeing.
var str = 'a\naa\naaa\naaaa\nab\nabab\nababab\nabababab\nba\nbab\nbaba\nbbabbabb';
console.log('Full string:');
console.log(str);
console.log('Lines with 1 occurrence of a:');
console.log(mOccurrences(str, 'a', 1));
console.log('Lines with 2 occurrences of a:');
console.log(mOccurrences(str, 'a', 2));
console.log('Lines with 3 occurrences of a:');
console.log(mOccurrences(str, 'a', 3));
console.log('Lines with not 1 occurrence of a:');
console.log(notMOccurrences(str, 'a', 1));
console.log('Lines with not 2 occurrences of a:');
console.log(notMOccurrences(str, 'a', 2));
console.log('Lines with not 3 occurrences of a:');
console.log(notMOccurrences(str, 'a', 3));
function mOccurrences(str, character, m) {
return str.match(new RegExp('^[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + m + '}[^' + character + '\n]*$', 'gm'));
}
function notMOccurrences(str, character, m) {
return str.match(new RegExp('^([^' + character + '\n]*(' + character + '[^' + character + '\n]*){0,' + (m - 1) + '}[^' + character + '\n]*|[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + (m + 1) + ',}[^' + character + '\n]*)$', 'gm'));
}
The key to how that works is that it tries to find n occurrences of a separated by sequences of [^a], with \n thrown in to stop it walking onto the next line.
In a real world scenario I would probably do the splitting into lines first as that makes things much easier. Counting the number of occurrences of a particular character is then just:
str.replace(/[^a]/g, '').length;
// Could use this instead but note in JS it'd fail if length is 0
str.match(/a/g, '').length;
Again, this assumes a JavaScript environment. If you were using regexes in an environment where you could literally only pass in the regex as an argument then it's back to my earlier horror show.

Split line based on regex in Julia

I'm interested in splitting a line using a regular expression in Julia. My input is a corpus in Blei's LDA-C format consisting of docId wordID : wordCNT For example a document with five words is represented as follows:
186 0:1 12:1 15:2 3:1 4:1
I'm looking for a way to aggregate words and their counts into separate arrays, i.e. my desired output:
words = [0, 12, 15, 3, 4]
counts = [1, 1, 2, 1, 1]
I've tried using m = match(r"(\d+):(\d+)",line). However, it only finds the first pair 0:1. I'm looking for something similar to Python's re.compile(r'[ :]').split(line). How would I split a line based on regex in Julia?
There's no need to use regex here; Julia's split function allows using multiple characters to define where the splits should occur:
julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
"186"
"0"
"1"
"12"
"1"
"15"
"2"
"3"
"1"
"4"
"1"
julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
"0"
"12"
"15"
"3"
"4"
julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
"1"
"1"
"2"
"1"
"1"
I discovered the eachmatch method that returns an iterator over the regex matches. An alternative solution is to iterate over each match:
words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
wd, cnt = m.captures
push!(words, parse(Int64, wd))
push!(counts, parse(Int64, cnt))
end
As Matt B. mentions, there's no need for a Regex here as the Julia lib split() can use an array of chars.
However - when there is a need for Regex - the same split() function just works, similar to what others suggest here:
line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]
I've recently had to do exactly that in some Unicode processing code (where the split chars - where a "combined character", thus not something that can fit in julia 'single-quotes') meaning:
split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

Split a word with regexp in matlab; startIndex for 'split'?

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?
You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end
So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.