Matlab: locate a string in txt and read it into a number - regex

I have an input file like this:
number of elements = 4
number of nodes = 6
number of fixed points = 2
number of forces = 1
young = 2.0E8
poiss = 0.2
thickness = 0.002
node group
1 2 6
2 3 4
2 4 5
2 5 6
And I use this to read the file
fid = fopen(input_file);
tline = fgetl(fid);
line_number = 1;
while ischar(tline)
# this will locate the string, and find the number
if ~isempty(strfind(tline,'number of elements'))
NELEM = str2double(regexp(tline, '\d+', 'match'));
end
if ~isempty(strfind(tline,'young'))
YOUNG = str2double(regexp(tline, '\d+', 'match'));
end
line_number=line_number+1;
tline = fgetl(fid);
end
fclose(fid);
The first works fine, however, for the second, YOUNG, the output is actually [2 0 8](original number is 2e8) The regexp turns the string into an array.
And for poiss, it read as [0,2].
How can I turn the string into the original number?

Your regular expression needs to match floating point numbers with exponents, try changing '\d+' to
'[0-9]*\.?[0-9]+([eE][0-9]+)?'
This then matches numbers with an optional decimal point and exponent. For example:
str2double(regexp('young = 2.0E8', '[0-9]*\.?[0-9]+([eE][0-9]+)?', 'match'))
gives 200000000.

Related

I need to create a loop that prints a graph where each number in the list is shown by a number of characters. Here is an example:

numbers = [1,2,3,4]
results in
1: i
2: ii
3: iii
4: iiii
This is my code so far and I'm not sure where to go.
numbers = [1,2,3,4]
c = 0
for i in numbers:
count += 1
print(len(numbers))
Instead of 'i' you can put any character you like enclosed in single inverted commas.
numbers = [1,2,3,4]
output = []
for number in numbers:
output.append(number*'i')
print(output)

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

Find position of first non-zero decimal

Suppose I have the following local macro:
loc a = 12.000923
I would like to get the decimal position of the first non-zero decimal (4 in this example).
There are many ways to achieve this. One is to treat a as a string and to find the position of .:
loc a = 12.000923
loc b = strpos(string(`a'), ".")
di "`b'"
From here one could further loop through the decimals and count since I get the first non-zero element. Of course this doesn't seem to be a very elegant approach.
Can you suggest a better way to deal with this? Regular expressions perhaps?
Well, I don't know Stata, but according to the documentation, \.(0+)? is suported and it shouldn't be hard to convert this 2 lines JavaScript function in Stata.
It returns the position of the first nonzero decimal or -1 if there is no decimal.
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
Explanation
We remove from input string a dot followed by optional consecutive zeros.
The difference between the lengths of original input string and this new string gives the position of the first nonzero decimal
Demo
Sample Snippet
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
var samples = [
"loc a = 12.00012",
"loc b = 12",
"loc c = 12.012",
"loc d = 1.000012",
"loc e = -10.00012",
"loc f = -10.05012",
"loc g = 0.0012"
]
samples.forEach(function(sample) {
console.log(getNonZeroDecimalPosition(sample))
})
You can do this in mata in one line and without using regular expressions:
foreach x in 124.000923 65.020923 1.000022030 0.0090843 .00000425 {
mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
}
4
2
5
3
6
Below, you can see the steps in detail:
. local x = 124.000823
. mata:
: /* Step 1: break Stata's local macro x in tokens using . as a parsing char */
: a = tokens(st_local("x"), ".")
: a
1 2 3
+----------------------------+
1 | 124 . 000823 |
+----------------------------+
: /* Step 2: tokenize the string in a[1,3] using 0 as a parsing char */
: b = tokens(a[3], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: /* Step 3: find which values are different from zero */
: c = b :!= "0"
: c
1 2 3 4
+-----------------+
1 | 0 0 0 1 |
+-----------------+
: /* Step 4: find the first index position where this is true */
: d = selectindex(c :!= 0)[1]
: d
4
: end
You can also find the position of the string of interest in Step 2 using the
same logic.
This is the index value after the one for .:
. mata:
: k = selectindex(a :== ".") + 1
: k
3
: end
In which case, Step 2 becomes:
. mata:
:
: b = tokens(a[k], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: end
For unexpected cases without decimal:
foreach x in 124.000923 65.020923 1.000022030 12 0.0090843 .00000425 {
if strmatch("`x'", "*.*") mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
else display " 0"
}
4
2
5
0
3
6
A straighforward answer uses regular expressions and commands to work with strings.
One can select all decimals, find the first non 0 decimal, and finally find its position:
loc v = "123.000923"
loc v2 = regexr("`v'", "^[0-9]*[/.]", "") // 000923
loc v3 = regexr("`v'", "^[0-9]*[/.][0]*", "") // 923
loc first = substr("`v3'", 1, 1) // 9
loc first_pos = strpos("`v2'", "`first'") // 4: position of 9 in 000923
di "`v2'"
di "`v3'"
di "`first'"
di "`first_pos'"
Which in one step is equivalent to:
loc first_pos2 = strpos(regexr("`v'", "^[0-9]*[/.]", ""), substr(regexr("`v'", "^[0-9]*[/.][0]*", ""), 1, 1))
di "`first_pos2'"
An alternative suggested in another answer is to compare the lenght of the decimals block cleaned from the 0s with that not cleaned.
In one step this is:
loc first_pos3 = strlen(regexr("`v'", "^[0-9]*[/.]", "")) - strlen(regexr("`v'", "^[0-9]*[/.][0]*", "")) + 1
di "`first_pos3'"
Not using regex but log10 instead (which treats a number like a number), this function will:
For numbers >= 1 or numbers <= -1, return with a positive number the number of digits to the left of the decimal.
Or (and more specifically to what you were asking), for numbers between 1 and -1, return with a negative number the number of digits to the right of the decimal where the first non-zero number occurs.
digitsFromDecimal = (n) => {
dFD = Math.log10(Math.abs(n)) | 0;
if (n >= 1 || n <= -1) { dFD++; }
return dFD;
}
var x = [118.8161330, 11.10501660, 9.254180571, -1.245501523, 1, 0, 0.864931613, 0.097007836, -0.010880074, 0.009066729];
x.forEach(element => {
console.log(`${element}, Digits from Decimal: ${digitsFromDecimal(element)}`);
});
// Output
// 118.816133, Digits from Decimal: 3
// 11.1050166, Digits from Decimal: 2
// 9.254180571, Digits from Decimal: 1
// -1.245501523, Digits from Decimal: 1
// 1, Digits from Decimal: 1
// 0, Digits from Decimal: 0
// 0.864931613, Digits from Decimal: 0
// 0.097007836, Digits from Decimal: -1
// -0.010880074, Digits from Decimal: -1
// 0.009066729, Digits from Decimal: -2
Mata solution of Pearly is very likable, but notice should be paid for "unexpected" cases of "no decimal at all".
Besides, the regular expression is not a too bad choice when it could be made in a memorable 1-line.
loc v = "123.000923"
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
Below code tests with more values of v.
foreach v in 124.000923 605.20923 1.10022030 0.0090843 .00000425 12 .000125 {
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
di "`v': The wanted number = `x'"
}

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?
Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.