Expected a right parenthesis in expression at (1) - fortran

I'm trying to compile the following code in gfortran:
INTEGER F(10),G(14),LUN(5)
DIMENSION MESSG(NMESSG)
DATA F(1),F(2),F(3),F(4),F(5),F(6),F(7),F(8),F(9),F(10)
1 / 1H( ,1H1 ,1HX ,1H, ,1H ,1H ,1HA ,1H ,1H ,1H) /
DATA G(1),G(2),G(3),G(4),G(5),G(6),G(7),G(8),G(9),G(10)
1 / 1H( ,1H1 ,1HX ,1H ,1H ,1H ,1H ,1H ,1H ,1H /
DATA G(11),G(12),G(13),G(14)
1 / 1H ,1H ,1H ,1H) /
DATA LA/1HA/,LCOM/1H,/,LBLANK/1H /
I get the following error:
Error: Expected a right parenthesis in expression at (1):
1 / 1H( ,1H1 ,1HX ,1H, ,1H ,1H ,1HA ,1H ,1H ,1H) /
1
The line referenced in the error is the fourth line of the code snippet. Anyone know what my problem is? I know this is really old code, but I inherited it in a scientific application and need to make it work.

Your old code is using the Hollerith H to place character values into integer variables. Somewhat more modern Fortran could be:
character*10 F
DATA F / "(1X, A )" /
More modern:
character (len=10) :: F = "(1X, A )"
but these are changing the type of F, which would impact the rest of the code. So using these changes will require other rewrites, but if it were my code, I'd consider it, considering how archaic and unreadable the problem code section is. If that's too much work, I get the following to compile with gfortran:
program f2
INTEGER F(10),G(14),LUN(5)
DATA F(1),F(2),F(3),F(4),F(5),F(6),F(7),F(8),F(9),F(10)
$ / 1H( ,1H1 ,1HX ,1H, ,1H ,1H ,1HA ,1H ,1H ,1H) /
DATA G(1),G(2),G(3),G(4),G(5),G(6),G(7),G(8),G(9),G(10), G(11),G(12),G(13),G(14)
$ / 1H( ,1H1 ,1HX ,1H ,1H ,1H ,1H ,1H ,1H ,1H ,1H ,1H ,1H ,1H) /
DATA LA/1HA/,LCOM/1H,/,LBLANK/1H /
end
with
gfortran-mp-4.7 -O3 -ffixed-form -ffixed-line-length-none -std=legacy f2.f

The 1H things are known as Hollerith constants from the Fortran 66 era, where the 1 indicates the length of the constant in terms of source code characters and then the actual value of the constant is the encoded form of that number of characters (including blanks) following the H. They are notoriously error prone.
Your code is using fixed form source. That is also notoriously error prone.
That said, after adjusting for the loss of the leading fixed form source columns your code compiles here with recent versions of gfortran. With Hollerith constants - all it takes is a blank out of position though - and things can break badly.
(Ironically, with fixed form source you can generally scatter blanks around with willful abandon - making the intersection of those two language features rather special...)
For what its worth, the encoded data for F and G looks like simple format specifiers. In "modern" Fortran (well, since 1977!) you would simply use character variables to hold the formats (1X,A) and (1X) (blanks elided) for F and G respectively. The other variables look like simple character constants (A, comma and blank) - perhaps for use in building up a larger format specifier.
Bringing the code into at least the era of flares and strange colours would be pretty high on my list of priorities if I had to maintain it.

Related

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

ODEINT with multiple parameters (time-dependent)

I'm trying to solve a single first-order ODE using ODEINT. Following is the code. I expect to get 3 values of y for 3 time-points. The issue I'm struggling with is ability to pass nth value of mt and nt to calculate dydt. I think the ODEINT passes all 3 values of mt and nt, instead just 0th, 1st or 2nd, depending on the iteration. Because of this, I get this error:
RuntimeError: The size of the array returned by func (4) does not match the size of y0 (1).
Interestingly, if I replace the initial condition, which is (and should be) a single value as: a0= [2]*4, the code works, but gives me a 4X4 matrix as solution, which seems incorrect.
mt = np.array([3,7,4,2]) # Array of constants
nt = np.array([5,1,9,3]) # Array of constants
c1,c2,c3 = [-0.3,1.4,-0.5] # co-efficients
para = [mt,nt] # Packing parameters
#Test ODE function
def test (y,t,extra):
m,n = extra
dydt = c1*c2*m - c1*y - c3*n
return dydt
a0= [2] # Initial Condition
tspan = range(len(mt)) # Define tspan
#Solving the ODE
yt= odeint(test, a0,tspan,args=(para,))
#Plotting the ODE
plt.plot(tspan,yt,'g')
plt.title('Multiple Parameters Test')
plt.xlabel('Time')
plt.ylabel('Magnitude')
The first order differential equation is:
dy/dt = c1*(c2*mt-y(t)) - c3*nt
This equation represents a part of murine endocrine system, which I am trying to model. The system is analogous to a two-tank system, where the first tank receives a specific hormone [at an unknown rate] but our sensor will detect that level (mt) at specific time intervals (1 second). This tank then feeds into the second tank, where the level of this hormone (y) is detected by another sensor. I labeled the levels using separate variables because the sensors that detect the levels are independent of each other and are not calibrated to each other. 'c2' may be considered as the co-efficient that shows the correlation between the two levels. Also, the transfer of this hormone from tank 1 to tank 2 is diffusion-driven. This hormone is further consumed by a biochemical process (similar to a drain valve for the second tank). At the moment, it is unclear which parameters affect the consumption; however, another sensor can detect the amount of hormone (nt) being consumed at a specific time interval (1 second, in this case too).
Thus, mt and nt are the concentrations/levels of the hormone at specific time points. although only 4-element in length in the code, these arrays are much longer in my study. All sensors report the concentrations at 1 second interval - hence tspan consists of time points separated by 1 second.
The objective is to determine the concentration of this hormone in the second tank (y) mathematically and then optimize the values of these coefficients based on the experimental data. I was able to pass these arrays mt and nt to the defined ODE and solve using ODE45 in MATLAB with no issue. I've been running into this RunTimeError, while trying to replicate the code in Python.
As I mentioned in a comment, if you want to model this system using an ordinary differential equation, you have to make an assumption about the values of m and n between sample times. One possible model is to use linear interpolation. Here's a script that uses scipy.interpolate.interp1d to create the functions mfunc(t) and nfunc(t) based on the samples mt and nt.
import numpy as np
from scipy.integrate import odeint
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
mt = np.array([3,7,4,2]) # Array of constants
nt = np.array([5,1,9,3]) # Array of constants
c1, c2, c3 = [-0.3, 1.4, -0.5] # co-efficients
# Create linear interpolators for m(t) and n(t).
sample_times = np.arange(len(mt))
mfunc = interp1d(sample_times, mt, bounds_error=False, fill_value="extrapolate")
nfunc = interp1d(sample_times, nt, bounds_error=False, fill_value="extrapolate")
# Test ODE function
def test (y, t):
dydt = c1*c2*mfunc(t) - c1*y - c3*nfunc(t)
return dydt
a0 = [2] # Initial Condition
tspan = np.linspace(0, sample_times.max(), 8*len(sample_times)+1)
#tspan = sample_times
# Solving the ODE
yt = odeint(test, a0, tspan)
# Plotting the ODE
plt.plot(tspan, yt, 'g')
plt.title('Multiple Parameters Test')
plt.xlabel('Time')
plt.ylabel('Magnitude')
plt.show()
Here is the plot created by the script:
Note that instead of generating the solution only at sample_times (i.e. at times 0, 1, 2, and 3), I set tspan to a denser set of points. This shows the behavior of the model between sample times.

AssertError when fitting a model

I have a small data set, it has less than 2000 rows. I am trying to fit a LinearRegressionModel using ML, well the data set has only one feature (which I already normalized), after the model was fitted, I evaluated it using a RegressionEvaluator and measuring metrics R2 and RMSE. Then I noticed the error was high, and hence decided to create more artificial features, in order to describe better the phenomena. To achieve this I created the following UDF (notice I check it works).
numberFeatures = 12
def addFeatures(value):
v = value.toArray()[0]
return Vectors.dense([v ** (1.0 / x) for x in xrange(2, 10)] +
[v ** x for x in xrange(1, numberFeatures)])
addFeaturesUDF = udf(addFeatures, VectorUDT())
# Here I test it
print(addFeatures(Vectors.dense(2)))
# [1.0,0.666666666667,0.5,0.4,0.333333333333,0.285714285714,0.25,0.222222222222,2.0,4.0,8.0,16.0,32.0,64.0,128.0,256.0,512.0,1024.0,2048.0]
After this I modify my DataFrame to add more features, using addFeaturesUDF, I can show it bellow.
dtBoosted = dt.withColumn("features", addFeaturesUDF(col("features")))
dtBoosted.show(5)
#+--------+-----+----------+--------------------+
#| date|price| feature| features|
#+--------+-----+----------+--------------------+
#|733946.0| 9.92|[733946.0]|[0.0,0.0,0.0,0.0,...|
#|733948.0| 8.05|[733948.0]|[4.88997555012224...|
#|733949.0| 8.05|[733949.0]|[7.33496332518337...|
#|733950.0| 7.91|[733950.0]|[9.77995110024449...|
#|733951.0| 7.91|[733951.0]|[0.00122249388753...|
#+--------+-----+----------+--------------------+
# only showing top 5 rows
And works, but when I attempt to fit the model it shows an AssertError.
dtTrain, dtValidation = dtBoosted.randomSplit([0.75, 0.25], seed=107)
lr = LinearRegression(maxIter=100, labelCol="price", featuresCol="features")
lrm = lr.fit(dtTrain)
What is the problem? What am I doing wrong? It worked with one feature and some other features!

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.
I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.
I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Fortran95 -- Reading from a formatted text file

I need to read some values from a table. These are the first five rows, to give you some idea of what it should look like:
1 + 3 98 96 1
2 + 337 2799 2463 1
3 + 2801 3733 933 1
4 + 3734 5020 1287 1
5 + 5234 5530 297 1
My interest is in the first four columns of each row. I need to read these into arrays. I used the following code:
program ----
implicit none
integer, parameter :: totbases = 4639675, totgenes = 4395
integer :: codtot, ks
integer, dimension(totgenes) :: ngene, lend, rend
character :: genome*4639675, sign*4
open(1,file='e_coli_g_info')
open(2,file='e_coli_g_str')
do ks = 1, totgenes
read(1,100) ngene(ks),sign(ks:ks),lend(ks), rend(ks)
end do
100 format(1x,i4,8x,a1, 2(5x,i7), 22x)
do ks = 1, 100
write(*,*) ngene(ks), sign(ks:ks),lend(ks), rend(ks)
end do
end program
The loop at the end of the program is to print the first hundred entries to test that they are being read correctly. The problem is that I am getting this garbage (the fourth row is the problem):
1 + 3 757934891
2 + 337 724249387
3 + 2801 757803819
4 + 3734 757803819
5 + 5234 757935405
Clearly, the fourth column is way off. In fact, I cannot find these values anywhere in the file that I am reading from. I am using the gfortran compiler for Ubuntu 12.04. I would greatly appreciate if somebody would point me in the right direction. I'm sure it's likely that I'm missing something very obvious because I'm new at Fortran.
Fortran formats are (traditionally, there's some newer stuff that I won't go into here) fixed format, that is, they are best suited for file formats with fixed columns. I.e. column N always starts at character position M, no ifs or buts. If your file format is more "free format"-like, that is, columns are separated by whitespace, it's often easier and more robust to read data using list formatting. That is, try to do your read loop as
do ks = 1, totgenes
read(1, *) ngene(ks), sign(ks:ks), lend(ks), rend(ks)
end do
Also, as a general advice, when opening your own files, start from unit 10 and go upwards from there. Fortran implementations typically use some of the low-numbered units for standard input, output, and error (a common choice is units 1, 5, and 6). You probably don't want to redirect those.
PS 2: I haven't tried your code, but it seems that you have a bounds overflow in the sign variable. It's declared of length 4, but then you assign to index ks which goes all the way up to totgenes. As you're using gfortran on Ubuntu 12.04 (that is, gfortran 4.6), when developing compile with options "-O1 -Wall -g -fcheck=all"