How to organize data from different datafiles in Fortran - fortran

I use Fortran 95, and now I'm facing a problem as follows:
I have 8 datafiles with 4 columns each one, they are generated by other program (each file contains the solutions of differential equations for different sets of initial conditions).
The 4th column is my x variable and the 2nd column is my f(x).
So, all I want is to create a new file with 9 columns (with the x in the first and the f(x) of each file in the others columns).
However, each file has different values for x (and its respective f), like 1.10, 1.30 and 1.40 in one and 1.15, 1.25 and 1.42 in other.
So, it's OK for me to take a "band" in x, like [1.00;1.20] and write in my new file this average value as x, and then run the f(x) in this band under it.
But I couldn't managed how to do it.

I would try plotting the files with a smooth csplines option into a temporary file:
set format x "%10.3f"
set format y "%10.3f"
set xrange [...]
set samples ...
set table "temp1.dat"
plot 'file1.dat' using 4:2 smooth csplines
unset table
This works if you can live with the spline interpolation. There is no way to print linearly interpolated points in csv format. You might want to learn a bit of Fortran (ask whether you will need it for your further research) to do the linear interpolation. Or any other programming language.
To plot all files with one command check for example the answers on
Loop structure inside gnuplot?
Then, on linux, you can combine the generated data using colrm and paste.
cat temp1.dat | colrm 11 > x
cat temp1.dat | colrm 1 11 | colrm 12 > y1
cat temp2.dat | colrm 1 11 | colrm 12 > y2
...
paste x y1 y2 ... > combined.dat
Adjust the constants as needed.
Again, learning a programming language might also help.

Related

How to get y axis range in Stata

Suppose I am using some twoway graph command in Stata. Without any action on my part Stata will choose some reasonable values for the ranges of both y and x axes, based both upon the minimum and maximum y and x values in my data, but also upon some algorithm that decides when it would be prettier for the range to extend instead to a number like '0' instead of '0.0139'. Wonderful! Great.
Now suppose that after (or while) I draw my graph, I want to slap some very important text onto it, and I want to be choosy about precisely where the text appears. Having the minimum and maximum values of the displayed axes would be useful: how can I get these min and max numbers? (Either before or while calling the graph command.)
NB: I am not asking how to set the y or x axis ranges.
Since this issue has been a bit of a headache for me for quite some time and I believe there is no good solution out there yet I wanted to write up two ways in which I was able to solve a similar problem to the one described in the post. Specifically, I was able to solve the issue of gray shading for part of the graph using these.
Define a global macro in the code generating the axis labels This is the less elegant way to do it but it works well. Locate the tickset_g.class file in your ado path. The graph twoway command uses this to draw the axes of any graph. There, I defined a global macro in the draw program that takes the value of the omin and omax locals after they have been set to the minimum between the axis range and data range (the command that does this is local omin = min(.scale.min,omin) and analogously for the max), since the latter sometimes exceeds the former. You could also define the global further up in that code block to only get the axis extent. You can then access the axis range using the globals after the graph command (and use something like addplot to add to the previously drawn graph). Two caveats for this approach: using global macros is, as far as I understand, bad practice and can be dangerous. I used names I was sure wouldn't be included in any program with the prefix userwritten. Also, you may not have administrator privileges that allow you to alter this file based on your organization's decisions. However, it is the simpler way. If you prefer a more elegant approach along the lines of what Nick Cox suggested, then you can:
Use the undocumented gdi natscale command to define your own axis labels The gdi commands are the internal commands that are used to generate what you see as graph output (cf. https://www.stata.com/meeting/dcconf09/dc09_radyakin.pdf). The tickset_g.class uses the gdi natscale command to generate the nice numbers of the axes. Basic documentation is available with help _natscale, basically you enter the minimum and maximum, e.g. from a summarize return, and a suggested number of steps and the command returns a min, max, and delta to be used in the x|ylabel option (several possible ways, all rather straightforward once you have those numbers so I won't spell them out for brevity). You'd have to adjust this approach in case you use some scale transformation.
Hope this helps!
I like Nick's suggestion, but if you're really determined, it seems that you can find these values by inspecting the output after you set trace on. Here's some inefficient code that seems to do exactly what you want. Three notes:
when I import the log file I get this message:
Note: Unmatched quote while processing row XXXX; this can be due to a formatting problem in the file or because a quoted data element spans multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict) if quoted data spans multiple lines or option bindquote(nobind) if quotes are not used for binding data.
Sometimes the data fall outside of the min and max range values that are chosen for the graph's axis labels (but you can easily test for this).
The log linesize is actually important to my code below because the key values must fall on the same line as the strings that I use to identify the helpful rows.
* start a log (critical step for my solution)
cap log close _all
set linesize 255
log using "log", replace text
* make up some data:
clear
set obs 3
gen xvar = rnormal(0,10)
gen yvar = rnormal(0,.01)
* turn trace on, run the -twoway- call, and then turn trace off
set trace on
twoway scatter yvar xvar
set trace off
cap log close _all
* now read the log file in and find the desired info
import delimited "log.log", clear
egen my_string = concat(v*)
keep if regexm(my_string,"forvalues yf") | regexm(my_string,"forvalues xf")
drop if regexm(my_string,"delta")
split my_string, parse("=") gen(new)
gen axis = "vertical" if regexm(my_string,"yf")
replace axis = "horizontal" if regexm(my_string,"xf")
keep axis new*
duplicates drop
loc my_regex = "(.*[0-9]+)\((.*[0-9]+)\)(.*[0-9]+)"
gen min = regexs(1) if regexm(new3,"`my_regex'")
gen delta = regexs(2) if regexm(new3,"`my_regex'")
gen max_temp= regexs(3) if regexm(new3,"`my_regex'")
destring min max delta , replace
gen max = min + delta* int((max_temp-min)/delta)
*here is the info you want:
list axis min delta max

How to prepare the multilevel multivalued training dataset in python

I am a beginner in machine learning. My academic project involves detecting human posture from acceleration and gyro data. I am stuck at the beginning itself. My accelerometer data has x,y,z values and gyro also has x,y,z values stored in file acc.csv and gyro.csv. I want to classify the 'standing', 'sitting', 'walking' and 'lying' position. The idea is to train the machine using some ML algorithm (supervised) and then throw a new acc + gyro data set to identify what this new dataset predict (what the subject is doing at present). I am facing the following problems--
Constructing a training dataset -- I think my activities will be dependent variable, and acc & gyro axis readings will be independent. So if I like to combine it in single matrix with each element of the matrix again has it's own set of acc and gyro value [Something like main and sub matrix], how can I do that? or is there any alternative idea to do the same?
How can I take the data of multiple activities with multiple readings in a single training matrix,
I mean 10 walking data each with it's own acc(xyz) and gyro (xyz) + 10 standing data each with it's own acc(xyz) and gyro (xyz) + 10 sitting data each with it's own acc(xyz) and gyro (xyz) and so on.
Each data file has different number of records and time stamp, how to bring them into a common platform.
I know I am asking very basic things but these are the confusion part nobody has clearly explained to me. I am feeling like standing in front of a big closed door, inside very interesting things are happening where I cannot participate at this moment with my limited knowledge. My mathematical background is high school level only. Please help.
I have gone through some projects on activity recognition in Github. But they are way too complicated for a beginner like me.
import pandas as pd
import os
import warnings
from sklearn.utils import shuffle
warnings.filterwarnings('ignore')
os.listdir('../input/testtraindata/')
base_train_dir = '../input/testtraindata/Train_Set/'
#Train Data
train_data = pd.DataFrame(columns = ['activity','ax','ay','az','gx','gy','gz'])
train_folders = os.listdir(base_train_dir)
for tf in train_folders:
files = os.listdir(base_train_dir+tf)
for f in files:
df = pd.read_csv(base_train_dir+tf+'/'+f)
train_data = pd.concat([train_data,df],axis = 0)
train_data = shuffle(train_data)
train_data.reset_index(drop = True,inplace = True)
train_data.head()
The Data Set
Problem in Train_set
Surprisingly if I remove the last 'gz' from
train_data = pd.DataFrame(columns =['activity','ax','ay','az','gx','gy','gz'])
Everything is working fine.
You have the data labeled? --> position of x,y,z... = positure?
I have no clue about the values (as I have not seen the dataset, and have no clue about positions, acc or gyro), but Im guessing you should have a dataset within a matrise with x, y, z as categories and a target category ;"positions".
If you need all 6 (3 from one csv and 3 from the other) to define the positions you can make 6 categories + positions.
Something like : x_1, y_1 z_1 , x_2, y_2, and z_2 + position label ("position" category).
You can also make each position an own category with 0/1 as true/false.
"sitting" , "walking" etc... and have 0 and 1 as the values in the columns.
Is the timestamp of any importance towards the position? If it is not a feature of importance I would just drop it. If it is important in some way, you might want to bin them.
Here is a beginners guide from Medium in which you can see a bit how to preprocess your data. It also shows one hot encoding :)
https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d
Also try googling Preprocessing your data, then you will probably find the right recipe

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

gnuplot 2-D plotting from arrays - realtime

I have to do realtime plotting of scan values of sensor. I am using gnuplot for this purpose. Till now, I am able to communicate to gnuplot from my c++ program. I tried some sample plots using a .DAT file and it is working. Now, My requirement is to plot last 5 values of sensor scan values in a single plot for comparing (that means I need to store 10 arrrays of data. 1 scan have two arrays X and Y).
What I am trying to do is to store the last 5 scan values in a column format in a .DAT file like this where x, y are are my two arrays for each scan.Then using the gnuplot command "plot 'filename.dat' 1:2" "plot 'filename.dat' 2:3" etc... Then I have to rewrite the file after every 5 scans.
X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5
2.3 3.4 6.6 3.6 5.5 6.5 8.5 5.5 4.5 6.6
4.3 4.5 6.2 7.7 4.3 9.2 1.4 6.9 2.4 7.8
I want to just confirm before proceeding wheather this is efficient for real time processing. Also Is there any command in gnuplot to directly plot from two arrays without the use of .dat files. I did not find one in my search.
Any suggestions would be helpful.
Presumably, you are communicating with gnuplot via pipes. Since gnuplot is a separate process, it does not have access to your programs memory space and therefore it cannot plot your data without you sending it somehow. The most straight forward way is how you mentioned (create a temporary file, send a command to gnuplot to read/plot the temporary file). Another straight forward way is to use gnuplot's inline data...It works like:
plot '-' using ... with ...
x1 y1
x2 y2
x3 y3
...
e
In this case, the datafile is written directly to the gnuplot pipe with no need for a temporary file. (for more questions, about the pseudo-file '-' see help datafile special-filenames in the gnuplot documentation).
As far as this approach being useful in realtime -- as long as the gnuplot rendering speed is fast compared to the time between re-rendering, it should work fine. (I guess there are some memory issues too if your arrays are HUGE, but I doubt that would limit any real application with only 10 1-D arrays -- and if the arrays are that big, you probably shouldn't be sending the whole thing to gnuplot anyway)
Take a look at this: https://github.com/dkogan/feedgnuplot
It's a general-purpose tool to plot standard input. It is able, among other things, to make realtime plots of data, as it comes in. If you have data in a format not directly supported, preprocess your stream with something like awk or perl.

Geo question: How to generate a .wld file given some ground control points?

OK, so I have a jpeg and a bunch of ground control points (x/y pixels and corresponding lat/lon references).
I'd like to generate a .wld world file to accompany the jpeg, from the command line. My coordinate system is Google Maps, i.e. EPSG:900913.
I know that I can use gdal_transform to generate a .vrt given the gcps, but what I need is a .wld file. (Not really clear on the difference, but that's definitely what I need!)
Anyone have any idea how to do this?
Thanks
Richard
A world file is basically a 6 line ascii text file determining your georeferencing. If you have a set of GCPs, you'll need to map them (using some tool like gdal) to a single affine transform.
I don't believe the gdal command line utilities give you the ability to just directly make a world file, although some drivers within GDAL will do this for you when you write an image if you set WORLDFILE=yes in the driver. You'll have to check the driver for your specific format to see if it supports this.
If it does not, though, you can do this easily by hand. Just make the .VRT file using the GCPs, and look at it in a text editor. It will have a section like this:
<GeoTransform>440720.0, 60, 0.0, 3751320.0, 0.0, -60.0</GeoTransform>
This "GeoTransform" is the affine transform used by a world file. All you need to do is make an ascii file that puts that with one value per line, like so:
60
0.0
0.0
-60.0
440720.0
3751320.0
That will be a valid .WLD file for your application.
FYI - The 6 numbers are the x pixel size, y shift per x value, x shift per y pixel, x origin, then y origin. (The shifts provide rotation/shearing capabilities in the affine transform. Typically, they'll be 0/0, since you normally want orthorectified imagery).
For details, see Wikipedia's entry on Worldfiles.
There is Python script gcps2wld.py in GDAL repository which translates the set of GCPs on a file into first order approximation in world file format.
I would add to Reed's answer that it's possible to give GCPs as command line parameters to gdal_translate:
gdal_translate -gcp 1 2 3 4 -gcp 6 7 8 9 [-gcp ...] -of GTiff inp.img out.tif