using data not sorted (nearmrg) - stata

I am trying to use nearmrg on my data files and I keep getting the same error:
using data not sorted
As I wanted to break down the problem I used very simple test data instead of my real data and the error message still shows up. Now I have the following:
Master.dta:
Group Date
A 15.01.2012
A 15.02.2012
B 15.01.2012
B 15.02.2012
C 15.01.2012
C 15.02.2012
Using.dta:
Group Date SVarOfInterest1 SVarOfInterest2
A 01.01.2012 1 201
A 15.01.2012 2 202
A 03.02.2012 3 203
A 23.02.2012 4 204
B 03.01.2012 11 211
B 19.01.2012 12 212
B 03.02.2012 13 213
C 20.01.2012 21 221
C 25.01.2012 22 222
C 04.02.2012 23 223
C 03.01.2012 24 224
This is the code:
nearmrg Group using Using.dta, nearvar(Date) genmatch(SourceDate) lower
using data not sorted
r(5);

It appears that Stata thinks that your using data is not sorted. Even if if looks sorted to you, run the sort command on each data file prior to running nearmrg.
tempfile myTemp
<read in Using file>
sort Group
* save temporary file
save "`myTemp'"
<read in master file>
sort Group
nearmrg Group using `myTemp', nearvar(Date) genmatch(SourceDate) lower
As a side note, nearmrg is not part of base Stata. It is helpful if you mention that it is a user written package in your question.

Related

Is there a way to put a section of a line at the start of every subsequent line using regular expressions?

I have a text file in which there is a line with the category and then all items of that category in lines below it. This is followed by 2 empty lines and then the title of the next category and more items in the category. I want to know how I could use regular expressions (specifically with Notepad++) in order to put the category at the start of each of the item's lines so I can save the file as a CSV or TAB file.
I started by isolating one of the categories as such:
Городищенский поссовет 1541
Арабовщина 535
Болтичи 11
Бриксичи 59
Великое Село 160
Гарановичи 34
Грибовщина 3
Душковцы 5
Зеленая 182
Кисели 97
Колдычево 145
Конюшовщина 16
Микуличи 31
Мостытычи 18
Насейки 5
Новоселки 45
Омневичи 53
Поручин 43
Пруды 24
Станкевичи 42
Ясенец 33
I then got as far as getting to be finding for
(.+)(поссовет)(\t\d{4}\r\n)(^.*$\r\n)
and replacing with
$1$2\t$4
which makes the first line
Арабовщина 535
turn into
Городищенский поссовет Арабовщина 535
which is what I want to happen to the rest of the lines but I couldn't get any farther.

csv parsing and manipulation using python

I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.
I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1
Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])
I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.

Plot sub-sections of Pandas Dataframes in Python - all legend entries in one column

I have the following Pandas DataFrame in Python 2.7.
Name Date Val_Celsius Rm_Log
Lite 2012-07-17 77 12
Lite 2012-12-01 76 -21
Lite 2012-09-01 79 73
Lite 2013-12-01 78 945
Staed 2012-07-17 105 36
Staed 2012-12-01 104 19
Staed 2012-09-01 102 107
Staed 2013-12-01 104 11
ArtYr 2012-07-17 -11 100
ArtYr 2012-12-01 -14 21
ArtYr 2012-09-01 -10 68
ArtYr 2013-12-01 -12 83
I need to plot the Rm_Log numbers as the y-variable and I need to plot the Date as the x-variable.
However, I need there to be 3 separate overlapping plots on the same figure - 1st plot for Lite, 2nd for Staed and 3rd for ArtYr. I need the legend for the figure to show 3 entries, Lite, Staed and ArtYr.
I have never done a plot like this before. Usually, I have separate columns but here the numbers are arranged differently.
If I create 3 separate DataFrames for each Name then it is possible to plot. However, the Name column typically has a lot more entries than just the 3 that I have shown so this method is very time consuming. Also, the number of entries are not known ahead of time.i.e. here I have shown 3 entries, Lite, Staed and ArtYr, but there may be 50 or there may be 100 entries. I cannot create 50-100 DataFrames each time I need to generate one figure.
How can I show overlapping plots of the Rm_Log vs Date column, for each Name value, on the same figure? Is it possible to get the date as vertical on the x-axis?
EDIT:
Error I get when using ax.set_ticks(df.index):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 2602, in set_xticks
return self.xaxis.set_ticks(ticks, minor=minor)
File "C:\Python27\lib\site-packages\matplotlib\axis.py", line 1574, in set_ticks
self.set_view_interval(min(ticks), max(ticks))
File "C:\Python27\lib\site-packages\matplotlib\axis.py", line 1885, in set_view_interval
max(vmin, vmax, Vmax))
File "C:\Python27\lib\site-packages\matplotlib\transforms.py", line 973, in _set_intervalx
self._points[:, 0] = interval
ValueError: invalid literal for float(): 2012-07-17
If you don't want to use anything besides native pandas, you can still do this pretty easily:
df.reset_index().set_index(["Name", "Date"]).unstack("Name")["Rm_Log"].plot(rot=90)
First, you sort it using a MultiIndex, then you unstack it so that each entry in the Name column becomes its own column. Then you select the Rm_Log column and plot it. The argument rot=90 rotates the xticks. You could also separate this into several lines, but I thought I'd keep it as one to show how it could be done without modifying the DataFrame.
That is where ggplot inspired from R is absolutely amazing for simplicity, you do not have to modify your dataframe.
from ggplot import *
ggplot(df, aes(x='Date', y='Rm_log', color='Name')) + geom_line()

Working with files I/O for beginners [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hi all I am working on a school beginners project using files I/O in C++,
This program consist of two parts:
1) reading and processing a student data file, and writing the results to a student report file
2) modifying part 1 to calculate some statistics and writing them to another file.
For this assignment, you will be reading one input file and writing out two other files.
Your program will be run using the referenced student data file.
Part 1 Detail
Read in the student data file. This 50 record file consists of a (8-digit numeric) student id, 8 assignment's points, midterm points, final points and lab exercise points. You must again follow the syllabus specifications for the determination of the letter grade, this time, processing 50 student grades. Extra credit points are not applicable for this assignment. You will write the input student data and the results of the processing into a student report file that looks like the output shown below. In addition to the input student data, the report should contain the "total" of the assignment grades, the total and percent of all points achieved, and the letter grade. You may assume that the input data file does not contain any errant data.
The file looks like the one below:
The file that we need to read from is hyperlinked here
The student report output file should look like this:
The Student Report Output File
Student --- Asignment Grades -- Ass Mid Fin LEx Total Pct Gr
-------- -- -- -- -- -- -- -- -- --- --- --- --- ----- --- --
56049257 16 16 20 16 12 15 12 20 115 58 123 59 355 89 B+
97201934 19 15 13 19 16 12 13 18 113 72 101 55 341 85 B
93589574 13 16 19 19 18 12 6 14 111 58 108 50 327 82 B
85404010 17 19 19 19 19 10 17 19 129 70 102 58 359 90 A-
99608681 11 15 19 19 17 10 16 19 116 42 117 57 332 83 B
84918110 11 20 18 17 12 8 12 19 109 46 122 31 308 77 C
89307179 16 16 19 18 14 17 15 19 120 56 117 52 345 86 B
09250373 15 15 18 18 11 18 17 19 120 44 106 51 321 80 B-
91909583 12 14 16 19 20 11 20 16 117 66 92 50 325 81 B-
...
Part 2 Detail
Write a summary report file that contains the average total points and average percent for all students. Also, display the number of A's, B's, C's, D's and F's for the students. Your summary output file should look something like this:
The average total points = ???
The average percent total = ??
The number of A's = ??
The number of B's = ??
The number of C's = ??
The number of D's = ??
The number of F's = ??
Additional requirements
All files must be checked for a successful open. They should also be closed when you are finished with them.
Make sure you write the student id with a leading 0, if appropriate (i.e. the 8th id).
Add headings to your output report file. They should be aligned and correctly identify the column data.
Do not use global variables, except for constants, in your solution.
For part 1 How do I duplicate the file and format it to add the headings above it and the grades at the end of each file into the new duplicated file??
Any help in this matter would be appreciated
thanks in advance.
Engineering is all about converting a large complex problem into many smaller, easy to solve, problems.
Here is how I would start.
1.) Open input file.
2.) Read one line from input file.
3.) Break the input string from one line into values.
4.) Close input file.
5.) Open output file.
6.) Write results to output file.
References:
1.)File I/O
2.)std::string
3.)File I/O C
Now you're pretty much there. Take it one step at a time.

How to export SAS dataset to a .dat

I have a SAS dataset chapter5.pressure and i verified that it is fine by printing it proc print:
Obs SBPbefore SBPafter
1 120 128
2 124 131
3 130 131
4 118 127
5 140 132
6 128 125
7 140 141
8 135 137
9 126 118
10 130 132
11 126 129
12 127 135
So, I want to export it to the .dat file, and the following method does not work:
libname chapter5 'c:\users\owner\desktop\sas\chapter5';
data _null_;
set chapter5.pressure;
file 'c:\users\owner\desktop\sas\chapter5\xxx.dat';
put a b ;
run;
The resulting file has all missing values. Why
Try using the variable names instead of "a" and "b".
data _null_;
set chapter5.pressure;
file 'c:\users\owner\desktop\sas\chapter5\xxx.dat';
put SBPbefore SBPafter;
run;
Instead of using a put statement, you can also use PROC EXPORT to create a delimited file from a SAS dataset:
PROC EXPORT
DATA=chapter5.pressure
OUTFILE='c:\users\owner\desktop\sas\chapter5\xxx.dat'
DBMS=DLM
REPLACE;
RUN;
The default delimiter is a blank, which should match what you are trying to do. To create a tab or comma-delimited file, change the DBMS option value to TAB or CSV respectively. This will create a header row in the external file. Here is a link to the SAS 9.2 documentation. Check the SAS support site if you are using a different version.