Search and match data from two files in Python

Search and match data from two files in Python - python-2.7

I have two different files; A reference file and a data sets with different length.
A reference file ("location.dat") contains:
40505 5.0666667 102.2166667
40517 5.6833333 101.8500000
40586 5.7666667 102.2000000
40587 5.8166667 102.0500000
40663 6.0333333 102.1166667
41525 5.5500000 100.4833333
41529 5.3500000 100.4000000
...............
...............
A data sets ("input.dat") contains:
40517 2014 12 18 0 17.4
40586 2014 12 18 0 9.9
40587 2014 12 18 0 15.5
40663 2014 12 18 0 30.9
41525 2014 12 18 0 0
41529 2014 12 18 0 0
41540 2014 12 18 0 0
41543 2014 12 18 0 0
41548 2014 12 18 0 0
41549 2014 12 18 0 0
41551 2014 12 18 0 0
41610 2014 12 18 0 0
Question:
How to search and match data set so that the output file will combine certain selected values from both files like this:
output.dat
40517 5.6833333 101.8500000 17.4
40586 5.7666667 102.2000000 9.9
40587 5.8166667 102.0500000 15.5
............
...........
The current script is:
data1=np.loadtxt('location.dat')
lats1=data1[:,1]
lons1=data1[:,2]
code1=data1[:,0]
data2=np.loadtxt('input.dat')
rain=data2[:,5]
code2=data2[:,0]
ind=[]
for i in range(len(data1)):
dist=code1[i]
ind.append(np.where(dist==np.int(dist))[0][0])
rain2=rain[ind]
data3=np.array([code1,lats1,lons1,rain2])
data3=np.transpose(data3)
np.savetxt('output.dat',data3,fmt='%9.3f')
The current result
40517.000 5.683 101.850 0.000
40586.000 5.767 102.200 0.000
40587.000 5.817 102.050 0.000
40663.000 6.033 102.117 0.000
41525.000 5.550 100.483 0.000
41529.000 5.350 100.400 0.000
41540.000 5.383 100.550 0.000
The rain2 values did not properly append from the input file. How to convert the first column output to be integer?.Any ideas of what went wrong??.TQ

The line
ind.append(np.where(dist==np.int(dist))[0][0])
does not make sense in your code. This will always append 0 if distis an integer (as dist==np.int(dist) is simply the array [True])
A better way to solve your problem is to create a look-up table from the data in location.dat
data1=np.loadtxt('location.dat')
lookup = {int(round(id_)):(lat,long) for id_, lat, long in data1}
Please note that the best way to convert a float to an int in python is to use int(round(i))
You can then iterate over the data in your other file and create the proper line
data3 = []
for line in data2:
ind = int(round(line[0]))
data3.append([ind, lookup[ind][0], lookup[ind][1], line[5]])
To save the data, you may want to format and write the line one after the other or use savetxt.

Related

Search for a value in multiple columns and return value from Column A

Sheet Example:
A
B
C
D
E
F
Jonas
1
6
11
16
21
Joaquin
2
7
12
17
22
William
3
8
13
18
23
Mark
4
9
14
19
24
Stuart
5
10
15
20
25
Search value example:
19
Expected Return:
Mark
Formula indicated:
https://stackoverflow.com/a/55119579/11462274
=QUERY(Clients!A1:F, "select A where B="&B1&"
or C="&B1&"
or D="&B1&"
or E="&B1&"
or F="&B1&"", 1)
But the result is:
Jonas
Stuart
Why is Jonas returning when there is no value 19 in row 1?
An additional info:
If I have Columns from B to CC with values, is this still the indicated method? I ask because of the immense amount of lines I would have to write one by one for each of these columns.

try:
=INDEX(TEXTJOIN(", ", 1, IF(B1:F5=I1, A1:A5, )))

Try using an INDEX(MATCH()) formula.
=INDEX(A1:A5,MATCH(19,B4:F4,0))

Matching observations based on a single variable or multiple variables on a single data set - stata

I need to match observations based on an index variable that measures home conditions, personal variables such as age, gender, education, etc. and year. My home index variable is numerical (from 0 to 103) and the personal characteristics are either dummies or categorical variables. For my analysis I need to match the most similar observations based on these variables. It is sort of a nearest neighbor match but without having a control or treatment group.
The dataset looks something like this.
indice_hogar anio mes directorio orden mujer nivel__educativo_cat trabaja
0 2018 08 4700731 1 1 4 1
0 2018 08 4700731 2 0 5 1
0 2018 11 4777752 1 0 5 1
37 2018 04 4605803 1 0 3 1
42 2011 07 2735691 1 1 4 1
42 2018 02 4545459 1 0 3 1
43 2018 12 4803694 1 0 5 1
44 2018 10 4747974 1 0 5 1
46 2018 05 4610096 1 0 3 1
47 2018 04 4598828 1 1 1 0
47 2018 08 4687722 1 0 1 0
48 2018 04 4592941 1 0 5 0
48 2018 06 4636177 1 0 3 1
50 2018 06 4645892 1 0 1 1
50 2018 06 4645892 2 1 4 1
For better understanding, I am using an IV that is the ability of the most similar person according to the index and to personal characteristics. That means I need to find the most similar observation to, for example, person A and then be able to take the match's abilities and use it for a regression.
I have not been able to create a code.

Duplicate your dataset, and match the 1st copy to the 2nd using nnmatch.
* Duplicate the data set
gen byte treat = 1
gen nobs = _N
save temp, replace
replace treat = 0
append using temp
* Make a fake outcome variable to keep nnmatch happy
gen byte outcome = runiform()<.5
* nnmatch performs a nearest neighbor match, return the id of the matched cases as nnid
teffects nnmatch (outcome indice_hogar nivel_educativo_cat trabaja) (treat), gen(nnid)
* Unduplicate the data set
keep if treat == 0
* change nnid to point to the 1st copy of the data set, not the 2nd
replace nnid = nnid - nobs

How to convert Object to Float in Python

I have the following dataframe :
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
5 4177.014316 2016 9 12 0 0 0
6 3077.103445 2016 9 13 0 0 0
7 4123.103795 2016 9 14 0 0 0
.. ... ... ... ... ... ... ...
551 NaN 2016 11 23 0 0 0
552 NaN 2016 11 24 0 0 0
553 NaN 2016 11 25 0 0 0
.. ... ... ... ... ... ... ...
579 NaN 2016 11 27 0 0 0
580 NaN 2016 11 28 0 0 0
The variables type is as follows:
print(df.dtypes)
Daily_KWH_System object
year int32
month int32
day int32
hour int32
minute int32
second int32
I need to convert "Daily_KWH_System" to Float, so that I use in Linear Regression model.
I tried the below code, which worked fine.
df['Daily_KWH_System'] = pd.to_numeric(df['Daily_KWH_System'], errors='coerce')
Then I replaced the NaN's to Blank space, to use in my model. And I used the following code
df = df.replace(np.nan,' ', regex=True)
But, again the variable " Daily_KWH_System" is getting converted to Object as soon as i replace NaN'.
Please let me know how to go about it

performing chi squared test in SAS using PROC FREQ

Our university is forcing us to perform the old school chi square test using PROC FREQ (I am aware of the options with proc univariate).
I have generated one theoretical exponential distribution with Beta=15 (and written down the values laboriously), and I've generated 10000 random variables which have an exponential distribution, with beta=15.
I try to first enter the frequencies of my random variables (in each interval) via the datalines command:
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
This seems to work.
I then try to compare these values to the theoretical values, using the chi square test in proc freq (the one we are supposed to use)
As follows:
proc freq data=expofaktiska;
weight count;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I get the following error:
ERROR: The number of TESTP values does not equal the number of levels. For the table of number,
there are 24 levels and 26 TESTP values.
This may be because two intervals contain 0 obervations. I don't really see a way around this.
Also, I don't get the chi square test in the results viewer, nor the "tes probability", I only the frequency/cumulative frequency of the random variables.
What am I doing wrong? Do both theoretical/actual distributions need to have the same form (probability/frequencies?)
We are using SAS 9.4
Thanks in advance!
/Magnus

You need ZEROS options on the WEIGHT statement.
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
proc freq data=expofaktiska;
weight count / zeros;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;

Assigning Variables from CSV files (or another format) in C++

Hello Stack Overflow world :3 My name is Chris, I have a slight issue.. So I am going to present the issue in this format..
Part 1
I will present the materials & code snippets I am currently working with that IS working..
Part 2
I will explain in my best ability my desired new way of achieving my goal.
Part 3
So you guys think I am not having you do all the work, I will go ahead and present my attempts at said goal, as well as possibly ways research has dug up that I did not fully understand.
Part 1
mobDB.csv Example:
ID Sprite kName iName LV HP SP EXP JEXP Range1 ATK1 ATK2 DEF MDEF STR AGI VIT INT DEX LUK Range2 Range3 Scale Race Element Mode Speed aDelay aMotion dMotion MEXP ExpPer MVP1id MVP1per MVP2id MVP2per MVP3id MVP3per Drop1id Drop1per Drop2id Drop2per Drop3id Drop3per Drop4id Drop4per Drop5id Drop5per Drop6id Drop6per Drop7id Drop7per Drop8id Drop8per Drop9id Drop9per DropCardid DropCardper
1001 SCORPION Scorpion Scorpion 24 1109 0 287 176 1 80 135 30 0 1 24 24 5 52 5 10 12 0 4 23 12693 200 1564 864 576 0 0 0 0 0 0 0 0 990 70 904 5500 757 57 943 210 7041 100 508 200 625 20 0 0 0 0 4068 1
1002 PORING Poring Poring 1 50 0 2 1 1 7 10 0 5 1 1 1 0 6 30 10 12 1 3 21 131 400 1872 672 480 0 0 0 0 0 0 0 0 909 7000 1202 100 938 400 512 1000 713 1500 512 150 619 20 0 0 0 0 4001 1
1004 HORNET Hornet Hornet 8 169 0 19 15 1 22 27 5 5 6 20 8 10 17 5 10 12 0 4 24 4489 150 1292 792 216 0 0 0 0 0 0 0 0 992 80 939 9000 909 3500 1208 15 511 350 518 150 0 0 0 0 0 0 4019 1
1005 FARMILIAR Familiar Familiar 8 155 0 28 15 1 20 28 0 0 1 12 8 5 28 0 10 12 0 2 27 14469 150 1276 576 384 0 0 0 0 0 0 0 0 913 5500 1105 20 2209 15 601 50 514 100 507 700 645 50 0 0 0 0 4020 1
1007 FABRE Fabre Fabre 2 63 0 3 2 1 8 11 0 0 1 2 4 0 7 5 10 12 0 4 22 385 400 1672 672 480 0 0 0 0 0 0 0 0 914 6500 949 500 1502 80 721 5 511 700 705 1000 1501 200 0 0 0 0 4002 1
1008 PUPA Pupa Pupa 2 427 0 2 4 0 1 2 0 20 1 1 1 0 1 20 10 12 0 4 22 256 1000 1001 1 1 0 0 0 0 0 0 0 0 1010 80 915 5500 938 600 2102 2 935 1000 938 600 1002 200 0 0 0 0 4003 1
1009 CONDOR Condor Condor 5 92 0 6 5 1 11 14 0 0 1 13 5 0 13 10 10 12 1 2 24 4233 150 1148 648 480 0 0 0 0 0 0 0 0 917 9000 1702 150 715 80 1750 5500 517 400 916 2000 582 600 0 0 0 0 4015 1
1010 WILOW Willow Willow 4 95 0 5 4 1 9 12 5 15 1 4 8 30 9 10 10 12 1 3 22 129 200 1672 672 432 0 0 0 0 0 0 0 0 902 9000 1019 100 907 1500 516 700 1068 3500 1067 2000 1066 1000 0 0 0 0 4010 1
1011 CHONCHON Chonchon Chonchon 4 67 0 5 4 1 10 13 10 0 1 10 4 5 12 2 10 12 0 4 24 385 200 1076 576 480 0 0 0 0 0 0 0 0 998 50 935 6500 909 1500 1205 55 601 100 742 5 1002 150 0 0 0 0 4009 1
So this is an example of the Spreadsheet I have.. This is what I wish to be using in my ideal goal. Not what I am using right now.. It was done in MS Excel 2010, using Columns A-BF and Row 1-993
Currently my format for working code, I am using manually implemented Arrays.. For example for the iName I have:
char iName[16][25] = {"Scorpion", "Poring", "Hornet", "Familiar", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null", "null"};
Defined in a header file (bSystem.h) now to apply, lets say their health variable? I have to have another array in the same Header with corresponding order, like so:
int HP[16] = {1109, 50, 169, 155, 95, 95, 118, 118, 142, 142, 167, 167, 193, 193, 220, 220};
The issue is, there is a large amount of data to hard code into the various file I need for Monsters, Items, Spells, Skills, ect.. On the original small scale to get certain system made it was fine.. I have been using various Voids in header files to transfer data from file to file when it's called.. But when I am dealing with 1,000+ Monsters and having to use all these variables.. Manually putting them in is kinda.. Ridiculous? Lol...
Part 2
Now my ideal system for this, is to be able to use the .CSV Files to load the data.. I have hit a decent amount of various issues in this task.. Such as, converting the data pulled from Names to a Char array, actually pulling the data from the CSV file and assigning specific sections to certain arrays... The main idea I have in mind, that I can not seem to get to is this;
I would like to be able to find a way to just read these various variables from the CSV file... So when I call upon the variables like:
cout << name << "(" << health << " health) VS. " << iName[enemy] << "(" << HP[enemy] << " health)";
where [enemy] is, it would be the ID.. the enemy encounter is in another header (lSystem.h) where it basically goes like;
case 0:
enemy = 0;
Where 0 would be the first data in the Arrays involving Monsters.. I hate that it has to be order specific.. I would want to be able to say enemy = 1002; so when the combat systems start it can just pull the variables it needs from the enemy with the ID 1002..
I always hit a few different issues, I can't get it to pull the data from the file to the program.. When I can, I can only get it to store int values to int arrays, I have issues getting it to convert the strings to char arrays.. Then the next issue I am presented with is recalling it and the actual saving part... Which is where part 3 comes in :3
Part 3
I have attempted a few different things so far and have done research on how to achieve this.. What I have came across so far is..
I can write a function to read the data from let's say mobDB, record it into arrays, then output it to a .dat? So when I need to recall variables I can do some from the .dat instead of a modifiable CSV.. I was presented with the same issues as far as reading and converting..
I can go the SQL route, but I have had a ton of issues understanding how to pull the data from the SQL? I have a PowerEdge 2003 Server box in my house which I store data on, it does have NavicatSQL Premium set up, so I guess my main 2 questions about the SQL route is, is it possible to hook right into the SQLServer and as I update the Data Base, when the client runs it would just pull the variables and data from the DB? Or would I be stuck compiling SQL files... When it is an online game, I know I will have to use something to transfer from Server to Client, which is why I am trying to set up this early in dev so I have more to build off of, I am sure I can use SQL servers for that? If anyone has a good grasp on how this works I would very much like to take the SQL route..
Attempts I have made are using like, Boost to Parse the data from the CSV instead of standard libs.. Same issues were presented.. I did read up on converting a string to a char.. But the issue lied in once I pulled the data, I couldn't convert it?..
I've also tried the ADO C++ route.. Dead end there..
All in all I have spent the last week or so on this.. I would very much like to set up the SQL server to actually update the variables... but I am open to any working ideas that presents ease of editing and implementing large amounts of data..
I appreciate any and all help.. If anyone does attempt to help get a working code for this, if it's not too much trouble to add comments to parts you feel you should explain? I don't want someone to just give me a quick fix.. I actually want to learn and understand what I am using. Thank you all very much :)
-Chris

Let's see if I understand your problem correctly: You are writing a game and currently all the stats for your game actors are hardcoded. You already have an Excel spreadsheet with this data and you just want to use this instead of the hardcoded header files, so that you can tweak the stats without waiting for a long recompilation. You are currently storing the stats in your code in a column-store fashion, i.e. one array per attribute. The CSV file stores stuff in a row-wise fashion. Correct so far?
Now my understanding of your problem becomes a little blurry. But let's try. If I understand you correctly, you want to completely remove the arrays from your code and directly access the CSV file when you need the stats for some creature? If yes, then this is already the problem. File I/O is incredibly slow, you need to keep this data in main memory. Just keep the arrays, but instead of manually assigning the values in the headers, you have a load function that reads the CSV file when you start the game and loads its contents into the array. You can keep the rest of your code unchanged.
Example:
void load (std::ifstream &csv)
{
readFirstLineAndCheckThatItIsCorrect (csv);
while (!csv.eof())
{
int id;
std::string spriteName;
csv >> id;
csv >> spriteName >> kName[id] >> iName[id] >> LV[id] >> HP[id] >> SP[id] >> ...
Sprite[id] = getSpriteForName (spriteName);
}
}
Using a database system is completely out of scope here. All you need to do is load some data into some arrays. If you want to be able to change the stats without restarting the program, add some hotkey for reloading the CSV file.
If you plan to write an online game, then you still have a long way ahead of you. Even then, SQL is a very bad idea for exchanging data between server and clients because a) it just introduces way too much overhead and b) it is an open invitation for cheaters and hackers because if clients have direct access to your database, you can no longer validate their inputs. See http://forums.somethingawful.com/showthread.php?noseen=0&pagenumber=258&threadid=2803713 for an actual example.
If you really want this to be an online game, you need to design your own communication protocol. But maybe you should read some books about that first, because it really is a complex issue. For instance, you need to hide the latency from the user by guessing on the client side what the server and the other players will most likely do next, and gracefully correct your guesses if they were wrong, all without the player noticing (Dead Reckoning).
Still, good luck on your game and I hope to play it some day. :-)

IMO, the simplest thing to do would be to first create a struct that holds all the data for a monster. Here's a reduced version because I don't feel like typing all those variables.
struct Mob
{
std::string SPRITE, kName, iName;
int ID, LV, HP, SP, EXP;
};
The loading code for your particular format is then fairly simple:
bool ParseMob(const std::string & str, Mob & m)
{
std::stringstream iss(str);
Mob tmp;
if (iss >> tmp.ID >> tmp.SPRITE >> tmp.kName >> tmp.iName
>> tmp.LV >> tmp.HP >> tmp.SP >> tmp.EXP)
{
m = tmp;
return true;
}
return false;
}
std::vector<Mob> LoadMobs()
{
std::vector<Mob> mobs;
Mob tmp;
std::ifstream fin("mobDB.csv");
for (std::string line; std::getline(fin, line); )
{
if (ParseMob(line,tmp))
mobs.emplace_back(std::move(tmp));
}
return mobs;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Search and match data from two files in Python - python-2.7

Related

Search for a value in multiple columns and return value from Column A

Matching observations based on a single variable or multiple variables on a single data set - stata

How to convert Object to Float in Python

performing chi squared test in SAS using PROC FREQ

Assigning Variables from CSV files (or another format) in C++

Categories

Resources