I have a DataFrame which has a column of coordinates list looks like the following:
I want to make this DataFrame a GeoPandas DataFrame with a geometry column. One way for doing this is to create two lists representing latitude and longitude and store the first and second element from the coors column to latitude and longitude, respectively. Then sue gpd.points_from_xy to build the geometry column. But this approach adds extra steps for building GeoPandas DataFrame. My question is how to build geometry directly from the coors list.
I add some test data here:
import pandas ad pd
import geopandas as gpd
data = {'id':[0,1,2,3], 'coors':[[41,-80],[40,-76],[35,-70],[35,-87]]}
df = pd.DataFrame.from_dict(data)
You can just apply Point to the 'coors' column to generate point geometry.
from shapely.geometry import Point
df['geometry'] = df.coors.apply(Point)
gdf = gpd.GeoDataFrame(df) # you should also specify CRS
In the case that you have latitude and longitude in separated columns, You can do this:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
def create_geometry(row):
row["geometry"] = Point(row["latitude"],row["longitude"])
return row
df = pd.read_[parquet|csv|etc]("file")
df = df.apply(create_geometry,axis=1)
gdf = gpd.GeoDataFrame(df,crs=4326)
Then, You can verify it doing:
df.geometry.head(1)
output
16 POINT (7.88507 -76.63130)
Name: geometry, dtype: geometry
Related
I have data which is loaded into a dataframe. This dataframe then needs to be saved to a django model. The major problem is that some data which should go into IntegerField or FloatField are empty strings "". On the other side, some data which should be saved into a CharField is represented as np.nan. This leads to the following errors:
ValueError: Field 'position_lat' expected a number but got nan.
If I replace the np.nan with an empty string, using data[database]["df"].replace(np.nan, "", regex = True, inplace = True), I end up with the following error:
ValueError: Field 'position_lat' expected a number but got ''.
So what I would like to do, is to check in the model whether a FloatField or IntegerField gets either np.nan or an empty string and replace it with an empty value. The same for CharField, which should convert integers (if applicable) to strings or np.nan to an empty string.
How could this be implemented? Using ModelManager or customized fields? Or any better approaches? Sorting the CSV files out is not an option.
import pandas as pd
import numpy as np
from .models import Record
my_dataframe = pd.read_csv("data.csv")
record = Record
entries = []
for e in my_dataframe.T.to_dict().values():
entries.append(record(**e))
record.objects.bulk_create(entries)
Maybe the problem was not clear, nevertheless, I would like to post my solution. I create a new dict which only contain keys with corresponding values.
entries = []
for e in my_dataframe.T.to_dict().values():
e = {k: v for k, v in e.items() if v}
entries.append(record(**e))
record.objects.bulk_create(entries)
I need to run a query like this -
historic_data.objects.raw("select * from company_historic_data")
This return a RawQuerySet. I have to convert values from this to a dataframe. the usual .values() method does not work with raw query. Can someone suggest a solution.
Try the codes below
import pandas as pd
res = model.objects.raw('select * from some_table;')
df = pd.DataFrame([item.__dict__ for item in res])
Note that there is a _state column in the returned dataframe
I need to create random dates with the next format 2017-03-29 12:10+0200, 2017-03-29 14:08-0400. The generated dates must have to be between a start date and a final date.
How can I do this in Python 2.7
You can do this by getting a timestamp as integer from each and then getting a random integer between these two integers you've found. Which is the timestamp in between, then you can convert that timestamp back to a datetime object again:
from dateutil.parser import parse
from datetime import datetime
from random import randint
import time
timestamp_one = time.mktime(parse("2017-03-29 12:10+0200").timetuple())
timestamp_two = time.mktime(parse("2017-03-29 14:08-0400").timetuple())
timestamp = randint(timestamp_one,timestamp_two)
result = datetime.fromtimestamp(timestamp)
print result
In python 3 you can also use the .timestamp() on the datetime object directly.
I have written a udf in pyspark like below:
df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))
df1 and df are spark dataframes
The function is given below:
def point_inside_polygon(x,y,poly):
latt = float(x)
long = float(y)
if ((math.isnan(latt)) or (math.isnan(long))):
point = sh.geometry.Point(latt, long)
polygonArr = poly
polygon=MultiPoint(polygonArr).convex_hull
if polygon.contains(point):
return True
else:
return False
else:
return False
But when I tried checking the data type of latitude and longitude, its a class of column.
The data type is Column
Is there a way to iterate through each tuple and use their values, instead of taking the data type column.
I don't want to use a for loop because I have a huge recordset and it defeats the purpose of using SPARK.
Is there a way to accomplish to pass the column values as float, or converting them inside the function?
Wrap it using udf:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
point_inside_polygon_ = udf(point_inside_polygon, BooleanType())
df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))
I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]