Shapefile dBase Data (naturalearthdata.com) - shapefile

I'm researching using natural earth for a software project, so I pulled down a sample data file and had a peek into its dBase file, namely ne_50m_admin_0_countries.dbf
Here is a sample row from said file:
ScaleRank = 1
LabelRank = 1
FeatureCla = Admin-0 countries
SOVEREIGNT = South Africa
SOV_A3 = ZAF
ADM0_DIF = 0.00
LEVEL = 2.00
TYPE = Sovereign country
ADMIN = South Africa
ADM0_A3 = ZAF
GEOU_DIF = 0.00
GEOUNIT = South Africa
GU_A3 = ZAF
SU_DIF = 0.00
SUBUNIT = South Africa
SU_A3 = ZAF
NAME = South Africa
ABBREV = S.Af.
POSTAL = ZA
NAME_FORMA = Republic of South Africa
TERR_ =
NAME_SORT = South Africa
MAP_COLOR = 2.00
POP_EST = 49052489.00
GDP_MD_EST = 491000.00
FIPS_10_ = 0.00
ISO_A2 = ZA
ISO_A3 = ZAF
ISO_N3 = 710.00
Now, what the heck is all of this stuff? I can guess what fields like "SOVEREIGNT" and "NAME" are, but what the heck is "ISO_A3", or "MAP_COLOR," or "ScaleRank"?
I tried to look for documentation in various places both on naturalearthdata.com and in other places, but I can't seem to find any. How exactly am I supposed to go about making sense of all those fields?

ISO stuff A2, A3, N3 are all associated with ISO-3166 global country codes and such.. Take a look at
http://www.unc.edu/~rowlett/units/codes/country.htm
Not sure of map_color or scale rank though.

the dbf file that came with shape file stores attribute of each of feature (a polygon, a line, a point etc). And it is totally upto person who created the shape file. So if you cannot guess what it is, you have to ask guys who created it.
I went to the website, and i guess you got this data Admin 0 – Countries. If what's written down there is not enough, you have to ask those guys there, as the data wont describe them. Sometimes shape file is accompanied with xml file which document what the data is, but for this particular data i saw only shp (defines polygons) dbf (attributes of each polygon) prj (map projection being used) and shx (i dont know what it is... indexing?).

Related

How can I calculate a new date conditionally based on other information?

I have a Google Sheet which is being populated by a Google Form. I am using Google Apps Script to add some extra functionality. Please feel free to access and modify these as needed in order to help.
Based on answers from the Form, I need to return a new date that factors in the time stamp at form submission.
This is a dumbed down example of what I need to do, but let's think of it like ordering a new car and its color determines how long it is going to take.
Car
Color
Toyota
Red
Honda
Blue
Tesla
Green
I need to write a conditional IF statement that determines how many weeks it will take to get the car based on the ordered color.
-
Red
Blue
Green
Toyota
1
3
5
Honda
2
4
6
Tesla
1
1
1
So if you order a Toyota in Red, it will take one week. If you order a Toyota in Green, it will take 5 weeks. If you order a Tesla, it will be really in one week no matter what color. Etc...
I started by writing some language in Sheets to take the Timestamp which is in Column A and add the appropriate amount of time to that:
=IFS(AND(B2 = "Toyota",C2 = "Red"),A2 + 7,AND(B2="Toyota",C2="Blue"), A2 + 21,AND(B2="Toyota",C2="Green"), A2 + 35,AND(B2 = "Honda",C2 = "Red"),A2 + 14,AND(B2="Honda",C2="Blue"), A2 + 28,AND(B2="Honda",C2="Green"), A2 + 42,AND(B2 = "Tesla"),A2 + 7)
And then I dragged that down the length of the entire column so that it would fill in as submissions came in.
However when you fill in the Google Form, it will overwrite what's in that entire row, blowing out what I had in that column.
Now I realized that the code needs to be written in Google Apps Script and returned as a value.
What kinds of modifications need to be made to my IFS statement in order to make it compatible with Google Apps Script?
For easier approach, QUERY would actually solve your issue without doing script as Broly mentioned in the comment. An approach you can try is to create a new sheet. Then have that sheet contain this formula on A1
Formula (A1):
=query('Form Responses 1'!A:C)
This will copy A:C range from the form responses, and then, copy/paste your formula for column Date Needed on column D.
Output:
Note:
Since you only copied A:C, it won't affect column D formula.
Your A:C in new sheet will update automatically, then the formula you inserted on D will recalculate once they are populated.
Add IFNA on your formula for column D to not show #N/A if A:C is still blank.
Formula (D2):
=IFNA(IFS(AND(B2 = "Toyota",C2 = "Red"),A2 + 7,AND(B2="Toyota",C2="Blue"), A2 + 21,AND(B2="Toyota",C2="Green"), A2 + 35,AND(B2 = "Honda",C2 = "Red"),A2 + 14,AND(B2="Honda",C2="Blue"), A2 + 28,AND(B2="Honda",C2="Green"), A2 + 42,AND(B2 = "Tesla"),A2 + 7), "")

Imputing Missing Value

I am looking to filter the talentpool_subset dataframe to capture only the city and state from the location column (it currently contains strings like this, "Software Developer in London, United Kingdom"). I've tried replacing NaN values with 0, and confirmed that I've done this by subsetting the dataframe to return only NaN values, which returned an empty data frame as expected. But every time I run the final statement, I get this error: "ValueError: cannot mask with array containing NA / NaN values"
Why is this happening?
talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]
talentpool_subset
talentpool_subset['location'].fillna(0, inplace=True)
location = talentpool_subset['location'].isna()
talentpool_subset[location]
talentpool_subset[talentpool_subset['location'].str.contains(r'(?<=in).*')]
name profile url source github location skills tags_strong tags_expert is_available description
0 Hugo L. Samayoa DevOps Developer https://www.toptal.com/resume/hugo-l-samayoa toptal NaN DevOps Developer in Long Beach, CA, United States {"Paradigms":["Agile Software Development","Sc... NaN ["Linux System Administration","VMware ESXi","... available "DevOps before DevOps" is a term mostly associ...
1 Stepan Yakovenko Software Developer https://www.toptal.com/resume/stepan-yakovenko toptal stiv-yakovenko Software Developer in Novosibirsk, Novosibirsk... {"Platforms":["Debian Linux","Windows","Linux"... ["Linux","C++","AngularJS"] ["Java","HTML5","CSS","JavaScript","MySQL","Hi... available Stepan is an experienced software developer wi...
2 Slobodan Gajic Software Developer https://www.toptal.com/resume/slobodan-gajic toptal bobangajicsm Software Developer in Sremska Mitrovica, Vojvo... {"Platforms":["Firebase","XAMPP"],"Storage":["... ["Firebase","Karma"] ["jQuery","HTML5","CSS3","Git","JavaScript","S... available Slobodan is a front-end developer with a Bache...
4 Jennifer Aquino Query Optimization Developer https://www.toptal.com/resume/jennifer-aquino toptal BlueCamelArt Query Optimization Developer in West Ryde, New... {"Paradigms":["Automation","ETL Implementation... ["Data Warehouse","Unix","Oracle 10g","Automat... ["SQL","SQL Server Integration Services (SSIS)... available Jennifer has five years of professional experi...
Assuming here that the objective is to get location and it is not required to use a mask for location. Code below uses .extract() to keep only city, state in the location column.
For example: Long Beach, CA, United States from DevOps Developer in Long Beach, CA, United States.
# Import libraries
import pandas as pd
import numpy as np
# Create list using text from question
name = ['Hugo L. Samayoa','Stepan Yakovenko','Slobodan Gajic','Bruno Furtado Montes Oliveira','Jennifer Aquino']
profile = ['DevOps Developer','Software Developer','Software Developer','Visual Studio Team Services (VSTS) Developer','Query Optimization Developer']
url = ['https://www.toptal.com/resume/hugo-l-samayoa','https://www.toptal.com/resume/stepan-yakovenko','https://www.toptal.com/resume/slobodan-gajic','https://www.toptal.com/resume/bruno-furtado-mo...','https://www.toptal.com/resume/jennifer-aquino']
source = ['toptal','toptal','toptal','toptal','toptal']
github = [np.nan, 'stiv-yakovenko','bobangajicsm','brunofurmon','BlueCamelArt']
location = ['DevOps Developer in Long Beach, CA, United States', 'Software Developer in Novosibirsk, Novosibirsk','Software Developer in Sremska Mitrovica, Vojvo','Visual Studio Team Services (VSTS) Developer in New York','Query Optimization Developer in West Ryde, New York']
skills = ['{"Paradigms":["Agile Software Development","Sc...', '{"Platforms":["Debian Linux","Windows","Linux"...','{"Platforms":["Firebase","XAMPP"],"Storage":["...','{"Paradigms":["Agile","CQRS","Azure DevOps"],"...','{"Paradigms":["Automation","ETL Implementation...']
# Create DataFrame using list above
talentpool_df = pd.DataFrame({
'name':name,
'profile':profile,
'url':url,
'source':source,
'github':github,
'location':location,
'skills':skills
})
# Add NaN row to DataFrame
talentpool_df.loc[6,:] = np.nan
# Subset DataFrame to get columns of interest
talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]
# Use .extract() to keep only text after 'in' in the 'location' column
talentpool_subset['location'] = talentpool_subset['location'].str.extract(r'((?<=in).*)')
Output
talentpool_subset

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Storing a curve in a django model as manytomany?

I have a list of products (say diodoes) which have a curve associated to them.
For example,
Diode 1: curve 1: [(0,1),(1,3),(2,10), ...., (100,0.5)]
Diode 2: curve 2: [(0,2),(1,4),(2.1,19), ..., (100,0)]
So for each product there is a curve (with the same x-axis values range(1,100)) but different y-axis values.
My question is what is the best practice to store such data (using Django + PostgreSql) given that I want to calculate things with it later in the views (say the area under the curve, or that curve times another one, etc). I will also be charting it, so the view will have to pull the values.
My first attempts have had various limitations:
Naive attempt 1
# model.py
for i in range(101):
name_sects = ["x", str(i+1)]
attrs["".join(name_sects)] = models.DecimalField(_("".join([str(i+1),' A'])), max_digits=6)
attrs['intensity'] = model.DecimalField(_('Diode Intensity'))
Diode = type('Diode', (models.Model,), attrs)
Ok, that creates a field for each "x", x1, x2,... etc, and I can fill each "y" in the admin ... but it's not obvious how to manipulate it in the view or the template. (and a pain to fill in, obviously)
Naive attempt 2
#model.py
class Curve(models.Model)
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
characteristic_curve = model.ManyToManyField(Curve)
Is ManyToMany the way forward? Even if to each diode corresponds one single curve? (but many points, possibly two diodes sharing a same point).
Any advice, tips or links to tools for it are very appreciated.
If you want to improve speed (because 100 entries for each product, it's really huge and it would be slow if you have to fetch 100 products and theirs points), I would use the pickle module and store your list of tuples in a TextField (or maybe CharField if the length of the string doesn't change).
>>> a = [(1,2),(3,4),(5,6),(7,8)]
>>> pickle.dumps(a)
'(lp0\n(I1\nI2\ntp1\na(I3\nI4\ntp2\na(I5\nI6\ntp3\na(I7\nI8\ntp4\na.'
>>> b = pickle.dumps(a)
>>> pickle.loads(b)
[(1, 2), (3, 4), (5, 6), (7, 8)]
Just store b in your TextField and you can get back your list really easily.
And even better, as Robert Smith says, use http://pypi.python.org/pypi/django-picklefield
I like your second approach but just a minor suggestion.
class Plot(models.Model):
x_axis = models.PositiveIntegerField( ...)
y_axis = models.DecimalField( ...)
class Curve(models.Model)
plots = models.ManyToManyField(Plot)
class Diode(models.Model)
name = blah, blah
intensity = model.DecimalField(_('Diode Intensity'), blah, blah)
curve = models.ForeignKey(Curve)
Just a minor suggestion for flexibility

Vim: Parsing address fields from all around the globe

Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa
The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
function! ParseAddress(line)
let r = split(a:line, ',\s*', 1)
let hadcountry = r[-1] !~ '\d'
let a = {}
let a.country = hadcountry ? r[-1] : ''
let r = r[:-1-hadcountry]
let a.zip = r[-1]
let a.state = r[-2]
let a.city = r[-3]
let a.header = r[0]
let nleft = len(r) - 4
if hadcountry
let a.address1 = r[-4]
let a.address2 = join(r[1:nleft-1], ', ')
else
let a.address1 = r[1]
let a.address2 = join(r[2:nleft], ', ')
endif
return a
endfunction
function! FormatAddress(a)
let t = map([
\ ['Header', 'header'],
\ ['Address 1', 'address1'],
\ ['Address 2', 'address2'],
\ ['Town', 'city'],
\ ['State/Province', 'state'],
\ ['Country', 'country'],
\ ['Zip', 'zip']],
\ 'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
\ '? v:val[0] . ": " . a:a[v:val[1]] : ""')
return join(filter(t, '!empty(v:val)'), '; ')
endfunction
The command below can be used to test the above parsing routines.
:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))
(One can provide a range to the :global command to run it through fewer
number of test address lines.)
Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
Best practices for storing postal addresses in a database
Is there a common street address database design for all addresses of the world
How many address fields would you use for a UK address
ISO Standard Street Addresses
There are probably other related questions: see the tag street-address for some of them.