Rails 4 Import unknown fields to serialized hash - ruby-on-rails-4

I have a model that consists of several fields ie (Name, Description) and a properties column which is a serialized hash of custom attributes.
I am trying to write an import script to handle importing a CSV/XLS file into the database for this model and am struggling with capturing all of the unknown database fields and serializing them with their header into the properties column as a hash.
ie with the following table.
Name | Description | Field1 | Field2
Row1 | This is | Row1F1 | Row1F2
Row2 | This is | Row2F1 | Row2F2
I would like it to import into the database as
Testmodel.create(:name => Row1, :description => "This is", :properties => { :field1 => "Row1F1", :field2 => "Row1F2" })
My current SIMPLE import that I use for my other simple tables is the following. I am using Roo
def self.import(file)
spreadsheet = open_spreadsheet(file)
header = spreadsheet.row(1)
(2..spreadsheet.last_row).each do |i|
row = Hash[[header, spreadsheet.row(i)].transpose]
chemuse = find_by_id(row["id"]) || new
chemuse.attributes = row.to_hash
chemuse.save!
end
end
My goal is for the import to catch unknown column names and then I will put them into a hash and add them to the :properties column.

So I found the following code that works but it doesn't seem very efficient but wanted to post it as an answer.
def self.import(file)
spreadsheet = open_spreadsheet(file)
cn = column_names
header = spreadsheet.row(1)
(2..spreadsheet.last_row).each do |i|
rawrow = Hash[[header, spreadsheet.row(i)].transpose]
cleansedrow = {:properties => {}}
rawrow.each do |key, value|
if cn.include?(key)
cleansedrow[key] = value
else
cleansedrow[:properties][key] = value
end
end
record = find_by_id(cleansedrow["id"]) || new
record.attributes = cleansedrow.to_hash
record.save!
end
end

Related

Compare fields within relationship on Django ORM

I have two models, route and stop.
A route can have several stop, each stop have a name and a number. On same route, stop.number are unique.
The problem:
I need to search which route has two different stops and one stop.number is less than the other stop.number
Consider the following models:
class Route(models.Model):
name = models.CharField(max_length=20)
class Stop(models.Model):
route = models.ForeignKey(Route)
number = models.PositiveSmallIntegerField()
location = models.CharField(max_length=45)
And the following data:
Stop table
| id | route_id | number | location |
|----|----------|--------|----------|
| 1 | 1 | 1 | 'A' |
| 2 | 1 | 2 | 'B' |
| 3 | 1 | 3 | 'C' |
| 4 | 2 | 1 | 'C' |
| 5 | 2 | 2 | 'B' |
| 6 | 2 | 3 | 'A' |
In example:
Given two locations 'A' and 'B', search which routes have both location and A.number is less than B.number
With the previous data, it should match route id 1 and not route id 2
On raw SQL, this works with a single query:
SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = 'A'
AND stop_to.`stop_location_id` = 'B'
AND stop_from.stop_number < stop_to.stop_number
Is this possible to do with one single query on Django ORM as well?
Generally ORM frameworks like Django ORM, SQLAlchemy and even Hibernate is not design to autogenerate most efficient query. There is a way to write this query only using Model objects, however, since I had similar issue, I would suggest to use raw query for more complex queries. Following is link for Django raw query:
[https://docs.djangoproject.com/en/1.11/topics/db/sql/]
Although, you can write your query in many ways but something like following could help.
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute("SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = %s
AND stop_to.`stop_location_id` = %s
AND stop_from.stop_number < stop_to.stop_number", ['A', 'B'])
row = cursor.fetchone()
return row
hope this helps.

i want to import csv file into two different controller models ..can this be possible

i have 5 field in which 4 field should in one table and last column should import in another table
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
lock_store_hash = row.to_hash
lock_store = LockStore.where(id: lock_store_hash["id"])
if lock_store.count == 1
lock_store.first.update_attributes(lock_store_hash.except(params[:file][:taxon_id]))
else
LockStore.create!(lock_store_hash)
end
end
end
Yes you can, in your case you could do something like this (I'm assuming you have a belong_to and has_many between your models):
...
if lock_store.count == 1
lock_store.first.update_attributes(lock_store_hash.except(params[:file][:taxon_id]))
# the update code I will let to you because is difficult to assume based in the information provided
else
lockstore = LockStore.create!(lock_store_hash)
lockstore.another_models.create(attribute_name: lock_store_hash[:last_column])
end

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []

Reference a table column by its column header in Python

Is there a Pythonic way to refer to columns of 2D lists by name?
I import a lot of tables from the web so I made a general purpose function that creates 2 dimensional lists out of various HTML tables. So far so good. But the next step is often to parse the table row by row.
# Sample table.
# In real life I would do something like: table = HTML_table('url', 'table id')
table =
[
['Column A', 'Column B', 'Column C'],
['One', 'Two', 3],
['Four', 'Five', 6]
]
# Current code:
iA = table[0].index('Column A')
iB = tabel[0].index('Column B')
for row in table[1:]:
process_row(row[iA], row[iC])
# Desired code:
for row in table[1:]:
process_row(row['Column A'], row['Column C'])
I think you'll really like the pandas module! http://pandas.pydata.org/
Put your list into a DataFrame
This could also be done directly from html, csv, etc.
df = pd.DataFrame(table[1:], columns=table[0]).astype(str)
Access columns
df['Column A']
Access first row by index
df.iloc[0]
Process row by row
df.apply(lambda x: '_'.join(x), axis=0)
for index,row in df.iterrows():
process_row(row['Column A'], row['Column C'])
Process a column
df['Column C'].astype(int).sum()
Wouldn't a ordereddict of keys being columns names and values a list of rows be a better approach for your problem? I would go with something like:
table = {
'Column A': [1, 4],
'Column B': [2, 5],
'Column C': [3, 6]
}
# And you would parse column by column...
for col, rows in table.iteritems():
#do something
My QueryList is simple to use.
ql.filter(portfolio='123')
ql.group_by(['portfolio', 'ticker'])
class QueryList(list):
"""filter and/or group_by a list of objects."""
def group_by(self, attrs) -> dict:
"""Like a database group_by function.
args:
attrs: str or list.
Returns:
{value_of_the_group: list_of_matching_objects, ...}
When attrs is a list, each key is a tuple.
Ex:
{'AMZN': QueryList(),
'MSFT': QueryList(),
...
}
-- or --
{('Momentum', 'FB'): QueryList(),
...,
}
"""
result = defaultdict(QueryList)
if isinstance(attrs, str):
for item in self:
result[getattr(item, attrs)].append(item)
else:
for item in self:
result[tuple(getattr(item, x) for x in attrs)].append(item)
return result
def filter(self, **kwargs):
"""Returns the subset of IndexedList that has matching attributes.
args:
kwargs: Attribute name/value pairs.
Example:
foo.filter(portfolio='123', account='ABC').
"""
ordered_kwargs = OrderedDict(kwargs)
match = tuple(ordered_kwargs.values())
def is_match(item):
if tuple(getattr(item, y) for y in ordered_kwargs.keys()) == match:
return True
else:
return False
result = IndexedList([x for x in self if is_match(x)])
return result
def scalar(self, default=None, attr=None):
"""Returns the first item in this QueryList.
args:
default: The value to return if there is less than one item,
or if the attr is not found.
attr: Returns getattr(item, attr) if not None.
"""
item, = self[0:1] or [default]
if attr is None:
result = item
else:
result = getattr(item, attr, default)
return result
I tried pandas. I wanted to like it, I really did. But ultimately it is too complicated for my needs.
For example:
df[df['portfolio'] == '123'] & df['ticker'] == 'MSFT']]
is not as simple as
ql.filter(portfolio='123', ticker='MSFT')
Furthermore, creating a QueryList is simpler than creating a df.
That's because you tend to use custom classes with a QueryList. The data conversion code would naturally be placed into the custom class which keeps that separate from the rest of the logic. But data conversion for a df would normally be done inline with the rest of the code.

Attribute Error for strings created from lists

I'm trying to create a data-scraping file for a class, and the data I have to scrape requires that I use while loops to get the right data into separate arrays-- i.e. for states, and SAT averages, etc.
However, once I set up the while loops, my regex that cleared the majority of the html tags from the data broke, and I am getting an error that reads:
Attribute Error: 'NoneType' object has no attribute 'groups'
My Code is:
import re, util
from BeautifulSoup import BeautifulStoneSoup
# create a comma-delineated file
delim = ", "
#base url for sat data
base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_N.htm"
#get webpage object for site
soup = util.mysoupopen(base)
#get column headings
colCols = soup.findAll("td", {"class":"vaTextBold"})
#get data
dataCols = soup.findAll("td", {"class":"vaText"})
#append data to cols
for i in range(len(dataCols)):
colCols.append(dataCols[i])
#open a csv file to write the data to
fob=open("sat.csv", 'a')
#initiate the 5 arrays
states = []
participate = []
math = []
read = []
write = []
#split into 5 lists for each row
for i in range(len(colCols)):
if i%5 == 0:
states.append(colCols[i])
i=1
while i<=250:
participate.append(colCols[i])
i = i+5
i=2
while i<=250:
math.append(colCols[i])
i = i+5
i=3
while i<=250:
read.append(colCols[i])
i = i+5
i=4
while i<=250:
write.append(colCols[i])
i = i+5
#write data to the file
for i in range(len(states)):
states = str(states[i])
participate = str(participate[i])
math = str(math[i])
read = str(read[i])
write = str(write[i])
#regex to remove html from data scraped
#remove <td> tags
line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0]
#append data point to the file
fob.write(line)
Any ideas regarding why this error suddenly appeared? The regex was working fine until I tried to split the data into different lists. I have already tried printing the various strings inside the final "for" loop to see if any of them were "None" for the first i value (0), but they were all the string that they were supposed to be.
Any help would be greatly appreciated!
It looks like the regex search is failing on (one of) the strings, so it returns None instead of a MatchObject.
Try the following instead of the very long #remove <td> tags line:
out_list = []
for item in (states, participate, math, read, write):
try:
out_list.append(re.search(">(.*)<", item).groups()[0])
except AttributeError:
print "Regex match failed on", item
sys.exit()
line = delim.join(out_list)
That way, you can find out where your regex is failing.
Also, I suggest you use .group(1) instead of .groups()[0]. The former is more explicit.