Django queryset StringAgg on arrayfield

Django queryset StringAgg on arrayfield - django

I have some data which includes sizes, much like the model below.
class Product(models.Model):
width = models.CharField()
height = models.CharField()
length = models.CharField()
Through annotation we have a field called at_size which produces data like:
[None, None, None]
['200', '000', '210']
['180', None, None]
This was accomplished like so (thanks to: )https://stackoverflow.com/a/70266320/5731101:
class Array(Func):
template = '%(function)s[%(expressions)s]'
function = 'ARRAY'
out_format = ArrayField(CharField(max_length=200))
annotated_qs = Product.objects.all().annotate(
at_size=Array(F('width'), F('height'), F('length'),
output_field=out_format)
)
I'm trying to get this to convert into:
''
'200 x 000 x 210'
'180'
In code, this could a bit like ' x '.join([i for i in data if i]). But as I need to accomplish this with database functions it's a bit more challenging.
I've been playing with StringAgg, but I keep getting:
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
It looks like I need to make sure the None values are excluded from the initial Array-func to begin with. But I'm not sure where to get started here.
How can I accomplish this?

Turns out the problem was two-fold.
Cleaning out the Null values could be done through using array_remove
glueing the strings with a delimiter through StringAgg only works if the input are strings. But since we use an array, that wasn't the way to go. Instead use array_to_string
The final result looks like:
class Array(Func):
# https://www.postgresql.org/docs/9.6/functions-array.html
template = '%(function)s[%(expressions)s]'
function = 'ARRAY'
class ArrayRemove(Func):
# https://www.postgresql.org/docs/9.6/functions-array.html
function = 'array_remove'
class ArrayToString(Func):
# https://stackoverflow.com/a/57873772/5731101
function = 'array_to_string'
out_format = ArrayField(CharField(max_length=200))
annotated_qs = annotated_qs.annotate(
at_size=ArrayToString(
ArrayRemove(
Array(F('width'), F('height'), F('length'), output_field=out_format),
None, # Remove None values from the Array with ArrayRemove
),
Value(" x "), # Delimiter.
Value(''), # If there are null-values, replace with... (fallback)
output_field=CharField(max_length=200),
)
)
this produces the desired format:
for product in annotated_qs:
print(product.at_size)
180 x 000 x 200
180 x 026 x 200
180 x 7 x 200
180 x 000 x 200
200 x 000 x 220
180 x 000 x 200
175 x 230 x 033
160 x 000 x 200
60 x 220

Product.objects.annotate(display_format=Concat(F('width'), '×', F('height'), '×', F('length')))
Should do the trick no?
No need to over complicate this, let's keep it nice and simple and use the database to concatenate the 3 strings and separate them with the multiplication symbol (obviously replace if you prefer another character).
Take a look at the docs over here:
https://docs.djangoproject.com/en/4.1/ref/models/database-functions/#concat

Related

filtering random objects in django

How can i filter 12 random objects from a model in django .
I tried to do this but It does not work and It just returned me 1 object.
max = product.objects.aggregate(id = Max('id'))
max_p = int(max['id'])
l = []
for s in range(1 , 13):
l.append(random.randint(1 , max_p))
for i in l:
great_proposal = product.objects.filter(id=i)

products = product.objects.all().order_by('-id')[:50]
great_proposal1 = random.sample(list(products) , 12)
Hi . It worked with this code !

Try this:
product.objects.order_by('?')[:12]
The '?' will "sort" randomly and "[:12]" will get only 12 objects.

I'm pretty sure the code is correct, but maybe you did not realize that you're just using great_proposal as variable to save the output, which is not an array, and therefore only returns one output.
Try:
result_array = []
for i in l:
result_array.append(product.objects.filter(index=i))

Reading in TSP file Python

I need to figure out how to read in this data of the filename 'berlin52.tsp'
This is the format I'm using
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION : 52
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
8 525.0 1000.0
9 580.0 1175.0
10 650.0 1130.0
And this is my current code
# Open input file
infile = open('berlin52.tsp', 'r')
# Read instance header
Name = infile.readline().strip().split()[1] # NAME
FileType = infile.readline().strip().split()[1] # TYPE
Comment = infile.readline().strip().split()[1] # COMMENT
Dimension = infile.readline().strip().split()[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split()[1] # EDGE_WEIGHT_TYPE
infile.readline()
# Read node list
nodelist = []
N = int(intDimension)
for i in range(0, int(intDimension)):
x,y = infile.readline().strip().split()[1:]
nodelist.append([int(x), int(y)])
# Close input file
infile.close()
The code should read in the file, output out a list of tours with the values "1, 2, 3..." and more while the x and y values are stored to be calculated for distances. It can collect the headers, at least. The problem arises when creating a list of nodes.
This is the error I get though
ValueError: invalid literal for int() with base 10: '565.0'
What am I doing wrong here?

This is a file in TSPLIB format. To load it in python, take a look at the python package tsplib95, available through PyPi or on Github
Documentation is available on https://tsplib95.readthedocs.io/
You can convert the TSPLIB file to a networkx graph and retrieve the necessary information from there.

You are feeding the string "565.0" into nodelist.append([int(x), int(y)]).
It is telling you it doesn't like that because that string is not an integer. The .0 at the end makes it a float.
So if you change that to nodelist.append([float(x), float(y)]), as just one possible solution, then you'll see that your problem goes away.
Alternatively, you can try removing or separating the '.0' from your string input.

There are two problem with the code above.I have run the code and found the following problem in lines below:
Dimension = infile.readline().strip().split()[1]
This line should be like this
`Dimension = infile.readline().strip().split()[2]`
instead of 1 it will be 2 because for 1 Dimension = : and for 2 Dimension = 52.
Both are of string type.
Second problem is with line
N = int(intDimension)
It will be
N = int(Dimension)
And lastly in line
for i in range(0, int(intDimension)):
Just simply use
for i in range(0, N):
Now everything will be alright I think.

nodelist.append([int(x), int(y)])
int(x)
function int() cant convert x(string(565.0)) to int because of "."
add
x=x[:len(x)-2]
y=y[:len(y)-2]
to remove ".0"

Split Pandas Column by values that are in a list

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?

the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

How to set a default value for a method argument

def my_method(options = {})
# ...
end
# => Syntax error in ./src/auto_harvest.cr:17: for empty hashes use '{} of KeyType => ValueType'
While this is valid Ruby it seems not to be in Crystal, my suspicion is that it is because of typing. How do I tell compiler I want to default to an empty hash?

Use a default argument (like in Ruby):
def my_method(x = 1, y = 2)
x + y
end
my_method x: 10, y: 20 #=> 30
my_method x: 10 #=> 12
my_method y: 20 #=> 21
Usage of hashes for default/named arguments is totally discouraged in Crystal
(edited to include the sample instead of linking to the docs)

It seems the error has all the information I needed, I need to specify the type for the key and values of the Hash.
def my_method(options = {} of Symbol => String)
# ...
end
It is quite clearly in the docs too.

gsub speed vs pattern length

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:
library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)
for (i in rpt) {
n = n + 1
print(n)
patt = paste(rep("a", rpt[n]), collapse = "")
#time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
msecFF[n] = mean(timeFF$time)/1000000.
timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
msecFT[n] = mean(timeFT$time)/1000000.
}
library(ggplot2)
library(grid)
library(gridExtra)
axis(1,at=seq(0,1000,200),labels=T)
p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)
As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below).
I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.
Can someone explain the kink better and why at 300 and 600 characters?
Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.
gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern
(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

The kinks might be related to the bits required to hold patterns of that length.
There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.
patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)
I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django queryset StringAgg on arrayfield - django

Related

filtering random objects in django

Reading in TSP file Python

Split Pandas Column by values that are in a list

How to set a default value for a method argument

gsub speed vs pattern length

Categories

Resources