NP where and if statment conditions - python-2.7

I am currently having a problem understanding np.where in relation to if statements. (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html) I have heard of it being more efficient. I am trying to get a better understanding of the np.where function but I havent found any examples that are making it clear. This is the if statement I really want to convert, can anyone assist? When you convert this can you give some more conditional examples perhaps. Would numpy conditional statements work better if np.where isn't a probably solution.
For the sake of a better explaination true_list and flat_list arent really list in the real problem, they're arrays I'm appending. So I'm just going to change their name to true_arr and flat_arr.
Heres more code:
combo = list(element)
flat_arr = np.concatenate(combo) #changes array dimensions to what I need
sum_flat_arr=flat_arr.sum(axis=0)
salary = sum_flat_arr[2]
values = sum_flat_arr[3]
if salary <= 5000 and values > 150:
true_arr = true_arr + flat_arr
true_arr is just an empty numpy array(not sure whats the best way to handle it to prefill it with the number of empty rows and columns or leave it completely blank and just append it)
flat_arr is just one whole array it looks like:
Out:
[['Johnny Tsunami' 'Driver' 1000 39]
['Snow White' 'Pistol' 2000 40]
['Michael B. Jackson' 'Pistol' 2500 46]
['Greg Ritcher' 'Lookout' 200 25]]
Essentially Name Job Salary and Value, instead of dataframes I'm trying to do everything in Numpy for speed. The reason why I'm not using np.concatenate is because I hear it's slower then adding them like list. If I'm wrong please explain.
Its just appending a list. If you cant do it this way and it needs to be a function it could be np.append or np.concatenate.
End all be all if non of this applies and I have been thinking totally the wrong way I'm simply looking for a numpy way to do if statements more efficiently(faster).
Can someone shoot me in the right direction.

Related

Getting the error: "missing 1 required positional argument: 'row'" when using Dataframe.apply()

I am trying to improve performance of my stock order placer algorithm (1000's of lines) by switching from using iterrows() to using apply(), but I am getting an error:
TypeError: ("place_orders() missing 1 required positional argument: 'row'", 'occurred at index 2008-01-14 00:00:00')
Below is an example of the orders file I am reading in (short list for simplicity):
Next...below is my code both my attempt at implementing apply() and the slower iterrows()
I apologize if this is a newbie question, but I need to use the index and the rows inside the function, as the index is a bunch of dates.
Update: Below is an example of my prices_table.
When switching from iterrows to apply you need to change your mindset a little bit. Instead of a looping over the dataframe and taking every row from top to bottom, you just specify what you want to happen in every row. Mostly just let go of row numbers.
So when using apply it's usually a good idea to let go of of row numbers (in you case i). Try using a function like this in your apply:
orders_df.apply(lambda row: place_orders(row), axis=1)
I realize that inside your place_orders function you are using specific (sets of) rows of the prices_table. To overcome this part you might want to merge the dataframes before calling apply, since apply is not really intended to work on multiple dataframes at once.
This forces you to rewrite some of your code, but in my experience the performance increase you gain from not using iterrows is always worth it.

Get highest value from a list with a lot of useless characters

I am trying to get a value from a cell in Google Sheets which contains a list of values separated with commas.
Example:
UC133 - 2019/01/10 2019/01/30, UC99 - 2018/11/29 2018/12/19, UC134 -
2019/06/01 2019/06/19, UC132 - 2018/12/20 2019/01/09
I would like to be able to get an output in a cell of "UC134", because 134 is "bigger" than UC99, UC132 and UC133.
I tried a lot of different functions and formulas but I am unable to get something to work. I also really tried to fix the original data I get this from, but it seems like it is not an option.
Any help is appreciated and if possible without any function scripts.
Thank you very much for your time and let me know if you have any questions.
=ARRAYFORMULA("UC"&MAX(REGEXEXTRACT(SPLIT(A1, ","), "UC(\d+)\s")*1))
shorter: =ARRAYFORMULA("UC"&MAX(LEFT(SPLIT(A1, "UC"), 3)*1))
longer: =ARRAYFORMULA("UC"&MAX(INDEX(SPLIT(TRANSPOSE(SPLIT(A1, "UC")), " ")),,1))

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Some confusion over Numpy + Scipy + matplotlib Spectrum Analyzer code

I've been attempting to understand the code at the bottom of http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html, though sadly I haven't been getting anywhere with it. I don't think I'm expected to understand most of the code, as I have limited experience with FFTs, but unfortunately I'm also having trouble understanding how the graph is generated. I'm also getting very limited progress from a trial-and-error approach, due to the fact that my computer lags heavily and because of the relatively long time it takes for a graph to be generated.
With that being said, I need a way to scale the graph so that it only displays values up to 5000 Hz, though still on a logarithmic scale. I'd also like to understand how the wav file is sampled, and what values I can edit in order to take more samples per second. Can somebody explain how both of these points work, and how I can edit the code in order to fulfill these requirements?
Hm, this code is by me so gladly help you understanding it. It's maybe not best practice and there may be several ways to improve it – suggestions are welcome. But at least it worked for me.
The function stft does a standard short-time-fourier-transform of an audio signal by the help of the numpy strides. The function logscale_spec takes an stft and scales it logarithmically. This is maybe a bit dirty and there must be a better way to do it. But it worked for me. plotstft is the function that finally reads a wave file via scipy.io.wavfile, combines the prior two functions and makes a plot with matplotlibs imshow. If you have a mono wavefile you should be able to just call plotstft("/path/to/mono.wav").
That was an overview – if I should explain some things in more detail, just say so.
To your questions. To leave out some frequencie values: You can get the frequencies values of the fft wih np.fft.fftfreq(binsize, 1./sr). You just have to find the index of of your cutoff value and leaving this values of the stft.
I don't understand your second question... You can have a look of all samples of your wavefile by:
>>> import scipy.io.wavfile as wav
>>> x = wav.read("/path/to/file.wav")
>>> x
(44100, array([4554752, 4848551, 3981874, ..., 2384923, 2040309, 294912], dtype=int32))
>>> x[1]
array([4554752, 4848551, 3981874, ..., 2384923, 2040309, 294912], dtype=int32)

Xlrd list index out of range

I'm just starting to explore Xlrd, and to be honest am pretty new to programming altogether, and have been working through some of their simple examples, and can't get this simple code to work:
import xlrd
book=open_workbook('C:\\Users\\M\\Documents\\trial.xlsx')
sheet=book.sheet_by_index(1)
cell=sheet.cell(0,0)
print cell
I get an error: list index out of range (referring to the 2nd to last bit of code)
I cut and pasted most of the code from the pdf...any help?
You say:
I get an error: list index out of range (referring to the 2nd to last
bit of code)
I doubt it. How many sheets are there in the file? I suspect that there is only one sheet. Indexing in Python starts from 0, not 1. Please edit your question to show the full traceback and the full error message. I suspect that it will show that the IndexError occurs in the 3rd-last line:
sheet=book.sheet_by_index(1)
I would play around with it in the console.
Execute each statement one at a time and then view the result of each. The sheet indexes count from 0, so if you only have one worksheet then you're asking for the second one, and that will give you a list index out of range error.
Another thing that you might be missing is that not all cells exist if they don't have data in them. Some do, but some don't. Basically, the cells that exist from xlrd's standpoint are the ones in the matrix nrows x ncols.
Another thing is that if you actually want the values out of the cells, use the cell_value method. That will return you either a string or a float.
Side note, you could write your path like so: 'C:/Users/M/Documents/trial.xlsx'. Python will handle the / vs \ on the backend perfectly and you won't have to screw around with escape characters.