Comparing local csvs to write updates, list(set()) issue - list

I have a funny problem. I can get the unique rows in the two comparison csv files by using:
update = list(set(compareNew) - set(compareOld))
However, when I include this line in my function I get TypeError:'list' object is not callable.
I need to do this for a few hundred csv's so I am calling the above line in a loop. Does that change anything for the function?
Using python 3.4

I was able to fix this by subtracting the two sets, then appending the difference to a list in a subloop.
Makes it a two-step process, but it appears to work

Related

Getting the error: "missing 1 required positional argument: 'row'" when using Dataframe.apply()

I am trying to improve performance of my stock order placer algorithm (1000's of lines) by switching from using iterrows() to using apply(), but I am getting an error:
TypeError: ("place_orders() missing 1 required positional argument: 'row'", 'occurred at index 2008-01-14 00:00:00')
Below is an example of the orders file I am reading in (short list for simplicity):
Next...below is my code both my attempt at implementing apply() and the slower iterrows()
I apologize if this is a newbie question, but I need to use the index and the rows inside the function, as the index is a bunch of dates.
Update: Below is an example of my prices_table.
When switching from iterrows to apply you need to change your mindset a little bit. Instead of a looping over the dataframe and taking every row from top to bottom, you just specify what you want to happen in every row. Mostly just let go of row numbers.
So when using apply it's usually a good idea to let go of of row numbers (in you case i). Try using a function like this in your apply:
orders_df.apply(lambda row: place_orders(row), axis=1)
I realize that inside your place_orders function you are using specific (sets of) rows of the prices_table. To overcome this part you might want to merge the dataframes before calling apply, since apply is not really intended to work on multiple dataframes at once.
This forces you to rewrite some of your code, but in my experience the performance increase you gain from not using iterrows is always worth it.

How to correctly use the Table.Repeat function?

I'm having some trouble getting the Table.Repeat function to work properly... I'm very new to PowerQuery/BI, so am just about getting my head wrapped around all the coding.
Following the syntax and it would appear everything is correct, given that the addition of columns is optional.
What I am aiming to achieve is to have the entire repeated a specific number of times, the repeat function described here sounds like it fits the bill. But when I have attempted to implement it, it results in an error.
I was previously using the Append Function, however, as I'm trying to append the query several thousand times, this results in the query crashing excel and has become uneditable after the initial setup.
I've tried implementing the Repeat code halfway through the query, where it's needed; and on a new sheet. Halfway through gave me an error stating: that it could not find a Value and that a table.
When I tired it on a new sheet, though I did not get an error, the applied sets disappeared and the data wasn't repeated. I tried the repeating tables, much lower I needed to test out, but this still went into error.
#"Repeat" = Table.Repeat(#"8-1", 2)
Essentially the entire table repeated X number of times.

Stata : generate/replace alternatives?

I use Stata since several years now, along with other languages like R.
Stata is great, but there is one thing that annoys me : the generate/replace behaviour, and especially the "... already defined" error.
It means that if we want to run a piece of code twice, if this piece of code contains the definition of a variable, this definition needs 2 lines :
capture drop foo
generate foo = ...
While it takes just one line in other languages such as R.
So is there another way to define variables that combines "generate" and "replace" in one command ?
I am unaware of any way to do this directly. Further, as #Roberto's comment implies, there are reasons simply issuing a generate command will not overwrite (see: replace) the contents of a variable.
To be able to do this while maintaining data integrity, you would need to issue two separate commands as your question points out (explicitly dropping the existing variable before generating the new one) - I see this as method in which Stata forces the user to be clear about his/her intentions.
It might be noted that Stata is not alone in this regard. SQL Server, for example, requires the user drop an existing table before creating a table with the same name (in the same database), does not allow multiple columns with the same name in a table, etc. and all for good reason.
However, if you are really set on being able to issue a one-liner in Stata to do what you desire, you could write a very simple program. The following should get you started:
program mkvar
version 13
syntax anything=exp [if] [in]
capture confirm variable `anything'
if !_rc {
drop `anything'
}
generate `anything' `exp' `if' `in'
end
You then would naturally save the program to mkvar.ado in a directory that Stata would find (i.e., C:\ado\personal\ on Windows. If you are unsure, type sysdir), and call it using:
mkvar newvar=expression [if] [in]
Now, I haven't tested the above code much so you may have to do a bit of de-bugging, but it has worked fine in the examples I've tried.
On a closing note, I'd advise you to exercise caution when doing this - certainly you will want to be vigilant with regard to altering your data, retain a copy of your raw data while a do file manipulates the data in memory, etc.

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

use uno (openoffice api) to open spreadsheet *without* recalculation

I'm using pyuno to read an excel spreadsheet (running on linux.) Many cells have formulas referring to addins that are, obviously, not available. However the cell values are what I want.
But when I load and read the sheet, it seems those formulas are being evaluated and thus the values are being overwritten with errors.
I've tried several things, none of which have worked:
set flags AutomaticCalculation=False, MacroExecutionMode=NEVER_EXECUTE in the call to desktop.loadComponentFromURL
call document.enableAutomaticCalculation(False) on the loaded document
Any suggestions?
If formluas aren't a matter, you might circumvent the problem by processing a copy of your spreadsheet in which only the values (not the formulas) are present.
To achieve this quickly, select the whole sheet content, copy, special paste; then remove everything except "value". Save to a new file (make sure you don't overwrite the original file or every formula will be lost!). Your script should then be able to process this file.
This is an ugly solution, as there must be a way to do it programmaticaly.
Calc does not yet support using the cached results after loading the document. Libreoffice Calc does now use cached results for xls documents. The results are also stored in ods but are ignored while loading the document and the formula result is evaluated by compiling and interpreting the saved formula.
There are some plans to add this for ods and xlsx too but there are many ods producers out there writting incorrect results in the file. So till now the only solution is to have a second version of the document only saving the results (or implementing it inside calc).