Do I need to load dataset everytime I shut down the kernel and start again the next day?
As I am constantly getting an error of NameError that the train dataset has not been identified.
Is there any better way rather than to load dataset and run the same commands again?
There are a couple of things to try. You could create and load a pickle object. I've done this for large text based datasets.
Have a look at the %%cache magic at https://github.com/rossant/ipycache
This can be used to save your data in a pickle like object between sessions.
Search stackoverflow for 'how to cache jpython notebook'.
I'd have added this just as a comment as I am not supplying code examples, but I don't have the necessary reputation.
Related
i am trying to generate SNMP data for printers for later analysis using a prediction algorithm to be able to fortell emanating faults in printers before they actually occur. I seek advice on how best i could collect the data and prepare it in a dataset format like .csv so as to feed it into my classifier.
Would really appreciate any help rendered
Cheers!
My approach might not be the most efficient one but it is possible to start with and later improve it.
What I would do in your case would be the following:
1) Create a python script that polls every printer, you need to poll. This using Pysnmp.
2) What I don't understand is where you want to collect your data from but anyways, you can import csv in your poller script and create a csv file if that is what you want. Or if you want that data inserted into a sql database eg MySQL you can push the data as well from your script.
Hope this helps:)
I am a transmission planning engineer and trying to automate the execution of PSSE 100 times or more at one go through a Python code. I already runs, change loads, reruns psse and write bus based summary report to *.csv file. What I really want to do is select the first active power load variable of a PSSE case and increase it by 1 MW. Then run psse, write results to a csv file. Change the selected load back to its original value and move on to the next active load to do the same again and again until I have done same for all load busses.
This will help me to calculate transmission loss factors for entire network with one go.
Thanks
#dsmtlk, if you're experienced in Python, you can readily find the information you need in the PSSE API Manual located in your PSSE program folder (mine is in C:\Program Files (x86)\PTI\PSSE33\DOCS). The API routines for getting bus data are in section 8.6. The routine for changing bus data—viz., psspy.load_data_4()—is in section 2.21.
If you're new to Python, here are a couple links I found helpful when I first started:
https://docs.python.org/2/tutorial/
http://www.tutorialspoint.com/python/
So I have two 200mb JSON files. The first one takes 1.5 hours to load, and the second which (which makes a bunch of many-to-many relationships models with the first), takes 24+ hours (since there's no updates via the console, I have no clue if it was still going or if it froze, so I stopped it).
Since the loaddata wasn't working that well, I wrote my own script that loaded the data while also outputting what's been recently saved into the db, but I noticed the speed of the script (along with my computer) decayed the longer it went. So I had to stop the script -> restart my computer -> resume at the section of data where I left off, and that would be faster than running the script throughout. This was a tedious process since it took roughly 18 hrs with me restarting the computer every 4 hours to get all the data fully loaded.
I'm wondering if there is a better solution for loading in large amounts of data?
EDIT: I realized there's an option to load in raw SQL, so I may try that, although I need to brush up on my SQL.
When you're loading large amounts of data, writing your own custom script is generally the fastest. Once you've got it loaded in once, you can use your databases import/export options, which will generally be very fast (ex, pgdump).
When you are writing your own script, though, two things which will drastically speed things up:
Loading data inside a transaction. By default the database is likely in autocommit mode, which causes an expensive commit after each insert. Instead, make sure that you begin a transaction before you insert anything, then commit it afterwards (importantly, though, don't forget to commit; nothing sucks like spending three hours importing data, only to realize you forgot to commit it).
Bypassing the Django ORM and use raw INSERT statements. There is some computational overhead to the ORM, and bypassing it will make things faster.
I am beginning to experiment with using MapBox and TileMill, and what I would like to do is map 400,000 addresses in a CSV file which have been pre-geocoded. When I try to add this 100mb CSV file as a layer into MapBox, I receive an error telling me that the CSV file is greater than 20mb and apparently this is a problem.
Can someone point me in the right direction in terms of what is the best way to get these 400k records into TileMill? Eventually, I want to publish the map to the web, and I was planning to do that using MapBox. I saw a program for converting CSV to a shapefile, and wondering whether this is the best approach.
Hundreds of thousands of markers is a lot. In the free tier of Mapbox, there is a limit of two thousand features. Such a limit would not stop you from displaying those in Tilemill, but it would stop you from uploading them to mapbox.com.
For discussion of that limit, see here.
A simple strategy for reducing the markers is to restrict to the subset of features that lies within a smaller bounding box.
I don't think it will matter whether your features are expressed in geojson, shapefiles, csv, or other formats. The number of features is what's stopping you.
I have the same problem. I had to import a 22MB csv file into tilemill and got the same error.
Although I don't have a working answer for you, but I would think either:
Convert csv to SQLite export files http://www.mapbox.com/tilemill/docs/tutorials/sqlite-work/
Configure the buffer for tilemill (however I doubt this would be the best because my tilemill cant take 5 GB of memory when doing points/markers rendering, increasing the buffer would make things worse)
I will keep experimenting with the ideas, and update this thread as soon as I found something. Also, I am looking forward to the tillmill pros out here for the best working answer~!
Best
can some one suggest me best idea to overcome this situation. Iam using kettle 4.1.0 community version, here when i want to preview the data in spoon for the transformation table output, then when i click on preview data option, the data is being generated in database directly even if we are not performing Run transformation option.. how can i overcome this problem..
regards
kiran kumar.g
Thats just how it works. Perhaps the name "preview" is badly named.
Couple of ways around it. Preview the step before the table output, and disable the hop so no data goes to table output. If the table output step is collecting several inputs, then put a "dummy" step and make that do the collecting, and then preview that.
or change your db to use a local db (via properties or jndi or even a different connection on the step) then you wont care if the db is generated.