Cycling through URLs to download csv files - web-services

I have a list of URLs which will access a webservice. The webservice downloads .csv files. I want to be able to cycle through them using a date field which is in a specific format in the url itself, thereby downloading the data day-by-day. The access seems fairly slow as even a manual url entry can take a couple of minutes to complete, and I suspect the issue is at the webservice' end.
The URL is in the format:
http://web.service.com/ws/XYZ/data/?key=mysecretkeyf&field1=X&start=YYYY-MM-DD 00:00&end= YYYY-MM-DD 00:00&field=Y&format=csv
So the way I envision it (and I am keen to take advice) is using a variable for the start year, month and day fields cycling onto the next URL as the previous .csv is downloaded, with the code ending when the current date is accessed.
Any ideas most welcome.

The coding is really straightforward which makes me wonder if you are looking to code this yourself or asking if there is a service out there that would help do this. If you are coding, I'd choose a language that works well for you. #Vivek mentioned Python which is what I would choose as well.
If you do not want to go the coding approach, you could check out DownThemAll. I have used this utility for batch downloading where you have to increment numbers in parts of the URL. Check it out, it may be a good non-programming solution: DownThemAll

Related

Django Upload File, Analyise Conents and write to DB or update form

I'm pretty new to Django, Trying to get my grips with it, and expand what I think its capable of doing, and maybe one of you more intelligent people here can point me in the right direction.
I'm basically trying to build a system similar to a so called "Asset Management" system, to track software version of a product, so when an engineer updates the software version, they run a script which gathers all the information (Version, Install date, Hardware etc), which is stored in a .txt file, The engineer then comes back to the website and upload this .txt file for that customer, and it automatically updates the fields in the form, or directly to the database.
While, I've search a bit on here for concepts, I haven't been able to find something similar (Maybe my search terms aren't correct?), and wanted to ask if anyone knows, what I'm doing is even feasible, or am I lost down the rabbit hole of limitations :) Maybe its not do-able within Django, Any suggestions on how I should approach such a problem would be greatly appreciated.
What you're asking is doable both in the form and post-submit of a form or file upload.
In the form approach you'll want a live-reload of the form should the data come from a .txt file. That can be done with JavaScript. This will mean that the data will come from the text file and be input into the forms in the manner you define. This also means form validation will work as you want it to.
Another option is to require a txt file in a specified format, and parse it in the view, form_valid() for Class-Based Views, and request.FILES[] for function based views, and then perform all the required validations and then save the values to the database as the model instance.

Generating WagtailCMS Document Previews

Is there a way to extend Wagtail's documents to show file previews? There's a service ( I have not tested ), which looks cool, but the free plan to Pro plan is a huge leap in cost. I am hoping someone has figured this out already and can point me to the solution. Thank you.
FilePreviews.io's 100 documents a month on the free plan seems pretty generous to me. You could try to build something similar, e.g. using ImageMagick to create PDF thumbnails:
http://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/
and use a service like Aspose to convert common file formats to PDFs:
https://www.aspose.com
Or you could do it manually, by adding a thumbnail field to your document model, and telling users to provide their own thumbnails.
But if you or your client are uploading more than 100 documents a month, and you want reliable thumbnail generation for multiple document types, $49 a month may be better value than working it all out yourself.

Automating Monthly Small Business Task (VBA, VC++, Excel, Access, Quickbooks, etc)

Let me start by giving a quick background on myself (please forgive me). I have an intense interest in programming and computers/technical things in general. I took a year of C/C++ in college and a semester of assembly. I have messed around with Visual BASIC. So, almost all of my programming knowledge is limited to these three languages in order of proficiency:
C/C++
Assembly
Visual BASIC
I have a job at a small business that can't justify hiring a trained/"certified" programmer where I have tasked myself with automating a process that must be completed on a monthly basis. It involves:
Sending faxes that are to be filled out with numbers
Receiving those faxes that are returned (all incoming faxes go to network folder as PDF)
Collecting the numbers from received faxes and entering these numbers into Excel (some are Word format for some reason) and then into QuickBooks after calculations
Sending emails
Receiving replies to these emails that contain numbers
Manually entering these numbers into Excel and then QuickBooks after calculations
Collecting numbers from a website written in Javascript. Numbers from website can be outputted to *.csv file.
Finally, printing invoices out from QuickBooks using the calculated numbers that have been entered.
My goal is to automate this entire process. As of now, everything is done manually. Emails and faxes are sent one at a time. Numbers from website are read and entered into Excel one at a time. Numbers are put into QB and invoices are printed one at a time.
So far I have added an email scheduling add-on to Outlook that automatically sends the emails every month. I am working on setting up faxes to be sent automatically (the only thing I can think of off the top of my head is manipulating Windows Scan/Fax with API library in either VB or VC++).
Also, I am automating the calculations that must be performed in order to prep the collected numbers for entry into QB using VBA/Excel and, potentially, Access.
Right now I'm brainstorming a way to automatically collect the numbers (along with customer name) from the returned faxes. My idea was to create a new fax sheet that forced the customer to "bubble in" the numbers like a ScanTron sheet. This way I could write a program (perhaps in C++) to parse the PDF looking for a certain colored pixel in a specific spot in order to piece together the number (I wonder if I could automatically OCR the PDFs and collect the customer name simply by extracting text from each PDF?) which could then be sent to a database or perhaps directly to an Excel sheet (the Excel sheets have to stay so that hard copies of data can be printed--though I supposed this could be accomplished without Excel).
And lastly, since some customers refuse to use any of those methods available to them, we have to manually call some of them. Once I am finished with all of the aforementioned work I would like to develop a way to allow customers to call a specific phone number and key in the information via voice prompt which would then deposit the information in database somewhere. This will be complicated and require special equipment so it will be last and lowest priority. Not worried about this right now.
Since my experience with programming is only moderate (though I'm sure my working knowledge will expand quickly once I get started since a lot of it is already in my brain somewhere) I wanted to give myself the best advantage and tools possible to tackle this project before I got so far into it that changing my methods would waste a lot of time/work. To sum up, I need to make an outline of exactly what I need to do/learn and what techniques/applications to use.
This is the site I always come to when searching for my programming questions and I have come to the conclusion that the people here are generally extremely knowledgeable, patient and helpful. I will appreciate any contribution of information, advice and/or insights no matter how small. I realize that in this situation I am the "beggar" and thus will be grateful for whatever I get.
Thanks in advance.
P.S. Before anyone says anything: I have "UTFSE" extensively and have assimilated lots of info from it. However, we all know that there's no equal to a human's problem solving capabilities--especially when proficient in the specific field.
Nice work! You are definitely on the right track. That was a lot of information so I apologize if I repeat anything you already know.
1) Faxes - Microsoft has an excellent resource for learning how to send faxes (they even provide the code). Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/ms693482(v=vs.85).aspx
2) You will have to OCR the PDF's (as you mentioned) and then you can extract the information. But (as you seem to understand), you cannot modify a pdf with c++.
3) C++ does allow you to save (and open) a file in Excel format. However, it's a very complicated format and will probably cause some problems. One of them is that it will want to save all of your data to one cell. A way to get around this is to I/O to Excel with .csv files. A comma separates the columns and a new line the rows. For example,
A1, B1, C1
A2, B2, C2
A3, B3, C3
Excel will open and read these files correctly. However, you won't be able to format font, borders, etc... automatically.
This is the extent of my knowledge, I have never worked with emails or Quickbooks. Hope it helps!

How do you open a file in C++ from HTTP where the URL is NOT the file location

I'm a first year comp sci student with a moderate knowledge of C++ and for a job I'm trying to put together a utility using a new U.S. Census Bureau API. It takes ID codes for things like state/county/census tract and the desired table and spits back out the desired table for the desired location.
Here's an example of a query for population stats for California and New York.
More examples can be found here: http://www.census.gov/developers/
My snag is that I've both never worked with files from HTTP and also I'm not sure how to handle a URL that outputs plain text but doesn't actually lead to the file location. Would it be possible to just use stdin? I don't understand how to handle the output given by one of the census query URLs.
Right now I'm using infile which I know isn't correct but I'm not sure a correct solution is either.
Thanks
The fact that the data you're receiving is (apparently) generated on the fly rather than coming from a file doesn't really make any difference to you -- you get the same stream of bytes either way.
My immediate advice would be to use cURL for the job. Most of your work is generating a correct URL, which is what cURL specializes in. It'll then make it pretty easy to grab the data. From there, you can use any of quite a few JSON parser libraries (e.g., yajl), or you can parse it on your own (JSON is simple enough to make that fairly practical). A quick Google indicates that a fair number of people have already done this, and have various blog posts and such giving information about how to do it (though I suspect most of that is probably unnecessary).

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.
Edit: Any that aren't torrents? I can't get those at work.
The stackoverflow database is available for download.
Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.
One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).
If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.
If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.
Out of curiosity, what are you using all this data for?
You could use a web crawler and scrape 100MB of data?
There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.
One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).
https://dumps.wikimedia.org/metawiki/latest/
You want to look for any files with the -articles.xml.bz2 suffix.