We have built a spreadsheet parsing app that will allow users to import large amounts of data easily into our application.
We have noticed some clients need in excess of 10,000 - 100,000 lines of spreadsheet data to be imported into the application some times.
Is there a standard practise that any other CF developers use to process large amounts of data within spreadsheets ?
Our standard work around has been to ask the users to break apart their spreadsheets into smaller sub spreadsheets so its manageable.. but Im hoping there is a better solution out there
thanks in advance
Do not use ColdFusion loop to parse and insert large files instead use native SQL commands such as BULK INSERT (in sql server) and LOAD DATA (in mySQL).
We have process in place that take in, parse large files (more than half million records) and import data into databases without any issue
Related
Is it possible to concatenate 1000 CSV file that have header into one file with no duplicated header directly in Google Cloud Storage? I could easily do this by downloading the file into my local hard drive but I would prefer to do it natively in Cloud Storage.
They all have same columns, and have header row.
I wrote an article to handle CSV files with BigQuery. To avoid several files, and if the volume is less than 1Gb, the recommended way is the following
Create a temporary table in BigQuery with all your CSV.
Use the Export API (not the export function)
Let me know if you need more guidance.
The problem with most solutions is that you still end up with a large number of split files where you have to then strip the headers and join them, etc...
Any method of avoiding multiple files tends to be also quite a lot of extra work.
It gets to be quite a hassle especially when big query spits out 3500 split gzipped csv files.
I needed a simple and batch file automatable method for achieving this.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article on with scenario and usage examples:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.
I my c++ project I want to open an excel file and query data and identify the rows that matches the specific column values(more than one column). What is the best methodology to connect to excel and query the worksheets?
The excel might contain several thousands of records and hence it is very important to complete the search and show the results in quick time and optimum performance.
Request you to let me know more than one option and suggest the best out of it.
See here for a library (under an open CPOL) license another Stackoverflow user recommends:
https://stackoverflow.com/a/2879322/444255
I've a django application that deals with large text files, up roughly 50,000,000 characters. For a variety of reasons it's desirable to store them as a model field.
We are using sqlite for dev and postgres for production.
Users do not need to enter the data via any UI.
The field does not need to be visible in the admin or elsewhere to the user.
Several questions:
Is it practicable to store this much text in a textarea field?
What, if any, performance issues will this likely create?
Would using a binary field improve performance?
Any guidance would be greatly appreciated.
Another consideration is that when you are querying that model, make sure you use defer on your querysets, so you aren't transferring 50MB of data down the pipe everytime you want to retrieve an object from the db.
I highly recommend storing those files on disk or S3 or equivalent in a FileField though. You won't really be able to query on the contents of those files efficiently.
This is more related to the database you use. You use SQLite so look at the limits of SQLite:
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000).
http://www.sqlite.org/limits.html
Besides that, it's probably better to use a TextField in Django.
A binary field wouldn't improve performance. Binary fields are meant for binary data, and you are storing text.
After some experimentation we decided to use a Django file field and not store the file contents in Postgresql. Performance was the primary decision driver. With file field we are able to query very quickly to get the underlying field file which in turn can be accessed directly at the OS level with much higher performance than is available if the data is stored in a Postgresql table.
Thanks for the input. It was a big help.
I have a server-client application where clients are able to edit data in a file stored on the server side. The problem is that the file is too large in order to load it into the memory (8gb+). There could be around 50 string replacements per second invoked by the connected clients. So copying the whole file and replacing the specified string with the new one is out of question.
I was thinking about saving all changes in a cache on the server side and perform all the replacements after reaching a certain amount of data. After reaching that amount of data I would perform the update by copying the file in small chunks and replace the specified parts.
This is the only idea I came up with but I was wondering if there might be another way or what problems I could encounter with this method.
When you have more than 8GB of data which is edited by many users simultaneously, you are far beyond what can be handled with a flatfile.
You seriously need to move this data to a database. Regarding your comment that "the file content is no fit for a database": sorry, but I don't believe you. Especially regarding your remark that "many people can edit it" - that's one more reason to use a database. On a filesystem, only one user at a time can have write access to a file. But a database allows concurrent write access for multiple users.
We could help you to come up with a database schema, when you open a new question telling us how your data is structured exactly and what your use-cases are.
You could use some form of indexing on your data (in a separate file) to allow quick access to the relevant parts of this gigantic file (we've been doing this with large files successfully (~200-400gb), but as Phillipp mentioned you should move that data to a database, especially for the read/write access. Some frameworks (like OSG) already come with a database back-end for 3d terrain data, so you can peek there, how they do it.
I have seen programs exporting to Excel in two different ways.
Opening Excel and entering data cell by cell (while it is running it looks like a macro at work)
Creating an Excel file on disk and writing the data to the file (like the Export feature in MS Access)
Number 1 is terribly slow and to me it is just plain aweful.
Number 2 is what I need to do. I'm guessing I need some sort of SDK so that I can create Excel files in C++.
Do I need different SDKs for .xls and .xlsx?
Where do I obtain these? (I've tried Googling it but the SDKs I've found looks like they do other things than providing an interface to create Excel files).
When it comes to the runtime, is MS Office a requirement on the PC that needs to create Excel files or do you get a redistributable DLL that you can deploy with your executable?
You can easily do that by means of the XML Excel format. Check the wikipedia about that:
http://en.wikipedia.org/wiki/Microsoft_Excel#XML_Spreadsheet
This format was introduced in Excel 2002, and it is an easy way to generate a XLS file.
You can also try working with XLS/XLSX files over ODBC or ADO drivers just like databases with a limited usage. You can use some templates if you need formatting or create the files from stratch. Of course you are limited by playing with the field values that way. For styling etc. you will need to use an Excel API like Microsoft's.
I'm doing this via Wt library's WTemplate
In short, I created the excel document I wanted in open office, and save-as excel 2003 (.xml) format.
I then loaded that in google-chrome to make it look pretty and copied it to the clipboard.
Now I'm painstakingly breaking it out into templates so that Wt can render a new file each time.