how to write a go test that need massive test case - unit-testing

Let's say I have a list that has about 210000 english words.
I need to use all these 210000 words as test case.
I need to make sure every words in that list is covered every time I run my test.
The question is: What is the best practices to store these words in my test?
should I save all these words in a slice (will it be too large a slice? ), or should I save these words in a external file (like words.txt) and load the file line by line when needed?

Test data is usually stored in a directory named testdata to keep it separate from the other source code or data files (see the docs from the command go help test). The go tool ignores stuff inside that directory.
210,000 words should take up only single digit megabytes of RAM anyway, which isn't much. Just have a helper function that reads the words from the file before each test (perhaps caching them), or define a TestMain() function which reads them once and stores them in a global variable for access by tests that are subsequently run.
Edit: Regarding best practices, it's sometimes nicer to store test data in testdata even if the data isn't large. For example, I sometimes need to use multiple short JSON snippets in test cases, and perhaps use them more than once. Storing them in appropriately named files under a subdirectory of testdata can be more readable than littering Go code with a bunch of JSON snippets.
The slight loss of performance is generally not an issue in tests. Whichever method makes the code easier to understand could be the 'best practice'.


In a text file with linked structures, how do I quickly follow those links in C++, without running through the file multiple times?

I'm about to start a project that requires me to load specific information from an IFC file into classes or structs. I'm using C++, but it's been some years since I last used it so I'm a bit rusty.
The IFC file has a linked structure, where an element in a line might refer to a different line, which in turn links to another. I've included a short example where the initial "#xxx" is the line index and any other "#xxx" in the line is a link to a different line.
#172=IFCBUILDINGSTOREY("GlobalId", #41, "Name", "Description", "ObjectType", #171"...);
In this example I would need to search for "IFCBULDINGSTOREY", and then follow the links backwards through the file, jumping around storing the important bits of information I need.
The main problem is that my test file has 273480 lines (18MB), and links can jump from one end of the file to the other - and I'll likely have to handle larger files than this.
In this file I need to populate about 500 objects, so that's a lot of jumping around the file to grap the relevant information.
What's a performance-friendly method of jumping around a file like that?
(Disclosure - I help out with a .NET IFC implementation)
I'd question what it is you're doing that means you can't use one of the many existing implementations of the IFC schema. Parsing the IFC models is generally the simple part of the problem. If you want to visualise the geometry or take measurements from the geometry primitives there's a whole another level of complexity... E.g. Just one particular geometry type out of dozens:
If you go to BuildingSmart's software implementations list and search for 'development' you'll get a good list of them for various technologies/languages.
If you're sure you want to implement yourself, the typical approaches are to build some kind of dictionary/map holding the entities based on their key. Naively you can run an initial pass through with a Lexer, and build the map in memory. But as IFC models can be over a GB, you may need a more sophisticated approach where you build some kind of persisted index - and maybe even put it into some kind of database with indexes (maybe some flavour of a document database). This is going to be more important if you want to support 'random access' to the data over multiple sessions.

Is it considered good practice to store unit test inputs/expected outputs in a flat file?

I'm finding myself writing a lot of boilerplate for my unit tests. I could down on that boilerplate significantly if I stored my unit test inputs along with the expected outputs in a csv file and directed my test suite to read the inputs form that file, pass them to the function being tested, and then compare its output with the values in the file's expected output column.
Is this considered good practice?
Instead of storing this in a separate file, I would recommend to store it in some kind of table (probably an array) inside your test code and iterate over that table. Most testing frameworks have specific support for this: in JUnit the feature is called parameterized tests. Then you even don't have to implement the iteration over that set of inputs and expected outputs yourself.

Suggestion about file reader tool implementation

Say I need to implement a tool that always read & update a file.
The tool is text based and taking commands to perform from the command line.
If the tool always read & write data to and from the DB (The DB is just files in this case) , should I include any data-structures in the main memory to make it easier?
I thought about just making an interpreter that just reads the command, parse it and perform it. Also, when there is a request for data , the tool just go over the file and grab the required data (without saving any of it in a data structure).
Keep in mind that the tool always update its DB whenever required so ill also have to update the data structures every time it updates his DB.
Bottom line is, is that a good idea to go over the file and grab the information every time it is required or I should just make data-structures within the program to make it faster and easier to keep the data?
The interpreter class (struct in this case) is something like :
struct Interpreter {
virtual void interpret(const std::string& cmd) = 0;
The concrete interpreter inherits from it.
would love to hear some suggestions.
Use boost:
- program options to read the command line
- filesystem to read the file path
As for whether you read the file or cache it? You say: "have to update the data structures every time it updates his DB". I read from this that you must read the files every time, to be consistent with his DB.
Even if you find that you need to cache the files to improve performance, remember the old adage: "it's easier to make working code fast, than to make fast code to work!"

c++ Reading big text file to string (bigger than string::max_size)

I have a huge text file (~5GB) which is the database for my program. During run this database is read completely many times with string functions like string::find(), string::at(), string::substr()...
The problem is that this text file cannot be loaded in one string, because string::max_size is definitely too small.
How would you implement this? I had the idea of loading a part to string->reading->closing->loading another part to same string->reading->closing->...
Is there a better/more efficient way?
How would you implement this?
With a real database, for instance SQLite. The performance improvement from having indexes is more than going to make up for your time learning another API.
Since this is a database, I'm assuming it'd have many records. That to me implies best idea would be to implement a data class for each records and populate a list/vector/etc depending upon how you plan to use it. I'd also look into persistent cache as the file is big.
And within in your container class of all records, you could implement search etc functions as you see fit. But as suggested for a db of this size, you're probably best of using a database.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
1024 * 1024,
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.