Keeping file name information with Cascalog Tuples - clojure

I'm looking for a way of keeping a filename that's associated with the tuples/data that originate from that particular file. I've searched around and found that hfs-wholefile works really well at getting filenames but it then returns a large chunk of binary information. Is it possible to take this binary information and turn it back into tuples that I can then processes as if I had gotten them from hfs-textline?
(def file-name-with-data
"Process a file and associate a filename with it"
[file]
(<- [file-name ?data1 ?data2 ?data3 ?data4]
((hfs-wholefile file) ?file-name ?binary-data)
(function-that-im-looking-for ?binary-data :> ?data1 ?data2 ?data3 ?data4)))
The example above is ideally what I would like to use to process this information. In Cascalog/Cascading is there a way to turn the bytes into regular variables I can use in queries?

Related

How to read Text file and returns additional input field using TextIO?

I have a PCollection of KV where key is filename and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g.,
KV("gs://bucket1/dir1/X1.dat", "SourceX"),
KV("gs://bucket1/dir2/Y1.dat", "SourceY")
I need to read all lines from the files and with the "Source" field, returning as a KV PCollection.
KV(line1 from X1.dat, "SourceX")
KV(line2 from X1.dat, "SourceX")
...
KV(line1 from Y1.dat, "SourceY")
I was able to achieve this by calling FileIO.match() and followed by a DoFn in which I sequentially read the file and append the SourceX (retrieved from a map passed in SideInput).
To get the benefit of parallel reading, could I use TextIO.readAll() to achieve this? TextIO.read() returns a PCollection, without filename info. How can I join it back the map of Filename to Source mapping? Tried WithKeys transfer, but not working ...
Currently using FileIO.match() as you are doing is the best way to accomplish this, but once https://github.com/apache/beam/pull/12645 is merged you'll be able to use the new ContextualTextIO transforms.
Note that computing line numbers in a distributed manner is inherently expensive; you might want to see if you can use offsets (much esasier to compute, and ordered the same as line numbers) instead.
If I understand correctly, you want to read the file in parallel? Unfortunately, TextIO.readAll does not have this feature. You will have to use FileIO.match, and then write your DoFn to read the file in the custom way that you want.
This is because you will not be able to do a random seek into a file and preserve the count of line numbers.
Is reading files serially a bottleneck for your pipeline?

How to store graph in files

I want to store the following information in a file.
My program is consisted of set of string that are connected forming a graph.
I call each single string "Tag".
let's say we have 3 main tags $Mohammed , $car , $color
Each of the main tags contains sub tags and each sub tag has a value or another sub tag or set of sub tags.
$Mohammad:
$Age: "18"
$color: $red
$kind_of: $human
$car:
$type: $toyota
$color: $blue
$doors:
$number: "3"
$car:
$made_of: $metal
$used_for: $transporting
$types: {$mercedes,$toyota,$nissan}
$best_color: $red
$color:
$usedto: $coloring_things
$example: {$red,$green,$blue,...}
But this in not the only thing, there is a connection between the tags of the same name, so that $Mohammed->$car->$color must be connected with the main tag $color. and $Mohammed->$color:$red , $car->$best_color:$red , $color->$best_color: $red and the main tag $red must all be connected to each other.
The tags connected means be stored in a way that I can call the connected tags at once. just like the computer memory. when it calls something from the memory, it calls the information before and after the requested information.
When I looked to my situation in the first time, I thought that XML would solve it, but then I realized that XML can't represent graph.
I don't want to use databases for this. I want to keep database as my last weapon.
Any idea or suggestion about how can I store,connect and recall the informations from my program?
Thanks in advance.
You actually could use XML, but I would recommend JSON or Yaml.
Your example format is already very close to Yaml.
Take a loot at boost's property_tree
It contains a nice c++ way to represent your graph, and let's you very easily decide what kind of file-representation you want. Be that xml, json, info.
Also, I don't see why your graph can't be represented by xml, as it supports named nodes.
Although property_tree also supports the ini format, that actually can't represent your >2 level deep tree.

storing urls to a file so they can be reachable quickly

i have a file and a plenty of urls, these urls are written to a file all with the same structure plus a url CheckSum of type int. stackoverflow.com is written as:
12534214214 http://stackoverflow.com
now everytime i want to put an url into the file i need to check if the url doesn't exist
then i can put it.
but it takes too much time to do this with 1 000 000 urls:
//list of urls
list<string> urls;
size_t hashUrl(string argUrl); //this function will hash the url and return an int
file.open("anchors");
//search for the int 12534214214 if it isn't found then write 12534214214 http://stackoverflow.com
file.close();
question1 : -how can i search in a file using the checksum so the search will take a few ms?
question2 : -is there another way of storing these urls so that they can be reachable quickly?
thanks, and sorry for bad english
There is (likely [1]) no way you search a million URLS in a plain text-file in "a few milliseconds". You need to either load the entire file into memory (and when you do, you may just as well do that into some reasonable data structure, for example a std::map or std::unordered_map), or use some sort of indexing for the file - e.g have a smaller file with just the checksum and the place in the file that they are stored at.
The problem with a plain textfile is that there is no way to know where anything is. One line can be 10 bytes, another 10000 bytes. This means that you literally have to read every byte up to the point you are interested in.
Of course, the other option is to use a database library, SQLite etc (or proper a database server, such as MySQL) that allows the data to be stored/retrieved based on a "query". This hides all the index-generation and other such problems, and is already optimised both when it comes to search algorithms, as well as having clever caching and optimised code for reading/writing data to disk, etc.
[1] If all the URLS are short, it's perhaps possible that the file is small enough to cache well, and code can be written to be fast enough to linearly scan through the entire file in a few milliseconds. But a file with, say, an average of 50 bytes for each URL will be 50MB. If each byte takes 10 clock cycles to process, we're already at 130ms to process the file, even if it's directly available in memory.

VAX Fortran Keyed indexed file - sequential access

Okay, I know I'm going back a few years, but maybe I'll run across some graybeards (like mine) :).
I have an indexed data file with a key field. It's opened like so in the application:
OPEN (FILE='DATA.MAS',STATUS='OLD',
1 ORGANIZATION='INDEXED',ACCESS='KEYED',
1 RECL=28,UNIT=LUNTM,SHARED,
1 KEY=(1:49:CHARACTER),
1 IOSTAT=IOS,ERR=9999)
I need to be able to scan the content of this file sequentially. However, every combination of organization and access options in the open, followed by reads always results in an error, either on the open or the read. Is it even possible to get the nth record of a keyed file?
Okay, found the solution after reading the doc for the umpteenth time. I changed the OPEN statement for SEQUENTIAL access and INDEXED organization. What I missed was that when you do this, FORTRAN interprets the file as FORMATTED. Adding FORM='UNFOFRMATTED' and adjusting the record size yields happiness and yuletide greetings

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.
The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.
If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.
A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page
One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.
Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.
Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.
I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.