How to pass multiple files as input to MapReduce?

How to pass multiple files as input to MapReduce? - mapreduce

I want to use two files as input to a MapReduce program. but using * doesn't work as a filename pattern.

I would expect working with input/ should do the trick. To get started try running the Wordcount example: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
At the end of this tutorial they explain you how to run the job (they run it on multiple dictionary files which reside in an input map).
EDIT: Also check this tutorial for using the distributed file system, you usually need your input files in the dfs..

It works and it should work on your machine as well. Are you sure about the path you are giving? Is it input/190*.txt or /input/190*.txt. Please mind the "/". Path without a / are assumed to be present inside the /user where as paths with a / are present directly under the root directory.
And it works with mv(or any other HDFS command for that matter) as well.

Related

How to serialize a diff of two folders optimally in C++

I'm trying to develop a file diff format for multiple files recursively in folders. Consider a source directory containing patched files and a destination directory containing original files. Write a size minimal diff file which expresses the difference between all files in the source and destination directory which can be applied to the original files in order to transform the original files into the patched files.
For this purpose I found the dtl library. Which algorithm or feature of the library should I use to write a file diff to the disk which I can then later read back and apply in order to patch the file? Any example code for this? I tried writing the result of the shortest edit script (SES) to the disk but I realized that I needed to specify the character and operation for every single byte. This of course makes the output file bigger than the entire comparison file, making this diff format entirely redundant since storing the entire target file instead would've saved more storage.
As another reference, this is very similar to how version control systems like git or svn operate but I don't want to use those since I'm mainly dealing with binary files and the simple requirement of creating and applying patches.

After doing some more search, I found the HDiffPatch project.
It worked fine apparently but it seems to take long on bigger folder comparisons:
diff usage: hdiffz [options] oldPath newPath outDiffFile
patch usage: hpatchz [options] oldPath diffFile outNewPath
EDIT:
Another good option is open-vcdiff but it only supports individual files.

use HDiffPatch: you can run hdiffz with "-s-48" for up speed;
or try "-s-32" , "-s-1k", "-s-128k" ...

get modified files after given timestamp in windows file system in Cpp code

Is there any way that I can get modified files/folders after a given timestamp in windows file system? I don't want to traverse entire file system and check which file/folder is modified in my code. Does windows provide any API which returns modified files/folders after a given time stamp ?

No, there is no direct WinAPI to accomplish this.
I'd suggest traversing only through certain folders (exclude folders like Windows, ProgramData) etc. Traverse only through the folders that make sense. ex: Users.
Why? Because the system files in Windows and such folders are accessed very frequently and are modified after system updates. Unless you're keen to see when the system files were modified, I'd say the data is going to be irrelevant and of no meaning.

When a file changes, I'd like to modify one or more different files

I've been scouring the web for hours looking for an approach to solving this problem, and I just can't find one. Hopefully someone can fast-track me. I'd like to cause the following behaviour:
When running ember s and a file of a certain extension is changed, I'd like to analyze the contents of that file and write to several other files in the same directory.
To give a specific example, let's assume I have a file called app/dashboard/dashboard.ember. dashboard.ember consists of 3 concatenated files: app/dashboard/controller.js, .../route.js, and .../template.hbs with a reasonable delimiter between the files. When dashboard.ember is saved, I'd like to call a function (inside an addon, I assume) that reads the file, splits it at the delimiter and writes the corresponding splitted files. ember-cli should then pick up the changed source (.js, .hbs, etc.) files that it knows how to handle, ignoring the .ember file.
I could write this as a standalone application, of course, but I feel like it should be integrated with the ember-cli build environment, but I can't figure out what concoction of hooks and tools I should use to achieve this.

Linked directory not found

I have following scenario:
The main software I wrote uses a database created by a simulator. This database is around 10 GB big at the moment, so I want to keep only one copy of that data per system.
Assuming I have following projects:
Main Software using the data, located at /SimData
DLL using the data for debugging, searching for data at /SimData
Debugging tool to parse the image database, searching for the data at /SimData
Since I do not want to have all those programs have their own copy of SimData (not only to decrease place used, but also to ensure that all Simulation data used is always up to date for all programs).
I created for the DLL and Debugging Utility a link named SimData to MainSoftware/SimData, but when opening a file with "SimData\MyFile.data" it cannot find it, only the MainSoftware with the ACTUAL SimData folder can find it.
How can I use the MainSoftware/SimData folder without setting absolute paths?
This is on Windows 7 x64

I agree with Peter about adding the DB location as a configurable parameter. A common place to store that is in the registry.
however, If you want to create links that will be recognized by your software, try hardlinks. . fsutil should do the trick as described here.

You need a way to configure the database location. You could use an INI or other configuration file, or a registry setting, or a command-line input, or an environment variable. Or You could write your program to search a directory hierarchy... for example, if the various modules are usually siblings of each other in your directory tree, you could search for SimData/MyFile.data, ../SimData/MyFile.data, ../../MainSoftware/SimData/Myfile.data, and use the first one found.
Which answer is the "right one" depends on your situation.

Differing paths for lua script and app

My problem is that I'm having trouble specifying paths for Lua to look in.
For example, in my script I have a require("someScript") line that works perfectly (it is able to use functions from someScript when the script is run standalone.
However, when I run my app, the script fails. I believe this is because Lua is looking in a location relative to the application rather than relative to the script.
Hardcoding the entire path down to the drive isn't an option since people can download the game wherever they like so the highest I can go is the root folder for the game.
We have XML files to load in information on objects. In them, when we specify the script the object uses, we only have to do something like Content/Core/Scripts/someScript.lua where Content is in the same directory as Debug and the app is located inside Debug. If I try putting that (the Content/Core...) in Lua's package.path I get errors when I try to run the script standalone.
I'm really stuck, and am not sure how to solve this. Any help is appreciated. Thanks.
P.S. When I print out the default package.path in the app I see syntax like ;.\?.lua
in a sequence like...
;.\?.lua;c:...(long file path)\Debug\?.lua; I assume the ; means the end of the path, but I have no idea what the .\?.lua means. Any Lua file in the directory?

You can customize the way require loads modules by putting your own loader into the package.loaders table. See here:
http://www.lua.org/manual/5.1/manual.html#pdf-package.loaders
If you want to be sure that things are nicely sandboxed, you'll probably want to remove all the default loaders and replace them with one that does exactly what you want and nothing more. (It will probably be somewhat similar to one of the existing ones, so you can use those as a guide.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to pass multiple files as input to MapReduce? - mapreduce

I want to use two files as input to a MapReduce program. but using * doesn't work as a filename pattern.

Related

How to serialize a diff of two folders optimally in C++

get modified files after given timestamp in windows file system in Cpp code

When a file changes, I'd like to modify one or more different files

Linked directory not found

Differing paths for lua script and app

Categories

Resources