Split large files - c++

I am developing a distributed system where in a server will distribute a huge task to clients who would process them and return the result.
Server has to accept huge files with size of the order of 20Gb.
Server has to split this file into smaller pieces and send the path to the clients who in turn would scp the file and process them.
I am using read and write to perform file splitting which is performing ridiculously slow.
Code
//fildes - Source File handle
//offset - The point from which the split to be made
//buffersize - How much to split
//This functions is called in a for loop
void chunkFile(int fildes, char* filePath, int client_id, unsigned long long* offset, int buffersize)
{
unsigned char* buffer = (unsigned char*) malloc( buffersize * sizeof(unsigned char) );
char* clientFileName = (char*)malloc( 1024 );
/* prepare client file name */
sprintf( clientFileName, "%s%d.txt",filePath, client_id);
ssize_t readcount = 0;
if( (readcount = pread64( fildes, buffer, buffersize, *offset ) ) < 0 )
{
/* error reading file */
printf("error reading file \n");
}
else
{
*offset = *offset + readcount;
//printf("Read %ud bytes\n And offset becomes %llu\n", readcount, *offset);
int clnfildes = open( clientFileName, O_CREAT | O_TRUNC | O_WRONLY , 0777);
if( clnfildes < 0 )
{
/* error opening client file */
}
else
{
if( write( clnfildes, buffer, readcount ) != readcount )
{
/* eror writing client file */
}
else
{
close( clnfildes );
}
}
}
free( buffer );
return;
}
Is there any faster way to split files?
Is there any way client can access its chunk in the file without using scp (read without transfer)?
I am using C++. I am ready to use other languages if they can perform faster.

You can place the file in the reach of a webserver and then use curl from the clients
curl --range 10000-20000 http://the.server.ip/file.dat > result
would get 10000 bytes (from 10000 to 20000)
If the file is highly redundant and the network is slow probably using compression could help speeding up the transfer a lot. For example executing
nc -l -p 12345 | gunzip > chunk
on the client and then executing
dd skip=10000 count=10000 if=bigfile bs=1 | gzip | nc client.ip.address 12345
on the server you can transfer a section doing a gzip compression on the fly without the need of creating intermediate files.
EDIT
A single command to get a section of a file from a server using compression over the network is
ssh server 'dd skip=10000 count=10000 bs=1 if=bigfile | gzip' | gunzip > chunk

Is rsync over SSH with the --partial an option?
Then you might not need to split the files either since you can just continue if the transfer is interrupted.
Are the file split sizes known in advance or are they split along some marker in the file?

You can deposit file onto NFS shared device, and client can mount that device in RO-mode. Thereafter, client can open file, and use mmap() or pread() for read it's slice (piece of file). By this way, to client, will be transferred just needed part of the file.

Related

Mounting memory buffer as a file without writing to disk

I have a server and needs to feed data from clients to a library; however, that library only supports reading files (it uses open to access the file).
Since the data can get pretty big, I rather not write it out to a temporary file, read it in with the library then delete it afterwards. Instead I would like to do something similar to a ramdisk where there's a file in which the content is actually in memory.
However, there can be multiple clients sending over large data, I don't think constantly calling mount and umount to create a ramdisk for each client is efficient. Is there a way for me to mount an existing memory buffer as a file without writing to disk?
The library does not support taking in a file descriptor nor FILE*. It will only accept a path which it feeds directly to open
I do have the library's source code and attempted to add in a function that uses fmemopen; however, fmemopen returns a FILE* with no file descriptor. The internals of the library works only with file descriptors and it is too complex to change/add support to use FILE*
I looked at mmap, but it appears to be no different than writing out the data to a file
Using mount requires sudo access and I prefer not to run the application as sudo
bool IS_EXITING = false;
ssize_t getDataSize( int clientFD ) { /* ... */}
void handleClient( int clientFD ) {
// Read in messages to get actual data size
ssize_t dataSize = getDataSize( clientFD );
auto* buffer = new char[ dataSize ];
// Read in all the data from the client
ssize_t bytesRead = 0;
while( bytesRead < dataSize ) {
int numRead = read( clientFD, buffer + bytesRead, dataSize - bytesRead );
bytesRead += numRead;
// Error handle if numRead is <= 0
if ( numRead <= 0 ) { /* ... */ }
}
// Mount the buffer and get a file path... How to do this
std::string filePath = mountBuffer( buffer );
// Library call to read the data
readData( filePath );
delete[ ] buffer;
}
void runServer( int socket )
while( !IS_EXITING ) {
auto clientFD = accept( socket, nullptr, nullptr );
// Error handle if clientFD <= 0
if ( clientFD <= 0 ) { /* ... */ }
std::thread clientThread( handleClient, clientFD );
clientThread.detach( );
}
}
Use /dev/fd. Get the file descriptor of the socket, and append that to /dev/fd/ to get the filename.
If the data is in a memory buffer, you could create a thread that writes to a pipe. Use the file descriptor of the read end of the pipe with /dev/fd.

writing binary data to a file using c++ , after receiving from socket

I am trying to write a c++ code which read from a file (any type) and write the file data (binary data) on the socket , so the receiver must take this data and create a file , i should see the same data with the same format , the problem is the data is still binary and written to the file as binary data !
if a tested the code without sending on a network , it will work well !
any idea ?
thanks in advance .
note , i am using Ubuntu 11.10 if it affects this issue ..
Here is the code, on the client side:
filer=fopen("a.doc","rb");
fseek (filer , 0 , SEEK_END);
long size;
size = ftell (filer);
rewind (filer);
buffer = (char*) malloc (sizeof(char)*size);
numr=fread(buffer,1,size,filer);
fclose(filer); //some socket code
char buffer2[size];
strcpy(buffer2 , buffer);
n = write(sockfd,buffer2,size);
and for the server side :
n = read(sock,buffer,length);
FILE * filew;
int numw;
filew=fopen("acopy.doc","wb");
numw=fwrite(buffer,1,len,filew);
fclose(filew);
First thing is that you'll need to loop. The calls to read and write will not always be the full buffer. Disclaimer that I couldn't test this here
Ex:
numr=fread(buffer,1,size,filer);
fclose(filer); //some socket code
char buffer2[size];
strcpy(buffer2 , buffer);
n = write(sockfd,buffer2,size);
to
char buffer2[size];
while ((numr=fread(buffer,1,size,filer)) != 0)
{
strcpy(buffer2 , buffer);
n = 0;
while ((n = write(sockfd,buffer2+n,numr-n)) != 0)
;
}
fclose(filer); //some socket code
filer = NULL;
Likewise on the server side
n = read(sock,buffer,length);
FILE * filew;
int numw;
filew=fopen("acopy.doc","wb");
numw=fwrite(buffer,1,len,filew);
fclose(filew);
to
FILE * filew;
filew=fopen("acopy.doc","wb");
int numw = 0;
while ((n = read(sock,buffer,length)) != 0)
{
while ((numw=fwrite(buffer+numw,1,n-numw,filew) != 0)
;
}
fclose(filew);

C++ using 7zip.dll

I'm developing an app which will need to work with different types of archives. As many of the archive types as possible is good. I have choosen a 7zip.dll as an engine of archive-worker. But there is a problem, does anyone knows how to uncompress a file from archive to memory buffer? As I see, 7zip.dll supports only uncompressing to hard disk. Also, it would be nice to load archive from memory buffer. Has anyone tried to do something like that?
Not sure if I completely understand your needs (for example, don't you need the decompressed file on disk?).
I was looking at LZMA SDK 9.20 and its lzma.txt readme file, and there are plenty of hints that decompression to memory is possible - you may just need to use the C API rather than the C++ interface. Check out, for example, the section called Single-call Decompressing:
When to use: RAM->RAM decompressing
Compile files: LzmaDec.h + LzmaDec.c + Types.h
Compile defines: no defines
Memory Requirements:
- Input buffer: compressed size
- Output buffer: uncompressed size
- LZMA Internal Structures: state_size (16 KB for default settings)
Also, there is this function:
SRes LzmaDec_DecodeToBuf(CLzmaDec *p, Byte *dest, SizeT *destLen,
const Byte *src, SizeT *srcLen, ELzmaFinishMode finishMode, ELzmaStatus *status);
You can utilize these by memory-mapping the archive file. To the best of my knowledge, if your process creates a memory-mapped file with exclusive access (so no other process can access it) and does no explicit flushing, all changes to the file will be kept in memory until the mapping is destroyed or the file closed. Alternatively, you could just load the archive contents in memory.
For the sake of completeness, I hacked together several examples into a demo of using memory mapping in Windows.
#include <stdio.h>
#include <time.h>
#include <Windows.h>
#include <WinNT.h>
// This demo will limit the file to 4KiB
#define FILE_SIZE_MAX_LOWER_DW 4096
#define FILE_SIZE_MAX_UPPER_DW 0
#define MAP_OFFSET_LOWER_DW 0
#define MAP_OFFSET_UPPER_DW 0
#define TEST_ITERATIONS 1000
#define INT16_SIZE 2
typedef short int int16;
// NOTE: This will not work for Windows less than XP or 2003 Server!
int main()
{
HANDLE hFile, hFileMapping;
PBYTE mapViewStartAddress;
// Note: with no explicit security attributes, the process needs to have
// the necessary rights (e.g. read, write) to this location.
LPCSTR path = "C:\\Users\\mcmlxxxvi\\Desktop\\test.dat";
// First, open a file handle.
hFile = CreateFile(path,
GENERIC_READ | GENERIC_WRITE, // The file is created with Read/Write permissions
FILE_SHARE_READ, // Set this to 0 for exclusive access
NULL, // Optional security attributes
CREATE_ALWAYS, // File is created if not found, overwritten otherwise
FILE_ATTRIBUTE_TEMPORARY, // This affects the caching behaviour
0); // Attributes template, can be left NULL
if ((hFile) == INVALID_HANDLE_VALUE)
{
fprintf(stderr, "Unable to open file");
return 1;
}
// Then, create a memory mapping for the opened file.
hFileMapping = CreateFileMapping(hFile, // Handle for an opened file
NULL, // Optional security attributes
PAGE_READWRITE, // File can be mapped for Read/Write access
FILE_SIZE_MAX_UPPER_DW, // Maximum file size split in DWORDs.
FILE_SIZE_MAX_LOWER_DW, // NOTE: I may have these two mixed up!
NULL); // Optional name
if (hFileMapping == 0)
{
CloseHandle(hFile);
fprintf(stderr, "Unable to open file for mapping.");
return 1;
}
// Next, map a view (a continuous portion of the file) to a memory region
// The view must start and end at an offset that is a multiple of
// the allocation granularity (roughly speaking, the machine page size).
mapViewStartAddress = (PBYTE)MapViewOfFile(hFileMapping, // Handle to a memory-mapped file
FILE_MAP_READ | FILE_MAP_WRITE, // Maps the view for Read/Write access
MAP_OFFSET_UPPER_DW, // Offset in the file from which
MAP_OFFSET_LOWER_DW, // the view starts, split in DWORDs.
FILE_SIZE_MAX_LOWER_DW); // Size of the view (here, entire file)
if (mapViewStartAddress == 0)
{
CloseHandle(hFileMapping);
CloseHandle(hFile);
fprintf(stderr, "Couldn't map a view of the file.");
return 1;
}
// This is where actual business stuff belongs.
// This example application does iterations of reading and writing
// random numbers for the entire length of the file.
int16 value;
errno_t result = 0;
srand((int)time(NULL));
for (int i = 0; i < TEST_ITERATIONS; i++)
{
// Write
for (int j = 0; j < FILE_SIZE_MAX_LOWER_DW / INT16_SIZE; j++)
{
value = rand();
result = memcpy_s(mapViewStartAddress + j * INT16_SIZE, INT16_SIZE, &value, INT16_SIZE);
if (result != 0)
{
CloseHandle(hFileMapping);
CloseHandle(hFile);
fprintf(stderr, "File write error during iteration #%d, error %d", i, GetLastError());
return 1;
}
}
// Read
SetFilePointer(hFileMapping, 0, 0, FILE_BEGIN);
for (int j = 0; j < FILE_SIZE_MAX_LOWER_DW / sizeof(int); j++)
{
result = memcpy_s(&value, INT16_SIZE, mapViewStartAddress + j * INT16_SIZE, INT16_SIZE);
if (result != 0)
{
CloseHandle(hFileMapping);
CloseHandle(hFile);
fprintf(stderr, "File read error during iteration #%d, error %d", i, GetLastError());
return 1;
}
}
}
// End business stuff
CloseHandle(hFileMapping);
CloseHandle(hFile);
return 0;
}

zlib's uncompress() strangely returning Z_BUF_ERROR

I'm writing Qt-based client application. It connects to remote server using QTcpSocket. Before sending any actual data it needs to send login info, which is zlib-compressed json.
As far as I know from server sources, to make everything work I need to send X bytes of compressed data following 4 bytes with length of uncompressed data.
Uncompressing on server-side looks like this:
/* look at first 32 bits of buffer, which contains uncompressed len */
unc_len = le32toh(*((uint32_t *)buf));
if (unc_len > CLI_MAX_MSG)
return NULL;
/* alloc buffer for uncompressed data */
obj_unc = malloc(unc_len + 1);
if (!obj_unc)
return NULL;
/* decompress buffer (excluding first 32 bits) */
comp_p = buf + 4;
if (uncompress(obj_unc, &dest_len, comp_p, buflen - 4) != Z_OK)
goto out;
if (dest_len != unc_len)
goto out;
memcpy(obj_unc + unc_len, &zero, 1); /* null terminate */
I'm compressing json using Qt built-in zlib (I've just downloaded headers and placed it in mingw's include folder):
char json[] = "{\"version\":1,\"user\":\"test\"}";
char pass[] = "test";
std::auto_ptr<Bytef> message(new Bytef[ // allocate memory for:
sizeof(ubbp_header) // + msg header
+ sizeof(uLongf) // + uncompressed data size
+ strlen(json) // + compressed data itself
+ 64 // + reserve (if compressed size > uncompressed size)
+ SHA256_DIGEST_LENGTH]);//+ SHA256 digest
uLongf unc_len = strlen(json);
uLongf enc_len = strlen(json) + 64;
// header goes first, so server will determine that we want to login
Bytef* pHdr = message.get();
// after that: uncompressed data length and data itself
Bytef* pLen = pHdr + sizeof(ubbp_header);
Bytef* pDat = pLen + sizeof(uLongf);
// hash of compressed message updated with user pass
Bytef* pSha;
if (Z_OK != compress(pLen, &enc_len, (Bytef*)json, unc_len))
{
qDebug("Compression failed.");
return false;
}
Complete function code here: http://pastebin.com/hMY2C4n5
Even though server correctly recieves uncompressed length, uncompress() returning Z_BUF_ERROR.
P.S.: I'm actually writing pushpool's client to figure out how it's binary protocol works. I've asked this question on official bitcoin forum, but no luck there. http://forum.bitcoin.org/index.php?topic=24257.0
Turns out it was server-side bug. More details in bitcoin forum thread.

Monitoring file using inotify

I am using inotify to monitor a local file, for example "/root/temp" using
inotify_add_watch(fd, "/root/temp", mask).
When this file is deleted, the program will be blocked by read(fd, buf, bufSize) function. Even if I create a new "/root/temp" file, the program is still block by read function. I am wondering if inotify can detect that the monitored file is created and the read function can get something from fd so that read will not be blocked forever.
Here is my code:
uint32_t mask = IN_ALL_EVENTS;
int fd = inotify_init();
int wd = inotify_add_watch(fd, "/root/temp", mask);
char *buf = new char[1000];
int nbytes = read(fd, buf, 500);
I monitored all events.
The problem is that read is a blocking operation by default.
If you don't want it to block, use select or poll before read. For example:
struct pollfd pfd = { fd, POLLIN, 0 };
int ret = poll(&pfd, 1, 50); // timeout of 50ms
if (ret < 0) {
fprintf(stderr, "poll failed: %s\n", strerror(errno));
} else if (ret == 0) {
// Timeout with no events, move on.
} else {
// Process the new event.
struct inotify_event event;
int nbytes = read(fd, &event, sizeof(event));
// Do what you need...
}
Note: untested code.
In order to see a new file created, you need to watch the directory, not the file. Watching a file should see when it is deleted (IN_DELETE_SELF) but may not spot if a new file is created with the same name.
You should probably watch the directory for IN_CREATE | IN_MOVED_TO to see newly created files (or files moved in from another place).
Some editors and other tools (e.g. rsync) may create a file under a different name, then rename it.