Get file string in compile time [duplicate] - c++

Is there a way to include an entire text file as a string in a C program at compile-time?
something like:
file.txt:
This is
a little
text file
main.c:
#include <stdio.h>
int main(void) {
#blackmagicinclude("file.txt", content)
/*
equiv: char[] content = "This is\na little\ntext file";
*/
printf("%s", content);
}
obtaining a little program that prints on stdout "This is
a little
text file"
At the moment I used an hackish python script, but it's butt-ugly and limited to only one variable name, can you tell me another way to do it?

I'd suggest using (unix util)xxd for this.
you can use it like so
$ echo hello world > a
$ xxd -i a
outputs:
unsigned char a[] = {
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int a_len = 12;

The question was about C but in case someone tries to do it with C++11 then it can be done with only little changes to the included text file thanks to the new raw string literals:
In C++ do this:
const char *s =
#include "test.txt"
;
In the text file do this:
R"(Line 1
Line 2
Line 3
Line 4
Line 5
Line 6)"
So there must only be a prefix at the top of the file and a suffix at the end of it. Between it you can do what you want, no special escaping is necessary as long as you don't need the character sequence )". But even this can work if you specify your own custom delimiter:
R"=====(Line 1
Line 2
Line 3
Now you can use "( and )" in the text file, too.
Line 5
Line 6)====="

I like kayahr's answer. If you don't want to touch the input files however, and if you are using CMake, you can add the delimeter character sequences on the file. The following CMake code, for instance, copies the input files and wraps their content accordingly:
function(make_includable input_file output_file)
file(READ ${input_file} content)
set(delim "for_c++_include")
set(content "R\"${delim}(\n${content})${delim}\"")
file(WRITE ${output_file} "${content}")
endfunction(make_includable)
# Use like
make_includable(external/shaders/cool.frag generated/cool.frag)
Then include in c++ like this:
constexpr char *test =
#include "generated/cool.frag"
;

You have two possibilities:
Make use of compiler/linker extensions to convert a file into a binary file, with proper symbols pointing to the begin and end of the binary data. See this answer: Include binary file with GNU ld linker script.
Convert your file into a sequence of character constants that can initialize an array. Note you can't just do "" and span multiple lines. You would need a line continuation character (\), escape " characters and others to make that work. Easier to just write a little program to convert the bytes into a sequence like '\xFF', '\xAB', ...., '\0' (or use the unix tool xxd described by another answer, if you have it available!):
Code:
#include <stdio.h>
int main() {
int c;
while((c = fgetc(stdin)) != EOF) {
printf("'\\x%X',", (unsigned)c);
}
printf("'\\0'"); // put terminating zero
}
(not tested). Then do:
char my_file[] = {
#include "data.h"
};
Where data.h is generated by
cat file.bin | ./bin2c > data.h

You can do this using objcopy:
objcopy --input binary --output elf64-x86-64 myfile.txt myfile.o
Now you have an object file you can link into your executable which contains symbols for the beginning, end, and size of the content from myfile.txt.

ok, inspired by Daemin's post i tested the following simple example :
a.data:
"this is test\n file\n"
test.c:
int main(void)
{
char *test =
#include "a.data"
;
return 0;
}
gcc -E test.c output:
# 1 "test.c"
# 1 "<built-in>"
# 1 "<command line>"
# 1 "test.c"
int main(void)
{
char *test =
# 1 "a.data" 1
"this is test\n file\n"
# 6 "test.c" 2
;
return 0;
}
So it's working but require data surrounded with quotation marks.

If you're willing to resort to some dirty tricks you can get creative with raw string literals and #include for certain types of files.
For example, say I want to include some SQL scripts for SQLite in my project and I want to get syntax highlighting but don't want any special build infrastructure. I can have this file test.sql which is valid SQL for SQLite where -- starts a comment:
--x, R"(--
SELECT * from TestTable
WHERE field = 5
--)"
And then in my C++ code I can have:
int main()
{
auto x = 0;
const char* mysql = (
#include "test.sql"
);
cout << mysql << endl;
}
The output is:
--
SELECT * from TestTable
WHERE field = 5
--
Or to include some Python code from a file test.py which is a valid Python script (because # starts a comment in Python and pass is a no-op):
#define pass R"(
pass
def myfunc():
print("Some Python code")
myfunc()
#undef pass
#define pass )"
pass
And then in the C++ code:
int main()
{
const char* mypython = (
#include "test.py"
);
cout << mypython << endl;
}
Which will output:
pass
def myfunc():
print("Some Python code")
myfunc()
#undef pass
#define pass
It should be possible to play similar tricks for various other types of code you might want to include as a string. Whether or not it is a good idea I'm not sure. It's kind of a neat hack but probably not something you'd want in real production code. Might be ok for a weekend hack project though.

You need my xtr utility but you can do it with a bash script. This is a script I call bin2inc. The first parameter is the name of the resulting char[] variable. The second parameter is the name of the file. The output is C include file with the file content encoded (in lowercase hex) as the variable name given. The char array is zero terminated, and the length of the data is stored in $variableName_length
#!/bin/bash
fileSize ()
{
[ -e "$1" ] && {
set -- `ls -l "$1"`;
echo $5;
}
}
echo unsigned char $1'[] = {'
./xtr -fhex -p 0x -s ', ' < "$2";
echo '0x00'
echo '};';
echo '';
echo unsigned long int ${1}_length = $(fileSize "$2")';'
YOU CAN GET XTR HERE xtr (character eXTRapolator) is GPLV3

Why not link the text into the program and use it as a global variable! Here is an example. I'm considering using this to include Open GL shader files within an executable since GL shaders need to be compiled for the GPU at runtime.

I reimplemented xxd in python3, fixing all of xxd's annoyances:
Const correctness
string length datatype: int → size_t
Null termination (in case you might want that)
C string compatible: Drop unsigned on the array.
Smaller, readable output, as you would have written it: Printable ascii is output as-is; other bytes are hex-encoded.
Here is the script, filtered by itself, so you can see what it does:
pyxxd.c
#include <stddef.h>
extern const char pyxxd[];
extern const size_t pyxxd_len;
const char pyxxd[] =
"#!/usr/bin/env python3\n"
"\n"
"import sys\n"
"import re\n"
"\n"
"def is_printable_ascii(byte):\n"
" return byte >= ord(' ') and byte <= ord('~')\n"
"\n"
"def needs_escaping(byte):\n"
" return byte == ord('\\\"') or byte == ord('\\\\')\n"
"\n"
"def stringify_nibble(nibble):\n"
" if nibble < 10:\n"
" return chr(nibble + ord('0'))\n"
" return chr(nibble - 10 + ord('a'))\n"
"\n"
"def write_byte(of, byte):\n"
" if is_printable_ascii(byte):\n"
" if needs_escaping(byte):\n"
" of.write('\\\\')\n"
" of.write(chr(byte))\n"
" elif byte == ord('\\n'):\n"
" of.write('\\\\n\"\\n\"')\n"
" else:\n"
" of.write('\\\\x')\n"
" of.write(stringify_nibble(byte >> 4))\n"
" of.write(stringify_nibble(byte & 0xf))\n"
"\n"
"def mk_valid_identifier(s):\n"
" s = re.sub('^[^_a-z]', '_', s)\n"
" s = re.sub('[^_a-z0-9]', '_', s)\n"
" return s\n"
"\n"
"def main():\n"
" # `xxd -i` compatibility\n"
" if len(sys.argv) != 4 or sys.argv[1] != \"-i\":\n"
" print(\"Usage: xxd -i infile outfile\")\n"
" exit(2)\n"
"\n"
" with open(sys.argv[2], \"rb\") as infile:\n"
" with open(sys.argv[3], \"w\") as outfile:\n"
"\n"
" identifier = mk_valid_identifier(sys.argv[2]);\n"
" outfile.write('#include <stddef.h>\\n\\n');\n"
" outfile.write('extern const char {}[];\\n'.format(identifier));\n"
" outfile.write('extern const size_t {}_len;\\n\\n'.format(identifier));\n"
" outfile.write('const char {}[] =\\n\"'.format(identifier));\n"
"\n"
" while True:\n"
" byte = infile.read(1)\n"
" if byte == b\"\":\n"
" break\n"
" write_byte(outfile, ord(byte))\n"
"\n"
" outfile.write('\";\\n\\n');\n"
" outfile.write('const size_t {}_len = sizeof({}) - 1;\\n'.format(identifier, identifier));\n"
"\n"
"if __name__ == '__main__':\n"
" main()\n"
"";
const size_t pyxxd_len = sizeof(pyxxd) - 1;
Usage (this extracts the script):
#include <stdio.h>
extern const char pyxxd[];
extern const size_t pyxxd_len;
int main()
{
fwrite(pyxxd, 1, pyxxd_len, stdout);
}

Here's a hack I use for Visual C++. I add the following Pre-Build Event (where file.txt is the input and file_txt.h is the output):
#(
echo const char text[] = R"***(
type file.txt
echo ^^^)***";
) > file_txt.h
I then include file_txt.h where I need it.
This isn't perfect, as it adds \n at the start and \n^ at the end, but that's not a problem to handle and I like the simplicity of this solution. If anyone can refine is to get rid of the extra chars, that would be nice.

You can use assembly for this:
asm("fileData: .incbin \"filename.ext\"");
asm("fileDataEnd: db 0x00");
extern char fileData[];
extern char fileDataEnd[];
const int fileDataSize = fileDataEnd - fileData + 1;

Even if it can be done at compile time (I don't think it can in general), the text would likely be the preprocessed header rather than the files contents verbatim. I expect you'll have to load the text from the file at runtime or do a nasty cut-n-paste job.

Hasturkun's answer using the xxd -i option is excellent. If you want to incorporate the conversion process (text -> hex include file) directly into your build the hexdump.c tool/library recently added a capability similar to xxd's -i option (it doesn't give you the full header - you need to provide the char array definition - but that has the advantage of letting you pick the name of the char array):
http://25thandclement.com/~william/projects/hexdump.c.html
It's license is a lot more "standard" than xxd and is very liberal - an example of using it to embed an init file in a program can be seen in the CMakeLists.txt and scheme.c files here:
https://github.com/starseeker/tinyscheme-cmake
There are pros and cons both to including generated files in source trees and bundling utilities - how to handle it will depend on the specific goals and needs of your project. hexdump.c opens up the bundling option for this application.

I think it is not possible with the compiler and preprocessor alone. gcc allows this:
#define _STRGF(x) # x
#define STRGF(x) _STRGF(x)
printk ( MODULE_NAME " built " __DATE__ " at " __TIME__ " on host "
STRGF(
# define hostname my_dear_hostname
hostname
)
"\n" );
But unfortunately not this:
#define _STRGF(x) # x
#define STRGF(x) _STRGF(x)
printk ( MODULE_NAME " built " __DATE__ " at " __TIME__ " on host "
STRGF(
# include "/etc/hostname"
)
"\n" );
The error is:
/etc/hostname: In function ‘init_module’:
/etc/hostname:1:0: error: unterminated argument list invoking macro "STRGF"

I had similar issues, and for small files the aforementioned solution of Johannes Schaub worked like a charm for me.
However, for files that are a bit larger, it ran into issues with the character array limit of the compiler. Therefore, I wrote a small encoder application that converts file content into a 2D character array of equally sized chunks (and possibly padding zeros). It produces output textfiles with 2D array data like this:
const char main_js_file_data[8][4]= {
{'\x69','\x73','\x20','\0'},
{'\x69','\x73','\x20','\0'},
{'\x61','\x20','\x74','\0'},
{'\x65','\x73','\x74','\0'},
{'\x20','\x66','\x6f','\0'},
{'\x72','\x20','\x79','\0'},
{'\x6f','\x75','\xd','\0'},
{'\xa','\0','\0','\0'}};
where 4 is actually a variable MAX_CHARS_PER_ARRAY in the encoder. The file with the resulting C code, called, for example "main_js_file_data.h" can then easily be inlined into the C++ application, for example like this:
#include "main_js_file_data.h"
Here is the source code of the encoder:
#include <fstream>
#include <iterator>
#include <vector>
#include <algorithm>
#define MAX_CHARS_PER_ARRAY 2048
int main(int argc, char * argv[])
{
// three parameters: input filename, output filename, variable name
if (argc < 4)
{
return 1;
}
// buffer data, packaged into chunks
std::vector<char> bufferedData;
// open input file, in binary mode
{
std::ifstream fStr(argv[1], std::ios::binary);
if (!fStr.is_open())
{
return 1;
}
bufferedData.assign(std::istreambuf_iterator<char>(fStr),
std::istreambuf_iterator<char>() );
}
// write output text file, containing a variable declaration,
// which will be a fixed-size two-dimensional plain array
{
std::ofstream fStr(argv[2]);
if (!fStr.is_open())
{
return 1;
}
const std::size_t numChunks = std::size_t(std::ceil(double(bufferedData.size()) / (MAX_CHARS_PER_ARRAY - 1)));
fStr << "const char " << argv[3] << "[" << numChunks << "]" <<
"[" << MAX_CHARS_PER_ARRAY << "]= {" << std::endl;
std::size_t count = 0;
fStr << std::hex;
while (count < bufferedData.size())
{
std::size_t n = 0;
fStr << "{";
for (; n < MAX_CHARS_PER_ARRAY - 1 && count < bufferedData.size(); ++n)
{
fStr << "'\\x" << int(unsigned char(bufferedData[count++])) << "',";
}
// fill missing part to reach fixed chunk size with zero entries
for (std::size_t j = 0; j < (MAX_CHARS_PER_ARRAY - 1) - n; ++j)
{
fStr << "'\\0',";
}
fStr << "'\\0'}";
if (count < bufferedData.size())
{
fStr << ",\n";
}
}
fStr << "};\n";
}
return 0;
}

This problem was irritating me and xxd doesn't work for my use case because it made the variable called something like __home_myname_build_prog_cmakelists_src_autogen when I tried to script it in, so I made a utility to solve this exact problem:
https://github.com/Exaeta/brcc
It generates a source and header file and allows you to explicitly set the name of each variable so then you can use them via std::begin(arrayname) and std::end(arrayname).
I incorporated it into my cmake project like so:
add_custom_command(
OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/binary_resources.hpp ${CMAKE_CURRENT_BINARY_DIR}/binary_resources.cpp
COMMAND brcc ${CMAKE_CURRENT_BINARY_DIR}/binary_resources RGAME_BINARY_RESOURCES_HH txt_vertex_shader ${CMAKE_CURRENT_BINARY_DIR}/src/vertex_shader1.glsl
DEPENDS src/vertex_shader1.glsl)
With small tweaks I suppose it could be made to work for C as well.

If you are using CMake, you probably may be interested in writing CMake preprocessing script like the following:
cmake/ConvertLayout.cmake
function(convert_layout file include_dir)
get_filename_component(name ${file} NAME_WE)
get_filename_component(directory ${file} DIRECTORY)
get_filename_component(directory ${directory} NAME)
string(TOUPPER ${name} NAME)
string(TOUPPER ${directory} DIRECTORY)
set(new_file ${include_dir}/${directory}/${name}.h)
if (${file} IS_NEWER_THAN ${new_file})
file(READ ${file} content)
string(REGEX REPLACE "\"" "\\\\\"" content "${content}")
string(REGEX REPLACE "[\r\n]" "\\\\n\"\\\\\n\"" content "${content}")
set(content "\"${content}\"")
set(content "#ifndef ${DIRECTORY}_${NAME}\n#define ${DIRECTORY}_${NAME} ${content} \n#endif")
message(STATUS "${content}")
file(WRITE ${new_file} "${content}")
message(STATUS "Generated layout include file ${new_file} from ${file}")
endif()
endfunction()
function(convert_layout_directory layout_dir include_dir)
file(GLOB layouts ${layout_dir}/*)
foreach(layout ${layouts})
convert_layout(${layout} ${include_dir})
endforeach()
endfunction()
your CMakeLists.txt
include(cmake/ConvertLayout.cmake)
convert_layout_directory(layout ${CMAKE_BINARY_DIR}/include)
include_directories(${CMAKE_BINARY_DIR}/include)
somewhere in c++
#include "layout/menu.h"
Glib::ustring ui_info = LAYOUT_MENU;

I like #Martin R.'s answer because, as it says, it doesn't touch the input file and automates the process. To improve on this, I added the capability to automatically split up large files that exceed compiler limits. The output file is written as an array of smaller strings which can then be reassembled in code. The resulting script, based on #Martin R.'s version, and an example is included here:
https://github.com/skillcheck/cmaketools.git
The relevant CMake setup is:
make_includable( LargeFile.h
${CMAKE_CURRENT_BINARY_DIR}/generated/LargeFile.h
"c++-include" "L" LINE_COUNT FILE_SIZE
)
The source code is then:
static std::vector<std::wstring> const chunks = {
#include "generated/LargeFile.h"
};
std::string contents =
std::accumulate( chunks.begin(), chunks.end(), std::wstring() );

in x.h
"this is a "
"buncha text"
in main.c
#include <stdio.h>
int main(void)
{
char *textFileContents =
#include "x.h"
;
printf("%s\n", textFileContents);
return 0
}
ought to do the job.

What might work is if you do something like:
int main()
{
const char* text = "
#include "file.txt"
";
printf("%s", text);
return 0;
}
Of course you'll have to be careful with what is actually in the file, making sure there are no double quotes, that all appropriate characters are escaped, etc.
Therefore it might be easier if you just load the text from a file at runtime, or embed the text directly into the code.
If you still wanted the text in another file you could have it in there, but it would have to be represented there as a string. You would use the code as above but without the double quotes in it. For example:
file.txt
"Something evil\n"\
"this way comes!"
main.cpp
int main()
{
const char* text =
#include "file.txt"
;
printf("%s", text);
return 0;
}
So basically having a C or C++ style string in a text file that you include. It would make the code neater because there isn't this huge lot of text at the start of the file.

Related

Properly handle escape sequences in strings from argv in C++

I'm writing a larger program that takes arguments from the command line after the executable. Some of the arguments are expected to be passed after the equals sign of an option. For instance, the output to the log is a comma separated vector by default, but if the user wants to change the separator to a period or something else instead of a comma, they might give the argument as:
./main --separator="."
This works fine, but if a user wants the delimiter be a special character (for example: tab), they might expect to pass the escape sequence in one of the following ways:
./main --separator="\t"
./main --separator='\t'
./main --separator=\t
It doesn't behave the way I want it to (to interpret \t as a tab) and instead prints out the string as written (sans quotes, and with no quotes it just prints 't'). I've tried using double slashes, but I think I might just be approaching this incorrectly and I'm not sure how to even ask the question properly (I tried searching).
I've recreated the issue in a dummy example here:
#include <string>
#include <iostream>
#include <cstdio>
// Pull the string value after the equals sign
std::string get_option( std::string input );
// Verify that the input is a valid option
bool is_valid_option( std::string input );
int main ( int argc, char** argv )
{
if ( argc != 2 )
{
std::cerr << "Takes exactly two arguments. You gave " << argc << "." << std::endl;
exit( -1 );
}
// Convert from char* to string
std::string arg ( argv[1] );
if ( !is_valid_option( arg ) )
{
std::cerr << "Argument " << arg << " is not a valid option of the form --<argument>=<option>." << std::endl;
exit( -2 );
}
std::cout << "You entered: " << arg << std::endl;
std::cout << "The option you wanted to use is: " << get_option( arg ) << "." << std::endl;
return 0;
}
std::string get_option( std::string input )
{
int index = input.find( '=' );
std::string opt = input.substr( index + 1 ); // We want everything after the '='
return opt;
}
bool is_valid_option( std::string input )
{
int equals_index = input.find('=');
return ( equals_index != std::string::npos && equals_index < input.length() - 1 );
}
I compile like this:
g++ -std=c++11 dummy.cpp -o dummy
With the following commands, it produces the following outputs.
With double quotes:
/dummy --option="\t"
You entered: --option=\t
The option you wanted to use is: \t.
With single quotes:
./dummy --option='\t'
You entered: --option=\t
The option you wanted to use is: \t.
With no quotes:
./dummy --option=\t
You entered: --option=t
The option you wanted to use is: t.
My question is: Is there a way to specify that it should interpret the substring \t as a tab character (or other escape sequences) rather than the string literal "\t"? I could parse it manually, but I'm trying to avoid re-inventing the wheel when I might just be missing something small.
Thank you very much for your time and answers. This is something so simple that it's been driving me crazy that I'm not sure how to fix it quickly and simply.
The escape sequences are already parsed from the shell you use, and are passed to your command line parameters array argv accordingly.
As you noticed only the quoted versions will enable you to detect that a "\\t" string was parsed and passed to your main().
Since most shells may just skip a real TAB character as a whitespace, you'll never see it in your command line arguments.
But as mentioned it's mainly a problem of how the shell interprets the command line, and what's left going to your program call arguments, than how to handle it with c++ or c.
My question is: Is there a way to specify that it should interpret the substring \t as a tab character (or other escape sequences) rather than the string literal "\t"? I could parse it manually, but I'm trying to avoid re-inventing the wheel when I might just be missing something small.
You actually need to scan for a string literal
"\\t"
within the c++ code.

devc++ input from file does not work

I'm trying to redirect a .txt content to .exe
program.exe < file.txt
and contents of file.txt are
35345345345
34543534562
23435635432
35683045342
69849593458
95238942394
28934928341
but the first index in array is the file path and the file contents is not displayed.
int main(int argc, char *args[])
{
for(int c = 0; c<argc; c++){
cout << "Param " << c << ": " << args[c] << "\n";
}
system("PAUSE");
return EXIT_SUCCESS;
}
Desired output:
Param0: 35345345345
Param1: 34543534562
Param2: 23435635432
Param3: 35683045342
Param4: 69849593458
Param5: 95238942394
Param6: 28934928341
The myapp < file.txt syntax passes to stdin (or cin if you prefer), not the arguments.
You have misunderstood what argc and argv are for. They contain the command line arguments to your program. If, for example, you ran:
program.exe something 123
The null terminated strings pointed to by argv will be program.exe, something, and 123.
You are attempting to redirect the contents of a file to program.exe using < file.txt. This is not a command line argument. It simply redirects the contents of the file to the standard input of your program. To get those contents you will need to extract from std::cin.
When you say "but the first index in array is the file path and the file contents is not displayed." it sounds like you're trying to read input from argv and argc. The angle bracket shell operator does not work that way. Instead, stdin (what cin and several C functions read from) has the contents of that file. So, to read from the file in the case above, you'd use cin.
If you instead really wanted to have a file automatically inserted into the argument list, I can't help you with the windows shell. However, if you have the option of using bash, the following will work:
program.exe `cat file.txt`
The backtick operator expands into the result of the command contained within, and so the contents are then passed as arguments to program.exe (again, under the bash shell and not the windows shell)
This code does what i was expecting to do with the other one. Thanks everybody who helped.
#include <iostream>
#include <string>
using namespace std;
int main()
{
string line;
while (getline(cin, line))
cout << "line: " << line << '\n';
}

C++ stream get unget not a nop?

I have the following C++ program and ran it using Visual Studio 2008 on Windows 7. I get and then unget a character. After doing so, the file position is different. Why? How do I get around this problem?
test.txt (download link below if you want)
/* Comment 1 */
/* Comment 2 */
#include <fstream>
int main (int argc, char ** argv) {
char const * file = "test.txt";
std::fstream fs(file, std::ios::in);
std::streampos const before = fs.tellg();
// replacing the following two lines with
// char c = fs.peek(); results in the same problem
char const c = fs.get();
fs.unget();
std::streampos const after = fs.tellg();
fs.seekg(after);
char const c2 = fs.get();
fs.close();
return 0;
}
c: 47 '/' char
c2: -1 'ÿ' char
before: {_Myoff=0 _Fpos=0 _Mystate=0 } std::fpos<int>
after: {_Myoff=0 _Fpos=-3 _Mystate=0 } std::fpos<int>
Adding | std::fstream::binary to the constructor seems to solve the problem. Perhaps it has to do with newlines in the file? If so, why does it affect code that doesn't even get close to reading a newline?
Updated with a seeking to the after position and getting another character.
It seems that saving via Notepad vs. Vim makes a difference. Saving via Notepad makes the stream work okay.
I have uploaded the file to google docs if you want to dl it:
https://docs.google.com/leaf?id=0B8Ufd7Rk6dvHZmYyZjgwYmItMTI3MC00MDljLWJjYTctMWMxYWM0ODk1MTE2&hl=en_US
Ok using your input file I see the same behavior you do. After some experimentation, it looks like the file was in Unix format, then had the ^M characters edited out (at least that's how I was able to reproduce it).
To fix it, I edited the file in Vim, executed ":set ff=dos", then added and deleted a character to dirty the file, then saved it.
The file position behaves as expected:
// unget.cpp
#include <fstream>
#include <iostream>
int main ()
{
char const * file = "test.txt";
std::fstream fs(file, std::fstream::in);
std::cout << fs.tellg() << std::endl; // 0
char c = fs.get();
std::cout << fs.tellg() << std::endl; // 1
fs.unget();
std::cout << fs.tellg() << std::endl; // 0
fs.close();
return 0;
}
Build and run:
$ clang++ unget.cpp
$ ./a.out
0
1
0
Or, I don't understand where is the problem.

Boost: How to locate the position of the iterator inside a huge text file?

I'm working in a program that uses boost::regex to match some patterns inside a huge text file (greater than 200 MB). The matches are working fine, but to build the output file I need to order the matches (just 2, but over all the text) in the sequence they are found in the text.
Well, when in debug mode, during the cout procedure I can see inside the iterator it1 an m_base attribute that shows an address that is increased each step of the loop and I think this m_base address is the address of the matched pattern in the text, but I could not certify it and I could not find a way to access this attribute to store the address.
I don't know if there is any way to retrieve the address of each matched pattern in the text, but I really need to get this information.
#define FILENAME "File.txt"
int main() {
int length;
char * cMainBuf;
ifstream is;
is.open (FILENAME, ios::binary );
is.seekg(0, ios::end);
length = is.tellg();
is.seekg (0, ios::beg);
cMainBuf = new char[length+1];
memset(cMainBuf, '\0',length+1);
is.read(cMainBuf,length);
is.close();
string str=cMainBuf;
regex reg("^(\\d{1,3}\\s[A-F]{99})");
regex rReg(reg);
int const sub_matches[] = { 1 };
boost::sregex_token_iterator it1(str.begin() ,str.end() ,rReg ,sub_matches ), it2;
while(it1!=it2)
{
cout<<"#"<<sz++<<"- "<< *(it1++) << endl;
}
return 0;
}
#sln
Hi sln,
I'll answer your questions:
1. I removed all code that is not part of this issue, so some libraries remaining there;
2. Same as 1;
3. Because the file is not a simple text file in fact, it can have any symbol and it may affect the reading procedure, as I could realize in the past;
4. Zero buffer was necessary during the tests period, since I could not store more than 1MB in the buffer;
5. the iterator doesn't allo to use char* to set the beggining and the end of the file, so was necessary to change it to string;
6. The incoming RegEx will not be declared static, this is just a draft to show the problem and the anchor act to find the line start, not only the string start;
7. sub_matches was part of the test to see where the iterator was for regex with 2 or more groups inside it;
8. sz is just a counter;
9. There is no cast possible from const std::_String_const_iterator<_Elem,_Traits,_Alloc> to long.
In fact all the code works fine, I can identify any pattern inside the text, but what I really need to know is the memory address of each matched pattern (in this case, the address of the iterator for each iteration). I could realize that m_base has this address, but I could not retrieve this address until this moment.
Ill continue the analysis, if I find any solution for this problem I post it here.
Edit #Tchesko, I am deleting my original answer. I've loaded the boost::regex and tried it out with a regex_search(). Its not the itr1 method like you are doing but, I think it comes down to just getting the results from the boost::smatch class, which is really boost::match_results().
It has member functions to get the position and length of the match and sub-matches. So, its really all you need to find the offset into your big string. The reason you can't get to m_base is that it is a private member variable.
Use the methods position() and length(). See the sample below... which I ran, debugged and tested. I'm getting back up to speed with VS-2005 again. But, boost does seem a little quirky. If I am going to use it, I want it to do Unicode, and than means I have to compile ICU. The boost binarys I'm using is downloaded 1.44. The latest is 1.46.1 so I might build it with vc++ 8 after I asess it viability with ICU.
Hey, let me know how it turns out. Good luck!
#include <boost/regex.hpp>
#include <locale>
#include <iostream>
using namespace std;
int main()
{
std::locale::global(std::locale("German"));
std::string s = " Boris Schäling ";
boost::regex expr("(\\w+)\\s*(\\w+)");
boost::smatch what;
if (boost::regex_search(s, what, expr))
{
// These are from boost::match_results() class ..
int Smpos0 = what.position();
int Smlen0 = what.length();
int Smpos1 = what.position(1);
int Smlen1 = what.length(1);
int Smpos2 = what.position(2);
int Smlen2 = what.length(2);
printf ("Match Results\n--------------\n");
printf ("match start/end = %d - %d, length = %d\n", Smpos0, Smpos0 + Smlen0, Smlen0);
std::cout << " '" << what[0] << "'\n" << std::endl;
printf ("group1 start/end = %d - %d, length = %d\n", Smpos1, Smpos1 + Smlen1, Smlen1);
std::cout << " '" << what[1] << "'\n" << std::endl;
printf ("group2 start/end = %d - %d, length = %d\n", Smpos2, Smpos2 + Smlen2, Smlen2);
std::cout << " '" << what[2] << "'\n" << std::endl;
/*
This is the hard way, still m_base is a private member variable.
Without m_base, you can't get the root address of the buffer.
long Match_start = (long)(what[0].first._Myptr);
long Match_end = (long)(what[0].second._Myptr);
long Grp1_start = (long)(what[1].first._Myptr);
long Grp1_end = (long)(what[1].second._Myptr);
*/
}
}
/* Output:
Match Results
--------------
match start/end = 2 - 17, length = 15
'Boris Schäling'
group1 start/end = 2 - 7, length = 5
'Boris'
group2 start/end = 9 - 17, length = 8
'Schäling'
*/

C++ - string.compare issues when output to text file is different to console output?

I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
ThisIsATestStringOutputtedToAFile
T^#h^#i^#s^#I^#s^#A^#T^#e^#s^#t^#S^#t^#r^#i^#n^#g^#O^#u^#t^#p^#u^#t^#
t^#e^#d^#T^#o^#A^#F^#i^#l^#e
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
EDIT
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
myInput.open(fileLocation.c_str());
myOutput.open("test.txt");
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
cin.get();
myInput.close();
myOutput.close();
TEST_ASSERT(compare1.compare(compare2) == 0);
How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
UPDATE:
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.
The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;
}
Output:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
NOTE
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.