How do I detect plaintext in a MIME file? - c++

I have a large set of MIME files, which contain multiple parts. Many of the files contain parts labelled with the following headers:
Content-Type: application/octet stream
Content-Transfer-Encoding: Binary
However, sometimes the contents of these parts are some form of binary code, and sometimes they are plaintext.
Is there a clever way in either C++, Bash or Ruby to detect whether the contents of a MIME part labelled as application/octet stream is binary data or plaintext?

The -I option of grep will treat binary files as files without a match. Combined with the -q option grep will return a nonzero exit status if a file is binary.
if grep -qI -e '' <file>
then
# plaintext
else
# binary
fi

The simplest method is to split the file into a set of multiple files each of which contains one of the component parts. We can then use grep and other functions to ascertain the text format.

Related

convert all BMP files recursively to JPG handling paths with spaces and getting the file extension right under Linux

I have files with beautiful, glob-friendly pathnames such as:
/New XXXX_Condition-selected FINAL/677193 2018-06-08 Xxxx Event-Exchange_FINAL/Xxxxx Dome Yyyy Side/Xxxx_General016 #07-08.BMP
(the Xxx...Yyyy strings are for privacy reasons). Of course the format is not fixed: the depth of the folder hierarchy can vary, but spaces, letters and symbols such as _, - and # can all appear, either as part of the path or part of the filename, or both.
My goal is to recurse all subfolders, find the .BMP files and convert them to JPG files, without having "double" extensions such as .BMP.JPG: in other words, the above filename must become
/New XXXX_Condition-selected FINAL/677193 2018-06-08 Xxxx Event-Exchange_FINAL/Xxxxx Dome Yyyy Side/Xxxx_General016 #07-08.JPG
I can use either bash shell tools or Python. Can you help me?
PS I have no need for the original files, so they can be overwritten. Of course a solution which doesn't overwrite them is also fine - I'll just follow up with a find . -name "*.BMP" -type f -delete command.
Would you please try:
find . -type f -iname "*.BMP" -exec mogrify -format JPG '{}' +
The command mogrify is a tool of ImageMagick suite and mogrify -format JPG file.BMP is equivalent to convert file.BMP file.JPG.
You can add the same options which are accepted by convert such as -quality.
The benefit of mogrify is it can perform the same conversion on multiple files all at once without specifying the output (converted) filenames.
If the command issues a warning: mogrify-im6.q16: length and filesize do not match, it means the image size stored in the BMP header discords with the actual size of image data block.
If JPG files are properly produced, you may ignore the warnings. Otherwise you will need to repair the BMP files which cause the warnings.
If the input files and the output files have the same extention (in such a
case JPG to JPG conversion with a resize), the original files are overwritten.
If they have different extentions like this time, the original BMP files are
not removed. You can remove them using the find as well.

replace part of file name with wrong encoding

Need some guidance how to solve this one. Have 10 000s of files in multiple subfolders where the encoding got screwed up. Via ls command I see a filename named like this 'F'$'\366''ljesedel.pdf', that includes the ' at beginning and end. That's just one example where the Swedish characters åäö got wrong, in this example this should have been 'Följesedel.pdf'. If If I run
#>find .
Then I see a list of files like this:
./F?ljesedel.pdf
Not the same encoding. How on earth solving this one? The most obvious ways:
myvar='$'\366''
char="ö"
find . -name *$myvar* -exec rename 's/$myvar/ö' {} \;
and other possible ways fails since
find . -name cannot find it due to the ? instead of the "real" characters " '$'\366'' "
Any suggestions or guidance would be very much appreciated.
The first question is what encoding your terminal expects. Make sure that is UTF-8.
Then you need to find what bytes the actual filename contains, not just what something might display it as. You can do this with a perl oneliner like follows, run in the directory containing the file:
perl -E'opendir my $dh, "."; printf "%s: %vX\n", $_, $_ for grep { m/jesedel\.pdf/ } readdir $dh'
This will output the filename interpreted as UTF-8 bytes (if you've set your terminal to that) followed by the hex bytes it actually contains.
Using that you can determine what your search pattern should be. Your replacement must be the UTF-8 encoded representation of ö, which it will be by default as part of the command arguments if your terminal is set to that.
I'm not an expert - but it might not be a problem with the file name (which seems to hold the correct Unicode file name) - but with the way ls (and many other utilities) show the name to the terminal.
I was able to show the correct name by setting the terminal character encoding to Unicode. Also I've noticed the GUI programs (file manager, etc), were able to show the correct file name.
Gnome Terminal: "Terminal .. set character encoding - Unicode UTF8
It is still a challenge with many utilities to 'select' those files (e.g., REGEXP, wildcard). In few cases, you will have to select those character using '*' pattern. If this is a major issue considering using Ascii only - may be use the 'o' instead of 'ö'. Not sure if this is acceptable.

wrong text encoding on linux

I downloaded a source code .rar file from internet to my linux server. Then, I extract all source files into local directory. When I use "cat" command to see the content of each file, the wrong text encoding is shown on my terminal (There are some chinese characters in the source file).
I use
file -bi testapi.cpp
then shows:
text/plain; charset=iso-8859-1
I tried to convert that file to uft-8 encoding with following command:
iconv -f ISO88591 -t UTF8 testapi.cpp > new.cpp
But it doesn't work.
I set my .vimrc file with following two lines:
set encoding=utf-8
set fileencoding=utf-8
After this, when I vim testapi.cpp, the chinese characters will be normally displayed in the vim. But cat testapi.cpp doesn't work.
When I compile and run the program, the printf statement with chinese characters will print wrong characters like ????
What should I do to display correct chinese characters when I run the program?
TLDR Quickest Solution: Copy/Paste the Visible Text to a Brand-New, Confirmed UTF-8 File
Your file is marked as latin1, but the data is stored as utf8.
When you set set-enc=utf8 or set fileencoding=utf-8 in VIM, you're not changing the data or converting it. You're looking at the same exact data, but interpreting as if it is the utf8 charset. So, good news: Your data is good. No conversion or changing necessary.
You just need to put the same exact data into a file already marked as UTF-8 encoding. That can be done easily by simply making a brand new file in vim, using set enc=utf8, and then copy-pasting your old data into the new file. You can test this out by making a testfile with the only text "汉语" ("chinese language"), set enc, save, close, reopen, and see that the text didn't get corrupted. And you can test with file -bi $pathtofile, though that is not super reliable.
Anyway, TLDR: Make a brand new UTF-8 file, confirm that it's utf-8, make your data visible, and then copy/paste and/or transfer it to the new UTF-8 file, without doing any conversion.
Also, theoretically, I considered that iconv -f utf8 -t utf8 would work, since all I wanted to do was make utf-8-encoded data be marked as utf-8-encoded, without changing it. But this gave me an error that indicated it was still trying to do a data conversion.

Grep returning regex results in recursive search

I've constructed a grep command that I use to search recursively through a directory of files for a pattern within them. The problem is that grep only returns back the file names the pattern is in, not the exact match of the pattern. How do I return the actual result?
Example:
File somefile.bin contains somestring0987654321�123�45� in a directory with one million other files
Command:
$ grep -EsniR -A 1 -B 1 '([a-zA-Z0-9]+)\x00([0-9]+)\x00([0-9]+)\x00' *
Current result:
Binary file somefile.bin matches
The desired result (or close to it):
Binary file somefile.bin matches
<line above match>
somestring0987654321�123�45�
<line below match>
You can try the -a option:
File and Directory Selection
-a, --text
Process a binary file as if it were text; this is equivalent to
the --binary-files=text option.
--binary-files=TYPE
If the first few bytes of a file indicate that the file contains
binary data, assume that the file is of type TYPE. By default,
TYPE is binary, and grep normally outputs either a one-line
message saying that a binary file matches, or no message if
there is no match. If TYPE is without-match, grep assumes that
a binary file does not match; this is equivalent to the -I
option. If TYPE is text, grep processes a binary file as if it
were text; this is equivalent to the -a option. Warning: grep
--binary-files=text might output binary garbage, which can have
nasty side effects if the output is a terminal and if the
terminal driver interprets some of it as commands.
But the problem is that in binary files there are no lines, so I'm not sure what you'd want the output to look like. You'll see random garbage, maybe the whole file, some special characters messing with your terminal may be printed.
If you want to restrict the output to the match itself, consider the -o option:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
The context control is limited to adding a certain number of lines before or after the match, which will probably not work well here. So if you want a context of certain number of bytes, you'll have to change the pattern itself.
Try...
grep -rnw "<regex>" <folder>
Much easier. More examples here --> https://computingbro.com/2020/05/10/word-search-in-linux-unix-filesystem/

Regex compare strings from multiple files

I have multiple XML files that contain various strings. I also have a text file of strings, some of which are contained within the XML files.
XML:
text="$$sRegister $$s is stuck at One. (VDB-5014)" uid="5014"/>
String File:
is stuck at one
I would like to print the strings that are both in my string file and my XML file. This way I can set the correct message type in the XML file. Given the high volume of messages I've been attempting to automate this process. Thoughts?
You can use grep -f:
grep -f stringFile xmlFile