Find specific hexadecimals in file

Find specific hexadecimals in file - c++

I am reading a hexadecimal file and I am looking for a specific set of hexadecimals and once I find that specific set, start storing the information.
The set I am trying to find is "AA 44 12 1C 1F 01". This set is a desired message identifier and in the file there are other messages that I dont want. I would like to find this set, and I know the information I want is 74 more characters long, then repeat the search and read process.
I know there is a built in std::fstream.peek(), but can I use it similar to std::string.find_first_of(" ")?

Related

What regex strings can distinguish files containing "PE null null L" from "PE null null d"

I need a quick and easy way to know how many dlls are 32-bit and how many are 64-bit in a given directory. I was about to write a PowerShell script when I thought of a much simpler solution. I've shown below that my idea can work but I need a little regex help to make it work properly.
It has been demonstrated that a dll file can be opened in Notepad to reveal the bitness (32 or 64) simply by checking the character after the first "PE". The letters "L" and "d" imply 32-bit and 64-bit respectively reference. Notepad++ or a hex editor will more accurately show there are actually 2 null characters between the "PE" and the other character as shown in the image below copied from Notepad++.
Unfortunately some of my directories contain hundreds of dlls so it's not practical to open them one at a time with Notepad or any other utility. There are, however powerful "grep-capable" file search utilities that can search a directory for files containing a specified search string. Moreover, some of these can do regular expression (regex) searches. Since I know the unique strings that differentiate 32-bit and 64bit dlls (shown above), such a file search utility should be able to quickly inventory the types of dlls in any directory. The best such file search utility in my opinion is grepWin which can be downloaded and installed for free.
My first attempt was the regex search string ".PE("\x00")*" which can be broken down as follows.
The image below shows results of a search done using grepWin and the search string ".PE("\x00")*" for a specified directory that had 276 dll files in it. It shows that 276 of the 276 dlls found contained "PE" followed by multiple null characters. It also shows that actually thousands of matches were found. This is because the regex search continued after the first match and found many more matches in larger files that inevitably appear "randomly".
The table below shows search results from regex strings "PE.{2}L" and "PE.{2}d" proposed by O-O-O. These search strings find all the files but unfortunately some of the dll files are being counted twice because the sum of the 32-bit and 64-bit dlls exceeds the total number of dll files in the directory.
The screen shots below of the search results using "PE.{2}L" and "PE.{2}d" show that the matches exceed the number of files found meaning that the regex searches are going beyond the first match.
So I only need to know how to modify these regex search strings to stop searching 3 characters after the first "PE" is found. I know this can be done using the ".*?" modifiers but I haven't been able to get it to work. Here is my question.
• How can these search strings be modified to stop reading 3 characters after the first "PE" is found?
Any regex search strings can be verified by searching any directory of dlls with grepWin. To be correct, the search strings must produce an equal number of matches as files unlike the examples shown above. This will verify that the search stopped after the first match was found.

This can't be true:
The regex .PE("\x00")* would search for:
any character (Why at all? To exclude finding it right with the file's start?)
the character P
the character E
the group of:
the character "
the character corresponding to the byte value 00
the character "
...as per * with an amount of matchings from never to countless (Why not wanting exactly 2?)
Wouldn't it be better to search for PE\x00\x00? Unless grepWin comes with its own flavor of regular expressions where quotation marks in groups have a special meaning. But I highly doubt that.
The regexes PE.{2}L and PE.{2}d are like phrases that nobody would use. Why not writing PE..L straight away?
From a technical point of view
We can further restrict a regular expression to not overly match too many false positives and to not ignore things we should also check (it helps knowing how a Portable Executable's layout looks like):
Each executable starts with a DOS header, which is always 64 bytes long and almost always starts with MZ (in rare/historical cases also ZM or NE, but not for our case).
The NT header always starts with PE\0\0 (or in hexadecimal 50 45 00 00, or in regex PE\x00\x00), which is then followed by either \x64\x86 (for 64 bit) or \x4c\x01 (for 32 bit). This header can start much later, but we can safely assume to find it within the first 2048 bytes of the file (most likely after 240 bytes already).
Also 18 bytes later we have most likely the bytes \x0b\x01 or \x0b\x02 (or in rare cases \x07\x01).
The better regex
For x64 (64 bit) search for ^MZ.{62,2046}PE\x00\x00\x64\x86.{18}\x0b[\x01\x02] and
for x86 (32 bit) search for ^MZ.{62,2046}PE\x00\x00\x4c\x01.{18}\x0b[\x01\x02].
If your target software crashes (although it praises its regex support, like grepWin) then
either omit matching the DOS header entirely (removing ^MZ.{62,2046}
or try reducing the repetition to a smaller one, f.e. {62,280}.
Explanation:
starting at the begin of the file (actually only the start of a "line")
characters M and Z (Mark Zbikowski)
any character for at least 62 times, but at max 2046 times (a text editor like Notepad++ might complain that our regex would be too complex, that's why we also define a maximum)
characters P and E (Portable Executable)
bytes 00 00
the CPU architecture:
bytes \x64\x86 for 64 bit (AMD), or
bytes \x4c\x01 for 32 bit (Intel 386 or later).
Don't rely on opticals only (d and L), because then you ignore half of the value and just risk more false positives).
any character for exactly 18 times
byte 0b
either byte 01 or 02
Successfully tested
with Notepad++ 8.4.8 x64 (make sure to tick that . matches newline)
on C:\Windows\System32\quartz.dll
using Windows 7 x64 (so the DLL should be 64 bit):
The big advantage here is that this regex most likely only matches once instead of multiple times, especially in DLLs. However, since executables have no "end" mark they can carry any format of data afterwards. Unbound to the intention (good = self extracting archives, bad = viruses) there's hardly a way to exclude those - if we're lucky our ^ helps us.

keyboard scan codes in c linux and windows

okay so i have a program i am writing , and basically i am going to be taking input for keyboard keys such as left arrow, right arrow, up and down etc and my question is , in what is the best option to scan in these keys so that i can make my program run both in linux and windows
and what am i scanning exactly? am i supposed to scan the ascii values and store them in int? chars? or is it another way to do this ? i have searched the internet and i am finding that the kex values for keyboard scan codes are e0 4b e0 4d e0 48 e0 50
but when i actually scan the values using getchar() and store them into ints i get 4 values for each key pressed namely for example 27 91 67 10 , 27 91 68 10
i understand that each key has press release and other values attached to it , so should i be scanning for the 67 68 etc range?
or is there another way to do this
i am writing the program using c language

In Linux, it seems like you're seeing ANSI escape sequences. They are used by text terminals, and start with the Escape character, which is '\x1b' (decimal 27).
This is probably not what you want, if you want to make something keyboard-controllable in direct, game-like manner you need to use "raw" input. There's plenty of references for that, look at ncurses for instance.

Open a terminal and use the command xev. You can then press any key you want and see its corresponding codes. You can also move and click the mouse to see what happens there.

decipher a file format called .EWB

I have a file which I know that contains a bunch of compressed files inside with some kind of a header.
Can anyone tell me how to unpack it?
file format is .EWB, which stands for EasyWorshipBible.
I know its possible as I've seen it being done. But they didn't tell me how.
I tried using hex editors and winRAR. But non of them seem to get the files correct.

In an example I found, each entry begins with hex 51 4b 03 04, followed by six more bytes of information, followed by a zlib stream. When the zlib streams are decompressed, they have the format "1:1 text line ...", blank line, "1:2 text line ...", etc. However the text does not seem to match the extraction I found along with the example, so I suspect that the text is encoded or encrypted somehow.
That should be enough to get you started.

MPEG ADTS format identification

I need detect whether file is MPEG ADTS file. I've searched for it around but whether I seek badly or something else but I can't find signature using which I could have said surely that certain file has MPEG ADTS format.
E.g. we can say for sure that file is MP4 if it begins with such signature 00 00 00 nn 66 74 79 70 6D 70 34.
How can it be done with MPEG ADTS?
Thanks in advance for any help!

ADTS header is typically used in stand alone aac,mpeg-ts file.(streaming scenario)
ADIF is used mainly in MP4 file
adts file header starts with 12bits "sync work" which is always (111111111111)
next 1 bits is ID -
next 2 bits (always 0)
http://developer.longtailvideo.com/trac/browser/providers/adaptive/doc/adts.pdf?rev=1460 (provide the full header)
so your algo to detect would be -
search for 12 bits sync work
validate that next fields contain valid values

C++ - Detect whether a file is PNG or JPEG

Is there any fast way to determine if some arbitrary image file is a png file or a jpeg file or none of them?
I'm pretty sure there is some way and these files probably have some sort of their own signatures and there should be some way to distinguish them.
If possible, could you also provide the names of the corresponding routines in libpng / libjpeg / boost::gil::io.

Look at the magic number at the beginning of the file. From the Wikipedia page:
JPEG image files begin with FF D8 and end with FF D9. JPEG/JFIF files
contain the ASCII code for "JFIF" (4A
46 49 46) as a null terminated string.
JPEG/Exif files contain the ASCII code
for "Exif" (45 78 69 66) also as a
null terminated string, followed by
more metadata about the file.
PNG image files begin with an 8-byte signature which identifies the file as
a PNG file and allows detection of
common file transfer problems: \211 P
N G \r \n \032 \n

Apart from Tim Yates' suggestion of reading the magic number "by hand", the Boost GIL documentation says:
png_read_image throws std::ios_base::failure if the file is not a valid PNG file.
jpeg_read_image throws std::ios_base::failure if the file is not a valid JPEG file.
Similarly for other Boost GIL routines. If you only want the type, you might want to try reading only the dimensions, rather than loading the entire file.

The question is essentially answered by the above replies, but I thought I'd add the following: If you ever need to determine file types beyond just "JPEG, PNG, other", there's always libmagic. This is what powers the Unix utility file, which is pretty magical indeed, on many of the modern operating systems.

Image file types like PNG and JPG have well-defined file formats that include signatures identifying them. All you have to do is read enough of the file to read that signature.
The signatures you need are well documented:
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
http://en.wikipedia.org/wiki/JPEG#Syntax_and_structure

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js