I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.
I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.
urls.each do |url|
response = HTTP.get(url)
if response.status.success?
source_code = response.to_s
# Remove comments
source_code = source_code.gsub(/<!--(.*?)-->/su, '')
# Remove scripts
source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')
if source_code.match(/out of stock/i)
# Flag URL for further processing
end
end
end
end
This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:
Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))
The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.
If anyone could help with this it would be hugely appreciated. Thank you!
The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:
source_code.force_encoding(Encoding::UTF_8)
If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.
Also as #WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.
I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")
I'm trying to find some very specific multibyte characters in PostgreSQL using Regex. I know I have the option to make a long CASE WHEN but i decided to check if there is a different way to finding these.
My current Regex looks like this E'\xf0\x9f\x98\x83'
This works pretty well, except that I would need to find all from \xf0\x9f\x98\x80 to \xf0\x9f\x98\x99.
In JS I would just be able to write something like \xf0\x9f\x98[\x80-\x89] but for whatever reason this returns an error in PGSQL. Is there a shortcut like this, or am I doomed to writing 20 CASE WHEN-s?
I have realized my mistake. PGSQL Error was caused because I'm looking for 4 byte characters and I just wanted to mess with the last byte. I realized I'd have to write it like this: E'[\xf0\x9f\x98\x80-\xf0\x9f\x98\x90]'
Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};
Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings
To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.
I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1