Regular Expression in Perl to check fit/match - regex

I'm in great trouble.
I must check if a string fits (matches) another string with RegEx.
For example, given the following string:
Apr 2 13:42:32 sandbox izxp[12000]: Received disconnect from 10.11.106.14: 10: disconnected by user
In the editable input field I give the program the following shortened string:
Received disconnect from 10.11.106.14: 10
If it fits the existing string (as you can see above), it is OK.
If any part of the new edited string doesn't fit the original string, I must warn the user with a message.
Could you help me solving this question with RegEx? Or another method?
I would appreciate it!

You must get the original string in a variable, let's call it $original (this is perl). Then you must get the input from the "editable input field", let's call it $input.
Then it is a simple
if ($original=~/$input/)
{
#Your code for a message to the user here
}

Your solution would be less regex and more escaping. Assuming you're going to use no regex patterns and just search for the input string literal, you should write your function so that it turns this
Received disconnect from 10.11.106.14: 10
into this
Received disconnect from 10\.11\.106\.14: 10
This can be achieved with many different libraries depending on which language you are using.
That will then allow you to check for a match.
Regular Expressions are more designed for common patterns in strings, rather than finding exact literals.

Related

Regular expression to extract either integer or string from JSON

I am working in an environment without a JSON parser, so I am using regular expressions to parse some JSON. The value I'm looking to isolate may be either a string or an integer.
For instance
Entry1
{"Product_ID":455233, "Product_Name":"Entry One"}
Entry2
{"Product_ID":"455233-5", "Product_Name":"Entry One"}
I have been attempting to create a single regex pattern to extract the Product_ID whether it is a string or an integer.
I can successfully extract both results with separate patterns using look around with either (?<=Product_ID":")(.*?)(?=") or (?<=Product_ID":)(.*?)(?=,)
however since I don't know which one I will need ahead of time I would like a one size fits all.
I have tried to use [^"] in the pattern however I just cant seem to piece it together
I expect to receive 455233-5 and 455233 but currently I receive "455233-5"
(?<="Product_ID"\s*:\s*"?)[^"]+(?="?\s*,)
, try it here.

Using Regex to validate the number of words in a text area

I am attempting to write a MVC model validation that verifies that there is 10 or more words in a string. The string is being populated correctly, so I did not include the HTML. I have done a fair bit of research, and it seems that something along the lines of what I have tries should work, but, for whatever reason, mine always seem to fail. Any ideas as to what I am doing wrong here?
(using System.ComponentModel.DataAnnotations, in a mvc 4 vb.net environment)
Have tried ([\w]+){10,}, ((\\S+)\s?){10,}, [\b]{20,}, [\w+\w?]{10,}, (\b(\w+?)\b){10,}, ([\w]+?\s){10}, ([\w]+?\s){9}[\w], ([\S]+\s){9}[\S], ([a-zA-Z0-9,.'":;$-]+\s+){10,} and several more varaiations on the same basic idea.
<Required(ErrorMessage:="The Description of Operations field is required"), RegularExpression("([\w]+){20,}", ErrorMessage:="ERROZ")>
Public Property DescOfOperations As String = String.Empty
Correct Solution was ([\S]+\s+){9}[\S\s]+
EDIT Moved accepted version to the top, removing unused versions. Unless I am wrong and the whole sequence needs to match, then something like (also accounting for double spaces):
([\S]+\s+){9}[\S\s]+
Or:
([\w]+?\s+){9}[\w]+
Give this a try:
([a-zA-Z0-9,.'":;$-]+\s){10,}

How to create a regex to check whether a set of words exists in a given string?

How can I write a regex to check if a set of words exist in a given string?
For example, I would like to check if a domain name contains "yahoo.com" at the end of it.
'answers.yahoo.com', would be valid.
'yahoo.com.answers', would be wrong. 'yahoo.com' must come in the end.
I got a hint from somewhere that it might be something like this.
"/^[^yahoo.com]$/"
But I am totally new to regex. So please help with this one, then I can learn further.
When asking regex questions, always specify the language or application, too!
From your history it looks like JavaScript / jQuery is most likely.
Anyway, to test that a string ends in "yahoo.com" use /.*yahoo\.com$/i
In JS code:
if (/.*yahoo\.com$/i.test (YOUR_STR) ) {
//-- It's good.
}
To test whether a set of words has at least one match, use:
/word_one|word_two|word_three/
To limit matches to just the most-common, legal sub-domains, ending with "yahoo.com", use:
/^(\w+\.)+yahoo\.com$/
(As a crude, first pass)
For other permutations, please clarify the question.

RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};
Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings
To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.
I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.