How to split string into chunks using regular expressions while keeping URI coded special characters together - regex

Let's assume you have a string that you want to split into chunks having a maximum size of x characters. If you ignore new lines, a suitable regular expression would be .{1,x}
The problem I have is that I want to keep URI coded special characters like %20 together.
Example:
Hello%20world%20how%20are%20you%20today
Doing a "dumb" chunking with 5 character chunks, you end up with:
Hello
%20wo
rld%2
0how%
20are
%20yo
u%20t
oday
What I want to achieve is this:
Hello
%20wo
rld
%20ho
w%20a
re%20
you
%20to
day
Is this even possible with only regular expressions? I currently have a working solution with a loop that goes through each character and fills a bucket. If the bucket is full, it adds its content to an array of chunks and empties it. However, it also checks if the current character is a % and if the bucket would be able to hold 3 more characters (% plus the two hex digits). If it can, OK, otherwise it would push the content of the bucket in the chunks array and start with a fresh bucket.

Keep it simple, stay with your working solution with a loop, its probably faster and ten times more readible.... http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

Try this regular expression to match all parts:
/(%[0-9A-F]{2}[^%]?[^%]?|[^%]%[0-9A-F]{2}[^%]?|[^%][^%]%[0-9A-F]{2}|[^%]{1,5})/
This basically lists all possible options to get at most five characters:
%[0-9A-F]{2}[^%]?[^%]? – a percent-encoded octet followed by at most two non-% characters
[^%]%[0-9A-F]{2}[^%]? – one non-% character, followed by a percent-encoded octet followed at most one non-% character
[^%][^%]%[0-9A-F]{2} – two non-% characters followed by a percent-encoded octet
[^%]{1,5} – one to five non-% characters

Related

Can regex be used to find this pattern?

I need to parse a large amount of data in a log file, ideally I can do this by splitting the file into a list where each entry in the list is an individual entry in the log.
Every time a log entry is made it is prefixed with a string following this pattern:
"4404: 21:42:07.433 - After this point there could be anything (including new line characters and such). However, as soon as the prefix repeats that indicates a new log entry."
4404 Can be any number, but is always then followed by a :.
21:42:07.433 is the 21 hours 42 mins 7 seconds 433 milliseconds.
I don't know much about regex, but is it possible to identify this pattern using it?
I figured something like this would work...
"*: [0-24]:[0:60]:[0:60].[0-1000] - *"
However, it just throws an exception and I fear I'm not on the right track at all.
List<string> split_content = Regex.Matches(file_content, #"*: [0-24]:[0:60]:[0:60].[0-1000] - *").Cast<Match>().Select(m => m.Value).ToList();
The following expression would split a string according to your pattern:
\d+: \d{2}:\d{2}:\d{2}\.\d{3}
Add a ^ in the beginning if your delimiting string always starts a line (and use the m flag for regex). Capturing the log chunks with a regex would be more elaborate, I'd suggest just splitting (with Regex.Split) if you have your log content in the memory all at once.

regex search window size

I have a large document that I am running a regex on. Below is an example of a similar expression:
(?=( aExample| bExample)(?=.*(XX))(?=.*(P1)))
This works a lot of times, but sometimes due to other text within the document the condition is met by looking in the entire document, e.g., there might be 10 characters between "aExample" and "XX, but 1,000 characters between "XX" and "P1". I would like to contain the expression to N characters (lets say 50 for the sake of the example) so that the regex is a little more conservative. Any help is appreciated. How can I go about reducing the size of the window of the regex to N characters instead of the entire string/document? Thanks!
(?=( aExample| bExample)(?=.{1,50}(XX))(?=.{1,50}(P1)))
You want to limit the number of .s to look at so you can just use braces.

Regular expression string division, priorize the part lengths

I have this string
0Sc-a+nn1.ed_AI&AO1301#89
That has to be split in three parts
0Sc-a+nn1.ed_AI&AO
1301
89
I am using this RE (?P<prefix>[a-z\.\_\-\+(\&)]+\W?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?)) in python, but for now, testing in https://regex101.com/.
I am having problem to identify the first part. If I try "Sc-a+nn.ed_AI&AO1301#89" works fine, but adding the numbers to the first part, as the example, don't.
How to priory the second and the third part to be the maximum length allowed around the # and the first one () allow numbers in the beginning and middle (never at the end because will be in part two)? ? is there because sometimes the precedent element doesn't exist.
Use [a-zA-Z]{2} to capture the string after & and specify the length for each part i.e [\d]{4}
(?P<prefix>[A-Za-z0-9._\-+&;]+[a-zA-Z]{2}?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?))

How to get a count of the word sizes in a large amount of text?

I have a large amount text - roughly 7000 words.
I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.
I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.
EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.
Don't forget word boudary token \b. If you don't put it at both ends of a \w{n} token then all words longer than n characters are also found. For a 4 character word \b\w{4}\b for a six character long word use \b\w{6}\b. Here is a demo with 7000 words as input string.
Java implementation:
String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.
You could generate regexes for each size you want.
\w{6} would get each word with 6 letters exactly
\w{7} would get each word with 7 letters exactly
and so on...
So you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.

Regex for UK registration number

I've been playing with creating a regular expression for UK registration numbers but have hit a wall when it comes to restricting overall length of the string in question. I currently have the following:
^(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})
This allows for an optional string (lower or upper case) of between 1 and 3 characters, followed by a mandatory numeric of between 1 and 3 characters and finally, a mandatory string (lower or upper case) of between 1 and 3 characters.
This works fine but I then want to apply a max length of 7 characters to the entire string but this is where I'm failing. I tried adding a 1,7 restriction to the end of the regex but the three 1,3 checks are superseding it and therefore allowing a max length of 9 characters.
Examples of registration numbers that need to pass are as follows:
A1
AAA111
AA11AAA
A1AAA
A11AAA
A111AAA
In the examples above, the A's represents any letter, upper or lower case and the 1's represent any number. The max length is the only restriction that appears not to be working. I disable the entry of a space so they can be assumed as never present in the string.
If you know what lengths you are after, I'd recommend you use the .length property which some languages expose for string length. If this is not an option, you could try using something like so: ^(?=.{1,7})(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})$, example here.