REGEX - Insert space after every 4 characters, and a line break after every 40 characters - regex

I have a huge string (22000+ characters) of encoded text. The code is consisted of digits [0-9] and lower case letters [a-z]. I need a regular expression to insert a space after every 4 characters, and one to insert a line break [\n] after every fourty characters. Any ideas?

Many people would prefer to do this with a for loop and string concatenation, but I hate those substring calls. I am really against using regexes when they aren't the right tool for the job (parsing HTML), but I think it'd pretty easy to work with in this case.
JSFiddle Example
Let's say you have the string
var str = "aaaabbbbccccddddeeeeffffgggghhhhiiiijjjjkkkkllllmmmmnnnnoooo";
And you want to insert a space after every four characters, and a newline after 40 characters, you could use the following code
str.replace(/.{4}g/, function (value, index){
return value + (index % 40 == 36? '\n' : ' ');
});
Note that this wouldn't work if the newline(40) index wasn't a multiple of the space index(4)
I abstracted this in a project, here's a simple way to do it
/**
* Adds padding and newlines into a string without whitespace
* #param {str} str The str to be modified (any whitespace will be stripped)
* #param {int} spaceEvery number of characters before inserting a space
* #param {int} wrapeEvery number of spaces before using a newline instead
* return {string} The replaced string
*/
function addPadding(str, spaceEvery, wrapEvery) {
var regex = new RegExp(".{"+spaceEvery+"}", "g");
// Add space every {spaceEvery} chars, newline after {wrapEvery} spaces
return str.replace(/[\n\s]/g, '').replace(regex, function(value, index) {
// The index is the group that just finished
var newlineIndex = spaceEvery * (wrapEvery - 1);
return value + ((index % (spaceEvery * wrapEvery) === newlineIndex) ? '\n' : ' ');
});
}

Well, a regexp in itself doesn't insert a space, so I'll assume you have some command in whatever language you're using that inserts based on finding a regexp.
So, finding 4 characters and finding 40 characters: that's not pretty in general regular expressions (unless your particular implementation has nice ways to express numbers). For finding 4 characters, use
....
Because typical regexp finders use maximal munch, then from the end of one regexp, search forward and maximally munch again, that'll chunk your string into 4 character pieces. The ugly part is that in standard regular expressions, you'll have to use
........................................
to find chuncks of 40 characters, although I'll note that if you run your 4 character one first, you'll have to run
..................................................
or
.... .... .... .... .... .... .... .... .... ....
to account for the spaces you've already put in.
The period finds any characters, but given that you're only using [0-9|a-z], you could use that regexp in place of each period if you need to ensure nothing else slipped in, I was just avoiding making it even more gross.
As you may be noting, regexp have some limitations. Take a look at the Chomsky hierarchy to really get into their theoretical limitations.

Related

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Regex mix of at least 2 characters (alphabetic, numeric, punctuation, special character)

I am trying to create a regular expression for a password field that checks to see if the input contains a mix of at least two characters sets (alphabetic, numeric, punctuation, special character). In addition, the first and last character cannot be numeric and the length must be at least 8 characters long.
I have never dealt with conditional logic for regular expressions, so it's probably why I'm having such a hard time. So far, this (but it's not working as intended):
(?=.{8,})(\d.*[a-zA-Z])|(?=.{8,})([a-zA-Z].*\d)|(?=.{8,})(\W.*\d)|(?=.{8,})(\d.*\W)|(?=.{8,})(\W.*[a-zA-Z])|(?=.{8,})([a-zA-Z].*\W)|(?=.{8,})([a-z].*[A-Z])|(?=.{8,})([A-Z].*[a-z])
Personally, I wouldn't do this with a single regex. Why not run it through a set of simpler ones, just to save yourself the inevitable maintenance headaches down the line? Something like (in order, in pseudocode):
// First and last are non-numeric and length check
if (!regex_check(pass, /^[^0-9].{6}.*[^0-9]$/)) return false
regexes = {/[a-zA-Z]/, /[0-9]/, /\p{P}|\p{Sc}|\^/} // Different character categories
numCategories = 0
for r in regex
if (regex_check(pass, r)) numCategories += 1
if numCategories >= 2 return true
return false

Spotfire: count the number of a certain character in a string

I am trying to add a new calculated column that counts the number of semi colons in a string and adds one to it. So the column i have contains a bunch of aliases and I need to know how many for each row.
For example,
A; B; C; D
So basically this means there are 4 aliases (3 semi colons + 1)
Need to do this for over 2 million rows. Help please!
Basic idea is to subtract length of your string without ; characters from it's original length:
len([columnName])-len(Substitute([columnName],";",""))+1
Here it is with a regular expression:
Len(RXReplace([Column 1], "(?!;).", "", "gis"))+1
RXReplace takes as arguments:
The string you are wanting to work on (in this case it is on Column 1)
The regular expression you want to use (here it is (?!;). )
What you want to replace matches with (blank in this situation so
that everything that matches the regex is removed)
Finally a parameter saying how you want it to work (we are passing
in gis which means replace all matches not just the first, ignore case, replace newlines)
We wrap this in a Len which gives us the amount of semicolons since that is all that is left and finally we add 1 to it to get the final result.
You can read more about the regular expression here: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx but in a nutshell it says match everything that isn't a semi colon.
You can read more about RXReplace and Len here: https://docs.tibco.com/pub/spotfire/6.0.0-november-2013/userguide-webhelp/ncfe/ncfe_text_functions.htm

Strategy to replace spaces in string

I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?
Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)
I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.
You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!
With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.
Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");
I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).
I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');

How to replace a string between two substrings in a string in VC++/MFC?

Say I have a CString object strMain="AAAABBCCCCCCDDBBCCCCCCDDDAA";
I also have two smaller strings, say strSmall1="BB";
strSmall2="DD";
Now, I want to replace all occurence of strings which occur between strSmall1("BB") and strSmall2("DD") in strMain, with say "KKKKKKK"
Is there a way to do it without Regex. I cannot use regex as adding another file to the project is prohibited.
Is there a way in VC++/MFC to do it? Or any easy algorithm you can point me to?
int length = strMain.GetLength();
int begin = strMain.Find(strSmall1, 0) + strSmall1.GetLength();
int end = strMain.Find(strSmall2, 0);
CStringT left = strMain.Left(begin);
CStringT right = strMain.Right(length - end);
strMain = left + "KKKKKKK" + right
The easiest way is probably to handle the replacement recursively. Search for the starting delimiter and the ending delimiter. If you find them, put together a new string consisting of the string up to the starting delimiter, followed by the replacement string, followed by the return from recursively doing the replacement in the remainder of the string following the ending delimiter.
That, of course, assumes you want to replace all the occurrences in the main string -- if you only want to replace the first one, John Weldon's solution (for one example) will work quite nicely.
psudocode:
loop over string
if curlocation matches string strsmall1 save index break
loop over remaining string
replace till curlocation matches string strsmall2
Extra credit:
What will the next assignment be?
My answer:
Speed it up by jumping the length of strsmall1 and strsmall2 in loop iterations