Parsing special character pattern using sscanf in C - regex

I am working on a command parser that is supposed to accept a command line terminating with \r\n and extract its parameters
The command line structure is as follows:
all the parameters inside () are mandatory and the arguments inside [] are optional,and spc
stands for blank-space or space. and \t stands for tab
AP is and decimal integer between 1...4
RT,WL are a decimal unsigned integer numbers
= is equal symbol
% is percentage symbol
Followings is an acceptable command structure
[spc] MYCMD [spc] (\t) [spc] (AP) [spc] (:) (WL)(=)(RT)spcspc(\n)
As an example follwoing commands sre correct: (The whole command is case insensitive)
MYCMD \t 1 : 540 = 21% \r\n
MYCMD \t 2 : 712= 25 % \r\n
MYCMD\t 3 : 200 =17%\r\n
and ...
Following commands are incorrect:
MYCMD \t 5: 540 = 21% \r\n ---> 5 is not in range 1..4
MYCMD \t 2 : 712 25% \r\n ---> There is no equal symbol
MYCMD 3 200 =17\r\n --->there is no : between 3 and 200, no percentage symbol
MYCMD 3 100 =1 ,,.\n ----> there are extra symbols after 1 and \r does not exist
MYCMD 2: 130 =17.1\r\n ----> the sscanf parser must not translate 17.1 float to integer 7
I have implemented sscanf control format but it does not parse correctly!
int n_parsed=sscanf(cmd_str,"%*sMYCMD[*^\t]%*s%[1234]:%u%*s%[=]%u\r\n",&int_ap,&uint_wl,&uint_rt);
But this does not work for the correct commands (n_parsed never gets 3).
Any hint or comments on fixing the parsing issue will be appreciated
Thanks

Cannot be done solely with sscanf().
A key problem is that " " as well as "\r" as well as "\n" in the format string (aside from inside "[ ]") will optionally scan any number 0+ white-spaces and OP has very specific requirements. Optional spaces ' ', but not other white-spaces, is difficult to do in sscanf().
Another problem is the %d et al, consume optionally leading whitespace and we need to prevent that or let it go.
There is a discrepancy between the format and the examples in the location of the "%". I assume the example is correct.
There is a discrepancy between the format and the examples in the end-of-line \r\n versus \n. I assume any trailing whitespace before a final \r\n.
There is a discrepancy between the format and the examples in that spaces are allowed before the numbers. I assume spaces are OK.
The more I look at it I see lots of discrepancies between the stated format and the correct examples. I'll go for whatever is easiest to pass the examples in those cases.
int sep[4] = { 0 };
int int_ap;
unsigned uint_wl, uint_rt;
// [spc] MYCMD [spc] (\t) [spc] (AP) [spc] (:) (WL)(RT)(=)spcspc(\n)
const char *format = " MYCMD%n %n%1d :%u =%u%n %n";
int n_parsed = sscanf(cmd_str, format,
&sep[0], &sep[1], &int_ap, &uint_wl, &uint_rt, &sep[2], &sep[3]);
if (sep[3] == 0) DidNotReadEnd();
if ((int_ap < 1) || (int_ap > 4)) RangeError();
unsigned TabCount = 0;
int n;
for (n = sep[0]; n < sep[1]; n++) {
if (cmd_str[n] == '\t') TabCount++;
}
if (TabCount != 1) WrongTabCount;
for (n = sep[2]; n < sep[3]; n++) {
if (cmd_str[n] != ' ') break;
}
if (strcmp(&cmd_str[n], "\r\n") != 0) EOLError();
Note: int_ap could be scanned with %1[1-4] into a string and than converted to an int.
I fully expect a claim that this can all be done with only a sscanf() format. I am confident such and approach can be broken.

Related

How do I mimic a Unicode JS regular expression in Lucee

I am trying to write a regular express in Lucee to mimic the JS on the front end. Since Lucee's regex doesn't seem to suppoert unicode how do I do it.
This is the JS
function charTest(k){
var regexp = /^[\u00C0-\u00ff\s -\~]+$/;
return regexp.test(k)
}
if(!charTest(thisKey)){
alert("Please Use Latin Characters Only");
return false;
}
This is what I have tried in Lucee
regexp = '[\u00C0-\u00ff\s -\~]+/';
writeDump(reFind(regexp,"测));
writeDump(reFind(regexp,"test));
I have also tried
regexp = "[\\p{L}]";
but the dump is always 0
EDIT: Give me one second. I think I interpreted your initial JS regex incorrectly. Fixing it.
EDIT 2: It was more than a second. Your original JS regex was:
"/^[\u00C0-\u00ff\s -\~]+$/". This is:
Basic parts of regex:
"/..../" == signifies the start and stop of the Regex.
"^[...]" == signifies anything that is NOT in this group
"+" == signifies at least one of the previous
"$" == signifies the end of the string
Identifiers in the regex:
"\u00c0-\u00ff" == Unicode character range of Character 192 (À)
to Character 255 (ÿ). This is the Latin 1
Extension of the Unicode character set.
"\s" == signifies a Space Character
" -\~" == signifies another identifier for a space character to the
(escaped) tilde character (~). This is ASCII 32-126, which
includes the printable characters of ASCII (except the DEL
character (127). This includes alpha-numerics amd most punctuation.
I missed the second half of your printable Latin basic character set. I've updated my regex and tests to include it. There are ways to shorthand some of these identifiers, but I wanted it to be explicit.
You can try this:
<cfscript>
//http://www.asciitable.com/
//https://en.wikipedia.org/wiki/List_of_Unicode_characters
//https://en.wikipedia.org/wiki/Latin_script_in_Unicode
function charTest(k) {
return
REfind("[^"
& chr(32) & "-" & chr(126)
& chr(192) & "-" & chr(255)
& "]",arguments.k)
? "Please Use Latin Characters Only"
: ""
;
}
// TESTS
writeDump(charTest("测")); // Not Latin
writeDump(charTest("test")); // All characters between 31 & 126
writeDump(charTest("À")); // Character 192 (in range)
writeDump(charTest("À ")); // Character 192 and Space
writeDump(charTest(" ")); // Space Characters
writeDump(charTest("12345")); // Digits ( character 48-57 )
writeDump(charTest("ð")); // Character 240 (in range)
writeDump(charTest("ℿ")); // Character 8511 (outside range)
writeDump(charTest(chr(199))); // CF Character (in range)
writeDump(charTest(chr(10))); // CF Line Feed Character (outside range)
writeDump(charTest(chr(1000))); // CF Character (outside range)
writeDump(charTest("
")); // CRLF (outside range)
writeDump(charTest(URLDecode("%00", "utf-8"))); // CF Null character (outside range)
//writeDump(asc("测"));
//writeDump(asc("test"));
//writeDump(asc("À"));
//writeDump(asc("ð"));
//writeDump(asc("ℿ"));
</cfscript>
https://trycf.com/gist/05d27baaed2b8fc269f90c7c80a1aa82/lucee5?theme=monokai
All the regex does is look at your input string and if it doesn't find a value between chr(192) and chr(255), it will return your chosen string, else it will return nothing.
I think you can access the UNICODE characters below 255 directly. I'll have to test it.
Do you need to alert this function, like the Javascript? If you need to, you can just output a 1 or 0 to determine if this function actually found the character you're looking for.

RegExp other patterns not working

I continue trying to perform string format matching using RegExp in VBScript & VB6. I am now trying to match a short, single-line string formatted as:
Seven characters:
a. Six alphanumeric plus one "-" OR
b. Five alphanumeric plus two "-"
Three numbers
Two letters
Literal "65"
A two-digit hex number.
Examples include 123456-789LM65F2, 4EF789-012XY65A5, A2345--789AB65D0 & 23456--890JK65D0.
The RegExp pattern ([A-Z0-9\-]{12})([65][A-F0-9]{2}) lumps (1) - (3) together and finds these OK.
However, if I try to:
c) Break (3) out w/ pattern ([A-Z0-9\-]{10})([A-Z]{2})([65][A-F0-9]{2}),
d) Break out both (2) & (3) w/ pattern ([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}), or
e) Tighten up (1) with alternation pattern ([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
it refuses to find any of them.
What am I doing wrong? Following is a VBScript that runs and checks these.
' VB Script
Main()
Function Main() ' RegEx_Format_sample.vbs
'Uses two paterns, TestPttn for full format accuracy check & SplitPttn
'to separate the two desired pieces
Dim reSet, EtchTemp, arrSplit, sTemp
Dim sBoule, sSlice, idx, TestPttn, SplitPttn, arrMatch
Dim arrPttn(3), arrItems(3), idxItem, idxPttn, Msgtemp
Set reSet = New RegExp
' reSet.IgnoreCase = True ' Not using
' reSet.Global = True ' Not using
' load test case formats to check & split
arrItems(0) = "0,6 nums + 1 '-',123456-789LM65F2"
arrItems(1) = "1,6 chars + 1 '-',4EF789-012XY65A5"
arrItems(2) = "2,5 chars + 2 '-',A2345--789AB65D0"
arrItems(3) = "3,5 nums + 2 '-',23456--890JK65D0"
SplitPttn = "([A-Z0-9]{5,6})[-]{1,2}([A-Z0-9]{9})" ' split pattern has never failed to work
' load the patterns to try
arrPttn(0) = "([A-Z0-9\-]{12})([65][A-F0-9]{2})"
arrPttn(1) = "([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2})"
arrPttn(2) = "([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
arrPttn(3) = "([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
For idxPttn = 0 To 3 ' select Test pattern
TestPttn = arrPttn(idxPttn)
TestPttn = TestPttn & "[%]" ' append % "ender" char
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
For idxItem = 0 To 3
reSet.Pattern = TestPttn ' set to Test pattern
sTemp = arrItems(idxItem )
arrSplit = Split(sTemp, ",") ' arrSplit is Split array
EtchTemp = arrSplit(2) & "%" ' append % "ender" char to Item sub (2) as the "phrase" under test
If reSet.Test(EtchTemp) = False Then
MsgBox("RegEx " & TestPttn & " false for " & EtchTemp & " as " & arrSplit(1) )
Else ' test OK; now switch to SplitPttn
reSet.Pattern = SplitPttn
Set arrMatch = reSet.Execute(EtchTemp) ' run Pttn as Exec this time
If arrMatch.Count > 0 then ' If test OK then Count s/b > 0
Msgtemp = ""
Msgtemp = "RegEx " & TestPttn & " TRUE for " & EtchTemp & " as " & arrSplit(1)
For idx = 0 To arrMatch.Item(0).Submatches.Count - 1
Msgtemp = Msgtemp & Chr(13) & Chr(10) & "Split segment " & idx & " as " & arrMatch.Item(0).submatches.Item(idx)
Next
MsgBox(Msgtemp)
End If ' Count OK
End If ' test OK
Next ' idxItem
Next ' idxPttn
End Function
Try this Regex:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Click for Demo
Explanation:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--) - matches either 6 Alphanumeric characters followed by a - or 5 Alphanumeric characters followed by a --
[0-9]{3} - matches 3 Digits
[A-Z]{2} - matches 2 Letters
65 - matches 65 literally
[0-9A-F]{2} - matches 2 HEX symbols
You can get some idea from the following code:
VBScript Code:
Option Explicit
Dim objReg, strTest
strTest = "123456-789LM65F2" 'Change the value as per your requirements. You can also store a list of values in an array and run the code in loop
set objReg = new RegExp
objReg.Global = True
objReg.IgnoreCase = True
objReg.Pattern = "(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}"
if objReg.test(strTest) then
msgbox strTest&" matches with the Pattern"
else
msgbox strTest&" does not match with the Pattern"
end if
set objReg = Nothing
Your patterns do not work because:
([A-Z0-9\-]{12})([65][A-F0-9]{2}) - matches 12 occurrences of either an AlphaNumeric character or - followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2}) - matches 10 occurrences of either an AlphaNumeric character or - followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches 7 occurrences of either an AlphaNumeric character or - followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches either 5 occurrences of an AlphaNumeric character followed by -- or 6 occurrences of an Alphanumeric followed by a -. This is then followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
Try this pattern :
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Or, if the last part doesn't like the [A-F]
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9ABCDEF]{2}
All, tanx again for your help!!
trincot, everything in each arrItems() between the commas, incl the the "plus", is merely part of a shorthand description of each item's characteristics, such as "5 characters plus 2 dashes".
Gurman, your pttn breakdowns were helpful, but, if I read it right, the addition of the ? prefix is a "Match zero or one occurrences" and this must match exactly one occurrence. Also, my 1st pattern (matches 12) actually DID work for all my test cases.
jNevill, & JMichelB your suggestions are very close to what I ended up with.
I was "over-classing". After some tinkering, I was able to get the Test Pttn to successfully recognize these test cases by taking the [65] out of the [] in my original Alternation pattern. That is I went from ([65]) to (65) and Zammo! it worked.
Orig pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
Wkg pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})(65)([A-F0-9]{2})
Oh, and I moved the
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
stmt up out of the For...Next loop. That helped w/ the splitting.
T-Bone

How to find the character "\" in a string?

I am trying to manipulate a string by finding the \ character in the string Find\inHere. However, I can't put that as an input in test.find('\', 0). It won't work and gives me the error "missing terminating character." Is there a way to fix test.find('\', 0)?
string test = "Find\inHere";
int x = test.find('\', 0); // error on this line
cout << x; // x should equal 4
\ is a character used to introduce special characters, for example \n newline, \xDB shows the ASCII character with hexadecimal number DB etc.
So, in order to search this special character, you have to escape it by adding another \, use:
test.find("\\",0);
EDIT : Also, in your first string, it is not written in it "Find\inHere" but "Find" and an error because \inHere isn't a special instruction. So, same way to avoid it, write "Find\\inHere".

ofstream not translating "\r\n" to new line character

I have written a c++ code for changing file formats. Part of the functionality is to add a configured line end character. For one of file conversions, the line end character required is "\r\n" i.e. CR+NL .
My code basically reads the configured value from DB and appends it to the end of each record. Something on the lines of
//read DB and store line end char in a string lets say lineEnd.
//code snippet for file writting
string record = "this is a record";
ofstream outFileStream;
string outputFileName = "myfile.txt";
outFileStream.open (outputFileName.c_str());
outFileStream<<record;
outFileStream<<lineEnd; // here line end contains "\r\n"
But this prints record followed by \r\n as it is, no translation to CR+NL takes place.
this is a record\r\n
While the following works (prints CR+LF in output file)
outFileStream<<record;
outFileStream<<"\r\n";
this is a record
But I can not hard code it. I am facing similar issues with "\n" also.
Any suggestions on how to do it.
The translation of \r into the ASCII character CR and of \n into the ASCII character LF is done by the compiler when parsing your source code, and in literals only. That is, the string literal "A\n" will be a 3-character array with values 65 10 0.
The output streams do not interpret escape sequences in any way. If you ask an output stream to write the characters \ and r after each other, it will do so (write characters with ASCII value 92 and 114). If you ask it to write the character CR (ASCII code 13), it will do so.
The reason std::cout << "\r"; writes the CR character is that the string literal already contains the character 13. So if your database includes the string \r\n (4 characters: \, \r, \, n, ASCII 92 114 92 110), that is also the string you will get on output. If it contained the string with ASCII 13 10, that's what you'd get.
Of course, if it's impractical for you to store 13 10 in the database, nothing prevents you from storing 92 114 92 110 (the string "\r\n") in there, and translating it at runtime. Something like this:
void translate(std::string &str, const std::string &from, const std:string &to)
{
std::size_t at = 0;
for (;;) {
at = str.find(from, at);
if (at == str.npos)
break;
str.replace(at, from.size(), to);
}
}
std::string lineEnd = getFromDatabase();
translate(lineEnd, "\\r", "\r");
translate(lineEnd, "\\n", "\n");

Reading characters from a File with fscanf

I have a problem, using fscanf function :(
I need to reed a sequence of characters from file like "a b c d" (characters are separated by space).
but it doesn't works :(
how I have to read them? (
I tried to print it and the result is uncorrect. I think, it's because of spaces. I really don't know why it doesn't work.
Tell me please, what is wrong with array access?
From cplusplus.com:
The function will read and ignore any whitespace characters encountered before the next non-whitespace character (whitespace characters include spaces, newline and tab characters -- see isspace). A single whitespace in the format string validates any quantity of whitespace characters extracted from the stream (including none).
Then if your code is:
while ( fscanf(fin,"%c", &array[i++]) == 1 );
and your file is like this:
h e l l o
Your array will be:
[h][ ][e][ ][l][ ][l][ ][o]
If you change your code into:
while ( fscanf(fin," %c", &array[i++]) == 1 );
with the same file your array will be:
[h][e][l][l][o]
In any case the code works: it depends on what you want.
Anyway, you should think about starting to use fgets() + sscanf(), for example:
char buff[NUM];
while ( fgets(buff, sizeof buff, fin) )
sscanf(buff,"%c", &array[i++]);
With the single fscanf() the lack of buffer management can turns into buffer overflow problems.
Add white space before %c =>
while (fscanf(pFile," %c", &alpArr[i++]) == 1);
It should work.