How to match using Regex till the invalid response (C#) - regex

I need to write a regex that matches the following string till E 1 ERRORWARNING SET \n, (till the end of invalid response). M 1 CSD ... are valid response strings.
Scenario #1
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n" +
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n" +
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
Scenario #2
"R OK \n" + // This could be any string not matching the pattern M 1 CSD ...
"E 1 ERRORWARNING SET \n" + // This could be any string not matching the pattern M 1 CSD ...
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n" +
I know I can write something like (M 1 CSD (?:.{3}) (?:.{2}\s)+\n)* to match the M 1 CSD pattern but not sure how to match the invalid response. The best I am able to do is
(M 1 CSD (?:.{3}) (?:.{2}\s)+\r\n)*([^M].*\r\n)*. But what happens if the invalid response starts with M?
Off course it is possible that there is no invalid response, then the regex needs to match till the end, i.e till M 1 CSD 382 02 30 33 22 7D 7D \n
"M 1 CSD 382 01 44 2B 54 36 7B 22 6A \n"
"M 1 CSD 382 00 73 6F 6E 72 70 63 22 \n"
"M 1 CSD 382 00 3A 22 32 2E 30 22 2C \n"
"M 1 CSD 382 00 22 69 64 22 3A 30 2C \n"
"M 1 CSD 382 00 22 72 65 73 75 6C 74 \n"
"M 1 CSD 382 00 22 3A 7B 22 53 65 72 \n"
"M 1 CSD 382 00 69 61 6C 4E 75 6D 62 \n"
"M 1 CSD 382 00 65 72 22 3A 22 32 32 \n"
"M 1 CSD 382 00 32 30 31 31 34 32 35 \n"
"M 1 CSD 382 02 30 33 22 7D 7D \n"

You can repeat matching all lines that do not have ERRORWARNING SET the invalid response starts with M
^(?![\w ]* ERRORWARNING SET \r?\n).+(?:\r?\n(?![\w ]* ERRORWARNING SET \r?\n).+)*
The pattern matches:
^ Start of string
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the string does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
(?: Non capture group
\r?\n Match a newline
(?![\w ]* ERRORWARNING SET \r?\n) Assert that the next line does not start with ERRORWARNING SET preceded by optional word chars and spaces
.+ Match the whole line with at least a single char
)* Close non capture group and optionally repeat
.NET regex demo
Or a bit more strict to test that the string does not start with a single char A-Z followed by 1 and then ERRORWARNING SET
^(?![A-Z] 1 ERRORWARNING SET \r?\n).+(?:\r?\n(?![A-Z] 1 ERRORWARNING SET \r?\n).+)*

Related

Regex to match part of a hex

so I need to use regex to match a part of a hexadecimal string, but that part is random. Let me try to explain more:
So I have this hexa data:
70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78
I need to match only the f2 in that case. But that is not always the case. Each data will be different. The only thing that is always the same is the '00 00 00' part and the '78' at the end. All the rest is random.
I managed to make the following regex:
/(?=00 00 00).+?(?=78)/
The output is:
00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0
But I dont know how to build a regex to take only the 'f2' (reminder: not always is going to be f2)
Any thoughts?
Given the explanation in this comment, the regex that you need is:
(?<=00 00 00 [0-9a-f]{2} )[0-9a-f]{2}
Providing the first input string from the question, this regex matches f2 (no spaces around it).
Check it online.
How it works:
(?<= # start of a positive lookbehind
00 00 00 # match the exact string ("00 00 00 ")
[0-9a-f] # match one hex digit (lowercase only)
{2} # match the previous twice (i.e. two hex digits)
# there is a space after ")"
) # end of the lookbehind
[0-9a-f]{2} # match two hex digits
The positive lookbehind works like a non-capturing group but it is not part of the match. Basically it says that the matching part ([0-9a-f]{2}) matches only if it is preceded by a match of the lookbehind expression.
The matching part of the expression is [0-9a-f]{2} (i.e. two hex digits).
You need to add i or whatever flag uses the regex engine that you use to denote "ignore cases" (i.e. the a-f part of regex also match A-F). If you cannot (or do not want to) provide this flag you can put [0-9A-Fa-f] everywhere and it works.
If your regex engine does not support lookbehind you can get the same result using capturing groups:
00 00 00 [0-9a-f]{2} ([0-9a-f]{2})
Applied on the same input, this regex matches 00 00 00 20 f2 and its first (and only) capturing group matches f2.
Check it online.
Update
If it is important that the input string contains 78 somewhere after the matching part then add (?=(?: [0-9a-z]{2})* 78) to the first regex:
(?<=00 00 00 [0-9a-f]{2} )[0-9a-f]{2}(?=(?: [0-9a-z]{2})* 78)
(?= introduces a positive lookahead. It behaves similar to a lookbehind but must stay after the matching part of the reged and it is verified against the part of the string located after the matching part of the string.
(?: starts a non-capturing group.
The [0-9a-z]{2} followed or preceded by a space in the lookahead and lookbehind ensure that the entire matching string is composed only of 2 hex digit numbers separated by spaces. You can use .* instead but that will match anything, even if they do not follow the format of 2 hex digit numbers.
For the version without lookaheads or lookbehinds add (?: [0-9a-z]{2})* 78 at the end of the regex:
00 00 00 [0-9a-f]{2} ([0-9a-f]{2})(?: [0-9a-z]{2})* 78
The regex matches the entire string starting with 00 00 00 and ending with 78 and the first capturing group matches the second number after 00 00 00 (your target).
Is the f2 surrounded by asterisks?
Without asterisks:
00 00 00 [a-f0-9]+ (?<hexits>[a-f0-9]+).+78
With asterisks:
\*(?<hexits>[a-f0-9]+)\*
You can use the following regex to match the hexadecimal value after "00 00 00": /00 00 00 ([0-9A-Fa-f]{2})/. The value you want is in the capturing group, represented by \1.
Here is a demo:
import re
s = '70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78'
match = re.search(r'00 00 00 ([0-9A-Fa-f]{2})', s)
if match:
print(match.group(1))
The output will be:
f2
You don't really need a regex for that. Get the offset of 3 bytes of zero in a row and take the 4th one after it:
s = '70 75 62 71 00 7e 00 01 4c 00 06 72 61 6e 64 6f 6d 74 00 1c 4c 6a 2f 73 2f 6e 64 6f 6d 3b 78 70 77 25 00 00 00 20 f2 90 c2 91 c4 c4 ca 91 c0 c0 ca 91 94 cb c5 97 90 c5 90 c2 90 96 c7 ca 91 91 93 94 c6 c5 c6 cb c0 78'
s2 = '01 02 03 00 00 00 05 06 07'
def locate(s):
data = bytes.fromhex(s)
offset = data.find(bytes([0,0,0]))
return data[offset + 4]
print(f'{locate(s):02X}')
print(f'{locate(s2):02X}')
Output:
F2
06
You could also extract the "f2" string directly from the string:
offset = s.index('00 00 00')
print(s[offset + 12 : offset + 14]) # 'f2'

how to read and parse text files with istringstream?

I am trying to read a .txt file line by line:
vector<vector<int>> iVecGrid;
vector<int> iVecRow;
ifstream text;
text.open("grid.txt", ios::in);
if (!text)
{
cout << "Error!\n";
return EXIT_FAILURE;
}
istringstream isLine;
string sLine;
while (getline(text, sLine))
{
isLine.str(sLine);
int iNumber;
while (isLine >> iNumber)
{
cout << iNumber << " ";
iVecRow.push_back(iNumber);
}
cout << endl;
iVecGrid.push_back(iVecRow);
iVecRow.clear();
}
When I put isLine inside the while (getline(text, sLine)) loop, it works fine.
But when I put it outside, it bugged out for some reason.
.txt file:
08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48
When you are done with the following loop
while (isLine >> iNumber)
{
cout << iNumber << " ";
iVecRow.push_back(iNumber);
}
isLine is in a state of error. The error state needs to cleared. Add the following line after the loop.
isLine.clear();
Also, the position of isLine needs to be reset to point to the start of the string. You can use the following line for that.
isLine.seekg(0);
Both of these problems are avoided by moving the scope of isLine to inside the while block.
The following should work.
while (getline(text, sLine))
{
isLine.str(sLine);
isLine.seekg(0); //ADD. Start reading from position 0.
int iNumber;
while (isLine >> iNumber)
{
cout << iNumber << " ";
iVecRow.push_back(iNumber);
}
isLine.clear(); //ADD. Clear the error state.
cout << endl;
iVecGrid.push_back(iVecRow);
iVecRow.clear();
}
However, I would recommend sticking to your first approach. It removes the unnecessary clutter from your code.
while (getline(text, sLine))
{
std::istringstream isLine(sLine);
int iNumber;
while (isLine >> iNumber)
{
cout << iNumber << " ";
iVecRow.push_back(iNumber);
}
cout << endl;
iVecGrid.push_back(iVecRow);
iVecRow.clear();
}
When you do
while (isLine >> iNumber)
{
cout << iNumber << " ";
iVecRow.push_back(iNumber);
}
the loop runs until isLine enters a failed state. Once it enters that failed state, you can no longer read from it until you call the clear member function to clear those errors. This means that when isLine is declare outside of the while loop, the first iteration of the loop puts it into an error state, and it stays that way for each subsequent iteration since you do not manually clear the errors.
On the other hand, when isLine is declared inside the while loop, it is destroyed at the end of the loop and created again at the start of the next iteration. This process gives you a new stream that is not in a error state so you can use it as expected.

How do I extract data as a data frame from a text file in R? The data has names in it and the middle names are messing with my method

I have a text file where strings are separated by whitespaces. I can easily extract these into R as a data frame, by first using the scan command and then seeing that each record has 15 strings in them.
So data[1:15} is one row, data[16:30} is the other row and so on. In each of these records, the name is composed of two strings, say FOO and BAR. But some records have names such as FOO BOR BAR or even FOO BOR BOO BAR. This obviously messes with my 15 string theory. How can I easily extract the data into a data frame?
So my data is in my working directory called results.txt.
I use this to scan my data:
mech <- scan("results.txt", "")
Then I can make the data frames like this:
d1 <- t(data.frame(mech[1:15]))
d2 <- t(data.frame(mech[16:30]))
d3 <- t(data.frame(mech[31:45]))
My plan was to iterate this in a for loop and rbind the data into one consolidated data frame.
d1 results in something like
1 FOO BAR 2K12/ME/01 96 86 86 92 73 86 72 168 82 30 84.93
d2 results in
2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88 75 165 91 30 72.60
Here, FOO and BAR are first and last names, respectively. Most records are like this. But d3:
3 FOO3 BOR BAR3 2K12/ME/03 72 83 61 75 44 88 75 165 91 30
Because of the extra middle name, I lose the final string of the text, the part right after 30. This then spills over to the next record. So row 46:60, instead of starting with 4, begins with the omitted data from the previous record.
How can I extract the data by treating the names as a single string?
EDIT: Stupid of me for not providing the data frame itself. Here is a sample.
1 FOO BAR 2K12/ME/01 96 86 86 92 73 86 72 168 82 30 84.93
2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88 75 165 91 30 72.60
3 FOO3 BOR BAR3 2K12/ME/03 63 84 62 62 50 79 74 157 85 30 69.13
4 FOO4 BOR BAR4 2K12/ME/04 89 88 74 79 77 83 68 182 82 30 81.93
s1 <- "1 FOO BAR 2K12/ME/01 96 86 86 92 73 86 72 168 82 30 84.93
2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88 75 165 91 30 72.60
3 FOO3 BOR BAR3 2K12/ME/03 63 84 62 62 50 79 74 157 85 30 69.13
4 FOO4 BOR BAR4 2K12/ME/04 89 88 74 79 77 83 68 182 82 30 81.93"
s2 <- readLines(textConnection(s1)) #read from your file here
s2 <- strsplit(s2, "\\s+") #splits by white space
s3 <- lapply(s2, function(s) {
n <- length(s)
s[2] <- paste(s[2:(2 + (n - 14))], collapse = " ")
s[-(3:(2 + (n - 14)))]
})
DF <- do.call(rbind, s3)
DF <- as.data.frame(DF, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)
str(DF)
#'data.frame': 4 obs. of 14 variables:
# $ V1 : int 1 2 3 4
# $ V2 : chr "FOO BAR" "FOO2 BAR2" "FOO3 BOR BAR3" "FOO4 BOR BAR4"
# $ V3 : chr "2K12/ME/01" "2K12/ME/02" "2K12/ME/03" "2K12/ME/04"
# $ V4 : int 96 72 63 89
# $ V5 : int 86 83 84 88
# $ V6 : int 86 61 62 74
# $ V7 : int 92 75 62 79
# $ V8 : int 73 44 50 77
# $ V9 : int 86 88 79 83
# $ V10: int 72 75 74 68
# $ V11: int 168 165 157 182
# $ V12: int 82 91 85 82
# $ V13: int 30 30 30 30
# $ V14: num 84.9 72.6 69.1 81.9
One approach is to use regex to enclose the names in quotes and then a simple read table. This approach has the advantage of allowing for cases with any number of names.
s1 <- "1 FOO BAR 2K12/ME/01 96 86 86 92 73 86 72 168 82 30 84.93
2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88 75 165 91 30 72.60
3 FOO3 BOR BAR3 2K12/ME/03 63 84 62 62 50 79 74 157 85 30 69.13
4 FOO4 BOR BAR4 2K12/ME/04 89 88 74 79 77 83 68 182 82 30 81.93"
s2 <- gsub("^ *|(?<= ) | *$", "", s1, perl = T)
read.table(text=gsub("(?<=[[:digit:]] )(.*)(?= 2K12)", "'\\1'", s2, perl = T), header = F)
Which gives:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 1 FOO BAR 2K12/ME/01 96 86 86 92 73 86 72 168 82 30 84.93
2 2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88 75 165 91 30 72.60
3 3 FOO3 BOR BAR3 2K12/ME/03 63 84 62 62 50 79 74 157 85 30 69.13
4 4 FOO4 BOR BAR4 2K12/ME/04 89 88 74 79 77 83 68 182 82 30 81.93

Parsing .csv files with CR LF EOL structure

I'm trying to parse a CSV file and getline() is reading the entire file as one line. On the assumption that getline() wasn't getting what it expected, I tried \r, \n, \n\r, \r\n, and \0 as arguments with no luck.
I took a look at the EOL characters and an seeing CR and then LF. Is getline() just ignoring this or am I missing something? Also, what's the fix here?
The goal of this function is a general purpose CSV parsing function that stores the data as a 2d vector of strings. Although advice on that front is welcome, I'm only looking for a way to fix this issue.
vector<vector<string>> Parse::parseCSV(string file)
{
// input fstream instance
ifstream inFile;
inFile.open(file);
// check for error
if (inFile.fail()) { cerr << "Cannot open file" << endl; exit(1); }
vector<vector<string>> data;
string line;
while (getline(inFile, line))
{
stringstream inputLine(line);
char delimeter = ',';
string word;
vector<string> brokenLine;
while (getline(inputLine, word, delimeter)) {
word.erase(remove(word.begin(), word.end(), ' '), word.end()); // remove all white spaces
brokenLine.push_back(word);
}
data.push_back(brokenLine);
}
inFile.close();
return data;
};
Here's the hexdump. I'm not sure what exactly this is showing.
0000000 55 4e 49 58 20 54 49 4d 45 2c 54 49 4d 45 2c 4c
0000010 41 54 2c 4c 4f 4e 47 2c 41 4c 54 2c 44 49 53 54
0000020 2c 48 52 2c 43 41 44 2c 54 45 4d 50 2c 50 4f 57
0000030 45 52 0d 31 34 32 34 31 30 35 38 30 38 2c 32 30
0000040 31 35 2d 30 32 2d 31 36 54 31 36 3a 35 36 3a 34
0000050 38 5a 2c 34 33 2e 38 39 36 34 2c 31 30 2e 32 32
0000060 34 34 34 2c 30 2e 38 37 2c 30 2c 30 2c 30 2c 4e
0000070 6f 20 44 61 74 61 2c 4e 6f 20 44 61 74 61 0d 31
0000080 34 32 34 31 30 35 38 38 35 2c 32 30 31 35 2d 30
0000090 32 2d 31 36 54 31 36 3a 35 38 3a 30 35 5a 2c 34
00000a0 33 2e 39 30 31 33 35 2c 31 30 2e 32 32 30 34 31
00000b0 2c 31 2e 30 32 2c 30 2e 36 33 39 2c 30 2c 30 2c
00000c0 4e 6f 20 44 61 74 61 2c 4e 6f 20 44 61 74 61 0d
00000d0 31 34 32 34 31 30 35 38 38 38 2c 32 30 31 35 2d
00000e0 30 32 2d 31 36 54 31 36 3a 35 38 3a 30 38 5a 2c
00000f0 34 33 2e 39 30 31 34 38 2c 31 30 2e 32 32 30 31
0000100
The first two lines of the file
UNIX TIME,TIME,LAT,LONG,ALT,DIST,HR,CAD,TEMP,POWER
1424105808,2015-02-16T16:56:48Z,43.8964,10.22444,0.87,0,0,0,No Data,No Data
UPDATE Looks like it was \r. Im not sure why it didn't work earlier, but I learned a few things while exploring. Thanks for the help guys.
A simple fix would be to write your own getline
For example one that ignores any combination of \n,\r
in the beginning of the line, and breaking on any too.
That will work on any platform, but wont preserve empty lines.
After looking at the hex-dump, the delimiter is 0d (\r)
Did you try to switch the order of the \r\n to \n\r?

c++ XOR string key hex

I am trying to XOR some already encrypted files.
I know that the XOR key is 0x14 or dec(20).
My code works except for one thing. All the '4' is gone.
Here is my function for the XOR:
void xor(string &nString) // Time to undo what we did from above :D
{
const int KEY = 0x14;
int strLen = (nString.length());
char *cString = (char*)(nString.c_str());
for (int i = 0; i < strLen; i++)
{
*(cString+i) = (*(cString+i) ^ KEY);
}
}
Here is part of my main:
ifstream inFile;
inFile.open("ExpTable.bin");
if (!inFile) {
cout << "Unable to open file";
}
string data;
while (inFile >> data) {
xor(data);
cout << data << endl;
}
inFile.close();
Here is a part of the encypted file:
$y{bq //0 move
%c|{ //1 who
&c|qfq //2 where
'saufp //3 guard
x{wu`}{z //4 location
But x{wu}{z` is returning //location. Its not displaying the 4.
Note the space infront of the X. thats supposed to be decoded to 4.
What am I missing? Why is it not showing all the 4? <space> = 4 // 4 = <space>
UPDATE
This is the list of all the specific conversions:
HEX(enc) ASCII(dec)
20 4
21 5
22 6
23 7
24 0
25 1
26 2
27 3
28 <
29 =
2a >
2b ?
2c 8
2d 9
2e :
2f ;
30 $
31 %
32 &
33 '
34
35 !
36 "
37 #
38 ,
39 -
3a .
3b /
3c (
3d )
3e *
3f +
40 T
41 U
42 V
43 W
44 P
45 Q
46 R
47 S
48 \
49 ]
4a ^
4b _
4c X
4d Y
4e Z
4f [
50 D
51 E
52 F
53 G
54 #
55 A
56 B
57 C
58 L
59 M
5a N
5b O
5c H
5d I
5e J
5f K
60 t
61 u
62 v
63 w
64 p
65 q
66 r
67 s
68 |
69 }
6a
6b
6c x
6d y
6e z
6f {
70 d
71 e
72 f
73 g
75 a
76 b
77 c
78 l
79 m
7a n
7b o
7c h
7d i
7e j
7f k
1d /tab
1e /newline
Get rid of all casts.
Don't use >> for input.
That should fix your problems.
Edit:
// got bored, wrote some (untested) code
ifstream inFile;
inFile.open("ExpTable.bin", in | binary);
if (!inFile) {
cerr << "Unable to open ExpTable.bin: " << strerror(errno) << "\n";
exit(EXIT_FAILURE);
}
char c;
while (inFile.get(c)) {
cout.put(c ^ '\x14');
}
inFile.close();
Are you sure that it is printing '//location'? I think it would print '// location' -- note the space after the double-slash. You are XORing 0x34 with 0x14. The result is 0x20, which is a space character. Why would you want to xor everything with 0x14 anyway?
** edit ** ignore the above; I missed part of your question. The real answer:
Are you entirely sure that the character before the x is a 0x20? Perhaps it's some unprintable character that looks like a space? I would check the hex value.