Matching algorithm or regular expression? - regex

I have a huge log file with different types of string rows, and I need to extract data in a "smart" way from these.
Sample snippet:
2011-03-05 node32_three INFO stack trace, at empty string asfa 11120023
--- - MON 23 02 2011 ERROR stack trace NONE      
For instance, what is the best way to extract the date from each row, independent of date format?

You could make a regex for different formats like so:
(fmt1)|(fmt2)|....
Where fmt1, fmt2 etc are the individual regexes, for yor example
(20\d\d-[01]\d-[0123]\d)|((?MON|TUE|WED|THU|FRI|SAT|SUN) [0123]\d [01]\d 20\d\d)
Note that to prevent the chance to match arbitrary numbers I restricted year, month and day numbers accordingly. For example, a day number cannot start with 4, neither can a month number start with 2.
This gives the following pseudo code:
// remember that you need to double each backslash when writing the
// pattern in string form
Pattern p = Pattern.compile("..."); // compile once and for all
String s;
for each line
s = current input line;
Matcher m = p.matcher(s);
if (m.find()) {
String d = m.group(); // d is the string that matched
....
}
Each individual date pattern is written in () to make it possible to find out what format we had, like so:
int fmt = 0;
// each (fmt) is a group, numbered starting with 1 from left to right
for (int i = 1; fmt == 0 && i <= total number of different formats; i++)
if (m.group(i) != null) fmt = i;
For this to work, inner (regex) groups must be written (?regex) so that they do not count as capture-groups, look at updated example.

If you use Java, you may want to have a look at Joda time. Also, read this question and related answers. I think Joda DateTimeFormat should give you all the flexibility that you need to parse the various date/time format of your log file.
A quick example:
String dateString = "2011-04-18 10:41:33";
DateTimeFormatter formatter =
DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
DateTime dateTime = formatter.parseDateTime(dateString);
Just define a String[] for the formats of you date/time, and pass each element to DateTimeFormat to get the corresponding DateTimeFormatter. You can use regex just separate date strings from other stuff in the log lines, and then you can use the various DateTimeFormatters to try and parse them.

Related

How can I get several characters from a std::string

I want to name a file according to the date and the time it was created. I'm using this code to get the date and time:
auto time = std::chrono::system_clock::now();
std::time_t end_time = chrono::system_clock::to_time_t(time);
std::string finaltime = std::ctime(&end_time);
I now want to eliminate all spaces and informations that I dont need from the finaltime-string . For this purpose I found out which characters I need. I need all except 11, 14, 17 (counter started at 0 for the first character).
In python there is a very simple way to do something like that, if you need all characters from 2 to 5 you can say mystring[2:5]. Is there somethig similar in c++ or is there another way to delete the chars I need
use the substr(a, b) function
std::string str2 = finaltime.substr (3,5); // 12:00

Regex date formats

I need help with with three regular expressions for date validation. The date formats to validate against should be:
- MMyy
- ddMMyy
- ddMMyyyy
Further:
I want the regular expressions to match the exact number of digits in the formats above. For instance, January should be 01, NOT 1:
060117 // ddMMyy format: Ok
06117 // ddMMyy format: NOT Ok
Hyphens and slashes are NOT allowed, like: 06-01-17, or 06/01/17.
Below are the regex:es that I use. I cannot get them quite right though.
string regex_MMyy = #"^(1[0-2]|0[1-9]|\d)(\d{2})$";
string regex_ddMMyy = #"^(0[1-9]|[12]\d|3[01])(1[0-2]|0[1-9]|\d)(\d{2})$";
string regex_ddMMyyyy = #"^(0[1-9]|[12]\d|3[01])(1[0-2]|0[1-9]|\d)(\d{4})$";
var test_MMyy_1 = Regex.IsMatch("0617", regex_MMyy); // Pass
var test_MMyy_2 = Regex.IsMatch("617", regex_MMyy); // Pass, do NOT want this to pass.
var test_ddMMyy_1 = Regex.IsMatch("060117", regex_ddMMyy); // Pass
var test_ddMMyy_2 = Regex.IsMatch("06117", regex_ddMMyy); // Pass, do NOT want this to pass.
var test_ddMMyyyy_1 = Regex.IsMatch("06012017", regex_ddMMyyyy); // Pass
var test_ddMMyyyy_2 = Regex.IsMatch("0612017", regex_ddMMyyyy); // Pass, do NOT want this to pass.
(If anyone could take allowed days for each month, and leap years into account, that would be a huge bonus :)).
Thanks,
Best Regards

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

How to separate a line of input into multiple variables?

I have a file that contains rows and columns of information like:
104857 Big Screen TV 567.95
573823 Blender 45.25
I need to parse this information into three separate items, a string containing the identification number on the left, a string containing the item name, and a double variable containing the price. The information is always found in the same columns, i.e. in the same order.
I am having trouble accomplishing this. Even when not reading from the file and just using a sample string, my attempt just outputs a jumbled mess:
string input = "104857 Big Screen TV 567.95";
string tempone = "";
string temptwo = input.substr(0,1);
tempone += temptwo;
for(int i=1 ; temptwo != " " && i < input.length() ; i++)
{
temptwo = input.substr(j,j);
tempone += temp2;
}
cout << tempone;
I've tried tweaking the above code for quite some time, but no luck, and I can't think of any other way to do it at the moment.
You can find the first space and the last space using std::find_first_of and std::find_last_of . You can use this to better split the string into 3 - first space comes after the first variable and the last space comes before the third variable, everything in between is the second variable.
How about following pseudocode:
string input = "104857 Big Screen TV 567.95";
string[] parsed_output = input.split(" "); // split input string with 'space' as delimiter
// parsed_output[0] = 104857
// parsed_output[1] = Big
// parsed_output[2] = Screen
// parsed_output[3] = TV
// parsed_output[4] = 567.95
int id = stringToInt(parsed_output[0]);
string product = concat(parsed_output[1], parsed_output[2], ... ,parsed_output[length-2]);
double price = stringToDouble(parsed_output[length-1]);
I hope, that's clear.
Well try breaking down the files components:
you know a number always comes first, and we also know a number has no white spaces.
The string following the number CAN have whitespaces, but won't contain any numbers(i would assume)
After this title, you're going to have more numbers(with no whitespaces)
from these components, you can deduce:
grabbing the first number is as simple as reading in using the filestream <<.
getting the string requires you to check until you reach a number, grabbing one character at a time and inserting that into a string. the last number is just like the first, using the filestream <<
This seems like homework so i'll let you put the rest together.
I would try a regular expression, something along these lines:
^([0-9]+)\s+(.+)\s+([0-9]+\.[0-9]+)$
I am not very good at regex syntax, but ([0-9]+) corresponds to a sequence of digits (this is the id), ([0-9]+\.[0-9]+) is the floating point number (price) and (.+) is the string that is separated from the two number by sequences of "space" characters: \s+.
The next step would be to check if you need this to work with prices like ".50" or "10".

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function