I'm facing a problem greping the right date within a letter as a document.
Reason is to grep the date of document creation and not any further date within the text.
Usaly the dokument hold information about the company, my address, customer number, bill number....
and the date by when it was created.
Mayby a greeting and/or text maybe within dates again.
Often the date at begin of the document has different look as following.
December 1999 instead of 3.12.1999 as example.
If I grep the date in case of pattern
'(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))'
sometimes get the wrong date as creation date. Reason is the different writing of dates in the documents.
Example 1 is what I usualy get and it works fine as I search for the date (creation date) with correct pattern.
Example 2 is in problem as I get a date, but it's NOT creation date which would be the 1st date. I get instead another date matching the pattern out from the text.
Example 1
Example 2
I could use different pattern '(([0-9][0-9]{,1}\.)([0-9][0-9]{,1}\.)([1-9][0-9][0-9][0-9]{1,}))' grepping the correct date in example 2 but then I would get same issue for example 1.
My idea was to search in first n lines only if pattern match take the date otherwise use different pattern.
I don't get the rule for pdfgrep using the first n lines only what would give me the possibility to use different pattern.
Has anybody an idea how to fix it?
Cheers, bdream
With GNU grep:
-m NUM: Stop reading a file after NUM matching lines.
Alternatively to GNU grep learn to use GNU gawk, specifically designed for such tasks.
Consider also learning python or GNU guile (then read SICP).
Friends:
I have to process a CSV file, using Perl language and produce an Excel as output, using the Excel::Writer::XSLX module. This is not a homework but a real life problem, where I cannot download whichever Perl version (actually, I need to use Perl 5.6), or whichever Perl module (I have a limited set of them). My OS is UNIX. I can also use (embedding in Perl) ksh and csh (with some limitation, as I have found so far). Please, limit your answers to the tools I have available. Thanks in advance!
Even though I am not a Perl developer, but coming from other languages, I have already done my work. However, the customer is asking for extra processing where I am getting stuck on.
1) The stones in the road I found are coming from two sides: from Perl and from Excel particular styles of processing data. I already found a workaround to handle the Excel, but -as mentioned in the subject- I have difficulties while processing zeroes found in CSV input file. To handle the Excel, I am using the '0 way which is the final way for data representation that Excel seems to have while using the # formatting style.
2) Scenario:
I need to catch standalone zeroes which might be present in whichever line / column / cell of the CSV input file and put them as such (as zeroes) in the Excel output file.
I will go directly to the point of my question to avoid loosing your valuable time. I am providing more details after my question:
Research and question:
I tried to use Perl regex to find standalone "0" and replace them by whichever string, planning to replace them back to "0" at the end of processing.
perl -p -i -e 's/\b0\b/string/g' myfile.csv`
and
perl -i -ple 's/\b0\b/string/g' myfile.csv
Are working; but only from command line. They aren't working when I call them from the Perl script as follows:
system("perl -i -ple 's/\b0\b/string/g' myfile.csv")
Do not know why... I have already tried using exec and eval, instead of system, with the same results.
Note that I have a ton of regex that work perfectly with the same structure, such as the following:
system("perl -i -ple 's/input/output/g' myfile.csv")
I have also tried using backticks and qx//, without success. Note that qx// and backticks have not the same behavior, since qx// is complaining about the boundaries \b because of the forward slash.
I have tried using sed -i, but my System is rejecting -i as invalid flag (do not know if this happens in all UNIX, but at least happens in the one at work. However is accepting perl -i).
I have tried embedding awk (which is working from command line), in this way:
system `awk -F ',' -v OFS=',' '$1 == \"0\" { $1 = "string" }1' myfile.csv > myfile_copy.csv
But this works only for the first column (in command line) and, other than having the disadvantage of having extra copy file, Perl is complaining for > redirection, assuming it as "greater than"...
system(q#awk 'BEGIN{FS=OFS=",";split("1 2 3 4 5",A," ") } { for(i in A)sub(0,"string",$A[i] ) }1' myfile.csv#);
This awk is working from command line, but only 5 columns. But not in Perl using #.
All the combinations of exec and eval have also been tested without success.
I have also tried passing to system each one of the awk components, as arguments, separated by commas, but did not find any valid way to pass the redirector (>), since Perl is rejecting it because of the mentioned reason.
Using another approach, I noticed that the "standalone zeroes" seem to be "swallowed" by the Text::CSV module, thus, I get rid off it, and turned back to a traditional looping in csv line by line and a spliter for commas, preserving the zeroes in that way. However I found the "mystery" of isdual in Perl, and because of the limitation of modules I have, I cannot use the Dumper. Then, I also explored the guts of binaries in Perl and tried the $x ^ $x, which was deprecated since version 5.22 but valid till that version (I said mine is 5.6). This is useful to catch numbers vs strings. However, while if( $x ^ $x ) returns TRUE for strings, if( !( $x ^ $x ) ) does not returns TRUE when $x = 0. [UPDATE: I tried this in a devoted Perl script, just for this purpose, and it is working. I believe that my probable wrong conclusion ("not returning TRUE") was obtained when I did not still realize that Text::CSV was swallowing my zeroes. Doing new tests...].
I will appreciate very much your help!
MORE DETAILS ON MY REQUIREMENTS:
1) This is a dynamic report coming from a database which is handover to me and I pickup programmatically from a folder. Dynamic means that it might have whichever amount of tables, whichever amount of columns in each table, whichever names as column headers, whichever amount of rows in each table.
2) I do not know, and cannot know, the column names, because they vary from report to report. So, I cannot be guided by column names.
A sample input:
Alfa,Alfa1,Beta,Gamma,Delta,Delta1,Epsilon,Dseta,Heta,Zeta,Iota,Kappa
0,J5,alfa,0,111.33,124.45,0,0,456.85,234.56,798.43,330000.00
M1,0,X888,ZZ,222.44,111.33,12.24,45.67,0,234.56,0,975.33
3) Input Explanation
a) This is an example of a random report with 12 columns and 3 rows. Fist row is header.
b) I call "standalone zeroes" those "clean" zeroes which are coming in the CSV file, from second row onwards, between commas, like 0, (if the case is the first position in the row) or like ,0, in subsequent positions.
c) In the second row of the example you can read, from the beginning of the row: 0,J5,alfa,0, which in this particular case, are "words" or "strings". In this case, 4 names (note that two of them are zeroes, which required to be treated as strings). Thus, we have a 4 names-columns example (Alfa,Alfa1,Beta,Gamma are headers for those columns, but only in this scenario). From that point onwards, in the second row, you can see floating point (*.00) numbers and, among them, you can see 2 zeroes, which are numbers. Finally, in the third line, you can read M1,0,X888,Z, which are the names for the first 4 columns. Note, please, that the 4th column in the second row has 0 as name, while the 4th column in the third row has ZZ as name.
Summary: as a general picture, I have a table-report divided in 2 parts, from left to right: 4 columns for names, and 8 columns for numbers.
Always the first M columns are names and the last N columns are numbers.
- It is unknown which number is M: which amount of columns devoted for words / strings I will receive.
- It is unknown which number is N: which amount of columns devoted for numbers I will receive.
- It is KNOWN that, after the M amount of columns ends, always starts N, and this is constant for all the rows.
I have done a quick research on Perl boundaries for regex ( \b ), and I have not found any relevant information regarding if it applies or not in Perl 5.6.
However, since you are using and old Perl version, try the traditional UNIX / Linux style (I mean, what Perl inherits from Shell), like this:
system("perl -i -ple 's/^0/string/g' myfile.csv");
The previous regex should do the work doing the change at the start of the each line in your CSV file, if matches.
Or, maybe better (if you have those "standalone" zeroes, and want avoid any unwanted change in some "leading zeroes" string):
system("perl -i -ple 's/^0,/string,/g' myfile.csv");
[Note that I have added the comma, after the zero; and, of course, after the string].
Note that the first regex should work; the second one is just a "caveat", to be cautious.
I've been trying to develop a program that will be used for DMing in an MMORPG but I'm having trouble parsing for the actual regex expression I need.
To quote myself from another thread on a less active forum:
I've officially taken over the DiceRoller addon from years and years ago and I've reworked it a lot since I've taken it over and done a lot of testing in game. While I haven't uploaded anything yet, I've been struggling on a piece of regex expression that is currently crucial to the design of the addon.
Some background: the newest iteration of the DiceRoller addon makes it so you can type "!XdY" (where X is the number of dice, Y is the dice value) into raid chat and the DM who has the addon will go through some logic in the addon (random number lua protocol) and then spit out an input after adding up the dice.
It is as follows:
local count, size = string.match(message, "^!(%d+)[dD](%d+)$")
Now the functionality I need it to do is parse for both "!XdY" OR "XdY+Z", but it seems as if I can't get close to "XdY+Z" no matter which regex expression I use since I need it to do both expressions. I can provide more source code context if necessary.
This is the closest I've ever gotten:
http://i.imgur.com/eMhPHQB.png
and this is with the regex expression:
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+?(%d+)$")
As you can see, with the modifier it will work just fine. However, remove the modifier the regex expression still thinks that it is "XdY+Z" and so with "1d20" it think it is "1d2+0". It will think 1d200 is "1d20+0", etc. I've tried moving around the optional character "?" but it just causes the expression to not work at all. If I do !1d2 it doesn't work. It's almost as if the optional character NEEDS to be there?
Thanks for the help ahead of time, I've always struggled with regex.
local function dice(input)
local count, size, modifier = input:match"^!(%d+)[dD](%d+)%+?(%d*)$"
if count then
return tonumber(count), tonumber(size), tonumber("0"..modifier)
end
end
for _, input in ipairs{"!1d6", "!1d24", "!1d200", "!1d2+4", "!1d20+24"} do
print(input, dice(input))
end
Output:
!1d6 1 6 0
!1d24 1 24 0
!1d200 1 200 0
!1d2+4 1 2 4
!1d20+24 1 20 24
Lua regular expressions are very limited. You would need to use ^!(%d+)[dD](%d+)(?:+(%d+))?$ but this wouldn't be supported because of (?:+(%d+))? that uses a non-capturing group and a modifier on a group, both are not supported by Lua Patterns.
Consider using a regex library like this one that allows you to use PCRE, PHP regex engine, one of the most complete engine. But that would be overkill if you only want to use it for this regex. You can do it by code then, wouldn't be so hard for a simple task like this.
While Lua patterns are not powerful enough to parse this with one expression (as they don't support optional groups), there is an easy option to handle it with two expressions:
-- check the longer expression first
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+(%d+)$")
if not count then
count, size = string.match(message, "^!(%d+)[dD](%d+)$")
end
I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?
How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.
I get travel confirmations that look like this:
"SQ 966 E 27JUL SINCGK"
= "Airline Space Flight Space BookingClass Space Date_with_Month_as_name Space 3LetterFrom 2LetterTo".
I can chop all this into pieces using a regex to submit it to a website. But the site would expect instead of 27JUL 27/07/2009 or at least 27/07. Is there a way to transform a regex result based on a piece in the input. Jan -> 01, Feb -> 02 ... Dec -> 12.
(Regex flavour is Java)
DateFormat is a more appropriate class:
DateFormat output = new SimpleDateFormat("dd/MM", Locale.US);
DateFormat input = new SimpleDateFormat("dd MMM", Locale.US);
System.out.println(output.format(input.parse("24 Dec")));
output:
24/12
In Perl syntax (s{pattern}{replacement}):
s{([0-9][0-9])JAN}{\1/01}
s{([0-9][0-9])FEB}{\1/02}
s{([0-9][0-9])MAR}{\1/03}
s{([0-9][0-9])APR}{\1/04}
s{([0-9][0-9])MAY}{\1/05}
s{([0-9][0-9])JUN}{\1/06}
s{([0-9][0-9])JUL}{\1/07}
s{([0-9][0-9])AUG}{\1/08}
s{([0-9][0-9])SEP}{\1/09}
s{([0-9][0-9])OCT}{\1/10}
s{([0-9][0-9])NOV}{\1/11}
s{([0-9][0-9])DEC}{\1/12}
(Yes this is long and ugly, but it would probably work).
I would be very careful with doing this with regular expressions as they don't tell you how the conversion went.
Extract every bit of information manually. Sanity check everything, and then use the SimpleDateFormat parser to get a Date object you can use from there on.
It isnt a regex solution, but you could use SimpleDateFormat to help you with your final formatting. You should note in the JavaDoc that this is not a thread-safe option out of the box.
Alternatively, you could use DateFormatSymbols.getShortMonths() and iterate over the months to identify the index* and format your string manually.
*dont forget to add 1 ;)
edit:
I am not sure what you are looking for is possible in Java regex without the ablility to make code changes. The conditional constructs that Perl supports are not supported by Java because Java provides if-then-else support as a language feature.