Value matching in regex and Openrefine - regex

I am trying to use the value.match command in OpenRefine 2.6 for splitting two columns based on a 4 number date.
A sample of the text is:
"first sentence, second sentence, third sentences, 2009"
What I do is going to "Add column based on this column" and insert
value.match(\d{4})
but I get the error
Parsing error at offset 12: Missing number, string, identifier, regex,
or parenthesized expression
any idea of the possible solution?

You need to fix 3 things to get this working:
1) As Wiktor says you need to start & end the regular expression with a forward slash /
2) The 'match' function requires you to match the whole string in the cell, not just the fragment you need - so your regular expression needs to match the whole string
3) To extract part of a string with 'match' you need to have capture groups in your regular expression- that is use ( ) around the bit of the regular expression you want to extract. The captured values will be put in an array and you will need to get the string out of tge array to store it in a cell
So you'll need something like:
value.match(/.*(\d{4})/)[0]
To get the four digit year from the end of the string

Related

Regex to parse json formatted string

I have a JSON formatted string which I am trying to parse using a regex. I would like to parse each key value pair for later use in grafana (the regex itself is used in logstash).
The test string looks like this:
{
"version":"1.1",
"nameId":"test",
"productId":"B2",
"total customers":99,
"full_description":"asdf"
}
I am using the following regex expression, but it seems that if the value is a number (without " "), it groups the comma inthe the value. For example, the group value for the key "total customers" is "99," and not just "99".
(?i)["'](?<key>[^"]*)["'](?:\:)["'\{\[]?([\r\n]?\t+\")?(?<value>\w(?:\s[a-zA-Z0-9_=]\.?)+\w+#(?:(?:\w[a-z\d\-]+\w)\.)+[a-z]{2,10}|true|false|[\w+-.$\s=-]*)(",[\r\n])?(?2)?(?J)(?<value>(?&value))?
What do I have to add to the regex expression in order to parse JSON-values which are numbers?
This part in the pattern [\w+-.$\s=-] has a range +-. instead of matching either a + - or .
The range matches ASCII chars decimals number 43-46 where number 44 matches the unwanted ,
As the character class already matches - at the end, so you can omit the middle -.
The pattern contains some superfluous escapes and capture groups and seems a bit complicated. The updated pattern with just 2 capture groups could look like;
(?i)["'](?<key>[^"]*)["']:["'{\[]?(?:[\r\n]?\t+")?(?<value>\w(?:\s[a-zA-Z\d_=]\.?)+\w+#(?:\w[a-z\d-]+\w\.)+[a-z]{2,10}|true|false|[\w+.$\s=-]*)(?:",[\r\n])?(?2)?(?J)(?<value>(?&value))?
Regex demo

Regular Expression - joining two lines, but first number of joined 2nd line is deleted

I have some sample data (simplified extract below - the real file contains 52,000 lines, with pairs of lines, the 2nd line of each pair is always a date field, and there are always 2 blank lines between each data pair):
The colour of money 20170233434
10-DEC-2015
SOME TEST DATA 32423412123
19-OCT-2015
I want to join each line up, using a Regular Expression (I am using TextPad, but I think the RegEx syntax is generic).
I am doing a replace search, and want to end up with this:
The colour of money 20170233434 10-DEC-2015
SOME TEST DATA 32423412123 19-OCT-2015
I am using this in the "Find what" field:
\n^[0|1|2|3|4|5|6|7|8|9]
And replacing with NULL.
The end result I am getting is almost there:
The colour of money 20170233434 0-DEC-2015
SOME TEST DATA 32423412123 9-OCT-2015
But not quite, because the first digit of the date values are being stripped out.
How would I modify the RegEx to not delete the first number of the 2nd line? I tried to replace with [0|1|2|3|4|5|6|7|8|9] but that just put that entire string in front of each date field, and still stripped out the first number of the date.
Just search for this
\r?\n(\d{1,2}\-)
And replace it with $1. See the live example here.
If you want to replace it with null, you can also use a lookahead:
\r?\n(?=\d{1,2}\-)
And replace it with null. See the live example here.
Those regular expressions only match for a newline character (in UNIX \n or Windows \r\n) followed by 1 or 2 characters of a number and finally followed by a dash. If you want to be more specific, you could also use this regular expression:
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4})
Or with a lookahead respectively:
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4})
You could even check for the double linebreaks after the statement (live example):
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})
Or with a lookahead respectively (live example):
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})

Remove set of string from a string, multiple occurences

Want to completely remove any part of my string that has
\"AddedDate\":\"\\/Date(1480542000000-0600)\\/\"
The 1480526460000-0600 is not hardcoded, it could be any set of numbers (JSON dates).
Try this regex \"AddedDate\":\"\\\/Date\(\d+(?:-\d+)?\)\\\?\" and replace with empty string. If the regex engine doesn't support \d, replace them with [0-9]. This will match date format like x or x-x, x being any number of digits.
If you want to match exactly 13 numbers in the first part of the date and 4 in the second, use \"AddedDate\":\"\\\/Date\(\d{13}(?:-\d{4})?\)\\\?\"
EDIT: For new format use \\\"AddedDate\\\":\\\"\\\\\/Date\(\d+(?:-\d+)?\)\\\\\/\\\" it should work.

R digit-expression and unlist doesn't work

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.
I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))
When I run this command, "yend_clean" is simply set to "character (empty)".
If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.
So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.
Was hoping someone in here could point me in the right direction.
This is a regular expression question. Your regular expression is wrong. Use:
unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
or equivalently
sub("^(\\d{4}).*", "\\1", "2003-")
of if really all you want is to remove the "-"
sub("-", "", "2003-")
Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:
match any single digit, followed by a 4, followed by the end of the string
When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).
The pattern I propose says instead:
match the beginning of the string (^), followed by a digit repeated four times.
The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].
A good website to learn about regex
A visualization tool to see how specific regular expressions match strings
If you mean the book Automated Data Collection with R, the code could be like this:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))
Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".

Regular expression string followed by numbers

I am writing a regular expression to extract phrases like #Question1# or #Question125# from html string like
Patitent name #Question1#, Patient was suffering from #Question2#, Patient's gender is #Question3#, patient has #Question4# drinking for the last month. His DOB is #Question5#
The first half of the expression is simple just #Question, but I also need to match for a series of digits with unspecified length, and the whole string ends with #.
Once I find the matching phrase, how I extract only the digits from the string? Like for example, #Question312#, I just want to get 312 out?
Any suggestion?
The regexp you are looking for is
/#Question[0-9]+#/
If you need to extract the number you can just wrap the [0-9]+ part in parenthesis
/#Question([0-9]+)#/
making it a group. How you use a captured group depends on the specific regexp implementation (e.g. python, perl, javascript ...). For example in python you can replace all those questions with corresponding answers from a list with
answers = ["Andrea", "Griffini"]
text = "My first name is #Question1# and my last name is #Question2#"
print re.sub("#Question([0-9]+)#",
lambda x:answers[int(x.group(1)) - 1],
text)
I think what you are looking for is:
#Question[0-9]+#
#Question
Any character in this class: [0-9], one or more repetitions
#