parse csv using lua script - regex

I have a csv file that has data like this:
+12345678901,08:00:00,12:00:00,1111100,35703,test.domain.net
+12345678901,,,0000000,212,test.domain.net
I'm trying to write lua code that will loop through each line, and create an array of values like this:
local mylist = {}
for line in io.lines("data/dd.csv") do
local id, start, finish, dow, int, domain = line:match("(+%d+),(%d*:*),(%d*:*),(%d*),(%d*),(%a*.*)")
mylist[#mylist + 1] = { id = id, start = start, finish = finish, dow = dow, int = int, domain = domain}
print(mylist[#mylist]['id'])
end
The problem is that when the code hits a line that has empty values for start and finish, the regex fails and all fields are nil.
I thought using the * meant 0 or more...
I can't seem to find my error / typo.
Thanks.

This pattern works for me:
"(.-),(.-),(.-),(.-),(.-),(.-)$"

It seems that you just need to group the digits and : inside a [...]:
match("(%+%d+),([%d:]*),([%d:]*),(%d*),(%d*),(.*)")
^ ^^^^^^ ^^^^^^
Now, the [%d:]* matches zero or more digits or : symbols. Your pattern did not find the match because %d*:* matched 0+ digits followed with 0+ : symbols, and you had more than 1 such sequence.
Also, you need to escape the first + to make sure it matches a literal +.
See online Lua demo

Related

The regex in string.format of LUA

I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

regex replace : if not followed by letter or number

Okay so I wanted a regex to parse uncontracted(if that's what it is called) ipv6 adresses
Example ipv6 adress: 1050:::600:5:1000::
What I want returned: 1050:0000:0000:600:5:1000:0000:0000
My try at this:
ip:gsub("%:([^0-9a-zA-Z])", ":0000")
The first problem with this: It replaces the first and second :
So :: gets replaced with :0000
Replacing it with :0000: wouldn't work because then it will end with a :. Also this would note parse the newly added : resulting in: 1050:0000::600:5:1000:0000:
So what would I need this regex to do?
Replace every : by :0000 if it isn't followed by a number or letter
Main problem: :: gets replaced instead of 1 :
gsub and other functions from Lua's string library use Lua Patterns which are much simpler than regex. Using the pattern more than once will handle the cases where the pattern overlaps the replacement text. The pattern only needs to be applied twice since the first time will catch even pairings and the second will catch the odd/new pairings of colons. The trailing and leading colons can be handled separately with their own patterns.
ip = "1050:::600:5:1000::"
ip = ip:gsub("^:", "0000:"):gsub(":$", ":0000")
ip = ip:gsub("::", ":0000:"):gsub("::", ":0000:")
print(ip) -- 1050:0000:0000:600:5:1000:0000:0000
There is no single statement pattern to do this but you can use a function to do this for any possible input:
function fill_ip(s)
local ans = {}
for s in (s..':'):gmatch('(%x*):') do
if s == '' then s = '0000' end
ans[ #ans+1 ] = s
end
return table.concat(ans,':')
end
--examples:
print(fill_ip('1050:::600:5:1000::'))
print(fill_ip(':1050:::600:5:1000:'))
print(fill_ip('1050::::600:5:1000:1'))
print(fill_ip(':::::::'))

CF Regex REFind() substring without quotes

My CF backend has to read through a CFM file as if it was a TEXT file to extract the names and values of different parameters, the data looks like this :
request.config.MY_PARAM_1 = 'ABCDEFGHI';
request.config.MY_PARAM_2 = "BlaBlaBla";
request.config.MY_PARAM_3 = TRUE;
request.config.MY_PARAM_4 = 'true';
request.config.MY_PARAM_5 = "1337";
request.config.MY_PARAM_6 = 1337;
As you can see, I can have STRINGS which can be SINGLE or DOUBLE quoted.
I also have BOOLEANS and NUMBERS which usually are without quotes, but that can also have (single or double).
I am "parsing" the file and extracting the values, I want to find a pattern that would return the matches like this :
request.config.MY_PARAM_2 = "BlaBlaBla";
I am VERY close to succeeding, but unfortunately the following expression cannot get rid of the closing quote.
<cfset match = REFind("^request\.config\.(\S+) = ['|""]?(.*)['|""]?;$", str, 1, "Yes")>
<cfset paramVal = Mid( str, match.pos[3], match.len[3] ) >
<cfdump var=#paramVal# >
For example, it returns BlaBlaBla", it has successfully omitted the opening quote, but not the last one, what am I doing wrong?
From your comments, it sounds like you're saying that you want to parse two ARBITRARY lines. This will do it:
^(?:[^\n]*\n){1}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\2;(?:[^\n]*\n){4}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\5;
In your code, just change the two numbers in the quantifiers: {1} and {4} as they specify how many lines to skip at the top and in the middle. For line 1, for instance you would have {0} in the first quantifier.
The data you want is in Groups 1, 3, 4 and 5. Please see the capture groups in the lower right panel of this demo
I am sure you will have no trouble building the regex in code by concatenating the pieces:
method Parse(x,y)
Build the regex by concatenating
^(?:[^\n]*\n){
With
x-1
With
}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\2;(?:[^\n]*\n){
With
y-x
With
}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\5;
Then match and retrieve Groups 1, 3, 4 and 5
Also see this visualization which makes it quite clear.
Debuggex Demo

Replace using RegEx outside of text markers

I have the following sample text and I want to replace '[core].' with something else but I only want to replace it when it is not between text markers ' (SQL):
PRINT 'The result of [core].[dbo].[FunctionX]' + [core].[dbo].[FunctionX] + '.'
EXECUTE [core].[dbo].[FunctionX]
The Result shoud be:
PRINT 'The result of [core].[dbo].[FunctionX]' + [extended].[dbo].[FunctionX] + '.'
EXECUTE [extended].[dbo].[FunctionX]
I hope someone can understand this. Can this be solved by a regular expression?
With RegLove
Kevin
Not in a single step, and not in an ordinary text editor. If your SQL is syntactically valid, you can do something like this:
First, you remove every string from the SQL and replace with placeholders. Then you do your replace of [core] with something else. Then you restore the text in the placeholders from step one:
Find all occurrences of '(?:''|[^'])+' with 'n', where n is an index number (the number of the match). Store the matches in an array with the same number as n. This will remove all SQL strings from the input and exchange them for harmless replacements without invalidating the SQL itself.
Do your replace of [core]. No regex required, normal search-and-replace is enough here.
Iterate the array, replacing the placeholder '1' with the first array item, '2' with the second, up to n. Now you have restored the original strings.
The regex, explained:
' # a single quote
(?: # begin non-capturing group
''|[^'] # either two single quotes, or anything but a single quote
)+ # end group, repeat at least once
' # a single quote
JavaScript this would look something like this:
var sql = 'your long SQL code';
var str = [];
// step 1 - remove everything that looks like an SQL string
var newSql = sql.replace(/'(?:''|[^'])+'/g, function(m) {
str.push(m);
return "'"+(str.length-1)+"'";
});
// step 2 - actual replacement (JavaScript replace is regex-only)
newSql = newSql.replace(/\[core\]/g, "[new-core]");
// step 3 - restore all original strings
for (var i=0; i<str.length; i++){
newSql = newSql.replace("'"+i+"'", str[i]);
}
// done.
Here is a solution (javascript):
str.replace(/('[^']*'.*)*\[core\]/g, "$1[extended]");
See it in action