Case Insensitive Regex expression for getting file - kettle

I have a scenario where i am taking files from a folder for data loading which is having naming convention as .Customer_..txt.But also i would like to make this expression case insensitive so if any file named CUSTOMER_1234 comes.It will also accept that and process accordingly

Try the below regex:
(?i)customer(?-i).*\.txt
in the wildcard section of the "get files" steps or any other regex step you are using. This will filter out files starting with either "customer" or "CUSTOMER".
Attached a sample code here.
Hope this helps :)
Sample Screenshot:
Modifying my previous answer based on the comment below:
If you are looking to match the pattern "customer_" irrespective of case sensitivity, first of all you can easily do it using a Javascript "match" function. You just need to pass the file names in upper case and match with the uppercase pattern. This will easily fetch you the result. Check the JS snip below:
var pattern="customer_"; //pattern is the word pattern you want to match
var match_files= upper(files).match(upper(pattern)); // files in the list of files you are getting from the directory
if(upper(match_files)==upper(pattern)){
//set one flag as 'match'
}
else{
// set the flag as 'not match'
}
But in case you need to use regex expression only. Then you can try the below regex:
.*(?i)(customer|CUSTOMER).*(?-i)\.txt
This would work for "_123_Customer_1vasd.txt" patterns too.
Hope this helps :)

Related

Match Regex CR-00000 pattern

I need a regex expression which would parse my text pattern: CR-000000. There may be a text but somewhere inside would be the pattern, sometimes two - i need to extract only the part matching the pattern
I have created the following pattern but still doesn't work [CR-]{6}[0-9]
[CR-]{6}[0-9]
From the following example: The change Request has been created for the location below. CR-0001083 Click this link to access the Change Request Change Request ID :  CR-0001086 Property ID:  CK1014 - the output would be CR-0001083 CR-0001086
Thanks, CR-[0-9]{7} resolves the thing!

Regex to match string containing two (or more) words in any order and case not sensitive

Hope you guys can help me.
I need to make a string that alerts me when the following conditions are met:
Two (or more) words are identified in a message
It does not have to look just at the "whole" words but also at that ones that contain the text I am searching. For istance, I search for "error", it has to be alert me also when it founds "errors" or "errorless".
It should not to be case sensitive
It has to look at word1 and word2 but also viceversa, in others words it has to look at them irrespective of their order
I have played a while with regex101 but I have not been able to reach all conditions (condition # 4 is still missing).
You can find at the following link what I have been able to make:
https://regex101.com/r/Z4cE9A/5
Please note that I need matches with the following expressions characteristics:
Flavor: golang / Flag: single line
Important note: I cannot use the character "|" as it does not work properly on the system where I am going to use this string.
Any help would be more than appreciated. Thanks in advance for your support.
EDIT: I did confusion. The non functioning character is "|". However if possible is better to avoid also the "/" as I am not sure if it works. If you want we can provide me with two strings, one without the symbol "/" and one without, in case it does not work.
This should do what you want:
(?i:(http)|(error))
You can replace http and error with any other keywords that you are searching for.
To do that in Golang:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
keywords := []string{
"error",
"http",
}
p := "(?i:(" + strings.Join(keywords, ")|(") + "))"
text := `
Gran Turismo Sport
Shipment Error
Attempt
https://
`
re := regexp.MustCompile(p)
fmt.Println(re.MatchString(text))
}
You can test that in Golang Playground:
https://play.golang.org/p/XOhNVBCh8Pt
EDIT:
Based on the new limitation of not being able to use the | char, I would suggest that you search using this:
(?i:(error)?(http)?)
This will always return true (or a list of empty strings in find all) but the good thing is you can filter out all the empty strings and you will end up with the result that you want.
This is a an example of this working in Golang Playground:
https://play.golang.org/p/miVC0hdLtQc
EDIT 2:
In case you want to make sure ALL the keywords are in the text change the ? in regex with {1,}. Also you don't need the loop any more.
(?i:(error){1,}(http){1,})
This is an an example working in Golang Playground
https://play.golang.org/p/f9eFcvObDsA

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.
You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver
Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.
This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.
Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files
Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

How to include regular expression in logstash file input path

I am using logstash to convert tomcat access logs into json format. The access log names are in below format
abcd_access_log.2016-03-15.log
efgh_access_log.2016-02-16.log
The input filter is:
input {
file {
path => "C:\tools\apache-tomcat-8.0.32\logs\*_access_log.*.log"
start_position => beginning
}
}
It is not showing logs with the regex used. What regex should I use here to select only these files?
Simple and fast way
If your log file is always after logs\, you can use the following regex:
logs\\(.*?\.log)
Capture everything .*? followed by logs\ (please note double \\ in the regex). Also your file ends with .log, so dont forget to escape the dot with \.log.
Detailed description is at Regex101.
Bullet-proof and slower way
In case there is missing the key characters logs\, you can use the following regex, that captures the last part of the path, using the negative lookahead:
\\((?:.(?!\\))+\.log)
Here is the detailed description again: Regex101

Regular Expression to extract src attribute from img tag

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.
Try this expression:
src\s*=\s*"([^"]+)"
I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1
You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible
I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.
I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]