Grok custom pattern for space delimited file - amazon-athena

I'm trying to load a file to structured table in Athena. I am using GROK pattern to load it to the table but not able to find the correct pattern. The file format is as below:
L1127 ACTUALS 214171 ON 27649075 -00000000000000000409618.02 601 MBS DAILY VISION - CAN OS
L1127 ACTUALS 412821 ON 27649075 002060 -00000000000000000002657.33 521 MBS DAILY VISION - CAN OS
GROK pattern I'm using:
(?<BusinessUnit>.{5})%{SPACE}(?<Type>.{7})%{SPACE}(?<PSGLAccountNumber>.{6})%{SPACE}(?<Province>.{2})%{SPACE}(?<DepartmentId>.{8})%{SPACE}(?<ProductId>.{6})%{SPACE}(?<Amount>.{27})%{SPACE}(?<TransCode>.{3})%{SPACE}(?<Feed>.{35})
I'm having trouble when the ProductId has no value.
Any help would be appreciated.

(?<ProductId>.{6})%{SPACE} means that you expect the ProductId field to be exactly six characters followed by any number of spaces. From the data you posted it seems to me that what should happen is that in the first row ProductId would end up as six spaces.
If the problem is that it becomes six spaces and you want it to be an empty string, you could for example use (?<ProductId>\S*)%{SPACE} (\S* matches zero or more non-space characters).
If this does not solve your problem, perhaps you could describe in some more detail what trouble you are having, and what you want to happen?
Update: in a comment you indicated that the problem with this solution is that the ProductId column becomes "-00000". The reason for that is that the %{SPACE} pattern before (?ProductId… consumes all the spaces between the DepartmentId and Account fields. To solve this you could for example limit the number of spaces that can appear between the DepartmentId and ProductId fields. In the example data you post there are two spaces, and since the fields are fixed-width I assume this is always the case. Using a pattern like …(?<DepartmentId>.{8})\s{2}(?<ProductId>\S*)%{SPACE}(?<Amount>.{27})… should fix the problem.

I was able to make it work using the below pattern below
%{WORD:BusinessUnit}%{SPACE}%{WORD:Type}%{SPACE}%{POSINT:PSGLAccountNumber}%{SPACE}%{WORD:Province}%{SPACE}%{POSINT:DepartmentId}%{SPACE}%{custompat:ProductId}%{SPACE}%{NUMBER:Amount}%{SPACE}%{NUMBER:TransCode}%{SPACE}(?<Feed>[A-Za-z0-9\-\s]{26})
And using custom pattern:
custompat ([0-9]{6}|\s{6})

Related

How to extract sub-directories from the URL using 'REGEXP_EXTRACT' in Data Studio

I'm trying to extract the product name from the URL between the 2 slashes using REGEXP_EXTRACT. For example, I want to extraxt ace-5 from the URLs below:
www.abc.com/products/phones/ace-5/
www.abc.com/products/phones/ace-5/?cid=dm66363&bid
www.abc.com/products/phones/ace-5/?fbclid=iwar30dpnmmpwppnla7
www.abc.com/products/phones/ace-5/?et_cid=em_367029&et_rid=130
I have a RegEx to extract the Domain Name but it is not something I'm actually looking for. Below is the RegEx:
REGEXP_EXTRACT(page,'^[^.]+.([^.]+)')
It gives the following result: abc
Assuming that the product name would always be the fixed fourth path element, we can try:
REGEXP_EXTRACT(page, '(?:[^\/]+\/){3}([^\/]+).*')
or, if the above would not work:
REGEXP_EXTRACT(page, '[^\/]+\/[^\/]+\/[^\/]+\/([^\/]+).*')
Here is a demo for the above:
Demo
Since I do not have the Same Page with my GDS, but I tried to recreate with my set of data source i.e pages from the google analytics.
Use may use the below which will get you all the records after two slash as per your requirement.
REGEXP_EXTRACT(Page,'[^/]+/[^/]+/([^/]+)')
You need to create a calculated column with this formula, once you have created this calculated column you might need to add an additional filter to remove those with the null value.
example Page: "/products/phones/ace-5/"
The Calculated Column value will be "ace-5"
Just make sure this regex will only give you the extracted word after phones/, if you do not have any record after that it will give you null in return.
The REGEXP_EXTRACT Calculated Field below does the trick, extracting all characters after the 3rd / till the next instance of /:
REGEXP_EXTRACT(Page, "^(?:[^/]+/){3}([^/]+)")
Google Data Studio Report and a GIF to elaborate

How to extract the year from a URL path using REGEXP_EXTRACT in Google Data Studio?

I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.
In short, I have a URI like this: /theme/2019/jan/blog-post-2019/
How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?
Try this:
REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/')
where:
theme\/ means literally "theme/";
([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
\/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.
See demo.
As you mentioned, I think you only want to extract the year between the string. The following will achieve that for you.
fit the query as per your needs
SELECT *
FROM Sample_table
WHERE REGEXP_EXTRACT(url, "(?<=\/theme\/)(?<year>\d{4})(?=\/[a-zA-Z]{3})")

How do I use regex to return text following specific prefixes?

I'm using an application called Firemon which uses regex to pull text out of various fields. I'm unsure what specific version of regex it uses, I can't find a reference to this in the documentation.
My raw text will always be in the following format:
CM: 12345
APP: App Name
BZU: Dept Name
REQ: First Last
JST: Text text text text.
CM will always be an integer, JST will be sentence that may span multiple lines, and the other fields will be strings that consist of 1-2 words - and there's always a return after each section.
The application, Firemon, has me create a regex entry for each field. Something simple that looks for each prefix and then a return should work, because I return after each value. I've tried several variations, such as "BZU:\s*(.*)", but can't seem to find something that works.
EDIT: To be clear I'm trying to get the value after each prefix. Firemon has a section for each field. "APP" for example is a field. I need a regex example to find "APP:" and return the text after it. So something as simple as regex that identifies "APP:", and grabs everything after the : and before the return would probably work.
You can use (?=\w+ )(.*)
Positive lookahead will remove prefix and space character from match groups and you will in each match get text after space.
I am a little late to the game, but maybe this is still an issue.
In the more recent versions of FireMon, sample regexes are provided. For instance:
jst:\s*([^;]?)\s;
will match on:
jst:anything in here;
and result in
anything in here

Google Analytics filters, only two countries

I want create a filter for include only two countries. For example United Kingdom and Russia.
I have two filters, first is excluding all countries. It is a filter where I set regex as pattern '.' and next filter is including only for this countries, pattern: United Kingdom|Russia.
But now I don't have any results displayed. Whats wrong with my regex?
Your regex is fine. You need to include the variable against which the regex filter is going to be executed. In your case, type Country (Pays in French)
Preview (Sorry for the French):
PS: I have tested it on my G account.
Edit:
As per your comment below, the . would only match one character (There are no countries with one character name). If you want to match all countries then your regex pattern would like .+ yet that leaves with this question: If you want to match all countries why use a filter in the first place?
If you set a filter to exclude all countries then there will be nothing in your reports, it does not matter what any other filters do because they cannot cancel each other out.
You simply need to place the include filter as it works as an "Include-only".
The RegEx you have already seems to be working on both filters, once again the problem is that your exclude filter is excluding everything.

How to use regex for querying in Solr 4

I've reached the point of desperation, so I'm asking for help. I'm trying to query results from a Solr 4 engine using regex.
Let's asume the document I want to query is:
<str name="text">description: best company; name: roca mola</str>
And I want to query using this regex:
description:(.*)?company(.*)?;
I read in some forums that using regex in Solr 4 was as easy as adding slashes, like:
localhost:8080/solr/q=text:/description\:(.*)?company(.*)?;/
but it isn't working. And this one doesn't work either:
localhost:8080/solr/q=text:/description(.*)?company(.*)?;/
I don't want a simple query like:
localhost:8080/solr/q=text:*company*
Since that would mismatch documents like:
<str name="text">description: my home; name: mother company"</str>
If I'm not clear please let me know.
Cheers from Chile :D
NOTE: I was using text_general fields on my scheme. As #arun pointed out, string fields can handle the type of regex I'm using.
Instead of trying regex search on text field type, try it on a string field type, since your regex is spanning more than one word. (If your regex needs to match a single word, then you can use a text field.)
Also do percent encoding of special characters just to make sure they are not the cause for the mismatches.
q=strfield:/description%3A(.*?)company(.*?)%3B.*/
Update:
Just tried it on a string field. The above regex works. It works even without the percent encoding too i.e.
q=strfield:/description:.*?company.*?;.*/