Azure blob storage regex pattern for user input validation - regex

I’ve developed a function in PowerShell (.NET Framework) to retrieve data from any given Azure Blob Storage: so far, so good.
When it comes down to validate users’ input, unfortunately, I cannot rely on external modules or libraries such as the NameValidator Class of Azure SDK for .NET.
Nevertheless, the article Naming and Referencing Containers, Blobs, and Metadata goes into the details of naming rules and thus regex patterns might come to the rescue.
For Container Names I’ve came up with this, and it seems to fit:
(?=^.{3,63}$)(?!.*--)[^-][a-z0-9-]*[^-]
Container Names
A container name must be a valid DNS name, conforming to the following
naming rules:
Container names must start or end with a letter or number, and can contain only letters, numbers, and the dash (-) character.
Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in
container names.
All letters in a container name must be lowercase.
Container names must be from 3 through 63 characters long.
For Blob Names however I’m not able to get around the counting of path segments:
(?=^.{1,1024}$)(?<=^|\/)(\S*?)[^\.] (?=\/|$)
NB: the Azure Storage emulator has been deprecated and therefor out of scope.
Blob Names
A blob name must conforming to the following naming rules:
A blob name can contain any combination of characters.
A blob name must be at least one character long and cannot be more than 1,024 characters long, for blobs in Azure Storage.
The Azure Storage emulator supports blob names up to 256 characters long. For more information, see Use the Azure storage emulator for
development and testing.
Blob names are case-sensitive.
Reserved URL characters must be properly escaped.
The number of path segments comprising the blob name cannot exceed 254. A path segment is the string between consecutive delimiter characters (e.g., the forward slash '/') that corresponds to the name
of a virtual directory.
Note Avoid blob names that end with a dot (.), a forward slash (/), or a sequence or combination of the two. No path segments should end
with a dot (.).
The Blob service is based on a flat storage scheme, not a hierarchical scheme. However, you may specify a character or string delimiter within a blob name to create a virtual hierarchy. For example, the following list shows valid and unique blob names. Notice that a string can be valid as both a blob name and as a virtual directory name in the same container:
/a
/a.txt
/a/b
/a/b.txt
You can take advantage of the delimiter character when enumerating blobs.
NB: Just before asking this question, I’ve found this ones that answer what I’ve already solved on my own or use the aforementioned class:
Azure Container Name RegEx
How to validate Azure storage blob names
By the way, does anybody know which flavor of regex is used by PowerShell?

You need to use
^(?!.{1025})/?[^/]*[^/.](?:/[^/]*[^/.]){0,253}$
^(?=.{1,1024}$)/?[^/]*[^/.](?:/[^/]*[^/.]){0,253}$
See the regex demo.
Details:
^ - start of string
(?=.{1,1024}$) - the string should contain from 1 to 1024 chars
(?!.{1025}) - the string cannot contain more than 1025 chars
/? - an optional /
[^/]*[^/.] - zero or more chars other than / and then a char other than / and .
(?:/[^/]*[^/.]){0,253} - zero to 253 occurrences of / followed by zero or more chars other than / and then a char other than / and .
$ - end of string.

Related

How do I extract the numbers after a phrase (xvth) and replace the phrase with "Group-"?

I have data being exported from BigQuery into Google Data Studio one field contains a username like the following.
xvth20-00-tt-wr
xvth27-00-pt-px
The first 4 characters (xvth) are always the same and the numbers that follow (xvth) correspond to a group. Multiple usernames will contain the same numbers after those characters but the rest of the string from 00- and on will be different.
What I'm trying to do is extract the numbers that follow the 4 characters and create a new field that looks like the following.
Group-20
Group-27
I've tried the following REPLACE(SUBSTR(Users,1, 6), 'xvth20', 'Group-20') and I will have to create one for every condition which seems like too much. Also the data will keep growing so I wouldn't want to keep going in and adding another function.
Is there an easier way to do this?
Either of the below REGEXP_REPLACE Calculated Fields will replace xvth with Group-, immediately followed by the respective captured numbers; Calculated Field #1 uses a Raw Literal, indicated by the letter r which requires a single \ to escape special RegEx characters whereas Calculated Field #2 requires \\ to escape a Google Data Studio RegEx as it does not use a Raw Literal:
1) With r (Raw Literal)
REGEXP_REPLACE(Users, r"^xvth(\d+).*", r"Group-\1")
2) Without r (Raw Literal)
REGEXP_REPLACE(Users, "^xvth(\\d+).*", "Group-\\1")
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

Regex to match only last segment of a folder structure

I have a recursive list of folders that I need to find characters in, but I do not want subfolders included in the result. I need to find many different characters that will be an issue when migrating data, including asterisks, double periods, etc.
For this example I will use double-period (..). I only need the first, fourth, and seventh lines
/System/Modules/Aspect/dmc_attachments_aspect..J5_D65
/System/Modules/Aspect/dmc_attachments_aspect..J5_D65/External Interfaces
/System/Modules/Aspect/dmc_attachments_aspect..J5_D65/Miscellaneous
/System/Modules/Collaboration/com.documentum.services.collaboration.IAttachmentsManager..J5_D65
/System/Modules/Collaboration/com.documentum.services.collaboration.IAttachmentsManager..J5_D65/External Interfaces
/System/Modules/Collaboration/com.documentum.services.collaboration.IAttachmentsManager..J5_D65/Miscellaneous
/System/Modules/TBO/dm_message_archive..J5_D65
/System/Modules/TBO/dm_message_archive..J5_D65/External Interfaces
Another example would be an asterisk -- I only need the first, fourth, and seventh lines.
/Public/Test/*Training
/Public/Test/*Training*/Documentation
/Public/Test/*Training*/SOPs
/Public/Test/Project**Tracking
/Public/Test/Project**Tracking/01
/Public/Test/Project**Tracking/02
/Public/Home*
/Public/Home*/Test
Is there a regex I could use to meet this? I am happy running multiple queries/reports and updating the main character (.. or *)
I wanted to give some clarity on the issue so I can avoid the XY problem.
We are migrating data from Documentum to SharePoint, and Documentum does not have the same file and folder name restrictions, so we will have to address those ahead of the migration or on the fly. I have a big text file (950k lines) containing all of the folders currently in Documentum, and I am attempting to find all folders that will not migrate due to containing these characters.
The issue is that doing a basic egrep '\*' will give not just the top level folder containing this character but all subfolders, which will throw off counts.
Let's say you were looking for the double period:
.*\.\.[^/]*$
would match two periods followed by an unlimited number of non-slash characters until the end of the string. In general, replace \.\. with whatever you are looking for.
Check it out at regex101.com. (Asterisk version here).

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

c++ - escape special characters

I need to escape all special characters and replace national characters and get "plain text" for a tablename.
string getTableName(string name)
My string could be "šárka65_%&." and I want to get string I can use in my database as a tablename.
Which DBMS?
In standard SQL, a name enclosed in double quotes is a delimited identifier and may contain any characters.
In MS SQL Server, a name enclosed in square brackets is a delimited identifier.
In MySQL, a name enclosed in back-ticks is a delimieted identifier.
You could simply choose to enclose the name in the appropriate markers.
I had a feeling that wasn't what you wanted...
What codeset is your string in? It seems to be UTF-8 by the time it gets to my browser. Do you need to be able to invert the mapping unambiguously? That is harder.
You can use many schemes to map the information:
One simple minded one is simply to hex-encode everything, using a marker (X) to protect against leading digits:
XC5A1C3A1726B6136355F25262E
One slightly less simple minded one is hex-encode anything that is not already an ASCII alphanumeric or underscore.
XC5A1C3A1rka65_25262E
Or, as a comment suggests, you can devise a mapping table for accented Latin letters - indeed, a mapping table appropriately initialized will be the fastest approach. The input is the character in the source string; the output is the desired mapped character or characters. If you use an 8-bit character set, this is entirely manageable. If you use full Unicode, it is a lot less manageable (not least, how do you map all the Han syllabary to ASCII?).
Or ...