YAML parsing error. Expected <block end>, but found '-' - build

I have the following config.yml:
- persist_to_workspace:
root: ~/project
paths: *build_cache_paths
# for integration tests:
- /home/circleci/cache/Cypress
I'm trying to persist_to_workspace /home/circleci/cache/Cypress. What's wrong with my syntax?

Your paths key has the value *build_cache_paths which is an alias. That means the value of paths is a reference to the node with the anchor &build_cache_paths (assuming it exists).
Two lines below, you start a sequence with -. Generally, a sequence at this level would be the value of a previous implicit key. But in this case it can't be, since the key paths already has a value. Hence the error.
If your goal is to merge the sequence behind *build_cache_paths with the sequence you give below: That is not possible with YAML. YAML is a serialization language, it doesn't implement operations on data (apart from the non-standard merge key << that is supported by some implementations but only works on mappings, not on sequences).

Related

URI encoding in C++ REST SDK ("Casablanca")

I'm using the http listener of the C++ REST SDK 2.8 and noticed the following. If I send the following URL to this listener:
http://my_server/my%2fpath?key=xxx%26yyy%3Dzzz
and I do:
auto uri = request.relative_uri();
auto v_path_components = web::uri::split_path(web::uri::decode(uri.path()));
auto m_query_components = web::uri::split_query(web::uri::decode(uri.query()));
then I find that v_path_components contains 2 elements ["my", "path"], and m_query_components contains 2 pairs [("key","xxx"), ("yyy","zzz")].
What I want and would have expected is v_path_components to contain 1 element ["my/path"], and m_query_components to contain 1 pair [("key","xxx&yyy=zzz")].
In order for the latter to achieve, relative_uri shouldn't decode/encode the uri, as that looses information. In addition, web::uri::decode() should be executed on the split results rather than before splitting. But, as the REST SDK itself as well as many samples shipped with it uses this in the above way, it leads me to believe that I might be wrong.
Could anyone confirm my findings or explain why I'm on the wrong track?
Your findings make sense.
Since you are decoding first, then the encoded ampersand (%3D) becomes a key/value pair separator. Same for the path components. The slash (%2f) becomes a path separator, and is parsed as such.

Yaml-cpp parsing doesn't work space is missing after colon

I have encountered problem in yaml-cpp parser. When I try to load following definition:
DsUniversity:
university_typ: {type: enum, values:[Fachhochschule, Universitat, Berufsakademie]}
students_at_university: {type: string(50)}
I'm getting following error:
Error: yaml-cpp: error at line 2, column 39: end of map flow not found
I tried to verify yaml validity on http://yaml-online-parser.appspot.com/ and http://yamllint.com/ and both services reports yaml as valid.
Problem is caused by missing space after "values:" definition. When yaml is updated to following format:
DsUniversity:
university_typ: {type: enum, values: [Fachhochschule, Universitat, Berufsakademie]}
students_at_university: {type: string(50)}
everything works as expected.
Is there any way how to configure/update/fix yaml-cpp parser to proceed also yamls with missing space after colon?
Added:
It seems that problem is caused by requirement for empty char as separator. When I simplified testing snippet to
DsUniversity:[Fachhochschule, Universitat, Berufsakademie]
yaml-cpp parser reads it as one scalar value "DsUniversity:[Fachhochschule, Universitat, Berufsakademie]". When empty char is added after colon, yaml-cpp correctly loads element with sequence.
yaml-cpp is correct here, and those online validators are incorrect. From the YAML 1.2 spec:
7.4.2. Flow Mappings
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.
...
To ensure JSON compatibility, if a key inside a flow mapping is JSON-like, YAML allows the following value to be specified adjacent to the “:”. This causes no ambiguity, as all JSON-like keys are surrounded by indicators. However, as this greatly reduces readability, YAML processors should separate the value from the “:” on output, even in this case.
In your example, you're in a flow mapping (meaning a map surrounded by {}), but your key is not JSON-like: you just have a plain scalar (values is unquoted). To be JSON-like, the key needs to be either single- or double-quoted, or it can be a nested flow sequence or map itself.
In your simplified example,
DsUniversity:[Fachhochschule, Universitat, Berufsakademie]
both yaml-cpp and the online validators parse this correctly as a single scalar - in order to be a map, as you intend, you're required a space after the :.
Why does YAML require that space?
In the simple plain scalar case:
a:b
could be ambiguous: it could be read as either a scalar a:b, or a map {a: b}. YAML chooses to read this as a scalar so that URLs can be easily embedded in YAML without quoting:
http://stackoverflow.com
is a scalar (like you'd expect), not a map {http: //stackoverflow.com}!
In a flow context, there's one case where this isn't ambiguous: when the key is quoted, e.g.:
{"a":b}
This is called JSON-like because it's similar to JSON, which requires quotes around all scalars. In this case, YAML knows that the key ends at the end-quote, and so it can be sure that the value starts immediately.
This behavior is explicitly allowed because JSON itself allows things like
{"a":"b"}
Since YAML 1.2 is a strict superset of JSON, this must be legal in YAML.
I think it would be beneficial to parse scalar/keys differently immediately inside a flow map{, if you agree, vote here please.
https://github.com/yaml/yaml-spec/issues/267

Path and string chopping in Powershell

I'm doing some work involving some automated file moving, and these files contain relative paths that must be maintained. Unfortunately, I'm finding the facilities offered by System.IO.Path, System.String, and Powershell's operators to be a little ill-equipped to handle my work gracefully.
One function that would be very useful to me is the notion of a subtraction of paths, that would work in theory like subtracting vectors. Conceptually, A - B gets you a path from B to A. In the application to paths, D:\A\B\C\D - D:\A\B\ = \C\D. Likewise, D:\A\B\ - D:\A\B\C\D = \..\.. in this case. I can accept, for now, that this only makes sense when one path is wholly contained in the other.
This seems to consist of two steps: 1) determine containment of one path in the other. 2) remove the contained path from the containing path. 3) Optionally, replace folder names with the parent .. symbol based on the sidedness of the operation.
As I am concerned with NTFS, I need both containment and replacement operations to be case-insensitive. For containment, I can use select-string since it is case-insensitive, and allows the -simple switch which allows me to use a path without hacking it apart to escape them for regex.
Removing the string from the other is a little more annoying though. System.IO.Path has nothing for this, System.String's pertinent methods are all case-sensitive, and powershell's operators all require massaging so that the regex will match things.
All this seems like more work than it should be--are there any tools I'm missing that would better handle this?
Determine containment - convert your paths to absolute paths (if not already). You can use Resolve-Path for this. Then you can use $path1.StartsWith($path2, 'OrdinalIgnoreCase') to test for containment.
Remove contained path - $path1.Substring($path2.length)
Replace parent folder names with ... - although I don't have the regex off the top of my head, I'm pretty sure you could do this with a regular expression search/replace using PowerShell's -replace operator
filedirectorypath, on CodePlex, may offer what you need
It's not a PowerShell specific API, but that's no reason not to use it from PowerShell.
Benefits of the NDepend.Helpers.FilePathDirectory over the .NET Framework class System.IO.Path include:
Strongly typed File/Directory path.
Relative / absolute path conversion.
Path normalization API
Path validity check API
Path comparison API
Path browsing API.
Path rebasing API
List of path operations (TryGetCommonRootDirectory, GetListOfUniqueDirsAndUniqueFileNames, list equality…)

Fast string matching algorithm with simple wildcards support

I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.
Examples:
*.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
Work within the scope of a modern desktop (e.g. not a server implementation)
Ability to return 0:n matches given a input string
Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.
First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.
I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.
If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.
Example
Rule: *.site.com/*
domain => *.site.com
path => /*
query => [empty]
URL: sub.site.com/path/home.html
domain => sub.site.com
path => /path/home.html
query => [empty]
Matching process:
domain => *.site.com matches sub.site.com? YES
path => /* matches /path/home.html? YES
query => [empty] matches [empty] YES
Result: MATCH
As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *'s to %'s and then use the database's native LIKE operation to do the matching for you. Then you'd just have a query like
SELECT *
FROM ruleTable
WHERE #urlDomain LIKE ruleDomain
AND #urlPath LIKE rulePath
AND #urlQuery LIKE ruleQuery
where #urlDomain, #urlPath, and #urlQuery are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.

Tokenize the text depending on some specific rules. Algorithm in C++

I am writing a program which will tokenize the input text depending upon some specific rules. I am using C++ for this.
Rules
Letter 'a' should be converted to token 'V-A'
Letter 'p' should be converted to token 'C-PA'
Letter 'pp' should be converted to token 'C-PPA'
Letter 'u' should be converted to token 'V-U'
This is just a sample and in real time I have around 500+ rules like this. If I am providing input as 'appu', it should tokenize like 'V-A + C-PPA + V-U'. I have implemented an algorithm for doing this and wanted to make sure that I am doing the right thing.
Algorithm
All rules will be kept in a XML file with the corresponding mapping to the token. Something like
<rules>
<rule pattern="a" token="V-A" />
<rule pattern="p" token="C-PA" />
<rule pattern="pp" token="C-PPA" />
<rule pattern="u" token="V-U" />
</rules>
1 - When the application starts, read this xml file and keep the values in a 'std::map'. This will be available until the end of the application(singleton pattern implementation).
2 - Iterate the input text characters. For each character, look for a match. If found, become more greedy and look for more matches by taking the next characters from the input text. Do this until we are getting a no match. So for the input text 'appu', first look for a match for 'a'. If found, try to get more match by taking the next character from the input text. So it will try to match 'ap' and found no matches. So it just returns.
3 - Replace the letter 'a' from input text as we got a token for it.
4 - Repeat step 2 and 3 with the remaining characters in the input text.
Here is a more simple explanation of the steps
input-text = 'appu'
tokens-generated=''
// First iteration
character-to-match = 'a'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ap'
pattern-found = false
tokens-generated = 'V-A'
// since no match found for 'ap', taking the first success and replacing it from input text
input-text = 'ppu'
// second iteration
character-to-match = 'p'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'pp'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ppu'
pattern-found = false
tokens-generated = 'V-A + C-PPA'
// since no match found for 'ppu', taking the first success and replacing it from input text
input-text = 'u'
// third iteration
character-to-match = 'u'
pattern-found = true
tokens-generated = 'V-A + C-PPA + V-U' // we'r done!
Questions
1 - Is this algorithm looks fine for this problem or is there a better way to address this problem?
2 - If this is the right method, std::map is a good choice here? Or do I need to create my own key/value container?
3 - Is there a library available which can tokenize string like the above?
Any help would be appreciated
:)
So you're going through all of the tokens in your map looking for matches? You might as well use a list or array, there; it's going to be an inefficient search regardless.
A much more efficient way of finding just the tokens suitable for starting or continuing a match would be to store them as a trie. A lookup of a letter there would give you a sub-trie which contains only the tokens which have that letter as the first letter, and then you just continue searching downward as far as you can go.
Edit: let me explain this a little further.
First, I should explain that I'm not familiar with these the C++ std::map, beyond the name, which makes this a perfect example of why one learns the theory of this stuff as well as than details of particular libraries in particular programming languages: unless that library is badly misusing the name "map" (which is rather unlikely), the name itself tells me a lot about the characteristics of the data structure. I know, for example, that there's going to be a function that, given a single key and the map, will very efficiently search for and return the value associated with that key, and that there's also likely a function that will give you a list/array/whatever of all of the keys, which you could search yourself using your own code.
My interpretation of your data structure is that you have a map where the keys are what you call a pattern, those being a list (or array, or something of that nature) of characters, and the values are tokens. Thus, you can, given a full pattern, quickly find the token associated with it.
Unfortunately, while such a map is a good match to converting your XML input format to a internal data structure, it's not a good match to the searches you need to do. Note that you're not looking up entire patterns, but the first character of a pattern, producing a set of possible tokens, followed by a lookup of the second character of a pattern from within the set of patterns produced by that first lookup, and so on.
So what you really need is not a single map, but maps of maps of maps, each keyed by a single character. A lookup of "p" on the top level should give you a new map, with two keys: p, producing the C-PPA token, and "anything else", producing the C-PA token. This is effectively a trie data structure.
Does this make sense?
It may help if you start out by writing the parsing code first, in this manner: imagine someone else will write the functions to do the lookups you need, and he's a really good programmer and can do pretty much any magic that you want. Writing the parsing code, concentrate on making that as simple and clean as possible, creating whatever interface using these arbitrary functions you need (while not getting trivial and replacing the whole thing with one function!). Now you can look at the lookup functions you ended up with, and that tells you how you need to access your data structure, which will lead you to the type of data structure you need. Once you've figured that out, you can then work out how to load it up.
This method will work - I'm not sure that it is efficient, but it should work.
I would use the standard std::map rather than your own system.
There are tools like lex (or flex) that can be used for this. The issue would be whether you can regenerate the lexical analyzer that it would construct when the XML specification changes. If the XML specification does not change often, you may be able to use tools such as lex to do the scanning and mapping more easily. If the XML specification can change at the whim of those using the program, then lex is probably less appropriate.
There are some caveats - notably that both lex and flex generate C code, rather than C++.
I would also consider looking at pattern matching technology - the sort of stuff that egrep in particular uses. This has the merit of being something that can be handled at runtime (because egrep does it all the time). Or you could go for a scripting language - Perl, Python, ... Or you could consider something like PCRE (Perl Compatible Regular Expressions) library.
Better yet, if you're going to use the boost library, there's always the Boost tokenizer library -> http://www.boost.org/doc/libs/1_39_0/libs/tokenizer/index.html
You could use a regex (perhaps the boost::regex library). If all of the patterns are just strings of letters, a regex like "(a|p|pp|u)" would find a greedy match. So:
Run a regex_search using the above pattern to locate the next match
Plug the match-text into your std::map to get the replace-text.
Print the non-matched consumed input and replace-text to your output, then repeat 1 on the remaining input.
And done.
It may seem a bit complicated, but the most efficient way to do that is to use a graph to represent a state-chart. At first, i thought boost.statechart would help, but i figured it wasn't really appropriate. This method can be more efficient that using a simple std::map IF there are many rules, the number of possible characters is limited and the length of the text to read is quite high.
So anyway, using a simple graph :
0) create graph with "start" vertex
1) read xml configuration file and create vertices when needed (transition from one "set of characters" (eg "pp") to an additional one (eg "ppa")). Inside each vertex, store a transition table to the next vertices. If "key text" is complete, mark vertex as final and store the resulting text
2) now read text and interpret it using the graph. Start at the "start" vertex. ( * ) Use table to interpret one character and to jump to new vertex. If no new vertex has been selected, an error can be issued. Otherwise, if new vertex is final, print the resulting text and jump back to start vertex. Go back to (*) until there is no more text to interpret.
You could use boost.graph to represent the graph, but i think it is overly complex for what you need. Make your own custom representation.