Extending Regex Replace
Recently I was working with the STL regular expression library, found in <regex>
. I was building a simple utility to identify tokens in a string and replace them with pre-defined values. First, I needed all replacements to occur in a single pass. Second, I needed to identify each token with it’s paired replacement value. As an example, let’s say I am using tokens to define specific directory paths. These are the tokens and their matching values:
Token | Value |
---|---|
{MyApp} | brainstem_breakfast |
{InstallDir} | /home/brian/apps/ |
I want to do a regex_replace
on the string: {InstallDir}{MyApp}
regex_replace
is unable to to perform this job because it only takes a string as a formatter. The fmt
variable can use character groups using the $xx syntax, however it can’t differentiate between groups dynamically. I am no regular expression wizard, but I think regex_replace
will not be able to fulfill my needs here.
Regex Iterators
Luckily, I can construct a new regex_replace
alternative using regex_token_iterator
, a ForwardIterator that allows tokenizing both matched and unmatched strings. With regex_token_iterator
it is possible to iterate unmatched expressions, matched expressions, and submatched expressions. Controlling what the tokenizer iterates is done through the int submatch
or array submatches
parameter of the constructor. -1
iterates unmatched portions, 0
matches the entire regular expression, and values 1
and greater match groups/sub-matches.
Extending regex_replace
Using the regex_token_iterator
it will be possible to construct a new regex_replace
function. The new function, rather than taking an std::basic_string
or CharT *
, will take a std::function<std::basic_string (int submatch, std::basic_string)>
. In fact, the Boost regex_replace, upon which the STL version was designed, provides a similar functionality. In Boost fmt
is defined as a Formatter, which can be C style string, a container of CharT (like std::basic_string
, although I suppose std::vector<CharT>
would also work), or a “unary, binary or ternary functor that computes the replacement string from a function call”.
The Solution
The basic idea is to iterate unmatched and submatched expressions. With unmatched expressions, the token will be passed along to the output string. With matched expressions, the token and the subexpression index will be passed to the fmt
function. The fmt
function will determine the replacement value and return it to be inserted into the return string.
Here I have created a string to show off the capabilities of the new regex_replace_ext
as well as a regular expression to match the tokens in the string.
Additionally, I need a mapping between tokens and values. I have seperated the “{ }” and “[ ]” tokens to highlight the flexibility of the new regex_replace_ext
. I also, have created the function arguement for the fmt
parameter. Match the submatch index to the correct dictionary, and then look for the token in the dictionary.
With all the arguments ready it is time to put the new regex_replace_ext
into action.
Let’s break down the functionality into easy to understand parts. First, regex_token_iterator
takes a submatch value or a vector of values, and I want to include the unmatched parts of the string. So I populate, -1 (the unmatched), skip 0 (the whole expression), and iterate incrementing values for as many submatches as exist in the regular expression.
Constructing the regular expression iterator.
Finally, allowing the regex_token_iterator
to do most of the work, I simply iterate the tokens. The regex_token_iterator
will iterate the submatch tokens in the other they were passed, so if a token isn’t found the value for the token will be an empty string. It necessary to keep track of which submatch the iterator is currently on, this is possible simply by incrementing a counter with a modulus of the total submatches. Finally, when the token is on count == 0
that means the iterator is matching the unmatched token, so we simply put that value in the return string. Otherwise, count
and token
are sent to the fmt
function which we defined earlier for the replacement values.
Thanks for reading! Brian Rackle