Here’s an interesting problem I had recently. I had a body of text that I was doing a number of fairly complex regular expression substitutions on. After encountering some errors in the results I was getting, I realised that I needed to make further substitutions within regex capturing groups. Figuring out how to do this took a bit of digging and the use of String#gsub
’s block form. For the sake of example, let’s make some changes to the first paragraph of Lewis Carroll’s Alice’s Adventures in Wonderland.
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'
Let’s say we want to change the single quotes enclosing Alice’s thoughts to LaTeX-style double quotes, i.e. ``
and ''
. We can’t just blanket replace '
with another character or characters because the replacement will depend on the '
’s position. We could probably mess around checking for spaces, but we want to do this in one regex. We also want do something with the thoughts themselves later, so we’re going to try to match each thought exclusive of its delimiters.
Our regex for capturing thoughts is '([^']*)'
, (match '
, zero or more characters that aren’t '
and then another '
and capture everything inside the ''
as the first group) which would produce two overall matches, each with one inner captured group:1
Overall match | Group 1 |
---|---|
'and what is the use of a book,' |
and what is the use of a book, |
'without pictures or conversations?' |
without pictures or conversations? |
Now we’re capturing thoughts. To make our desired quotes substitution, we can use a single gsub
as below:
dbquotes = alice.gsub(/'([^']*)'/,'``\1\'\'')
In a gsub
replacement string, \1
tells Ruby “insert the first capture group here”. \2
will insert the second group, \3
the third, and so on. You can also give groups names if you prefer those to numbers. Therefore, the new variable dbquotes
will contain the following string:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ``and what is the use of a book,'' thought Alice ``without pictures or conversations?''
Now comes the nesting. Let’s say we want to replace all the space characters in Alice’s thoughts with |
s for some reason. We can’t do this with dbquotes.gsub(' ','|')
because that will replace every space in the passage. With enough time and effort, maybe you could write a very ugly regex with a lot of look ahead and look behind assertions to get the result you want, but let’s not. Besides, we’re already capturing Alice’s thoughts, so we should just be able to gsub
them individually. But how?
The most common and straightforward version of String#gsub
takes two arguments: a string/regex to match and a string to substitute in its place. As we’ve seen, capture groups can be used to include the matched text in the substitutions, but because of the order of operations, there’s very little we can do with \1
itself.
failure = alice.gsub(/'([^']*)'/,'``\1\'\''.gsub(' ','|')
The naive approach above will fail and produce unchanged text because the gsub
will execute on the literal \1
rather than its expansion, find no spaces, and do nothing.
Luckily for us, there’s another form of gsub
that takes a match regex/string and a substitution block. What this form of gsub
does is finds a number of matches, hands them to the block to iterate over, and uses the output of each iteration for each substitution. We can’t use \1
anymore, but gsub
automatically provides its block with captured groups in variables that look like $1
. So to do what we were doing before, we’d write:
dbquotes = alice.gsub(/'([^']*)'/,') do
"``#{$1}\\"
end
This would produce the exact same output as before. But now we have the breathing room of a block and actual variables to work with, making our |
substitution simple.
dbquotes = alice.gsub(/'([^']*)'/) do
"``#{$1.gsub(' ','|')}''"
end
"Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ``and|what|is|the|use|of|a|book,'' thought Alice ``without|pictures|or|conversations?''"
From here, we can make whatever further alterations we want, and we’re not limited to mere substitutions.
dbquotes = alice.gsub(/'([^']*)'/) do
"``#{$1.reverse}''"
end
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ``,koob a fo esu eht si tahw dna'' thought Alice ``?snoitasrevnoc ro serutcip tuohtiw''
We can also use multiple groups if, say, we want to do something special with the first word of each thought.
dbquotes = alice.gsub(/'([^ ]*)([^']*)'/) do
"``#{$1.upper}:-#{$2.gsub(' ','|'}\\"
end
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ``AND:-|what|is|the|use|of|a|book,'' thought Alice ``WITHOUT:-|pictures|or|conversations?''
And it goes without saying that you can nest further inner substitutions to your heart’s content. If Alice’s thoughts contained (bracketed asides)
, we could use a second, inner String#gsub() {}
to pull those out and Title Case them, for example. Endless possibilities!