Substrings in Ruby

June 3, 2007 at 12:33

Filed under: Computing — Pistos @ 12:33

Most Ruby programmers know how to get substrings out of strings using the ways described in the Ruby core documentation for the String class. For reference, I quote:

   a = "hello there"
   a[1]                   #=> 101
   a[1,3]                 #=> "ell"
   a[1..3]                #=> "ell"
   a[-3,2]                #=> "er"
   a[-4..-2]              #=> "her"
   a[12..-1]              #=> nil
   a[-2..-4]              #=> ""
   a[/[aeiou](.)\1/]      #=> "ell"
   a[/[aeiou](.)\1/, 0]   #=> "ell"
   a[/[aeiou](.)\1/, 1]   #=> "l"
   a[/[aeiou](.)\1/, 2]   #=> nil
   a["lo"]                #=> "lo"
   a["bye"]               #=> nil

I almost always use the string[ /regexp/, 1 ] method, myself. However, all of the above only let you extract one substring. We can do multi-variable assignment in Ruby:

a, b, c = 1, 2, 3
  # a == 1; b == 2; c == 3
a, b = b, a
  # a == 2; b == 1

So why can’t we do multi-variable substring extraction? Ah, but we can! As described in the MatchData documentation:

multi-variable substring extraction

all, first, second = *(
    /(\w+) +(\w+)/.match "Here is a sentence."
)
  # all == "Here is"
  # first == "Here"
  # second == "is"

Get the same result the other way around, if this ordering makes more sense to you:

all, first, second = *(
    "Here is a sentence.".match /(\w+) +(\w+)/
)
  # all == "Here is"
  # first == "Here"
  # second == "is"

Now, what we really need is a way to do it without having to give a container for the whole match (“all” in the above code), preferably something with more concise syntax. Wouldn’t this be neat:

# This is not valid Ruby code!
first, second =
  "Here is a sentence."[ /(\w+) +(\w+)/ ]
  • Share/Bookmark

No related posts.

7 Comments »

  • manveru says:

    first, second = ‘this is a’.scan(/w+/)[0,2]

    ["this", "is"]

  • Pistos says:

    manveru:

    Your backslash got swallowed up by my Markdown plugin:

    'this is a'.scan(/\w+/)[0,2]

    But anyway: Thanks for pointing out this technique. However, I’m not sure it would let us grab arbitrary regexp groups, such as:

    weekday, month, monthday, year =
    "Monday, June 4, 2007"[
      /^(\w+), (\w+) (\d+), (\d+)/
    ]
  • manveru says:

    Yeah, i noticed that but didn’t want to double-post.
    To parse something like this i would use:

      string = 'Monday, June 4, 2007'
      format = '%A, %B %d, %Y'
      date = Date.strptime(string, format)
      puts date
      # => 2007-06-04
    

    But i get your point, groups would be quite handy at times instead of:

      'Monday, June 4, 2007'.match(/^(\\w+), (\\w+) (\\d+), (\\d+)/).captures
      # ["Monday", "June", "4", "2007"]
    

    This time i even previewed the post ;)
    One thing i see though is that markdown doesn’t work like the docs say, 4 spaces indentation should result in a code-block.

  • Pistos says:

    Hrm:

      test code
      right here.
    

    Seems to work for me? The way I’m getting the highlighted code blocks is actually via the syntax highlight Wordpress plugin. Add code blocks on my blog with <pre lang=”ruby”>.

  • manveru says:

    Yes, but it still tries to remove the \

  • Pistos says:

    Oh, I see. Well… I guess we need to use <pre> and preview. :)

  • Pistos says:

    Ruby 1.9 has named captures. With this we can name capture groups in a regexp, and then local variables are assigned the captured strings.

    s = "a123 bxyz"
    /a(?<my_a>\S+) +b(?<my_b>\S+)/ =~ s
    my_a #=> "123"
    my_b #=> "xyz"

RSS feed for comments on this post. TrackBack URI

Leave a comment

You may use Markdown syntax in your comment.

Powered by WP Hashcash

Powered by WordPress.