Friday, December 30, 2005

Matching ISO-8859-1 strings with Ruby

While refactoring a braindead legacy application, I needed to separate employees' last names and first names that historically have been stored in a single field in the employees table. Easy cake, even though Spanish names can be a bit more complicated than their English counterparts. Here is the regular expression:

# matches 'Gomez', 'de la Cruz', 'de los Santos', etc.
apellido = '((?:(?:de|del|la|las|los|y|san)\s+)*(?:\w|\#)+)\s+'

# matches the rest of the string after the last names
nombres = "(.*)"

# the complete regular expression
re = /#{apellido}#{apellido}#{nombres}/i

It's not perfect, but covers most of our cases, with only one failure.

These names were written in US-ASCII and thus without accented letters and 'Ñ' (did you notice the '#' in the regular expression?, it is replacing the 'Ñ'). But, what if they were used?

Ruby supports some encodings, including UTF-8 which would be enough for matching those characters. Unfortunately the database was created with the ISO-8859-1 encoding and converting it to UTF-8 was not an option because many programs and (very old) printers depend on ISO-8859-1.

Ruby supports ISO-8859-1 with its new regular expression engine code-named Oniguruma, but only in the development branch (1.9). Oniguruma will be included in Ruby 2.0.

There was one option left: converting the string from ISO-8859-1 to UTF-8 before passing it through the regular expression. This is done with the interface to the iconv library.

require 'iconv'

# We want to convert from ISO-8859-1 to UTF-8
c = Iconv.new('UTF-8', 'ISO-8859-1')

# This is an ISO-8859-1 string
fullname = "Núñez de los Santos María de Jesús"

# Converting
utf_fullname = c.iconv(fullname)

# We can test it, spliting the name into words:
utf_fullname.scan(/\w+/e)

Since \w now matches accented letters and 'ñ', the previous code splits fullname into words.

Notice the extra 'e' after the regular expression. It's an option, saying that Ruby should treat the string as UTF-8 encoded.

3 comments:

Jackson Sejoon Park said...

I have a similar problem that I'm encountering now. I get a list of names and some of them may have accented letters. I have a database where the names are all stored in ASCII. What I would like to do is to convert the characters, such as í into i and ú into u in ruby. Any ideas or thoughts on how to do this?

Gerardo said...

Hi Sparky,

You can use String#tr for that if you're using ISO-8859-1, or any other one-byte encoding:

src = "Gómez Rodríguez"
accented = "áéíóú"
notaccented = "aeiou"

dst = src.tr accented, notaccented
puts src, dst

Demon said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.