Express Yourself – String Comparison in Oracle using REGEXP_REPLACE

As a programmer, one of the absolute pleasures of Unix ( especially when compared to Windows) is the flexibility offerred by the use of Regular Expressions. Not only do tools such as Sed and Awk offer almost unlimited options for manipulating text files, but Regular Expression syntax offers an unparalleled opportunity to exercise the top row of your keyboard.
Regular Expressions have been available in Oracle for a while now, but it’s one of those incredibly useful features that doesn’t seem to be widely used.
This is no doubt due, in part, to the fairly involved syntax, but help is at hand. There’s a particularly good explaination of most of the Oracle Regular Expression functions ( and syntax) on the excellent PSOUG reference site ( formerly known as Morgan’s Library).

I want to concentrate here on the REGEXP_REPLACE function and how you can use it to help when comparing strings in the database that have come from different sources and are, as a result, in different formats.

Being Improperly Addressed

Perhaps a common example of when this is likely to come in handy is when dealing with postal address data.
What tends to happen, particularly on OLTP systems is that the initial free-form entry of addresses leads to differences in formatting. Over time, this leads to massive duplication where the same address is entered many times in a slightly different format.

The result is a massive addresses table, which is very slow to search, which results in users entering an address again because it’s quicker than searching through all the muck that’s now sloshing around in the addresses table.

Then, someone gets the bright idea of storing the address information with the record that they want to display. This is usually done without referring back to a parent record in the addresses table.
All of which leads to a great big sticky mess.

No RI, no comment

Consider the following scenario, which I’ve come across more than once in my time.
The database has a SUPPLIERS table which, at some point over the years has had address data added to it.
A CONTRACTS table has had the same treatment at a different point in time, but with the same result.

The formatting of address data is not standard across the system. Whilst the address data in the contracts table is entered in uppercase, the same cannot be said for the suppliers address data.
At some point, the contracts table has been the subject of a data load, which didn’t really work as expected.
The resulting carnage could be re-created like this

CREATE TABLE suppliers ( 
    supp_id NUMBER PRIMARY KEY, 
    supplier_name VARCHAR2(100), 
    supplier_addr VARCHAR2(100)) 
/ 

INSERT INTO suppliers( supp_id, supplier_name, supplier_addr) 
VALUES( 1, 'Evil Giant Corp Inc.', '1 World Domination Towers, Aylesbury, EV1 L') 
/ 

INSERT INTO suppliers( supp_id, supplier_name, supplier_addr) 
VALUES( 2, 'Not as big as Evil but Still pretty mean Co', '2 High Street, Stony Stratford, Milton Keynes') 
/ 

INSERT INTO suppliers( supp_id, supplier_name, supplier_addr) 
VALUES( 3, 'Honest Mikes Software Emporium and Farm Shop', '3 Dunghill Mansions, Putney') 
/ 

INSERT INTO suppliers( supp_id, supplier_name, supplier_addr) 
VALUES( 4, 'Big Pink', '22 Acacia Avenue Tunbridge Wells, Kent') 
/ 

CREATE TABLE contracts ( 
    con_id NUMBER, 
    supp_id NUMBER REFERENCES suppliers(supp_id), 
    contractor_addr VARCHAR2(100)) 
/ 


INSERT INTO contracts( con_id, supp_id, contractor_addr) 
VALUES(1,1, '1 WORLD DOMINATION     TOWERS, AYLESBURY EV1 L') 
/ 

INSERT INTO contracts( con_id, supp_id, contractor_addr) 
VALUES( 2,2, '2 HIGH STREET# STONY STRATFORD# MILTON KEYNES# ') 
/ 

INSERT INTO contracts( con_id, supp_id, contractor_addr) 
VALUES( 3, 3,'3 DUNGHILL MANSIONS, PUTNEY') 
/ 

INSERT INTO CONTRACTS( con_id, supp_id, contractor_addr) 
VALUES( 4, 4, '149A NELSON MANDELA HOUSE, PECKHAM') 
/

Eventually, it comes to light that different parts of the application are returning different address information when ostensibly querying the same data :

SELECT supp.supplier_name, con.contractor_addr 
FROM suppliers supp, contracts con 
WHERE con.supp_id = supp.supp_id; 

SUPPLIER_NAME                                              CONTRACTOR_ADDR
--------------------------------------------------        --------------------------------------------------
Evil Giant Corp Inc.				             1 WORLD DOMINATION	  TOWERS, AYLESBURY EV1 L
Not as big as Evil but Still pretty mean Co            2 HIGH STREET# STONY STRATFORD# MILTON KEYNES#
Honest Mikes Software Emporium and Farm Shop  3 DUNGHILL MANSIONS, PUTNEY
Big Pink					                     149A NELSON MANDELA HOUSE, PECKHAM

As opposed to :

SELECT supp.supplier_name, supp.supplier_addr
FROM suppliers supp, contracts con
WHERE con.supp_id = supp.supp_id;
SUPPLIER_NAME                                                    SUPPLIER_ADDR
--------------------------------------------------              --------------------------------------------------
Evil Giant Corp Inc.                                                 1 World Domination Towers, Aylesbury, EV1 L
Not as big as Evil but Still pretty mean Co                   2 High Street, Stony Stratford, Milton Keynes
Honest Mikes Software Emporium and Farm Shop         3 Dunghill Mansions, Putney
Big Pink                                                                 22 Acacia Avenue Tunbridge Wells, Kent

Incidentally, apologies for the “eccentric” formatting of the result sets in the above listings. It all looks fine when I’m drafting the post and then goes all wacky when I preview the post. Lots of messing around with formats has so far not resulted in a solution.

Now, which lucky person has been “volunteered” to try and identify the records where the address information itself ( as opposed to the format) doesn’t match ? You really should by a lottery ticket this week.

Boring old string manipulation

Looking at the data, we know that we need to filter out differences due to case, spacing and any weird non-alphanumeric characters. As the data isn’t particularly trustworthy, we want to apply the same transformations to both tables so we don’t get any false positives.
If we’re sifting through a large number of records, we might want to start narrowing down the differences a step at a time. So, our first attempt might look something like this :

SELECT supp.supplier_name 
FROM suppliers supp, contracts con 
WHERE supp.supp_id = con.supp_id 
AND UPPER(supp.supplier_addr) != UPPER(con.contractor_addr); 

SUPPLIER_NAME
--------------------------------------------------
Evil Giant Corp Inc.
Not as big as Evil but Still pretty mean Co
Big Pink

Well, that did help a bit, but we still have almost all of the records coming up as different and we can tell by looking that some of these differences are simply down to formatting.
For our second attempt, we’ll try to disregard all of the spaces :

SELECT supp.supplier_name 
FROM suppliers supp, contracts con 
WHERE supp.supp_id = con.supp_id 
AND REPLACE( UPPER( supp.supplier_addr),' ') != REPLACE( UPPER(con.contractor_addr), ' '); 

SUPPLIER_NAME
--------------------------------------------------
Evil Giant Corp Inc.
Not as big as Evil but Still pretty mean Co
Big Pink

That hasn’t really helped much. In fact, it hasn’t helped at all. Because some of the addresses have whitespace characters other than spaces ( in this case, tabs), stripping out the spaces to do the comparison hasn’t filtered them out.
We’re not really getting very far, and we still need to sort out those annoying # characters. Desparate times call for desparate measures…

The Regular Expression Solution

Where Regular Expressions really have an edge in this situation is in the fact that you can identify whole classes of characters in a compact manner :
[[:class:]]

This being Regular Expression syntax, you do have to use unusual non-alphanumeric character combinations. I think there must be a law somewhere that mandates this.

Anyway, we can now apply this to our existing problem :

SELECT supp.supplier_name 
FROM suppliers supp, contracts con 
WHERE supp.supp_id = con.supp_id 
AND REGEXP_REPLACE( UPPER( supp.supplier_addr), '[[:space:]]|[[:punct:]]') != 
  REGEXP_REPLACE( UPPER(con.contractor_addr), '[[:space:]]|[[:punct:]]'); 

SUPPLIER_NAME
--------------------------------------------------
Big Pink

A couple of points to note here.
First, the name of the class must have a “:” before and after. I mention this only because I have a bit of a mental block and keep missing the trailing “:” with, as they say, hilarious results.
The second is that the classes specified are separated by a “|” and that the whole expression – i.e. ‘[[:space:]]|[[:punct:]]’ is enclosed in one set of single quotes.

Hmmm, looks like Big Pink may have moved at some point.

Incidentally, this company employed a PR firm to create a corporate image that would invoke the reliability and stability associated with IBM, whilst making them stand out as supplier of Oracle related software.

After extensive research and an eye-watering fee, the PR firm came back with the reasoning that IBM is known as Big Blue. Oracle’s corporate logo is predominantly red. Oracle is known in some circles as Big Red. Pink is nearly the same as red. Hence “Big Pink”.

One by-product of the name, hitherto unappreciated by the PR firm, is the opportunity for humorous innuendo and mockery when the company’s products don’t quite come up to the mark.
You can do your own gags here.

Author: mikesmithers

Back in 1993, I discovered that I could get paid money for doing fun stuff with computers. Over the years, I've specialised in Oracle Databases as a developer, a DBA and sometimes, an architect. It's my evil alter-ego - The Antikyte - who writes a blog about my various technical adventures. Yes, that is his Death Star parked in the Disabled Bay. I currently live in the South-West of England with Deb, my long-suffering wife. View all posts by mikesmithers

5 thoughts on “Express Yourself – String Comparison in Oracle using REGEXP_REPLACE”

Pingback: Modulesoft Offers Cheap Orissa Hosting, Oracel10g Hosting & Free Hosting | Host Rage
metallicatony says:

March 31, 2010 at 2:23 pm

Nice one Mike! Your blog is really interesting!

Why are you including the [[:punct:]] class in the REGEXP_REPLACE function? Is that for # and , characters?

LikeLike

1. mikesmithers says:
  
  March 31, 2010 at 6:27 pm
  
  Spot on – I included the [[:punct:]] to strip out the # and , characters, as well as any other punctuation characters that might have been input.
  
  LikeLike
  
  1. metallicatony says:
    
    April 5, 2010 at 9:52 pm
    
    Cool! Thanx! I started liking your blog so much that I placed a pointer from my blog to yours! Please keep writing! I love every post of yours with all those coding snippets! 🙂
    
    LikeLike
Pingback: Perl – Regular Expression? | perlscriptsearch.com