A Performance Comparison between Oracle’s REGEXP_SUBSTR and SUBSTR functions

Oracle SQL offers support for POSIX regular expressions by means of a suite of native functions.

Darth Sidious, light-saber in hand, imparting ancient Sith wisdom regarding regular expressions.
Regular Expressions are a pathway to many abilities that some consider…unnatural.

Whilst providing the ability for complex search conditions that regular expressions offer, REGEXP_SUBSTR has acquired a reputation for being a fair bit slower when compared to the standard SUBSTR function.

What I’m going to demonstrate here is :

  • how SUBSTR seems generally to be faster than REGEXP_SUBSTR
  • how the performance of REGEXP_SUBSTR can improve dramatically when used with INSTR
  • REGEXP_SUBSTR performs better when it matches the start of a string

To start with though, well discover why you’ll never see a Sith Lord on Sesame Street ( hint : it’s the way they count in a Galaxy Far, Far Away)…

A Test Table

We have a two-column table that looks like this :

create table star_wars_numbers
(
    id number generated always as identity,
    episode varchar2(4000)
)
/

insert into star_wars_numbers( episode)
values('Four');

insert into star_wars_numbers( episode)
values('Four|Five');

insert into star_wars_numbers( episode)
values('Four|Five|Six');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three|Seven');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three|Seven|Three-and-a-bit');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three|Seven|Three-and-a-bit|Three and a bit less of a bit');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three|Seven|Three-and-a-bit|Three and a bit less of a bit|Eight');

insert into star_wars_numbers( episode)
values('Four|Five|Six|One|Two|Three|Seven|Three-and-a-bit|Three and a bit less of a bit|Eight|Nine');

commit;

Whilst the contents of the EPISODES column may be the result of arcane Sith ritual, Moof-Milker activity in the design process cannot be ruled out.

Either way, the EPISODES column contains up to 11 values, with multiple columns being separated by a ‘|’.

Extracting each value in SQL

Using SUBSTR and INSTR to separate out the values in EPISODES is, as ever, quite a bit of typing…

select 
    id, 
    episode, 
    substr
    (
        episode, 
        1, 
        case 
            when instr(episode, '|') > 0 then instr(episode, '|') -1 
            else length(episode) 
        end
    )  as "A New Hope",
    case when instr(episode, '|', 1, 1) > 0 then
        substr
        (
            episode, 
            instr(episode, '|', 1, 1) + 1, -- start
            case 
                when instr(episode, '|', 1, 2) > 0 then instr(episode, '|', 1,2) -1 
                else length(episode) 
            end
            -
            instr(episode, '|', 1, 1)   
        ) 
    end as "The Empire Strikes Back",
    case when instr(episode, '|', 1, 2) > 0 then
        substr( episode, instr(episode, '|', 1, 2) + 1,
            case when instr(episode, '|', 1,3) > 0 then instr(episode, '|',1,3) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 2))
    end as "Return of the Jedi",    
    case when instr(episode, '|', 1, 3) > 0 then
        substr( episode, instr(episode, '|', 1, 3) + 1,
            case when instr(episode, '|', 1,4) > 0 then instr(episode, '|',1,4) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 3))
    end as "Phantom Menace",
    case when instr(episode, '|', 1, 4) > 0 then
        substr( episode, instr(episode, '|', 1, 4) + 1,
            case when instr(episode, '|', 1,5) > 0 then instr(episode, '|',1,5) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 4))
    end as "Attack of the Clones",
    case when instr(episode, '|', 1, 5) > 0 then
        substr( episode, instr(episode, '|', 1, 5) + 1,
            case when instr(episode, '|', 1,6) > 0 then instr(episode, '|',1,6) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 5))
    end as "Revenge of the Sith",
    case when instr(episode, '|', 1, 6) > 0 then
        substr( episode, instr(episode, '|', 1, 6) + 1,
            case when instr(episode, '|', 1,7) > 0 then instr(episode, '|',1,7) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 6))
    end as "The Force Awakens",
    case when instr(episode, '|', 1, 7) > 0 then
        substr( episode, instr(episode, '|', 1, 7) + 1,
            case when instr(episode, '|', 1,8) > 0 then instr(episode, '|',1,8) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 7))
    end as "Rogue One",        
    case when instr(episode, '|', 1, 8) > 0 then
        substr( episode, instr(episode, '|', 1, 8) + 1,
            case when instr(episode, '|', 1,9) > 0 then instr(episode, '|',1,9) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 8))
    end as "Solo",        
    case when instr(episode, '|', 1, 9) > 0 then
        substr( episode, instr(episode, '|', 1, 9) + 1,
            case when instr(episode, '|', 1, 10) > 0 then instr(episode, '|',1, 10) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 9))
    end as "The Last Jedi",        
    case when instr(episode, '|', 1, 10) > 0 then
        substr( episode, instr(episode, '|', 1, 10) + 1,
            case when instr(episode, '|', 1, 11) > 0 then instr(episode, '|',1, 11) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 10))
    end as "The Rise of Skywalker"                
from star_wars_numbers
/

…although these days ( version 12c and later), we can save ourselves a bit of effort by using an inline function…

with function extract_episode( i_string in varchar2, i_pos in pls_integer)
    return varchar2
is
begin
    if i_pos = 1 then
        return substr(i_string, 1, case when instr(i_string, '|') > 0 then instr(i_string, '|') - 1 else length(i_string) end);
    end if;    
    return
        case when instr(i_string, '|', 1, i_pos -1 ) > 0 then
            substr( i_string, instr(i_string, '|', 1, i_pos -1) + 1,
                case when instr(i_string, '|', 1, i_pos) > 0 then instr(i_string, '|',1, i_pos) - 1 else length(i_string) end 
                -
                instr( i_string, '|', 1, i_pos - 1))
        end;
end;
select 
id, episode, 
    extract_episode( episode, 1) as "A New Hope",
    extract_episode( episode, 2) as "Empire Strikes Back",
    extract_episode( episode, 3) as "Return of the Jedi",
    extract_episode( episode, 4) as "The Phantom Menace",
    extract_episode( episode, 5) as "Attack of the Clones",
    extract_episode( episode, 6) as "Revenge of the Sith",
    extract_episode( episode, 7) as "The Force Awakens",
    extract_episode( episode, 8) as "Rogue One",
    extract_episode( episode, 9) as "Solo",
    extract_episode( episode, 10) as "The Last Jedi",
    extract_episode( episode, 11) as "The Rise of Skywalker"
from star_wars_numbers
/
    
    

Whether you find the equivalent regexp query more elegant or just horrible is somewhat subjective :

select 
    regexp_substr(episode, '[^|]+') as "A New Hope",
    regexp_substr(episode, '[^|]+',1, 2) as "The Empire Strikes Back",
    regexp_substr(episode, '[^|]+',1, 3) as "Return of the Jedi",
    regexp_substr(episode, '[^|]+',1, 4) as "The Phantom Menace",
    regexp_substr(episode, '[^|]+',1, 5) as "Attack of the Clones",
    regexp_substr(episode, '[^|]+',1, 6) as "Revenge of the Sith",
    regexp_substr(episode, '[^|]+',1, 7) as "The Force Awakens",
    regexp_substr(episode, '[^|]+',1, 8) as "Rogue One",
    regexp_substr(episode, '[^|]+',1, 9) as "Solo",
    regexp_substr(episode, '[^|]+',1, 10) as "The Last Jedi",
    regexp_substr(episode, '[^|]+',1, 11) as "The Rise of Skywalker"
from star_wars_numbers
/

Irrespective of aesthetic considerations, I should explain the regexp in use here, if only so I can understand it when I read this a few months from now :

[^|] - match any character that isn't pipe ("|")
+ - match one or more times

Next, we need to find out how the regexp function stacks up when it comes to performance, and we’re not going to find that with a mere 11 rows of data…

Performance Test Setup

Let’s make some use of all that space in the Galaxy Far, Far Away…

create table galaxy (
    id number generated always as identity,
    episode varchar2(4000))
/

begin
    for i in 1..100000 loop
        insert into galaxy(episode)
        select episode from star_wars_numbers;
        
        commit;
    end loop;
end;
/

exec dbms_stats.gather_table_stats(user, 'galaxy');

All of the following tests were run on Oracle 19c Enterprise Edition running on Oracle Linux.

Everything was run in a SQL*Plus session from the command line on the database server.

The queries were run in SQL*Plus with the following settings :

set lines 130
clear screen
set autotrace trace
set timing on

Each query was executed at least twice consecutively to ensure that results were not skewed by the state of the buffer cache.

It’s also worth noting that, I found no low-level explanation for the performance discrepancies between the two functions when trawling through trace files. Therefore, I’ve concentrated on elapsed time as the main performance metric in these tests.

Test 1 – All Fields extracted in the Select Clause

Let’s start with the SUBSTR function ( referred to as “Standard” henceforth) :

select 
    id, 
    substr
    (
        episode,  -- input string
        1,  -- start position
        case 
            when instr(episode, '|') > 0 then instr(episode, '|') -1 
            else length(episode) 
        end -- number of characters to extract
    )  as "A New Hope",
    case when instr(episode, '|', 1, 1) > 0 then
        substr
        (
            episode, 
            instr(episode, '|', 1, 1) + 1, -- start
            case 
                when instr(episode, '|', 1, 2) > 0 then instr(episode, '|', 1,2) -1 
                else length(episode) 
            end
            -
            instr(episode, '|', 1, 1)   
        ) 
    end as "The Empire Strikes Back",
    case when instr(episode, '|', 1, 2) > 0 then
        substr( episode, instr(episode, '|', 1, 2) + 1,
            case when instr(episode, '|', 1,3) > 0 then instr(episode, '|',1,3) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 2))
    end as "Return of the Jedi",    
    case when instr(episode, '|', 1, 3) > 0 then
        substr( episode, instr(episode, '|', 1, 3) + 1,
            case when instr(episode, '|', 1,4) > 0 then instr(episode, '|',1,4) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 3))
    end as "Phantom Menace",
    case when instr(episode, '|', 1, 4) > 0 then
        substr( episode, instr(episode, '|', 1, 4) + 1,
            case when instr(episode, '|', 1,5) > 0 then instr(episode, '|',1,5) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 4))
    end as "Attack of the Clones",
    case when instr(episode, '|', 1, 5) > 0 then
        substr( episode, instr(episode, '|', 1, 5) + 1,
            case when instr(episode, '|', 1,6) > 0 then instr(episode, '|',1,6) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 5))
    end as "Revenge of the Sith",
    case when instr(episode, '|', 1, 6) > 0 then
        substr( episode, instr(episode, '|', 1, 6) + 1,
            case when instr(episode, '|', 1,7) > 0 then instr(episode, '|',1,7) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 6))
    end as "The Force Awakens",
    case when instr(episode, '|', 1, 7) > 0 then
        substr( episode, instr(episode, '|', 1, 7) + 1,
            case when instr(episode, '|', 1,8) > 0 then instr(episode, '|',1,8) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 7))
    end as "Rogue One",        
    case when instr(episode, '|', 1, 8) > 0 then
        substr( episode, instr(episode, '|', 1, 8) + 1,
            case when instr(episode, '|', 1,9) > 0 then instr(episode, '|',1,9) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 8))
    end as "Solo",        
    case when instr(episode, '|', 1, 9) > 0 then
        substr( episode, instr(episode, '|', 1, 9) + 1,
            case when instr(episode, '|', 1, 10) > 0 then instr(episode, '|',1, 10) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 9))
    end as "The Last Jedi",        
    case when instr(episode, '|', 1, 10) > 0 then
        substr( episode, instr(episode, '|', 1, 10) + 1,
            case when instr(episode, '|', 1, 11) > 0 then instr(episode, '|',1, 11) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 10))
    end as "The Rise of Skywalker"                
from galaxy
/

Runnnig this query, we get :

1100000 rows selected.

Elapsed: 00:00:20.32

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   35857676  bytes sent via SQL*Net to client                                                                                     
     811886  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Before we look at the REGEXP_SUBSTR equivalent, it’s probably worth considering the more streamlined in-line function version of this query :

with function extract_episode( i_string in varchar2, i_pos in pls_integer)
    return varchar2
is
begin
    return
        case when instr(i_string, '|', 1, i_pos -1 ) > 0 then
            substr( i_string, instr(i_string, '|', 1, i_pos -1) + 1,
                case when instr(i_string, '|', 1, i_pos) > 0 then instr(i_string, '|',1, i_pos) - 1 else length(i_string) end 
                -
                instr( i_string, '|', 1, i_pos - 1))
        end;
end;
select 
    id, 
    substr
    (
        episode, 
        1, 
        case 
            when instr(episode, '|') > 0 then instr(episode, '|') -1 
            else length(episode) 
        end
    ) as "A New Hope",
    extract_episode( episode, 2) as "Empire Strikes Back",
    extract_episode( episode, 3) as "Return of the Jedi",
    extract_episode( episode, 4) as "The Phantom Menace",
    extract_episode( episode, 5) as "Attack of the Clones",
    extract_episode( episode, 6) as "Revenge of the Sith",
    extract_episode( episode, 7) as "The Force Awakens",
    extract_episode( episode, 8) as "Rogue One",
    extract_episode( episode, 9) as "Solo",
    extract_episode( episode, 10) as "The Last Jedi",
    extract_episode( episode, 11) as "The Rise of Skywalker"
from galaxy
/

Whilst it’s a bit more readable, it’s also a lot slower :

1100000 rows selected.

Elapsed: 00:00:41.76

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   35857676  bytes sent via SQL*Net to client                                                                                     
     810042  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

REGEXP_SUBSTR however, takes slow to a new level…

select id, 
    regexp_substr(episode, '[^|]+') as "A New Hope",
    regexp_substr(episode, '[^|]+',1, 2) as "The Empire Strikes Back",
    regexp_substr(episode, '[^|]+',1, 3) as "Return of the Jedi",
    regexp_substr(episode, '[^|]+',1, 4) as "The Phantom Menace",
    regexp_substr(episode, '[^|]+',1, 5) as "Attack of the Clones",
    regexp_substr(episode, '[^|]+',1, 6) as "Revenge of the Sith",
    regexp_substr(episode, '[^|]+',1, 7) as "The Force Awakens",
    regexp_substr(episode, '[^|]+',1, 8) as "Rogue One",
    regexp_substr(episode, '[^|]+',1, 9) as "Solo",
    regexp_substr(episode, '[^|]+',1, 10) as "The Last Jedi",
    regexp_substr(episode, '[^|]+',1, 11) as "The Rise of Skywalker"
from galaxy
/
1100000 rows selected.

Elapsed: 00:01:27.25

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   35857680  bytes sent via SQL*Net to client                                                                                     
     809519  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Now see what happens when we give REGEXP_SUBSTR a little help from the humble INSTR (the “hybrid” query) :

select id, 
    regexp_substr(episode, '[^|]+') as "A New Hope",
    decode( instr(episode, '|'), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,1) + 1)) as "The Empire Strikes Back",
    decode( instr(episode, '|',1, 2), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,2) + 1)) as "Return of the Jedi",
    decode( instr(episode, '|',1, 3), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,3) + 1)) as "The Phantom Menace",
    decode( instr(episode, '|',1, 4), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,4) + 1)) as "Attack of the Clones",
    decode( instr(episode, '|',1, 5), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,5) + 1)) as "Revenge of the Sith",
    decode( instr(episode, '|',1, 6), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,6) + 1)) as "The Force Awakens",
    decode( instr(episode, '|',1, 7), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,7) + 1)) as "Rogue One",
    decode( instr(episode, '|',1, 8), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,8) + 1)) as "Solo",
    decode( instr(episode, '|',1, 9), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,9) + 1)) as "The Last Jedi",
    decode( instr(episode, '|',1, 10), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,10) + 1)) as "The Rise of Skywalker"
from galaxy
/

Yes, I have cheated a bit on the aesthetics here by using the more compact DECODE instead of CASE.
However, this does not affect the runtime of the query, which is a bit faster than the pure REGEXP_SUBSTR equivalent :

1100000 rows selected.

Elapsed: 00:00:30.83

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   35857680  bytes sent via SQL*Net to client                                                                                     
     810158  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       


Test 1 Results Summary

The performance of the Hybrid approach does raise the question of how the REGEXP_SUBSTR compares when we’re just looking to extract a single field from a string, rather than all of them…

Test 2 – Extract the first field only

In this instance we’re just looking for the first field in EPISODES.

In this context, the hybrid approach doesn’t apply because we’re always starting our search at the start of the input string.

Starting, once again with the Standard approach :

select id, 
    substr
    (
        episode, 1, 
        case 
            when instr(episode, '|', 1,1) > 0 then instr(episode, '|', 1,1) -1 
            else length(episode) 
        end 
    ) as "A New Hope"
from galaxy
/
1100000 rows selected.

Elapsed: 00:00:05.33

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   25086067  bytes sent via SQL*Net to client                                                                                     
     808790  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Now for REGEXP_SUBSTR :

select id,
    regexp_substr( episode, '[^|]+') as "A New Hope"
from galaxy
/

1100000 rows selected.

Elapsed: 00:00:06.38

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   25086067  bytes sent via SQL*Net to client                                                                                     
     808868  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Interestingly, whilst it’s still not as fast, the Regexp runtime is only 20% longer than the Standard.

This contrasts markedly with the 430% longer that the Regexp took for our first test.

Test 2 Results Summary

Does this relative performance hold true for any single field in the input string ?

Test 3 – Extract the last field only

Starting with the Standard query :

select 
    id, 
    case when instr(episode, '|', 1, 10) > 0 then
        substr( episode, instr(episode, '|', 1, 10) + 1,
            case when instr(episode, '|', 1, 11) > 0 then instr(episode, '|',1, 11) - 1 else length(episode) end 
            -
            instr(episode, '|', 1, 10))
    end as "The Rise of Skywalker"                
from galaxy
/
1100000 rows selected.

Elapsed: 00:00:05.44

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   25686033  bytes sent via SQL*Net to client                                                                                     
     808915  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Once again, the Regexp Query is much slower…

select 
    id,
    regexp_substr(episode, '[^|]+',1, 11) as "The Rise of Skywalker"
from galaxy 
/

1100000 rows selected.

Elapsed: 00:00:16.16

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   25686033  bytes sent via SQL*Net to client                                                                                     
     808888  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

…unless we use the hybrid approach…

select 
    id,
    decode( instr(episode, '|',1, 10), 0, null, regexp_substr(episode, '[^|]+', instr(episode, '|',1,10) + 1)) as "The Rise of Skywalker"
from galaxy 
/
1100000 rows selected.

Elapsed: 00:00:05.60

Execution Plan
----------------------------------------------------------                                                                        
Plan hash value: 3574103611                                                                                                       
                                                                                                                                  
----------------------------------------------------------------------------                                                      
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |                                                      
----------------------------------------------------------------------------                                                      
|   0 | SELECT STATEMENT  |        |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
|   1 |  TABLE ACCESS FULL| GALAXY |  1100K|    47M|  2089   (1)| 00:00:01 |                                                      
----------------------------------------------------------------------------                                                      


Statistics
----------------------------------------------------------                                                                        
          0  recursive calls                                                                                                      
          0  db block gets                                                                                                        
      80459  consistent gets                                                                                                      
       7616  physical reads                                                                                                       
          0  redo size                                                                                                            
   25686033  bytes sent via SQL*Net to client                                                                                     
     808736  bytes received via SQL*Net from client                                                                               
      73335  SQL*Net roundtrips to/from client                                                                                    
          0  sorts (memory)                                                                                                       
          0  sorts (disk)                                                                                                         
    1100000  rows processed                                                                                                       

Test 3 Results Summary

Conclusions

These tests seem to bear out the fact that, in general, SUBSTR offers better performance than REGEXP_SUBSTR.
However, the performance of REGEXP_SUBSTR can be greatly improved if used in conjunction with INSTR.
As always, the decision as to which function to use will depend on the specific circumstances you find yourself in.

Author: mikesmithers

Back in 1993, I discovered that I could get paid money for doing fun stuff with computers. Over the years, I've specialised in Oracle Databases as a developer, a DBA and sometimes, an architect. It's my evil alter-ego - The Antikyte - who writes a blog about my various technical adventures. Yes, that is his Death Star parked in the Disabled Bay. I currently live in the South-West of England with Deb, my long-suffering wife.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.