Searching strings in SQL with LOCATE_IN_STRING

I’ve been somewhat remiss in maintaining this blog over the past few months, but the LOCATE_IN_STRING scalar function really is worth mentioning.

The LOCATE_IN_STRING function returns the starting position of a string (called the search-string) within another string (called the source-string). If the search-string is not found and neither argument is null, the result is zero. If the search-string is found, the result is a number from 1 to the actual length of the source-string. If the optional start is specified, it indicates the character position in the source-string at which the search is to begin.

The format of the function is LOCATE_IN_STRING(source_string, search_string, start, instance)

If the optional start is specified, it indicates the character position in the source-string at which the search is to begin. If start is specified, an optional instance number can also be specified. The instance argument is used to determine the specific occurrence of search-string within source-string. Each unique instance can include any of the characters in a previous instance, but not all characters in a previous instance.

What this means is that not only can you find strings containing a search text, but you can also determine exactly where in the source string that text is.

INSTR can be used as a synonym for LOCATE_IN_STRING. I have used this shorter form in the following example.

select 
    INSTR('TEST TEXT IS A TEST', 'TEST') a, 
    INSTR('TEST TEXT IS A TEST', 'TEST', 1, 2) b,
    INSTR('TEST TEXT IS A TEST', 'TEST', -1) c,
    INSTR('TEST TEXT IS A TEST', 'TEST', -1, 2) d,
    INSTR('TEST TEXT IS A TEST', 'TEST', 5) e, 
    INSTR('TEST TEXT IS A TEST', 'TEST', -5) f 
from SYSIBM.SYSDUMMY1                       

Returns:

A               B               C               D               E               F 
1              16              16               1              16               1 

Column A looks for the first instance of ‘TEST’ in the source string and finds it in position 1. Column B is looking for the second instance of the same string and finds it in position 16. All pretty straightforward so far.

With column C, things get a bit more interesting. Starting at position -1 tells the function to search from right to left, starting at the rightmost position in the source string. And in column D, I’m looking for the second instance of the search string, starting from the right.

And finally, in columns E and F, I am starting the search five positions in.

LOCATE_IN_STRING was introduced in 7.1 and provides an effective method of searching within strings without having to resort to regular expressions.

LTRIM, RTRIM and Trim Expressions

I have to admit that, when trimming strings with SQL, I tend to just use the TRIM function to strip off all leading and trailing spaces. But the LTRIM and RTRIM functions really do deserve more attention because they have been recently enhanced to support trim expressions. That is, all of the characters you might want to trim.

Here’s an example. Imagine you have a text representation of a value and, for some reason, some of these values have Pound, Dollar or Euro signs at either the start or the end of the string. Here’s our table:

CURRENCYVALUE
 $400
 £250
 300€

So to trim the left hand currency symbols, all we need to do is this:

select ltrim(CURRENCYVALUE, '$£')
from QTEMP.TEST

And to remove that right hand Euro symbol (along with any trailing spaces), we can do this:

select rtrim(CURRENCYVALUE, '€ ')
from QTEMP.TEST

Of course, this only becomes really useful when we put it all together so that this:

select rtrim(ltrim(CURRENCYVALUE, '$£'), '€ ') as NUMBERS
from QTEMP.TEST

… returns this:

NUMBERS  
400      
250      
300      

Admittedly this example is a bit of an artificial one, but being able to easily strip multiple characters from the beginning and end of a string does simplify the process of cleaning up data received from external sources.

Handling divide by zero in SQL

Division by zero is something that, sooner or later, we all have to capture. For the rest of this post, I am using a small temporary table that I have imaginatively named DIV0. It has two columns, COLUMN1 and COLUMN2 and it looks like this:

COLUMN1   COLUMN2
      5         0
      6         2

Until recently, I have tended to use a CASE statement to check the column values before attempting to divide. It looks something like this:

select
    COLUMN1, COLUMN2,  
    case when COLUMN2 = 0 then null else COLUMN1/COLUMN2 end as DIVIDED
from div0

… and the output will look something like this:

COLUMN1   COLUMN2                             DIVIDED 
      5         0                                    -
      6         2        3.00000000000000000000000000 

This is fine in as far as it goes, but when COLUMN2 is replaced with a calculation, it can all get very messy very quickly.

It turns out that the NULLIF function provides a better approach.

The NULLIF function returns a null value if the arguments compare equal, otherwise it returns the value of the first argument.

What this means is that I can simplify my earlier query to look like this:

select                                     
    COLUMN1, COLUMN2,                      
    COLUMN1 / NULLIF(COLUMN2, 0) as DIVIDED
from div0                                  

And the output remains the same.

Combining Counts for multiple tables

It’s surprising how many times you need to know whether a record exists in any one of two or more tables. Traditionally, you would have to check each table individually until you either find a match or have checked every table to confirm that no match exists.

Finding myself faced with just such a question, I started wondering whether I could check all of the tables in question in one go. And I can.

So I have a environment where records are written first to a transactions file and then, once the transactions are processed, to an archive. I need to know whether a transaction has been written to either table and if it hasn’t then I can do an insert.

My first check looked like this:

select
(select count(*) from TRANSACTIONS where TRANSID = 677447208) as count1,
(select count(*) from ARCHIVE where TRANSID = 677447208) as count2
from SYSIBM.SYSDUMMY1

This gives me a single row with a separate column for each count, which is a reasonable enough start and may prove to be useful at some point. What I really want right now, though, is a simple count of all records across all tables.

For this I have to adjust the query very slightly indeed:

select
(select count(*) from TRANSACTIONS where TRANSID = 677447208) +
(select count(*) from ARCHIVE where TRANSID = 677447208)
from SYSIBM.SYSDUMMY1

… to get a single count of all occurrences of the transaction across as many tables as I need to check.

Using the SQL Translate Function

The SQL TRANSLATE function is worth knowing about if you ever find yourself needing to (for example) fix data from an external feed before you attempt to do anything with it.

The function requires returns a value in which one or more characters in expression have been converted into other characters. You can pass it up to four parameters: the expression (obviously); the to-string; the from-string and a pad character.

You can use the function with only the expression, in which case the characters are simply converted to uppercase, which is a bit pointless. Things become useful when you specify the to-string and from-string:

If more than one argument is specified, the result string is built character by character from expression, converting characters in from-string to the corresponding character in to-string. For each character in expression, the same character is searched for in from-string. If the character is found to be the nth character in from-string, the resulting string will contain the nth character from to-string. If to-string is less than n characters long, the resulting string will contain the pad character. If the character is not found in from-string, it is moved to the result string unconverted.

Here’s an example.

I’ve knocked together a sample table with two columns — a code and a description — to represent a data feed. The description can contain accented characters, but we don’t want to see these in the target system.

The unadjusted table looks like this:

IDX   CODE    DESCRIPTION  
  1   FAC01   Façile       
  2   UML01   Umlaüt       

To get rid of those accents in the description, this query…

select                                                          
 IDX, CODE, translate(DESCRIPTION, 'eeaaocuouee', 'éèâàôçûöüëê')
from FEED                                                       

… returns this result:

IDX   CODE    TRANSLATE 
  1   FAC01   Facile    
  2   UML01   Umlaut    

If the from-string is longer than the to-string, the pad character is used to pad the to-string to ensure the lengths match. So, in the above example,

This query:

select translate(CODE, 'xx', '01') from FEED

… is equivalent to this one…

select translate(CODE, 'x', '01', 'x') from FEED

I’m not entirely convinced of the value of this argument. It is probably bot safer and clearer to simply ensure your to-string and from-string are the same length.

Interpreting the DSPPGMREF file usage

I mentioned the DSPPGMREF command a couple of weeks ago. This command allows you to generate a list of system objects used by specified programs and is particularly handy when you want to see which programs are doing what to a specific file. The only quirk is that the file usage field is a two digit field that needs to be interpreted.

This field is a decimal representation of a three-digit binary field for which the rightmost column (1) indicates the file is used for input, the middle column (2) indicates the file is used for output and the leftmost column (4) indicates the file is used for update. Combining these columns gives you a single digit (decimal) number that fully describes how the file is used.

Interpreting this field is simple enough, but I do have to keep looking up what the values mean. Since I am directing the output to a file and using SQL to query this file, I am also letting the SQL bitwise functions interpret the file usage for me.

At it’s simplest, this query:

select WHLIB, WHPNAM, WHTEXT, WHFNAM, WHLNAM, WHSNAM, WHFUSG,
       bitand(WHFUSG, 1) as Input,                           
       bitand(WHFUSG, 2) as Output,                          
       bitand(WHFUSG, 4) as Update                           
FROM LSCLIB/CROSSREF WHERE WHSNAM like '%LSCTST%'              

… gives me a result that looks like this:

Library     Program     Object       Library      Source       File Usage  INPUT  OUTPUT  UPDATE
                        Referenced                File Name                                     
LSCLIB      LSCT100     LSCTSTP      LSCLIB       LSCTSTP           3        1       2       0  
LSCLIB      LSCT100N    LSCTSTP      LSCLIB       LSCTSTP           3        1       2       0  
LSCLIB      LSCT110N    LSCTSTL1     LSCLIB       LSCTSTL1          1        1       0       0  
LSCLIB      LSCT120     LSCTSTL1     LSCLIB       LSCTSTL1          5        1       0       4  
LSCLIB      LSCT120N    LSCTSTL1     LSCLIB       LSCTSTL1          5        1       0       4  

If the file is used for input, the INPUT column contains a 1. If the file is used for output, the OUTPUT column contains a 2. And, if the file is updated, the UPDATE column contains a 4.

This is immediately more readable than just looking at the raw File Usage field, but it can be simplified further by using some case statements:

select WHLIB, WHPNAM, WHFNAM, WHLNAM, WHSNAM, WHFUSG,              
  case bitand(WHFUSG, 1) when 1 then 'Yes' else 'No' end as Input, 
  case bitand(WHFUSG, 2) when 2 then 'Yes' else 'No' end as Output,
  case bitand(WHFUSG, 4) when 4 then 'Yes' else 'No' end as Update 
FROM LSCLIB/CROSSREF WHERE WHSNAM like '%LSCTST%'                    

… gives me a result that looks like this:

Library     Program     Object       Library      Source       File Usage  INPUT  OUTPUT  UPDATE
                        Referenced                File Name                                     
LSCLIB      LSCT100     LSCTSTP      LSCLIB       LSCTSTP           3       Yes    Yes     No   
LSCLIB      LSCT100N    LSCTSTP      LSCLIB       LSCTSTP           3       Yes    Yes     No   
LSCLIB      LSCT110N    LSCTSTL1     LSCLIB       LSCTSTL1          1       Yes    No      No   
LSCLIB      LSCT120     LSCTSTL1     LSCLIB       LSCTSTL1          5       Yes    No      Yes  
LSCLIB      LSCT120N    LSCTSTL1     LSCLIB       LSCTSTL1          5       Yes    No      Yes  

Read and Update with SQL

One of the advantages of SQL over native IO is that it allows you to handle sets of data rather than having to read and update each row individually. However, there can be cases where you want to handle your data line by line. To do this with SQL, you need to use a cursor.

I have a couple of examples below, both of which are written in fully free RPGLE. Going forward, all examples will be fully free because, as developers, we should all be putting the past behind us.

I am basing this on the previously mentioned Employees file, to which I have added an extra field named EMPVALID. This is numeric — a zero indicates an invalid employee number and a one indicates a valid employee number. Any number of validations or codes could be added, of course, but I want to keep things simple.

So first of all, here’s a simple program that will read and display each invalid employee.

 
      **Free
        
        // Employee records data structure
        dcl-ds EMPDS extname('EMPLOYEES')
        end-ds;

        // Ensure the cursor is not already open
        // This isn't really necessary here, but it is good practice
        exec SQL close C1;

        // Declare and open the cursor
        exec SQL declare C1 cursor for
            select EMPNUMBER, EMPNAME, EMPVALID
            from EMPLOYEES
            where EMPVALID = 0
            for fetch only;
        exec SQL open C1;

        // Read the file
        dou SQLCOD < 0 or SQLCOD = 100;
            exec SQL fetch C1 into :EMPDS;
            if SQLCOD >= 0 and SQLCOD <> 100;
                dsply (EMPNAME);
            endif;
        enddo;

        // Close the cursor and exit
        exec SQL close C1;
        *inlr = *on;

This is fine as far as it goes, but what we really want to do is check whether the employee numbers are valid and update the EMPVALID accordingly. For this, I will use the %check BIF.


      **Free

        // Employee records data structure
        dcl-ds EMPDS extname('EMPLOYEES')
        end-ds;

        // A display field to get around the limitations of the DSPLY OpCode
        dcl-s DISPLAY CHAR(52);

        // And we need a constant, digits, containing the valid characters
        dcl-c DIGITS '0123456789';

        // Ensure the cursor is not already open
        // This isn't really necessary here, but it is good practice
        exec SQL close C1;

        // Declare and open the cursor
        // Note that the last clause is now a 'for update of...'
        exec SQL declare C1 cursor for
            select EMPNUMBER, EMPNAME, EMPVALID
            from EMPLOYEES
            where EMPVALID = 0
            for update of EMPVALID;
        exec SQL open C1;

        // Read the file
        dou SQLCOD < 0 or SQLCOD = 100;
            exec SQL fetch C1 into :EMPDS;
            if SQLCOD >= 0 and SQLCOD <> 100;
                if %check(DIGITS: EMPNUMBER) > 0;
                    DISPLAY = %trim(EMPNAME) + ' is valid';
                    dsply (DISPLAY);

                    // And this is where we set EMPVALID = 1
                    // for the current record
                    exec SQL
                        update EMPLOYEES
                        set EMPVALID = 1
                        where current of C1;
                endif;
            endif;
        enddo;

        // Close the cursor and exit
        exec SQL close C1;
        *inlr = *on;

And this is the output:

DSPLY  Bugs Bunny is valid
DSPLY  Elmer Fudd is valid
DSPLY  Speedy Gonzales is valid
DSPLY  Sylvester is valid

And looking at the file, I can see everything is updated, as below:

EMPNUMBER   EMPNAME                                                  EMPVALID
00001       Bugs Bunny                                                      1
OOOO2       Elmer Fudd                                                      1
0000000003  Daffy Duck                                                      0
-4          Speedy Gonzales                                                 1
00005       Sylvester                                                       1

I told you that Elmer Fudd was tricksey.

As a rule, updating sets of data is much more efficient and should certainly be preferred for any new development. However, there are plenty of programs out there for which the business rules are tucked away in some validation subroutine which has been written to process one line at a time. In these cases, being able to process line by line with an SQL cursor gives you a handy first step towards modernising the application.

I have also found cursors to be useful when building subfiles, but that’s a whole other post.

Using regular expressions in DB2 for i

Regular expressions have been supported in DB2 for a while now and, while they are not something I often need, there are times when they come in very handy indeed. A simple example involves cleaning up some data so that it can be exported from an old (poorly maintained) database to a new, shiny one that (I’m promised) absolutely will be used.

The database in question is an employee master and, in order to protect the innocent, I have both simplified it and changed the names for the purpose if this example. There are only two fields to worry about empnumber and empname, which represent the employee number and name respectively. There are a couple of issues to address. The first one is that the empnumber is ten characters in the old database but five numeric in the new, so I need to ensure that all the numbers fit.

The second issue is that this database has been very poorly maintained. If you allow someone to enter a character in a numeric field, they will. And they did.

Here’s an idea of how the database looked:

EMPNUMBER   EMPNAME
00001       Bugs Bunny
OOOO2       Elmer Fudd
0000000003  Daffy Duck
-4          Speedy Gonzales
00005       Sylvester

As you can see, there are lots of ways in which the employee number can be invalid, and these have all been very well explored. And watch out for Elmer Fudd — he’s a bit confused about the difference between letters and numbers.

Fortunately REGEXP_INSTR is your friend:

The REGEXP_INSTR returns the starting position or the position after the end of the matched substring, depending on the value of the return_option argument.

So all I need is a regular expression that identifies which employee numbers are five digits followed by spaces, and which aren’t. And here it is:

select * from employees
where regexp_instr(empnumber, '^\d{5}\s+$') != 1

There are several optional parameters but I only need to use the source string (empnumber) and the pattern expression. The expression I am using looks for digits in the first five characters followed by spaces to the end of the field. And this is the result:

EMPNUMBER   EMPNAME
OOOO2       Elmer Fudd
0000000003  Daffy Duck
-4          Speedy Gonzales

I told you that Elmer Fudd was tricksy.

It should also be possible to use REGEXP_REPLACE to automate some of the cleanup, but you really do need your business users to buy in before you go down that particular rabbit hole.

Merging data with SQL MERGE

There have been times, many times, when I have needed to update the one or more columns in a table with values retrieved from another. A traditional approach to this is to use an UPDATE with subselects but there is a better way:

The MERGE statement updates a target (a table or view) using data from a source (result of a table reference). Rows in the target that match the input data may be updated or deleted as specified, and rows that do not exist in the target may be inserted as specified. Updating, deleting, or inserting a row in a view updates, deletes, or inserts the row into the tables on which the view is based if no INSTEAD OF trigger is defined on the view.

In order to play around with this, I build and populated a couple of test tables (imaginatively named TEST01 and TEST02) using the below scripts:

create table test01
 (key_field char(1) not null,
  data01 char(20),
  data02 char(30),
  data03 char(5),
  primary key (key_field));

insert into test01
 (Key_field, data01)
values('A', 'TEXT One');

insert into test01
 (Key_field, data01)
values('B', 'TEXT Two');

insert into test01
 (Key_field, data01)
values('D', 'TEXT Four');
create table test02
 (key_field char(1) not null,
  data01 char(20),
  data02 char(30),
  data03 char(5),
  primary key (key_field));

insert into test02
 (Key_field, data02)
values('A', 'More text one');

insert into test02
 (Key_field, data02, data03)
values('C', 'more text two', 'CODE');

insert into test02
 (Key_field, data01, data02)
values('E', 'TEXT five', 'More text five');

This gives me two tables with one matched row (key_field=’A’). To update the first table with the values in the second is really rather simple:

MERGE INTO test01 a
  USING (SELECT key_field, data01, data02, data03 from test02) b
  ON b.key_field = a.key_field
WHEN MATCHED THEN
   UPDATE SET data02 = b.data02, data03 = b.data03;

And the one matched row is updated in table TEST01 with the data from table TEST02.

If this was all I could do, I would be happy to have replaced a rather clunky approach with a more readable solution. But now the MERGE statement really starts to make a difference:

MERGE INTO test01 a
  USING (SELECT key_field, data01, data02, data03 from test02) b
  ON b.key_field = a.key_field
WHEN MATCHED THEN
   UPDATE SET data02 = b.data02, data03 = b.data03
WHEN NOT MATCHED THEN
   INSERT (key_field, data01, data02, data03)
   VALUES (b.key_field, b.data01, b.data02, b.data03)
NOT ATOMIC
  CONTINUE ON SQLEXCEPTION;

The WHEN NOT MATCHED clause means that not only can I update the matched rows, I can also do something about the rows that exist in table TEST02 that have now matched record in table TEST01. In this example, I have inserted the rows from table TEST02, other options allow me to either delete the row from table TEST01 or raise a SIGNAL statement.

And that NOT ATOMIC CONTINUE ON SQLEXCEPTION specifies that if an error is encountered on the insert, the MERGE will continue. Bad data happens, this saves be the burden of constantly having to refine my selection criteria to avoid the crap.

And a final note on the subject of selection criteria, the WHEN MATCHED and WHEN NOT MATCHED can both be extended with an AND to refine the selection to exactly what you want.

Do go and read up on the MERGE statement. It provides a powerful and flexible approach to updating tables from source data and one that I will be using in the very near future.

Better date conversion with timestamp_format

Handling dates and times on the IBM i can often be a bit of a challenge. Obviously, if you are working with a modern application that uses the Date, Time and Timestamp data types everything is nice and straightforward. However, there are plenty of old applications that use eight, seven or six digit numeric fields to represent dates and, sooner or later, you will need to extract these in a manner that looks sane in some other environment.

Timestamp_format is your friend.

The TIMESTAMP_FORMAT function returns a timestamp that is based on the interpretation of the input string using the specified format.

It’s that “interpretation of the input string” that makes this so handy. Put simply: you tell the function how to interpret the string and it will do the rest.

Here’s an example. I have a contracts table (CONTRACTS) with three columns: Contract Number (CTNR), Start Date (STDT) and End Date (ENDT). And the start and end date are both seven digit numbers. I need to list all of the currently active contracts (Start Date is before Today and End Date is after Today).

And the SQL looks like this:

select CTNR,
       date(timestamp_format(substr(digits(STDT), 2, 6), 'YYMMDD')),
       date(timestamp_format(substr(digits(ENDT), 2, 6), 'YYMMDD'))
from CONTRACTS
where date(timestamp_format(substr(digits(STDT), 2, 6), 'YYMMDD')) <= current_date
and date(timestamp_format(substr(digits(ENDT), 2, 6), 'YYMMDD')) >= current_date

Current_date is a special register that returns the current date, but that’s not important right now.

The date conversion part involves several nested functions to get to a final date, so here’s the breakdown using a date of 1st September 2015 (1150901):

  • digits(STDT) converts the 7 digit number into a 7 character string. So 1150901 is converted to ‘1150901’.
  • The IBM i can figure out which century it’s in by looking at the year, so I don’t need the century flag and the substr(digits(STDT), 2, 6) strips it out to give me a value of ‘150901’
  • The timestamp_format function takes the date string and uses the format string of ‘YYMMDD’ to generate a timestamp of 2015-09-01-00.00.00.000000. Since I only passed it a date, the hours, minutes and seconds are all zeroes.
  • And finally, I can use the date function to retrieve the date from the generated timestamp.

And here’s the result:

CTNR      DATE      DATE
C1000001  01/09/15  31/08/16

Pretty, and portable.