T-SQL routine to REPLACE all occurrences of text matching multiple patterns


This is just a snippet of SQL code I was working on. I was hoping to use it in this script to exclude keywords inside of comments and quotes, but it is just too bulky and unwieldly in its current form. Perhaps I can come back to it and streamline it a bit in the future.

In the meantime, I’m putting it here so I don’t lose it. And, as always, perhaps someone out there can make use of it, or even improve it (let me know if you do).

--V1.01.05 2021-08-04 copyright C. Pottinger
DECLARE @text VARCHAR(MAX) = 'okay, testing, testing, 1, 2, 3. ' +
                             'tthis is a test. This is only a test. ' +
                             'what about tarts? does this handle tarts?';
--set @text = 'abcdefghijklmnopqrstuvwxyz';

DECLARE @matchStart INT;
DECLARE @matchLen   INT;
DECLARE @pattern    VARCHAR(50);
DECLARE @replText   VARCHAR(50);

--declare @matched varchar(max); -- debugging line
--declare @forpat  varchar(50);  -- debugging line

DECLARE PatternCursor CURSOR FOR
  SELECT Pattern, Replacement
    FROM (VALUES ('%c%f%', '**'), -- these are the patterns to look for and their replacement strings
                 ('%t_%t%', 'call') 
               , ('%is%', 'are')
         )
      AS X (Pattern, Replacement);
OPEN PatternCursor;
FETCH NEXT FROM PatternCursor INTO @pattern, @replText;
WHILE @@FETCH_STATUS = 0
BEGIN
    SET @matchStart = 1;
    WHILE @matchStart > 0
    BEGIN
        SET @matchStart = 0;
        WITH cte_getMatch
        AS (SELECT PATINDEX(@pattern, @text)                               AS [matchPoint]
                 , 0                                                       AS [matchLen]
                 , 0                                                       AS [lowEnd]
                 , LEN(@text)                                              AS [highEnd] 
                 , LEN(@text) / 2                                          AS [newEnd]
                 , PATINDEX(@pattern, SUBSTRING(@text, 1, LEN(@text) / 2)) AS [foundInTestArea]
             UNION ALL
            SELECT [matchPoint]                                                     AS [matchPoint]
                 , CAST(IIF([foundInTestArea] = 0                                   
                            , [matchLen]                                            
                            , [newEnd] - [matchPoint] + 1                           
                           ) AS INT                                                 
                       )                                                            AS [matchLen]
                 , CAST(IIF([foundInTestArea] = 0                                   
                            , IIF([matchPoint] > [newEnd], [matchPoint], [newEnd])                                                  
                            , IIF([matchPoint] > [lowEnd], [matchPoint], [lowEnd])  
                           ) AS INT                                                 
                       )                                                            AS [lowEnd]
                 , IIF([foundInTestArea] = 0, [highEnd], [newEnd] )                 AS [highEnd]
                 , ([newEnd] + IIF([foundInTestArea] = 0, [highEnd], [lowEnd])) / 2 AS [newEnd]
                 , PATINDEX(@pattern
                            , SUBSTRING(@text
                                        , 1
                                        , ([newEnd] + IIF([foundInTestArea] = 0
                                                          , [highEnd]
                                                          , [lowEnd]
                                                         )
                                          ) / 2
                                       )
                           )                                                        AS [foundInTestArea]
              FROM cte_getMatch
             WHERE [matchPoint] > 0
                   AND ABS([highEnd] - [lowEnd]) > 1                  
           )
        --select *, substring(@text, [gm].[matchPoint], [gm].[matchLen]) as [Matched] -- debugging line
        --  from cte_getMatch as [gm];                                                -- debugging line
        SELECT @matchStart = [gm].[matchPoint]
             , @text       = IIF([gm].[matchPoint] > 0 AND [gm].[matchLen] > 0
                                 , REPLACE(@text, SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen]), @replText)
                                 , @text
                                )
             --, @matched = SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen]) -- debugging line
             --, @forpat  = @pattern                                             -- debugging line
             , @matchLen   = [gm].[matchLen]
          FROM (SELECT TOP 1 * 
                  FROM cte_getMatch
                 WHERE [matchLen] > 0
                       AND ABS([highEnd] - [lowEnd]) < 2
               ) AS [gm]; 
        --select @matchStart [@matchStart], @matchLen [@matchLen], -- debugging line
        --       @matched [matched], @forpat [for pattern];        -- debugging line
    END;

    FETCH NEXT FROM PatternCursor INTO @pattern, @replText;
END;
CLOSE PatternCursor;
DEALLOCATE PatternCursor;

SELECT @text;

Edit: this is now small enough to include in the script to find stored procedures and functions, so I have updated that script here. (It actually is smaller once you remove all the newlines I put in to avoid line wrapping on the page).

This next part is an explanation for myself – because I know I’ll come back to this in a year or two and won’t be able to remember how CTEs work.

The purpose of cte_getMatch is provide the one piece of information that PATINDEX does not give: the length of the matching pattern. It does so by recursively looking for longer and shorter substrings of the text that will contain a match for pattern until it finds the shortest substring.

The first SELECT in the CTE is the anchor. It is called once and results from it are used to feed the first call to the UNIONed SELECT. The UNIONed SELECT is then repeatedly called using the values from the previous call, until the WHERE clause fails. The result of every call is returned as a set.

In the anchor, we determine where the pattern match is first found ([matchPoint]), and initialize the unknown length of the match to 0 ([matchLen]). [lowEnd] and [highEnd] represent the endpoints of the shortest and longest substrings that have been tried so far . They are initialized to low and max values. [newEnd] is the length of the substring that is being checked and [foundInTestArea] will indicate if the pattern match was found in that substring. I begin by using the first half of the text as the test area.

In the recursive SELECT, the goal will be to alter the values of [lowEnd] and [highEnd] so that [lowEnd] will be an increasing index value that is always too small to allow the pattern to be found, and [highEnd] will be a decreasing index value large enough to allow the pattern to be found. The loop will move [lowEnd] and [highEnd] toward each other until they are only 1 character apart, at which point we have found the smallest substring that contains the pattern

I start by including [matchPoint] in the results. For the remaining values, there are two possibilities:

  • The pattern was found in the last search:
    • [matchLen] is set based on the test area used in the last search.
    • [lowEnd] remains at its current value, unless it was less than [matchPoint] (occurs if the anchor’s test area did not include the match), in which case it is set to [matchPoint]. Why not just set it in the anchor? I wanted to keep the number of calls to PATINDEX to a minimum.
    • [highEnd] is lowered to equal endpoint of the last test area.
    • [newEnd] is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the [lowEnd] and the endpoint of the last test area.
    • [foundInTestArea] indicates if the pattern is in the new test area.
  • The pattern was not found in the last search:
    • [matchLen] remains at its current value (the length of the last successful search).
    • [lowEnd] is raised to equal the index of the end of the unsuccessful search ([newEnd]), unless it was less than [matchPoint] (same reasoning as above).
    • [highEnd] remains at its previous value (index of the end of the last successful search)
    • [newEnd] is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the endpoint of the previous search and the endpoint of the last successful search ([highEnd]).
    • [foundInTestArea] indicates if the pattern is in the new test area.

Once the length of the matching string is determined, then a REPLACE is performed, and the search is repeated until no matches for the current pattern are found. At that point the code loops to the next supplied pattern/replacement string pair and starts the processing again.

Note: This replacement is semi-recursive.

By “-recursive’, I mean that if the replacement string causes the new text to contain the match pattern, then the code will attempt to replace at least a portion of the newly inserted string.

  • Replacing '%abc%' with 'xxa' in '123abcbca' gives '123xxabcc' which is then set to '123xxxxac'.
  • Replacing '%a_c%‘ with 'zaxb' in '123abcbca' gives '123zaxcbca' which is then set to '123zzaxcbca', leading to an infinite loop.

By “semi-“, I mean that each pattern/replacement pair is processed separately, and replacements made by subsequent pairs will not cause a previous pair to re-execute.

  • Replacing '%abc%' with 'xxa', and '%xxxxa%' with 'ab' in '123abcbca' results in '123xxxxac' after the first pair is processed, and '123abc' after the second pair is processed. But this will not cause the first pair to reprocess and turn the string into '123xxa'.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s