T-SQL routine to REPLACE all occurrences of text matching multiple patterns

This is just a snippet of SQL code I was working on. I was hoping to use it in this script to exclude keywords inside of comments and quotes, but it is just too bulky and unwieldly in its current form. Perhaps I can come back to it and streamline it a bit in the future.

In the meantime, I’m putting it here so I don’t lose it. And, as always, perhaps someone out there can make use of it, or even improve it (let me know if you do).

--V1.01.06 2022-01-25 copyright C. Pottinger
DECLARE @text VARCHAR(MAX) = 'okay, testing, testing, 1, 2, 3. ' +
                             'tthis is a test. This is only a test. ' +
                             'what about tarts? does this handle tarts?';
        -- 'test','tthis is a call. T','t about','tart' => 'call': 'okay, calling, calling, 1, 2, 3. callhis is only a call. whacall calls? does this handle calls?'
        -- 'are' => 'is': okay, calling, calling, 1, 2, 3. callhare are only a call. whacall calls? does thare handle calls?'
--set @text = 'abcdefghijklmnopqrstuvwxyz'; -- 'cdef' => '**': 'ab**ghijklmnopqrstuvwxyz'

DECLARE @matchStart INT;
DECLARE @matchLen   INT;
DECLARE @pattern    VARCHAR(50);
DECLARE @replText   VARCHAR(50);
--DECLARE @repl       VARCHAR(50); -- uncomment references to @repl to see the transformation

DECLARE PatternCursor CURSOR FOR
  SELECT Pattern, Replacement
    FROM (VALUES ('%c%f%', '**'), -- these are the patterns to look for and their replacement strings
                 ('%t_%t%', 'call') 
               , ('%is%', 'are')
         )
      AS X (Pattern, Replacement);
OPEN PatternCursor;
FETCH NEXT FROM PatternCursor INTO @pattern, @replText;
WHILE @@FETCH_STATUS = 0
BEGIN
    SET @matchLen = 1;
    WHILE @matchLen > 0
    BEGIN
        SET @matchStart = 0;
        WITH cte_getMatch
        AS (SELECT PATINDEX(@pattern, @text)                               AS [matchPoint]
                 , 0                                                       AS [matchLen]
                 , 0                                                       AS [lowEnd]
                 , LEN(@text)                                              AS [highEnd] 
                 , LEN(@text) / 2                                          AS [newEnd]
                 , PATINDEX(@pattern, SUBSTRING(@text, 1, LEN(@text) / 2)) AS [foundInTestArea]
             UNION ALL
            SELECT [matchPoint]                                                                                              AS [matchPoint]
                 , CAST(IIF([foundInTestArea] = 0, [matchLen], [newEnd] - [matchPoint] + 1) AS INT)                          AS [matchLen]
                 , CAST(IIF([foundInTestArea] = 0, [newEnd], IIF([matchPoint] > [lowEnd], [matchPoint], [lowEnd])) AS INT)   AS [lowEnd]
                 , IIF([foundInTestArea] = 0, [highEnd], [newEnd])                                                           AS [highEnd]
                 , ([newEnd] + IIF([foundInTestArea] = 0, [highEnd], [lowEnd])) / 2                                          AS [newEnd]
                 , PATINDEX(@pattern, SUBSTRING(@text, 1, ([newEnd] + IIF([foundInTestArea] = 0, [highEnd], [lowEnd])) / 2)) AS [foundInTestArea]
              FROM cte_getMatch
             WHERE ABS([highEnd] - [lowEnd]) > 1
           )
        SELECT @matchStart = [gm].[matchPoint]
           --, @repl = SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen])
             , @text = IIF([gm].[matchPoint] > 0 AND [gm].[matchLen] > 0, REPLACE(@text, SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen]), @replText), @text)
             , @matchLen = [gm].[matchLen]
          FROM (SELECT TOP 1 * 
                  FROM cte_getMatch
                 WHERE ABS([highEnd] - [lowEnd]) < 2
               ) AS [gm];
      --SELECT @replText, @repl, @text;
    END;

    FETCH NEXT FROM PatternCursor INTO @pattern, @replText;
END;
CLOSE PatternCursor;
DEALLOCATE PatternCursor;

SELECT @text;

Edit: this is now small enough to include in the script to find stored procedures and functions, so I have updated that script here.

This next part is an explanation for myself – because I know I’ll come back to this in a year or two and won’t be able to remember how CTEs work.

The purpose of cte_getMatch is provide the one piece of information that PATINDEX does not give: the length of the matching pattern. It does so by recursively looking for longer and shorter substrings of the text that will contain a match for pattern until it finds the shortest substring.

The first SELECT in the CTE is the anchor. It is called once and results from it are used to feed the first call to the UNIONed SELECT. The UNIONed SELECT is then repeatedly called using the values from the previous call, until the WHERE clause fails. The result of every call is returned as a set.

In the anchor, we determine where the pattern match is first found ([matchPoint]), and initialize the unknown length of the match to 0 ([matchLen]). [lowEnd] and [highEnd] represent the endpoints of the shortest and longest substrings that have been tried so far . They are initialized to low and max values. [newEnd] is the length of the substring that is being checked and [foundInTestArea] will indicate if the pattern match was found in that substring. I begin by using the first half of the text as the test area.

In the recursive SELECT, the goal will be to alter the values of [lowEnd] and [highEnd] so that [lowEnd] will be an increasing index value that is always too small to allow the pattern to be found, and [highEnd] will be a decreasing index value large enough to allow the pattern to be found. The loop will move [lowEnd] and [highEnd] toward each other until they are only 1 character apart, at which point we have found the smallest substring that contains the pattern.

Note: it helps to understand that I am not searching for the start of matching pattern (I have that from PATINDEX) – [lowEnd], [highEnd], and [newEnd] all refer to the end of matching pattern. All searches are performed from the start of the string to [newEnd] to see if substring is long enough to include the pattern.

I start by including [matchPoint] in the results. For the remaining values, there are two possibilities:

The pattern was found in the last search:
- [matchLen] is set based on the test area used in the last search.
- [lowEnd] remains at its current value, unless it was less than [matchPoint] (occurs if the anchor’s test area did not include the match), in which case it is set to [matchPoint]. Why not just set it in the anchor? I wanted to keep the number of calls to PATINDEX to a minimum.
- [highEnd] is lowered to equal the endpoint of the last test area.
- [newEnd] is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the [lowEnd] and the endpoint of the last test area.
- [foundInTestArea] indicates if the pattern is in the new test area.
The pattern was not found in the last search:
- [matchLen] remains at its current value (the length of the last successful search).
- [lowEnd] is raised to equal the index of the end of the unsuccessful search ([newEnd]), unless it was less than [matchPoint] (same reasoning as above).
- [highEnd] remains at its previous value (index of the end of the last successful search)
- [newEnd] is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the endpoint of the previous search and the endpoint of the last successful search ([highEnd]).
- [foundInTestArea] indicates if the pattern is in the new test area.

Once the length of the matching string is determined, then a REPLACE is performed, and the search is repeated until no matches for the current pattern are found. At that point the code loops to the next supplied pattern/replacement string pair and starts the processing again.

Note: This replacement is semi-recursive.

By “-recursive’, I mean that if the replacement string causes the new text to contain the match pattern, then the code will attempt to replace at least a portion of the newly inserted string.

Replacing '%abc%' with 'xxa' in '123abcbca' gives '123xxabcc', which contains the search pattern. So in the next iteration the text is set to '123xxxxac'.
Replacing '%a_c%‘ with 'zaxc' in '123abcbca' gives '123zaxcbca', which is then set to '123zzaxcbca', leading to an infinite loop.

By “semi-“, I mean that each pattern/replacement pair is processed separately, and replacements made by subsequent pairs will not cause a previous pair to re-execute.

Replacing '%abc%' with 'xxa', and '%xxxxa%' with 'ab' in '123abcbca' results in '123xxxxac' after the first pair is processed, and '123abc' after the second pair is processed. But this will not cause the first pair to reprocess and turn the string into '123xxa'.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

T-SQL routine to REPLACE all occurrences of text matching multiple patterns

Related

Leave a comment Cancel reply