This is just a snippet of SQL code I was working on. I was hoping to use it in this script to exclude keywords inside of comments and quotes, but it is just too bulky and unwieldly in its current form. Perhaps I can come back to it and streamline it a bit in the future.
In the meantime, I’m putting it here so I don’t lose it. And, as always, perhaps someone out there can make use of it, or even improve it (let me know if you do).
--V1.01.05 2021-08-04 copyright C. Pottinger DECLARE @text VARCHAR(MAX) = 'okay, testing, testing, 1, 2, 3. ' + 'tthis is a test. This is only a test. ' + 'what about tarts? does this handle tarts?'; --set @text = 'abcdefghijklmnopqrstuvwxyz'; DECLARE @matchStart INT; DECLARE @matchLen INT; DECLARE @pattern VARCHAR(50); DECLARE @replText VARCHAR(50); --declare @matched varchar(max); -- debugging line --declare @forpat varchar(50); -- debugging line DECLARE PatternCursor CURSOR FOR SELECT Pattern, Replacement FROM (VALUES ('%c%f%', '**'), -- these are the patterns to look for and their replacement strings ('%t_%t%', 'call') , ('%is%', 'are') ) AS X (Pattern, Replacement); OPEN PatternCursor; FETCH NEXT FROM PatternCursor INTO @pattern, @replText; WHILE @@FETCH_STATUS = 0 BEGIN SET @matchStart = 1; WHILE @matchStart > 0 BEGIN SET @matchStart = 0; WITH cte_getMatch AS (SELECT PATINDEX(@pattern, @text) AS [matchPoint] , 0 AS [matchLen] , 0 AS [lowEnd] , LEN(@text) AS [highEnd] , LEN(@text) / 2 AS [newEnd] , PATINDEX(@pattern, SUBSTRING(@text, 1, LEN(@text) / 2)) AS [foundInTestArea] UNION ALL SELECT [matchPoint] AS [matchPoint] , CAST(IIF([foundInTestArea] = 0 , [matchLen] , [newEnd] - [matchPoint] + 1 ) AS INT ) AS [matchLen] , CAST(IIF([foundInTestArea] = 0 , IIF([matchPoint] > [newEnd], [matchPoint], [newEnd]) , IIF([matchPoint] > [lowEnd], [matchPoint], [lowEnd]) ) AS INT ) AS [lowEnd] , IIF([foundInTestArea] = 0, [highEnd], [newEnd] ) AS [highEnd] , ([newEnd] + IIF([foundInTestArea] = 0, [highEnd], [lowEnd])) / 2 AS [newEnd] , PATINDEX(@pattern , SUBSTRING(@text , 1 , ([newEnd] + IIF([foundInTestArea] = 0 , [highEnd] , [lowEnd] ) ) / 2 ) ) AS [foundInTestArea] FROM cte_getMatch WHERE [matchPoint] > 0 AND ABS([highEnd] - [lowEnd]) > 1 ) --select *, substring(@text, [gm].[matchPoint], [gm].[matchLen]) as [Matched] -- debugging line -- from cte_getMatch as [gm]; -- debugging line SELECT @matchStart = [gm].[matchPoint] , @text = IIF([gm].[matchPoint] > 0 AND [gm].[matchLen] > 0 , REPLACE(@text, SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen]), @replText) , @text ) --, @matched = SUBSTRING(@text, [gm].[matchPoint], [gm].[matchLen]) -- debugging line --, @forpat = @pattern -- debugging line , @matchLen = [gm].[matchLen] FROM (SELECT TOP 1 * FROM cte_getMatch WHERE [matchLen] > 0 AND ABS([highEnd] - [lowEnd]) < 2 ) AS [gm]; --select @matchStart [@matchStart], @matchLen [@matchLen], -- debugging line -- @matched [matched], @forpat [for pattern]; -- debugging line END; FETCH NEXT FROM PatternCursor INTO @pattern, @replText; END; CLOSE PatternCursor; DEALLOCATE PatternCursor; SELECT @text;
Edit: this is now small enough to include in the script to find stored procedures and functions, so I have updated that script here. (It actually is smaller once you remove all the newlines I put in to avoid line wrapping on the page).
This next part is an explanation for myself – because I know I’ll come back to this in a year or two and won’t be able to remember how CTEs work.
The purpose of
cte_getMatch is provide the one piece of information that
PATINDEX does not give: the length of the matching pattern. It does so by recursively looking for longer and shorter substrings of the text that will contain a match for
pattern until it finds the shortest substring.
SELECT in the CTE is the anchor. It is called once and results from it are used to feed the first call to the
SELECT is then repeatedly called using the values from the previous call, until the
WHERE clause fails. The result of every call is returned as a set.
In the anchor, we determine where the pattern match is first found (
[matchPoint]), and initialize the unknown length of the match to 0 (
[highEnd] represent the endpoints of the shortest and longest substrings that have been tried so far . They are initialized to low and max values.
[newEnd] is the length of the substring that is being checked and
[foundInTestArea] will indicate if the pattern match was found in that substring. I begin by using the first half of the text as the test area.
In the recursive SELECT, the goal will be to alter the values of
[highEnd] so that
[lowEnd] will be an increasing index value that is always too small to allow the pattern to be found, and
[highEnd] will be a decreasing index value large enough to allow the pattern to be found. The loop will move
[highEnd] toward each other until they are only 1 character apart, at which point we have found the smallest substring that contains the pattern
I start by including
[matchPoint] in the results. For the remaining values, there are two possibilities:
- The pattern was found in the last search:
[matchLen]is set based on the test area used in the last search.
[lowEnd]remains at its current value, unless it was less than
[matchPoint](occurs if the anchor’s test area did not include the match), in which case it is set to
[matchPoint]. Why not just set it in the anchor? I wanted to keep the number of calls to
PATINDEXto a minimum.
[highEnd]is lowered to equal endpoint of the last test area.
[newEnd]is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the
[lowEnd]and the endpoint of the last test area.
[foundInTestArea]indicates if the pattern is in the new test area.
- The pattern was not found in the last search:
[matchLen]remains at its current value (the length of the last successful search).
[lowEnd]is raised to equal the index of the end of the unsuccessful search (
[newEnd]), unless it was less than
[matchPoint](same reasoning as above).
[highEnd]remains at its previous value (index of the end of the last successful search)
[newEnd]is set to indicate that the endpoint of the new test area to be searched will be a point halfway between the endpoint of the previous search and the endpoint of the last successful search (
[foundInTestArea]indicates if the pattern is in the new test area.
Once the length of the matching string is determined, then a
REPLACE is performed, and the search is repeated until no matches for the current pattern are found. At that point the code loops to the next supplied pattern/replacement string pair and starts the processing again.
Note: This replacement is semi-recursive.
By “-recursive’, I mean that if the replacement string causes the new text to contain the match pattern, then the code will attempt to replace at least a portion of the newly inserted string.
'123xxabcc'which is then set to
'123zaxcbca'which is then set to
'123zzaxcbca', leading to an infinite loop.
By “semi-“, I mean that each pattern/replacement pair is processed separately, and replacements made by subsequent pairs will not cause a previous pair to re-execute.
'123xxxxac'after the first pair is processed, and
'123abc'after the second pair is processed. But this will not cause the first pair to reprocess and turn the string into