This is a continuation of string manipulation techniques. If you are interested, you can read the other articles,
though you do not need to read it before this one. These are mostly beginning
programming articles, but advanced T-SQL programmers may still find something useful here.
Continuing on with taming strings.
In Part 1, I worked with phone numbers to find the area codes in phone number data. This data was in a variety of formats and frankly, quite a mess. Since we allowed users to enter data with very little validation, we got all sorts of invalid data as well as typos. Well, the time finally came for us to clean and validate as much phone data as possible, and so I decided to write about my little adventure.
Dealing with phone number data is more than a trivial exercise, both from the technical as well a business viewpoints. If you limit yourself to US phone numbers, then the problem is simplified, but still requires some work. In today's Internet, 24x7 uptime requirements, and international applications, I have seen this become less and less of an option, so I plan on designing the db to hold all kinds of information. Another article will tackle the database design side of storing phone data; this article looks at getting all the data in a standard format.
We had initially designed a single phone number field on forms as a way
to handle all types of data in an ergonomic fashion. There was no
validation of the data being entered so as to make the data entry as
smooth as possible. The result of this was phone number data that looked
1-(520)-555-5821 (904)555-1877 2711 5555458 610-555-3723 000 6195557693 11111111111111111 (111) 555-0986 181-5554761 31 471 55505 31 22555599 1-800-555-9495 415 555 3851 +11455561022 1603555775 ++492055599971 407.555.4770 555-6609 202/555-2228 248-555-8300*8263 812-555-8444x312 9999999 1800555SOFT ext. 132 512/555-2000x2563 (800) 555-2588 XT 7 none 800-555-0963 X90454As you can see, there are a variety of formats in which data was entered (BTW, these are not real phone numbers, I scrambled quite a few of the numbers and made the exchanges 555. If any of them actually work, I apologize in advance). Users have entered parenthesis, dashes, periods, +, etc. as delimiters for various sections of the phone numbers. The spacing, abbreviations, and formats vary from company to company (much to the chagrin of developers) and cry out for some standardization. However, forcing standardization during the data entry annoys and angers ( and loses) users. My company finally came up with a standard data entry solution with some validation of data, but I needed to then standardize the existing data.
The technical solution to storing the data was to standardize the formatting and then include parsing routines in the application logic to handle the display and data entry. We decided a single field will contain the user entered data as numerics only with all other characters stripped out with one exception. For users that have a phone extension, we would include an "x" between the phone number and the extension. By standardizing on this format, all developers can expect a certain format and apply simpler parsing rules to format the data for display.
What String Manipulation?
Here is a problem that needs to be solved, what does this have to do with string manipulation? After all, shouldn't the validation be occuring in the presentation or business layers? Couldn't I use a rule or user defined datatype to standardize the system?
Patience, patience, all will be clear soon.
In order to clean this data, I will be using string manipulation to reformat all the existing phone data in the standard format. Regardless of future validation and schema changes, I would still have to get the existing data into a standard format to ensure the application works.
Now it would be nice to write one SQL statement that would rip through the data and reformat all the data as expected. A couple problems with this; one, it would be difficult to explain in an article. Two, it would probably take more time to decipher than most of you would want to spend. Three, I can't come up with one. Since reason three will tend to override reasons one and two in anything I write, here is what I came up with.
Since I try to avoid row processing which is slow and the kind of thing I usually chastise junior DBAs for implementing, I decided to look for patterns which would enable me to develop an algorithm to clean up this data. I made notes as I conducted the analysis so that I could present readers with some insight into how I go about solving a problem or developing a solution.
The first thing I do when solving most problems is look for a pattern.
After all, computers are very good at performing repetitive actions
and SQL is especially well suited to working with a batch of data.
In my examination of the data, I found that most data was basically in one
of a few formats:
999 999-9999 (999) 999-9999 999-999-9999and the extensions tended to look like:
x999 x.999 ext 999 ext.999The rest of the data is pretty close to these formats with some variations (more or less spaces, +xx for international numbers, some people entered slashes ("/") or periods (".") or some other character, but since I may end up handling these individually or as "one-offs" (things that do not fit a pattern or are an exception), I will ignore these for now.
The format that I am trying to standardize all data into is the following:
9999999999x9999where all non-numeric characters and spaces are removed. Actually spaces are not important for the parsing routines we employ, so I will ignore embedded spaces, but I think you will be able to adapt my solution easily to remove them.
Back to the analysis...
There are multiple places in the field where each non-numeric character can occur. I could write a series of substring updates (like in Part 1), but this would be quite a bit of work, lots and lots of statements, and probably not of interest to any of you. As I examine the data, what I really want to do is remove all the dashes ( or other characters) from the field.
So how can we remove the parenthesis, dashes, and periods from the
phone numbers? One way that I think should work is a simple CONVERT or
CAST of the field to an integer value. If I run the following,
select cast( '(555) 555-5555' as int)I would expect to get:
5555555555but instead, I get:
Server: Msg 245, Level 16, State 1, Line 1
Syntax error converting the varchar value '(555) 555-55555' to a column of data type int.
Why this occurs, I am not sure (if you are on the SQL Server team, let me know), but it does not. SUBSTRING is not a great option, so I decided to search BOL a little and see what my options were.
The first thing I saw in BOL which looked promising was PATINDEX which lets you find the location of a particular string within another string. This along with CHARINDEX (see Part 2) is useful for finding the location of a pattern within a string. I suppose I could have used this and then written a bunch of substrings that would concatenate the before and after portions of the string, but this did not seem very efficient either.
It was around this time that one of our senior developers was doing some data cleaning and sent me a script to execute on the production database. This script contained a series of SQL statements in a batch that ran REPLACE to change the wording in a number of rows in a particular table. The idea then came to me that I could use this function to quickly remove all the non-numeric characters in my data!
The REPLACE function is very powerful and is used to do exactly what you would expect it to do.
It searches a string for some other string that you supply (much like CHARINDEX and PATINDEX), but
replaces the second string with a third string that you supply. So if you ran the following code:
declare @a char( 30) select @a = 'Bob and Jim played catch.' select @a, replace( @a, 'Jim', 'Jeff')You get
--------------------------- --------------------------- Bob and Jim played catch. Bob and Jeff played catch.As you can see, the first string, @a, is searched for any occurence of the second string, "Jim". If an occurence is found, then it is replaced with the third string, "Jeff".
To apply this to my phone number problem, I decided to write a series of statements
that would replace the characters I wanted removed. The three obvious characters I
wanted to remove were the dashes, paranthesis, and periods. Actually these are four characters
and I ended up with the following script:
update UserInfo set phone = replace( phone, '-', '') update UserInfo set phone = replace( phone, '(', '') update UserInfo set phone = replace( phone, ')', '') update UserInfo set phone = replace( phone, '.', '')One thing that many people do not know about many functions (at least I am surprised how many people do not know) is that you can often use an empty string wherever a character parameter is called for. In this case, substituting an empty string as the replacement character allows me to remove offending characters.
There are more possible solutions for solving this problem, but some are better left to another article. I hope this has helped some of you understand how to manipulate and clean data a little better as well as sparked some ideas for even better solutions. My solution worked well and was set based so it was somewhat efficient as well.
For those of you interested in further information about REPLACE, here is the URL for the online reference for REPLACE
As always, I welcome feedback and please rate this article below (and any you read on Swynk). It helps to motivate and assist us authors in writing better columns.Steve Jones