dcsimg
 

Oracle's ASCIISTR() and Unicode Characters

Thursday Jul 12th 2018 by David Fitzjarrell

Although documented, the ASCIISTR() function may return 'interesting' results depending on the characters translated.  Read on to see what problems this can cause and a possible fix.

Oracle provides an interesting function, ASCIISTR(), to return ASCII strings from a VARCHAR2 or CLOB column, and in general it does an admirable job. There are non-printing characters however, that 'put a spanner in the works', returning HEX strings instead of characters. Since each HEX string is five bytes long, such strings can significantly increase the overall length of the text so that it won't 'fit' in the allocated buffer space. Let's look at one of those 'annoying' characters, look at its source and provide a possible solution.

Word-processing programs (which will remain unnamed), spreadsheets, PDFs and HTML documents are great tools, that is until text from any of those sources is inserted into a text or CLOB column in an Oracle database using the popular 'copy and paste' procedure. Many times, those programs use what is called a 'non-breaking space' to separate words in a line of text, and if the Oracle database in question is configured to use UTF-8 character encoding those non-breaking spaces get inserted into the column without error or translation. A simple select of the column data won't reveal these insidious pests, but ASCIISTR() will. This is mentioned in the Oracle online documentation; the description from that document is reproduced below:


ASCIISTR takes as its argument a string, or an expression that resolves to a string, in any character set and returns an ASCII
version of the string in the database character set. Non-ASCII characters are converted to the form \xxxx, where xxxx represents
a UTF-16 code unit.

So what is a 'non-breaking space'? According to Wikipedia:


In word processing and digital typesetting, a non-breaking space (" "), also called no-break space, non-breakable space (NBSP),
hard space, or fixed space,[1] is a space character that prevents an automatic line break at its position. In some formats,
including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.

In the database world the ASCIISTR() Function can't translate it into an ASCII character so it returns a text string of the HEX representation of it, which is '\00A0'. Concatenate enough of those spaces into a single line and the line length expands, sometimes dramatically. When there are many of these in a line, paragraph, or block of text Oracle can throw the following errors from ASCIISTR():


ORA-24920: column size too large for client. 
ORA-22835: Buffer too small for CLOB to CHAR or BLOB to RAW conversion (actual: %s, maximum: %s)

Reports can fail because of those errors, and to the user, the cause is unknown because these 'non-breaking spaces' look like regular spaces.

Can this issue be fixed? Yes, but it may not be a really simple fix, depending upon the configured database block size. For block sizes of 8192 or less the string will need to be 'walked' in 'chunks' to update it completely. If the block size is 16K, successful single-statement updates are possible. Looking at an example that searches the first 2400 characters of a string for '\00A0' the PL/SQL block might look like this:


declare
        v_chunk         number:=2400;
        v_start_pos     number:=1;
        v_next_pos      number:=0;
        v_passes        number:=0;
begin
        --
	-- Set the number of passes through the loop
        --
        select round(length(col)/v_chunk,0) into v_passes from mytable;
        --
	-- Execute the loop, walking the string in pre-defined blocks
        --
        for i in 1..v_passes loop
                v_next_pos := v_start_pos + v_chunk;
                update mytable
                set col = replace(asciistr(substr(col, v_start_pos, v_chunk)), 'Z','')
                where id in
                        (select id from mytable where asciistr(substr(col, v_start_pos, v_chunk)) like '%Z%');
        --
        -- Set the new starting position
        --

                v_start_pos := v_next_pos;
        end loop;
        --
        -- Commit the changes
        --
        commit;
end;
/

Feel free to use this code, making any changes to table and column names or chunk size.

The above script works because it calculates the number of 'steps' it will take to update the entire string, based on the string length and the 'chunk' size. This determines the number of steps required to 'walk' the entire string. The loop itself calculates the starting position and positions the 'pointer' where it needs to be to pick up from the previous update. Eventually the entire string is 'walked' and any errant characters replaced.

Of course, not everyone will have need of this script; this problem arises when the copy and paste operations described above are in use at insurance companies, medical offices, law offices and others who regularly insert text into a VARCHAR2 or CLOB column that is found online or in word processing/spreadsheet/PDF files.

Although this may seem like a remote issue, to some it can be a problem on a daily, weekly, or monthly basis depending upon when reports where this character is liberally sprinkled are generated. It's nice to know, though, that if this problem ever does appear there's a possible solution handy.

See all articles by David Fitzjarrell

Home
Mobile Site | Full Site
Copyright 2018 © QuinStreet Inc. All Rights Reserved