This document describes the steps to be taken in order to migrate an Oracle database to UTF8. It is based on Oracle 9i using the WE8ISO8859P15 (Latin 15) character set, but in princip should apply to other configurations as well.
- Data Cleaning
- Data Integrity
- Character Set Cleaning
- Preparing the database structures
- Data transfer: Different strategies compared
Before we go for the migration, we have to clean our database. This is being done mainly for the following reasons:
- Less data result in a faster data transfer, minimizing the required downtime
- Less data mean less source for possible character set conflicts
- Less data will improve performance due to shorter access times
- Less data requires less storage (really!)
- Less data boosts backup and restore, saving time in emergency cases
This is just to mention a few items and shows, that besides being helpful for the migration process, data cleaning will also have positive aspect on the every day maintenance and work. Thus, if there are data in our database which cannot be used or will not be used anymore: This is a good time to get rid of it. Even if you don't plan to migrate at all …
Identify objects to be deleted
Your database probably contains tables which are no longer used and thus could be deleted after an optional export. These tables can be grouped into the categories
- Tables temporarily created for a one-time task, but forgotten to be dropped afterwards
- Obsolete tables already marked for deletion but not yet dropped
- Tables used in a former version of the application, but not used anymore
Moreover there may be complete schematas which are not being used anymore, and can be grouped by the same categories.
Once identified confidentally, these objects should be backed up (e.g. using Oracles EXP utility) – just in case they may be needed again later on - and then finally dropped.
Data no more needed
In a next step we are looking for obsolete/outaged data. These may e.g. include
- Application users who have not logged in for a long time (maybe employees who already left the company
- Statistical data not accessed for long time and probably not needed for the next years
As with the objects above, these data should be backed up and deleted from the database.
Verifying the database schema logic, you may find out that many "foreign key constraints" are missing and where obviously more or less handled from the application only. Checking these may bring up a list of orphans, where the application failed to resolve the referenced data, leaving alone the bones we just can delete (due to the missing reference, they cannot be of any use).
To find all those "bones", you need to know your database model well. Another indicator may be constraints which are either not enabled at all (check if these make sense in your logic – if so, they should not be disabled; if not, they should be dropped), or enabled but not validated. The latter ones are pretty good indicators: You should try to validate them, and check all violations – which will most likely be some of those "bones".
To prevent these orphans in the future, the missing "foreign key constraints" should be established. Before doing so on the database, the applications need to be checked for possible side-effects (e.g. if the app tries to do the logic, it might run into some constraint violations when the foreign keys are established). The required application changes then need to be published at the same time as the constraints are established, to avoid problems.
In parallel to the missing foreign key constraints discussed above, there are a couple of more objects we should check for to avoid possible problems. These include:
- Disabled triggers
- Disabled and/or not validated constraints
- Invalid indices
- Other invalid/broken objects (like functions, views, packages, etc.)
- Objects with no or with outdated statistics
Below you will find an SQL script to list up all these possible troublemakers. For some of these objects this is an important job, since e.g. triggers may have been disabled for a good reason (and should not be "accidentally" enabled after the migration), others may just indicate problems in the database or the application. So do a careful investigation on them.
-- -- Discover possible problematic objects -- -- Author: AI Rehberg -- $LastChangedDate: 2007-08-31 15:04:37 +0200 (Fr, 31 Aug 2007) $ -- disabled triggers select * from dba_triggers where status != 'ENABLED' and owner not in 'SYS' order by owner -- Invalid objects (packages,views,...) select owner,object_name "Object",object_type "OType",status,temporary "Temp" from dba_objects where status != 'VALID' and owner not in ('SYS') order by owner -- Possible broken packages/procedures/views/... -- (may give detail information on above selected broken objects select distinct owner,name,type,line,position,text from dba_errors where text not like '%Statement ignored%' order by owner,name -- Disabled constraints select owner,constraint_name,constraint_type "CType",table_name,status,validated,last_change from dba_constraints where 1=1 and status !='ENABLED' and owner not in ('SYS','SYSTEM','WMSYS') order by owner -- Not validated constraints select owner,constraint_name,constraint_type "CType",table_name,status,validated,last_change from dba_constraints where 1=1 and validated!='VALIDATED' and owner not in ('SYS','SYSTEM','WMSYS') order by owner -- Indices with outdated or no statistics select owner,index_name,index_type "IType",table_name,uniqueness,tablespace_name,logging "Log",status,last_analyzed,temporary "Temp" from dba_indexes where 1=1 and owner not in ('SYS','SYSTEM','WMSYS','CSMIG','CTXSYS','MDSYS','ORDSYS','OUTLN','TOAD','XDB') and (last_analyzed is null or last_analyzed < add_months(sysdate,-2)) order by owner -- Invalid indices select owner,index_name,index_type "IType",table_name,uniqueness,tablespace_name,logging "Log",status,last_analyzed,temporary "Temp" from dba_indexes where 1=1 and owner not in ('SYS','SYSTEM','WMSYS','CSMIG','CTXSYS','MDSYS','ORDSYS','OUTLN','TOAD','XDB') and status != 'VALID' order by owner
Character Set Cleaning
Mainly two different problems are addressed by this topic, resulting in a double-check on the database plus maybe changes in the environment:
- Wrong characters in the database
- Characters to be converted
The first issue, where applicable, was most likely introduced due to wrong
environment setups. The
NLS_LANG variable should reflect the client's system
character set. Instead, on some machines it may reflect the internal character
set of the database (here: '
WE8ISO8859P15'). Although this may sometimes
match, it will cause a problem on the Windows clients, which use the
Windows-1252) character set. As a result, the
Win1252-characters from their input will be introduced to the database without
the necessary character set translation - so we have Win1252 characters in our
Latin-15 database. Since in the other direction (the client is reading from the
database) no conversion took place either, these characters will still be
displayed correctly there, while garbage will be shown to clients with correct
setup. Converting the database without correcting these characters first would
result in a data loss, since the characters in question would not be converted
correctly (but instead all show up as question marks – real ones or
The second issue is just the result of our goal: Since we want to migrate to a
different character set (i.e. '
AL32UTF8'), which luckily is a
super-set of the character set used right now, for some characters the
"character code" (i.e. the position/number in the character table) will change,
requiring a migration of existing content.
How to solve the wrong characters
As stated above, without fixing this problem some data would not be converted correctly: The Win1252 characters in the database would just be replaced by a "dummy placeholder". The solution consists of three steps:
- Identify the data in question
- Replace the wrong characters
- Correct the environment settings (i.e. set
NLS_LANG=WE8MSWIN1252on the windows clients)
The first step is done using Oracles Character Set Scanner (CSScan). To
identify our problem candidates, we pretend that we want to convert to
WE8ISO8859P15, using this for both the "from" and "to" character set – this
way CSScan will report all data which are not in Latin1, in our case the
Win1252 data. There's no way to automatically replace the wrong characters
since Latin1 is not a super-set of the Win1252 character set.
This means rows found "not convertible" need to be manually corrected. Before this is done, we cannot continue with the next step.
Characters to be converted
When all "wrong characters" are fixed (see previous paragraph), i.e. when the
database contains only characters according to its specified character set –
and not before that, we can check for its convertibility. This process is
similar to the previous one, except that we set the target character set to
AL32UTF8. As a result, our data will be divided into up to 4 categories:
- Changeless: Nothing will change here
- Convertible: The necessary conversions will be done automatically
- Truncation: The necessary conversions could be done automatically, but the resulting string would be too long to fit into the column
- Lossy: Due to "wrong characters" (see above), data loss is imminent
Hopefully, most data belongs to category 1 and 2 and only few (or better none) to category 3. Candidates for category 4 should have been eliminated with the previous step – if there are any of these reported here, return to there.
The data for the 3rd category need special attention. If any data
falls into this category, we need to check our table definitions and adjust the
affected column size accordingly – otherwise the data listed in this category
will be truncated on conversion, causing a data loss. If that's not possible
(i.e. the column size is already the maximum supported by Oracle), we need to
truncate these data manually or automatically. This applies only to those
columns which – as a result of the conversion – would have to hold more than
4.000 bytes – so don't worry about the e.g.
VARCHAR2(50) fields reported:
The CSScan tool just sees the current state, which says "50 (bytes)". We will
later change the semantics for all these fields from
the CSScan tool cannot know at this stage. So you only need to be worried
about fields longer than 2.000: If those are completely filled with characters
needing 2 bytes in UTF8, then you will have a problem. We come back to this
Preparing the database structures
In most cases, the migration process will require setting up a new database
already using the
AL32UTF8 character set – the exceptions where you simply
ALTER DATABASE CHARACTER SET AL32UTF8 are very rare. So what you will
have to do is:
- Take the scripts used to create your current database, make a copy of them, and edit the copy concerning the character set to use. Also take care for the database name - you cannot use the same name if that instance will be running on the same machine like the original database.
- Create the new database. Make sure you set
NLS_LENGTH_SEMANTICS=CHARin the new databases environment before.
- Do a full export from the originating database with the
ROWS=Nparameter (this will simply export all objects like tables, triggers and stored procedures - but no data)
- Take that export file, and import the schemata you want to migrate into
the new UTF8 database (do not import the
SYSTEM, etc. schematas which ship with Oracle, they are already there!)
Since all these 4 steps belong to the DBAs daily duties, they won't be explained in depth at this place – I rather focus on the real issues ;)
NLS Length Semantics
By default, the
NLS_LENGTH_SEMANTICS is set to
BYTE. This means, if you
declare a column
VARCHAR2(50), Oracle implicitly makes that
BYTES). What does that mean? Nothing to your old Latin-15 database, since here
each character uses exactly one byte. But with your UTF8 database, it can mean
a lot – if you are not only store US7ASCII characters (why would you want an
UTF8 database for that?). As for those, UTF8 also uses 1 byte per character.
But e.g. for the accentuated French/Spanish/… characters, or the German
Umlauts, UTF8 will use 2 bytes. So with the default settings, you are unable to
say how much characters fit into that column.
To avoid this problem, we need to set the
Oracle interprets our above column implicitely to
VARCHAR2(50 CHARS), and
everything is fine. Hopefully. There still remains one limit for the VARCHAR2
columns, which explicitely states "4.000 bytes". So you of course can define a
VARCHAR2(4000 CHARS) – but with UTF8 this does not necessarily mean
that always 4.000 chars fit in there – see above...
So you set up the
NLS_LENGTH_SEMANTICS correctly, created the UTF8 database
and imported the structures. You thing why not directly import the data along,
right? Maybe you already guessed it: Your imported structures kept the state
they have been created with, i.e. they are still using byte semantics. This was
explicitely written into the export dump file, so whom to blame …
This leaves us with the next task: We have to convert all this to use character semantics. Luckily, we can use some script to do the work for us, and don't have to check everything manually. But Oracle doesn't deliver such a script. Time for me to mention my IZ_MIGRATION package the first time, which contains the necessary procedures. The package is downloadable for free at IzzySoft.DE.
So you can simply use the
iz_migration.semantics2char procedure to do the job
for you (for details, please see the documentation of that package). You may
also consider to convert all your national character fields, such as
NVARCHAR to their "normal pendants" – the package also contains procedures
While talking about the character set cleaning and the Oracle provided CSScan
tool, we mentioned rows that would be truncated and the need to manually check
those. If you want to postpone this job (or automate it and don't care about
the cut-off), you can use the
iz_migration.truncate_long_fields procedure for
that. It will only cut-off what really has to be cut-off, while storing the
original content in a special tables CLOB field together with the object name
and row id, so it is easily identifyable for later review. If you don't
truncate the fields beforehand, you will run into problems on data import,
since these rows would be skipped with an error message. This way you at least
can keep what is possible, and fix the details later.
Data transfer: Different strategies compared
There are different ways to migrate the data themselves. First, a table will give a brief overview – later the strategies will be discussed into more detail:
||Fast and safe||Only applyable in very rare cases|
|EXP/IMP||Safe and supported by Oracle||Slow, requires long downtime|
|DBLink Copy||Minimal downtime||Not supported by Oracle, not so error-proof|
ALTER DATABASE method
If this method can be applied, it is for sure first choice: From all the methods described here, it is not only the fastest (requires almost no downtime), but it also is supported by Oracle. Sounds very good so far - so there must be a rub in it. Right, there is: It won't work if you start off from a Latin-15 database. So it's out of discussion here.
The Export/Import method
So what is Oracle recommending us in our case? Right: Make a complete, consistent export and import it into your new UTF8 database (basically). Details you will find on all Oracle pages, so no need to discuss that here. And I guess you already see the hitch: To make a complete and consistent export, you have to close your database for a while. Same for the import of that dump, for even a longer time. And you probably don't want anybody change the data in the original database while you're importing into the new one, since you want be up-to-date when opening the new instance - so add both times together. You will easily come up with one day off for everybody who depends on that database, since you need 10 hours or more for the process …
The DBLink method
Not always it is affordable to close the database for an entire workday. Not even if you could provide a read-only access for that time in some way. So we have been looking for some alternative. Surprisingly, copying an entire table via a database link turned out to be much faster than an export/import, so I investigated closer on that – the result was the IZ_MIGRATION package.
The idea behind it is simple: Since we already imported the entire structure
into the new database, we can simply walk through the data dictionary and do a
"for each owner – for each table –
insert into owner.table select * from
owner.table@dblink". There are some more things to take care for in this
context (all explained in the packages documentation), including:
- first disable all triggers and constraints to avoid integrity violation and to speed up the process. After copying, enable them again
- drop all indexes first (except for those caused by constraints like
UNIQUEand thus cannot be so easily dropped) and recreate them later, again to speed up the process and end up with clean, balanced indexes
- cater for the sequences
For the constraint and trigger issue, the package contains appropriate scripts
to disable and enable them. As for the speed, constraints we will only
NOVALIDATE – assuming the data have been validated before, we just need to
make sure that no new "violating" entries can be inserted. An additional
procedure to validate them can be run later at any time.
Dropping the indexes also can be done with a procedure of the package. Before doing so, you will need to extract a script to re-create them later – you will find that script at the end of this page. The sequences, again, will be catered for by a procedure from the IZ_MIGRATION package.
Warning concerning the DBLink Copy Method
As already stated above, the DBLink copy method is not supported nor even recommended by Oracle. Even worse, it is not that much tested yet. I ran it already successfully on a few migrations, and thereby fixed the one or the other issue – but that does not mean it can be recommended to migrate any production database. The instances I migrated up to now using the package, all had very much in common – and since I created the package I exactly know what it is doing, and thus can solve potential trouble somehow. So if you really want to use this method, you should either hire me along for that task, or at least run everything thoroughly with a test system (which is as close as possible to your production database) to see that the result is really what it should be! Best, of course, is to combine both ;)
Script to generate the
CREATE INDEX statements
-- Retrieving all index information (create-script) for non-pk and non-unique indexes SET SERVEROUTPUT ON FEEDBACK OFF SPOOL create_indexes.sql DECLARE town VARCHAR2(40) := '0'; CURSOR cur IS SELECT i.owner,i.index_name FROM dba_indexes i,dba_constraints c WHERE i.owner = c.owner(+) AND i.index_name = c.index_name(+) AND c.index_name IS NULL AND i.index_type NOT LIKE 'IOT%' AND i.index_type != 'LOB' AND i.owner NOT IN ('SYS','SYSTEM','CSMIG','CTXSYS','DBSNMP','DMSYS','EXFSYS', 'HR','IX','MDSYS','OE','OLAPSYS','ORDSYS','OUTLN', 'SH','SYSMAN','WKSYS','WK_TEST','WMSYS','XDB') ORDER BY i.owner,i.index_name; PROCEDURE print(line IN VARCHAR2) IS -- wrapper to dbms_output.put_line - work around the 255 characters limit pos NUMBER; FUNCTION strpos (str IN VARCHAR2,needle IN VARCHAR2,startpos NUMBER) RETURN NUMBER IS -- split string helper to procedure print pos NUMBER; strsub VARCHAR2(255); BEGIN strsub := SUBSTR(str,1,255); pos := INSTR(strsub,needle,startpos); return pos; END; BEGIN dbms_output.put_line(line); EXCEPTION WHEN OTHERS THEN IF SQLERRM LIKE '%ORU-10028%' THEN pos := strpos(line,' ',-1); print(SUBSTR(line,1,pos)); pos := pos +1; print(SUBSTR(line,pos)); ELSE dbms_output.put_line('-- *!* Problem in print() *!*'); END IF; END; BEGIN dbms_output.enable(1000000); FOR rec IN cur LOOP IF town != rec.owner THEN town := rec.owner; print('-- ===============================[ Schema '||town||' ]==='); END IF; print(dbms_metadata.get_ddl('INDEX',rec.index_name,rec.owner)||';'); END LOOP; END; / SPOOL OFF
There are different strategies to migrate a database to UTF8 – from "very safe" to "safe but faster", we discussed them a little bit in detail. Many steps in the preparation phase are common to all strategies.
Whatever strategy you chose, the IZ_MIGRATION package can assist you and make some otherwise a bit difficult tasks a lot easier. You may want to check the IzzySoft.DE website for more details on this and other helpful Oracle packages and software.