Installing Oracle 10g GridControl
Monitoring databases and servers is an important part of the administrator. Especially for Oracle databases and their internals the question comes up, Why not to take a tool from the vendor itself? And if so, DBControl is shipped with the database – do we really need the Enterprise Manager Grid control? We will deal with these questions here as well as with the Installation itself.
What's this article about
As title and introduction already tell, this article is about Oracle 10g Grid Control, the "why" (in short) and the installation process. The latter will use an already existing database for the repository rather than a new one created by the GridControl installer (the "why so" will also be discussed). We again use a Linux server (RHEL / CentOS).
What this article is NOT about
Yeah, again the same disclaimer: We won't talk about Windoze specific stuff here talking about RedHAT, not RedMOND). Moreover, we will not talk about specifics concerning the "Installing GridControl with a new database" option (you will see why).
Contents
- Why GridControl?
- Prerequisites
- Installing GridControl Software
- Applying the Patches
- Postprocessing and Remarks
- Troubleshooting
- Installation and Konfiguration
- Alerts and Policies
- Grid reports the Listener to be down - but it is running
- PLS-00201: identifier 'MGMT_CNTR_TT.CLEAN_UP_OLD_TICKET_RECS' must be declared
- Alerts: Warning and errors because a Standby Database is not open
- Trouble with the Open Ports policy
- Target Alert: Error - Missing Properties
- Massive collection errors with ORA-00600
- Alerts: "User SYS logged on from …"
Why GridControl?
Oracle 10g databases ship with DBConsole – a tool somehow similiar to GridControl, which does not necessarily require an Enterprise license. So next to the license question (you may decide for the Enterprise Edition for other reasons, as partitioning or other Enterprise features you need), why not use DBControl?
There may be many reasons – we only want to pick two of them here. First: With DBControl, you have one (Web) interface per database - and have to check them all separately. With GridControl, you have one (Web) interface to monitor them all together – including a summary page for all services configured, not only databases but other components as well. Second, GridControl has more restrict security policies than DBControl, including file permissions the latter one does not monitor (if you don't want these, you can change or turn off what you don't like). Having more than just two or three databases, the first point alone may already make the decision – an Enterprise license provided.
First critics on GridControl
Downloading GridControl from the Oracle server issues the first critic: The latest version at the time of this article is 10.2.0.4 – the corresponding archives take 1GB to download, and the archive only contains patches and the hint, you need a previous version to apply them to. Downloading the previous version, 10.2.0.3, ends at the same point: 1GB and only patches. 10.2.0.2 finally contains the installer. While the idea behind it may be providing one solution for all (if you already installed a previous version, you may patch it instead of re-installing), this is frustrating for new installations plus, in this case, leads to additional problems that could have been avoided. And in time pressure, this may end with a 10.2.0.2 GridControl installed since the patches were postponed until "sometimes in the future" (and nobody knows when this will be), maybe leaving even security holes. I cannot call this "customer friendly" – for a software with prices like that, I'ld rather expect a full installer for the latest version.
Second – and here comes the answer on the question Why not using the default installation with a new database? – the whole thing ships with a database release 1 (!), i.e. 10.1.0.4! This version is more than 2 years old, today we already have 10.2.0.4 available. Why should one install some old version like that?
By the way, the database makes up about one third of the package size of 1.5GB in the distribution archive for GridControl 10.2.0.2. Assumed the same for the patches to 10.2.0.4, this is almost 1GB for nothing: If I don't have any database, I probably don't need the GridControl, do I? And if I have one, I probably use some more up-to-date version. Third option, if I want to monitor my Oracle 9i farm, I either can use a 9i instance for the repository (they explicitely allow 9.2.0.6 or above) or, if I want to use a 10g and not yet have it, I won't install some outdated version. Hm – so where's the option justifying the 10gR1 database in this package?
Caution message
GridControl is a powerful tool for the administrator. But while alerts put
their fingers on possibel security lacks, they are not always clear and
sometimes even ambiguous: If you receive an alert about an open port 1521, it
is quite obvious that you should not close that port but rather define an
exception to that rule, since you need that listener to contact the database.
But if you receive an alert about certain "Privileges Granted to Public", it
may look clear to revoke some of them – but the impact is not that clear,
as you will find out later (this one is discussed later in this
article). In this example, you could render your database at least partly
unusable, since other Oracle packages blindly trust on those permissions as
usually being granted, and do not grant them explicitly during their install
(in this case, Oracle Text could and should have granted those to the
CTXSYS
user). So here is some potential left to Oracle for improvements:
First to not take things granted they alert about in other places, and second
to make the alerts (and/or their help texts) more discriptive and
distinguishable.
Prerequisites
General
As already said on the first page, we assume you already have installed a 10g database. While the repository database may be on a different host (which is, by the way, recommended by Oracle), you may also have it running on the same host. However: The host for the GridControl installation must meet the same requirements as for the database itself, plus some more. Differences/additions will be described below:
Additional hardware requirements
- hard disk: add about 2.5GB for Grid software
- RAM: At least 2GB
Additional Software requirements
Additionally to the database requirements, add the packages
sysstat
wget
Remark: Oracle also mentiones the control-center package as a must-have requirement. This would resolve in half of Gnome being installed, including Evolution packages and all that stuff. I felt this contra-productive and left that requirement alone – with no consequences so far. Which is not a guarantee, however! So to be on the "safe side", you may better follow the Oracle advice. Since the installer did not feel anything missing (and they make a lot of checks), I personally feel on the safe side without it.
Environment
Additionally to the database requirements, GridControl needs the operating
system user nobody – which is usually already there. you can check that by
locating the user in /etc/passwd
, or simply run id nobody
which should
return the user_id.
As usual, there will be one or more script to be run as root
during the
installation (for our article, two scripts). So make sure you have root access
on the machine you want to install GridControl to.
Database
Permissions
The curious things first, I dropped into with a failing install: Though
DBControl before alerted that you should never grant public execution to the
utl_*
packages, the installation will only be successful if you did – and
will alert the same again once it's running. Don't ask me what idea stands
behind that, but before installing GridControl, you have to
sqlplus> GRANT EXECUTE ON utl_smtp TO public;
sqlplus> GRANT EXECUTE ON utl_tcp TO public;
sqlplus> GRANT EXECUTE ON utl_file TO public
Once you have GridControl up and running, don't forget to revoke those again –
and grant the same rights to sysman
instead.
Configuration
Your redo logs should have a size of 100M. Also, you should have the package
dbms_shared_pool
installed – if not, do so by executing
sqlplus> @?/rdbms/admin/dbmspool.sql
Furthermore, you should compare your init.ora
settings to meet the following
settings:
job_queue_processes >= 10
db_block_size = 8192
timed_statistics = TRUE
statistics_level = TYPICAL
open_cursors >= 300
session_cached_cursors >= 200
aq_tm_processes >= 1
undo_management = AUTO
undo_retention >= 10800
processes >= 150
log_buffer = 1 M
pga_aggregate_target >= 256 M
sga_target >= 512 M
_b_tree_bitmap_plans = FALSE
Miscellaneous
- DBControl must not be configured for the repository database.
GridControl and DBControl cannot go together. If you had DBControl
configured, make sure you remove it again. You can do so by executing the
following statement as user
oracle
, having set the correctORACLE_SID
:
$ORACLE_HOME/bin/emca -deconfig dbcontrol db -repos drop
- To prevent errors during installation, you should add the following line
to your
$TNS_ADMIN/listener.ora
(if it is not there already):
SUBSCRIBE_FOR_NODE_DOWN_EVENT_<listener-name>=OFF
(replace the phrase<listener-name>
with the name of your listener – default would beLISTENER
). Otherwise the listener claims a port needed by GridControl, and the installation fails. Leave this setting there also after the installation finished successfully – for the same reason of port conflicts. Restart the listener afterwards (lsnrctl stop && lsnrctl start
). - If you want to, you may edit the default ports used for the services of
GridControl, by editing the file
response/staticports.ini
in your GridControl installation directory.
Installing GridControl Software
Call the installer
To invoke the installer, simply change into the directory where you unpacked
the downloaded archive, and execute ./runInstaller
. Follow the steps through
the installer as described e.g.
here
(very detailed, including all the pre-requisites etc.). You do not need to read
that page – the link is just mentioned "in case of". The process is quite
intuitive.
Installation process
On the first page of the installer, chose "Enterprise Manager 10g Grid Control Using an Existing Database", click the "Next" button, and specify the database details on the second page. Also on the second page, click the "Prefill Tablespace Locations" button once – this will read the correct location of your databases data files plus, as a side-effect, makes sure your entered credentials are correct (since otherwise the installer cannot read these data out of the target database). Check the database file names, and click the "Next" button again.
On the next page, you can configure an EMail address (where GridControl sends its alert messages to) plus SMTP server – you should enter these data. Everything else on this page is optional: If you want GridControl to "phone home", you can enter your MetaLink credentials here plus, if necessary, the proxy settings. And then click "Next" again.
Now come the passwords. With GridControl, this is the first time the stronger policies take effect: The given password must be at least 5 characters, starting with a letter and including at least one digit.
Next comes a summary screen, and the the installation itself. Hopefully our preparations where good, and everything runs smoothely. I assume so here – well, leaving you alone in case of additional problems (two of them we already got out of the way with our preparations), sorry for that.
Applying the Patches
Preparations
Download and unpack
No need to say you first must download the patches from Oracle - you probably
already have. Unpack the archive into an empty directory, and delete all but
the p3731593_10204_LINUX.zip
– you won't ever need those files: The
373596.zip
and 3822442.zip
are for already deployed agents (and we did not
yet deploy any), and the p4329444*.zip
files are for patching the included
10.1 database we did not use – so you see, again more than 50% waste …
Now unpack the p3731593_10204_LINUX.zip
, and it will create another directory
called 3731593
holding the patch installation files.
Stop all EM services
Next we need to stop all Enterprise Manager services. This can be done using the following commands:
$EM_HOME/bin/emctl stop oms
$EM_HOME/opmn/bin/opmnctl stopall
$AGENT_HOME/bin/emctl stop agent
where $EM_HOME
is the ORACLE_HOME
of the Oracle Management Services (a
directory usually called "oms10g" somewhere below your ORACLE_BASE
– parallel
to your usual ORACLE_HOME
or, as in our case, parallel to that in
em102/oms10g
), and $AGENT_HOME
at the same place, but usually called
agent10g
.
Apply the patches
Change to the directory created by the last unpack operation above, and
run the installer: ./runInstaller
. Now first select the Oracle Home of the
OMS and click "Next". Check that it selected the correct database (where your
repository resides), and enter the password for the SYS
user. Approve all
questions to your requirements, and wait the patch to finish (this may take a
while – some 15 minutes or the like). Click "Finish".
Start the installer again, and this time select the Oracle Home of the agent. Less questions this time (no database connection required), but almost the same procedure. Click "Finish" once the patches have been applied, and you are done.
Restart the services
Now it's time to start the services again:
$EM_HOME/bin/emctl start oms
$EM_HOME/opmn/bin/opmnctl startproc ias-component=WebCache
$AGENT_HOME/bin/emctl start agent
Instead of the start
command, you may want to issue status
– most probably
these services are already started by the patch utility. Except for the
WebCache, which simply gives an no processes matched this request
error – no
idea why, since these startup informations are taken directly from the Grid
manual.
Postprocessing and Remarks
OMS
Services
Right after the installation (and the patch process), the installer starts up a
bunch of processes – some of them you will normally not need. For sure you will
need the OMS itself and the agent – but not necessary all of the IAS
components. Running <OMS HOME>/bin/opmnctl status -l
gives you a list of all
IAS components and their current state. The following components are the basic
requirements:
- OC4J_EM
- OC4J_EMPROV
- HTTP_SERVER
These processes you should make sure to have started when your server boots – all others are optional. Also make sure you keep above order:
$OMS_HOME/bin/emctl start oms
$OMS_HOME/opmn/bin/opmnctl startproc process-type=OC4J_EM
$OMS_HOME/opmn/bin/opmnctl startproc process-type=OC4J_EMPROV
$OMS_HOME/opmn/bin/opmnctl startproc process-type=HTTP_Server
$AGENT_HOME/bin/emctl start agent
In case you wonder where the OC4J_EM and HTTP_SERVER are: They already
implicitly got started by emctl start oms
. Now, stopping these services goes
as follows:
$OMS_HOME/bin/emctl stop oms
$OMS_HOME/opmn/bin/opmnctl stopall
$AGENT_HOME/bin/emctl stop agent
Port security
While with the above steps we already closed some ports unnecessarily open,
there are some more we may or may not want to be used: The unsecure HTTP ports.
We cannot close them directly – but we can make them "unusable" by forbidding
their use. This can be done in the HTTP servers configuration, i.e.
$OMS_HOME/sysman/config/http_em.conf
(resp. http_em.conf.template
, which
will overwrite the first every time we secure/lock/unlock the OMS, see
Metalink Note
452290.1). First check which are the non-secure HTTP ports used (<OMS
HOME>/bin/opmnctl status -l
), and then find the right sections in these two
files. Here comes one example:
<VirtualHost *:4889>
...
<Location /em/console>
order deny,allow
Deny from all
<Location>
As you may already have guessed, the keywords are Deny from all – meaning that nobody is allowed to access this host anymore. But be carefully: this also disables the agents possibility to communicate via unsecured HTTP. So check here first and decide then whether you can do so.
Remote Agents
Probably you will have at least one database on a remote host you want to monitor using GridControl. Or another machine where you want to monitor the host itself. Or both, or more. However: You will need an Agent to be installed there (and running). Besides copying the installation files there manually, and click through the installer again, you have two alternatives:
Before we roll out
Alright, you installed everything, patched everything, configured everything - what more? Don't ask me why, but if you want to install the agents on remote hosts as described below, at this point there would only be the 10.2.0.1 agent available! Although you patched OMS as well as the agent, this created the 10.2.0.4 directory, but only to add the script file and a plain text file there, missing all the bunch of files needed for the installation on the remote host.
What does it mean – but an additional download of another 340MB for the
Complete 10.2.0.4 agent install
files
from Oracle? Once you downloaded these files, unzip them into the <OEM
Home>/sysman/agent_download/10.2.0.4.0
directory (at least the linux
directory and its contents you need there). After that, the install will still
roll out the 10.2.0.1 agent, but together with its patches, and will apply
those automatically on the rollout. Why they don't simply put the latest
version there, is another question to the Oracle …
Push Install
The Push Install requires a working SSH setup: The Management server must be
able to connect to the hosts where the agents should be installed, plus to
transfer and install the software there and finally start it. If you have not
yet set up SSH for this, you can do so using the <OMS
ORACLE_HOME>/sysman/prov/resources/scripts/sshUserSetup.sh
script. It further
requires a username and password (I did not check whether this should be for
the OS – if you did not make the oracle
user a login user, you probably
cannot use the Push Install).
If these conditions are met, you can open GridControl in your web browser (e.g.
https://<servername>:1159/em
), click the Deployments tab, hit the Install
Agents link there, and follow the instructions. This way you can deploy the
agents to multiple machines at once – provided they share the same credentials.
Pull Install
Other than the Push Install, this does not require the SSH setup -
but wget
must be available. Logged in as the oracle
user on the machine the
agent should be installed, issue the following commands:
oracle@remotehost> wget http://<emhost>:4889/agent_download/10.2.0.4.0/linux/agentDownload.linux
oracle@remotehost> chmod +x agentDownload.linux
oracle@remotehost> ./agentDownload.linux -m <emhost> -b <ORACLE_BASE> -r 4889
ORACLE_BASE
specifies the directory below which the agent home will be
created. So if you e.g. specify /home/oracle/em102
, it will create
/home/local/em102/agent102
as the AGENT_HOME
directory and install the
agent software here. The agent will be started right after the installation
process completed. With the 10.2.0.4 version of the script, you must provide
the port with the -r
option, or the script will simply break with some error
messages.
Ah, well: <emhost>
is the host running the management server, i.e. the one we
just made the big installation on the pages before.
Securing the agents
The agents transport a lot of secure and sensitive information via the net, as e.g. passwords. So you probably don't want them to use an unsecured protocol such as HTTP, do you? At least not as long there is an alternative – and there is: HTTPS. So first let's check what they are using:
$AGENT_HOME/bin/emctl status agent
Watch the output for the Agent URL and Repository URL lines. If the values here already start with "https://", everything is fine (usually they do so at least after a Pull Install). Otherwise, better secure the agent:
$AGENT_HOME/bin/emctl secure agent
Once you have done so for all agents, you may optionally tell the OMS not to permit any unsecure connections anymore:
$OMS_HOME/bin/emctl secure lock
At this stage the agents communication is completely secured – no passwords will be transfered by them anymore uncrypted. But what about the users of the GridControl – they may still use the unsecured HTTP (accidentally, of course)? This now can be suppressed as described above.
Unavailable Services
Having done this all, it's time to check our console and see if everything
works. Open your web browser and point it to the management server, e.g. at
https://<emhost>:1159/em
and have a look.
First thing you will notice (only if you decided for the minimal services described above, and you already stopped and restarted the services as described above) are 3 services marked down. A problem? Not really. The agent of course discovered all services installed – and now expects them to run. But we decided we don't need them all. The agent does not yet know that – so we need to tell him.
In order to do so, first click on the "Targets" tab. Then, in the line just below the tab bar, select "Application Servers". Expand the tree displayed – you will find our 3 services marked down here. Do not remove them all! This would have some implications, i.e. other services we want to monitor would be removed as well. Tell it a "counting problem": 1+1=3 here (2 services are down – yes, right. That means, not all services in the group EnterpriseManager are available – so the group is marked down and added to the list – makes 3). So what we are doing now is: Selecting the WebCache by checking the radio box next to it, click on "Remove", and confirm it. After returning to the list, we do the same with the EnterpriseManager*_home item (only remove the leafs here, never a tree).
Going back to the main page (click the "Home" tab), GridControl now still reports one service being down. Don't mind, give it a few minutes to update. Continue with the next steps below – and when you are finished, it should have updated and listing that service as being "Up" again.
Discover the Databases
Now we check for the other things. Yeah, it already discovered a lot – but
probably no database! For each of the databases, it simply says they are in
"unknown state" because there have been some "metric errors". A quite
misleading information: "Credential errors" would match better. The problem is
that we did not yet configure a database user for the console – so it simply
assumes the dbsnmp
user (which is somehow correct) without any password
(which should be not correct – that is, if your databases are shown fine,
you have a security problem!).
Normal Databases
To fix this, we go to the databases home page (e.g. select "Database-Instance" in the search forms select box, send the search request, and then click on the link with the database name). Once there, click on the "Monitoring Configuration" link (usually at the bottom of the page in the "Related Links" section), fill in the correct password. Click on the "Test Connection" button to make sure you put the right credentials, and click the "Next" button to apply the changes.
DataGuard Databases
If the database is part of a DataGuard configuration (i.e. a standby database
or the corresponding primary), you must configure a SYSDBA
user here, e.g.
SYS
– since otherwise a login attempt will simply result in an ORA-01033:
insufficient privileges
error (the database is not open but mounted when in
recovery state). Though this would not apply to the primary instance, on a
switchover the primary becomes the new standby, so you should be prepared. To
do so, first open the drop-down box stating "NORMAL", and select "SYSDBA". Now
you can edit the user field, and put e.g. "SYS" here. Add the correct password
to the password field, then click the "Test connection" and finally the "Next"
button as above.
Policies
When a policy is violated, GridControl will issue an alert. Well, that's what we expect. But what we certainly not expect are e.g. the following alerts:
- Open ports: 1521 (Oracle Listener), 3872 (Oracle Agent), 1159 (GridControl secure HTTP)
- Remote password file and external authenticated accounts in a DataGuard configuration (this is a requirement there!)
Just to mention some of them. So you probably want either to define some exceptions to those rules, or simply disable them completely. Both is possible with a few mouse clicks – you can do this either for each database separately, or by using templates. Here it is described on a per-host base:
On the GridControl home page you click on the link (number) following "critical" or "warning" in the "All targets - Alerts" section. On the following page, click the link (number) in the column "Number of Violations next to the policy you want to edit. On the next page, you'll find a single link in the section "Related Links" to edit that policy. To disable it completely, simply use the first drop-down box, set the value to "Disabled", and submit. For exception rules, click the "Add Object" button in the "Excluded Objects" section.
Candidates to turn off completely
Here are a few example candidates you may wish to turn off completely:
Policy | Description |
---|---|
Permissions for files in the Oracle Home directory | This will always find something – including the tnsnames.ora . It requires all directories and files in the $ORACLE_HOME to belong to the OS-user oracle and one of the oracle OS-groups, and (expect for the $ORACLE_HOME/bin directory) to have absolutely no permissions to somebody else (others). As for the tnsnames.ora this could be crucial for some applications. |
Use of SQL92 Security features | You rarely will need that. What stands behind it is simply to make queries like DELETE FROM table WHERE … fail when the executor has no SELECT permission. Hell, in that case it is very unlikely that the same executor should have DELETE permission! And if so, the DBA should ask what that should be for … |
Use of Non-Standard Initialization Parameters | On the repository database, this will also warn on the _b_tree_bitmap_plans=FALSE parameter, which Oracle explicitly lists as requirement. Recommendation: Make an exclude rule for this one. |
Installation von JAccelerator (NCOMP) | Not very descriptive help in no Oracle documentation I found; they simply argue "it's c00l and makes Java faster" (what we probably already guessed from the name) – but nowhere I found what that has to do with the database, whether it is a requirement or a nice-to-have, and to what Java code it applies after all. The Oracle Application Server Release Notes state, JAccelerator is not needed (literally: "installed") with OracleAS, and "You can ignore this error message." (i.e. the ORA-29558: JAccelerator (NCOMP) not installed.). Concluding from this, the database itself (which in this case holds the repository) does not need it. Googling for more, there are statements in the DBForum.COM that the JAccelerator is only needed to compile YOUR OWN STUFF, since all Java shipped with the database is already compiled natively. On Forums.Oracle.COM we find a short hint that it is "speeding up the Java execution in the database". Except for the first statement from the AS Release Notes, we cannot say anything about "official statements". But the conclusion seems to be: As long as YOU don't use YOUR OWN Java code in the database, you will not need JAccelerator and thus can simply disable this policy. |
Public Execute Privileges | There are several policies concerning execute privileges for the PUBLIC user: Execute Privileges on %1 To Public (with %1 being one of UTL_FILE , DBMS_LOB and DBMS_JOB ), and Restricted Privilege to Execute %1 (with %1 being one of UTL_TCP , UTL_SMTP and UTL_HTTP ). These alerts will come up if you did not change the defaults concerning these public privileges. While there is some sense behind these, you should not blindly trust those alerts and revoke the permissions – this may lead you in trouble (at least if you have Oracle Text installed, for example, which relies on some of these implicit grants and will be broken when you revoke them). If you are concerned about these permissions and think about revoking them, read Metalink note 247093.1 first and think twice. In many cases, you may simply prefer to disable these policies. |
OC4J Password indirection | If you don't know what OC4J needs passwords for, you probably can turn this policy off. If you are not sure, check those files mentioned (and Metalink note 312648.1" to know where to look). Probably this file still uses our well-known "scott/tiger" account we never created – so turn the policy off. But if you need it, follow the advices in the Metalink note just mentioned. |
Ports to exclude
For the Open Ports rule, you may want to define the following exceptions:
Port | Description |
---|---|
22 | SSH |
1521 | Oracle Listener |
3872 | EM Agent |
1159 | GridControl secure HTTP |
Well Known Accounts
Probably this comes up with the OUTLN
account. So all you have to do here is
to log in to the database as SYSDBA
user, and execute:
sqlplus> ALTER USER outln PASSWORD EXPIRE;
sqlplus> ALTER USER outln ACCOUNT LOCK;
Miscellaneous Rules and Metrics
There are some more metrics and rules you may wish to adjust. We cannot mention them all here (since at least some of them are very subjective), but a few common things should be listed at least:
Policy/Metric | Description |
---|---|
Current Open Cursors Count | The default of 1.200 set here will not be enough in most cases, so you probably always will have an alert for this. Do not confuse this value with the open cursors per session – we talk about open cursors in the entire database (a sum of all sessions). Depending on the session count, this number may be much higher. No idea how Oracle calculates the default value of 1.200 (simple calculation: The default "small" database is configured with 50 processes, default for open_cursors is 50 – which results in a maximum of 2.750 possible open cursors (processes * 1.1 * open_cursors) – so the alert is already triggered at <50% of a small databases maximum). In a Metalink forum I read a value of 6.000 is not unusual. So if you encounter alerts for this, changing the metrics treshold up to this should be no problem (probably even more is not, so you could take above calculation with your settings, and set the "warning" to 80% of that value). |
Tablespace Space Used (%) | Generally, the values here are OK – but if you are using the automatic undo management, having your undo tablespace properly sized but its datafiles with autoextend off , you will be annoyed by many alerts for it – since Oracle always keeps as much undo as possible. So better add a separate rule for your undo tablespace, setting its tresholds to 99%. |
Database Time Spent Waiting (%) | Here are some useless alerts for Wait Class Other: these are in 99% simple idle waits, nothing to worry about. So to prevent those "false alerts", since you cannot simply turn this section off, either increase the Warning Treshold from 30 to something >70, the Number of Occurences to something >10, or both. |
Permissions
One thing you will note very soon is a contradiction between the catalog scripts you run on installation and upgrade of a database – and the policies GridControl ships with: The catalog scripts will grant a bunch of permissions to public, which GridControl then will alert for. If you simply revoke them, this will break GridControl itself – amongst others: They rely on PUBLIC being granted those permissions. This is a known "feature" of Oracle for years. Since a solution is not to be expected that soon (there is a "feature request" pending for this, which may make it into one of the next releases), we have to find a work-around meanwhile.
The easiest one is to turn these policies off. But if we'd do that in all cases, we can save us from installing GridControl: There wouldn't be that much things left. So we better figure out how to solve the grants. Before touching these, make sure to read the Metalink article 247093.1, since this may have some bad side-effects!
So first we check who is relying on the grants alerted:
-- some (optional) output formatting COLUMN package FOR a10 COLUMN owner FOR a10 COLUMN objects FOR 999 COLUMN references FOR a10 COLUMN type FOR a10 SET feedback off SET verify off SELECT referenced_name package, owner, count(name) objects,referenced_owner references, referenced_type type FROM dba_dependencies WHERE referenced_owner IN ('SYS','PUBLIC') AND referenced_type IN ('PACKAGE','SYNONYM') AND referenced_name IN ('UTL_FILE','UTL_SMTP','UTL_TCP','UTL_HTTP','DBMS_JOB','DBMS_LOB') AND owner <> 'SYS' AND owner <> 'PUBLIC' GROUP BY referenced_name, owner, referenced_owner, referenced_type ORDER BY referenced_name, owner, referenced_owner, referenced_type; SET feedback on SET verify on |
The result probably looks like:
DBMS_JOB DBSNMP 1 PUBLIC SYNONYM
DBMS_JOB SYSMAN 5 PUBLIC SYNONYM
UTL_FILE ORACLE_OCM 1 PUBLIC SYNONYM
UTL_FILE SYSMAN 1 PUBLIC SYNONYM
Which means: We have to grant the execute permissions on DBMS_JOB to DBSNMP and SYSMAN, plus execute on UTL_FILE to SYSMAN and ORACLE_OCM before revoking them from PUBLIC (please adjust this according to your real output!). And it means, the other packages are not affected, and we can safely revoke the execute permissions on them from PUBLIC:
-- Grant necessary permissions explicitly GRANT EXECUTE ON dbms_job TO DBSNMP; GRANT EXECUTE ON dbms_job TO SYSMAN; GRANT EXECUTE ON utl_file TO ORACLE_OCM; GRANT EXECUTE ON utl_file TO SYSMAN; -- Remove unsafe permissions from public REVOKE EXECUTE ON dbms_job FROM PUBLIC; REVOKE EXECUTE ON dbms_lob FROM PUBLIC; REVOKE EXECUTE ON utl_tcp FROM PUBLIC; REVOKE EXECUTE ON utl_smtp FROM PUBLIC; REVOKE EXECUTE ON utl_http FROM PUBLIC; REVOKE EXECUTE ON utl_file FROM PUBLIC; -- "Repair" packages we possibly broke temporarily @?/rdbms/admin/utlrp.sql |
The last command executes a script checking for invalid objects allover the database and trying to "repair" them. Maybe you have to run this script again a while later …
Warning: Due to complex nestings, revoking above
mentioned privileges may have serious side-effects. In some cases, even many
objects of the SYS
user have been invalidated on the following day (again and
again, even after fixing this with utlrp.sql
) – though the objects we revoked
PUBLIC access from belong to this user himself! Another troublemaker is
CTXSYS
: Here we had cases where some packages did not even re-compile after
granting all the permissions again. This may force you to re-run the entire
catalog process in order to re-validate the objects.
So if following this procedure, you find some objects popping up as invalid
(most likely package bodies), and cannot validate them again using the
@?/rdbms/admin/utlrp.sql
script, and even an explicit ALTER PACKAGE
<package_name> compile
fails (most likely with a message like "malformed or
corrupted wrapped unit"), here's what you can do: Check for the broken objects
create script (in most cases this is a *.plb
file) and run this in the
users context to re-create the object. This solved the issue here in most
cases.
And here comes an example:
COLUMN owner FOR a15 COLUMN object_name FOR a30 COLUMN status FOR a10 COLUMN object_type FOR a15 SELECT owner,object_name,status,object_type FROM dba_objects WHERE status<>'VALID'; |
This could show up:
CTXSYS DRVDOC INVALID PACKAGE BODY
Usually, you would try to ALTER PACKAGE ctxsys.drvdoc COMPILE BODY
– but in
our case, this failed ("compilation errors"). The same for
@?/rdbms/admin/utlrp.sql
– the object remains unimpressed. By the "owner" we
know this object belongs to CTX – so we will find the associated CREATE
Statement in $ORACLE_HOME/ctx/admin/*
: grep -i "package body drvdoc"
$ORACLE_HOME/ctx/admin/*
points us to the file drvdoc.plb
. So at the SQL
prompt, we bravely type:
ALTER SESSION SET current_schema ctx;
@?/ctx/admin/drvdoc.plb
And we will see the object valid again. At least in most cases.
Policy Templates
It might be quite annoying to repeat each of this step for each database/host again and again – but we don't have to. Once we fugured out which rules we need to adjust, we can create a template for this to apply to a list of targets. To do so, first click on the Setup link on the top of the tabs in the upper left of the page. On the next page, select Monitoring Templates from the link-list on the left of the page. The table in the center of the next page is probably still empty – so click on the Create button to create a new template. Now select a target you already configured accordingly, and you can chose the policies which need to be modified to be part of the template. You also can modify the rules in this template to be more generic, or whatever you need.
Once you've setup one or more templates, you can apply them to one or multiple targets using the Apply button and selecting the target(s).
Troubleshooting
There are several common problems arising during or shortly after the installation of GridControl. Here we will list some of them, and how they may be solved.
Installation and Konfiguration
Unknown target: generic_mom_managed_host
Problem: The management home page lists an unknown target called generic_mom_managed_host, the status is pending.
Cause: Well, they really install everything … MOM stands for Microsoft Operations Management – and you probably don't use that stuff. This connector enables OEM to receive alerts from MOM. I.e., if you don't have MOM running somewhere, the connector cannot connect.
Solution: What you don't need you can simply kick off:
- Click on the "unknown" link
- Click on "generic_mom_managed_host"
- Click on "Connector Setup"
- Select "Microsoft Operations Manager Connector"
- Click on "Delete"
- Click on "Yes"
Error on pulling the 10.2.0.4 agent
Problem: You want to install the agent on some remote host using the
Pull method. After downloading the agentDownload.*
script
and executing it, you simply get an error 404: Not Found.
Cause: Looking into the <OMS Home>/sysman/agent_download/10.2.0.4
directory you will notice that the files there are indeed missing – so the
script is not able to download them. Though you applied the patches to both OMS
and the agent, it only added an empty directory here with the script plus one
text file.
Solution: If you ran into this problem, you missed the paragraph on an additional download here, so please go back and check ;)
It looks like Oracle wants us to roll out the old 10.2.0.2
agent and then patch it on every machine - not a very practical way! Or, as the
doc/ReleaseNotes_EMGC_10.2.0.4.html
file mentions in the chapter
Post Installation Tasks, you overlooked to download the Clone
Support Files from Metalink. Nice that they mention this – but they don't tell
where to find them. Searching for a place on Metalink to download them from
only returns suitable results, if you know exactly where and how to search:
Tab "Patches and Updates", follow the link *"Advanced Search" here. Chose Enterprise Manager Grid Control (emgrid) as "Product", set "Patch Type" to Any, enter CLONE SUPPORT FILES FOR% in the "Description" field. Now also set "Priority" to Any, and submit the form – to find out that the only available patch for this issue applies to the cloning of Oracle 9i databases only – so if you (no longer) have such, you don't need this patch …
Re-Install an Agent
Problem: You have an agent installed on some machine, but want to replace it.
Cause: Yes, there could be some: e.g. you want to replace an older version of the agent with a newer one, but don't want to patch it, or the prior installation broke or failed, or you put it into the wrong directory (according to the local policies) …
Solution: Do NOT try to simply install the new agent on top or next to the existing one!!! This will either fail if using the same directory, or you will run into trouble with alerts like Number of Duplicate Targets exceeded the critical threshold. Instead, follow below rules:
- Make sure you removed all targets (which watched by the affected agent) from the OMS repository
- Make sure the (old) agent is stopped
- If the agent is still reported "Down" in GridControl even after waiting
for a reasonal time (i.e. up to 15 minutes), and you find no way
to remove it there: Login to the repository database as
SYSMAN
, and runexec mgmt_admin.cleanup_agent('<myserver>:<port>')
(see Metalink Note 454081.1). Admitted: This is the "Die Hard" way finally getting rid of it ;) - On the target host, login as
oracle
, change to the directory$AGENT_HOME/oui/bin
and execute therunInstaller
script. Delete the current agent using the installer. - Now you can install the new version, as described before
- Grrr … and now we have to re-configure the "new" targets again! With some luck, you created templates before, which you now only have to apply …
Alerts and Policies
Grid reports the Listener to be down - but it is running
Problem: Sometimes GridControl reports your Listener is down. Shocked as
you are, you log in to the machine and check with lsnrctl status
–
and this command reports: Everything fine, I'm up and running, and I even know
all databases etc. I should know.
Cause: Who knows? It was reported OK by GridControl immediately after installation. Then you possibly restarted either the listener, the OMS (e.g. for patching) or both - and the problem was there.
Solution: There may be some. If you search on Metalink, you'll find some SRs on this issue – but none solved. Check Google, and you'll find some more. Advices vary and even include manual changes to configuration files. Before you fiddle around with those, try a simple approach:
In GridControl, select "Agents" from the drop-down search list and submit. Then click on the associated agent and list its targets. Search for the listener, and remove it. Manually add it again with the correct data – and it will probably work.
PLS-00201: identifier 'MGMT_CNTR_TT.CLEAN_UP_OLD_TICKET_RECS' must be declared
Problem: Soon after applying the patch, above error message appears in the alert log – execution of some job failed due to this.
Cause: The patch submitted the job using the wrong user (SYS
instead of
SYSMAN
). But even with the right owner, it raises another error (see
Metalink Note
416352.1" for details).
Solution: Above note writes about a couple of bugs filed for this issue with GridControl 10.2.0.3 already – obviously they are still not fixed with 10.2.0.4. The workaround recommended by above note is to simply remove the job using
SQL> exec dbms_job.remove(<job_id>)
Alright: a non-existing job can no longer fail …
Alerts: Warning and errors because a Standby Database is not open
Problem: There is a bunch of warnings and errors for a standby database, tracing back to (or even explicitely mentioning) an ORA-01219: database not open: queries allowed on fixed tables/views only error.
Cause: Bug in the agent (see e.g. MetaLink note 403223.1), which should have been fixed for 10.2.0.4 according to that note – but still is not.
Solution: Good question. There is a patch available on MetaLink (see bug 6442193 and/or search for the patch with the same number) to be applied on top of the 10.2.0.4 agent. But even after doing so (plus removing and re-discovering the standby database as mentioned in above note), the errors persist.
After not receiving any usable answer from Metalink for about 2 months, I found the solution as a side-effect from another task: If you enabled the Data Guard Broker, these errors disappear! To do so:
- Open the GridControl web page
- Click on the "Targets" tab
- In the sub-navigation, select "Databases"
- Click on the primary instance belonging to the standby in question
- In the section "High Availability", click on the link Primary
- If asked for login credentials, provide the necessary data for a
SYSDBA
account (e.g.SYS
) - In the "Overview" section, click on the link Use Data Guard Broker
- Follow the instructions to setup Data Guard
As a side-effect, doing a SwitchOver or FailOver now is only a matter of a few mouse clicks, since the Data Guard Broker cares for everything else …
Even though Metalink claims they have a patch solving this issue (Patch #6442193 for agent 10.2.0.4 - once applied, you should remove and rediscover the standby database in Grid Control in order to get the existing alerts cleared), this patch did not do anything for me – I even got more ORA-01219 errors reported. As far as I can tell, configuring the Data Guard Broker seems to be the only almost-solution for now. It still leaves the two warnings about the database being MOUNTED (instead of open) – and all the other alerts maybe re-appear after a restart …
Trouble with the Open Ports policy
Problem: Initially there where 5 open ports reported. Since those should be open (usually, it's the Oracle ports like listener, agent, etc. which are needed to be open), they were added to the exclusion list. Now there are no more open ports reported – but there are for sure more ports open on that host!
Cause: By default, this policy has set a maximum of 5 entries, including the exception list – so it will never report more than 5 ports. If you already have 5 entries in the exclusion list, that's it – you will never get an open ports alert anymore, regardless what.
Solution: If you need this policy to report more, you must increase the maximum allowed entries for it. This can be done as follows:
Click on the "Jobs" tab. From the "Create Job" drop-down, select Security Policy Configuration. Click the "Go" button. Now give the job some name (and optionally description). Leave "Target Type" to Host, and use the "Add" button to add the target hosts where you want to change the policy. Click the "Parameters" tab then. In the "Security Policy" drop-down, select Open Ports, and in the "Max Row Count" input put the number of ports you want to have reported (put All to remove the limit completely). Next, select the "Credentials" tab. If you did not yet setup preferred credentials for the hosts you want to change that policy on, check the "Override Preferred Credentials" radio button, and enter a valid user/password combination. Note: It will NOT work to use the root login and sudo to oracle (the job will fail, claiming there's no sudo available even if it is) – but the job will succeed with the root user alone (without sudo to anything). Finally, press the "Submit" button.
Once the job was successfully executed on the target hosts, you will get more open ports alerted as soon as the policy is evaluated next on these.
Target Alert: Error - Missing Properties
Problem: In the target alerts, we find one or more errors reported with the message Missing Properties: […]. The error type is Collection Failure.
Cause: The dynamic properties are not being computed correctly (see Metalink Bug 5283003).
Solution: Metalink offers a solution in note 422847.1. In short: On the target host, issue the command
$AGENT_HOME/bin/emctl reload agent dynamicproperties <target>:<type>
$AGENT_HOME/bin/emctl clearstate agent
$AGENT_HOME/bin/emctl upload
(<target>
is the name of the affected target, as reported in the target
column of the error report; <type>
is the target type, e.g. host for a
host, or oracle_database for an oracle database). According to the note, this
should clear the alert. Then give it 5-10 minutes, and check the alert list –
these alerts should have been cleared. For us, it helped on the second try; so
you may need to give it another chance …
Massive collection errors with ORA-00600
Problem: In the target alerts, you suddenly find a bunch of collection errors, the details show some ORA-00600 [17059] on the repository database. About one per minute, throwing partially huge trace files.
Cause: Your repository database is probably Oracle 10.2.0.3 on Windows or Linux, and you hit the bug 5705795.
Solution: Congratulations! You just won the lottery for another patch set. Please confirm with Metalink support first (file a Service Request) just to make sure that this fix applies to your environment. If so, they will tell you to download patch #5705795 and apply it against your repository database (yepp – this time we do not patch the Grid software, but the database software).
Remark: Despite of the patch set #5705795 being applied, the error may still persist. In this case, it should help to upgrade the database to version 10.2.0.4 – but if that's the permanent solution, I cannot (yet) tell …
Alerts: "User SYS logged on from …"
Problem: If you want to monitor a dataguard environment, you probably
configured these databases to use the SYS
user. Now you have a permanent
alert in warnings for these database, telling you that SYS
connected from the
database host – no wonder, because this is the agent itself! So what to do
about this?
- You will need a user with
SYSDBA
privileges to monitor a standby database - Granting
SYSDBA
,SELECT ANY DICTIONARY
andOEM_MONITOR
toSYSTEM
still gives the error message "User system does not have privileges to monitor this database." - Metalink note 311684.1 mentions you have to grant the
OEM_MONITOR
role via the EM database configuration tool, not via SQL*Plus – but doing so changes nothing - Disabling the audit solves the issue with the alert – but certainly is not the preferred solution.
I opened a Service Request at Metalink
for this issue. They point to the $ORACLE_HOME/rdbms/admin/catsnmp.sql
file to check the permissions given to the DBSNMP
monitor user (see also
Metalink Note 315340.1), which is basically what is described above
(plus GRANT EXECUTE ON DBMS_SERVER_ALERT
directly to the user – which
does not change anything).
Cause: As you may already have guessed, the agent simply reads all sessions
from v$sessions
. Since it usually holds more than one connection to
the database, it does not even try to exclude itself from the result set.
Solution: There will probably never be a fix from Oracle for this issue. Quoting the Metalink co-ordinator for the responsible team: "This is a configuration issue. no statement need to be changed in the xml/xmlp file. Just disable this policy on the databases that you would like to monitor under the SYS account." So we have to care ourselves: The following work-around solves the issue – but note that this is not officially supported by Oracle:
- on the affected machine, open the file
$AGENT_HOME/sysman/admin/metadata/instance.xmlp
with your favorite editor - search for the term "v$session", which should bring you to a line reading
from v$session where type <> 'BACKGROUND' and username is not null )
(scrolling a few lines up, you will see this is "category: User Audit", exactly what we need to fix) - remove the closing
)
from that line, and insert a new line immediately following this line:
and not (program like 'emagent@%' and machine=(select host_name from v$instance)) )
- in Grid Control, edit the metric for the user audit, and remove/replace the "SYS" account so the event gets cleared
- restart the agent on the affected machine (
$AGENT_HOME/bin/emctl stop agent && $AGENT_HOME/bin/emctl start agent
) - edit the metric again, and place the SYS user back
Repeat these steps for all affected hosts, and the agents should no longer alert themselves.