ORACLE RAC QUESTION
RAC
What is Cluster
?
A
cluster provides an optional method of storing table data. A cluster is made up
of a group of tables that share the same data blocks. The tables are grouped together because they share
common columns and are often used together OR
Two or
more computers that share resources and work together to form a larger logical
computing unit. RAC and Oracle Parallel
Server can be used to access Oracle from multiple nodes of a clustered system.
What
is oracle clusterware?
Oracle
Clusterware is software that enables servers to operate together as if they are
one server. Each server looks like any standalone server. However, each server has additional processes that
communicate with each other so the separate servers appear as if they are one
server to applications and end users.
Why
RAC, what are the benifits of using RAC?
The
benefits of Real Application Clusters:
Ability
to spread CPU load across multiple servers
Continuous
Availability / High Availability (HA)
-Protection
from single instance failures
-Protection
from single server failures
RAC can take advantage of larger SGA sizes
than can be accommodated by a single instance commodity server.
What
is startup sequence in RAC ?
INIT spawns init.ohasd (with
respawn) which in turn starts the OHASD process (Oracle High Availability
Services Daemon). This daemon spawns 4 processes.The entire Oracle
Cluster stack and the services registered on the cluster automatically comes up
when a node reboots or if the cluster stack manually starts. The startup
process is segregated in five (05) levels, at each level, different processes
are got started in a sequence.
Now from 11g onwords you store Voting disk and OCR in ASM since Voting
disk and OCR are the primary component required to start the clusterware which further starts the clusterware
resources like ASM, Listener, Database etc .Which further starts the
clusterware resources like ASM, what but you just told that Voting
disk and OCR can be stored in ASM and clusterware starts the ASM and
clusterware startup itself requires the access of Voting disk and OCR so what
starts first ASM or Clusterware what the hell is going on 1.
When a node of an Oracle Clusterware cluster start/restarts, OHASD is started by
platform-specific means. OHASD is
the root for bringing up Oracle Clusterware. OHASD has access to the OLR (Oracle Local Registry) stored on the local file
system. OLR provides
needed data to complete OHASD initialization. 2. OHASD brings up GPNPD and CSSD. CSSD has access to the GPNP Profile stored on the local file system. This profile
contains the following vital bootstrap data;
a. ASM Diskgroup Discovery String
b. ASM SPFILE location
(Diskgroup name)
c. Name of the ASM Diskgroup containing
the Voting Files
3. The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in
the ASM Disk headers
and CSSD is able to
complete initialization and start or join an existing cluster.
4. OHASD starts
an ASM instance
and ASM can now
operate with CSSD initialized
and operating. The ASM instance uses special code to locate the contents of the
ASM SPFILE, assuming it is
stored in a Diskgroup.
5. With an ASM instance operating and its
Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.
6. OHASD starts CRSD with access to the OCR in an ASM Diskgroup.
7. Clusterware completes
initialization and brings up other services under its control.
When Clusterware
starts three files are involved
1. OLR – Is the first
file to be read and opened. This file is local and this file contains information
regarding where the voting disk is stored
and information to startup the ASM. (e.g ASM DiscoveryString) 2. VOTING DISK – This is the second file to be opened and read, this is dependent on only OLRbeing accessible. ASM starts after CSSD or ASM does not start if CSSD is offline (i.e voting file missing) How are Voting Disks stored in ASM? Voting disks are placed directly on ASMDISK. Oracle Clusterware will store the votedisk on the disk within a disk group that holds the Voting Files.
Oracle Clusterware does not rely on ASM to access the Voting Files, which means Oracle Clusterware does not need of Diskgroup to read and write on ASMDISK. It is possible to check for existence of voting files on a ASMDISK using the V$ASM_DISK column VOTING_FILE.
So, voting files not depend of Diskgroup to be accessed, does not mean that the diskgroup is not needed, diskgroup and voting file are linked by their settings. 3. OCR – Finally the ASM Instance starts and mount all Diskgroups, then Clusterware Deamon (CRSD) opens and reads the OCR which is stored on Diskgroup. So, if ASM already started, ASM does not depend on OCR or OLR to be online. ASM depends on CSSD (Votedisk) to be online.There is a exclusive mode to start ASM without CSSD (but it’s to restore OCR or VOTE purposes)
and information to startup the ASM. (e.g ASM DiscoveryString) 2. VOTING DISK – This is the second file to be opened and read, this is dependent on only OLRbeing accessible. ASM starts after CSSD or ASM does not start if CSSD is offline (i.e voting file missing) How are Voting Disks stored in ASM? Voting disks are placed directly on ASMDISK. Oracle Clusterware will store the votedisk on the disk within a disk group that holds the Voting Files.
Oracle Clusterware does not rely on ASM to access the Voting Files, which means Oracle Clusterware does not need of Diskgroup to read and write on ASMDISK. It is possible to check for existence of voting files on a ASMDISK using the V$ASM_DISK column VOTING_FILE.
So, voting files not depend of Diskgroup to be accessed, does not mean that the diskgroup is not needed, diskgroup and voting file are linked by their settings. 3. OCR – Finally the ASM Instance starts and mount all Diskgroups, then Clusterware Deamon (CRSD) opens and reads the OCR which is stored on Diskgroup. So, if ASM already started, ASM does not depend on OCR or OLR to be online. ASM depends on CSSD (Votedisk) to be online.There is a exclusive mode to start ASM without CSSD (but it’s to restore OCR or VOTE purposes)
What
is VIP in RAC and use of VIP ?
A virtual IP address or VIP
is an alternate IP address that the client connections use instead of the
standard public IP address. To configure VIP address, we need to reserve a
spare IP address for each node, and the IP addresses must use the same subnet
as the public network.Every
NODE in oracle rac cluster has an ip
address and hostname managed by the respective
os of the each node .From Oracle 10g, virtual IP considers to configure
listener. Using virtual IP we can save our TCP/IP timeout problem because Oracle notification service(ONS) maintains communication
between each nodes and listeners. Once ONS found any listener down or node
down, it will notify another nodes and listeners with
same situation. While new connection is trying to establish connection to
failure node or listener, virtual IP of failure
node automatically divert to surviving node and session will be establishing in
another surviving node. This process doesn't wait for TCP/IP timeout event. Due to this new
connection gets faster session establishment to another surviving
nodes/listener.
$
srvctl config vip -node NODE_NAME
#srvctl config nodeapps -n
node_name
How to remove the VIP ------------srvctl remove vip -i “vip_name_list” [-f]
[-y] [-v]
How
to start/stop/status the VIP
---------srvctl start/stop/status vip {-n node_name|-i vip_name} [-v]How to enable/disable/config
srvctl enable/disable vip -i vip_name [-v] srvctl cnfig vi {-n node_name|-i vip_name} Some time VIP may not failback to Original Node,Then we can use below command to failback Failover VIP (on the destination node)
./crs_relocate [vip resource name] (The VIP will now go where it’s configureed to be)
What is SCAN
and SCAN listener ? Single Client Access Name (SCAN)
is a feature used in Oracle Real Application Cluster environments that provides
a single name for clients to access any Oracle Database running in a cluster.
You can think of SCAN as a cluster alias for databases in the cluster. The
benefitis that the client’s connect information does not need to change if you
add or remove nodes or databases in the cluster. During
Oracle Grid Infrastructure installation, SCAN listeners are created for as many
IP addresses as there are SCAN VIP addresses assigned to resolve to the SCAN.Oracle
recommends that the SCAN resolves to three VIP addresses, to provide high
availability and scalability. If the SCAN resolves to three addresses, then
three SCAN VIPs and three SCAN listeners are created. Each SCAN listener
depends on its corresponding SCAN VIP. The SCAN listeners cannot start until
the SCAN VIP is available on a node.
The addresses for the SCAN listeners
resolve either through an external domain name service (DNS), or through the Grid
Naming Service (GNS) within the cluster. SCAN listeners and SCAN VIPs can run
on any node in the cluster. If a node where a SCAN VIP is running fails, then
the SCAN VIP and its associated listener fails over to another node in the
cluster. If the number of available nodes within the cluster falls to less than
three, then one server hosts two SCAN VIPs and SCAN listeners. The SCAN
listener also supports HTTP protocol for communication with Oracle XML Database
(XDB). SCAN VIP is one of the resources you find
in the output of “crsctl status resource –t” command. Number of SCAN VIP’s you
notice will be the same as the number of SCAN LISTENERS in the setup.
SCAN VIP’s are
physical IP addresses that you allocate to SCAN listeners. In the example that
I use later in this blog, 192.168.122.5, 192.168.122.6, 192.168.122.7 are SCAN
VIP’s. If you identify that SCAN VIP’s are online in the output of “crsctl
status resource –t” command then IP addresses are online on the physical
network ports. Only when SCAN VIP’s are online we can start the SCAN listeners.
SCAN Listener is
the oracle component which starts running a service on the port (by default its
1521) using the SCAN VIP (IP address). So SCAN listener doesn’t start if SCAN
VIP is not online. This is the major difference between a SCAN listener and
SCAN VIP. The number of SCAN listeners you notice in the output will be the
same as number of SCAN VIP’s ONLINE. Name that is given to S CAN LISTENER is
referred as SCAN NAME and it is registered in DNS server. In our example which
you will find next, the SCAN name is “SCAN_LISTENER
0 sec: User1 when tries to establish a
session on database with connection request C1, it hits DNS server first.
DNS server will then resolve the name
“SCAN_LISTENER” to the first IP 192.168.122.5
1.
C1
request reaches the first scan listener SCAN1 mostly the default name will be
“LISTENER_SCAN1” which is running on 192.168.122.5 SCAN VIP.
2.
SCAN1
using details from LBA, identifies the load on each node in the setup and
routes the request C1 to node which has least load.
3.
In this
case it happened to be NODE 2 with least load or no load and the C1 request is
addressed by local listener on this node which helps C1 to establish a session
on instance on NODE 2.
4.
5th sec: User2 when tries to establish a session
on database with connection request C2, it hits DNS serer first.
5.
6.
DNS
server will now use Round-Robin algorithm and resolves the name “SCAN_LISTENER”
to second IP 192.168.122.6
7.
C2
request reaches the second scan listener SCAN2 mostly the default name will be
“LISTENER_SCAN2” which is running on 192.168.122.6 SCAN VIP.
8.
SCAN2
using details from LBA, identifies the load on each node in the setup and
routes the request C2 to node which has least load.
9.
In this
case it happened to be NODE 1 with least load or no load and the C2 request is
addressed by local listener on this node which helps C2 to establish a session
on instance on NODE 1.
What is HAIP in RAC ?
Oracle
11gR2 introduced the RAC Highly Available IP (HAIP) for the Cluster
Interconnect to help eliminate a single point of failure If the node in the cluster only has
one network adapter for the private network, and that adapter fails then the
node will no longer be able to participate in cluster operations. It will not
be able to perform its heartbeat with the cluster. Eventually, the other nodes
will evict the failing node from the
cluster. If the cluster only has a single network switch for the Cluster
Interconnect and the switch fails, then the entire cluster is compromised The purpose of HAIP is to perform load
balancing across all active interconnect interfaces and fail over existing
non-responsive interfaces to available
interfaces. HAIP has the ability to activate a maximum of four private
interconnect connections. These private network adapters can be configured during the installation of
Oracle Grid Infrastructure or after the installation using the oifcfg utility.
[oracle@host01
bin]$ ./oifcfg getif
eth0 192.168.56.0
global public
eth1 192.168.10.0
global cluster_interconnect
[oracle@host01
bin]$ ./crsctl stat res -t -init
ora.asm
ora.cluster_interconnect.haip
oracle@host01
bin]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 08:00:27:98:EA:FE
inet addr:192.168.56.71 Bcast:192.168.56.255 Mask:255.255.255.0
eth1 Link encap:Ethernet HWaddr 08:00:27:54:73:8F
inet addr:192.168.10.1 Bcast:192.168.10.255 Mask:255.255.255.0
inet6 addr:
fe80::a00:27ff:fe54:738f/64 Scope:Link
eth1:1 Link encap:Ethernet HWaddr 08:00:27:54:73:8F
inet addr:169.254.225.190 Bcast:169.254.255.255 Mask:255.255.0.0
The entry for eth1 with IP address
192.168.10.1 is the way the NIC was configured on this system for the private
network. Notice the device listed as eth1:1 in the output above. It has been
given the 169.254.225.190 IP address.
Device
eth1:1 is RAC HAIP in action even though only one private network adapter
exists. HAIP uses the 169.254.*.* subnet. As such, no other network devices in
the cluster should be configured for the same subnet.
When
Grid Infrastructure is stopped, the ifconfig command will no longer show the
eth1:1 device. The gv$cluster_interconnects view shows the HAIP subnets for
each instance.
Why Node Eviction? OR Why
Split Brain syndrome? The node eviction/reboot is used for
I/O fencing to ensure that writes from I/O capable clients can be cleared
avoiding potential corruption scenarios in the event of a network split, node
hang, or some other fatal event in clustered environment.
By definition, I/O fencing (cluster
industry technique) is the isolation of a malfunctioning node from a cluster's
shared storage to protect the integrity of data.
Who evicts/reboot the node?
The daemons for Oracle Clusterware
(CRS) are started by init when the machine boots. Viz. CRSD, OCSSD,EVMD, OPROCD
(when vendor clusterware is absent), OCLSOMON.
There are three fatal processes, i.e.
processes whose abnormal halt or kill will provoke a node reboot
1. the ocssd.bin (run as
oracle)
2. the oclsomon.bin (monitors OCSSD
and run as a root)
3. the oprocd.bin (I/o
fencing in non-vendor clusterware env and run as a root)
—Other non-CRS Processes Capable of
Evicting:
◦OCFS2 (if used)
◦Vendor Clusterware (if used)
◦Operating System (panic
4 Reasons for Node Reboot or
Node Eviction in Real Application Cluster (RAC) Environment
1. High Load on Database
Server: Out of 100 Issues, I have seen 70 to 80
time High load on the system was reason for Node Evictions, One common
scenario is due to high load RAM and SWAP space of DB node got exhaust and
system stops working and finally reboot.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs
2. Voting Disk not
Reachable: One of the another reason for Node Reboot is clusterware is not
able to access a minimum number of the voting files.When the node
aborts for this reason, the node alert log will show CRS-1606 error.
3. Missed Network
Connection between Nodes: In technical term this is called as Missed
Network Heartbeat (NHB). Whenever there is communication gap or no
communication between nodes on private network (interconnect) due to network
outage or some other reason. A node abort itself to avoid "split
brain" situation. The most common (but not exclusive) cause of missed
NHB is network problems communicating over the private interconnect.
4. Database Or ASM
Instance Hang: Sometimes Database or ASM instance hang can cause Node
reboot. In these case Database instance is hang and is terminated
afterwards, which cause either reboot cluster or Node eviction. DBA should
check alert log of Database and ASM instance for any hang situation which might
cause this issue.
SPLIT BRAIN
-SYNDROME.
In a
Oracle RAC environment all the instances/servers communicate with each other
using high-speed interconnects on the private network. This private network
interface or interconnect are redundant and are only used for inter-instance
oracle data block transfers. Now talking about split-brain concept with respect
to oracle rac systems, it occurs when the instance members in a RAC fail to
ping/connect to each other via this private interconnect, but the servers are
all pysically up and running and the database instance on each of these servers
is also running. These individual nodes are running fine and can conceptually
accept user connections and work independently. So basically due to lack of
commincation the instance thinks that the other instance that it is not able to
connect is down and it needs to do something about the situation. The problem
is if we leave these instance running, the sane block might get read, updated
in these individual instances and there would be data integrity issue, as the
blocks changed in one instance, will not be locked and could be over-written by
another instance.
Oracle has efficiently
implemented check for the split brain syndrome.
In RAC
if any node becomes inactive, or if other nodes are unable to ping/connect to a
node in the RAC, then the node which first detects that one of the node is not
accessible, it will evict that node from the RAC group. e.g. there are 4 nodes
in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect
to node 3 and finds it not responding, then node 1 will evict node 3 out of the
RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to
continue functioning.The split brain concepts can become more complicated in
large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4
nodes are not able to communicate with the other 6. So there
are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and
other of 6 nodes). Now the nodes will quickly try to affirm their membership by
locking controlfile, then the node that lock the controlfile will try to check
the votes of the other nodes. The group with the most number of active nodes
gets the preference and the others are evicted. Moreover, I have seen this node
eviction issue with only 1 node getting evicted and the rest function fine, so
I cannot really testify that if thats how it work by experience, but this is the theory behind it.
When we see that the node is evicted,
usually oracle rac will reboot that node and try to do a cluster
reconfiguration to include back the evicted node.You will see oracle error:
ORA-29740, when there is a node eviction in RAC. There are many reasons for a
node eviction like heart beat not received by the controlfile, unable to
communicate with the clusterware etc.
Why
voting disk are recommeded to have odd in number for the cluster ?
An odd
number of voting disks is required for proper clusterware configuration. A node
must be able to strictly access more than half of the voting disks at any time.
So, in order to tolerate a failure of n voting disks, there must be at least
2n+1 configured. (n=1 means 3 voting disks). You can configure up to 31 voting
disks, providing protection against 15 simultaneous disk failures. If you lose 1/2 or more of
all of your voting disks, then nodes get evicted from the cluster, or nodes
kick themselves out of the cluster. It doesn't threaten database corruption. Alternatively you can use
external redundancy which means you are providing redundancy at the storage
level using RAID.
. How
to Identify Master Node for cluster ?
There
are three ways we can find out the master node for the cluster
1.
Check on which node OCR backups are happening
2.
Scan the ocssd logs on all the nodes
3.
Scan the crsd logs on all the nodes
What are RAC database
backgrund processes? ANS- -1) LMS-- This background process
copy read consistent blocks from the holding instance buffer cache to the
requesting instance. LMSn also performs rollback on uncommitted transactions
for blocks that are being requested for consistent read by another
instance.this is the cache fusion part and the most active process, it handles
the consistent copies of blocks that are
transferred between instances. It receives requests from LMD to perform lock
requests. There can be up to ten LMS processes running and can be started dynamically if demand
requires it . they manage lock manager service requests for GCS resources and
send them to a service queue to be handled by the LMSn process. It also handles
global deadlock detection and monitors
for lock conversion timeout as a performance gain you can increase this process
priority to make sure CPU starvation does not occur. This background process is
also called Global Cache Services. The LMS process maintains records of the data file statuses
and each cached block by recording information in a Global Resource
Directory (GRD). The LMS process also
controls the flow of messages to remote instances and manages global data block
access and transmits block images between the buffer caches of different instances. This processing is part of the
Cache Fusion feature
This
is the name you often see back in wait events (GCS). Default 2 LMS
background processes are started.
2) LMON
--
LMON
is responsible for monitoring all instances in a clusterfor the detection of failed instances. The LMON process monitors global
enqueues and resources across the cluster and performs
global enqueue recovery operations.It manages instance deaths
and perform recovery for the Global Cache Service//LMS. Joining an leaving instances are managed by LMON.
LMON manage also all the global resource in the RAC database. LMON register the
instance/database with the node monitoring part of the
cluster (CSSD).This background process is also called Global Enqueue
Monitoring.LMON provide services are also referred to cluster group service
(CGS).
3) LCK--The Lock Process is also
for non-RAC environments LCK manage local noncache requests (row cache, lock
requests, library locks). And it also managed shared resource requests cross instance. It keeps a list of
invalid and valid lock elements. And if needed past information to the GCS. The
Lock Process (LCK) manages non-cache fusion resource requests such as library
and row cache requests and lock requests that are local to the server. LCK
manages instance resource requests and cross-instance call operations for shared resources. It
builds a list of invalid lock elements and validates lock elements during
recovery. Because the LMS process handles the primary function of lock
management, only a single LCK process exists in each instance. There is only
one LCK process per instance in RAC.
4) DIAG-- It regularly monitors the health of the instance.
Checks for instance hangs and deadlocks.
It captures diagnostic data for instance and process failures.
5)LMD--This background process manage access to the blocks
and global enqueues. Also global deadlock detection and remote resource request
are handled by LMD. LMD also manage lock requests for GCS /LMS.
This
background process is also called Global Enqueue Service Deamon. In wait events
you will see GES.
6)RMSn--This process is called as Oracle
RAC Management Service/Process. These processes perform manageability tasks for
Oracle RAC. Tasks include creation of
resources related Oracle RAC when new instances are added to the cluster.
Q. What
is Private interconnect ?-
The private interconnect is
the physical construct that allows inter-node communication. It can be a simple
crossover cable with UDP or it can be a proprietary interconnect with specialized proprietary communications
protocol. When setting up more than 2- nodes, a switch is usually needed. This
provides the maximum performance for RAC,s which relies on inter-process
communication between the instances for cache-fusion implementation.
Using the dynamic
view gv$cluster_interconnects:
SQL> select * from gv$cluster_interconnects;
Using the clusterware command oifcfg:
$oifcfg getif
$oifcfg getif
What is cache fusion in RAC ? –
Cache Fusion Oracle RAC transfer the data block from buffer cache
of one instance to the buffer cache of another instance using the cluster high
speed interconnect.
CR Image: A consistent read (CR) block represents a consistent snapshot of the data from a previous point in time. Applying undo/rollback segment information produces consistent read versions. Past Image: Past Image is converted from Exclusive current block , when another request comes for exclusive lock on the same block. Past Images are not written to the disk , After the latest version of that block is written to the disk all Past Images are discarded.
CR Image: A consistent read (CR) block represents a consistent snapshot of the data from a previous point in time. Applying undo/rollback segment information produces consistent read versions. Past Image: Past Image is converted from Exclusive current block , when another request comes for exclusive lock on the same block. Past Images are not written to the disk , After the latest version of that block is written to the disk all Past Images are discarded.
Example 1 [write-read] :Instance A is holding a data block
in exclusive mode.
1.
Instance B is trying to access the same block for
read purpose.
2.
If transaction is not yet committed by Instance A, in this
case instance A cannot send current block to requesting
instance as data yet not committed , so it will create
consistent read version of that data block by applying undo to that block and
sends it to requesting instance.
3.
Creating a CR image in RAC can
come with some I/O overheads. This is because the UNDO data
could be spread across instances and hence to build a CR copy of
the block, the instance might has to visit UNDO segments on
other instances and hence perform certain extra I/O.
Now what actually happens
inside : When
any Instance access any data block , GCS keep track of it and
store it in GRD saying the latest block is with this instance
in our case, Instance A. So when other instance[Instance B] ask
for same block, it can easily find that block is with Instance A.It also stores
the block is being accessed in Exclusive mode.So when other instance asks for
shared lock on this , it will check transaction is committed or not , if not it
creates a read consistent image for that data block and send it to requesting
instance. After shipping block to requesting instance it also stores that
details in GRD, CR image is with Instance
B which is having shared lock and Instance B still
holds exclusive lock.
Example 2[write-write]: In case of write-write operations
, past image comes into the picture , Instance A is
holding a data block in exclusive mode and Instance B is
trying to access the same data block in exclusive mode too.So
, here Instance B needs a actual block and not aconsistent
read version of the block. In this scenario, holding instance
sends an actual block, but is liable to keep past image of the block in its
cache until that data block has been written to the disk. In case of node
failure or node crash, GCS is able to build that data block
using PI image across the cluster. Once the data block is
written to disk, all PIs can be discarded as it won’t need a recovery in case
of a crash.
Past Image[PI] and Consistent
Read[CR] image in above example:
CR image is used in read-write
contention, in this case, one instance is holding the block in exclusive mode
and the second instance requests a read access to that block so it won’t need
an exclusive lock, consistent read image consistent with the requested queryscn is
sufficient here.Whereas the first Instance has acquired
the exclusive lock and the second instance also wants the same
block in an exclusive mode it is write-write contention.In this
case two possibilities are there either instance A releases
the lock on that block (if it no longer needs it) and lets instance B read
the block from the disk OR instance Acreates a PI image of
the block in its own buffer cache, makes the redo entries and ships the block
to the requesting instance via interconnect .
Another specification we can give is, the CR image is to
be shipped to the requesting instance whereas the PI has to be kept by the
holding instance after shipping the actual block.In order to facilitate Cache
Fusion, we still need the Buffer Cache, the Shared Pool,
and the Undo tablespace just like a single-instance database. However, for
Oracle RAC, we need the Buffer Caches on all instances to appear to be
global across the cluster.
To do this, we need GRD – Global Resource Directory to
keep track of the resources in the cluster.There is no true concept of a master
node in Oracle RAC. But each node belongs to a cluster , becomes
the resource master for a subset of resources.The GCS processes [LMS] are
responsible for facilitating blocks through interconnect.
Q. What is GRD in RAC
Is the internal database that records and stores
the current status of the data blocks
·
Is maintained
by Global Cache Service(GCS) and Global Enqueue Service(GES)
·
Global Enqueue
Service Daemon (LMD)
·
It holds the
information on the locks on the buffers
·
The lock info
is available in V$LOCK_ELEMENT & V$BH.LOCK_ELEMENT
§ Global Cache Service Processes (LMSn)
·
It provides
the buffer from one instance to another instance
·
it does not
know who has what type of buffer lock
o
Whenever a
data block is transferred out of a local cache to another instance’s cache the
GRD is updated
o
The GRD
resides in memory and is distributed throughout the cluster.
o
It list all
the master instance of all the buffers
it
holds information like
1.SCN(system change number)
2.DBI(datablock identifier)
3.location of the block
4.mode of the block
1.SCN(system change number)
2.DBI(datablock identifier)
3.location of the block
4.mode of the block
1.null (N) - indicates the
their are noacess rites on block
2.shared(S) - indicate block is share across the all instance
3.exclusive(E) - access rides only for particular instance
2.shared(S) - indicate block is share across the all instance
3.exclusive(E) - access rides only for particular instance
role of the block
1.local-date block image present in only one node
2.global-data block image present in multiple nodes
1.local-date block image present in only one node
2.global-data block image present in multiple nodes
Types of datablock image
1.current image-update data block value
2.consistent image-previous data block value
2.past image-grd updated image
it convert to current image when instance is crash
1.current image-update data block value
2.consistent image-previous data block value
2.past image-grd updated image
it convert to current image when instance is crash
How to check location of OCR and voting disk in
RAC?
Voting Disk: It manage information about node
membership. Each voting disk must be accessible by all nodes in the cluster.If
any node is not passing heat-beat across other note or voting disk, then that
node will be evicted by Voting disk.
To check voting disk location.-- crsctl query css votedisk
OCR: It created at the time of Grid Installation. It’s
store information to manage Oracle cluster-ware and it’s component such as RAC
database, listener, VIP,Scan IP & Services.
Check OCR location--
ocrcheck or
cat /etc/oracle/ocr.loc also give you the location of
ocr
What is I/O fencing
I/O fencing is a mechanism
to prevent uncoordinated access to the shared storage. This feature works even
in the case of faulty cluster communications causing a split-brain condition.To provide high
availability, the cluster must be capable of taking corrective action when a
node fails .if a system in a two-node cluster fails, the system stops sending
heartbeats over the private interconnects and the remaining node takes
corrective action. However, the failure of private interconnects (instead of
the actual nodes) would present identical symptoms and cause each node to
determine its peer has departed. This situation
typically results in data corruption because both nodes attempt to take
control of data storage in an uncoordinated manner.
I/O
fencing allows write access for members of the active cluster and blocks access
to storage from non-members; even a node that is alive is unable to cause
damage. Fencing is an important operation that protects processes from other
nodes modifying the resources during node failures. When a node fails, it needs
to be isolated from the other active nodes. Fencing is required because it is
impossible to distinguish between a real failure and a temporary hang.
Therefore, we assume the worst and always fence. (If the node is really down,
it cannot do any damage; in theory, nothing is required. We could just bring it
back into the cluster with the usual join process.) Fencing, in general,
insures that I/O can no longer occur from the failed node. Raw devices using a
fencing method called STOMITH (Shoot The Other Machine In The Head)
automatically power off the server.This simply means the healthy nodes kill the
sick node. Oracle's Clusterware does not
do this; instead, it simply gives the message "Please Reboot" to the
sick node. The node bounces itself and
rejoins the cluster.
In versions before 11.2.0.2 Oracle
Clusterware tried to prevent a split-brain with a fast reboot (better: reset)
of the server(s) without waiting for ongoing I/O operations or synchronization
of the file systems. This mechanism has been changed in version 11.2.0.2 (first
11g Release 2 patch set). After deciding which node to evict, the Clusterware:
--attempts to shut down all
Oracle resources/processes on the server
--- will stop itself on the
node
---Afterwards Oracle High
Availability Service Daemon (OHASD)5 will try to start the Cluster Ready
Services (CRS) stack again. Once the cluster interconnect is back online, all
relevant cluster resources on that node will automatically start
---Kill the node if stop of
resources or processes generating I/O is not possible (hanging in kernel mode,
I/O path, etc.)Generally Oracle Clusterware uses two rules to choose which
nodes should leave the cluster to assure the cluster integrity:
---In configurations with
two nodes, node with the lowest ID will survive (first node that joined the
cluster), the other one will be asked to leave the cluster
--- With more cluster nodes,
the Clusterware will try to keep the largest sub-cluster Running.
Why
raw devices are faster than the block devices?
-------years ago raw devices
were a lot faster than file systems. Nowadays the difference has become much
smaller. "Veritas filesystems for Oracle"
(don't recall the exact
name) is supposed to offer the same speed as raw devices.
Filesystems however are a
lot easier to administer than raw devices and give you more freedom
--------Raw devices in
conjunction with database applications in some cases are giving more
performances ( because the operating system handle a minimum of specific
activity on haw to deal with data to be written and read from blocks in one
hand and the RDBMS take all the responsability to manage
the entire space to deal
with data as stream CHARACTERS) .
--------Raw partition is
accessed in character mode, so IO is faster than fs partition which is accessed
by block mode. with raw partitions you can do bulk IO's.
What
is UDP Protocol and why to use that for
private interconnect ?
-----UDP (USER DATAGRAM PROTOCOL) is diffrent from tcp/ip as it does
not have built-in hand-shake dialogue
(in tcp/op it will first create connection setup after that if will
start transferring data this is called as hand shaking) .
This means that UDP does not have the same data integrity and
reliability and serialization as TCP/IP. -UDP is nonreliable and it doesnt give data delivery gareenty to
the deatination side like tcp/tp.
-----UDP is no provide
sequencing of data (in case of transmission of data).
-----UDP is far faster than TCP/IP, primarily because
there is no overhead in establishing a handshake connection.
-----UDP protocol is used
for high impact communications areas such as domain name servers (DNS)
it is mainly use for cache
fusion time.
To enable UDP on AIX for Oracle, you set the
following UDP parameters:
udp_sendspace:
Set udp_sendspace parameter to [(DB_BLOCK_SIZE
*DB_FILE_MULTIBLOCK_READ_COUNT) + 4096], but not less than 65536.
udp_recvspace: Set the value of the udp_recvspace parameter
to be >= 4 * udp_sendpace
Why ocr and voting disk were not possible to keep
on ASM in 10g?
-In Oracle 11g there is a
great feature where we can put the OCR and Voting disk files on ASM but in
Oracle earlier version it was not possible.
The reason is, In 11g,
Clusterware can access the OCR and Voting disk files even if the ASM instance
is down and it can start the CRS and CSS services. But in Oracle 10g, ASM instance could not be bring up as while
startup it throws error as "ORA-29701:unable to connect to Cluster
Manager". Coz to bring the ASM instance up first you need to start the CRS services. So in this case if the OCR and
Voting disk files resides on ASM then definitely those services should be UP
which couldn't be possible as ASM is
not UP and it depends upon CRS.
What is OLR in RAC?
In 11gR2, addition to OCR,
we have another component called OLR installed on each node in the cluster. It
is a local registry for node specific purposes. The OLR is not shared by other
nodes in the cluster. It is installed and configured when clusterware is
installed.
Why OLR is used and why was it introduced.------In 10g, we cannot store OCR’s in ASM and hence to startup the
clusterware, oracle uses OCR but what happens when OCR is stored is ASM in 11g. OCR should be accessible to find out
the resources that need to be started or not. But, if OCR is on ASM, it can’t
read until ASM (which itself is the
resource of the node and this information is stored in OCR) is up.
To answer this, Oracle introduced a component called
OLR.
Ø It is the first file used to startup the
clusterware when OCR is stored on ASM.
Ø Information about the resources that needs to
be started on a node is stored in an OS file called ORACLE LOCAL REGISTRY
(OLR).
Ø Since OLR is an OS file, it can be accessed
by various processes on the node for read/write irrespective of the status of
cluster (up/down).
Ø When a node joins the cluster, OLR on that
node is read, various resources, including ASM are started on the node.
Ø Once ASM is up, OCR is accessible and is used
henceforth to manage all the cluster nodes. If OLR is missing or corrupted,
clusterware can’t be started on that node.
Where is OLR located
It is located
$GRID_HOME/cdata/<hostname>.olr .
The location of OLR is stored in /etc/oracle/olr.loc and used by OHASD.
What does OLR contain
The OLR stores
· Clusterware version info.
· Clusterware configuration
· Configuration of various resources
which needs to be started on the
The OLR stores data
about ORA_CRS_HOME,localhost version,active
version, GPnP details,OCR latest backup time and location, information about
OCR daily, weekly backup location node name etc.
This information stored in
the OLR is needed by OHASD to start or join a cluster.
Checking the status of the OLR file on each node.
$ ocrcheck –local
OCRDUMP is used to dump the contents of the OLR
to text terminal
$ocrdump –local –stdout
We can export and import the OLR file using OCRCONFIG
$ocrconfig –local
–export <export file name>
$ocrconfig –local –import <file_name>
We can even the repair the OLR file if it
corrupted.
$ocrconfig –local –repair
–olr <filename>
OLR is backed up at the end
of the installation or an upgrade. After that time we need to manually backup
the OLR. Automatic backups are not supported for OLR.
$ocrconfig –local –manualbackup.
To change the OLR backup
location
$ocrconfig –local
–backuploc <new_backup_location>
To restore OLR
$crsctl stop crs
$ocrconfig –local
–restore_file_name
$ocrcheck –local
$crsctl start crs
$cluvfy comp olr -- to check the integrity of the OLR file
which was restored.
How to backup OCR and Voting disk in RAC
Oracle
Clusterware (CRSD) automatically creates OCR backups every 4 hours.
b) A backup is created for each full day.
c) A backup is created at the end of each week.
d) Oracle Database retains the last three copies of OCR
b) A backup is created for each full day.
c) A backup is created at the end of each week.
d) Oracle Database retains the last three copies of OCR
Add Voting Disk :#
crsctl add css votedisk
To remove a
voting disk:# crsctl delete css votedisk=
Voting Disks
In 11g release 2 you no longer have to
take voting disks backup. In fact according
to Oracle documentation restoration of
voting disks that were copied using the
"dd" or "cp" command
may prevent your clusterware from starting up.
the OCR whenever there is a configuration
change.
Also the data is automatically restored to
any voting that is added.
There is no need to backup voting disk every day, because the
node membership information does not usually change.
Following is a guideline to backup voting disk.
• After installation
• After adding nodes to or deleting nodes from the cluster
• After performing voting disk add or delete operations
Following is a guideline to backup voting disk.
• After installation
• After adding nodes to or deleting nodes from the cluster
• After performing voting disk add or delete operations
How to check clustername
in RAC A
cluster comprises multiple co-ordinated computers or servers that appear as if
they are one server to end users and applications. Oracle RAC enables you to
cluster Oracle databases. Oracle RAC uses Oracle Clusterware for the
infrastructure to bind multiple servers so they operate as a single
system.Oracle Clusterware is a portable cluster management solution that is
integrated with Oracle Database. Oracle Clusterware is also a required
component for using Oracle RAC. In addition, Oracle Clusterware enables both
single-instance Oracle databases and Oracle RAC databases to use the Oracle
high-availability infrastructure. Oracle Clusterware enables you to create a
clustered pool of storage to be used by any combination of single-instance and
Oracle RAC databases. If you want to find the cluster name from an existing RAC
setup, then use below command.
1. cd $GRID_HOME/bin
./olsnodes – 2.cd $GRID_HOME/bin
cemutlo –n
./olsnodes – 2.cd $GRID_HOME/bin
cemutlo –n
What is instance recovey in RAC
1. All nodes available. 2. One or more RAC
instances fail.
3. Node
failure is detected by any one of the remaining instances. 4.
Global Resource Directory(GRD) is reconfigured and distributed among surviving
nodes 5. The
instance which first detected the failed instance, reads the failed insances
redo logs to determine the logs which are needed to be recovered. The above task is done by the SMON process of
the instance that detected failure. 6. Until
this time database activity is frozen, The SMON issues recovery requests for all the blocks that are needed for
recovery. Once all the blocks are available, the other blocks which are not needed for
recovery are available for normal processing. 7. Oracle
performs roll forward operation against the blocks that were modified by the failed instance but were not written to disk
using redo log recorded transactions 8. Once
redo logs are applied, uncomitted transactions are rolled back usin undo tablespace.
9. Database on the RAC in now fully available.
What is DNS server in RAC
SCAN is a domain name registered to at least one and up to
three IP addresses, either in the domain name service (DNS) or the Grid Naming
Service (GNS). When client wants a connection database unlike the previous
releases the client will use SCAN as specified in tnsnames.ora.The DNS server
will return three IP addresses for the SCAN and the client will try to connect
to each IP address given by DNS server until the connection is not made.So with
11GR2 the Client will initiate the connection to SCAN listener which will
forward the connection request to least loaded the node within the cluster.
The flow will be,Client Connection Request –> SCAN listener –> Node listener (Running on Virtual IP)
The flow will be,Client Connection Request –> SCAN listener –> Node listener (Running on Virtual IP)
Why
SCAN needs DNS ?
You must have DNS server setup if you want to use SCAN.The reason is, if you use /etc/hosts file for the SCAN than all the requests for the SCAN will be forwarded to first SCAN node specified in /etc/hosts because /etc/hosts file does not have capability of round robin name resolution but If you use the DNS server than SCAN can take the advantage of DNS’s round robin name resolution feature.
You must have DNS server setup if you want to use SCAN.The reason is, if you use /etc/hosts file for the SCAN than all the requests for the SCAN will be forwarded to first SCAN node specified in /etc/hosts because /etc/hosts file does not have capability of round robin name resolution but If you use the DNS server than SCAN can take the advantage of DNS’s round robin name resolution feature.
What is passwordless access and why is it
required in RAC?
During the add node or
cluster upgradation,rdbms upgradation or cluster installation or
rdbms installation if you want to check any pre-requisites using
runcluvfy.sh script or cluvfy.sh script it required password less
connection between the RAC nodes for same user.
What is asmlib and its usage
ASMLib is an optional support library for the Automatic
Storage Management feature of the Oracle Database. Automatic Storage Management
(ASM) simplifies database administration and greatly reduces kernel resource
usage (e.g. the number of open file descriptors). It eliminates the need for
the DBA to directly manage potentially thousands of Oracle database files,
requiring only the management of groups of disks allocated to the Oracle
Database. ASMLib allows an Oracle Database using ASM more efficient and capable
access to the disk groups it is using Oracle
ASM (Automated Storage Management) is a data volume manager for Oracle
databases. ASMLib is an optional utility that can be used on Linux systems to
manage Oracle ASM devices. ASM assists users in disk management by keeping
track of storage devices dedicated to Oracle databases and allocating space on
those devices according to the requests from Oracle database instances.
What is GNS?
When we use Oracle
RAC, all clients must be able to reach the database. All public addresses,
VIP addresses and SCAN addresses of Cluster must be resolved by the
clients.And GNS[Grid Naming Service] helps us resolve this problem.
GNS is
linked to the domain name server (DNS) so that clients can resolve these
dynamic addresses and transparently connect to the cluster and the databases.
Activating GNS in a cluster requires a DHCP service
on the public network. Grid Naming Service uses one static address, which
dynamically allocates VIP addresses using Dynamic Host
Configuration Protocol[DHCP], which must be running on the network. Grid
Naming Service is in use with gnsd[grid naming service
daemon]. Background Process of GNS: mDNS[Multicast
Domain Name Service]: It allows DNS request. GNS[Oracle Grid
Naming Service]: It is a gateway between the cluster mDNS and
external DNSservers. The GNS process performs name resolution
within the cluster. The DNS delegates query the GNS virtual IP address and the
GNS daemon responds to incoming name resolution requests at that address.
Within the subdomain, the GNS uses multicast Domain Name Service (mDNS),
included with Oracle Clusterware, to enable the cluster to map host names and
IP addresses dynamically as nodes are added and removed from the cluster,
without requiring additional host configuration in the DNS.
$ srvctl config gns –a0
What is the difference between CRSCTL and SRVCTL?
Crsctl command is used to manage the elements
of the clusterware (crs,cssd, OCR,voting disk etc.) while srvctl is used to manage the elements
of the cluster (databases,instances,listeners, services etc) . For exemple,
with crsctl you can tune the heartbeat of the cluster, while with srvctl you
will set up the load balancing at the service level. Both command were
introduced with Oracle 10g and have been improved since. There is sometimes
some confusion among DBAs because both commands can be used to start the
database, crsctl starting the whole clusterware + cluster, while srvctl is
starting the other elements, such as database, listener, services, but not the clusterware.so
,Use SRVCTL to manage Oracle-supplied resources such Listener,Instances,Disk
groups,Networks.Use CRSCTL for managing Oracle Clusterware and its
resources.Oracle strongly discourages directly manipulating Oracle-supplied
resources (resources whose names begin with ora) using CRSCTL.
What is rebootless fencing in 11gR2 rac?
Oracle Grid Infrastructure
11.2.0.2 has
many features including Cluster Node Membership, Cluster Resource Management, and Cluster
Resources monitoring. One of the key areas where DBA need
to have expert knowledge on how the cluster node membership works and how the
cluster decides to take out node should there be a heartbeat network, voting
disk or node specific issues. In Oracle 11.2.0.2 oracle bring many new features one of them
is reboot less fencing :when sub-components
of Oracle RAC like private interconnect or voting disk
fails, Oracle Clusterware tries to prevent a split-brain with a fast reboot of
the node without waiting for I/O operation or synchronization of the file
systems
Oracle uses algorithms common to STONITH (shoot
the other node in the Head) implementations to determine what nodes
need to get fenced. When a node is alerted that it is being “fenced” it uses
suicide to carry out the order STONITH automatically powers down a
node that is not working correctly. An administrator might employ STONITH if
one of the nodes in a cluster can not be reached by the
other node(s) in the cluster.;
But after 11.2.0.2 the mechanism is changed.Finally, Oracle has
improved the node fencing in Oracle 11g Release 2 (11.2.0.2)
by killing the processes on the failed node that are capable of performing IO
and then stopping the Clusterware on the failed node rather than simply
rebooting the failed node. Whenever subcomponents of Oracle RAC like private
interconnect, voting disk etc fails, Oracle Clusterware first decides
which node to evict, then
1. The Clusterware attempts to
shut down all Oracle resources and process on that node, especially those
processes which generates I/O.
2. The Clusterware will stop
cluster service on that node.
3. Then OHASD[Oracle High
Availability Service Daemon] will try to start CRS [Cluster
Ready Service] stack again. and once the interconnect is back online, all
cluster resources on that node will automatically be started.
4. And if it is not possible to stop resources
or processes generating I/O then Clusterware will kill the
node.
If any one of the nodes cannot communicate to other nodes,
there is a potential that node can be corrupting the data without coordinating
the writes with the other nodes. Should that situation arise, that node needs
to be taken out from the cluster to protect the integrity of the cluster and
its data. This is called as “split brain” in the cluster which means two
different sets of clusters can be functioning against the same set of data
writing independently causing data integrity and corruption. Any clustering
solution needs to address this issue so does Oracle Grid Infrastructure
Clusterware.
What is Rolling Upgrade and Rolling patch apply
in RAC? The rolling upgrade will allow one node to be
upgraded whilst the other node is running.There is no downtime at all since at
least one node is running at any one time.The term rolling upgrade refers to
upgrading different databases or different instances of the same database (in a
RealApplication Clusters environment) one at a time, without stopping the
database.The advantage of a RAC rolling upgrade is that it enables at least
some instances of the RAC installation to beavailable during the scheduled
outage required for patch upgrades. Only the RAC instance that is currently
being patched needs to be broughtdown. The other instances can continue to
remain available. This means that the impact on the application downtime
required for such scheduled outages is further minimized. Oracle's opatch
utility enables the user to apply the patch successively to the different
instances of the RAC installation.Rolling upgrade is available only for patches
that have been certified by Oracle to be eligible for rolling upgrades.
Typically, patches that can be installed in a rolling upgrade include:
•Patches that do not affect
the contents of the database such as the data dictionary. •Patches not related
to RAC internode communication
•Patches related to client-side tools such as SQL*PLUS, Oracle
utilities, development libraries, and Oracle Net •Patches
that do not change shared database resources such as datafile headers, control
files, and common header definitions of kernel modules
•Rolling upgrade of patches is currently available for one-off patches
only. It is not available for patch sets
What is One Node RAC concept
Oracle introduced a new
option called RAC One Node with the release of 11gR2 in late 2009. This option
is available with Enterprise edition only. Basically, it provides a cold
failover solution for Oracle databases. It’s a single instance of Oracle RAC
running on one node of the cluster while the 2nd node is in a cold standby
mode. If the instance fails for some reason, then RAC One Node detects it and
first tries to restart the instance on the same node. The instance is relocated
to the 2nd node in case there is a failure or fault in 1st node and the
instance cannot be restarted on the same node. The benefit of this feature is
that it automates the instance relocation without any downtime and does not
need a manual intervention. It uses a technology called Omotion, which
facilitates the instance migration/relocation.
What are some of the RAC specific
parameters? -----active_instance_count
Designates one instance in a two-instance cluster as the primary
instance, and the other as the secondary instance. This parameter has no
functionality in a cluster with more than two instances. ----archive_lag_target
Specifies a log switch after a user-specified time period elapses. ----cluster_database
-------Specifies whether or not Oracle Database 10g RAC is enabled. ----cluster_database_
instances -- -
Equal to the number of instances. Oracle uses the value of this
parameter to compute the default value of the large_pool_size parameter when
the parallel_automatic_tuning parameter is set to true. ----cluster_interconnects --- Specifies
the additional cluster interconnects available for use in RAC environment.
Oracle uses information from this parameter to distribute traffic among the
various interfaces. ----Compatible---This
parameter specifies the release with which the Oracle server must maintain
compatibility.
----control_files---Specifies one or more names of control files. ----db_block_size---Specifies
the size (in bytes) of Oracle database blocks. ----db_domain-----In
a distributed database system, db_domain specifies the logical location of the
database within the network structure.
How to put RAC database in archivelog mode ?
From 11g, you no longer
need to reset the CLUSTER_DATABASE parameter during the process Step 1. Make sure
to set the parameters for db_recovery_file_dest_size and db_recovery_file_dest
if not set the below parameter
ALTER SYSTEM SET log_archive_dest_1='location=+ORADATA' SCOPE=spfile;
Step 2. Stop the
Database
From the command line we can stop the entire clustered database using
the following.
srvctl stop database -d PROD
Step 3. Now start the
instance from one node only
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE ARCHIVELOG; SQL>
SHUTDOWN IMMEDIATE;
Step 4. Start the
database
srvctl start database -d PROD
SQL> select
name,open_mode,LOG_MODE from v$database;
What are the background process that exists in
11gr2 and functionality?
Process Name
|
Functionality
|
crsd
|
•The CRS daemon (crsd) manages cluster resources based on
configuration information that is stored in Oracle Cluster Registry (OCR) for
each resource. This includes start, stop, monitor, and failover operations.
The crsd process generates events when the status of a resource changes.
|
cssd
|
•Cluster Synchronization Service (CSS): Manages the cluster
configuration by controlling which nodes are members of the cluster and by
notifying members when a node joins or leaves the cluster. If you are using certified
third-party clusterware, then CSS processes interfaces with your clusterware
to manage node membership information. CSS has three separate processes: the
CSS daemon (ocssd), the CSS Agent (cssdagent), and the CSS Monitor
(cssdmonitor). The cssdagent process monitors the cluster and provides
input/output fencing. This service formerly was provided by Oracle Process
Monitor daemon (oprocd), also known as OraFenceService on Windows. A
cssdagent failure results in Oracle Clusterware restarting the node.
|
diskmon
|
•Disk Monitor daemon (diskmon): Monitors and performs input/output
fencing for Oracle Exadata Storage Server. As Exadata storage can be added to
any Oracle RAC node at any point in time, the diskmon daemon is always
started when ocssd is started.
|
evmd
|
•Event Manager (EVM): Is a background process that publishes Oracle
Clusterware events
|
mdnsd
|
•Multicast domain name service (mDNS): Allows DNS requests. The mDNS
process is a background process on Linux and UNIX, and a service on Windows.
|
gnsd
|
•Oracle Grid Naming Service (GNS): Is a gateway between the cluster
mDNS and external DNS servers. The GNS process performs name resolution
within the cluster.
|
ons
|
•Oracle Notification Service (ONS): Is a publish-and-subscribe
service for communicating Fast Application Notification (FAN) events
|
oraagent
|
•oraagent: Extends clusterware to support Oracle-specific
requirements and complex resources. It runs server callout scripts when FAN
events occur. This process was known as RACG in Oracle Clusterware 11g
Release 1 (11.1).
|
orarootagent
|
•Oracle root agent (orarootagent): Is a specialized oraagent process
that helps CRSD manage resources owned by root, such as the network, and the
Grid virtual IP address
|
oclskd
|
•Cluster kill daemon (oclskd): Handles instance/node evictions
requests that have been escalated to CSS
|
gipcd
|
•Grid IPC daemon (gipcd): Is a helper daemon for the communications
infrastructure
|
ctssd
|
•Cluster time synchronisation daemon(ctssd) to manage the time syncrhonization
between nodes, rather depending on NTP
|
What is inittab?
Inittab is like oratab
entry. Inittab is used for starting crs services in RAC environment. The line
which is responsible to start is below. This file is responsible for starting
the services
h1:35:respawn:/etc/init.d/init.ohasd
run >/dev/null 2>&1 </dev/null
Useful Commands:
1.
crsctl enable
has –> Enable Automatic start
of Oracle High Availability services after reboot
2.
crsctl disable
has –> Disable Automatic start
of Oracle High Availability services after reboot
What is OHASD ?
Ohasd stands for Oracle
High Availability Services Daemon. Ohasd spawns 3 types of services at cluster
level. Level 1 : Cssd Agent
Level 2: Oraroot Agent (respawns cssd, crsd, cttsd, diskmon,acfs)
Level 3: OraAgent(respawns mdsnd, gipcd, gpnpd, evmd, asm), CssdMonitor
Level 2: Oraroot Agent (respawns cssd, crsd, cttsd, diskmon,acfs)
Level 3: OraAgent(respawns mdsnd, gipcd, gpnpd, evmd, asm), CssdMonitor
Useful Commands:
1. crsctl enable has –> To start has services after reboot.
2. crsctl disable has –> has services should not start after reboot
3. crsctl config has –> Check configuration whether autostart is enabled or not.
4. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdstr –> check whether it is enabled or not.
5. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdrun –> whether restart enabled if node fails.
1. crsctl enable has –> To start has services after reboot.
2. crsctl disable has –> has services should not start after reboot
3. crsctl config has –> Check configuration whether autostart is enabled or not.
4. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdstr –> check whether it is enabled or not.
5. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdrun –> whether restart enabled if node fails.
What is OCR? How and why OLR is
used? Where is the location of OCR & OLR?
OCR stands for Oracle
Cluster Registry. It holds information on it such as node membership (which
nodes are part of this cluster), Software Version, Location of the voting disk,
Status of RAC databases, listeners, instances & services. OCR is placed in
ASM, OCFS.
ASM can be brought up only if we have access
to OCR. But, OCR is accessible only after the ASM Is up. In this case, how will
CRS services come up?
Yes. For this OLR (Oracle
Local Registry) is there. This is a multiplexing of OCR file which was placed
in local file system.OLR holds information on it such as CRS_HOME, GPnP
details, active version, localhost version, OCR latest backup(with time &
location), Node name.,., .
Location Of OCR & OLR:
Location Of OCR & OLR:
NOTE: Some commands like restore need bounce of services. Please
verify before taking any action.
1.
ocrconfig
–showbackup –> OCR file backup location
2.
ocrconfig
–export < File_Name_with_Full_Location.ocr > –> OCR Backup
3.
ocrconfig
–restore <File_Name_with_Full_Location.ocr> –> Restore OCR
4.
ocrconfig –import
<File_Name_With_Full_Location.dmp> –> Import metadata
specifically for OCR.
5.
Ocrcheck
–details –> Gives the OCR info in detail
6.
ocrcheck
–local –> Gives the OLR info in detail
7.
ocrdump –local
<File_Name_with_Full_Location.olr> –> Take the dump of OLR.
8.
ocrdump
<File_Name_with_Full_Location.ocr> –> Take the dump of OCR.
What is the Voting Disk and how
is this Used?
If a node joins cluster,
if a node fails (may be evicted), if VIP need to be assigned in case of GNS is
configured. In all the cases, voting disk comes into picture. Voting disk saves
the info of which nodes were part of cluster. While starting the crs services,
with the help of OCR, it will vote in the voting disk (Nothing but mark
attendance in the cluster)We need not take the backup of the voting disk
periodically like our cron jobs. We are supposed to take backup only in SOME of
the below cases.
There are two different
jobs done by voting disk.
1.
Dynamic –
Heart beat information
2.
Static – Node
information in the cluster
Useful Commands:
1.
dd
if=Name_Of_Voting_Disk of=Name_Of_Voting_Disk_Backup –> Taking
backup of voting disk
2.
crsctl query
css votedisk –> Check voting disk details.
3.
crsctl add css
votedisk path_to_voting_disk –> To add voting disk
4.
crsctl add css
votedisk –force –> If the Cluster is down
5.
crsctl delete
css votedisk <File_Name_With_Password_With_file_name> –>
Delete Voting disk
6.
crsctl delete
css votedisk –force –> If the cluster is down
7.
crsctl replace
votedisk <+ASM_Disk_Group> –> Replace the voting disk.
What is
CRS?
CRSD stands for Cluster Resource Service Daemon. It is a proce–> which is responsible to
monitor, stop, start & failover the resources. This process maintains OCR
and this is responsible for restarting resource when any failover is about to
take place.
Useful Commands:
1.
crs_stat –t
–v –> Check crs resources
2.
crsctl stat
res -t –> Check in a bit detail view. BEST ONE.
3.
crsctl enable
crs –> Enable Automatic start of Services after reboot
4.
crsctl check
crs –> Check crs Services.
5.
crsctl disable
crs –> Disable Automatic start of CRS services after reboot
6.
crsctl stop
crs –> Stop the crs services on the node which we are executing
7.
crsctl stop
crs –f –> Stop the crs services forcefully
8.
crsctl start
crs –> To start the crs services on respective node
9.
crsctl start
crs –excl –> To start the crs services in exclusive mode
when u lost voting disk.
You need to replace the voting disk after you start the css.
You need to replace the voting disk after you start the css.
10.
crsctl stop
cluster –all –> Stop the crs services on the cluster nodes
11.
crsctl start
cluster –all –> Start the crs services on all the cluster nodes.
12.
olsnodes –>
Find all the nodes relative to the cluster
13.
oclumon manage
–get master –> With this you will get master node information
14.
cat
$CRS_HOME/crs/init/<node_name>.pid –> Find PID from which crs
is running.
6. What is CSSD?
CSSD stands for Cluster Synchronization Service Daemon. This is responsible for communicating the nodes
each other. This will monitor the heart beat messages from all the nodes.
Example:
We have 2 node RAC cluster. Till one hour back, our CSSD is monitoring
both the nodes and able to communicate each other. Now, if one of the node is
down, CRS should know that one of the node is down. This information is
provided by CSSD process. Simple Scenario: If
both the nodes are up & running now. And due to one of the communication
channel, CSSD process got information that the other node is down. So, in this
case, new transactions cannot be assigned to that node. The node eviction will
be done. And the node which is running now will be taking the ownership as
master node.
7. What is CTTSD?
CTTSD stands for Cluster Time Synchronization Service Daemon. This service
by default will be in observer mode. If time difference is there, it won’t be
taking any action. To run this service in active mode, we need to disable all
the time synchronization services like NTP (Network Time Protocol). But, it is
recommended as per my knowledge to keep this service in observer mode. This
line was quoted because, if this service is in online mode. And time
synchronization difference is huge, the cttsd process may terminate. And
sometimes, crsd fail to startup due to time difference.
Useful Commands:
1.
cluvfy comp
clocksync -n all -verbose –> To check the clock synchronization
across all the nodes
2.
crsctl check
ctts –> Check the service status & timeoffset in msecs.
8. What is VIP?
VIP stands for Virtual IP Address. Oracle uses VIP for Database level
access. Basically, when a connection comes from application end. Then using
this IP address, it will connect. Suppose if IP for one of the node is down. As
per protocol timeout, it need to wait 90 seconds to get a session. In this
scenario, VIP comes into picture. If one of the VIP is down, connections will
be routed only to the active node. The VIP must be on same address as public IP
address. This VIP is used for RAC failover and RAC management.
Useful Commands:
1.
srvctl start
vip –n <node_name> -i <VIP_Name> –> To start VIP
2.
srvctl stop
vip –n <node_name> -i <VIP_Name> –> To stop VIP
3.
srvctl enable
vip -i vip_name –> Enable the VIP.
4.
srvctl disable
vip -i vip_name –> Disable the VIP.
5.
srvctl status
nodeapps –n <node_name> –> status of nodeapps
6.
srvctl status
vip –n <node_name> –> status of vip on a node
What is ologgerd?
Ologgerd stands for cluster logger service Daemon. This is
otherwise called as cluster logger service. This logger services writes the
data in the master node. And chooses other nodes as standby. If any network
issue occurs between the nodes, and if it is unable to contact the master. Then
the other node takes ownership & chooses a node as standby node. This
master will manage the operating system metric database in CHM repository.
Useful Commands:
1.
Oclumon manage
–get master –> Find which is the master node
2.
oclumon manage
-get reppath –> Will get the path of the repository logs
3.
oclumon manage
-get repsize –> This will give you the limitations on repository
size
4.
Oclumon
showobjects –>find which nodes are connected to loggerd
5.
Oclumon
dumpnodeview –> This will give a detail view including system,
topconsumers, processes, devices, nics, filesystems status, protocol errors.
6.
oclumon
dumpnodeview -n <node_1 node_2 node_3> -last “HH:MM:SS” –> you
can view all the details in c. column from a specific time you mentioned.
7.
oclumon
dumpnodeview allnodes -last “HH:MM:SS” –> If we need info from all
the nodes.11.What is sysmon?
This process is
responsible for collecting information in the local node. This will collect the
info from every node and that data will be sent the data to master loggerd. This
will send the info like CPU, memory usage, Os level info, disk info, disk info,
process, file system info.
11.
What is evmd?
Evmd stands for Event Volume Manager Daemon. This handles event messaging for the processes.
It sends and receives actions regarding resource state changes to and from all
other nodes in a cluster. This will take the help of ONS(Oracle Notification
Services).
Useful Commands:
1. evmwatch -A -t “@timestamp @@” –> Get events
generated in evmd.
2. Evmpost –u “<Message here>” –h
<node_name> –> This will post message in evmd log in the mentioned
node.
3.
13.
What is mdnsd?
Mdnsd stands for Multicast Domain Name Service. This process is used by gpndp to locate
profiles in the cluster as well as by GNS to perform name resolutions. Mdnsd
updates the pid file in init directory.
What is
ONS?
ONS stands for Oracle Notification Service. ONS will
allow users to send SMS, emails, voice messages and fax messages in a easy way.
ONS will send the state of database, instance. This state information is used
for load balancing. ONS will also communicate with daemons in other nodes for
informing state of database.
This is started as
part of CRS as part of nodeapps. ONS will run as a node application. Every node
will have its own ONS configured.
Useful Commands:
1. srvctl status nodeapps –> Status of
nodeapps
2. cat $ORACLE_HOME/opmn/conf/ons.config –> Check
ons configuration.
3. $ORACLE_HOME/opmn/logs –> ONS logs
will be in this location.
15. what is OPROCD ?
OPROCD stands for Oracle Process Monitor Daemon. Oprocd monitors the system state of cluster
nodes. Stonith, which is nothing but power cycling the node. Simply, means
power off & power on the server using reboot command. And main change in
OPROCD is cssd agent from 11gR2.
Useful
Commands:
CRS_HOME/oprocd stop –> To stop the processon single
node.
16.
What is FAN?----FAN stands for Fast Application Notification. If any
state change occurs in cluster/instance/node, an event is triggered by the
event manager and it is propogated by ONS. The event is known as FAN event. It
was the feature which was introduced in Oracle 10g for an immediate
notification. FAN uses ONS for notifying.
Useful Commands:
1. onsctl ping –> To check whether ons
is running or not.
2. onsctl debug –> Will get detail
view of ons.
3. onsctl start –> Start the daemon.
4. onsctl stop –> Stop the daemon.
5.
17.
What is TAF?
TAF stands for
Trasparent Application Failover. When any rac node is down, the select
statements need to failover to the active node. And insert, delete, update and
also Alter session statements are not supported by TAF. Temporary objects &
pl/sql packages are lost during the failover.
There are two
types of failover methods used in TAF.
1. Basic failover: It will connect to single node. And
no overload will be there. End user experiences delay in completing the
transaction.
2. Preconnect failover: It will connect to primary
& backup node at at time. This offers faster failover. An overload will be
experienced as statement need to be ready to complete transaction with minimal
delay.
Useful Commands:
1. Add a service:
Srvctl add service –d <database_name> -s <Name_for_service> -r <instance_names> -p <Policy_specification>
Srvctl add service –d <database_name> -s <Name_for_service> -r <instance_names> -p <Policy_specification>
Policy
specification – none, basic, preconnect
2. Check TAF status:
SELECT machine, failover_type, failover_method, failed_over, COUNT(*) FROM gv$session GROUP BY machine, failover_type, failover_method, failed_over;
SELECT machine, failover_type, failover_method, failed_over, COUNT(*) FROM gv$session GROUP BY machine, failover_type, failover_method, failed_over;
18. What is FCF?
FCF stands for Fast Connection Failover. It is an
application level failover process. This will automatically subscribes to FAN
events and this will help in immediate reaction on the up & down events
from the database cluster. All the failure applications are cleaned up
immediately, so that the application will receive a failure message. And after
cleanup, if new connection is received then with load balancing it will reach
active node. As said, this is application level process I am not discussing
much.
19. What is GCS(LMSn)?
GCS stands for Global Cache Service. GCS catches
the information of data blocks, and access privileges of various instances.
Integrity is maintained by maintaining global access. It is responsible for
transferring blocks from instance to another instance when needed.
Clear
Understanding: Blocks of table “A” were retrieved with a connection to second
node. Now, if first node requests blocks from this table, services need not
pick the data from the datafiles. Blocks can be retrieved from other instance.
This is the main use of GCS.
19. What is GES(LMD)?
GES stands for
Global Enqueue Service. GES controls library and dictionary caches on all the
nodes. GES manages transaction locks, table locks, library cache locks,
dictionary cache locks, database mount lock.
21. What is GRD?
GRD stands for Global Resource Directory. This is
to record the information of resources and enqueues. As the word, it stores
info on all the information. Information like Data block identifiers, data
block mode(shared, exclusive, null), buffer caches will be having access.
22. What is GPNPD?
GPNPD stands for Grid Plug aNd Play Daemon. A file is
located in CRS_HOME/gpnp/<node_name>/profile/peer/profile.xml which is
known as GPNP profile. And this profile consists of cluster name, hostname,
ntwork profiles with IP addresses, OCR. If we do any modifications for voting
disk, profile will be updated.
Useful Commands:
1. gpnptool ver -> Check the version
of tool.
2. gpnptool lfind -> get local
gpnpd server.
3. gpnptool get -> read the profile
4. gpnptool lfind -> check daemon is
running on local node.
5. gpnptool check –p= CRS_HOME/gpnp/<node_name>/profile/peer/profile.xml -> Check
whether configuration is valid.
23. why is Diskmon?
Disk monitor
daemon continuously runs when ocssd starts. And it monitors and performs I/O
fencing for Exadata storage server (This server is termed as cell as per Exadata).
This process will run since the ocssd starts because exadata cell can be added
to any cluster at any time.
Useful Commands:
1. ./crsctl stat res ora.diskmon <– To
check diskmon status.
What is lower stack and
higher stack in RAC
The Lower Stack – Managed by OHASD
The 11gR2 Grid Infrastructure consists of a
set of daemon processes which execute on each cluster node; the voting
and OCR files, and protocols used to communicate across
the interconnect. Prior to 11gR2, there were various scripts run by the init process
to start and monitor the health of the clusterware daemons. From 11gR2,
the Oracle High Availability Services Daemon (OHASD)
replaces these. The OHASD starts, stops and
checks the status of all the other daemon processes that are part of the
clusterware using new agent processes listed here:
- CSSDAGENT –
used to start,stop and check status of the CSSD resource
- ORAROOTAGENT –
used to start “Lower Stack”
daemons that must run as root: ora.crsd, ora.ctssd,
ora.diskmon, ora.drivers.acfs, ora.crf
- ORAAGENT –
used to start “Lower Stack”
daemons that run as the grid owner: ora.asm, ora.evmd,
ora.gipcd, ora.gpnpd, ora.mdnsd
- CSSDMONITOR – used to monitor the CSSDAGENT
The OHASD is
essentially a daemon which starts and monitors the clusterware daemons
themselves. It is started by init using the /etc/init.d/ohasd script
and starts the ohasd.bin executable as
root. The Oracle documentation lists the “Lower Stack” daemons where
they are referred to as the “The Oracle High Availability Services Stack”
and notes which agent is responsible for starting and monitoring each specific
daemon. It also explains the purpose of each of the stack components. (Discussions
of some of these components will feature in future blog posts.) If the grid
infrastructure is enabled on a node, then OHASD starts
the “Lower Stack” on that node at boot time. If disabled, then
the “Lower Stack” is started manually. The following commands
are used for these operations:
- crsctl enable crs – enables autostart at boot time
- crsctl disable crs – disables autostart at boot time
- crsctl start crs – manually starts crs on the local node
The “Lower Stack” consists of
daemons which communicate with their counterparts on other cluster nodes. These
daemons must be started in the correct sequence, as some of them depend on
others. For example, the Cluster Ready Services Daemon (CRSD), may
depend on ASM being available if the OCR file
is stored in ASM. Clustered ASM in turn,
depends on the Cluster Synchronisation Services Daemon(CSSD),
as the CSSD must be started in order for
clustered ASM to start up. This dependency tree is similar to
that which already existed for the resources managed by the CRSD itself,
known as the “Upper Stack“, which will be discussed later in this
post.To define the dependency tree for the “Lower Stack“, a
local repository called the OLR is used. This
contains the metadata required by OHASD to join the
cluster and configuration details for the local software. As a
result, OHASD can start the “Lower Stack”
daemons without reference to the OCR. To examine the OLR use
the following command, and then examine the dump file produced:
- ocrdump -local <FILENAME>
Another benefit of the OHASD, is
that there is a daemon running on each cluster node whether or not the “Lower Stack”
is started. As long as the OHASD daemon is running,
then the following commands may be used in 11gR2:
- crsctl check has – check the status of the OHASD
- crsctl check crs – check the status of the OHASD, CRSD,
CSSD and EVMD
- crsctl check cluster – all – this checks the “Lower Stack”
on all the nodes
- crsctl start cluster – this attempts to start the “Lower Stack” on all the nodes
- crsctl stop cluster – this attempts to stop the “Lower Stack”
on all the nodes
# crsctl check has
# crsctl check cluster –all
- crsctl stat res -init -t
To start the CSSD Daemon requires
access to the Voting Files which may be stored in ASM. But a
clustered ASM Instance may not start until the node has joined the
cluster which requires that CSSD be up. To get around this problem, ASM Diskgroups are flagged to indicate that they contain Voting Files. The ASM Discovery string is contained in the OLR and used to
scan for the ASM Disks when CSSD starts. The
scan locates the flags indicating the presence of Voting Files which are stored at a fixed location in the ASM Disks.
This process does not require the ASM instance to
be up. Once the Voting
Files are found by this
scanning process, CSSD can access them, join the cluster and then the ORAAGENT can
start the Clustered ASM
Instance.
The Upper Stack – Managed by CRSD
The “Upper Stack” consists of the daemons and resources
managed by the Grid Infrastructure, once it is up and running. It uses the same
architecture as OHASD, but CRSDuses
its own threads of the agents to start up, stop
and check the status of the daemons and resources as follows:
- ORAROOTAGENT –
used to start “Upper Stack”
daemons that must run as root: GNS, VIP, SCAN VIP and
network resources
- ORAAGENT –
used to start “Upper Stack”
daemons that run as grid owner: ora.asm,
ora.eons, ora.LISTENER.lsnr, SCAN listeners, ora.ons, ASM
Diskgroups, Database Instances, Database Services. It
is also used to publish High Availability events
to interested clients and manages Cluster Ready
Service changes of state.
The resources managed by the CRSD for the “Upper Stack” are also listed in the Oracle Documentation where they
are referred to as “The Cluster Ready Services Stack”
and consist of familiar resources such as Database Instances, Database
Services and NodeApps such as Node Listeners.
There are also some new resources such as the Single Client Access
Name (SCAN), SCAN Vips, Grid Naming Service (GNS), GNS Vips and Network
Resources. Some of these will be the subject of future
Blog posts.
The resources managed by the “Upper Stack”
are in the OCR file which may be stored in ASM.
Since the Clustered ASM Instance is started by OHASD after CSSD is
started but before CRSD is started, access
to the OCR by CRSD is done as a
normal client of ASM. The OCR file
may be seen as a file in ASM, unlike the Voting Files which
are not “visible” when looking at the ASM directory
contents using either Enterprise Manager or the ASMCMD utility.
To check the location of the OCR do
the following:
# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA
local_only=FALSE
ocrconfig_loc=+DATA
local_only=FALSE
CRSD Resource Categories
CRSD resources are categorised
as “Local
Resources” or”Cluster Resources“. Local Resources are
activated on a specific node and never fail over to another node. For example,
an ASM
Instance exists on each node, so if a node fails, then
the ASM Instance that was on that node will not fail
over to a surviving node. Likewise, a Node Listener exists
for each node and does not fail over. These two resource types are therefore “Local Resources“. SCAN Listeners however,
may fail over as may Database Instances or
the GNS (if used), so these are “Cluster Resources“
Finally to check the status of
the “Upper Stack” resources and daemons, do the following:
# ./crsctl status resource -t
What is GPNP profile?
The GPnP
profile is a small XML file located in
GRID_HOME/gpnp/<hostname>/profiles/peer under the name profile.xml.
It is used to establish the correct global personality of a node. Each node
maintains a local copy of the GPnP Profile and is maintanied by the GPnP Deamon
(GPnPD) .
WHAT DOES GPNP PROFILE CONTAIN?
GPnP Profile is used to store necessary
information required for the startup of Oracle Clusterware like SPFILE
location,ASM DiskString etc.
It contains various attributes defining node
personality.
- Cluster name
- Network classifications
(Public/Private)
- Storage to be used for CSS
- Storage to be used for ASM
: SPFILE location,ASM DiskString etc
WHO UPDATES GPNP PROFILE?
GPnPd daemon replicates changes to the profile
during
§ - installation,
§ - system boot or
§ - when updated
Profile is updated Whenever changes are made to
a cluster with configuration tools like
§ . oifcfg (Change network),
§ . crsctl (change location of voting disk),
§ . asmcmd (change ASM_DISKSTRING, SPfile
location) etc.
HOW IS GPNP PROFILE USED BY CLUSTERWARE?
When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means, OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization. OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system. The information regarding voting disk is on ASM , is read from GPnP profile i.e.
We can even read voting disk by using kfed utility
,even if ASM is not up.
In
next step, the clusterware checks whether all the nodes have the updated GPnP
profile and the nodes joins the cluster based on the GPnP configuration .
Whenever a node is started or added to the cluster, the clusterware software on
the starting node starts a GPnP agent and perform following task.
1. If
the node is already part of the cluster, the GPnP agent reads the existing
profile on that node.
2. If
the node is being added to the cluster, GPnP agent locates agent on another
existing node using multicast protocol (provided by mDNS) and gets the profile
from other node’s GPnP agent.
The
Voting Files locations on ASM Disks are accessed by CSSD with well-known
pointers in the ASM Disk headers and CSSD is able to complete initialization
and start or join an existing cluster.
Now
OHASD starts an ASM instance and ASM can now operate with initialized and
operating CSSD.
With,
an ASM instance running and its Diskgroup mounted, access to Clusterware’s OCR
is available to CRSD (CRSD needs to read OCR to startup various resources
on the node and hence update it, as status of resources changes )Now OHASD
starts CRSD with access to the OCR in an ASM Diskgroup and thus
Clusterware completes initialization and brings up other services under its
control.
The
ASM instance uses special code to locate the contents of the ASM SPFILE , which
is stored in a Diskgroup.
Next.
Since OCR is also on ASM, location of ASM spfile should be known. The order of
searching the ASM SPfile is
·
GPnP profile
·
ORACLE_HOME/dbs/spfile
·
ORACLE_HOME/dbs/init
ASM
spfile is stored in ASM. But to start ASM, we’ll need spfile. Oracle know
spfile location from GPnP profile & it reads spfile flag from
underlying disk(s) and then starts the ASM.
Thus
with the use of GPnP profile stores several information. GPnP profile
information along with the information in the OLR have enough information ,
that have sufficient to automate several tasks or eased for the administrators
and also the dependency on OCR is gradually reduced but not eliminated.
What
is the major difference between 10g and 11g RAC?
·
ASM introduced
·
Concept of Services expanded
·
ocrcheck introduced
·
ocrdump introduced
·
AWR was instance specific
CRS was renamed as Clusterware
·
asmcmd introduced
·
CLUVFY introduced
·
OCR and Voting disks can be mirrored
·
Can use FAN/FCF with TAF for OCI and
ODP.NET
Oracle 11g R1 RAC
·
Oracle 11g RAC parallel upgrades -
Oracle 11g have rolling upgrade features whereby RAC database can be upgraded
without any downtime.
·
Hot patching - Zero downtime patch
application.
·
Oracle RAC load balancing advisor -
Starting from 10g R2 we have RAC load balancing
advisor utility. 11g RAC load balancing advisor is only available with clients
who use .NET, ODBC, or the Oracle Call Interface (OCI).
·
ADDM for RAC - Oracle has
incorporated RAC into the automatic database diagnostic monitor, for cross-node
advisories. The script addmrpt.sql run give report for single instance, will
not report all instances in RAC, this is known as instance ADDM. But using the
new package DBMS_ADDM, we can generate report for all instances of RAC, this
known as database ADDM.
·
Optimized RAC cache fusion protocols
- moves on from the general cache fusion protocols in 10g to deal with specific
scenarios where the protocols could be further optimized.
- We
can store everything on the ASM. We can store OCR & voting
files also on the ASM.
- ASMCA
- Single Client Access Name
(SCAN) - eliminates the need to change tns entry when
nodes are added to or removed from the Cluster. RAC instances register to
SCAN listeners as remote listeners. SCAN is fully qualified name. Oracle
recommends assigning 3 addresses to SCAN, which create three SCAN
listeners.
- Clusterware
components: crfmond, crflogd, GIPCD.
- AWR
is consolidated for the database.
- By
default, LOAD_BALANCE is ON.
- GSD
(Global Service Deamon), gsdctl introduced.
- GPnP
profile.
- Cluster
information in an XML profile.
- Oracle RAC OneNode is
a new option that makes it easier to consolidate databases that aren’t
mission critical, but need redundancy.
- raconeinit -
to convert database to RacOneNode.
- raconefix
- to fix RacOneNode database in case of failure.
- racone2rac
- to convert RacOneNode back to RAC.
- Oracle
Restart - the feature of Oracle Grid Infrastructure's High Availability Services (HAS)
to manage associated listeners, ASM instances and Oracle instances.
- Cluster
Time Synchronization Service (CTSS) is a new feature in Oracle 11g R2
RAC, which is used to synchronize time
across the nodes of the cluster. CTSS will be replacement of NTP protocol.
- Grid Naming Service
(GNS) is a new service introduced in Oracle RAC 11g R2.
With GNS, Oracle Clusterware (CRS) can manage Dynamic Host
Configuration Protocol (DHCP) and DNS services for the dynamic node
registration and configuration.
- Cluster
interconnect: Used for data blocks, locks, messages, and SCN numbers.
- Oracle Local Registry (OLR)
- From Oracle 11gR2 "Oracle
Local Registry (OLR)" something new as part of Oracle Clusterware.
OLR is node’s local repository, similar to OCR (but local) and is managed
by OHASD. It pertains data of local node only and is not shared among
other nodes.
- Multicasting
is introduced in 11gR2 for private interconnect traffic.
- I/O
fencing prevents updates by failed instances, and detecting failure and
preventing split brain in cluster. When a cluster node fails, the failed
node needs to be fenced off from all the shared disk devices or
diskgroups. This methodology is called I/O Fencing, sometimes called Disk
Fencing or failure fencing.
- Re-bootless node fencing
(restart) - instead of fast re-booting the node, a graceful shutdown of
the stack is attempted.
- Clusterware
log directories: acfs*
- HAIP
(IC VIP)..
What are nodeapps services in RAC?
Nodeapps are standard set of oracle application services
which are started automatically for RAC.
Node apps Include:
1) VIP.
2) Oracle Net listener.
3) Global Service Daemon.
4) Oracle Notification Service.
Nodeapp Services run on each node of the cluster and will switched over to other nodes through VIP during the failover.
Useful commands to maintain nodeapps services:
srvctl stop nodeapps -n NODE1
[ STOP NODEAPPS on NODE 1 ]
srvctl stop nodeapps -n NODE2
[ STOP NODEAPPS on NODE 2 ]
srvctl start nodeapps -n NODE1
[ START NODEAPPS on NODE1 ]
srvctl start nodeapps -n NODE2
[ START NODEAPPS ON NODE2 ]
srvctl status nodeapps
Node apps Include:
1) VIP.
2) Oracle Net listener.
3) Global Service Daemon.
4) Oracle Notification Service.
Nodeapp Services run on each node of the cluster and will switched over to other nodes through VIP during the failover.
Useful commands to maintain nodeapps services:
srvctl stop nodeapps -n NODE1
[ STOP NODEAPPS on NODE 1 ]
srvctl stop nodeapps -n NODE2
[ STOP NODEAPPS on NODE 2 ]
srvctl start nodeapps -n NODE1
[ START NODEAPPS on NODE1 ]
srvctl start nodeapps -n NODE2
[ START NODEAPPS ON NODE2 ]
srvctl status nodeapps
Shutdown and Start sequence of Oracle RAC
components?
Stop Oracle RAC (11g, 12c)
18.
emctl stop dbconsole (11c only. In 12c
DB Express replaces dbconsole and doesn’t have to be stopped )
2. srvctl stop listener [-listener listener_name] [-node node_name] [-force] (stops all listener services)
3. srvctl stop database -db db_unique_name [-stopoption stop_options] [-eval(12c only)] [-force] [-verbose]
4. srvctl stop asm [-proxy] [-node node_name] [-stopoption stop_options] [-force]
5. srvctl stop nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-force] [-relocate] [-verbose]
6. crsctl stop crs
2. srvctl stop listener [-listener listener_name] [-node node_name] [-force] (stops all listener services)
3. srvctl stop database -db db_unique_name [-stopoption stop_options] [-eval(12c only)] [-force] [-verbose]
4. srvctl stop asm [-proxy] [-node node_name] [-stopoption stop_options] [-force]
5. srvctl stop nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-force] [-relocate] [-verbose]
6. crsctl stop crs
Start Oracle RAC (11g, 12c)
1. crsctl start crs
2. crsctl start res ora.crsd -init
3. srvctl start nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-verbose]
4. srvctl start asm [-proxy] [-node node_name [-startoption start_options]]
5. srvctl start database -db db_unique_name [-eval(12c only)]] [-startoption start_options] [-node node_name]
6. srvctl start listener [-node node_name] [-listener listener_name] (start all listener services)
7. emctl start dbconsole (11c only) To start resources of your HA environment if that are still down(e.g. ora.ons, Listener):
crsctl start resource –all
2. crsctl start res ora.crsd -init
3. srvctl start nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-verbose]
4. srvctl start asm [-proxy] [-node node_name [-startoption start_options]]
5. srvctl start database -db db_unique_name [-eval(12c only)]] [-startoption start_options] [-node node_name]
6. srvctl start listener [-node node_name] [-listener listener_name] (start all listener services)
7. emctl start dbconsole (11c only) To start resources of your HA environment if that are still down(e.g. ora.ons, Listener):
crsctl start resource –all
1) TAF
a feature of Oracle Net Services for OCI8 clients. TAF is transparent application failover which will move a session to a backup connection if the session fails. With Oracle 10g Release 2, you can define the TAF policy on the service using dbms_service package. It will only work with OCI clients. It will only move the session and if the parameter is set, it will failover the select statement. For insert, update or delete transactions, the application must be TAF aware and roll back the transaction. YES, you should enable FCF on your OCI client when you use TAF, it will make the failover faster.
ote: TAF will not work with JDBC thin.
2) FAN
FAN is a feature of Oracle RAC which stands for Fast Application Notification. This allows the database to notify the client of any change (Node up/down, instance up/down, database up/down). For integrated clients, inflight transactions are interrupted and an error message is returned. Inactive connections are terminated.
FCF is the client feature for Oracle Clients that have integrated with FAN to provide fast failover for connections. Oracle JDBC Implicit Connection Cache, Oracle Data Provider for .NET (ODP.NET) and Oracle Call Interface are all integrated clients which provide the Fast Connection Failover feature.
FAN is a feature of Oracle RAC which stands for Fast Application Notification. This allows the database to notify the client of any change (Node up/down, instance up/down, database up/down). For integrated clients, inflight transactions are interrupted and an error message is returned. Inactive connections are terminated.
FCF is the client feature for Oracle Clients that have integrated with FAN to provide fast failover for connections. Oracle JDBC Implicit Connection Cache, Oracle Data Provider for .NET (ODP.NET) and Oracle Call Interface are all integrated clients which provide the Fast Connection Failover feature.
3) FCF ---FCF is a feature of Oracle
clients that are integrated to receive FAN events and abort inflight
transactions, clean up connections when a down event is received as well as
create new connections when a up event is received. Tomcat or JBOSS can take
advantage of FCF if the Oracle connection pool is used underneath. This can be
either UCP (Universal Connection Pool for JAVA) or ICC (JDBC Implicit
Connection Cache). UCP is recommended as ICC will be deprecated in a future
release.
What is TAF?
The Oracle Transparent Application Failover (TAF) feature allows
application users to reconnect to surviving database instances if an existing
connection fails. When such a failure happens, all uncommitted transactions
will be rolled back and an identical connection will be established. The
uncommitted transactions have to be resubmitted after reconnection. The TAF
reconnect occurs automatically from within the OCI library. To use all features of TAF, the
application code may have to be modified. When your application is query-only,
TAF can be used without any code changes. In general, TAF works well for
reporting.
Server-Side vs. Client-Side TAF ---TAF can be implemented either client-side or
server-side. Service attributes are used server-side to hold the TAF
configuration; client-side the TNS connect string must be changed to enable
TAF. Settings configured server-side supersede their client-side counterparts
if both methods are used. Server-side configuration of TAF is the preferred
method. You can configure TAF in two different failover modes. In the first
mode, Select Failover, SELECT statements that are in progress during the
failure are resumed over the new connection. In the second mode, Session
Failover, lost connections and sessions are re-created.
- Select Selects will resume on
the new connection
- Session When a connection is
lost, a new connection is created automatically
When TAF is set up client-side, it can be configured to establish
from the beginning a second connection to another (backup) instance. This
eliminates the reconnection penalty but requires that the backup instance
support all connections from all nodes set up this way.
- Basic Establishes
connections only when failover occurs
- Preconnect Pre-establishes
connections to the backup server
What
is FAN and ONS ?
The
Oracle RAC, FAN, Oracle 12c Fast Application
Notification (FAN) feature provides a simplified API for accessing FAN
events through a callback mechanism. This mechanism enables third-party
drivers, connection pools, and containers to subscribe, receive and process FAN
events. These APIs are referred to as Oracle RAC FAN APIs.If we simplify
it, FAN[Fast Application Notification] is a mechanism that Oracle
RAC uses to notify ONS about service status changes, like UP and DOWN events,
Instance state changes etc.Oracle RAC notifies the cluster about FAN events
the minutes any changes have occurred. So the benefit is instead of
waiting for the application to check on individual nodes to detect an anomaly ,
the applications are notified by FAN events and are able to react
immediately.Any changes that occur in a cluster configuration will be notified
by Fast Application Notification to ONS.
We can also use server callouts scripts to catch FAN events.
FAN also publishes load balancing advisory
(LBA) events. Applications are in a position to take full advantage of
the LBA FAN events to facilitate a smooth transition of connections to
healthier nodes in a cluster. One can take advantage of FAN is
the following ways:
1. When using integrated Oracle Client, the applications
can use FAN with no programming whatsoever. Oracle 120g JDBC,
ODP.NET, and OCI would be considered as the components of the
integrated clients.
2. Programmatic changes in ONS API make
it possible for applications to still subscribe to the FAN events
and can execute the event handling actions appropriately.
3. We can also use server callouts scripts to
catch FAN events at a database level.
For
DBA, if a database is up and running, everything seems beautiful..!! But once a
state change happens in a database, we don’t know will it take 2 min to
overcome it or it will eat our infinite time.When we talk about Oracle RAC, we
know there are multiple resources are available to give High Availability and
Load balancing feature. And when we are having multiple resources available for
multiple instances, multiple services and multiple listeners to serve us, a
state change can cause a performance problem.
Let
us take an example of a node failure.
When a node fails without
closing sockets, all sessions that are blocked in an IO [read or
write] wait for TCP keepalive. This wait status is the
typical condition for an application using the database.Sessions processing the
last result are even worse off, not receiving an interrupt until the next data
is requested. Here
we can take advantage
of FAN [Fast Application Notification]. To
read more about FAN Fast Application
Notification [FAN] overview
- FAN eliminates application
waiting on TCP timeouts as Oracle RAC notifies the cluster about FAN
events the minutes any changes have occurred and we can handle those
events.
- It eliminates time wasted
processing the last result at the client after a failure has occurred.
- It eliminates time wasted
executing work on slow, hung or dead nodes.
We
can take advantage of server-side callouts FAN and do following things.
- Whenever FAN events
occur we can log that so that helps us for administration in future.
- We can use paging or SMS DBA
to open tickets when any resource fails to restart.
- Change resource plans
or to shut down services when the number of available instances decreases,
thus preventing further load on the cluster and keeping theRAC running
until another healthy node is added to the cluster.
- We can automate the
fail service back to the PREFERRED instances
when required.
WHAT IS ONS?
ONS allows
users to send SMS messages, e-mails, voice notifications, and fax messages in
an easy-to-access manner. Oracle Clusterware uses ONS to send notifications
about the state of the database instances to midtier applications that use this
information for load-balancing and for fast failure detection. ONS is a daemon
process that communicates with other ONS daemons
on other nodes which inform each other of the current state of the database components on the database server.
on other nodes which inform each other of the current state of the database components on the database server.
To add additional members or nodes that
should receive notifications, the hostname or IP address of the node
should be added to the ons.config file.
The ONS configuration file is located in the $ORACLE_HOME/opmn/conf directory and has the following format:
[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
localport=6101
remoteport=6201
loglevel=3
useocr=on
nodes=oradb4.sumsky.net:6101,oradb2.sumsky.net:6201,
oradb1.sumsky.net:6201,oradb3.sumsky.net:6201,
onsclient1.sumsky.net:6200,onsclient2.sumsky.net:6200
The ONS configuration file is located in the $ORACLE_HOME/opmn/conf directory and has the following format:
[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
localport=6101
remoteport=6201
loglevel=3
useocr=on
nodes=oradb4.sumsky.net:6101,oradb2.sumsky.net:6201,
oradb1.sumsky.net:6201,oradb3.sumsky.net:6201,
onsclient1.sumsky.net:6200,onsclient2.sumsky.net:6200
The localport is the port that ONS binds to
on the local host interface to talk to local clients. The remoteport is
the port that ONS binds to on all interfaces to talk to other ONS daemons. The
loglevel indicates the amount of logging that should be generated. Oracle
supports logging levels from 1 through 9. ONS logs are generated in the
$ORACLE_HOME/opmn/logs directory on the respective instances
The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and nodes are participating in the cluster. The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications. This includes client machines where ONS is also running to receive FAN events for applications.
The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and nodes are participating in the cluster. The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications. This includes client machines where ONS is also running to receive FAN events for applications.
ONS configuration
==>>
ONS is installed and configured as part of the Oracle Clusterware installation.
Execution of the root.sh file on Unix and Linux-based systems, during
the Oracle Clusterware installation will create and start the ONS on all
nodes participating in the cluster. This can be verified using the crs_stat
utility provided by Oracle.
Configuration of ONS involves registering all nodes and servers that will communicate with the ONS daemon on the database server.
During Oracle Clusterware installation, all nodes participating in the cluster are automatically registered with the ONS.
Subsequently, during restart of the clusterware, ONS will register all nodes with the respective ONS processes
on other nodes in the cluster.
To add additional members or nodes that should receive notifications,the hostname or IP address of the node should be added to
the ons.config file. The configuration file is located in the $ORACLE_HOME/opmn/conf
directory and has the following format:
[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
ONS is installed and configured as part of the Oracle Clusterware installation.
Execution of the root.sh file on Unix and Linux-based systems, during
the Oracle Clusterware installation will create and start the ONS on all
nodes participating in the cluster. This can be verified using the crs_stat
utility provided by Oracle.
Configuration of ONS involves registering all nodes and servers that will communicate with the ONS daemon on the database server.
During Oracle Clusterware installation, all nodes participating in the cluster are automatically registered with the ONS.
Subsequently, during restart of the clusterware, ONS will register all nodes with the respective ONS processes
on other nodes in the cluster.
To add additional members or nodes that should receive notifications,the hostname or IP address of the node should be added to
the ons.config file. The configuration file is located in the $ORACLE_HOME/opmn/conf
directory and has the following format:
[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
The localport is the port
that ONS binds to on the local host interface to talk to local clients.
The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons.
The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels
from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances.
The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and
nodes are participating in the cluster.
The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications.
The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons.
The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels
from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances.
The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and
nodes are participating in the cluster.
The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications.
what is rconfig utility and usage
?
Use the below
steps to convert existing database running on one of cluster node from single
Instance to Cluster database.
Step 1>
copy ConvertToRAC_AdminManaged.xml to another file convert.xml
node1$cd $ORACLE_HOME/assistants/rconfig/sampleXMLs
node1$cp ConvertToRAC_AdminManaged.xml convert.xml
node1$cd $ORACLE_HOME/assistants/rconfig/sampleXMLs
node1$cp ConvertToRAC_AdminManaged.xml convert.xml
Step 2> Edit convert.xml and make
following changes :
* Specify current
OracleHome of non-rac database as SourceDBHome
* Specify OracleHome where the rac database should be configured. It can be same as SourceDBHome
* Specify SID of non-rac database and credential. User with sysdba role is required to perform conversion
* Specify the list of nodes that should have rac instances running for the Admin Managed Cluster Database. LocalNode should be the first node in this nodelist.
* Instance Prefix tag is optional starting with 11.2. If left empty, it is derived from db_unique_name
* Specify the type of storage to be used by rac database. Allowable values are CFS|ASM
* Specify Database Area Location to be configured for rac database. Leave it blank if you want to use existing location of the database files but location must be accessible for all the cluster node.
* Specify Flash Recovery Area to be configured for rac database. Leave it blank if you want to use existing location as flash recovery area.
* Specify OracleHome where the rac database should be configured. It can be same as SourceDBHome
* Specify SID of non-rac database and credential. User with sysdba role is required to perform conversion
* Specify the list of nodes that should have rac instances running for the Admin Managed Cluster Database. LocalNode should be the first node in this nodelist.
* Instance Prefix tag is optional starting with 11.2. If left empty, it is derived from db_unique_name
* Specify the type of storage to be used by rac database. Allowable values are CFS|ASM
* Specify Database Area Location to be configured for rac database. Leave it blank if you want to use existing location of the database files but location must be accessible for all the cluster node.
* Specify Flash Recovery Area to be configured for rac database. Leave it blank if you want to use existing location as flash recovery area.
Step 3> Run rconfig to convert racdb from
single instance database to cluster database
* node1$rconfig convert.xml
* node1$rconfig convert.xml
* Check the log
file for rconfig while conversion is going on
oracle@node1$ls -lrt $ORACLE_BASE/cfgtoollogs/rconfig/*.log
oracle@node1$ls -lrt $ORACLE_BASE/cfgtoollogs/rconfig/*.log
* check that the
database has been converted successfully
node1$srvctl status database -d orcl
‘
node1$srvctl status database -d orcl
‘
What is CHM ( cluster health Monitor ) ?
The
Oracle Grid Cluster Health Monitor (CHM) stores operating system metrics in the
CHM repository for all nodes in a RAC cluster. It stores information on CPU,
memory, process, network and other OS data. This information can later be
retrieved and used to troubleshoot and identify any cluster related issues. It
is a default component of the 11gr2 grid install. The data is stored in the
master repository and also replicated to a standby repository on a different
node. The Cluster Health Monitor (CHM) (formerly a.k.a. Instantaneous Problem
Detector for Clusters or IPD/OS) is designed to –
detect and to analyze operating system (OS).
– and cluster resource related degradation and failures – in order to bring more
explanatory power to many issues that occur in clusters,
in which Oracle Clusterware and / or Oracle RAC are used, e.g. node
evictions.
– It is Oracle Clusterware and Oracle RAC independent in the current
release.
What are the processes and
components for the Cluster Health Monitor?
Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database). It compresses the data before persisting to save the disk space. In an environment
with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd. The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster.
System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond
process runs on every node and collects the system statistics including
CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info
Locate CHM log directory
Check CHM resource status and locate Master Node
[grid@grac41 ~] $ $GRID_HOME/bin/crsctl status res ora.crf -init
NAME=ora.crf
TYPE=ora.crf.type
TARGET=ONLINE
STATE=ONLINE on grac41
[grid@grac41 ~]$ oclumon manage -get MASTER
Master = grac43
Login into grac43 and located CHM log directory ( ologgerd process )
[root@grac43 ~]# ps -elf |grep ologgerd | grep -v grep
.... /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac43
What is ora.crf?
It's resource name on Oracle Clusterware 11gR2(new in 11.2.0.2). $
crsctl stat res ora.crf -init \
ora.crf is the Cluster Health Monitor resource name that ohasd
manages.
You can check processes about
it.
$ ps -aef | grep osysmond $
ps -aef | grep ologgerd
Crash Recovery Vs Instance Recovery
When the instance suddenly fail due to power outage or issuing shut abort, the instance requires recovery during next startup. Oracle will perform crash recovery up on restarting the database.
Crash Recovery involves two steps Cache recovery and transaction recovery.
Cache Recovery (Roll Forward) : The committed and Uncommitted data from the online redolog files are applied to the datafiles.
Transaction Recovery (Roll Back): The uncommitted data are rolled back from the datafiles.
In an RAC environment, one of the surviving instance performs the crash recovery of failed instance. This is known as instance recovery. In a single instance database, crash recovery and instance recovery are synonymous.
When the instance suddenly fail due to power outage or issuing shut abort, the instance requires recovery during next startup. Oracle will perform crash recovery up on restarting the database.
Crash Recovery involves two steps Cache recovery and transaction recovery.
Cache Recovery (Roll Forward) : The committed and Uncommitted data from the online redolog files are applied to the datafiles.
Transaction Recovery (Roll Back): The uncommitted data are rolled back from the datafiles.
In an RAC environment, one of the surviving instance performs the crash recovery of failed instance. This is known as instance recovery. In a single instance database, crash recovery and instance recovery are synonymous.
What
methods are available to keep the time synchronized on all nodes in the
cluster?
In RAC environment
it is essential that all nodes are in sync in time. Also it recommended to have
the time in sync between primary and standby. There are three options to ensure
that the time across nodes are in sync
1)Windows Time Service
2)Network Time Protocol
3)Oracle cluster time synchronization service CTSS
If one of the first 2 is available on a RAC node then
CTSS starts in observer mode
If neither found, CTSS then starts in active mode
and synchronizes time across cluster nodes without an external server
To check whether NTP is running on a server use ps
command
oracle:dev1:/home/oracle$ ps -ef | grep ntp
ntp 39013 1 0 Apr05 ? 00:03:34 ntpd -u ntp:ntp -p /var/run/ntpd.pid -x
oracle 93293 93226 0 12:54 pts/4 00:00:00 grep ntp
ntp 39013 1 0 Apr05 ? 00:03:34 ntpd -u ntp:ntp -p /var/run/ntpd.pid -x
oracle 93293 93226 0 12:54 pts/4 00:00:00 grep ntp
for RAC NTP has to run with –x option. This option
means that the time corrections are done gradually in small changes and this is
also called as slewing.
Q1. What is Oracle Real Application
Clusters?
Oracle RAC enables you to cluster Oracle databases.Oracle RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so they operate as a single system.Oracle Clusterware is a portable cluster management solution that is integrated with Oracle Database.
Oracle RAC enables you to cluster Oracle databases.Oracle RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so they operate as a single system.Oracle Clusterware is a portable cluster management solution that is integrated with Oracle Database.
Q2. What are the file storage options
provided by Oracle Database for Oracle RAC?
The file
storage options provided by Oracle Database for Oracle RAC are,
·
Automatic
Storage Management (ASM)
·
OCFS2
and Oracle Cluster File System (OCFS)
·
A
network file system
·
Raw
devices
Q3. What is a CFS?
A cluster File System (CFS) is a file system that may be accessed (read and write) by all members in a cluster at the same time. This implies that all members of a cluster have the same view.
A cluster File System (CFS) is a file system that may be accessed (read and write) by all members in a cluster at the same time. This implies that all members of a cluster have the same view.
Q4. What is cache fusion?
In a RAC environment, it is the combining of data blocks, which are shipped across the interconnect from remote database caches (SGA) to the local node, in order to fulfill the requirements for a transaction (DML, Query of Data Dictionary).
In a RAC environment, it is the combining of data blocks, which are shipped across the interconnect from remote database caches (SGA) to the local node, in order to fulfill the requirements for a transaction (DML, Query of Data Dictionary).
Q5. What is split brain?
When database nodes in a cluster are unable to communicate with each other, they may continue to process and modify the data blocks independently. If the
same block is modified by more than one instance, synchronization/locking of the data blocks does not take place and blocks may be overwritten by others in the cluster. This state is called split brain.
When database nodes in a cluster are unable to communicate with each other, they may continue to process and modify the data blocks independently. If the
same block is modified by more than one instance, synchronization/locking of the data blocks does not take place and blocks may be overwritten by others in the cluster. This state is called split brain.
Q6. What methods are available to keep the
time synchronized on all nodes in the cluster?
Either the Network Time Protocol(NTP) can be configured or in 11gr2, Cluster Time Synchronization Service (CTSS) can be used.
Either the Network Time Protocol(NTP) can be configured or in 11gr2, Cluster Time Synchronization Service (CTSS) can be used.
Q7. Where are the Clusterware files stored on
a RAC environment?
The Clusterware is installed on each node (on an Oracle Home) and on the shared disks (the voting disks and the CSR file)
The Clusterware is installed on each node (on an Oracle Home) and on the shared disks (the voting disks and the CSR file)
Q8. What command would you use to check the
availability of the RAC system?
crs_stat -t -v (-t -v are optional)
crs_stat -t -v (-t -v are optional)
Q9. What is the minimum number of
instances you need to have in order to create a RAC?
You can create a RAC with just one server.
You can create a RAC with just one server.
Q10. Name two specific RAC
background processes
RAC processes are: LMON, LMDx, LMSn, LKCx and DIAG.
RAC processes are: LMON, LMDx, LMSn, LKCx and DIAG.
Q11. What files components in RAC must reside
on shared storage?
Spfiles, ControlFiles, Datafiles and Redolog files should be created on shared storage.
Spfiles, ControlFiles, Datafiles and Redolog files should be created on shared storage.
Q12. Where does the Clusterware write when
there is a network or Storage missed heartbeat?
The network ping failure is written in $CRS_HOME/log
The network ping failure is written in $CRS_HOME/log
Q13. How do you find out what OCR backups are
available?
The ocrconfig -showbackup can be run to find out the automatic and manually run backups.
The ocrconfig -showbackup can be run to find out the automatic and manually run backups.
Q14. What is the interconnect used for?
It is a private network which is used to ship data blocks from one instance to another for cache fusion. The physical data blocks as well as data dictionary blocks are shared across this interconnect.
It is a private network which is used to ship data blocks from one instance to another for cache fusion. The physical data blocks as well as data dictionary blocks are shared across this interconnect.
Q15. How do you determine what protocol is
being used for Interconnect traffic?
One of the ways is to look at the database alert log for the time period when the database was started up.
One of the ways is to look at the database alert log for the time period when the database was started up.
Q16. If your OCR is corrupted what options do
have to resolve this?
You can use either the logical or the physical OCR backup copy to restore the Repository.
You can use either the logical or the physical OCR backup copy to restore the Repository.
Q17. What is hangcheck timer used for ?
The hangcheck timer checks regularly the health of the system. If the system hangs or stop the node will be restarted automatically.
There are 2 key parameters for this module:
The hangcheck timer checks regularly the health of the system. If the system hangs or stop the node will be restarted automatically.
There are 2 key parameters for this module:
·
hangcheck-tick:
this parameter defines the period of time between checks of system health. The
default value is 60 seconds; Oracle recommends setting it to 30seconds.
·
hangcheck-margin:
this defines the maximum hang delay that should be tolerated before
hangcheck-timer resets the RAC node.
·
Q18. What is the difference between Crash
recovery and Instance recovery?
When an instance crashes in a single node database on startup a crash recovery takes place. In a RAC enviornment the same recovery for an instance is performed by the surviving nodes called Instance recovery.
When an instance crashes in a single node database on startup a crash recovery takes place. In a RAC enviornment the same recovery for an instance is performed by the surviving nodes called Instance recovery.
Q19. How do we know which database instances
are part of a RAC cluster?
You can query the V$ACTIVE_INSTANCES view to determine the member instances of the RAC cluster.
You can query the V$ACTIVE_INSTANCES view to determine the member instances of the RAC cluster.
Q20. What it the ASM POWER_LIMIT?
This is the parameter which controls the number of Allocation units the ASM instance will try to rebalance at any given time. In ASM versions less than 11.2.0.3 the default value is 11 however it has been changed to unlimited in later versions.
This is the parameter which controls the number of Allocation units the ASM instance will try to rebalance at any given time. In ASM versions less than 11.2.0.3 the default value is 11 however it has been changed to unlimited in later versions.
Q21. What is a rolling upgrade?
A patch is considered a rolling if it is can be applied to the cluster binaries without having to shutting down the database in a RAC environment. All nodes in the cluster are patched in a rolling manner, one by one, with only the node which is being patched unavailable while all other instance open.
A patch is considered a rolling if it is can be applied to the cluster binaries without having to shutting down the database in a RAC environment. All nodes in the cluster are patched in a rolling manner, one by one, with only the node which is being patched unavailable while all other instance open.
Q22. What is the default memory allocation
for ASM?
In 10g the default SGA size is 1G in 11g it is set to 256M and in 12c ASM it is set back to 1G.
In 10g the default SGA size is 1G in 11g it is set to 256M and in 12c ASM it is set back to 1G.
Q23. How do you find out what object has its
blocks being shipped across the instance the most?
You can use the dba_hist_seg_stats.
You can use the dba_hist_seg_stats.
Q24. What is a VIP in RAC use for?
The VIP is an alternate Virtual IP address assigned to each node in a cluster. During a node failure the VIP of the failed node moves to the surviving node and relays to the application that the node has gone down. Without VIP, the application will wait for TCP timeout and then find out that the session is no longer live due to the failure.
The VIP is an alternate Virtual IP address assigned to each node in a cluster. During a node failure the VIP of the failed node moves to the surviving node and relays to the application that the node has gone down. Without VIP, the application will wait for TCP timeout and then find out that the session is no longer live due to the failure.
Q25. What components of the Grid should I
back up?
The backups should include OLR, OCR and ASM Metadata.
The backups should include OLR, OCR and ASM Metadata.
Q26. Is there an easy way to verify the
inventory for all remote nodes
You can run the opatch lsinventory -all_nodes command from a single node to look at the inventory details for all nodes in the cluster.
You can run the opatch lsinventory -all_nodes command from a single node to look at the inventory details for all nodes in the cluster.
Q27. How do you backup ASM Metadata?
You can use md_backup to restore the ASM diskgroup configuration in-case of ASM diskgroup storage loss.
You can use md_backup to restore the ASM diskgroup configuration in-case of ASM diskgroup storage loss.
Q28. What files can be stored in the ASM
diskgroup?
In 11g the following files can be stored in ASM diskgroups.
In 11g the following files can be stored in ASM diskgroups.
·
Datafiles
·
Redo
logfiles
·
Spfiles
In 12c the
files below can also new be stored in the ASM Diskgroup Password file
Q29. What is OCLUMON used for in a cluster
environment?
The Cluster Health Monitor (CHM) stores operating system metrics in the CHM repository for all nodes in a RAC cluster. It stores information on CPU, memory, process, network and other OS data, This information can later be retrieved and used to troubleshoot and identify any cluster related issues. It is a default component of the 11gr2 grid install. The data is stored in the master repository and replicated to a standby repository on a different node.
The Cluster Health Monitor (CHM) stores operating system metrics in the CHM repository for all nodes in a RAC cluster. It stores information on CPU, memory, process, network and other OS data, This information can later be retrieved and used to troubleshoot and identify any cluster related issues. It is a default component of the 11gr2 grid install. The data is stored in the master repository and replicated to a standby repository on a different node.
Q30. What would be the possible performance
impact in a cluster if a less powerful node (e.g. slower CPU’s) is added to the
cluster?
All processing will show down to the CPU speed of the slowest server.
All processing will show down to the CPU speed of the slowest server.
Q31. What is the purpose of OLR?
Oracle Local repository contains information that allows the cluster processes to be started up with the OCR being in the ASM storage ssytem. Since the ASM file system is unavailable until the Grid processes are started up a local copy of the contents of the OCR is required which is stored in the OLR.
Oracle Local repository contains information that allows the cluster processes to be started up with the OCR being in the ASM storage ssytem. Since the ASM file system is unavailable until the Grid processes are started up a local copy of the contents of the OCR is required which is stored in the OLR.
Q32. What are some of the RAC specific
parameters?
Some of the RAC parameters are:
Some of the RAC parameters are:
·
CLUSTER_DATABASE
·
CLUSTER_DATABASE_INSTANCE
·
INSTANCE_TYPE
(RDBMS or ASM)
·
ACTIVE_INSTANCE_COUNT
·
UNDO_MANAGEMENT
·
Q33. What is the future of the Oracle Grid?
The Grid software is becoming more and more capable of not just supporting HA for Oracle Databases but also other applications including Oracle’s applications. With 12c there are more features and functionality built-in and it is easier to deploy these pre-built solutions, available for common Oracle applications.
The Grid software is becoming more and more capable of not just supporting HA for Oracle Databases but also other applications including Oracle’s applications. With 12c there are more features and functionality built-in and it is easier to deploy these pre-built solutions, available for common Oracle applications.
OCR Backup and Recovery in Oracle RAC
OCR
calls Oracle Cluster Registry. It stores cluster configuration information. It
is also shared disk component. It must be accessed by all nodes in cluster
environment.It also keeps information of Which database instance run on which
nodes and which service runs on which database.The process daemon OCSSd manages
the configuration info in OCR and maintains the changes to cluster in the registry.
Automatic Backup of OCR:
Automatic
backup of OCR is done by CRSD process and every 3 hours. Default location is
CRS_home/cdata/cluster_name. But we can change default location of this backup
of OCR. We can check default location using following command.
$ocrconfig
-showbackup
We
can change this default location of physical OCR copies using following
command.
$ocrconfig
-backuploc
How to take PHYSICAL Backup of OCR?
First
check exact location of automatic OCR backup location using "ocrconfig
-showbackup" command. Because this automatic backup of OCR is physical
copies of Oracle Cluster Registry. There is no need to bring down Cluster to
take this physical backup of OCR. Use simple operating system's copy command to
copy this physical OCR copy to backup destination as give below.
$cp -p -R
/u01/app/crs/cdata /u02/crs_backup/ocrbackup/RACNODE1
How to take MANUAL EXPORT Backup of OCR?
We
can take export backup of OCR (Oracle Cluster Registry) also in online. There
is no need to bring down Cluster to take export backup of OCR. Manual backup
can be taken using "ocrconfig -export" command as follows.
$ocrconfig
-export /u04/crs_backup/ocrbackup/exports/OCRFile_expBackup.dmp
How to Recover OCR from PHYSICAL Backup?
For
recovering OCR from physical automated backup needs all cluster, RAC instances
and RAC database bring down before performing recovery of OCR. Here you can
find out command reference for Recovery of OCR from physical backup copy.
$ocrconfig
-showbackup
$srvctl -stop database -d RACDB (Shutdown all
RAC instances and RAC database)
$crsctl stop crs (Shutdown Cluster)
#rm -f /u01/oradata/racdb/OCRFile
#cp /dev/null /u01/oradata/racdb/OCRFile
#chown root /u01/oradata/racdb/OCRFile
#chgrp oinstall /u02/oradata/racdb/OCRFile
#chmod 640 /u01/oradata/racdb/OCRFile
#ocrconfig -restore
/u02/apps/crs/cdata/crs/backup00.ocr
$crsctl start crs (After issuing start
cluster check status of cluster using 'crs_stat -t')
$srvctl start database -d RACDB (Start Oracle
RAC database and RAC instances)
How to Recover OCR from EXPORT Backup?
We
can import metadata of OCR from export dump. Before importing Stop Oracle RAC
database, RAC instances and Cluster. Reboot Cluster and remove OCR partition as
well as OCR mirror partition. Recreate using 'dd' command both partitions.
Import metadata of OCR from export dump file of backup. Example commands as
following...
$srvctl
-stop database -d RACDB (Shutdown all RAC instances and RAC database)
$crsctl stop crs (Shutdown Cluster)
#rm -f /u01/oradata/racdb/OCRFile
#dd if=/dev/zero
of=/u01/oradata/racdb/OCRFile bs=4096 count=65587
#chown root /u01/oradata/racdb/OCRFile
#chgrp oinstall /u01/oradata/racdb/OCRFile
#chmod 640 /u01/oradata/racdb/OCRFile
SAME
process should need to repeat for OCR mirror also.
ocrconfig
-import /u04/crs_backup/ocrbackup/exports/OCRFile_exp_Backup.dmp (Import
metadata of OCR using command)
$crsctl start crs (After issuing start
cluster check status of cluster using 'crs_stat -t')
$srvctl start database -d RACDB (Start Oracle
RAC database and RAC instances
Remember following important things:
·
Oracle
takes physical backup of OCR automatically.
·
No
Cluster downtime or RAC database down time requires for PHYSICAL backup of OCR.
·
No
Cluster downtime or RAC database down time requires for MANUAL export backup of
OCR.
·
For
recovery of OCR from any of above backup it should need to down ALL.
·
All
procedure requires ROOT login.
OCR
backup needs to monitoring constantly. Because OCR is important part of Oracle
RAC. During remote database
monitoring and remote database
services, we need to take care of backup of OCR for Oracle RAC database administration.
Dbametrix is world wide leader in remote dba support. Expert remote DBA team of
Dbametrix is offering high quality professional Oracle DBA support with strong
response time to fulfill your SLA. Contact our sales department for
more information with your requirements for remote dba services.
How to change the hostname in RAC
after installation?
You want to change Hostname or IP Address or DNS configuration on server where Oracle is running on ASM.
The steps have been written for an installation
that splits the ownership of the “Grid Infrastructure” and the database between
a user named ORAGRID and a user named ORADB respectively. Make sure you run
below given commands from right user.
Step 1:
Check existing configured resources with Oracle
home
[oragrid@litms#### ~]$ crs_stat
-t
Step 2:
Before start with the Oracle Restart process you
must stop the listener:
[oragrid@litms####
~]$ lsnrctl stop listener
Step 3:
Confirm if listener is stopped
[oragrid@litms####
~]$ crs_stat -t
Step 4:
Login as ROOT
Set ORACLE_HOME as grid home
Execute below command to remove the existing oracle
grid infra configuration.
[root@litms####
~]# $ORACLE_HOME/perl/bin/perl -I $ORACLE_HOME/perl/lib -I
$ORACLE_HOME/crs/install
Step 5:
After removing oracle configuration you can change
the hostname of your Server
Edit /etc/sysconfig/network file
Edit /etc/hosts file
[root@### ~]# cat
/etc/sysconfig/network-scripts/ifcfg-eth0
Step 6:
Edit Listener.ora file with new Hostname
Step 7:
Login as ROOT
Set ORACLE_HOME as grid home.
Execute below command to recreate grid infra
configuration:
[root@litms####~]#
$ORACLE_HOME/perl/bin/perl -I $ORACLE_HOME/perl/lib -I
$ORACLE_HOME/crs/install
Step 8:
Add Listener and start it
[oragrid@litms###
~]$ srvctl add listener
[oragrid@litms####
~]$ srvctl start listener
Step 9:
Create ASM and add Disks, and mount all diskgroups
manually
[oragrid@litms####
disks]$ srvctl add asm -d '/dev/oracleasm/disks/*'
[oragrid@litms####
disks]$ srvctl start asm
[oragrid@litmsj614
disks]$ sqlplus / as sysasm
[oragrid@litms#### disks]$ srvctl status diskgroup -g rpst02data
Step 10:
Configure all databases with SRVCTL (Oracle
Restart), Configure and Start your database:
[oragrid@litms####
disks]$ srvctl add database -d RPST02 -o $ORACLE_HOME -n RPST02 -p
Change Public/SCAN/Virtual IP/Name in 11g/12c RAC
When working with Real Application Cluster DB, changing the infrastructure properties is bit tricky if not difficult. There is dependency chain with several network & name properties and dependent component from Oracle perspective are also needed to be modified.
I recently undertook the exercise to do so for one of our RAC Cluster which resulted in this post. There are several use cases which I have covered as follows.
Case I. Changing Public Host-name
Public hostname is recorded in OCR, it is entered during installation
phase. It can not be modified after the installation. The only way to modify
public hostname is by deleting the node, then add the node back with a new
hostname, or reinstall the clusterware.
Case II. Changing Public IP Only Without
Changing Interface, Subnet or Netmask
If the change is only public IP address and the new
ones are still in the same subnet, nothing needs to be done on clusterware
layer, all changes need to be done at OS layer to reflect the change.
1. Shutdown Oracle Clusterware stack
2. Modify the IP address at network layer, DNS and /etc/hosts file to reflect the change
3. Restart Oracle Clusterware stack
2. Modify the IP address at network layer, DNS and /etc/hosts file to reflect the change
3. Restart Oracle Clusterware stack
Above change can be done in rolling fashion, eg: one node at a time.
Case II. Changing SCAN / SCAN IP
SCAN is used to access cluster as whole from oracle database clients and can redirect your connection request to any available node on the cluster where the requested service is running. This resource is cluster resource and can fail over to any other node if the node where it is running should fail. The entry of SCAN is in OCR and IP is configured at DNS level.
So to change to SCAN IP / Name one has to first populate the changes on DNS to take it into effect. Once the changes are in effect, one can modify the SCAN resource in OCR as follows. Remember SCAN acts as cluster entry point and load balancing process , the restart to SCAN will require a brief outage. However the existing connection will not have any impact.
[oracle@dbrac2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node dbrac1
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node dbrac1
[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.110
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes:
SCAN VIP is individually disabled on nodes:
[oracle@dbrac2 ~]$ srvctl stop scan_listener
[oracle@dbrac2 ~]$ srvctl stop scan
[oracle@dbrac2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is not running
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is not running
-- MODIFY THE SCAN IP AT OS LEVEL
[root@dbrac2 ~]$ srvctl modify scan -scanname dbrac-scan.localdomain
[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.120
SCAN VIP is enabled.
[oracle@dbrac2 ~]$ srvctl start scan_listener
[oracle@dbrac2 ~]$ srvctl start scan
Since this is my test cluster, I have configured only one SCAN, but regardless of it, the process remains the same for SCAN with 3 IPs.
Case II. Changing Virtual IP / Virtual Host Name
-- CHANGING NODE VIP FROM 192.168.2.103 TO 192.168.2.203 ON DBRAC1
-- SINCE NODE VIP IS PART OF NODE APPS ONE NEEDS TO MODIFY THE IP ADDRESS ON OS LEVEL AND THEN USE SRVCTL TO MODIFY NODEAPPS
[oracle@dbrac1 automation]$ srvctl config vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.103
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1
PRCR-1065 : Failed to stop resource ora.dbrac1.vip
CRS-2529: Unable to act on 'ora.dbrac1.vip' because that would require stopping or relocating 'ora.LISTENER.lsnr', but the force option was not specified
[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1 -force
[oracle@dbrac1 automation]$ srvctl status vip -node dbrac1
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is not running
-- NOW MODIFY THE ADDRESS OF NODE VIP ON OS LEVEL USING EITHER /etc/hosts OR DNS.
-- Once done, use SRVCTL to modify OCR resource.
-- Here I am not changing the name, but only IP
[oracle@dbrac1 automation]$ srvctl modify nodeapps -node dbrac1 -address dbrac1-vip.localdomain/255.255.255.0/eth1
[oracle@dbrac1 automation]$ srvctl config vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.203
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[root@dbrac2 ~]# srvctl config nodeapps
[oracle@dbrac1 automation]$ srvctl start vip -node dbrac1
[root@dbrac2 ~]# srvctl status nodeapps
-- ON NODE2
[oracle@dbrac2 ~]$ srvctl stop vip -node dbrac2 -force
[oracle@dbrac2 ~]$ srvctl status vip -node dbrac2
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is not running
[oracle@dbrac2 ~]$ srvctl modify nodeapps -node dbrac2 -address dbrac2-vip.localdomain/255.255.255.0/eth1
[oracle@dbrac2 ~]$ srvctl config vip -n dbrac2
VIP exists: network number 1, hosting node dbrac2
VIP Name: dbrac2-vip.localdomain
VIP IPv4 Address: 192.168.2.204
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[oracle@dbrac2 ~]$ srvctl start vip -node dbrac2
[oracle@dbrac2]$ srvctl status nodeapps
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is running on node: dbrac1
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is running on node: dbrac2
Network is enabled
Network is running on node: dbrac1
Network is running on node: dbrac2
ONS is enabled
ONS daemon is running on node: dbrac1
ONS daemon is running on node: dbrac2
[oracle@dbrac1 automation]$ crs_stat -t | grep vip
ora.dbrac1.vip ora....t1.type ONLINE ONLINE dbrac1
ora.dbrac2.vip ora....t1.type ONLINE ONLINE dbrac2
ora.scan1.vip ora....ip.type ONLINE ONLINE dbrac1
A special case for 11gR2 VIP Name Change -
modifying the VIP hostname only without changing the IP address.
For example: only VIP hostname changes from dbrac1-vip to dbrac1-nvip, IP and other attributes remain the same.
If IP address is not changed, above modify command will not change the USR_ORA_VIP value in 'crsctl stat res ora.dbrac1.vip -p' output. Please use the following command:
# crsctl modify res ora.dbrac1.vip -attr USR_ORA_VIP=ora.dbrac1.nvip
Verify the changes for USR_ORA_VIP field:
# crsctl stat res ora.dbrac1.vip -p |grep USR_ORA_VIP
Three important flag for crsctl stat res command are as follows.
-p Print static configuration
-v Print runtime configuration
-f Print full configuration
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node dbrac1
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node dbrac1
[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.110
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes:
SCAN VIP is individually disabled on nodes:
[oracle@dbrac2 ~]$ srvctl stop scan_listener
[oracle@dbrac2 ~]$ srvctl stop scan
[oracle@dbrac2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is not running
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is not running
-- MODIFY THE SCAN IP AT OS LEVEL
[root@dbrac2 ~]$ srvctl modify scan -scanname dbrac-scan.localdomain
[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.120
SCAN VIP is enabled.
[oracle@dbrac2 ~]$ srvctl start scan_listener
[oracle@dbrac2 ~]$ srvctl start scan
Since this is my test cluster, I have configured only one SCAN, but regardless of it, the process remains the same for SCAN with 3 IPs.
Case II. Changing Virtual IP / Virtual Host Name
-- CHANGING NODE VIP FROM 192.168.2.103 TO 192.168.2.203 ON DBRAC1
-- SINCE NODE VIP IS PART OF NODE APPS ONE NEEDS TO MODIFY THE IP ADDRESS ON OS LEVEL AND THEN USE SRVCTL TO MODIFY NODEAPPS
[oracle@dbrac1 automation]$ srvctl config vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.103
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1
PRCR-1065 : Failed to stop resource ora.dbrac1.vip
CRS-2529: Unable to act on 'ora.dbrac1.vip' because that would require stopping or relocating 'ora.LISTENER.lsnr', but the force option was not specified
[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1 -force
[oracle@dbrac1 automation]$ srvctl status vip -node dbrac1
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is not running
-- NOW MODIFY THE ADDRESS OF NODE VIP ON OS LEVEL USING EITHER /etc/hosts OR DNS.
-- Once done, use SRVCTL to modify OCR resource.
-- Here I am not changing the name, but only IP
[oracle@dbrac1 automation]$ srvctl modify nodeapps -node dbrac1 -address dbrac1-vip.localdomain/255.255.255.0/eth1
[oracle@dbrac1 automation]$ srvctl config vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.203
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[root@dbrac2 ~]# srvctl config nodeapps
[oracle@dbrac1 automation]$ srvctl start vip -node dbrac1
[root@dbrac2 ~]# srvctl status nodeapps
-- ON NODE2
[oracle@dbrac2 ~]$ srvctl stop vip -node dbrac2 -force
[oracle@dbrac2 ~]$ srvctl status vip -node dbrac2
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is not running
[oracle@dbrac2 ~]$ srvctl modify nodeapps -node dbrac2 -address dbrac2-vip.localdomain/255.255.255.0/eth1
[oracle@dbrac2 ~]$ srvctl config vip -n dbrac2
VIP exists: network number 1, hosting node dbrac2
VIP Name: dbrac2-vip.localdomain
VIP IPv4 Address: 192.168.2.204
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
[oracle@dbrac2 ~]$ srvctl start vip -node dbrac2
[oracle@dbrac2]$ srvctl status nodeapps
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is running on node: dbrac1
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is running on node: dbrac2
Network is enabled
Network is running on node: dbrac1
Network is running on node: dbrac2
ONS is enabled
ONS daemon is running on node: dbrac1
ONS daemon is running on node: dbrac2
[oracle@dbrac1 automation]$ crs_stat -t | grep vip
ora.dbrac1.vip ora....t1.type ONLINE ONLINE dbrac1
ora.dbrac2.vip ora....t1.type ONLINE ONLINE dbrac2
ora.scan1.vip ora....ip.type ONLINE ONLINE dbrac1
A special case for 11gR2 VIP Name Change -
modifying the VIP hostname only without changing the IP address.
For example: only VIP hostname changes from dbrac1-vip to dbrac1-nvip, IP and other attributes remain the same.
If IP address is not changed, above modify command will not change the USR_ORA_VIP value in 'crsctl stat res ora.dbrac1.vip -p' output. Please use the following command:
# crsctl modify res ora.dbrac1.vip -attr USR_ORA_VIP=ora.dbrac1.nvip
Verify the changes for USR_ORA_VIP field:
# crsctl stat res ora.dbrac1.vip -p |grep USR_ORA_VIP
Three important flag for crsctl stat res command are as follows.
-p Print static configuration
-v Print runtime configuration
-f Print full configuration
Did
you ever lost your Grid Infrastructure Diskgroup? or What if the disk is lost from the
diskgroup which has OCR/OLR stored
1. Status
The cluster stack does not come up since there are no voting disks
anymore.
[root@oel6u4 ~]# crsctl stat res -t –init
2. Recreate OCR
diskgroup
At
first I made my disk available again using ASMlib:
1
2
3
4
5
6
7
8
|
[root@oel6u4
~]# oracleasm listdisks DATA [root@oel6u4
~]# oracleasm createdisk OCR /dev/sdc1 Writing
disk header: done Instantiating
disk: done [root@oel6u4
~]# oracleasm listdisks DATA OCR |
Now I need to restore my ASM diskgroup, but I
that requires a running ASM instance to do that. So stop the cluster stack and
start again in exclusive mode. By the way, “crsctl stop crs -f” did not finish
so I disabled the cluster stack by issuing “crsctl disable has” and reboot.
[root@oel6u4 ~]# crsctl enable has
as you see the startup fails since
“ora.storage” is not able to locate the OCR diskgroup. That means there is a
timeframe of about 10 minutes to create the diskgroup during startup of “ora.storage”.
If I would have made a backup of my ASM
diskgroup I could have used that. But I have not. That’s why I create my OCR
diskgroup from scratch. Start the CRS again and then do the following from a
second session:
[root@oel6u4 ~]# cat ocr.dg
[root@oel6u4 ~]# asmcmd mkdg ~/ocr.dg
[root@oel6u4 ~]# asmcmd lsdg
3. Restore
OCR
Next
step is restoring the OCR from backup. Fortunately the clusterware creates
backups of the OCR by itself right from the beginning.
root@oel6u4 ~]# ocrconfig –showbackup
Just choose the most recent backup and use it to
restore the contents of OCR.
[root@oel6u4 ~]# ocrconfig -restore
/u01/app/grid/12.1.0.2/cdata/mycluster/backup00.ocr
[root@oel6u4 ~]# ocrcheck
4. Restore Voting
Disk
Since
the voting files are placed in ASM together with OCR, the OCR backup contains a
copy of the voting file as well. All I need to do is start CRSD and recreate
the voting file.
[root@oel6u4 ~]# crsctl start res ora.crsd –init
[root@oel6u4 ~]# crsctl replace votedisk +OCR
Not good. But the reason for that is that ASM
does not have ASM_DISKSTRING configured. Actually ASM has not a single
parameter configured because it is using a spfile stored in OCR diskgroup as
well. That means there is no spfile anymore. As a quick solution I set the
parameter in memory.
[oracle@oel6u4 ~]$ sqlplus / as sysasm
alter system set
asm_diskstring='/dev/oracleasm/disks/*' scope=memory;
With this small change I am now able to recreate
the voting file.
[root@oel6u4 ~]# crsctl replace votedisk +OCR
[root@oel6u4 ~]# crsctl query css votedisk
5. Restore ASM
spfile
This
is easy, I don’t have a backup of my ASM spfile so I recreate it from memory.
[oracle@oel6u4 ~]$ sqlplus / as sysasm
SQL> create spfile='+OCR' from memory;
SQL> create spfile='+OCR' from memory;
The GPNP profile get’s updated also by doing so.
6. Restart Grid
Infrastructure I have
everything restored that I need to start the clusterware in normal operation
mode.
Let’s see: [root@oel6u4 ~]# crsctl stop crs –f
[root@oel6u4 ~]# crsctl
start has –wait
You see, the GIMR (MGMTDB)
is gone too. I will talk about that soon. At first, let’s see if all the other
ressources are running properly.
[root@oel6u4 ~]# crsctl stat
res –t
7. Restore ASM
password file
Since
12c the password file for ASM is stored inside ASM. Again, I have no backup so
I need to create it from scratch.
[oracle@oel6u4 ~]$ orapwd
file=/tmp/orapwasm password=Oracle-1 force=y
[oracle@oel6u4 ~]$ asmcmd pwcopy --asm
/tmp/orapwasm +OCR/pwdasm
the “pwcopy” updates the
GPNP profile to reflect this.
What is a
Master Node in RAC
The
master node in oracle RAC is node which is responsible to initiate the OCR
backup.
Node-id of the Master node in rac is least node-id among the nodes in cluater
node-ids are assigned to the nodes in the order they join the cluater and therefore the node which joins the cluster first is designated as Master Node
Node-id of the Master node in rac is least node-id among the nodes in cluater
node-ids are assigned to the nodes in the order they join the cluater and therefore the node which joins the cluster first is designated as Master Node
Task of
the Master Node
crsd
process of the master node is responsible to initiate OCR backup
Master node is responsible to sync the OCR local cache across the nodes
only crsd process on the master node updates the OCR on disk
in case of node eviction, if the cluster is devided into 2 equal nodes the sub-cluster having the master node survives and the other sub-cluster is evicted
Master node is responsible to sync the OCR local cache across the nodes
only crsd process on the master node updates the OCR on disk
in case of node eviction, if the cluster is devided into 2 equal nodes the sub-cluster having the master node survives and the other sub-cluster is evicted
How to
identify the Master Node in RAC
1> Identify the node which performs the backup of OCR
$ ocrconfig –showbacku
2>
Check crsd logs from various nodes.
cat crsd.trc |grep MASTER
cat crsd.trc |grep MASTER
3>
Check ocssd logs from various nodes.
cat ocssd.trc |grep MASTER
cat ocssd.trc |grep MASTER
What will happen
if the Master Node is down.
very obious question, if the master node is down what will happen ? Does
OCR will not be backed up?
When OCR master (crsd.bin process) stops or
restarts for whatever reason, the crsd.binon surviving node
with lowest node number will become new OCR master.
Just to proof the same I restarted the Node1 which is my current master
node and checked the logfile and did a Manual OCR backup too and you can see
the result below
2017-06-18 00:17:15.398 : CRSPE:1121937152:
{2:46838:561} PE Role|State Update: old role [SLAVE] new [MASTER]; old
state [Running] new [Configuring]
2017-06-18 00:17:15.398 : CRSPE:1121937152: {2:46838:561} PE MASTER NAME: node2
2017-06-18 00:17:15.403 : OCRMAS:1748141824: th_master:13: I AM THE NEW OCR MASTER at incar 7. Node Number 2
2017-06-18 00:17:15.398 : CRSPE:1121937152: {2:46838:561} PE MASTER NAME: node2
2017-06-18 00:17:15.403 : OCRMAS:1748141824: th_master:13: I AM THE NEW OCR MASTER at incar 7. Node Number 2
What is clufy ?
The Cluster Verification Utility (CVU) performs system checks in
preparation for installation, patch updates, or other system changes. Using CVU
ensures that you have completed the required system configuration and
preinstallation steps so that your Oracle Grid Infrastructure or Oracle Real
Application Clusters (Oracle RAC) installation, update, or patch operation
completes successfully.
With Oracle Clusterware 11g release 2 (11.2), Oracle
Universal Installer is fully integrated with CVU, automating many CVU
prerequisite checks. Oracle Universal Installer runs all prerequisite checks
and associated fixup scripts when you run the installer CVU can verify the primary cluster components during an operational
phase or stage. A component can be basic, such as free disk space, or it can be
complex, such as checking Oracle Clusterware integrity. For example, CVU can
verify multiple Oracle Clusterware subcomponents across Oracle Clusterware
layers. Additionally, CVU can check disk space, memory, processes, and other
important cluster components. A stage could be, for example, database
installation, for which CVU can verify whether your system meets the criteria
for an Oracle Real Application Clusters (Oracle RAC) installation. Other stages
include the initial hardware setup and the establishing of system requirements
through the fully operational cluster setup
cluvfy stage {-pre|-post} stage_name stage_specific_options
[-verbose]
Valid stage options and stage names are:
-post hwos : post-check for hardware and operating system
-pre cfs : pre-check for CFS setup
-post cfs : post-check for CFS setup
-pre crsinst : pre-check for CRS installation
-post crsinst : post-check for CRS installation
-pre hacfg : pre-check for HA configuration
-post hacfg : post-check for HA configuration
-pre dbinst : pre-check for database installation
-pre acfscfg : pre-check for ACFS Configuration.
-post acfscfg : post-check for ACFS Configuration.
-pre dbcfg : pre-check for database configuration
-pre nodeadd : pre-check for node addition.
-post nodeadd : post-check for node addition.
-post nodedel : post-check for node deletion.
-post hwos : post-check for hardware and operating system
-pre cfs : pre-check for CFS setup
-post cfs : post-check for CFS setup
-pre crsinst : pre-check for CRS installation
-post crsinst : post-check for CRS installation
-pre hacfg : pre-check for HA configuration
-post hacfg : post-check for HA configuration
-pre dbinst : pre-check for database installation
-pre acfscfg : pre-check for ACFS Configuration.
-post acfscfg : post-check for ACFS Configuration.
-pre dbcfg : pre-check for database configuration
-pre nodeadd : pre-check for node addition.
-post nodeadd : post-check for node addition.
-post nodedel : post-check for node deletion.
Example:- Installation checks after hwos -
Hardware and Operating system installation
cluvfy stage -post hwos -n node_list [-verbose]
./runcluvfy.sh stage -post hwos -n node1,node2 –verbose
cluvfy stage -post hwos -n node_list [-verbose]
./runcluvfy.sh stage -post hwos -n node1,node2 –verbose
What is shared everything and
shared nothing storage architecture
A
shared-everything architecture means that any given service knows everything
and will fail if it doesn't. If you have a central db (or other similar
service), then you likely have a shared-everything architecture. Shared-nothing
means that no state is shared between every other service. If one service
fails, then nothing happens to the other services.Most applications generally
start as a shared-everything architecture, but they don't have to, that's just
been my experience. When you get to global scale, you're going to want to be able
to have random services fail, and why the application my run in a degraded
state until the failed service comes back one, the rest of the application will
continue to run.
Shared-nothing
architecture is resilient when done properly. Everything it needs to know from
outside the service, is sent to the service in its "work request"
which is probably signed or encrypted with a trusted key.
For example,
if a user is requesting a resource from the resource-service. The
resource-service will have its own database on what users are allowed to
request that specific resource, or the request will include an encrypted token
providing the service with authentication and/or authorization. The
resource-service doesn't have to call an authorization resource (or look in a
shared db) and ask if the user is allowed to access the resource. This means,
if the auth service were to be down, and the user had a valid token (from
before the auth service went down) that was still valid, then the user could
still retrieve the resource until the token expired.
When to use
them?
Use shared-everything architecture when you need to be highly consistent
at the cost of resilience. In the example I gave above, if the auth-service
decided that the token were incorrect and revoked it, the resource-service
couldn't know about it before it were informed from the auth-service. In that
period of time, the user could have "illegally" retrieved the
resource. This is akin to taking out several loans before they get the chance
to all appear on your credit report and affect your ability to get a loan.Use
shared-nothing when resilience is more important than consistency, if you can
guarantee some kind of eventual consistency, then this might be the most
scalable solution.
Difference between Admin managed
database and Policy managed databases ? Before
discussing the ADMIN and POLICY managed db, we will have a short description of
the server pools.
Server
pool : Server pool divides the cluster logically. Server pool is used
basically when the cluster/db is being POLICY managed. The db is assigned to
the server pool and oracle takes the control to manage the instances. Oracle
decides which instance to run on which node. There is no hard coded rule to run
the instance on a perticular node as in administrator managed databases.
Types of
Server pool : By
default oracle creates two server pools, i.e, Free and Generic Pool. These
pools are internally managed and has limitations of changing the attributes.
1. Free
Pool : When the cluster is created , all the nodes are assigned to the
Free pool. Nodes from this pool are assigned to other pools when defined.
2. Generic
Pool : When we upgrade a cluster
, all the nodes are assigned to the Generic Pool. You cannot query the details
of this pool. When a admin managed db is created, a server pool is created with
its name, and it is assigned as child of the generic pool.
Here we will see the two type
of configuration of the RAC db …
1. ADMINISTRATOR MANAGED
Database.
2. POLICY MANAGED Database.
2. POLICY MANAGED Database.
1.
ADMINISTRATOR MANAGED Database : In admin managed database we hard couple the instance of the
db to a node. e.g. instance 1 will run on node 1, instance 2 will run on
node 2. Here the DBA is responsible to manage the instances for db. When a
admin managed db is created, a server pool is created with its name, and it is
assigned as child of the generic pool.
This method of db management is
generally used when there are less no. of nodes. If the no. of nodes grow high
, say beyond 20, the POLICY managed cluster should be used.
2. POLICY
MANAGED Database : In
policy managed database, the db is assigned to a server pool. Now oracle takes
the control to manage the instances. The no. of instances started for the db
will be equal to the maximum of the nodes assigned to the serverpool currently.
Here any instance can run on any node, e.g. instance 1 can run on node 2 and
instance 2 can run on node 1. There is no hard coupling of the instances to the
nodes.
Comments
Post a Comment