RAC

What is Cluster ?

A cluster provides an optional method of storing table data. A cluster is made up of a group of tables that share the same data blocks. The tables are grouped together because they share common columns and are often used together OR

Two or more computers that share resources and work together to form a larger logical computing unit. RAC and Oracle Parallel Server can be used to access Oracle from multiple nodes of a clustered system.

What is oracle clusterware?

Oracle Clusterware is software that enables servers to operate together as if they are one server. Each server looks like any standalone server. However, each server has additional processes that communicate with each other so the separate servers appear as if they are one server to applications and end users.

Why RAC, what are the benifits of using RAC?

The benefits of Real Application Clusters:

Ability to spread CPU load across multiple servers

Continuous Availability / High Availability (HA)

-Protection from single instance failures

-Protection from single server failures

RAC can take advantage of larger SGA sizes than can be accommodated by a single instance commodity server.

What is startup sequence in RAC ?

INIT spawns init.ohasd (with respawn) which in turn starts the OHASD process (Oracle High Availability Services Daemon). This daemon spawns 4 processes.The entire Oracle Cluster stack and the services registered on the cluster automatically comes up when a node reboots or if the cluster stack manually starts. The startup process is segregated in five (05) levels, at each level, different processes are got started in a sequence.

Now from 11g onwords you store Voting disk and OCR in ASM since Voting disk and OCR are the primary component required to start the clusterware which further starts the clusterware resources like ASM, Listener, Database etc .Which further starts the clusterware resources like ASM, what but you just told that Voting disk and OCR can be stored in ASM and clusterware starts the ASM and clusterware startup itself requires the access of Voting disk and OCR so what starts first ASM or Clusterware what the hell is going on 1. When a node of an Oracle Clusterware cluster start/restarts, OHASD is started by platform-specific means. OHASD is the root for bringing up Oracle Clusterware. OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization. 2. OHASD brings up GPNPD and CSSD. CSSD has access to the GPNP Profile stored on the local file system. This profile contains the following vital bootstrap data;

a. ASM Diskgroup Discovery String

b. ASM SPFILE location (Diskgroup name)

c. Name of the ASM Diskgroup containing the Voting Files

3. The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.

4. OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.

5. With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.

6. OHASD starts CRSD with access to the OCR in an ASM Diskgroup.

7. Clusterware completes initialization and brings up other services under its control.

When Clusterware starts three files are involved                                                 1. OLR – Is the first file to be read and opened. This file is local and this file contains information regarding where the voting disk is stored
and information to startup the ASM. (e.g ASM DiscoveryString)                                        2. VOTING DISK – This is the second file to be opened and read, this is dependent on only OLRbeing accessible. ASM starts after CSSD or ASM does not start if CSSD is offline (i.e voting file missing)                                                          How are Voting Disks stored in ASM? Voting disks are placed directly on ASMDISK. Oracle Clusterware will store the votedisk on the disk within a disk group that holds the Voting Files.
Oracle Clusterware does not rely on ASM to access the Voting Files, which means Oracle Clusterware does not need of Diskgroup to read and write on ASMDISK. It is possible to check for existence of voting files on a ASMDISK using the V$ASM_DISK column VOTING_FILE.
So, voting files not depend of Diskgroup to be accessed, does not mean that the diskgroup is not needed, diskgroup and voting file are linked by their settings.                                           3. OCR – Finally the ASM Instance starts and mount all Diskgroups, then Clusterware Deamon (CRSD) opens and reads the OCR which is stored on Diskgroup. So, if ASM already started, ASM does not depend on OCR or OLR to be online. ASM depends on CSSD (Votedisk) to be online.There is a exclusive mode to start ASM without CSSD (but it’s to restore OCR or VOTE purposes)

What is VIP in RAC and use of VIP ?

A virtual IP address or VIP is an alternate IP address that the client connections use instead of the standard public IP address. To configure VIP address, we need to reserve a spare IP address for each node, and the IP addresses must use the same subnet as the public network.Every NODE in oracle rac cluster has an ip address and hostname managed by the respective os of the each node .From Oracle 10g, virtual IP considers to configure listener. Using virtual IP we can save our TCP/IP timeout problem because Oracle notification service(ONS) maintains communication between each nodes and listeners. Once ONS found any listener down or node down, it will notify another nodes and listeners with same situation. While new connection is trying to establish connection to failure node or listener, virtual IP of failure node automatically divert to surviving node and session will be establishing in another surviving node. This process doesn't wait for TCP/IP timeout event. Due to this new connection gets faster session establishment to another surviving nodes/listener.

$ srvctl config vip -node NODE_NAME

#srvctl config nodeapps -n node_name

How to remove the VIP ------------srvctl remove vip -i “vip_name_list” [-f] [-y] [-v]

How to start/stop/status the VIP ---------srvctl start/stop/status vip {-n node_name|-i vip_name} [-v]
How to enable/disable/config
srvctl enable/disable vip -i vip_name [-v] srvctl cnfig vi {-n node_name|-i vip_name} Some time VIP may not failback to Original Node,Then we can use below command to failback Failover VIP (on the destination node)
./crs_relocate [vip resource name] (The VIP will now go where it’s configureed to be)

What is SCAN and SCAN listener ? Single Client Access Name (SCAN) is a feature used in Oracle Real Application Cluster environments that provides a single name for clients to access any Oracle Database running in a cluster. You can think of SCAN as a cluster alias for databases in the cluster. The benefitis that the client’s connect information does not need to change if you add or remove nodes or databases in the cluster. During Oracle Grid Infrastructure installation, SCAN listeners are created for as many IP addresses as there are SCAN VIP addresses assigned to resolve to the SCAN.Oracle recommends that the SCAN resolves to three VIP addresses, to provide high availability and scalability. If the SCAN resolves to three addresses, then three SCAN VIPs and three SCAN listeners are created. Each SCAN listener depends on its corresponding SCAN VIP. The SCAN listeners cannot start until the SCAN VIP is available on a node. The addresses for the SCAN listeners resolve either through an external domain name service (DNS), or through the Grid Naming Service (GNS) within the cluster. SCAN listeners and SCAN VIPs can run on any node in the cluster. If a node where a SCAN VIP is running fails, then the SCAN VIP and its associated listener fails over to another node in the cluster. If the number of available nodes within the cluster falls to less than three, then one server hosts two SCAN VIPs and SCAN listeners. The SCAN listener also supports HTTP protocol for communication with Oracle XML Database (XDB). SCAN VIP is one of the resources you find in the output of “crsctl status resource –t” command. Number of SCAN VIP’s you notice will be the same as the number of SCAN LISTENERS in the setup.

SCAN VIP’s are physical IP addresses that you allocate to SCAN listeners. In the example that I use later in this blog, 192.168.122.5, 192.168.122.6, 192.168.122.7 are SCAN VIP’s. If you identify that SCAN VIP’s are online in the output of “crsctl status resource –t” command then IP addresses are online on the physical network ports. Only when SCAN VIP’s are online we can start the SCAN listeners.

SCAN Listener is the oracle component which starts running a service on the port (by default its 1521) using the SCAN VIP (IP address). So SCAN listener doesn’t start if SCAN VIP is not online. This is the major difference between a SCAN listener and SCAN VIP. The number of SCAN listeners you notice in the output will be the same as number of SCAN VIP’s ONLINE. Name that is given to S CAN LISTENER is referred as SCAN NAME and it is registered in DNS server. In our example which you will find next, the SCAN name is “SCAN_LISTENER

0 sec: User1 when tries to establish a session on database with connection request C1, it hits DNS server first.

DNS server will then resolve the name “SCAN_LISTENER” to the first IP 192.168.122.5

1. C1 request reaches the first scan listener SCAN1 mostly the default name will be “LISTENER_SCAN1” which is running on 192.168.122.5 SCAN VIP.

2. SCAN1 using details from LBA, identifies the load on each node in the setup and routes the request C1 to node which has least load.

3. In this case it happened to be NODE 2 with least load or no load and the C1 request is addressed by local listener on this node which helps C1 to establish a session on instance on NODE 2.

4. 5^th sec: User2 when tries to establish a session on database with connection request C2, it hits DNS serer first.

6. DNS server will now use Round-Robin algorithm and resolves the name “SCAN_LISTENER” to second IP 192.168.122.6

7. C2 request reaches the second scan listener SCAN2 mostly the default name will be “LISTENER_SCAN2” which is running on 192.168.122.6 SCAN VIP.

8. SCAN2 using details from LBA, identifies the load on each node in the setup and routes the request C2 to node which has least load.

9. In this case it happened to be NODE 1 with least load or no load and the C2 request is addressed by local listener on this node which helps C2 to establish a session on instance on NODE 1.

What is HAIP in RAC ? Oracle 11gR2 introduced the RAC Highly Available IP (HAIP) for the Cluster Interconnect to help eliminate a single point of failure If the node in the cluster only has one network adapter for the private network, and that adapter fails then the node will no longer be able to participate in cluster operations. It will not be able to perform its heartbeat with the cluster. Eventually, the other nodes will evict the failing node from the cluster. If the cluster only has a single network switch for the Cluster Interconnect and the switch fails, then the entire cluster is compromised The purpose of HAIP is to perform load balancing across all active interconnect interfaces and fail over existing non-responsive interfaces to available interfaces. HAIP has the ability to activate a maximum of four private interconnect connections. These private network adapters can be configured during the installation of Oracle Grid Infrastructure or after the installation using the oifcfg utility.

[oracle@host01 bin]$ ./oifcfg getif

eth0 192.168.56.0 global public

eth1 192.168.10.0 global cluster_interconnect

[oracle@host01 bin]$ ./crsctl stat res -t -init

ora.asm

ora.cluster_interconnect.haip

oracle@host01 bin]$ ifconfig -a

eth0 Link encap:Ethernet HWaddr 08:00:27:98:EA:FE

inet addr:192.168.56.71 Bcast:192.168.56.255 Mask:255.255.255.0

eth1 Link encap:Ethernet HWaddr 08:00:27:54:73:8F

inet addr:192.168.10.1 Bcast:192.168.10.255 Mask:255.255.255.0

inet6 addr: fe80::a00:27ff:fe54:738f/64 Scope:Link

eth1:1 Link encap:Ethernet HWaddr 08:00:27:54:73:8F

inet addr:169.254.225.190 Bcast:169.254.255.255 Mask:255.255.0.0

The entry for eth1 with IP address 192.168.10.1 is the way the NIC was configured on this system for the private network. Notice the device listed as eth1:1 in the output above. It has been given the 169.254.225.190 IP address.

Device eth1:1 is RAC HAIP in action even though only one private network adapter exists. HAIP uses the 169.254.*.* subnet. As such, no other network devices in the cluster should be configured for the same subnet.

When Grid Infrastructure is stopped, the ifconfig command will no longer show the eth1:1 device. The gv$cluster_interconnects view shows the HAIP subnets for each instance.

Why Node Eviction? OR Why Split Brain syndrome? The node eviction/reboot is used for I/O fencing to ensure that writes from I/O capable clients can be cleared avoiding potential corruption scenarios in the event of a network split, node hang, or some other fatal event in clustered environment. By definition, I/O fencing (cluster industry technique) is the isolation of a malfunctioning node from a cluster's shared storage to protect the integrity of data.

Who evicts/reboot the node?

The daemons for Oracle Clusterware (CRS) are started by init when the machine boots. Viz. CRSD, OCSSD,EVMD, OPROCD (when vendor clusterware is absent), OCLSOMON.

There are three fatal processes, i.e. processes whose abnormal halt or kill will provoke a node reboot

1. the ocssd.bin (run as oracle)

2. the oclsomon.bin (monitors OCSSD and run as a root)

3. the oprocd.bin (I/o fencing in non-vendor clusterware env and run as a root)

—Other non-CRS Processes Capable of Evicting:

◦OCFS2 (if used)

◦Vendor Clusterware (if used)

◦Operating System (panic

4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment

1. High Load on Database Server: Out of 100 Issues, I have seen 70 to 80 time High load on the system was reason for Node Evictions, One common scenario is due to high load RAM and SWAP space of DB node got exhaust and system stops working and finally reboot.

So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs

2. Voting Disk not Reachable: One of the another reason for Node Reboot is clusterware is not able to access a minimum number of the voting files.When the node aborts for this reason, the node alert log will show CRS-1606 error.

3. Missed Network Connection between Nodes: In technical term this is called as Missed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.

4. Database Or ASM Instance Hang: Sometimes Database or ASM instance hang can cause Node reboot. In these case Database instance is hang and is terminated afterwards, which cause either reboot cluster or Node eviction. DBA should check alert log of Database and ASM instance for any hang situation which might cause this issue.

SPLIT BRAIN -SYNDROME.

In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle rac systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of commincation the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance.

Oracle has efficiently implemented check for the split brain syndrome.

In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.

When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.

Why voting disk are recommeded to have odd in number for the cluster ?

An odd number of voting disks is required for proper clusterware configuration. A node must be able to strictly access more than half of the voting disks at any time. So, in order to tolerate a failure of n voting disks, there must be at least 2n+1 configured. (n=1 means 3 voting disks). You can configure up to 31 voting disks, providing protection against 15 simultaneous disk failures. If you lose 1/2 or more of all of your voting disks, then nodes get evicted from the cluster, or nodes kick themselves out of the cluster. It doesn't threaten database corruption. Alternatively you can use external redundancy which means you are providing redundancy at the storage level using RAID.

. How to Identify Master Node for cluster ?

There are three ways we can find out the master node for the cluster

1. Check on which node OCR backups are happening

2. Scan the ocssd logs on all the nodes

3. Scan the crsd logs on all the nodes

What are RAC database backgrund processes? ANS- -1) LMS-- This background process copy read consistent blocks from the holding instance buffer cache to the requesting instance. LMSn also performs rollback on uncommitted transactions for blocks that are being requested for consistent read by another instance.this is the cache fusion part and the most active process, it handles the consistent copies of blocks that are transferred between instances. It receives requests from LMD to perform lock requests. There can be up to ten LMS processes running and can be started dynamically if demand requires it . they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeout as a performance gain you can increase this process priority to make sure CPU starvation does not occur. This background process is also called Global Cache Services. The LMS process maintains records of the data file statuses and each cached block by recording information in a Global Resource Directory (GRD). The LMS process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances. This processing is part of the Cache Fusion feature This is the name you often see back in wait events (GCS). Default 2 LMS background processes are started.

2) LMON -- LMON is responsible for monitoring all instances in a clusterfor the detection of failed instances. The LMON process monitors global enqueues and resources across the cluster and performs global enqueue recovery operations.It manages instance deaths and perform recovery for the Global Cache Service//LMS. Joining an leaving instances are managed by LMON. LMON manage also all the global resource in the RAC database. LMON register the instance/database with the node monitoring part of the cluster (CSSD).This background process is also called Global Enqueue Monitoring.LMON provide services are also referred to cluster group service (CGS).

3) LCK--The Lock Process is also for non-RAC environments LCK manage local noncache requests (row cache, lock requests, library locks). And it also managed shared resource requests cross instance. It keeps a list of invalid and valid lock elements. And if needed past information to the GCS. The Lock Process (LCK) manages non-cache fusion resource requests such as library and row cache requests and lock requests that are local to the server. LCK manages instance resource requests and cross-instance call operations for shared resources. It builds a list of invalid lock elements and validates lock elements during recovery. Because the LMS process handles the primary function of lock management, only a single LCK process exists in each instance. There is only one LCK process per instance in RAC.

4) DIAG--

It regularly monitors the health of the instance.

Checks for instance hangs and deadlocks.

It captures diagnostic data for instance and process failures.

5)LMD--This background process manage access to the blocks and global enqueues. Also global deadlock detection and remote resource request are handled by LMD. LMD also manage lock requests for GCS /LMS.

This background process is also called Global Enqueue Service Deamon. In wait events you will see GES.

6)RMSn--This process is called as Oracle RAC Management Service/Process. These processes perform manageability tasks for Oracle RAC. Tasks include creation of resources related Oracle RAC when new instances are added to the cluster.

Q. What is Private interconnect ?-

The private interconnect is the physical construct that allows inter-node communication. It can be a simple crossover cable with UDP or it can be a proprietary interconnect with specialized proprietary communications protocol. When setting up more than 2- nodes, a switch is usually needed. This provides the maximum performance for RAC,s which relies on inter-process communication between the instances for cache-fusion implementation.

Using the dynamic view gv$cluster_interconnects: SQL> select * from gv$cluster_interconnects;

Using the clusterware command oifcfg:
$oifcfg getif

What is cache fusion in RAC ? –

Cache Fusion Oracle RAC transfer the data block from buffer cache of one instance to the buffer cache of another instance using the cluster high speed interconnect.
CR Image: A consistent read (CR) block represents a consistent snapshot of the data from a previous point in time. Applying undo/rollback segment information produces consistent read versions. Past Image: Past Image is converted from Exclusive current block , when another request comes for exclusive lock on the same block. Past Images are not written to the disk , After the latest version of that block is written to the disk all Past Images are discarded.

Example 1 [write-read] :Instance A is holding a data block in exclusive mode.

1. Instance B is trying to access the same block for read purpose.

2. If transaction is not yet committed by Instance A, in this case instance A cannot send current block to requesting instance as data yet not committed , so it will create consistent read version of that data block by applying undo to that block and sends it to requesting instance.

3. Creating a CR image in RAC can come with some I/O overheads. This is because the UNDO data could be spread across instances and hence to build a CR copy of the block, the instance might has to visit UNDO segments on other instances and hence perform certain extra I/O.

Now what actually happens inside : When any Instance access any data block , GCS keep track of it and store it in GRD saying the latest block is with this instance in our case, Instance A. So when other instance[Instance B] ask for same block, it can easily find that block is with Instance A.It also stores the block is being accessed in Exclusive mode.So when other instance asks for shared lock on this , it will check transaction is committed or not , if not it creates a read consistent image for that data block and send it to requesting instance. After shipping block to requesting instance it also stores that details in GRD, CR image is with Instance B which is having shared lock and Instance B still holds exclusive lock.

Example 2[write-write]: In case of write-write operations , past image comes into the picture , Instance A is holding a data block in exclusive mode and Instance B is trying to access the same data block in exclusive mode too.So , here Instance B needs a actual block and not aconsistent read version of the block. In this scenario, holding instance sends an actual block, but is liable to keep past image of the block in its cache until that data block has been written to the disk. In case of node failure or node crash, GCS is able to build that data block using PI image across the cluster. Once the data block is written to disk, all PIs can be discarded as it won’t need a recovery in case of a crash.

Past Image[PI] and Consistent Read[CR] image in above example:

CR image is used in read-write contention, in this case, one instance is holding the block in exclusive mode and the second instance requests a read access to that block so it won’t need an exclusive lock, consistent read image consistent with the requested queryscn is sufficient here.Whereas the first Instance has acquired the exclusive lock and the second instance also wants the same block in an exclusive mode it is write-write contention.In this case two possibilities are there either instance A releases the lock on that block (if it no longer needs it) and lets instance B read the block from the disk OR instance Acreates a PI image of the block in its own buffer cache, makes the redo entries and ships the block to the requesting instance via interconnect . Another specification we can give is, the CR image is to be shipped to the requesting instance whereas the PI has to be kept by the holding instance after shipping the actual block.In order to facilitate Cache Fusion, we still need the Buffer Cache, the Shared Pool, and the Undo tablespace just like a single-instance database. However, for Oracle RAC, we need the Buffer Caches on all instances to appear to be global across the cluster. To do this, we need GRD – Global Resource Directory to keep track of the resources in the cluster.There is no true concept of a master node in Oracle RAC. But each node belongs to a cluster , becomes the resource master for a subset of resources.The GCS processes [LMS] are responsible for facilitating blocks through interconnect.

Q. What is GRD in RAC Is the internal database that records and stores the current status of the data blocks

· Is maintained by Global Cache Service(GCS) and Global Enqueue Service(GES)

· Global Enqueue Service Daemon (LMD)

· It holds the information on the locks on the buffers

· The lock info is available in V$LOCK_ELEMENT & V$BH.LOCK_ELEMENT

§ Global Cache Service Processes (LMSn)

· It provides the buffer from one instance to another instance

· it does not know who has what type of buffer lock

o Whenever a data block is transferred out of a local cache to another instance’s cache the GRD is updated

o The GRD resides in memory and is distributed throughout the cluster.

o It list all the master instance of all the buffers

it holds information like
1.SCN(system change number)
2.DBI(datablock identifier)
3.location of the block
4.mode of the block

1.null (N) - indicates the their are noacess rites on block
2.shared(S) - indicate block is share across the all instance
3.exclusive(E) - access rides only for particular instance

role of the block
1.local-date block image present in only one node
2.global-data block image present in multiple nodes

Types of datablock image
1.current image-update data block value
2.consistent image-previous data block value
2.past image-grd updated image
it convert to current image when instance is crash

How to check location of OCR and voting disk in RAC?

Voting Disk: It manage information about node membership. Each voting disk must be accessible by all nodes in the cluster.If any node is not passing heat-beat across other note or voting disk, then that node will be evicted by Voting disk.

To check voting disk location.-- crsctl query css votedisk

OCR: It created at the time of Grid Installation. It’s store information to manage Oracle cluster-ware and it’s component such as RAC database, listener, VIP,Scan IP & Services.

Check OCR location-- ocrcheck or

cat /etc/oracle/ocr.loc also give you the location of ocr

What is I/O fencing

I/O fencing is a mechanism to prevent uncoordinated access to the shared storage. This feature works even in the case of faulty cluster communications causing a split-brain condition.To provide high availability, the cluster must be capable of taking corrective action when a node fails .if a system in a two-node cluster fails, the system stops sending heartbeats over the private interconnects and the remaining node takes corrective action. However, the failure of private interconnects (instead of the actual nodes) would present identical symptoms and cause each node to determine its peer has departed. This situation typically results in data corruption because both nodes attempt to take control of data storage in an uncoordinated manner.

I/O fencing allows write access for members of the active cluster and blocks access to storage from non-members; even a node that is alive is unable to cause damage. Fencing is an important operation that protects processes from other nodes modifying the resources during node failures. When a node fails, it needs to be isolated from the other active nodes. Fencing is required because it is impossible to distinguish between a real failure and a temporary hang. Therefore, we assume the worst and always fence. (If the node is really down, it cannot do any damage; in theory, nothing is required. We could just bring it back into the cluster with the usual join process.) Fencing, in general, insures that I/O can no longer occur from the failed node. Raw devices using a fencing method called STOMITH (Shoot The Other Machine In The Head) automatically power off the server.This simply means the healthy nodes kill the sick node. Oracle's Clusterware does not do this; instead, it simply gives the message "Please Reboot" to the sick node. The node bounces itself and rejoins the cluster.

In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better: reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which node to evict, the Clusterware:

--attempts to shut down all Oracle resources/processes on the server

--- will stop itself on the node

---Afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources on that node will automatically start

---Kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the cluster integrity:

---In configurations with two nodes, node with the lowest ID will survive (first node that joined the cluster), the other one will be asked to leave the cluster

--- With more cluster nodes, the Clusterware will try to keep the largest sub-cluster Running.

Why raw devices are faster than the block devices?

-------years ago raw devices were a lot faster than file systems. Nowadays the difference has become much smaller. "Veritas filesystems for Oracle"

(don't recall the exact name) is supposed to offer the same speed as raw devices.

Filesystems however are a lot easier to administer than raw devices and give you more freedom

--------Raw devices in conjunction with database applications in some cases are giving more performances ( because the operating system handle a minimum of specific activity on haw to deal with data to be written and read from blocks in one hand and the RDBMS take all the responsability to manage

the entire space to deal with data as stream CHARACTERS) .

--------Raw partition is accessed in character mode, so IO is faster than fs partition which is accessed by block mode. with raw partitions you can do bulk IO's.

What is UDP Protocol and why to use that for private interconnect ?

-----UDP (USER DATAGRAM PROTOCOL) is diffrent from tcp/ip as it does not have built-in hand-shake dialogue (in tcp/op it will first create connection setup after that if will start transferring data this is called as hand shaking) . This means that UDP does not have the same data integrity and reliability and serialization as TCP/IP. -UDP is nonreliable and it doesnt give data delivery gareenty to the deatination side like tcp/tp.

-----UDP is no provide sequencing of data (in case of transmission of data).

-----UDP is far faster than TCP/IP, primarily because there is no overhead in establishing a handshake connection.

-----UDP protocol is used for high impact communications areas such as domain name servers (DNS)

it is mainly use for cache fusion time.

To enable UDP on AIX for Oracle, you set the following UDP parameters:

udp_sendspace: Set udp_sendspace parameter to [(DB_BLOCK_SIZE *DB_FILE_MULTIBLOCK_READ_COUNT) + 4096], but not less than 65536.

udp_recvspace: Set the value of the udp_recvspace parameter to be >= 4 * udp_sendpace

Why ocr and voting disk were not possible to keep on ASM in 10g?

-In Oracle 11g there is a great feature where we can put the OCR and Voting disk files on ASM but in Oracle earlier version it was not possible. The reason is, In 11g, Clusterware can access the OCR and Voting disk files even if the ASM instance is down and it can start the CRS and CSS services. But in Oracle 10g, ASM instance could not be bring up as while startup it throws error as "ORA-29701:unable to connect to Cluster Manager". Coz to bring the ASM instance up first you need to start the CRS services. So in this case if the OCR and Voting disk files resides on ASM then definitely those services should be UP which couldn't be possible as ASM is not UP and it depends upon CRS.

What is OLR in RAC?

In 11gR2, addition to OCR, we have another component called OLR installed on each node in the cluster. It is a local registry for node specific purposes. The OLR is not shared by other nodes in the cluster. It is installed and configured when clusterware is installed.

Why OLR is used and why was it introduced.------In 10g, we cannot store OCR’s in ASM and hence to startup the clusterware, oracle uses OCR but what happens when OCR is stored is ASM in 11g. OCR should be accessible to find out the resources that need to be started or not. But, if OCR is on ASM, it can’t read until ASM (which itself is the resource of the node and this information is stored in OCR) is up.

To answer this, Oracle introduced a component called OLR.

Ø It is the first file used to startup the clusterware when OCR is stored on ASM.

Ø Information about the resources that needs to be started on a node is stored in an OS file called ORACLE LOCAL REGISTRY (OLR).

Ø Since OLR is an OS file, it can be accessed by various processes on the node for read/write irrespective of the status of cluster (up/down).

Ø When a node joins the cluster, OLR on that node is read, various resources, including ASM are started on the node.

Ø Once ASM is up, OCR is accessible and is used henceforth to manage all the cluster nodes. If OLR is missing or corrupted, clusterware can’t be started on that node.

Where is OLR located

It is located $GRID_HOME/cdata/<hostname>.olr . The location of OLR is stored in /etc/oracle/olr.loc and used by OHASD.

What does OLR contain

The OLR stores

· Clusterware version info.

· Clusterware configuration

· Configuration of various resources which needs to be started on the

The OLR stores data about ORA_CRS_HOME,localhost version,active version, GPnP details,OCR latest backup time and location, information about OCR daily, weekly backup location node name etc.

This information stored in the OLR is needed by OHASD to start or join a cluster.

Checking the status of the OLR file on each node.

$ ocrcheck –local

OCRDUMP is used to dump the contents of the OLR to text terminal

$ocrdump –local –stdout

We can export and import the OLR file using OCRCONFIG

$ocrconfig –local –export <export file name>

$ocrconfig –local –import <file_name>

We can even the repair the OLR file if it corrupted.

$ocrconfig –local –repair –olr <filename>

OLR is backed up at the end of the installation or an upgrade. After that time we need to manually backup the OLR. Automatic backups are not supported for OLR.

$ocrconfig –local –manualbackup.

To change the OLR backup location

$ocrconfig –local –backuploc <new_backup_location>

To restore OLR

$crsctl stop crs

$ocrconfig –local –restore_file_name

$ocrcheck –local

$crsctl start crs

$cluvfy comp olr -- to check the integrity of the OLR file which was restored.

How to backup OCR and Voting disk in RAC

Oracle Clusterware (CRSD) automatically creates OCR backups every 4 hours.
b) A backup is created for each full day.
c) A backup is created at the end of each week.
d) Oracle Database retains the last three copies of OCR

Add Voting Disk :# crsctl add css votedisk

To remove a voting disk:# crsctl delete css votedisk=

Voting Disks

In 11g release 2 you no longer have to take voting disks backup. In fact according

to Oracle documentation restoration of voting disks that were copied using the

"dd" or "cp" command may prevent your clusterware from starting up.

So, In 11g Release 2 your voting disk data is automatically backed up in

the OCR whenever there is a configuration change.

Also the data is automatically restored to any voting that is added.

There is no need to backup voting disk every day, because the node membership information does not usually change.
Following is a guideline to backup voting disk.
• After installation
• After adding nodes to or deleting nodes from the cluster
• After performing voting disk add or delete operations

How to check clustername in RAC A cluster comprises multiple co-ordinated computers or servers that appear as if they are one server to end users and applications. Oracle RAC enables you to cluster Oracle databases. Oracle RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so they operate as a single system.Oracle Clusterware is a portable cluster management solution that is integrated with Oracle Database. Oracle Clusterware is also a required component for using Oracle RAC. In addition, Oracle Clusterware enables both single-instance Oracle databases and Oracle RAC databases to use the Oracle high-availability infrastructure. Oracle Clusterware enables you to create a clustered pool of storage to be used by any combination of single-instance and Oracle RAC databases. If you want to find the cluster name from an existing RAC setup, then use below command.

1. cd $GRID_HOME/bin
./olsnodes – 2.cd $GRID_HOME/bin
cemutlo –n

What is instance recovey in RAC 1. All nodes available. 2. One or more RAC instances fail. 3. Node failure is detected by any one of the remaining instances. 4. Global Resource Directory(GRD) is reconfigured and distributed among surviving nodes 5. The instance which first detected the failed instance, reads the failed insances redo logs to determine the logs which are needed to be recovered. The above task is done by the SMON process of the instance that detected failure. 6. Until this time database activity is frozen, The SMON issues recovery requests for all the blocks that are needed for recovery. Once all the blocks are available, the other blocks which are not needed for recovery are available for normal processing. 7. Oracle performs roll forward operation against the blocks that were modified by the failed instance but were not written to disk using redo log recorded transactions 8. Once redo logs are applied, uncomitted transactions are rolled back usin undo tablespace. 9. Database on the RAC in now fully available.

What is DNS server in RAC SCAN is a domain name registered to at least one and up to three IP addresses, either in the domain name service (DNS) or the Grid Naming Service (GNS). When client wants a connection database unlike the previous releases the client will use SCAN as specified in tnsnames.ora.The DNS server will return three IP addresses for the SCAN and the client will try to connect to each IP address given by DNS server until the connection is not made.So with 11GR2 the Client will initiate the connection to SCAN listener which will forward the connection request to least loaded the node within the cluster.
The flow will be,Client Connection Request –> SCAN listener –> Node listener (Running on Virtual IP)

Why SCAN needs DNS ?
You must have DNS server setup if you want to use SCAN.The reason is, if you use /etc/hosts file for the SCAN than all the requests for the SCAN will be forwarded to first SCAN node specified in /etc/hosts because /etc/hosts file does not have capability of round robin name resolution but If you use the DNS server than SCAN can take the advantage of DNS’s round robin name resolution feature.

What is passwordless access and why is it required in RAC? During the add node or cluster upgradation,rdbms upgradation or cluster installation or rdbms installation if you want to check any pre-requisites using runcluvfy.sh script or cluvfy.sh script it required password less connection between the RAC nodes for same user.

What is asmlib and its usage ASMLib is an optional support library for the Automatic Storage Management feature of the Oracle Database. Automatic Storage Management (ASM) simplifies database administration and greatly reduces kernel resource usage (e.g. the number of open file descriptors). It eliminates the need for the DBA to directly manage potentially thousands of Oracle database files, requiring only the management of groups of disks allocated to the Oracle Database. ASMLib allows an Oracle Database using ASM more efficient and capable access to the disk groups it is using Oracle ASM (Automated Storage Management) is a data volume manager for Oracle databases. ASMLib is an optional utility that can be used on Linux systems to manage Oracle ASM devices. ASM assists users in disk management by keeping track of storage devices dedicated to Oracle databases and allocating space on those devices according to the requests from Oracle database instances.

What is GNS? When we use Oracle RAC, all clients must be able to reach the database. All public addresses, VIP addresses and SCAN addresses of Cluster must be resolved by the clients.And GNS[Grid Naming Service] helps us resolve this problem. GNS is linked to the domain name server (DNS) so that clients can resolve these dynamic addresses and transparently connect to the cluster and the databases. Activating GNS in a cluster requires a DHCP service on the public network. Grid Naming Service uses one static address, which dynamically allocates VIP addresses using Dynamic Host Configuration Protocol[DHCP], which must be running on the network. Grid Naming Service is in use with gnsd[grid naming service daemon]. Background Process of GNS: mDNS[Multicast Domain Name Service]: It allows DNS request. GNS[Oracle Grid Naming Service]: It is a gateway between the cluster mDNS and external DNSservers. The GNS process performs name resolution within the cluster. The DNS delegates query the GNS virtual IP address and the GNS daemon responds to incoming name resolution requests at that address. Within the subdomain, the GNS uses multicast Domain Name Service (mDNS), included with Oracle Clusterware, to enable the cluster to map host names and IP addresses dynamically as nodes are added and removed from the cluster, without requiring additional host configuration in the DNS.

$ srvctl config gns –a0

What is the difference between CRSCTL and SRVCTL? Crsctl command is used to manage the elements of the clusterware (crs,cssd, OCR,voting disk etc.) while srvctl is used to manage the elements of the cluster (databases,instances,listeners, services etc) . For exemple, with crsctl you can tune the heartbeat of the cluster, while with srvctl you will set up the load balancing at the service level. Both command were introduced with Oracle 10g and have been improved since. There is sometimes some confusion among DBAs because both commands can be used to start the database, crsctl starting the whole clusterware + cluster, while srvctl is starting the other elements, such as database, listener, services, but not the clusterware.so ,Use SRVCTL to manage Oracle-supplied resources such Listener,Instances,Disk groups,Networks.Use CRSCTL for managing Oracle Clusterware and its resources.Oracle strongly discourages directly manipulating Oracle-supplied resources (resources whose names begin with ora) using CRSCTL.

What is rebootless fencing in 11gR2 rac? Oracle Grid Infrastructure 11.2.0.2 has many features including Cluster Node Membership, Cluster Resource Management, and Cluster Resources monitoring. One of the key areas where DBA need to have expert knowledge on how the cluster node membership works and how the cluster decides to take out node should there be a heartbeat network, voting disk or node specific issues. In Oracle 11.2.0.2 oracle bring many new features one of them is reboot less fencing :when sub-components of Oracle RAC like private interconnect or voting disk fails, Oracle Clusterware tries to prevent a split-brain with a fast reboot of the node without waiting for I/O operation or synchronization of the file systems Oracle uses algorithms common to STONITH (shoot the other node in the Head) implementations to determine what nodes need to get fenced. When a node is alerted that it is being “fenced” it uses suicide to carry out the order STONITH automatically powers down a node that is not working correctly. An administrator might employ STONITH if one of the nodes in a cluster can not be reached by the other node(s) in the cluster.; But after 11.2.0.2 the mechanism is changed.Finally, Oracle has improved the node fencing in Oracle 11g Release 2 (11.2.0.2) by killing the processes on the failed node that are capable of performing IO and then stopping the Clusterware on the failed node rather than simply rebooting the failed node. Whenever subcomponents of Oracle RAC like private interconnect, voting disk etc fails, Oracle Clusterware first decides which node to evict, then

1. The Clusterware attempts to shut down all Oracle resources and process on that node, especially those processes which generates I/O.

2. The Clusterware will stop cluster service on that node.

3. Then OHASD[Oracle High Availability Service Daemon] will try to start CRS [Cluster Ready Service] stack again. and once the interconnect is back online, all cluster resources on that node will automatically be started.

4. And if it is not possible to stop resources or processes generating I/O then Clusterware will kill the node.

If any one of the nodes cannot communicate to other nodes, there is a potential that node can be corrupting the data without coordinating the writes with the other nodes. Should that situation arise, that node needs to be taken out from the cluster to protect the integrity of the cluster and its data. This is called as “split brain” in the cluster which means two different sets of clusters can be functioning against the same set of data writing independently causing data integrity and corruption. Any clustering solution needs to address this issue so does Oracle Grid Infrastructure Clusterware.

What is Rolling Upgrade and Rolling patch apply in RAC? The rolling upgrade will allow one node to be upgraded whilst the other node is running.There is no downtime at all since at least one node is running at any one time.The term rolling upgrade refers to upgrading different databases or different instances of the same database (in a RealApplication Clusters environment) one at a time, without stopping the database.The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to beavailable during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched needs to be broughtdown. The other instances can continue to remain available. This means that the impact on the application downtime required for such scheduled outages is further minimized. Oracle's opatch utility enables the user to apply the patch successively to the different instances of the RAC installation.Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:

•Patches that do not affect the contents of the database such as the data dictionary. •Patches not related to RAC internode communication •Patches related to client-side tools such as SQL*PLUS, Oracle utilities, development libraries, and Oracle Net •Patches that do not change shared database resources such as datafile headers, control files, and common header definitions of kernel modules •Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets

What is One Node RAC concept Oracle introduced a new option called RAC One Node with the release of 11gR2 in late 2009. This option is available with Enterprise edition only. Basically, it provides a cold failover solution for Oracle databases. It’s a single instance of Oracle RAC running on one node of the cluster while the 2nd node is in a cold standby mode. If the instance fails for some reason, then RAC One Node detects it and first tries to restart the instance on the same node. The instance is relocated to the 2nd node in case there is a failure or fault in 1st node and the instance cannot be restarted on the same node. The benefit of this feature is that it automates the instance relocation without any downtime and does not need a manual intervention. It uses a technology called Omotion, which facilitates the instance migration/relocation.

What are some of the RAC specific parameters? -----active_instance_count Designates one instance in a two-instance cluster as the primary instance, and the other as the secondary instance. This parameter has no functionality in a cluster with more than two instances. ----archive_lag_target Specifies a log switch after a user-specified time period elapses. ----cluster_database -------Specifies whether or not Oracle Database 10g RAC is enabled. ----cluster_database_ instances -- - Equal to the number of instances. Oracle uses the value of this parameter to compute the default value of the large_pool_size parameter when the parallel_automatic_tuning parameter is set to true. ----cluster_interconnects --- Specifies the additional cluster interconnects available for use in RAC environment. Oracle uses information from this parameter to distribute traffic among the various interfaces. ----Compatible---This parameter specifies the release with which the Oracle server must maintain compatibility. ----control_files---Specifies one or more names of control files. ----db_block_size---Specifies the size (in bytes) of Oracle database blocks. ----db_domain-----In a distributed database system, db_domain specifies the logical location of the database within the network structure.

How to put RAC database in archivelog mode ? From 11g, you no longer need to reset the CLUSTER_DATABASE parameter during the process Step 1. Make sure to set the parameters for db_recovery_file_dest_size and db_recovery_file_dest if not set the below parameter ALTER SYSTEM SET log_archive_dest_1='location=+ORADATA' SCOPE=spfile;

Step 2. Stop the Database From the command line we can stop the entire clustered database using the following. srvctl stop database -d PROD

Step 3. Now start the instance from one node only SQL> STARTUP MOUNT; SQL> ALTER DATABASE ARCHIVELOG; SQL> SHUTDOWN IMMEDIATE;

Step 4. Start the database srvctl start database -d PROD

SQL> select name,open_mode,LOG_MODE from v$database; What are the background process that exists in 11gr2 and functionality?

Process Name	Functionality
crsd	•The CRS daemon (crsd) manages cluster resources based on configuration information that is stored in Oracle Cluster Registry (OCR) for each resource. This includes start, stop, monitor, and failover operations. The crsd process generates events when the status of a resource changes.
cssd	•Cluster Synchronization Service (CSS): Manages the cluster configuration by controlling which nodes are members of the cluster and by notifying members when a node joins or leaves the cluster. If you are using certified third-party clusterware, then CSS processes interfaces with your clusterware to manage node membership information. CSS has three separate processes: the CSS daemon (ocssd), the CSS Agent (cssdagent), and the CSS Monitor (cssdmonitor). The cssdagent process monitors the cluster and provides input/output fencing. This service formerly was provided by Oracle Process Monitor daemon (oprocd), also known as OraFenceService on Windows. A cssdagent failure results in Oracle Clusterware restarting the node.
diskmon	•Disk Monitor daemon (diskmon): Monitors and performs input/output fencing for Oracle Exadata Storage Server. As Exadata storage can be added to any Oracle RAC node at any point in time, the diskmon daemon is always started when ocssd is started.
evmd	•Event Manager (EVM): Is a background process that publishes Oracle Clusterware events
mdnsd	•Multicast domain name service (mDNS): Allows DNS requests. The mDNS process is a background process on Linux and UNIX, and a service on Windows.
gnsd	•Oracle Grid Naming Service (GNS): Is a gateway between the cluster mDNS and external DNS servers. The GNS process performs name resolution within the cluster.
ons	•Oracle Notification Service (ONS): Is a publish-and-subscribe service for communicating Fast Application Notification (FAN) events
oraagent	•oraagent: Extends clusterware to support Oracle-specific requirements and complex resources. It runs server callout scripts when FAN events occur. This process was known as RACG in Oracle Clusterware 11g Release 1 (11.1).
orarootagent	•Oracle root agent (orarootagent): Is a specialized oraagent process that helps CRSD manage resources owned by root, such as the network, and the Grid virtual IP address
oclskd	•Cluster kill daemon (oclskd): Handles instance/node evictions requests that have been escalated to CSS
gipcd	•Grid IPC daemon (gipcd): Is a helper daemon for the communications infrastructure
ctssd	•Cluster time synchronisation daemon(ctssd) to manage the time syncrhonization between nodes, rather depending on NTP

What is inittab?

Inittab is like oratab entry. Inittab is used for starting crs services in RAC environment. The line which is responsible to start is below. This file is responsible for starting the services

h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Useful Commands:

1. crsctl enable has –> Enable Automatic start of Oracle High Availability services after reboot

2. crsctl disable has –> Disable Automatic start of Oracle High Availability services after reboot

What is OHASD ?

Ohasd stands for Oracle High Availability Services Daemon. Ohasd spawns 3 types of services at cluster level. Level 1 : Cssd Agent
Level 2: Oraroot Agent (respawns cssd, crsd, cttsd, diskmon,acfs)
Level 3: OraAgent(respawns mdsnd, gipcd, gpnpd, evmd, asm), CssdMonitor

Useful Commands:
1. crsctl enable has –> To start has services after reboot.
2. crsctl disable has –> has services should not start after reboot
3. crsctl config has –> Check configuration whether autostart is enabled or not.
4. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdstr –> check whether it is enabled or not.
5. cat /etc/oracle/scls_scr/<Node_name>/root/ohasdrun –> whether restart enabled if node fails.

What is OCR? How and why OLR is used? Where is the location of OCR & OLR?

OCR stands for Oracle Cluster Registry. It holds information on it such as node membership (which nodes are part of this cluster), Software Version, Location of the voting disk, Status of RAC databases, listeners, instances & services. OCR is placed in ASM, OCFS.

ASM can be brought up only if we have access to OCR. But, OCR is accessible only after the ASM Is up. In this case, how will CRS services come up?

Yes. For this OLR (Oracle Local Registry) is there. This is a multiplexing of OCR file which was placed in local file system.OLR holds information on it such as CRS_HOME, GPnP details, active version, localhost version, OCR latest backup(with time & location), Node name.,., .
Location Of OCR & OLR:

NOTE: Some commands like restore need bounce of services. Please verify before taking any action.

1. ocrconfig –showbackup –> OCR file backup location

2. ocrconfig –export < File_Name_with_Full_Location.ocr > –> OCR Backup

3. ocrconfig –restore <File_Name_with_Full_Location.ocr> –> Restore OCR

4. ocrconfig –import <File_Name_With_Full_Location.dmp> –> Import metadata specifically for OCR.

5. Ocrcheck –details –> Gives the OCR info in detail

6. ocrcheck –local –> Gives the OLR info in detail

7. ocrdump –local <File_Name_with_Full_Location.olr> –> Take the dump of OLR.

8. ocrdump <File_Name_with_Full_Location.ocr> –> Take the dump of OCR.

What is the Voting Disk and how is this Used?

If a node joins cluster, if a node fails (may be evicted), if VIP need to be assigned in case of GNS is configured. In all the cases, voting disk comes into picture. Voting disk saves the info of which nodes were part of cluster. While starting the crs services, with the help of OCR, it will vote in the voting disk (Nothing but mark attendance in the cluster)We need not take the backup of the voting disk periodically like our cron jobs. We are supposed to take backup only in SOME of the below cases.

There are two different jobs done by voting disk.

1. Dynamic – Heart beat information

2. Static – Node information in the cluster

Useful Commands:

1. dd if=Name_Of_Voting_Disk of=Name_Of_Voting_Disk_Backup –> Taking backup of voting disk

2. crsctl query css votedisk –> Check voting disk details.

3. crsctl add css votedisk path_to_voting_disk –> To add voting disk

4. crsctl add css votedisk –force –> If the Cluster is down

5. crsctl delete css votedisk <File_Name_With_Password_With_file_name> –> Delete Voting disk

6. crsctl delete css votedisk –force –> If the cluster is down

7. crsctl replace votedisk <+ASM_Disk_Group> –> Replace the voting disk.

What is CRS?

CRSD stands for Cluster Resource Service Daemon. It is a proce–> which is responsible to monitor, stop, start & failover the resources. This process maintains OCR and this is responsible for restarting resource when any failover is about to take place.

Useful Commands:

1. crs_stat –t –v –> Check crs resources

2. crsctl stat res -t –> Check in a bit detail view. BEST ONE.

3. crsctl enable crs –> Enable Automatic start of Services after reboot

4. crsctl check crs –> Check crs Services.

5. crsctl disable crs –> Disable Automatic start of CRS services after reboot

6. crsctl stop crs –> Stop the crs services on the node which we are executing

7. crsctl stop crs –f –> Stop the crs services forcefully

8. crsctl start crs –> To start the crs services on respective node

9. crsctl start crs –excl –> To start the crs services in exclusive mode when u lost voting disk.
You need to replace the voting disk after you start the css.

10. crsctl stop cluster –all –> Stop the crs services on the cluster nodes

11. crsctl start cluster –all –> Start the crs services on all the cluster nodes.

12. olsnodes –> Find all the nodes relative to the cluster

13. oclumon manage –get master –> With this you will get master node information

14. cat $CRS_HOME/crs/init/<node_name>.pid –> Find PID from which crs is running.

6. What is CSSD?

CSSD stands for Cluster Synchronization Service Daemon. This is responsible for communicating the nodes each other. This will monitor the heart beat messages from all the nodes.

Example: We have 2 node RAC cluster. Till one hour back, our CSSD is monitoring both the nodes and able to communicate each other. Now, if one of the node is down, CRS should know that one of the node is down. This information is provided by CSSD process. Simple Scenario: If both the nodes are up & running now. And due to one of the communication channel, CSSD process got information that the other node is down. So, in this case, new transactions cannot be assigned to that node. The node eviction will be done. And the node which is running now will be taking the ownership as master node.

7. What is CTTSD?

CTTSD stands for Cluster Time Synchronization Service Daemon. This service by default will be in observer mode. If time difference is there, it won’t be taking any action. To run this service in active mode, we need to disable all the time synchronization services like NTP (Network Time Protocol). But, it is recommended as per my knowledge to keep this service in observer mode. This line was quoted because, if this service is in online mode. And time synchronization difference is huge, the cttsd process may terminate. And sometimes, crsd fail to startup due to time difference.

Useful Commands:

1. cluvfy comp clocksync -n all -verbose –> To check the clock synchronization across all the nodes

2. crsctl check ctts –> Check the service status & timeoffset in msecs.

8. What is VIP?

VIP stands for Virtual IP Address. Oracle uses VIP for Database level access. Basically, when a connection comes from application end. Then using this IP address, it will connect. Suppose if IP for one of the node is down. As per protocol timeout, it need to wait 90 seconds to get a session. In this scenario, VIP comes into picture. If one of the VIP is down, connections will be routed only to the active node. The VIP must be on same address as public IP address. This VIP is used for RAC failover and RAC management.

Useful Commands:

1. srvctl start vip –n <node_name> -i <VIP_Name> –> To start VIP

2. srvctl stop vip –n <node_name> -i <VIP_Name> –> To stop VIP

3. srvctl enable vip -i vip_name –> Enable the VIP.

4. srvctl disable vip -i vip_name –> Disable the VIP.

5. srvctl status nodeapps –n <node_name> –> status of nodeapps

6. srvctl status vip –n <node_name> –> status of vip on a node

What is ologgerd?

Ologgerd stands for cluster logger service Daemon. This is otherwise called as cluster logger service. This logger services writes the data in the master node. And chooses other nodes as standby. If any network issue occurs between the nodes, and if it is unable to contact the master. Then the other node takes ownership & chooses a node as standby node. This master will manage the operating system metric database in CHM repository.

Useful Commands:

1. Oclumon manage –get master –> Find which is the master node

2. oclumon manage -get reppath –> Will get the path of the repository logs

3. oclumon manage -get repsize –> This will give you the limitations on repository size

4. Oclumon showobjects –>find which nodes are connected to loggerd

5. Oclumon dumpnodeview –> This will give a detail view including system, topconsumers, processes, devices, nics, filesystems status, protocol errors.

6. oclumon dumpnodeview -n <node_1 node_2 node_3> -last “HH:MM:SS” –> you can view all the details in c. column from a specific time you mentioned.

7. oclumon dumpnodeview allnodes -last “HH:MM:SS” –> If we need info from all the nodes.11.What is sysmon?

This process is responsible for collecting information in the local node. This will collect the info from every node and that data will be sent the data to master loggerd. This will send the info like CPU, memory usage, Os level info, disk info, disk info, process, file system info.

11. What is evmd?

Evmd stands for Event Volume Manager Daemon. This handles event messaging for the processes. It sends and receives actions regarding resource state changes to and from all other nodes in a cluster. This will take the help of ONS(Oracle Notification Services).

Useful Commands:

1. evmwatch -A -t “@timestamp @@” –> Get events generated in evmd.

2. Evmpost –u “<Message here>” –h <node_name> –> This will post message in evmd log in the mentioned node.

13. What is mdnsd?

Mdnsd stands for Multicast Domain Name Service. This process is used by gpndp to locate profiles in the cluster as well as by GNS to perform name resolutions. Mdnsd updates the pid file in init directory.

What is ONS?

ONS stands for Oracle Notification Service. ONS will allow users to send SMS, emails, voice messages and fax messages in a easy way. ONS will send the state of database, instance. This state information is used for load balancing. ONS will also communicate with daemons in other nodes for informing state of database.

This is started as part of CRS as part of nodeapps. ONS will run as a node application. Every node will have its own ONS configured.

Useful Commands:

1. srvctl status nodeapps –> Status of nodeapps

2. cat $ORACLE_HOME/opmn/conf/ons.config –> Check ons configuration.

3. $ORACLE_HOME/opmn/logs –> ONS logs will be in this location.

15. what is OPROCD ?

OPROCD stands for Oracle Process Monitor Daemon. Oprocd monitors the system state of cluster nodes. Stonith, which is nothing but power cycling the node. Simply, means power off & power on the server using reboot command. And main change in OPROCD is cssd agent from 11gR2.

Useful Commands: CRS_HOME/oprocd stop –> To stop the processon single node.

16. What is FAN?----FAN stands for Fast Application Notification. If any state change occurs in cluster/instance/node, an event is triggered by the event manager and it is propogated by ONS. The event is known as FAN event. It was the feature which was introduced in Oracle 10g for an immediate notification. FAN uses ONS for notifying.

Useful Commands:

1. onsctl ping –> To check whether ons is running or not.

2. onsctl debug –> Will get detail view of ons.

3. onsctl start –> Start the daemon.

4. onsctl stop –> Stop the daemon.

17. What is TAF?

TAF stands for Trasparent Application Failover. When any rac node is down, the select statements need to failover to the active node. And insert, delete, update and also Alter session statements are not supported by TAF. Temporary objects & pl/sql packages are lost during the failover.

There are two types of failover methods used in TAF.

1. Basic failover: It will connect to single node. And no overload will be there. End user experiences delay in completing the transaction.

2. Preconnect failover: It will connect to primary & backup node at at time. This offers faster failover. An overload will be experienced as statement need to be ready to complete transaction with minimal delay.

Useful Commands:

1. Add a service:
Srvctl add service –d <database_name> -s <Name_for_service> -r <instance_names> -p <Policy_specification>

Policy specification – none, basic, preconnect

2. Check TAF status:
SELECT machine, failover_type, failover_method, failed_over, COUNT(*) FROM gv$session GROUP BY machine, failover_type, failover_method, failed_over;

18. What is FCF?

FCF stands for Fast Connection Failover. It is an application level failover process. This will automatically subscribes to FAN events and this will help in immediate reaction on the up & down events from the database cluster. All the failure applications are cleaned up immediately, so that the application will receive a failure message. And after cleanup, if new connection is received then with load balancing it will reach active node. As said, this is application level process I am not discussing much.

19. What is GCS(LMSn)?

GCS stands for Global Cache Service. GCS catches the information of data blocks, and access privileges of various instances. Integrity is maintained by maintaining global access. It is responsible for transferring blocks from instance to another instance when needed.

Clear Understanding: Blocks of table “A” were retrieved with a connection to second node. Now, if first node requests blocks from this table, services need not pick the data from the datafiles. Blocks can be retrieved from other instance. This is the main use of GCS.

19. What is GES(LMD)?

GES stands for Global Enqueue Service. GES controls library and dictionary caches on all the nodes. GES manages transaction locks, table locks, library cache locks, dictionary cache locks, database mount lock.

21. What is GRD?

GRD stands for Global Resource Directory. This is to record the information of resources and enqueues. As the word, it stores info on all the information. Information like Data block identifiers, data block mode(shared, exclusive, null), buffer caches will be having access.

22. What is GPNPD?

GPNPD stands for Grid Plug aNd Play Daemon. A file is located in CRS_HOME/gpnp/<node_name>/profile/peer/profile.xml which is known as GPNP profile. And this profile consists of cluster name, hostname, ntwork profiles with IP addresses, OCR. If we do any modifications for voting disk, profile will be updated.

Useful Commands:

1. gpnptool ver -> Check the version of tool.

2. gpnptool lfind -> get local gpnpd server.

3. gpnptool get -> read the profile

4. gpnptool lfind -> check daemon is running on local node.

5. gpnptool check –p= CRS_HOME/gpnp/<node_name>/profile/peer/profile.xml -> Check whether configuration is valid.

23. why is Diskmon?

Disk monitor daemon continuously runs when ocssd starts. And it monitors and performs I/O fencing for Exadata storage server (This server is termed as cell as per Exadata). This process will run since the ocssd starts because exadata cell can be added to any cluster at any time.

Useful Commands:

1. ./crsctl stat res ora.diskmon <– To check diskmon status.

What is lower stack and higher stack in RAC

The Lower Stack – Managed by OHASD

The 11gR2 Grid Infrastructure consists of a set of daemon processes which execute on each cluster node; the voting and OCR files, and protocols used to communicate across the interconnect. Prior to 11gR2, there were various scripts run by the init process to start and monitor the health of the clusterware daemons. From 11gR2, the Oracle High Availability Services Daemon (OHASD) replaces these. The OHASD starts, stops and checks the status of all the other daemon processes that are part of the clusterware using new agent processes listed here:

CSSDAGENT – used to start,stop and check status of the CSSD resource
ORAROOTAGENT – used to start “Lower Stack” daemons that must run as root: ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs, ora.crf
ORAAGENT – used to start “Lower Stack” daemons that run as the grid owner: ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd
CSSDMONITOR – used to monitor the CSSDAGENT

The OHASD is essentially a daemon which starts and monitors the clusterware daemons themselves. It is started by init using the /etc/init.d/ohasd script and starts the ohasd.bin executable as root. The Oracle documentation lists the “Lower Stack” daemons where they are referred to as the “The Oracle High Availability Services Stack” and notes which agent is responsible for starting and monitoring each specific daemon. It also explains the purpose of each of the stack components. (Discussions of some of these components will feature in future blog posts.) If the grid infrastructure is enabled on a node, then OHASD starts the “Lower Stack” on that node at boot time. If disabled, then the “Lower Stack” is started manually. The following commands are used for these operations:

crsctl enable crs – enables autostart at boot time
crsctl disable crs – disables autostart at boot time
crsctl start crs – manually starts crs on the local node

The “Lower Stack” consists of daemons which communicate with their counterparts on other cluster nodes. These daemons must be started in the correct sequence, as some of them depend on others. For example, the Cluster Ready Services Daemon (CRSD), may depend on ASM being available if the OCR file is stored in ASM. Clustered ASM in turn, depends on the Cluster Synchronisation Services Daemon(CSSD), as the CSSD must be started in order for clustered ASM to start up. This dependency tree is similar to that which already existed for the resources managed by the CRSD itself, known as the “Upper Stack“, which will be discussed later in this post.To define the dependency tree for the “Lower Stack“, a local repository called the OLR is used. This contains the metadata required by OHASD to join the cluster and configuration details for the local software. As a result, OHASD can start the “Lower Stack” daemons without reference to the OCR. To examine the OLR use the following command, and then examine the dump file produced:

ocrdump -local <FILENAME>

Another benefit of the OHASD, is that there is a daemon running on each cluster node whether or not the “Lower Stack” is started. As long as the OHASD daemon is running, then the following commands may be used in 11gR2:

crsctl check has – check the status of the OHASD
crsctl check crs – check the status of the OHASD, CRSD, CSSD and EVMD
crsctl check cluster – all – this checks the “Lower Stack” on all the nodes
crsctl start cluster – this attempts to start the “Lower Stack” on all the nodes
crsctl stop cluster – this attempts to stop the “Lower Stack” on all the nodes

# crsctl check has

# crsctl check cluster –all

crsctl stat res -init -t

To start the CSSD Daemon requires access to the Voting Files which may be stored in ASM. But a clustered ASM Instance may not start until the node has joined the cluster which requires that CSSD be up. To get around this problem, ASM Diskgroups are flagged to indicate that they contain Voting Files. The ASM Discovery string is contained in the OLR and used to scan for the ASM Disks when CSSD starts. The scan locates the flags indicating the presence of Voting Files which are stored at a fixed location in the ASM Disks. This process does not require the ASM instance to be up. Once the Voting Files are found by this scanning process, CSSD can access them, join the cluster and then the ORAAGENT can start the Clustered ASM Instance.

The Upper Stack – Managed by CRSD The “Upper Stack” consists of the daemons and resources managed by the Grid Infrastructure, once it is up and running. It uses the same architecture as OHASD, but CRSDuses its own threads of the agents to start up, stop and check the status of the daemons and resources as follows:

ORAROOTAGENT – used to start “Upper Stack” daemons that must run as root: GNS, VIP, SCAN VIP and network resources
ORAAGENT – used to start “Upper Stack” daemons that run as grid owner: ora.asm, ora.eons, ora.LISTENER.lsnr, SCAN listeners, ora.ons, ASM Diskgroups, Database Instances, Database Services. It is also used to publish High Availability events to interested clients and manages Cluster Ready Service changes of state.

The resources managed by the CRSD for the “Upper Stack” are also listed in the Oracle Documentation where they are referred to as “The Cluster Ready Services Stack” and consist of familiar resources such as Database Instances, Database Services and NodeApps such as Node Listeners. There are also some new resources such as the Single Client Access Name (SCAN), SCAN Vips, Grid Naming Service (GNS), GNS Vips and Network Resources. Some of these will be the subject of future Blog posts.

The resources managed by the “Upper Stack” are in the OCR file which may be stored in ASM. Since the Clustered ASM Instance is started by OHASD after CSSD is started but before CRSD is started, access to the OCR by CRSD is done as a normal client of ASM. The OCR file may be seen as a file in ASM, unlike the Voting Files which are not “visible” when looking at the ASM directory contents using either Enterprise Manager or the ASMCMD utility.

To check the location of the OCR do the following:

# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA
local_only=FALSE

CRSD Resource Categories

CRSD resources are categorised as “Local Resources” or”Cluster Resources“. Local Resources are activated on a specific node and never fail over to another node. For example, an ASM Instance exists on each node, so if a node fails, then the ASM Instance that was on that node will not fail over to a surviving node. Likewise, a Node Listener exists for each node and does not fail over. These two resource types are therefore “Local Resources“. SCAN Listeners however, may fail over as may Database Instances or the GNS (if used), so these are “Cluster Resources“

Finally to check the status of the “Upper Stack” resources and daemons, do the following:

# ./crsctl status resource -t

What is GPNP profile? The GPnP profile is a small XML file located in GRID_HOME/gpnp/<hostname>/profiles/peer under the name profile.xml. It is used to establish the correct global personality of a node. Each node maintains a local copy of the GPnP Profile and is maintanied by the GPnP Deamon (GPnPD) .

WHAT DOES GPNP PROFILE CONTAIN?

GPnP Profile is used to store necessary information required for the startup of Oracle Clusterware like SPFILE location,ASM DiskString etc.

It contains various attributes defining node personality.

- Cluster name

- Network classifications (Public/Private)

- Storage to be used for CSS

- Storage to be used for ASM : SPFILE location,ASM DiskString etc

WHO UPDATES GPNP PROFILE?

GPnPd daemon replicates changes to the profile during

§ - installation,

§ - system boot or

§ - when updated

Profile is updated Whenever changes are made to a cluster with configuration tools like

§ . oifcfg (Change network),

§ . crsctl (change location of voting disk),

§ . asmcmd (change ASM_DISKSTRING, SPfile location) etc.

HOW IS GPNP PROFILE USED BY CLUSTERWARE?

When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means, OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization. OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system. The information regarding voting disk is on ASM , is read from GPnP profile i.e.

We can even read voting disk by using kfed utility ,even if ASM is not up.

In next step, the clusterware checks whether all the nodes have the updated GPnP profile and the nodes joins the cluster based on the GPnP configuration . Whenever a node is started or added to the cluster, the clusterware software on the starting node starts a GPnP agent and perform following task.

1. If the node is already part of the cluster, the GPnP agent reads the existing profile on that node.

2. If the node is being added to the cluster, GPnP agent locates agent on another existing node using multicast protocol (provided by mDNS) and gets the profile from other node’s GPnP agent.

The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.

Now OHASD starts an ASM instance and ASM can now operate with initialized and operating CSSD.

With, an ASM instance running and its Diskgroup mounted, access to Clusterware’s OCR is available to CRSD (CRSD needs to read OCR to startup various resources on the node and hence update it, as status of resources changes )Now OHASD starts CRSD with access to the OCR in an ASM Diskgroup and thus Clusterware completes initialization and brings up other services under its control.

The ASM instance uses special code to locate the contents of the ASM SPFILE , which is stored in a Diskgroup.

Next. Since OCR is also on ASM, location of ASM spfile should be known. The order of searching the ASM SPfile is

· GPnP profile

· ORACLE_HOME/dbs/spfile

· ORACLE_HOME/dbs/init

ASM spfile is stored in ASM. But to start ASM, we’ll need spfile. Oracle know spfile location from GPnP profile & it reads spfile flag from underlying disk(s) and then starts the ASM.

Thus with the use of GPnP profile stores several information. GPnP profile information along with the information in the OLR have enough information , that have sufficient to automate several tasks or eased for the administrators and also the dependency on OCR is gradually reduced but not eliminated.

What is the major difference between 10g and 11g RAC?

Oracle 10g R1 RAC
Cluster Manager replaced by CRS

· ASM introduced

· Concept of Services expanded

· ocrcheck introduced

· ocrdump introduced

· AWR was instance specific

CRS was renamed as Clusterware

· asmcmd introduced

· CLUVFY introduced

· OCR and Voting disks can be mirrored

· Can use FAN/FCF with TAF for OCI and ODP.NET

Oracle 11g R1 RAC

· Oracle 11g RAC parallel upgrades - Oracle 11g have rolling upgrade features whereby RAC database can be upgraded without any downtime.

· Hot patching - Zero downtime patch application.

· Oracle RAC load balancing advisor - Starting from 10g R2 we have RAC load balancing advisor utility. 11g RAC load balancing advisor is only available with clients who use .NET, ODBC, or the Oracle Call Interface (OCI).

· ADDM for RAC - Oracle has incorporated RAC into the automatic database diagnostic monitor, for cross-node advisories. The script addmrpt.sql run give report for single instance, will not report all instances in RAC, this is known as instance ADDM. But using the new package DBMS_ADDM, we can generate report for all instances of RAC, this known as database ADDM.

· Optimized RAC cache fusion protocols - moves on from the general cache fusion protocols in 10g to deal with specific scenarios where the protocols could be further optimized.

We can store everything on the ASM. We can store OCR & voting files also on the ASM.
ASMCA
Single Client Access Name (SCAN) - eliminates the need to change tns entry when nodes are added to or removed from the Cluster. RAC instances register to SCAN listeners as remote listeners. SCAN is fully qualified name. Oracle recommends assigning 3 addresses to SCAN, which create three SCAN listeners.
Clusterware components: crfmond, crflogd, GIPCD.
AWR is consolidated for the database.
By default, LOAD_BALANCE is ON.
GSD (Global Service Deamon), gsdctl introduced.
GPnP profile.
Cluster information in an XML profile.
Oracle RAC OneNode is a new option that makes it easier to consolidate databases that aren’t mission critical, but need redundancy.
raconeinit - to convert database to RacOneNode.
raconefix - to fix RacOneNode database in case of failure.
racone2rac - to convert RacOneNode back to RAC.
Oracle Restart - the feature of Oracle Grid Infrastructure's High Availability Services (HAS) to manage associated listeners, ASM instances and Oracle instances.
Cluster Time Synchronization Service (CTSS) is a new feature in Oracle 11g R2 RAC, which is used to synchronize time across the nodes of the cluster. CTSS will be replacement of NTP protocol.
Grid Naming Service (GNS) is a new service introduced in Oracle RAC 11g R2. With GNS, Oracle Clusterware (CRS) can manage Dynamic Host Configuration Protocol (DHCP) and DNS services for the dynamic node registration and configuration.
Cluster interconnect: Used for data blocks, locks, messages, and SCN numbers.
Oracle Local Registry (OLR) - From Oracle 11gR2 "Oracle Local Registry (OLR)" something new as part of Oracle Clusterware. OLR is node’s local repository, similar to OCR (but local) and is managed by OHASD. It pertains data of local node only and is not shared among other nodes.
Multicasting is introduced in 11gR2 for private interconnect traffic.
I/O fencing prevents updates by failed instances, and detecting failure and preventing split brain in cluster. When a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or diskgroups. This methodology is called I/O Fencing, sometimes called Disk Fencing or failure fencing.
Re-bootless node fencing (restart)‏ - instead of fast re-booting the node, a graceful shutdown of the stack is attempted.
Clusterware log directories: acfs*
HAIP (IC VIP)..

What are nodeapps services in RAC?

Nodeapps are standard set of oracle application services which are started automatically for RAC.
Node apps Include:
1) VIP.
2) Oracle Net listener.
3) Global Service Daemon.
4) Oracle Notification Service.
Nodeapp Services run on each node of the cluster and will switched over to other nodes through VIP during the failover.
Useful commands to maintain nodeapps services:
srvctl stop nodeapps -n NODE1
[ STOP NODEAPPS on NODE 1 ]
srvctl stop nodeapps -n NODE2
[ STOP NODEAPPS on NODE 2 ]

srvctl start nodeapps -n NODE1
[ START NODEAPPS on NODE1 ]
srvctl start nodeapps -n NODE2
[ START NODEAPPS ON NODE2 ]

srvctl status nodeapps

Shutdown and Start sequence of Oracle RAC components?

Stop Oracle RAC (11g, 12c)

18. emctl stop dbconsole (11c only. In 12c DB Express replaces dbconsole and doesn’t have to be stopped )
2. srvctl stop listener [-listener listener_name] [-node node_name] [-force] (stops all listener services)
3. srvctl stop database -db db_unique_name [-stopoption stop_options] [-eval(12c only)] [-force] [-verbose]
4. srvctl stop asm [-proxy] [-node node_name] [-stopoption stop_options] [-force]
5. srvctl stop nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-force] [-relocate] [-verbose]
6. crsctl stop crs

Start Oracle RAC (11g, 12c) 1. crsctl start crs
2. crsctl start res ora.crsd -init
3. srvctl start nodeapps [-node node_name] [-gsdonly] [-adminhelper] [-verbose]
4. srvctl start asm [-proxy] [-node node_name [-startoption start_options]]
5. srvctl start database -db db_unique_name [-eval(12c only)]] [-startoption start_options] [-node node_name]
6. srvctl start listener [-node node_name] [-listener listener_name] (start all listener services)
7. emctl start dbconsole (11c only) To start resources of your HA environment if that are still down(e.g. ora.ons, Listener):
crsctl start resource –all

1) TAF
a feature of Oracle Net Services for OCI8 clients. TAF is transparent application failover which will move a session to a backup connection if the session fails. With Oracle 10g Release 2, you can define the TAF policy on the service using dbms_service package. It will only work with OCI clients. It will only move the session and if the parameter is set, it will failover the select statement. For insert, update or delete transactions, the application must be TAF aware and roll back the transaction. YES, you should enable FCF on your OCI client when you use TAF, it will make the failover faster.

ote: TAF will not work with JDBC thin.

2) FAN
FAN is a feature of Oracle RAC which stands for Fast Application Notification. This allows the database to notify the client of any change (Node up/down, instance up/down, database up/down). For integrated clients, inflight transactions are interrupted and an error message is returned. Inactive connections are terminated.
FCF is the client feature for Oracle Clients that have integrated with FAN to provide fast failover for connections. Oracle JDBC Implicit Connection Cache, Oracle Data Provider for .NET (ODP.NET) and Oracle Call Interface are all integrated clients which provide the Fast Connection Failover feature.

3) FCF ---FCF is a feature of Oracle clients that are integrated to receive FAN events and abort inflight transactions, clean up connections when a down event is received as well as create new connections when a up event is received. Tomcat or JBOSS can take advantage of FCF if the Oracle connection pool is used underneath. This can be either UCP (Universal Connection Pool for JAVA) or ICC (JDBC Implicit Connection Cache). UCP is recommended as ICC will be deprecated in a future release.

What is TAF?

The Oracle Transparent Application Failover (TAF) feature allows application users to reconnect to surviving database instances if an existing connection fails. When such a failure happens, all uncommitted transactions will be rolled back and an identical connection will be established. The uncommitted transactions have to be resubmitted after reconnection. The TAF reconnect occurs automatically from within the OCI library. To use all features of TAF, the application code may have to be modified. When your application is query-only, TAF can be used without any code changes. In general, TAF works well for reporting.

Server-Side vs. Client-Side TAF ---TAF can be implemented either client-side or server-side. Service attributes are used server-side to hold the TAF configuration; client-side the TNS connect string must be changed to enable TAF. Settings configured server-side supersede their client-side counterparts if both methods are used. Server-side configuration of TAF is the preferred method. You can configure TAF in two different failover modes. In the first mode, Select Failover, SELECT statements that are in progress during the failure are resumed over the new connection. In the second mode, Session Failover, lost connections and sessions are re-created.

Select Selects will resume on the new connection
Session When a connection is lost, a new connection is created automatically

When TAF is set up client-side, it can be configured to establish from the beginning a second connection to another (backup) instance. This eliminates the reconnection penalty but requires that the backup instance support all connections from all nodes set up this way.

Basic Establishes connections only when failover occurs
Preconnect Pre-establishes connections to the backup server

What is FAN and ONS ? The Oracle RAC, FAN, Oracle 12c Fast Application Notification (FAN) feature provides a simplified API for accessing FAN events through a callback mechanism. This mechanism enables third-party drivers, connection pools, and containers to subscribe, receive and process FAN events. These APIs are referred to as Oracle RAC FAN APIs.If we simplify it, FAN[Fast Application Notification] is a mechanism that Oracle RAC uses to notify ONS about service status changes, like UP and DOWN events, Instance state changes etc.Oracle RAC notifies the cluster about FAN events the minutes any changes have occurred. So the benefit is instead of waiting for the application to check on individual nodes to detect an anomaly , the applications are notified by FAN events and are able to react immediately.Any changes that occur in a cluster configuration will be notified by Fast Application Notification to ONS. We can also use server callouts scripts to catch FAN events.

FAN also publishes load balancing advisory (LBA) events. Applications are in a position to take full advantage of the LBA FAN events to facilitate a smooth transition of connections to healthier nodes in a cluster. One can take advantage of FAN is the following ways:

1. When using integrated Oracle Client, the applications can use FAN with no programming whatsoever. Oracle 120g JDBC, ODP.NET, and OCI would be considered as the components of the integrated clients.

2. Programmatic changes in ONS API make it possible for applications to still subscribe to the FAN events and can execute the event handling actions appropriately.

3. We can also use server callouts scripts to catch FAN events at a database level.

For DBA, if a database is up and running, everything seems beautiful..!! But once a state change happens in a database, we don’t know will it take 2 min to overcome it or it will eat our infinite time.When we talk about Oracle RAC, we know there are multiple resources are available to give High Availability and Load balancing feature. And when we are having multiple resources available for multiple instances, multiple services and multiple listeners to serve us, a state change can cause a performance problem.

Let us take an example of a node failure. When a node fails without closing sockets, all sessions that are blocked in an IO [read or write] wait for TCP keepalive. This wait status is the typical condition for an application using the database.Sessions processing the last result are even worse off, not receiving an interrupt until the next data is requested. Here we can take advantage of FAN [Fast Application Notification]. To read more about FAN Fast Application Notification [FAN] overview

FAN eliminates application waiting on TCP timeouts as Oracle RAC notifies the cluster about FAN events the minutes any changes have occurred and we can handle those events.
It eliminates time wasted processing the last result at the client after a failure has occurred.
It eliminates time wasted executing work on slow, hung or dead nodes.

We can take advantage of server-side callouts FAN and do following things.

Whenever FAN events occur we can log that so that helps us for administration in future.
We can use paging or SMS DBA to open tickets when any resource fails to restart.
Change resource plans or to shut down services when the number of available instances decreases, thus preventing further load on the cluster and keeping theRAC running until another healthy node is added to the cluster.
We can automate the fail service back to the PREFERRED instances when required.

WHAT IS ONS? ONS allows users to send SMS messages, e-mails, voice notifications, and fax messages in an easy-to-access manner. Oracle Clusterware uses ONS to send notifications about the state of the database instances to midtier applications that use this information for load-balancing and for fast failure detection. ONS is a daemon process that communicates with other ONS daemons
on other nodes which inform each other of the current state of the database components on the database server.

To add additional members or nodes that should receive notifications, the hostname or IP address of the node should be added to the ons.config file.
The ONS configuration file is located in the $ORACLE_HOME/opmn/conf directory and has the following format:
[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config
localport=6101
remoteport=6201
loglevel=3
useocr=on
nodes=oradb4.sumsky.net:6101,oradb2.sumsky.net:6201,
oradb1.sumsky.net:6201,oradb3.sumsky.net:6201,
onsclient1.sumsky.net:6200,onsclient2.sumsky.net:6200

The localport is the port that ONS binds to on the local host interface to talk to local clients. The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons. The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances
The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and nodes are participating in the cluster. The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications. This includes client machines where ONS is also running to receive FAN events for applications.

ONS configuration ==>>

ONS is installed and configured as part of the Oracle Clusterware installation.
Execution of the root.sh file on Unix and Linux-based systems, during
the Oracle Clusterware installation will create and start the ONS on all
nodes participating in the cluster. This can be verified using the crs_stat
utility provided by Oracle.
Configuration of ONS involves registering all nodes and servers that will communicate with the ONS daemon on the database server.
During Oracle Clusterware installation, all nodes participating in the cluster are automatically registered with the ONS.
Subsequently, during restart of the clusterware, ONS will register all nodes with the respective ONS processes
on other nodes in the cluster.
To add additional members or nodes that should receive notifications,the hostname or IP address of the node should be added to
the ons.config file. The configuration file is located in the $ORACLE_HOME/opmn/conf
directory and has the following format:

[oracle@oradb4 oracle]$ more $ORACLE_HOME/opmn/conf/ons.config

The localport is the port that ONS binds to on the local host interface to talk to local clients.
The remoteport is the port that ONS binds to on all interfaces to talk to other ONS daemons.
The loglevel indicates the amount of logging that should be generated. Oracle supports logging levels
from 1 through 9. ONS logs are generated in the $ORACLE_HOME/opmn/logs directory on the respective instances.

The useocr parameter (valid values are on/off) indicates whether ONS should use the OCR to determine which instances and
nodes are participating in the cluster.
The nodes listed in the nodes line are all nodes in the network that will need to receive or send event notifications.

what is rconfig utility and usage ? Use the below steps to convert existing database running on one of cluster node from single Instance to Cluster database.

Step 1> copy ConvertToRAC_AdminManaged.xml to another file convert.xml
node1$cd $ORACLE_HOME/assistants/rconfig/sampleXMLs
node1$cp ConvertToRAC_AdminManaged.xml convert.xml

Step 2> Edit convert.xml and make following changes :

* Specify current OracleHome of non-rac database as SourceDBHome
* Specify OracleHome where the rac database should be configured. It can be same as SourceDBHome
* Specify SID of non-rac database and credential. User with sysdba role is required to perform conversion
* Specify the list of nodes that should have rac instances running for the Admin Managed Cluster Database. LocalNode should be the first node in this nodelist.
* Instance Prefix tag is optional starting with 11.2. If left empty, it is derived from db_unique_name
* Specify the type of storage to be used by rac database. Allowable values are CFS|ASM
* Specify Database Area Location to be configured for rac database. Leave it blank if you want to use existing location of the database files but location must be accessible for all the cluster node.
* Specify Flash Recovery Area to be configured for rac database. Leave it blank if you want to use existing location as flash recovery area.

Step 3> Run rconfig to convert racdb from single instance database to cluster database
* node1$rconfig convert.xml

* Check the log file for rconfig while conversion is going on
oracle@node1$ls -lrt $ORACLE_BASE/cfgtoollogs/rconfig/*.log

* check that the database has been converted successfully
node1$srvctl status database -d orcl
‘

What is CHM ( cluster health Monitor ) ? The Oracle Grid Cluster Health Monitor (CHM) stores operating system metrics in the CHM repository for all nodes in a RAC cluster. It stores information on CPU, memory, process, network and other OS data. This information can later be retrieved and used to troubleshoot and identify any cluster related issues. It is a default component of the 11gr2 grid install. The data is stored in the master repository and also replicated to a standby repository on a different node. The Cluster Health Monitor (CHM) (formerly a.k.a. Instantaneous Problem Detector for Clusters or IPD/OS) is designed to – detect and to analyze operating system (OS). – and cluster resource related degradation and failures – in order to bring more explanatory power to many issues that occur in clusters, in which Oracle Clusterware and / or Oracle RAC are used, e.g. node evictions. – It is Oracle Clusterware and Oracle RAC independent in the current release.

What are the processes and components for the Cluster Health Monitor?

Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database).  It compresses the data before persisting to save the disk space. In an environment

with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd  is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd.  The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster.

System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd.  A sysmond

process runs on every node and collects the system statistics including

CPU, memory usage, platform info, disk info, nic info, process info,  and filesystem info

Locate CHM log directory

Check CHM resource status and locate Master Node

[grid@grac41 ~] $ $GRID_HOME/bin/crsctl status res ora.crf -init

NAME=ora.crf

TYPE=ora.crf.type

TARGET=ONLINE

STATE=ONLINE on grac41

[grid@grac41 ~]$ oclumon manage -get MASTER

Master = grac43

Login into grac43 and located CHM log directory  ( ologgerd process )

[root@grac43 ~]#  ps -elf |grep ologgerd | grep -v grep

 .... /u01/app/11204/grid/bin/ologgerd -M -d /u01/app/11204/grid/crf/db/grac43

What is ora.crf? It's resource name on Oracle Clusterware 11gR2(new in 11.2.0.2). $ crsctl stat res ora.crf -init \ ora.crf is the Cluster Health Monitor resource name that ohasd manages.

You can check processes about it. $ ps -aef | grep osysmond $ ps -aef | grep ologgerd

Crash Recovery Vs Instance Recovery

When the instance suddenly fail due to power outage or issuing shut abort, the instance requires recovery during next startup. Oracle will perform crash recovery up on restarting the database.
Crash Recovery involves two steps Cache recovery and transaction recovery.
Cache Recovery (Roll Forward) : The committed and Uncommitted data from the online redolog files are applied to the datafiles.
Transaction Recovery (Roll Back): The uncommitted data are rolled back from the datafiles.
In an RAC environment, one of the surviving instance performs the crash recovery of failed instance. This is known as instance recovery. In a single instance database, crash recovery and instance recovery are synonymous.

What methods are available to keep the time synchronized on all nodes in the cluster?

In RAC environment it is essential that all nodes are in sync in time. Also it recommended to have the time in sync between primary and standby. There are three options to ensure that the time across nodes are in sync

1)Windows Time Service

2)Network Time Protocol

3)Oracle cluster time synchronization service CTSS

If one of the first 2 is available on a RAC node then CTSS starts in observer mode

If neither found, CTSS then starts in active mode and synchronizes time across cluster nodes without an external server

To check whether NTP is running on a server use ps command

oracle:dev1:/home/oracle$ ps -ef | grep ntp
ntp 39013 1 0 Apr05 ? 00:03:34 ntpd -u ntp:ntp -p /var/run/ntpd.pid -x
oracle 93293 93226 0 12:54 pts/4 00:00:00 grep ntp

for RAC NTP has to run with –x option. This option means that the time corrections are done gradually in small changes and this is also called as slewing.

Q1. What is Oracle Real Application Clusters?
Oracle RAC enables you to cluster Oracle databases.Oracle RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so they operate as a single system.Oracle Clusterware is a portable cluster management solution that is integrated with Oracle Database.

Q2. What are the file storage options provided by Oracle Database for Oracle RAC?

The file storage options provided by Oracle Database for Oracle RAC are,

· Automatic Storage Management (ASM)

· OCFS2 and Oracle Cluster File System (OCFS)

· A network file system

· Raw devices

Q3. What is a CFS?
A cluster File System (CFS) is a file system that may be accessed (read and write) by all members in a cluster at the same time. This implies that all members of a cluster have the same view.

Q4. What is cache fusion?
In a RAC environment, it is the combining of data blocks, which are shipped across the interconnect from remote database caches (SGA) to the local node, in order to fulfill the requirements for a transaction (DML, Query of Data Dictionary).

Q5. What is split brain?
When database nodes in a cluster are unable to communicate with each other, they may continue to process and modify the data blocks independently. If the
same block is modified by more than one instance, synchronization/locking of the data blocks does not take place and blocks may be overwritten by others in the cluster. This state is called split brain.

Q6. What methods are available to keep the time synchronized on all nodes in the cluster?
Either the Network Time Protocol(NTP) can be configured or in 11gr2, Cluster Time Synchronization Service (CTSS) can be used.

Q7. Where are the Clusterware files stored on a RAC environment?
The Clusterware is installed on each node (on an Oracle Home) and on the shared disks (the voting disks and the CSR file)

Q8. What command would you use to check the availability of the RAC system?
crs_stat -t -v (-t -v are optional)

Q9. What is the minimum number of instances you need to have in order to create a RAC?
You can create a RAC with just one server.

Q10. Name two specific RAC background processes
RAC processes are: LMON, LMDx, LMSn, LKCx and DIAG.

Q11. What files components in RAC must reside on shared storage?
Spfiles, ControlFiles, Datafiles and Redolog files should be created on shared storage.

Q12. Where does the Clusterware write when there is a network or Storage missed heartbeat?
The network ping failure is written in $CRS_HOME/log

Q13. How do you find out what OCR backups are available?
The ocrconfig -showbackup can be run to find out the automatic and manually run backups.

Q14. What is the interconnect used for?
It is a private network which is used to ship data blocks from one instance to another for cache fusion. The physical data blocks as well as data dictionary blocks are shared across this interconnect.

Q15. How do you determine what protocol is being used for Interconnect traffic?
One of the ways is to look at the database alert log for the time period when the database was started up.

Q16. If your OCR is corrupted what options do have to resolve this?
You can use either the logical or the physical OCR backup copy to restore the Repository.

Q17. What is hangcheck timer used for ?
The hangcheck timer checks regularly the health of the system. If the system hangs or stop the node will be restarted automatically.
There are 2 key parameters for this module:

· hangcheck-tick: this parameter defines the period of time between checks of system health. The default value is 60 seconds; Oracle recommends setting it to 30seconds.

· hangcheck-margin: this defines the maximum hang delay that should be tolerated before hangcheck-timer resets the RAC node.

Q18. What is the difference between Crash recovery and Instance recovery?
When an instance crashes in a single node database on startup a crash recovery takes place. In a RAC enviornment the same recovery for an instance is performed by the surviving nodes called Instance recovery.

Q19. How do we know which database instances are part of a RAC cluster?
You can query the V$ACTIVE_INSTANCES view to determine the member instances of the RAC cluster.

Q20. What it the ASM POWER_LIMIT?
This is the parameter which controls the number of Allocation units the ASM instance will try to rebalance at any given time. In ASM versions less than 11.2.0.3 the default value is 11 however it has been changed to unlimited in later versions.

Q21. What is a rolling upgrade?
A patch is considered a rolling if it is can be applied to the cluster binaries without having to shutting down the database in a RAC environment. All nodes in the cluster are patched in a rolling manner, one by one, with only the node which is being patched unavailable while all other instance open.

Q22. What is the default memory allocation for ASM?
In 10g the default SGA size is 1G in 11g it is set to 256M and in 12c ASM it is set back to 1G.

Q23. How do you find out what object has its blocks being shipped across the instance the most?
You can use the dba_hist_seg_stats.

Q24. What is a VIP in RAC use for?
The VIP is an alternate Virtual IP address assigned to each node in a cluster. During a node failure the VIP of the failed node moves to the surviving node and relays to the application that the node has gone down. Without VIP, the application will wait for TCP timeout and then find out that the session is no longer live due to the failure.

Q25. What components of the Grid should I back up?
The backups should include OLR, OCR and ASM Metadata.

Q26. Is there an easy way to verify the inventory for all remote nodes
You can run the opatch lsinventory -all_nodes command from a single node to look at the inventory details for all nodes in the cluster.

Q27. How do you backup ASM Metadata?
You can use md_backup to restore the ASM diskgroup configuration in-case of ASM diskgroup storage loss.

Q28. What files can be stored in the ASM diskgroup?
In 11g the following files can be stored in ASM diskgroups.

· Datafiles

· Redo logfiles

· Spfiles

In 12c the files below can also new be stored in the ASM Diskgroup Password file

Q29. What is OCLUMON used for in a cluster environment?
The Cluster Health Monitor (CHM) stores operating system metrics in the CHM repository for all nodes in a RAC cluster. It stores information on CPU, memory, process, network and other OS data, This information can later be retrieved and used to troubleshoot and identify any cluster related issues. It is a default component of the 11gr2 grid install. The data is stored in the master repository and replicated to a standby repository on a different node.

Q30. What would be the possible performance impact in a cluster if a less powerful node (e.g. slower CPU’s) is added to the cluster?
All processing will show down to the CPU speed of the slowest server.

Q31. What is the purpose of OLR?
Oracle Local repository contains information that allows the cluster processes to be started up with the OCR being in the ASM storage ssytem. Since the ASM file system is unavailable until the Grid processes are started up a local copy of the contents of the OCR is required which is stored in the OLR.

Q32. What are some of the RAC specific parameters?
Some of the RAC parameters are:

· CLUSTER_DATABASE

· CLUSTER_DATABASE_INSTANCE

· INSTANCE_TYPE (RDBMS or ASM)

· ACTIVE_INSTANCE_COUNT

· UNDO_MANAGEMENT

Q33. What is the future of the Oracle Grid?
The Grid software is becoming more and more capable of not just supporting HA for Oracle Databases but also other applications including Oracle’s applications. With 12c there are more features and functionality built-in and it is easier to deploy these pre-built solutions, available for common Oracle applications.

OCR Backup and Recovery in Oracle RAC

OCR calls Oracle Cluster Registry. It stores cluster configuration information. It is also shared disk component. It must be accessed by all nodes in cluster environment.It also keeps information of Which database instance run on which nodes and which service runs on which database.The process daemon OCSSd manages the configuration info in OCR and maintains the changes to cluster in the registry.

Automatic Backup of OCR:

Automatic backup of OCR is done by CRSD process and every 3 hours. Default location is CRS_home/cdata/cluster_name. But we can change default location of this backup of OCR. We can check default location using following command.

$ocrconfig
-showbackup

We can change this default location of physical OCR copies using following command.

$ocrconfig
-backuploc

How to take PHYSICAL Backup of OCR?

First check exact location of automatic OCR backup location using "ocrconfig -showbackup" command. Because this automatic backup of OCR is physical copies of Oracle Cluster Registry. There is no need to bring down Cluster to take this physical backup of OCR. Use simple operating system's copy command to copy this physical OCR copy to backup destination as give below.

$cp -p -R
/u01/app/crs/cdata /u02/crs_backup/ocrbackup/RACNODE1

How to take MANUAL EXPORT Backup of OCR?

We can take export backup of OCR (Oracle Cluster Registry) also in online. There is no need to bring down Cluster to take export backup of OCR. Manual backup can be taken using "ocrconfig -export" command as follows.

$ocrconfig
-export /u04/crs_backup/ocrbackup/exports/OCRFile_expBackup.dmp

How to Recover OCR from PHYSICAL Backup?

For recovering OCR from physical automated backup needs all cluster, RAC instances and RAC database bring down before performing recovery of OCR. Here you can find out command reference for Recovery of OCR from physical backup copy.

$ocrconfig
-showbackup

$srvctl -stop database -d RACDB (Shutdown all
RAC instances and RAC database)

$crsctl stop crs (Shutdown Cluster)
#rm -f /u01/oradata/racdb/OCRFile
#cp /dev/null /u01/oradata/racdb/OCRFile
#chown root /u01/oradata/racdb/OCRFile
#chgrp oinstall /u02/oradata/racdb/OCRFile
#chmod 640 /u01/oradata/racdb/OCRFile

#ocrconfig -restore
/u02/apps/crs/cdata/crs/backup00.ocr

$crsctl start crs (After issuing start
cluster check status of cluster using 'crs_stat -t')

$srvctl start database -d RACDB (Start Oracle
RAC database and RAC instances)

How to Recover OCR from EXPORT Backup?

We can import metadata of OCR from export dump. Before importing Stop Oracle RAC database, RAC instances and Cluster. Reboot Cluster and remove OCR partition as well as OCR mirror partition. Recreate using 'dd' command both partitions. Import metadata of OCR from export dump file of backup. Example commands as following...

$srvctl
-stop database -d RACDB (Shutdown all RAC instances and RAC database)

$crsctl stop crs (Shutdown Cluster)
#rm -f /u01/oradata/racdb/OCRFile

#dd if=/dev/zero
of=/u01/oradata/racdb/OCRFile bs=4096 count=65587

#chown root /u01/oradata/racdb/OCRFile
#chgrp oinstall /u01/oradata/racdb/OCRFile
#chmod 640 /u01/oradata/racdb/OCRFile

SAME process should need to repeat for OCR mirror also.

ocrconfig
-import /u04/crs_backup/ocrbackup/exports/OCRFile_exp_Backup.dmp (Import
metadata of OCR using command)

$crsctl start crs (After issuing start
cluster check status of cluster using 'crs_stat -t')

$srvctl start database -d RACDB (Start Oracle
RAC database and RAC instances

Remember following important things:

· Oracle takes physical backup of OCR automatically.

· No Cluster downtime or RAC database down time requires for PHYSICAL backup of OCR.

· No Cluster downtime or RAC database down time requires for MANUAL export backup of OCR.

· For recovery of OCR from any of above backup it should need to down ALL.

· All procedure requires ROOT login.

OCR backup needs to monitoring constantly. Because OCR is important part of Oracle RAC. During remote database monitoring and remote database services, we need to take care of backup of OCR for Oracle RAC database administration.

Dbametrix is world wide leader in remote dba support. Expert remote DBA team of Dbametrix is offering high quality professional Oracle DBA support with strong response time to fulfill your SLA. Contact our sales department for more information with your requirements for remote dba services.

How to change the hostname in RAC after installation?

You want to change Hostname or IP Address or DNS configuration on server where Oracle is running on ASM.

The steps have been written for an installation that splits the ownership of the “Grid Infrastructure” and the database between a user named ORAGRID and a user named ORADB respectively. Make sure you run below given commands from right user.

Step 1:

Check existing configured resources with Oracle home

[oragrid@litms#### ~]$ crs_stat -t

Step 2:

Before start with the Oracle Restart process you must stop the listener:

[oragrid@litms#### ~]$ lsnrctl stop listener

Step 3:

Confirm if listener is stopped

[oragrid@litms#### ~]$ crs_stat -t

Step 4:

Set ORACLE_HOME as grid home

Execute below command to remove the existing oracle grid infra configuration.

[root@litms#### ~]# $ORACLE_HOME/perl/bin/perl -I $ORACLE_HOME/perl/lib -I $ORACLE_HOME/crs/install

Step 5:

After removing oracle configuration you can change the hostname of your Server

Edit /etc/sysconfig/network file

Edit /etc/hosts file

[root@### ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0

Step 6:

Edit Listener.ora file with new Hostname

Step 7:

Set ORACLE_HOME as grid home.

Execute below command to recreate grid infra configuration:

[root@litms####~]# $ORACLE_HOME/perl/bin/perl -I $ORACLE_HOME/perl/lib -I $ORACLE_HOME/crs/install

Step 8:

Add Listener and start it

[oragrid@litms### ~]$ srvctl add listener

[oragrid@litms#### ~]$ srvctl start listener

Step 9:

Create ASM and add Disks, and mount all diskgroups manually

[oragrid@litms#### disks]$ srvctl add asm -d '/dev/oracleasm/disks/*'

[oragrid@litms#### disks]$ srvctl start asm

[oragrid@litmsj614 disks]$ sqlplus / as sysasm

[oragrid@litms#### disks]$ srvctl status diskgroup -g rpst02data

Step 10:

Configure all databases with SRVCTL (Oracle Restart), Configure and Start your database:

[oragrid@litms#### disks]$ srvctl add database -d RPST02 -o $ORACLE_HOME -n RPST02 -p

Change Public/SCAN/Virtual IP/Name in 11g/12c RAC

When working with Real Application Cluster DB, changing the infrastructure properties is bit tricky if not difficult. There is dependency chain with several network & name properties and dependent component from Oracle perspective are also needed to be modified.

I recently undertook the exercise to do so for one of our RAC Cluster which resulted in this post. There are several use cases which I have covered as follows.

Case I. Changing Public Host-name

Public hostname is recorded in OCR, it is entered during installation phase. It can not be modified after the installation. The only way to modify public hostname is by deleting the node, then add the node back with a new hostname, or reinstall the clusterware.

Case II. Changing Public IP Only Without Changing Interface, Subnet or Netmask

If the change is only public IP address and the new ones are still in the same subnet, nothing needs to be done on clusterware layer, all changes need to be done at OS layer to reflect the change.

1. Shutdown Oracle Clusterware stack
2. Modify the IP address at network layer, DNS and /etc/hosts file to reflect the change
3. Restart Oracle Clusterware stack

Above change can be done in rolling fashion, eg: one node at a time.

Case II. Changing SCAN / SCAN IP

SCAN is used to access cluster as whole from oracle database clients and can redirect your connection request to any available node on the cluster where the requested service is running. This resource is cluster resource and can fail over to any other node if the node where it is running should fail. The entry of SCAN is in OCR and IP is configured at DNS level.
So to change to SCAN IP / Name one has to first populate the changes on DNS to take it into effect. Once the changes are in effect, one can modify the SCAN resource in OCR as follows. Remember SCAN acts as cluster entry point and load balancing process , the restart to SCAN will require a brief outage. However the existing connection will not have any impact.

[oracle@dbrac2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node dbrac1
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node dbrac1

[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.110
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes:
SCAN VIP is individually disabled on nodes:

[oracle@dbrac2 ~]$ srvctl stop scan_listener
[oracle@dbrac2 ~]$ srvctl stop scan
[oracle@dbrac2 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is not running
[oracle@dbrac2 ~]$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is not running

-- MODIFY THE SCAN IP AT OS LEVEL

[root@dbrac2 ~]$ srvctl modify scan -scanname dbrac-scan.localdomain

[oracle@dbrac2 ~]$ srvctl config scan
SCAN name: dbrac-scan.localdomain, Network: 1
Subnet IPv4: 192.168.2.0/255.255.255.0/eth1, static
Subnet IPv6:
SCAN 0 IPv4 VIP: 192.168.2.120
SCAN VIP is enabled.

[oracle@dbrac2 ~]$ srvctl start scan_listener
[oracle@dbrac2 ~]$ srvctl start scan

Since this is my test cluster, I have configured only one SCAN, but regardless of it, the process remains the same for SCAN with 3 IPs.

Case II. Changing Virtual IP / Virtual Host Name

-- CHANGING NODE VIP FROM 192.168.2.103 TO 192.168.2.203 ON DBRAC1
-- SINCE NODE VIP IS PART OF NODE APPS ONE NEEDS TO MODIFY THE IP ADDRESS ON OS LEVEL AND THEN USE SRVCTL TO MODIFY NODEAPPS

[oracle@dbrac1 automation]$ srvctl config vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.103
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:

[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1
PRCR-1065 : Failed to stop resource ora.dbrac1.vip
CRS-2529: Unable to act on 'ora.dbrac1.vip' because that would require stopping or relocating 'ora.LISTENER.lsnr', but the force option was not specified

[oracle@dbrac1 automation]$ srvctl stop vip -node dbrac1 -force
[oracle@dbrac1 automation]$ srvctl status vip -node dbrac1
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is not running

-- NOW MODIFY THE ADDRESS OF NODE VIP ON OS LEVEL USING EITHER /etc/hosts OR DNS.
-- Once done, use SRVCTL to modify OCR resource.
-- Here I am not changing the name, but only IP

[oracle@dbrac1 automation]$ srvctl modify nodeapps -node dbrac1 -address dbrac1-vip.localdomain/255.255.255.0/eth1
[oracle@dbrac1 automation]$ srvctl config   vip -node dbrac1
VIP exists: network number 1, hosting node dbrac1
VIP Name: dbrac1-vip.localdomain
VIP IPv4 Address: 192.168.2.203
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:

[root@dbrac2 ~]# srvctl config nodeapps

[oracle@dbrac1 automation]$ srvctl start vip -node dbrac1
[root@dbrac2 ~]# srvctl status nodeapps

-- ON NODE2

[oracle@dbrac2 ~]$ srvctl stop vip -node dbrac2 -force
[oracle@dbrac2 ~]$ srvctl status vip -node dbrac2
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is not running

[oracle@dbrac2 ~]$ srvctl modify nodeapps -node dbrac2 -address dbrac2-vip.localdomain/255.255.255.0/eth1

[oracle@dbrac2 ~]$ srvctl config vip -n dbrac2
VIP exists: network number 1, hosting node dbrac2
VIP Name: dbrac2-vip.localdomain
VIP IPv4 Address: 192.168.2.204
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:

[oracle@dbrac2 ~]$ srvctl start vip -node dbrac2

[oracle@dbrac2]$ srvctl status nodeapps
VIP dbrac1-vip.localdomain is enabled
VIP dbrac1-vip.localdomain is running on node: dbrac1
VIP dbrac2-vip.localdomain is enabled
VIP dbrac2-vip.localdomain is running on node: dbrac2
Network is enabled
Network is running on node: dbrac1
Network is running on node: dbrac2
ONS is enabled
ONS daemon is running on node: dbrac1
ONS daemon is running on node: dbrac2

[oracle@dbrac1 automation]$ crs_stat -t | grep vip

ora.dbrac1.vip ora....t1.type ONLINE    ONLINE    dbrac1
ora.dbrac2.vip ora....t1.type ONLINE    ONLINE    dbrac2
ora.scan1.vip ora....ip.type ONLINE    ONLINE    dbrac1

A special case for 11gR2 VIP Name Change -
modifying the VIP hostname only without changing the IP address.

For example: only VIP hostname changes from dbrac1-vip to dbrac1-nvip, IP and other attributes remain the same.

If IP address is not changed, above modify command will not change the USR_ORA_VIP value in 'crsctl stat res ora.dbrac1.vip -p' output. Please use the following command:

# crsctl modify res ora.dbrac1.vip -attr USR_ORA_VIP=ora.dbrac1.nvip

Verify the changes for USR_ORA_VIP field:
# crsctl stat res ora.dbrac1.vip -p |grep USR_ORA_VIP

Three important flag for crsctl stat res command are as follows.
-p Print static configuration
-v Print runtime configuration
-f Print full configuration

Did you ever lost your Grid Infrastructure Diskgroup? or What if the disk is lost from the diskgroup which has OCR/OLR stored

1. Status

The cluster stack does not come up since there are no voting disks anymore.

[root@oel6u4 ~]# crsctl stat res -t –init

2. Recreate OCR diskgroup At first I made my disk available again using ASMlib:

[root@oel6u4
  ~]# oracleasm listdisks

DATA

[root@oel6u4
  ~]# oracleasm createdisk OCR /dev/sdc1

Writing
  disk header: done

Instantiating
  disk: done

[root@oel6u4
  ~]# oracleasm listdisks

DATA

OCR

Now I need to restore my ASM diskgroup, but I that requires a running ASM instance to do that. So stop the cluster stack and start again in exclusive mode. By the way, “crsctl stop crs -f” did not finish so I disabled the cluster stack by issuing “crsctl disable has” and reboot.

[root@oel6u4 ~]# crsctl enable has

as you see the startup fails since “ora.storage” is not able to locate the OCR diskgroup. That means there is a timeframe of about 10 minutes to create the diskgroup during startup of “ora.storage”.

If I would have made a backup of my ASM diskgroup I could have used that. But I have not. That’s why I create my OCR diskgroup from scratch. Start the CRS again and then do the following from a second session:

[root@oel6u4 ~]# cat ocr.dg

[root@oel6u4 ~]# asmcmd mkdg ~/ocr.dg

[root@oel6u4 ~]# asmcmd lsdg

3. Restore OCR Next step is restoring the OCR from backup. Fortunately the clusterware creates backups of the OCR by itself right from the beginning.

root@oel6u4 ~]# ocrconfig –showbackup

Just choose the most recent backup and use it to restore the contents of OCR.

[root@oel6u4 ~]# ocrconfig -restore /u01/app/grid/12.1.0.2/cdata/mycluster/backup00.ocr

[root@oel6u4 ~]# ocrcheck

4. Restore Voting Disk Since the voting files are placed in ASM together with OCR, the OCR backup contains a copy of the voting file as well. All I need to do is start CRSD and recreate the voting file.

[root@oel6u4 ~]# crsctl start res ora.crsd –init

[root@oel6u4 ~]# crsctl replace votedisk +OCR

Not good. But the reason for that is that ASM does not have ASM_DISKSTRING configured. Actually ASM has not a single parameter configured because it is using a spfile stored in OCR diskgroup as well. That means there is no spfile anymore. As a quick solution I set the parameter in memory.

[oracle@oel6u4 ~]$ sqlplus / as sysasm

alter system set asm_diskstring='/dev/oracleasm/disks/*' scope=memory;

With this small change I am now able to recreate the voting file.

[root@oel6u4 ~]# crsctl replace votedisk +OCR

[root@oel6u4 ~]# crsctl query css votedisk

5. Restore ASM spfile This is easy, I don’t have a backup of my ASM spfile so I recreate it from memory.

[oracle@oel6u4 ~]$ sqlplus / as sysasm
SQL> create spfile='+OCR' from memory;

The GPNP profile get’s updated also by doing so.

6. Restart Grid Infrastructure I have everything restored that I need to start the clusterware in normal operation mode.

Let’s see: [root@oel6u4 ~]# crsctl stop crs –f

[root@oel6u4 ~]# crsctl start has –wait

You see, the GIMR (MGMTDB) is gone too. I will talk about that soon. At first, let’s see if all the other ressources are running properly.

[root@oel6u4 ~]# crsctl stat res –t

7. Restore ASM password file Since 12c the password file for ASM is stored inside ASM. Again, I have no backup so I need to create it from scratch.

[oracle@oel6u4 ~]$ orapwd file=/tmp/orapwasm password=Oracle-1 force=y

[oracle@oel6u4 ~]$ asmcmd pwcopy --asm /tmp/orapwasm +OCR/pwdasm

the “pwcopy” updates the GPNP profile to reflect this.

What is a Master Node in RAC

The master node in oracle RAC is node which is responsible to initiate the OCR backup.
Node-id of the Master node in rac is least node-id among the nodes in cluater
node-ids are assigned to the nodes in the order they join the cluater and therefore the node which joins the cluster first is designated as Master Node

Task of the Master Node

crsd process of the master node is responsible to initiate OCR backup
Master node is responsible to sync the OCR local cache across the nodes
only crsd process on the master node updates the OCR on disk
in case of node eviction, if the cluster is devided into 2 equal nodes the sub-cluster having the master node survives and the other sub-cluster is evicted

How to identify the Master Node in RAC

1> Identify the node which performs the backup of OCR
$ ocrconfig –showbacku

2> Check crsd logs from various nodes.
cat crsd.trc |grep MASTER

3> Check ocssd logs from various nodes.
cat ocssd.trc |grep MASTER

What will happen if the Master Node is down.

very obious question, if the master node is down what will happen ? Does OCR will not be backed up?

When OCR master (crsd.bin process) stops or restarts for whatever reason, the crsd.binon surviving node with lowest node number will become new OCR master.

Just to proof the same I restarted the Node1 which is my current master node and checked the logfile and did a Manual OCR backup too and you can see the result below

2017-06-18 00:17:15.398 : CRSPE:1121937152: {2:46838:561} PE Role|State Update: old role [SLAVE] new [MASTER]; old state [Running] new [Configuring]
2017-06-18 00:17:15.398 : CRSPE:1121937152: {2:46838:561} PE MASTER NAME: node2
2017-06-18 00:17:15.403 : OCRMAS:1748141824: th_master:13: I AM THE NEW OCR MASTER at incar 7. Node Number 2

What is clufy ? The Cluster Verification Utility (CVU) performs system checks in preparation for installation, patch updates, or other system changes. Using CVU ensures that you have completed the required system configuration and preinstallation steps so that your Oracle Grid Infrastructure or Oracle Real Application Clusters (Oracle RAC) installation, update, or patch operation completes successfully. With Oracle Clusterware 11g release 2 (11.2), Oracle Universal Installer is fully integrated with CVU, automating many CVU prerequisite checks. Oracle Universal Installer runs all prerequisite checks and associated fixup scripts when you run the installer CVU can verify the primary cluster components during an operational phase or stage. A component can be basic, such as free disk space, or it can be complex, such as checking Oracle Clusterware integrity. For example, CVU can verify multiple Oracle Clusterware subcomponents across Oracle Clusterware layers. Additionally, CVU can check disk space, memory, processes, and other important cluster components. A stage could be, for example, database installation, for which CVU can verify whether your system meets the criteria for an Oracle Real Application Clusters (Oracle RAC) installation. Other stages include the initial hardware setup and the establishing of system requirements through the fully operational cluster setup

cluvfy stage {-pre|-post} stage_name stage_specific_options [-verbose]

Valid stage options and stage names are:
        -post hwos        : post-check for hardware and operating system
        -pre cfs       : pre-check for CFS setup
        -post cfs       : post-check for CFS setup
        -pre crsinst      : pre-check for CRS installation
        -post crsinst      : post-check for CRS installation
        -pre hacfg       : pre-check for HA configuration
        -post hacfg       : post-check for HA configuration
        -pre dbinst    : pre-check for database installation
        -pre acfscfg     : pre-check for ACFS Configuration.
        -post acfscfg     : post-check for ACFS Configuration.
        -pre dbcfg       : pre-check for database configuration
        -pre nodeadd     : pre-check for node addition.
        -post nodeadd     : post-check for node addition.
        -post nodedel     : post-check for node deletion.

Example:- Installation checks after hwos - Hardware and Operating system installation
cluvfy stage -post hwos -n node_list [-verbose]
./runcluvfy.sh stage -post hwos -n node1,node2 –verbose

What is shared everything and shared nothing storage architecture A shared-everything architecture means that any given service knows everything and will fail if it doesn't. If you have a central db (or other similar service), then you likely have a shared-everything architecture. Shared-nothing means that no state is shared between every other service. If one service fails, then nothing happens to the other services.Most applications generally start as a shared-everything architecture, but they don't have to, that's just been my experience. When you get to global scale, you're going to want to be able to have random services fail, and why the application my run in a degraded state until the failed service comes back one, the rest of the application will continue to run.

Shared-nothing architecture is resilient when done properly. Everything it needs to know from outside the service, is sent to the service in its "work request" which is probably signed or encrypted with a trusted key.

For example, if a user is requesting a resource from the resource-service. The resource-service will have its own database on what users are allowed to request that specific resource, or the request will include an encrypted token providing the service with authentication and/or authorization. The resource-service doesn't have to call an authorization resource (or look in a shared db) and ask if the user is allowed to access the resource. This means, if the auth service were to be down, and the user had a valid token (from before the auth service went down) that was still valid, then the user could still retrieve the resource until the token expired.

When to use them? Use shared-everything architecture when you need to be highly consistent at the cost of resilience. In the example I gave above, if the auth-service decided that the token were incorrect and revoked it, the resource-service couldn't know about it before it were informed from the auth-service. In that period of time, the user could have "illegally" retrieved the resource. This is akin to taking out several loans before they get the chance to all appear on your credit report and affect your ability to get a loan.Use shared-nothing when resilience is more important than consistency, if you can guarantee some kind of eventual consistency, then this might be the most scalable solution.

Difference between Admin managed database and Policy managed databases ? Before discussing the ADMIN and POLICY managed db, we will have a short description of the server pools.

Server pool : Server pool divides the cluster logically. Server pool is used basically when the cluster/db is being POLICY managed. The db is assigned to the server pool and oracle takes the control to manage the instances. Oracle decides which instance to run on which node. There is no hard coded rule to run the instance on a perticular node as in administrator managed databases.

Types of Server pool : By default oracle creates two server pools, i.e, Free and Generic Pool. These pools are internally managed and has limitations of changing the attributes.

1. Free Pool : When the cluster is created , all the nodes are assigned to the Free pool. Nodes from this pool are assigned to other pools when defined.

2. Generic Pool : When we upgrade a cluster , all the nodes are assigned to the Generic Pool. You cannot query the details of this pool. When a admin managed db is created, a server pool is created with its name, and it is assigned as child of the generic pool.

Here we will see the two type of configuration of the RAC db …

1. ADMINISTRATOR MANAGED Database.
2. POLICY MANAGED Database.

1. ADMINISTRATOR MANAGED Database : In admin managed database we hard couple the instance of the db to a node. e.g. instance 1 will run on node 1, instance 2 will run on node 2. Here the DBA is responsible to manage the instances for db. When a admin managed db is created, a server pool is created with its name, and it is assigned as child of the generic pool.

This method of db management is generally used when there are less no. of nodes. If the no. of nodes grow high , say beyond 20, the POLICY managed cluster should be used.

2. POLICY MANAGED Database : In policy managed database, the db is assigned to a server pool. Now oracle takes the control to manage the instances. The no. of instances started for the db will be equal to the maximum of the nodes assigned to the serverpool currently. Here any instance can run on any node, e.g. instance 1 can run on node 2 and instance 2 can run on node 1. There is no hard coupling of the instances to the nodes.

Search This Blog

Mithilesh Jambhulkar ORACLE dba

ORACLE RAC QUESTION

4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment

Shutdown and Start sequence of Oracle RAC components?

Stop Oracle RAC (11g, 12c)

What are the processes and components for the Cluster Health Monitor?

Locate CHM log directory

OCR Backup and Recovery in Oracle RAC

Automatic Backup of OCR:

How to take PHYSICAL Backup of OCR?

How to take MANUAL EXPORT Backup of OCR?

How to Recover OCR from PHYSICAL Backup?

How to Recover OCR from EXPORT Backup?

Remember following important things:

Change Public/SCAN/Virtual IP/Name in 11g/12c RAC

Did you ever lost your Grid Infrastructure Diskgroup? or What if the disk is lost from the diskgroup which has OCR/OLR stored

1. Status

2. Recreate OCR diskgroup At first I made my disk available again using ASMlib:

3. Restore OCR Next step is restoring the OCR from backup. Fortunately the clusterware creates backups of the OCR by itself right from the beginning.

4. Restore Voting Disk Since the voting files are placed in ASM together with OCR, the OCR backup contains a copy of the voting file as well. All I need to do is start CRSD and recreate the voting file.

5. Restore ASM spfile This is easy, I don’t have a backup of my ASM spfile so I recreate it from memory.

6. Restart Grid Infrastructure I have everything restored that I need to start the clusterware in normal operation mode.

7. Restore ASM password file Since 12c the password file for ASM is stored inside ASM. Again, I have no backup so I need to create it from scratch.

What is a Master Node in RAC

Task of the Master Node

How to identify the Master Node in RAC

Comments

Post a Comment

Popular posts from this blog

ORACLE GOLDENGATE

EXADATA ARCHITECTURE IN ORACLE

CREATING SCHEDULER JOBS