FAULT-TOLERANT ROUTING METHODS IN NETWORK ON CHIP

The network based communication (NoC) is still very efficient communication method in System on Chip (SoC). Different types of faults are affecting the communication.The fault tolerance issue is an essential factor that has a direct impact on the reliability of the system. Many techniques were developed to improve the fault tolerance capability of NoCs. Several solutions are dedicated to enhance the fault tolerance, which in turn will increase the reliability of the system. This paper discuss such solution methods and to compare them through table.


INTRODUCTION
A system on chip (SoC) is an integrated circuit (also known as a "chip") that integrates many components include a central processing unit (CPU), memory, input/output ports and secondary storage [23]. The many-core SoCs have been used widely in high-performance computation, embedded systems, and other fields. The interconnection methodology between SoC components is carried out using a shared bus. However, this communication methodology is unable to satisfy the increasing scalability requirement. This is due to the lack of bus utilization, since only one master is able to use the shared bus at any given time. This drawback has led to the innovation of new paradigm in which a new communication infrastructure is applied. This new paradigm, which is known as Packet-based interconnection networks, or Network-on-Chip architecture (NoC) replaces the shared bus with a communication network consisting of Processing Elements (PE), routers, links and Network Interfaces (NI). The network interface is used to connect processing elements to this network [17] [18]. These network interfaces are responsible for managing the connection between different system components, creating and unpacking the packets that are sent over this connection. The Network Interface Controller (NIC) divides the variable sized data packets into smaller units called flits, and then routers deliver the flits from source to destination. The routing information is found in the first flits of the packet, which is called the packet header or the header flit [24], [25], [26].
The fundamental components of NoC are network adapters, routing nodes and links [18]. Network adapters implement an interface by which cores connect to the NoC.This is to decouple computation from communication. And routing nodes are route the flits. Then the links are connect the nodes providing the raw bandwidth. They may consist of one or more logical or physical channels. When designing NoC, major design decisions have to be considered such as the topology that connects communicating routers together. The most common topologies of NoC includes ring, mesh and torus. For the simplicity of physical layout and short wire lengths, mesh topology is widely used in actual implementations which is discussed in section 2. In this paper, we focus on mesh topology based routing. A fault can be defined as "a physical defect, imperfection, or flaw that occur within some hardware or software component" [12]. In general, faults can be divided to permanent faults and transient faults. A permanent fault in some components has a permanent disrupt on this component. That is, an unwanted behavior is produced every time this faulty component is used. On the other hand, a transient fault has a temporary effect and caused by any unusual environmental conditions. Transient faults are harder to detect since they occur randomly, however, they form the most common failures in nano-scale circuits. Faults can appear in cores, routers, links and other components. In this paper, we consider fault in routers, in which failed routers are usually handled by fault-tolerant routing. The objective of fault-tolerant routing is to maximize the ability of the good nodes in a direct network to communicate with each other in the presence of faulty nodes or routers.
The paper is organized as follows. Section 2 includes literature survey on fault tolerant methods on network-onchip. Section 3 describes an analysis of the methods. And section 4 includes conclusion.

II. 2D MESH NETWORK ON CHIP
One of the commonly used topology of NoC is 2Dmesh. Generally, it consist of M number of rows and N number of columns. This means the routers are arranged in these rows and columns. So total number of routers in the 2D-mesh is M*N. Fig. 1 shows an example of architecture of 2D mesh topology of NoC. Since it consist of 4 rows and 4 columns, there are total 16 routers in the 2D-mesh. In a mesh, except end routers, all routers have four adjacent neighbors. The routing protocol responsible for delivering packets from source to destination, routers' micro architecture, and the flow control scheme that defines how to allocate network resources during routing [19][20][21] [22].

Fault tolerant routing based on RSD
RSD is a special rectangle (Rectangle defined by Source node and Destination node) is introduced in [5] to cover all the possible minimal paths from a source node to a destination node. There are many RSDs from different pairs of a source and destination nodes in a mesh network. All the nodes included in a RSD are called " RSD nodes ". It can be seen that all the possible minimal paths from (Xs, Ys) to (Xd, Yd) will be included in the range of RSD, and only RSD nodes can make up any minimal paths. The RSD fault block model is constructed for highly efficient fault-tolerant Manhattan routing algorithms in 2D mesh networks. RSD fault block constructing algorithm is relevant not to the scale of mesh but to the range of RSD. So no matter how large the scale of mesh is, there are only two pairs of source nodes, and destination nodes whose RSDs are equal to the whole mesh network. All other RSDs are less than the whole mesh. That is to say, it is of low probability for RSD to be extended to the whole mesh network. The RSD-based general fault-tolerant minimal routing for mesh architectures proposed in [1]. It can label all enhanced unreachable nodes ( ("ER node" for abbreviation) a node that each of its previous-hop nodes is a faulty node, a prohibited previous-hop node or an ER node ), all enhanceduseless node(("ES node" for abbreviation) a node that each of its next-hop node is a faulty node, a prohibited next-hop node or an ES node) ,all prohibited nodes and all faulty node with low time-complexity by counting every node's F-APC (Forward Allowed-path-counter method) and R-APC (Reverse Allowed-path-counter method) values. The RSD is made up of a source node (5, 3) and a destination node (12,10). Because (12,10) is in northeast of (5,3), all E(ast)N(orth) turns at any nodes located in an even column are prohibited according to odd-even turn model [6]. There are many available static turn models such as odd-even [6], negativefirst [13], 4P-first [14] and so on. Node (12, 3) is an ER node because its unique previous-hop node is a prohibited previous hop node. Node (12,4), (12,5), (12,6), (12,7), (12,8), (12,9) are all ER nodes because their two previoushop nodes are a prohibited previous-hop node and an ER node respectively. Because of both the two previous-hop nodes are faulty nodes, node (11, 7) is both an unreachable node and an ER node . Also the two previous-hop nodes are a prohibited previous-hop node or a faulty node, node (8,4) become an ER node . Because of each next-hop nodes is a prohibited next-hop node or a faulty node, node (6, 4), (10,6) and (11,5) are all ES nodes . Because of each next-hop nodes is an ES node or a prohibited next-hop node node (10, 5), (11,4), (10,4), (11,3) and (10,3) are all ES nodes in turn .  Fig. 3. Given an intermediate node on a fault-tolerant minimal path from (Xs, Ys) to (Xd, Yd), its R-APC is defined as the total number of all fault-tolerant allowed minimal paths from this intermediate node to (Xd, Yd) shown in Fig. 4. The basic idea of APC-based fault tolerant minimal routing algorithms is that: if there exists at least one fault-tolerant allowed minimal paths, (Xs, Ys) is set as the first current node. After that, repeat such a step until (Xd, Yd) is found: look for one of current node's allowed next hop nodes whose F-APC and R-APC values are both more than 0, and set it as the new current node. The sequence of "current nodes" make up of a fault-tolerant allowed minimal path.

Fault tolerant routing based on region
The region-based fault tolerant routing methods create Rectangular Faulty Blocks (FBs) which including faulty and non-faulty nodes. The detour path is defined around each of the FBs to avoid them in packet routing shown in fig.5. For the creation of FBs, the following rules are applied; first, if non-faulty node has two or more faulty or unused neighbor nodes, the node is changed as an unused node, second, the above rule is repeatedly applied until no unused node is generated. To follow these rules, many rectangular FBs are created in the 2D mesh network. There are three types of detour paths defined for FBs according to the locations. That are Fault Ring (FR), Fault Chain (FC), and South Chain (SC). FR and FC are the detour paths for the FBs inside the network, on the Westside, and SC, on the south side of the network. The routing rules are strictly defined in region-based fault-tolerant routing methods according to the type of detour paths and the information on the address of the destination node. So the packet delivery is deadlock free.
A deadlock-free fault-tolerant routing algorithm which can work under small-sized faulty blocks with a simple routing control is proposed in [2] which is modified MessageRoute [7,8]. This paper introduces introduces the new function of "router node" to form the minimal rectangular faulty blocks, which significantly reduces the number of nodes to be deactivated. There is a definition that, " A node which is deactivated by itself, if there is at least one faulty node or deactivated neighbor node in both row and column ." After completion of the node deactivation, fault rings are constructed around each faulty block as does in MessageRoute. Notice that since at most four fault rings may overlap on a single healthy node in the new node deactivation method, Message-Route cannot provide deadlock-free routing control for the new node deactivation method. Additionally, to provide deadlock-free routing control in the proposed algorithm, we introduce a key node function called router node, in which each node deactivates own processor element and just works as a router.. Another definition is that, " A node becomes a router node if the node is on both the east border and the west border of two fstrings, and the reference node of the eastward f-string is to the north of the westward f-string".
In the proposed algorithm, to enhance node availability, deactivated nodes can be reactivated to be unsafe nodes by healthy neighbor nodes. The proposed algorithm called PositionRoute. Specifically, Position-Route does not require complex message and ring information propagation functions. To enable simple and efficient ring selection, simple ring information units (RIUs) and the related new ring management technique is used in Position-Route.

Fault tolerant routing based on turn model
The basic idea behind the turn model the prohibition of a minimum number of turns and, hence, increase the adaptive routing . Generally, in each cycle only one turn is prohibited in each cycle [15].
A novel low-overhead neighbor aware, turn model based fault tolerant routing scheme (NARCO) for NoCs proposed in [3] that corporates threshold-based replication in the network interfaces, a parameterizable region-based neighbor awareness in all routers, and an odd-even and inverted odd-even turn models. To replicate packets to bypass permanent faults in routes, the threshold-based replication which balances energy dissipation that need and the parameterizable neighbor awareness in routers ensures a region-based awareness of faults that can improve routing decisions. Finally, a combination of odd-even (OE) and inverted odd-even (IOE)-based turn models ensures deadlock free and balanced packet delivery on routes that can pass around faults on the communication links.
The OE turn model prohibits the locations at which certain turns can happen to ensure that a circular wait does not occur. The columns in a 2-D mesh are alternately designated as odd (O) and even (E) In OE turn model-based routing as shown in Fig. 6 which depicts a 55 2-D mesh. The restricted turns for the OE turn model with solid arrows, and the restricted turns for the IOE turn model with dashed arrows. The following two main rules ensure deadlock freerouting in the OE turn model: 1) a packet is not allowed to take an ES or EN turn at any of the nodes is located in an even column; and 2) a packet is not allowed to take an SW or NW turn at any node is located in an odd column. The system reliability improves when the level of redundancy is increased as a general rule. However, redundancy also impacts other design objectives such as energy and performance. Therefore, in practice it is important to limit redundancy to achieve a reasonable tradeoff between reliability, energy, and performance. The NARCO transmits only one redundant packet for each transmitted packet, and only if the fault rate is above a replication threshold sigma(provide symbol). The original packet is sent using the OE turn model while the redundant packet is propagated using an IOE turn model scheme. There are two virtual channels (VCs), one is for OE packets and the other is for IOE packets ensure deadlock freedom. If the fault rate is below threshold, replication is not utilized to save energy. The proposed routing algorithm give priority to minimal paths which have higher chances of reaching the destination. The routers detect which of their adjacent (neighbor) links/nodes are faulty based on control signals.
The concept of routing is that, first check whether the output port direction has a fault in its attached adjacent link, then this is an invalid direction. Next, check the restricted turn rules for the turn models based on the router location (in an odd or even column), the input port of the packet, and its output port direction. If the packet is attempting a forbidden turn, then the direction is invalid. If does not violate the basic OE routing rules and, the direction has no adjacent faults, then check if the direction will lead to a turn rules violation downstream based on the location of the destination. Finally, check for a back turn which is not allowed. If all these checks pass, then the given direction is valid for packet transfer.

Fault tolerant routing based on XY routing
Unlike almost all conventional methods where packets always detour faulty nodes [9,10,11], the method proposed in [4] allows packets to path through the faulty nodes with the help of additional hardware. Fig. 5 shows the architecture for the proposed method. Four electrical switches and links are added around each node. Each switch has three states as shown in Fig. 7. The switch states can be determined easily, once the node is tested and judged as faulty or not. In other words, switches are decided the states based on the fault flag of the node, so that packets can pass through the faulty node both vertically and horizontally without being sent to the node. Here consider the case where a faulty node is on the south boundary of the network . In such case, packets cannot detour the faulty node through the south side. Based on two definitions, the algorithm can work. First one is that, a faulty node on the south boundary of mesh networks is defined as South Faulty (SF) node. If a faulty node (i, j) has any SF nodes (i0 , j0 ) in the 8 neighbor nodes, the node is also changed to a SF node, where (i−1 ≤ i0 ≤ i+ 1) and (j − 1 ≤ j0 ≤ j + 1). And second one is that, SF area is defined as the area consisting of all nodes (i, j) which have the same ycoordinate as the north most SF node (i, j),where i = i0 . All faulty nodes in the SF area are changed to SF nodes. The proposed routing method always allow passing through faulty and SF nodes in the movement of northward or southward directions. In the case of the eastward and westward directions, packets basically detour faulty and SF nodes. If y-coordinates of the destination and source nodes are equal, the packet can pass through faulty and SF nodes. Fig. 8 shows routing examples by the proposed method. II. Node Availability :Occupation rate of healthy nodes as node availability. Healthy nodes are the node which are participate in routing process. Not considered 0 to 200 Not considered 0 to 1 10x10, 20x20

IV. ANALYSIS
Average latency is considered as timing parameter III. Average Latency Latency : Latency defined by the total cycles of a packet to reach the destination node from the source node. IV. Packet Injection Rate : Packet Injection Rate is defined as the rate with which number of packets injected into the system per unit of time. E. V.
Packet Arrival Rate : Packet Arrival Rate is defined as the rate with which number of packets arrived into the system per unit of time. VI. Network Size : Network size refers to the total size of the system which is proportional to number nodes in NoC. In the case of mesh topology, network size determine by using the number of rows and columns of the mesh. VII. Time Complexity: Time complexity is a concept in computer science that deals with the quantification of the amount of time taken by a set of code or algorithm to process or run as a function of the amount of input [22]. In NoC it is the quantification of time taken by the system to transmit packets from source to destination.
Based on these parameters, compare the fault tolerant methods which shown in table 1. In RSD based method, 8x8 mesh used for the implementation. Also the time complexity is proportional to the size of RSD. Packet injection rate is normally ranged from 0 to 1. Introduction of APC-method in RSD give a universality to fault tolerant routing. So no any other parameters are not used for the evaluation of this method. Additional hardware Algorithm is not working where a faulty node is presented in the south boundary of the network.
In region based method, failure rate is considered as the important parameter. Under different fault rate, plot average latency and packet injection rate to analyze delay and throughput of packets and flits. Compared to Message-Route, Position-Route could achieve lower latency while keeping almost the same throughput. The node availability is significantly improved in Position-Route compared to method Message-Route. Position-Route has less deactivated nodes and more unsafe nodes than Message-Route under the random fault model. Because Position-Route does not create large-sized faulty blocks compared with Message-Route. And also compare the average delay on 15x15 and 20x20 2D-NoCs. Position-Route could achieve much lower latency compared with Message-Route. So Position-Route is more scalable.
In turn-model based method, compare the successful packet arrival rate for the various fault tolerant routing schemes under different fault rates under different traffic such as uniform random, hotspot and transpose. Compared to the other schemes, because of higher neighbor awareness for high-fault rate environments, NARCO method have a much higher successful packet arrival rate. The implementation uses 9x9 mesh. In XY based method, plot average latency and packet injection rate under different failure rate. The average latency of this method and its previous method is almost the same when the packet generation rate is relatively low. But, when it is high, the difference becomes significant. The latency is reduced under higher fault rates.
Each of these methods have advantages and disadvantages shown in table 2. All of these methods are provide deadlock-freedom. The unique benefit of RSDbased method is that it provide universality for designing fault-tolerant minimal routing algorithms. The allowed-pathcounter method does not use any fault block models, so that, no available nodes will be sacrificed by fault blocks. And the fault-tolerant minimal path provide a low time complexity. The F-APC and R-APC values are determine whether any fault-tolerant minimal path exist. If source and destination are presented in same row or column, then this region-based method not work. And also scalability is a problem.The unique benefit of region-based method is that minimize the chance to participate faulty nodes in routing process by creating detour paths. This method, Position-Route, in which node availability is significantly improved. Position-Route has less deactivated nodes and more unsafe nodes. This indicates that the node deactivation method in Position-Route does not create large-sized faulty blocks. When creating FBs the utilization of non-faulty node is decreased. Because of detour path, it have high communication latency. Other disadvantage is that, under utilization of unused nodes which included in FBs.
The turn model-based method, replication is based on threshold which balances energy dissipation and neighbor awareness in routers is parameterize that ensures a regionbased awareness of faults that can improve routing decisions. The extremely low-energy consumption of the NARCO configurations for low fault rates is a strong motivation for the existence of the replication threshold parameter. There are two input FIFO buffers each dedicated to a VC, with one VC for the OE routed packets, and the other for the IOE routed packets, so additional hardware is a disadvantage. In XY routing based method, additional switches and links help to pass packets through the faulty node both vertically and horizontally without being sent to the faulty nodes. For the clockwise direction, we show that SW and NE turns never overlap. In a non SF area, the SW turn occurs; however, the NE turn never occurs, because the NE turn only occurs in an SF area. According to the above reason, in this case, circular waitings never occur. Thus this method is deadlock free. The disadvantage of this method is that routing algorithm assumes there is faulty node on south boundary of the network. If there is no faulty node on south boundary, then algorithm is not working.

V. CONCLUSION
In this paper, introduce the different fault tolerant routing methods on mesh topology. That are RSD-based, region-based, turn-model based and XY routing based methods. Each has its own unique features. The evaluation parameters of each methods are discussed. Other than RSD based method, all other methods have fault rate as an important parameter. Different mesh sizes are considered in each method. The advantage and disadvantage of each of these methods are also discussed. Occurance of faults is a critical factor that affects the routing. All of these methods can tolerate faults in there own way and thereby minimizing chances of deadlock and livelock. Also each of these methods have unique features.