Packet corruption loss is a serious problem in datacenter networks. A large-scale study by Microsoft reported that the number of packets lost due to corruption is comparable to those lost due to congestion. Previous attempts to mitigate the impact of packet corruption loss seek to avoid the faulty links by routing around them, at the cost of reduced link capacities and disruption to the rest of the network.
In this paper, we investigate the feasibility and tradeoffs of the classical loss recovery strategy of link-local retransmissions in the context of datacenter networks. We present the design and implementation of LinkGuardian, a dataplane-based protocol that detects the packets lost due to corruption and simply retransmits them out-of-order. Our preliminary results show that a naïve out-of-order retransmission strategy is effective in mitigating the impact of packet corruption loss for both throughput-sensitive and latency-sensitive flows. Our long-term goal is to extend LinkGuardian so that the end hosts can be made completely oblivious to packet corruption losses in the network.