The Inherent Queuing Delay of Parallel Packet Switches

The * parallel packet switch (PPS)* extends the *inverse multiplexing* architecture, and is widely used as the core of contemporary commercial switches. This paper investigates the inherent queuing delay introduced by the PPS's demultiplexing algorithm, responsible for dispatching cells to the middle-stage switches, relative to an optimal work-conserving switch.

We first consider an * NxN* PPS without buffers in its input-ports, operating at external rate *R*, internal rate *r<R*, and speedup (or over-capacity) *S*. We show that the inherent queuing delay of a symmetric and fault-tolerant PPS, where every demultiplexor may dispatch cells to all middle-stage switches is *Ω((NR)⁄ r)*, if no information is shared between the input ports. Sharing information between the input-ports significantly reduces this lower bound, even if the information is out-dated.

These lower bounds indicate that employing algorithms using slightly out-of-date information may greatly improve the PPS performance.

When the PPS has buffers in its input-ports, an *Ω(N⁄S)* lower bound holds if the demultiplexing algorithm uses only local information, or the input buffers are small relative to the time an input-port needs to learn the switch global information.

Randomization does not Reduce the Queuing Delay of Parallel Packet Switches Joint work with Hagit Attiya. [pdf]

Switching cells in parallel is a common approach to building switches with very high external line rates and a large number of ports. A prime example is the *parallel packet switch* (PPS) in which a demultiplexing algorithm sends cells, arriving at rate *R* on *N* input-ports, through one of *K* intermediate slower switches, operating at rate *r<R*. In order to utilize the parallelism of the PPS, a key issue is to balance the load among the planes; since *randomization* is known as a successful paradigm to solve load balancing problems, it is tempting to design randomized demultiplexing algorithms that balance the load on the average. This paper presents lower bounds on the average queuing delay introduced by the PPS relative to an optimal work-conserving first-come first-serve (FCFS) switch for *randomized* demultiplexing algorithms that do not have full and immediate information about the switch status. These lower bounds are shown to be asymptotically optimal through a methodology for analyzing the *maximal* relative queuing delay by measuring the imbalance between the middle stage switches; clearly, this also bounds (from above) the average relative queuing delay. The methodology is used to devise new algorithms that rely on slightly outdated global information on the switch status. It is also used to provide, for the first time, a complete proof of the maximum relative queuing delay provided by the *fractional traffic dispatch* algorithm [S. Iyer and N. McKeown, in *Proceedings of IEEE INFOCOM*, IEEE Communications Society, New York, NY, 2001, pp. 168–1687; D. Khotimsky and S. Krishnan, in *Proceedings of the IEEE International Conference on Communications*, IEEE Communications Society, New York, NY, 2001, pp. 100–106]. These optimal algorithms are deterministic, proving that randomization does not reduce the relative queuing delay of the PPS.

Joint work with Hagit Attiya. [pdf]

Jitter Regulations for Multiple Streams

For widely-used interactive communication, it is essential that traffic is kept as smooth as possible; the smoothness of the traffic is typically captured by its *delay jitter*, i.e., the difference between the maximal and minimal end-to-end delays. The task of minimizing the jitter is done by jitter regulators that use a limited-size buffer in order to shape the traffic. In many real-life situations regulators must handle multiple streams simultaneously and provide low jitter on each of them separately. Moreover, communication links have limited capacity, and these may pose further restrictions on the choices made by the regulator. This paper investigates the problem of minimizing jitter in such an environment, using a fixed-size buffer.

We show that the offline version of the problem can be solved in polynomial time, by introducing an efficient offline algorithm that finds a release schedule with optimal jitter. When regulating *M* streams in the online setting, we take a *competitive analysis* point of view and note that, in the upcapacitated case, previous results in [Mansour and Patt-Shamir, 2001] can be extended to an online algorithm that uses a buffer of size *2MB* and obtains the optimal jitter possible with a buffer of size *B* (and an offline algorithm). The question arises whether such a resource augmentation is essential. We answer this question in the affirmative, by proving a lower bound that is tight up to a factor of 2, thus showing that jitter regulation does not scale well as the number of streams increases unless the buffer is sized-up proportionally.

Packet-Mode Emulation of Output-Queued Switches Joint work with Gabriel Scalosub. [pdf]

Most common network protocols transmit variable size *packets*, whereas contemporary switches still operate with fixed size *cells*, which are easier to transmit and buffer. This necessitates packet segmentation and reassembly modules, resulting in significant computation and communication overhead that might be too costly as switches become faster and bigger. It is therefore imperative to investigate an alternative mode of scheduling, in which packets are scheduled contiguously over the switch fabric.

This paper investigates the cost of *packet-mode scheduling* for the *combined input output queued (CIOQ)* switch architecture. We devise frame-based schedulers that allow a packet-mode CIOQ switch with small speedup to *mimic* an ideal output-queued switch, with bounded relative queuing delay. The schedulers are pipelined and are based on matrix decomposition.

Our schedulers demonstrate a trade-off between the switch speedup and the relative queuing delay incurred while mimicking an output-queued switch. When the switch is allowed to incur high relative queuing delay, a speedup arbitrarily close to 2 suffices to mimic an ideal output-queued switch. This implies that packetmode scheduling does not require higher speedup than a cell-based scheduler. The relative queuing delay can be significantly reduced with just a doubling of the speedup. We further show that it is impossible to achieve zero relative queuing delay (that is, a perfect emulation), regardless of the switch speedup.

In addition, significantly simpler algorithms can mimic an output-queued switch with a bounded buffer size, using speedup arbitrarily close to 1.

Simulations confirm that packet-mode emulation with reasonable relative queuing delay can be achieved with moderate speedup. Furthermore, a simple and practical heuristic is shown by simulations to also provide effective packet-mode emulation.

Joint work with Hagit Attiya and Isaac Keslassy. [pdf (conference version)]

PEDS: Parallel Error Detection Secheme for TCAM Devices

Ternary content-addressable memory (TCAM) devices are increasingly used for performing high-speed packet classification. A TCAM consists of an associative memory that compares a search key in parallel against all entries. TCAMs may suffer from error events that cause ternary cells to change their value to any symbol in the ternary alphabet {“0”,“1”,“*”}. Due to their parallel access feature, standard error detection schemes are not directly applicable to TCAMs; an additional difficulty is posed by the special semantic of the “*” symbol.

This paper introduces PEDS, a novel parallel error detection scheme that locates the erroneous entries in a TCAM device. PEDS is based on applying an error-detection code to each TCAM entry, and utilizing the parallel capabilities of the TCAM, by simultaneously checking the correctness of multiple TCAM entries. A key feature of PEDS is that the number of TCAM lookup operations required to locate all errors depends on the number of symbols per entry rather than the (orders-of-magnitude larger) number of TCAM entries. For large TCAM devices, a specific instance of PEDS requires only 200 lookups for 100-symbol entries, while a naive approach may need hundreds of thousands lookups. PEDS allows flexible and dynamic selection of trade-off points between robustness, space complexity, and number of lookups.

Joint work with Anat Bremler-Barr, Danny Hendler, and Ron M. Roth. [pdf]

The Crosspoint-Queued Switch

This paper calls for rethinking packet-switch architectures, by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assumption is breaking down, because effective propagation times are too high and keep increasing with the line rates.

In this paper, we argue for a self-sufficient switch fabric, by moving all the buffering from the linecards to the switch fabric. We introduce the *crosspoint-queued (CQ)* switch, a new buffered-crossbar switch architecture with large crosspoint buffers and no input queues. We study the performance of the switch and show that it nearly behaves as an ideal output-queued switch as the buffer size increases. Further, with a small crosspoint buffer size, we provide a closed-form throughput formula for all work-conserving schedules with uniform Bernoulli i.i.d. arrivals. Finally, we show how the CQ switch can be practically implemented in a single SRAM-based chip.

Joint work with Yossi Kanizo and Isaac Keslassy. [pdf]

Optimal Fast Hashing

This paper is about designing *optimal highthroughput hashing schemes* that minimize the total number of memory accesses needed to build and access an hash table. Recent schemes often promote the use of multiple-choice hashing. However, such a choice also implies a significant increase in the number of memory accesses to the hash table, which translates into higher power consumption and lower throughput.

In this paper, we propose to only use choice when needed. Given some target hash-table overflow rate, we provide a lower bound on the total number of needed memory accesses. Then, we design and analyze schemes that provably achieve this lower bound over a large range of target overflow values. Further, for the Multi-level Hash Table (MHT) scheme, we prove that the optimum occurs when its table sizes decrease in a geometric way, thus formally confirming a heuristic rule-of-thumb.

Joint work with Yossi Kanizo and Isaac Keslassy. [pdf]

Layered interval codes for TCAM-based classification

Ternary content-addressable memories (TCAMs) are increasingly used for high-speed packet classification. TCAMs compare packet headers against all rules in a classification database in parallel and thus provide high throughput.

CAMs are not well-suited, however, for representing rules that contain range fields and prior art algorithms typically represent each such rule by multiple TCAM entries. The resulting *range expansion* can dramatically reduce TCAM utilization because it introduces a large number of redundant TCAM entries. This redundancy can be mitigated by making use of extra bits, available in each TCAM entry.

We present a novel scheme for constructing efficient representations of range rules, based on the simple observation that sets of disjoint ranges may be encoded much more efficiently than sets of overlapping ranges. Since the ranges in real-world classification databases are, in general, non-disjoint, our technique splits the ranges between multiple *layers* each of which consists of mutually disjoint ranges. Each layer is then coded independently and assigned its own set of extra bits.

Our layering algorithms are based on approximations for specific variants of interval-graph coloring. We evaluate these algorithms by performing extensive comparative analysis on real-life classification databases. Our analysis establishes that our algorithms reduce the number of redundant TCAM entries caused by range rules by more than 60% as compared with best range-encoding prior art.

Joint work with Anat Bremler-Barr, Danny Hendler, and Boris Farber. [pdf]

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks

Delay Tolerant Networks (DTNs), in which contacts between nodes come and go over time, is a promising approach to model communications in mobile ad-hoc networks, where scenarios of network partitioning and node disconnections are likely to happen. One of the most important challenges in such networks is how to route information and schedule transmissions, coping with the continuously changing network topology.

In this paper, we focus on a deterministic and centralized DTN in which the contact times between nodes are known in advance or can be predicted; this model is applicable for various real-life scenarios.

We provide a general framework for devising optimal routing algorithms to such networks under different objective functions and various real-life constraints (such as the available buffer and energy). The key insight is to model the DTN as an equivalent time-independent graph; this allows the usage of well-known algorithms and techniques to achieve optimal results. These algorithms can be also used as approximation for less certain settings or as benchmarks to evaluate other routing algorithms. In addition, we extended our framework to deal with long-lived DTNs in which contacts are periodic.

Our algorithms are demonstrated by simulations, based directly on real-life traces, showing capacity-delay tradeoffs and the influence of the constraints and periodicity on the achievable throughput of the network.

Joint work with Paolo Giaccone. [pdf]

Optimal Clock Synchronization under Energy Constraints in Wireless Ad-hoc Networks Clock synchronization is a crucial service in many distributed systems, including wireless ad-hoc networks. This paper studies *external* clock synchronization, in which nodes should bring their clocks close to the value of some external reference time, which is provided in the system by one or more *source* clocks.

Reference broadcast synchronization (RBS) is a known approach that exploits the broadcast nature of wireless networks for a single hop. However, when networks are large in physical extent, additional mechanisms must be employed.

Using multi-hop algorithms that re-broadcast time information to short distances reduces the energy consumed for clock synchronization. The reason is that energy costs grow more than linearly with the broadcast distance. On the other hand, the quality of the clock synchronization, as measured in the closeness of the clocks, deteriorates as the number of hops increases.

This paper shows how to balance these two contradictory goals, achieving optimal clock synchronization while adhering to an energy budget at each node. In particular, a distributed algorithm is presented that uses multi-hop broadcasting over a shallow infrastructure to synchronize the clocks. The closeness of clock synchronization achieved by the algorithm is proved to be optimal for the given energy constraints.

Joint work with Hagit Attiya and Jennifer L. Welch. [pdf]

Crosstalk-Preventing Scheduling in AWG-Based Cell Switches AWG-based optical switching fabrics are affected by coherent crosstalk, that can significantly impair system operation when the same wavelength is used simultaneously on several input ports to forward data to output ports. To permit large port counts in a N×N AWG, the scheduling of transmissions across the AWG must therefore prevent switch configurations that generate large crosstalk. We study the properties and the existence conditions of switch configurations able to control coherent crosstalk. Our results show that it is possible to keep an AWG-based switch with large port counts in the feasible operational region without significant performance degradation, provided that a proper scheduling algorithm is used.

Joint work with Andrea Bianco and Fabio Neri. [pdf (conference version), pdf (extended version with more results)]

CompactDFA: Generic State Machine Compression for Scalable Pattern Matching Pattern matching algorithms lie at the core of
all contemporary Intrusion Detection Systems (IDS), making
it intrinsic to reduce their speed and memory requirements.
This paper focuses on the most popular class of patternmatching
algorithms—The Aho-Corasick-like algorithms—which
are based on constructing and traversing a *Deterministic Finite
Automaton (DFA)*, representing the patterns. While this approach
ensures deterministic time guarantees, modern IDSs need to deal
with hundreds of patterns, thus requiring to store very large
DFAs which usually do not fit in fast memory. This results in
a major bottleneck on the throughput of the IDS, as well as its
power consumption and cost.

We propose a novel method to compress DFAs by observing that
the name of the states is meaningless. While regular DFAs store
separately each transition between two states, we use this degree
of freedom and encode states in such a way that all transitions to
a specific state can be represented by a single prefix that defines
a set of current states. Our technique applies to a large class of
automata, which can be categorized by simple properties. Then,
the problem of pattern matching is reduced to the well-studied
problem of *Longest Prefix Matching (LPM)* that can be solved
either in TCAM, in commercially available IP-lookup chips, or
in software. Specifically, we show with a TCAM our scheme can
reach a throughput of 10 Gbps with low power consumption.

Joint work with Anat Bremler-Barr and Yaron Koral. [pdf]

Optimal Resource Allocation for Disaster Recovery Two key elements in disaster recovery are backing
up data at remote sites and reliability of services by virtualization.
To support operational services and to speed up recovery process,
resources should be distributed fairly and efficiently. Thus, we
discuss resource allocation algorithms to support remote data
storage and live virtual machines (VMs) migration. We identify
two opposing forces: on the one hand, backup data should be
stored as close as possible to the original site to guarantee
high access speed and to minimize network load. On the other
hand, upon a site failure, VM migration should minimize the
impact of resuming VMs on other sites, to protect application
performance and to reduce VM restoring time. We present
optimal algorithms trading-off these two contrasting goals, and
we compare their performance for different network topologies
and resource distributions among sites.

Joint work with Andrea Bianco and Luca Giraudo. [pdf]

Maximum Bipartite Matching Size And Application to Cuckoo Hashing

Cuckoo hashing with a stash is a robust high-performance hashing scheme that can be used in many real-life applications. It complements cuckoo hashing by adding a small stash storing the elements that cannot fit into the main hash table due to collisions. However, the exact required size of the stash and the tradeoff between its size and the memory over-provisioning of the hash table are still unknown.

We settle this question by investigating the equivalent maximum matching size of a random bipartite graph, with a constant left-side vertex degree d = 2. Specifically, we provide an exact expression for the expected maximum matching size and show that its actual size is close to its mean, with high probability. This result relies on decomposing the bipartite graph into connected components, and then separately evaluating the distribution of the matching size in each of these components. In particular, we provide an exact expression for any finite bipartite graph size and also deduce asymptotic results as the number of vertices goes to infinity.

We also extend our analysis to cases where only part of the left-side vertices have a degree of 2; as well as to the case where the set of right-size vertices is partitioned into two subsets, and each left-side vertex has exactly one edge to each of these subsets. Finally, in the case where the constant left-side degree satisfies d ≥ 3, we show how our method can be used to bound from above the expected maximum size matching.

Our results improve upon previous results in two ways. First, we give an exact expression of the expected size of the maximum matching and not only the threshold for achieving a perfect matching with high probability (namely, we show the trade-off between the size of the stash and the memory over-provisioning of the table, and not only the over-provisioning threshold beyond which the stash can be eliminated). Second, our results hold for any finite graph size and are not only asymptotic.

Joint work with Yossi Kanizo and Isaac Keslassy. [pdf]

Network Vulnerability to Single, Multiple, and Probabilistic Physical Attacks

Telecommunications networks heavily rely on the physical infrastructure and, are therefore, vulnerable to natural disasters, such as earthquakes or floods, as well as to physical attacks, such as an Electromagnetic Pulse (EMP) attack. Large-scale disasters are likely to destroy network equipment and to severely affect interdependent systems such as the power-grid. In turn, long-term outage of the power-grid might cause additional failures to the telecommunication network. In this paper, we model an attack as a disk around its epicenter, and provide efficient algorithms to find vulnerable points within the network, under various metrics. In addition, we consider the case in which multiple disasters happen simultaneously and provide an approximation algorithm to find the points which cause the most significant destruction. Finally, since a network element does not always fail, even when it is close to the attack's epicenter, we consider a simple probabilistic model in which the probability of a network element failure is given. Under this model, we tackle the cases of single and multiple attacks and develop algorithms that identify potential points where an attack is likely to cause a significant damage.

Joint work with Pankaj Agarwal, Alon Efrat, Shashidhara Ganjugunte, Swaminathan Sankararaman, and Gill Zussman. [pdf]

Timely Data Delivery in a Realistic Bus Network

WiFi-enabled buses and stops may form the backbone of a metropolitan delay tolerant network, that exploits nearby communications, temporary storage at stops, and predictable bus mobility to deliver non-real time information.

This paper studies the problem of how to route data from its source to its destination in order to maximize the delivery probability by a given deadline. We assume to know the bus schedule, but we take into account that randomness, due to road traffic conditions or passenger boarding and alighting, affects bus mobility. In this sense, this paper is one of the first to tackle quasi-deterministic mobility scenarios.

We propose a simple stochastic model for bus arrivals at stops, supported by a study of real-life traces collected in a large urban network with 250 bus lines and about 7500 bus-stops. A succinct graph representation of this model allows us to devise an optimal (under our model) single-copy routing algorithm and then extend it to cases where several copies of the same data are permitted.

Through an extensive simulation study, we compare the optimal routing algorithm with three other approaches: minimizing the expected traversal time over our graph, maximizing the delivery probability over an infinite time-horizon, and a recently-proposed heuristic based on bus frequencies. We show that, in general, our optimal algorithm outperforms the three, but it essentially reduces to either minimizing the expected traversal time when transmissions are always successful, or maximizing the delivery probability over an infinite time-horizon when transmissions fail frequently. For reliable transmissions and reasonable values of deadlines, the multi-copy extension requires only 10 copies to reach almost the performance of costly flooding approaches.

Joint work with Utku Acer, Paolo Giaccone, Giovanni Neglia, and Saed Tarapiah. [pdf]

The Resilience of WDM Networks to Probabilistic Geographical Failures

Telecommunications networks, and in particular optical WDM networks, are vulnerable to large-scale failures of their infrastructure, resulting from physical attacks (such as an Electromagnetic Pulse attack) or natural disasters (such as earthquakes or floods). Such events happen in specific geographical locations and disrupt specific parts of the network but their effects are not deterministic. Therefore, we provide a unified framework to model the network vulnerability when the event has a probabilistic nature, defined by an arbitrary probability density function. Our framework captures scenarios with a number of simultaneous attacks, in which network components consist of several dependent sub-components, and in which either a 1+1 or a 1:1 protection plan is in place. We use computational geometric tools to provide efficient algorithms to identify vulnerable points within the network under various metrics. Then, we obtain numerical results for specific backbone networks, thereby demonstrating the applicability of our algorithms to real-world scenarios. Our novel approach would allow identifying locations which require additional protection efforts (e.g., equipment shielding). Overall, the paper demonstrates that using computational geometric techniques can significantly contribute to our understanding of network resilience.

Joint work with Pankaj Agarwal, Alon Efrat, Shashidhara Ganjugunte, Swaminathan Sankararaman, and Gil Zussman. [pdf]

Hash Tables With Finite Buckets Are Less Resistant To Deletions

We show that when memory is bounded, i.e. buckets are finite, dynamic hash tables that allow insertions and deletions behave significantly worse than their static counterparts that only allow insertions. This behavior differs from previous results in which, when memory is unbounded, the two models behave similarly.

We show the decrease in performance in dynamic hash tables using several hash-table schemes. We also provide tight upper and lower bounds on the achievable overflow fractions in these schemes. Finally, we propose an architecture with content- addressable memory (CAM), which mitigates this decrease in0.4 performance.

Joint work with Yossi Kanizo, and Isaac Keslassy. [pdf]

Space-time tradeoffs in software-based Deep Packet Inspection

Deep Packet Inspection (DPI) lies at the core of contemporary Network Intrusion Detection/Prevention Systems and Web Application Firewalls. DPI aims to identify various malware (including spam and viruses) by inspecting both the header and the payload of each packet and comparing it to a known set of patterns. DPI is often performed on the critical path of the packet processing, thus the overall performance of the security tools is dominated by the speed of DPI.

The seminal algorithm of Aho-Corasick (AC) is the de- facto standard for pattern matching in network intrusion detection systems (NIDS). Basically, the AC algorithm constructs a Deter- ministic Finite Automaton (DFA) for detecting all occurrences of a given set of patterns by processing the input in a single pass. The input is inspected symbol by symbol (usually each symbol is a byte), such that each symbol results in a state transition. Thus, in principle, the AC algorithm has deterministic performance, which does not depend on specific input and therefore is not vulnerable to algorithmic complexity attacks, making it very attractive to NIDS systems.

In this paper we show that, when implementing the AC algorithm in software, this property does not hold, due to the fact that contemporary pattern sets induce very large DFAs that cannot be stored entirely in cache. We then propose a novel technique to compress the representation of the Aho-Corasick automaton, so it can fit in modern cache. We compare both the performance and the memory footprint of our technique to previously-proposed implementation, under various settings and pattern sets. Our results reveal the space-time tradeoffs of DPI. Specifically, we show that our compression technique reduces the memory footprint of the best prior-art algorithm by approximately 60%, while achieving comparable throughput.

Joint work with Anat Bremler-Barr and Yotam Harchol. [pdf]

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP

Deep Packet Inspection (DPI) is the most time and resource consuming procedure in contemporary security tools such as Network Intrusion Detection/Prevention System (NIDS/IPS), Web Application Firewall (WAF), or Content Filtering Proxy. DPI consists of inspecting both the packet header and payload and alerting when signatures of malicious software appear in the traffic. These signatures are identified through pattern matching algorithms. Moreover, DPI and its corresponding pattern matching algorithms are also a crucial building for other networking applications such as monitoring and HTTP load balancing. The portion of compressed traffic of overall Internet traffic is constantly increasing. This paper focuses on traffic compressed using Shared Dictionary Compression over HTTP (SDCH), a compression algorithm proposed by Google Inc. in 2008. Unlike traditional compression algorithms, SDCH takes advantage of the inter-response redundancy (e.g., almost the same data is sent over and over again) as in nowadays dynamic Data. Currently, 22\% of HTTP clients supports SDCH, and this number is expected to grow substantially. SDCH works well with other compression algorithm (as GZip), making it even more appealing. On the other hand, performing DPI on any compressed traffic is considered hard, therefore today's security tools either do not inspect compressed data, alter HTTP headers to avoid compression, or decompress the traffic before inspecting it. We present a novel pattern matching algorithm that inspects SDCH-compressed traffic without decompressing it first. Our algorithm relies on inspecting the shared dictionary, which is common to all compressed traffic, offline and marking auxiliary information on it to speed up the online DPI inspection. We show that our algorithm works near the rate of the compressed traffic, implying a speed gain of SDCH's compression ratio (which is around 40\%). We also discuss how to implement our algorithm in an entire systems, and show how the NIDS should get the shared dictionary, how to deal with SDCH compression over GZip compression, and how to support regular expression matching

Joint work with Anat Bremler-Barr, Shimrit Tzur David, and Yaron Koral. [pdf]

Access-Efficient Balanced Bloom Filters

The implementation of Bloom Filters in network devices ideally has low memory access-rate and low false positive rate. However, since Bloom Filters hash elements to arbitrary memory blocks, they need high memory access rates. On the other hand, Blocked Bloom Filters first hash elements to a single memory block, where they maintain a local Bloom Filter. Therefore, they access only one memory block per element, resulting in better memory-access efficiency. Unfortunately, they have poor performance, with poor load-balancing and high false positive rates. In this paper, we propose to implement load-balancing schemes for the choice of the memory block, along with an optional overflow list, resulting in improved false positive rates while keeping a high memory-access efficiency. To study this problem, we define, analyze and solve a fundamental \emph{access-constrained balancing problem}, where incoming elements need to be optimally balanced across resources while satisfying average and instantaneous constraints on the \emph{number of memory accesses} associated with checking the current load of the resources. We then build on this problem to suggest a new load-balanced Blocked Bloom Filter scheme. Finally, we show that this scheme can reduce the false positive rate by up to two orders of magnitude, with the cost of 1.2 memory accesses per element and an overflow list size of $0.5\%$ of the elements.

Joint work with Yossi Kanizo, and Isaac Keslassy. [pdf]

MCA^2: Multi-Core Architecture for Mitigating Complexity Attacks

This paper takes advantage of the emerging multi-core
computer architecture to design a general framework for mitigating
network-based complexity attacks. In complexity attacks, an
attacker carefully crafts ``*heavy*'' messages (or packets) such that each heavy message consumes
substantially more resources than a normal message. Then, it sends a sufficient number of heavy messages to bring the system to a crawl at best.
In our architecture, called MCA^2---Multi-Core Architecture for Mitigating
Complexity Attacks---cores quickly identify such suspicious messages
and divert them to a fraction of the cores that are dedicated
to handle all the heavy messages. This keeps the rest of the cores
relatively unaffected and free to give the legitimate traffic
the same quality of service as if no attack takes place.

We demonstrate the effectiveness of our architecture by examining cache-miss
complexity attacks against Deep Packet Inspection (DPI)
engines. For example, for Snort DPI engine, an attack in which 30% of the packets are
malicious degrades the system throughput by over 50%,
while with MCA^2 the throughput drops by either 20% when no packets
are dropped or by 10% in case dropping of heavy packets is allowed.
At 60% malicious packets, the corresponding numbers are 70%, 40% and 23%.

Joint work with Yehuda Afek, Anat Bremler-Barr, Yotam Harchol, and Yaron Koral. [pdf]

Power Grid Vulnerability to Geographically Correlated Failures - Analysis and Control Implications

We consider line
outages in the transmission network of the power grid, and
specifically those caused by natural disasters or large-scale
physical attacks. In such networks, an outage of a line may lead to
overload on other lines, thereby >leading to their outage. * Such
a cascade may have devastatin effects not only on the power grid but
also on the interconnecte communication networks *. We study a
model of such failures and show that it differs from other models
used to analyze cascades (e.g., epidemic/percolation-based
models). Inspired by methods developed for network-survivability
analysis, we show how to identify the most vulnerable locations in
the network. We als perform extensive numerical experiments with
real grid data to estimate the effects of geographically correlated
outages and briefly discuss mitigation methods. The developed
techniques can indicate potential locations for grid monitoring, and
hence, wil have impact on the deployment of the smart-grid
networking infrastructure.

Joint work with Andrey Bernstein, Daniel Bienstock, Meric Uzunoglu, and Gil Zussman. [pdf]

Mitigating Cascading Failures in Interdependent Power Grids and Communication NetworksWe study the interdependency between the power
grid and the communication network used to control the grid. A
communication node depends on the power grid in order to receive
power for operation, and a power node depends on the communication
network in order to receive control signals. We demonstrate that these
dependencies can lead to cascading failures, and it is essential to
consider the power flow equations for studying the behavior of
such interdependent networks. We propose a two-phase control policy
to mitigate the cascade of failures. In the first phase, our control
policy finds the unavoidable failures that occur due to physical
disconnection. In the second phase, our algorithm redistributes the
power so that all the connected communication nodes have enough power
for operation and no power lines overload. We perform a sensitivity
analysis to evaluate the performance of our control policy, and
show that our control policy achieves close to optimal yield for many
scenarios. This analysis can help design robust interdependent grids and
associated control policies.

Joint work with Marzieh Parandehgheibi and Eytan Modiano. [pdf]

DPI as a ServiceMiddleboxes play a major role in contemporary networks, as for-
warding packets is often not enough to meet operator demands, and other functionalities (such as security, QoS/QoE provisioning, and load balancing) are required. Traffic is usually routed through a
sequence of such middleboxes, which either reside across the network or in a single, consolidated location. Although middleboxes
provide a vast range of different capabilities, there are components
that are shared among many of them.

A task common to almost all middleboxes that deal with L7 protocols is Deep Packet Inspection (DPI). Today, traffic is inspected
from scratch by all the middleboxes on its route. In this paper, we
propose to treat DPI as a service to the middleboxes, implying that
traffic should be scanned only once, but against the data of all middleboxes that use the service. The DPI service then passes the scan
results to the appropriate middleboxes. Having DPI as a service has
significant advantages in performance, scalability, robustness, and
as a catalyst for innovation in the middlebox domain. Moreover,
technologies and solutions for current Software Defined Networks
(SDN) (e.g., SIMPLE [41]) make it feasible to implement such a
service and route traffic to and from its instances.
study the interdependency between the power
associated control policies.

Joint work with Anat Bremler-Barr, Yotam Harchol, and Yaron Koral. [pdf]

Leveraging Traffic Repetitions for High-Speed Deep Packet Inspection
Deep Packet Inspection (DPI) plays a major role
in contemporary networks, and specifically, in datacenters of
content providers, scanned data may be highly repetitive. Most
DPI engines are based on identifying signatures in packet
payload. This pattern matching process is expensive both in
memory and CPU resources, and therefore, often becomes the
bottleneck of the entire application

This paper shows how DPI can be accelerated by leveraging
repetitions in the inspected traffic. We first show that such
repetitions exist in many traffic types and present a mechanism
that allows skipping repeated data instead of scanning it again.
In its slow path, frequently repeated strings are identified and
stored in a dictionary along with some succinct information for
accelerating the DPI process. In the mechanismâ€™s data path,
each time the scanning algorithm encounters a string from the
dictionary, it skips it and recovers to the correct state had this
word been scanned byte by byte.

Our solution achieves significant performance boost, especially
when data is of the same content source (e.g. same website).
Our experiments show that for such cases, our solution achieves
throughput gain of 1.25-2.5 times the original throughput, when
implemented in software.

Joint work with Anat Bremler-Barr, Shimrit Tzur Daviv, and Yotam Harchol. [pdf]

Capturing Resource Tradeoffs in Fair Multi-Resource Allocation

Cloud computing platforms provide
computational resources (CPU, storage, etc.) for running users'
applications.
Often, the same application can be implemented in various ways, each
with different resource requirements. Taking advantage of this
flexibility when allocating resources to users can both greatly
benefit users and lead to much better global resource
utilization. We develop a framework for fair resource allocation
that captures such implementation tradeoffs by allowing users to
submit multiple “resource demands”. We present and analyze two
mechanisms for fairly allocating resources in such environments: the
Lexicographically-Max-Min-Fair (LMMF) mechanism and the
Nash-Bargaining (NB) mechanism. We prove that NB has many desirable
properties, including Pareto optimality and envy freeness, in a
broad variety of environments whereas the seemingly less appealing
LMMF fares better, and is even immune to manipulations, in
restricted settings of interest.

Joint work with Doron Zarchy and Michael Schapira. [pdf]

Accelerating Regular Expression Matching Over Compressed HTTP

This paper focuses on regular expression
matching over compressed traffic. The need for such matching
arises from two independent trends. First, the volume and share of
compressed HTTP traffic is constantly increasing, mostly with GZIP
that uses back-references to repeated strings. Second, due to
their superior expressibility, current Deep Packet Inspection
engines use regular expressions more and more frequently.

We present an algorithmic framework to accelerate such matchings, taking advantage of information gathered when the traffic was initially compressed. Our algorithm is based on calculating (for every byte) the minimum number of (previous) bytes that can be part of a future regular expression matching. When inspecting a back-reference, only these bytes should be taken into account, thus enabling one to skip repeated strings almost entirely without missing a match. We show that our generic framework works with either NFA-based or DFA-based implementation and gains performance boost of more than 70%. Moreover, it can be readily adapted to most existing regular expression matching algorithms, which usually are based either on NFA, DFA or combinations of the two. Finally, we discuss other applications in which calculating the number of relevant bytes becomes handy, even when the traffic is not compressed.

Joint work with Michela Becchi, Anat Bremler-Barr, Omer Kochba, and Yaron Koral. [pdf]

Ultra-Fast Similarity Search Using Ternary Content Addressable Memory

Similarity search, and specifically the nearest-neighbor search (NN) problem is widely used in many fields of computer science such as machine learning, computer vision and databases. However, in many settings such searches are known to suffer from the notorious curse of dimensionality, where running time grows exponentially with d. This causes severe performance degradation when working in high-dimensional spaces. Approximate techniques such as localitysensitive hashing [2] improve the performance of the search, but are still computationally intensive.

In this paper we propose a new way to solve this problem using a special hardware device called ternary content addressable memory (TCAM). TCAM is an associative memory, which is a special type of computer memory that is widely used in switches and routers for very high speed search applications. We show that the TCAM computational model can be leveraged and adjusted to solve NN search problems in a single TCAM lookup cycle, and with linear space. This concept does not suffer from the curse of dimensionality and is shown to improve the best known approaches for NN by more than four orders of magnitude. Simulation results demonstrate dramatic improvement over the best known approaches for NN, and suggest that TCAM

devices may play a critical role in future large-scale databases and cloud applications.

Joint work with Anat Bremler-Barr, Yotam Harchol, and Yacov Hel-Or. [pdf]

OpenBox: Enabling Innovation in Middlebox Applications

Contemporary networks contain many different kind of middleboxes that perform variety of advanced network functions. Currently, a special box is tailored to provide each such function. These special boxes are usually proprietary, and operators control over them is limited to the set of capabilities defined by the provider of each box. Nonetheless, many middleboxes perform very similar tasks.

In this paper we present *OpenBox*: a logically-centralized framework that makes advanced packet processing and monitoring easier, faster, more scalable, flexible, and innovative. OpenBox decouples the control plane of middleboxes from their data plane, and unifies the data plane of multiple middlebox applications using entities called service instances. On top of the centralized control plane everyone can develop OpenBox applications. An OpenBox application, formerly implemented as a separate middlebox, instructs the data plane how to process packets in order to achieve its

intended function. OpenBox service instances reside in data plane and process packets according to policies defined by the control plane. They can be implemented in software or use specialized hardware.

Joint work with Anat Bremler-Barr, Yotam Harchol, and . [pdf]

Making DPI Engines Resilient to Algorithmic Complexity Attacks

This paper starts by demonstrating the vulnerability of Deep Packet Inspection (DPI) mechanisms, which are at the core of security devices, to algorithmic complexity denial of service attacks, thus exposing a weakness in the first line of defense of enterprise networks and clouds. A system and a multi-core architecture to defend from these algorithmic complexity attacks is presented in the second part of the paper. The integration of this system with two different DPI engines is demonstrated and discussed. The vulnerability is exposed by showing how a simple low bandwidth cache-miss attack takes down the Aho-Corasick (AC) pattern matching algorithm that lies at the heart of most DPI engines. As a first step in the mitigation of the attack, we have developed a compressed variant of the AC algorithm that improves the worst case performance (under an attack). Still, under normal traffic its running-time is worse than classical AC implementations. To overcome this problem, we introduce {\rm MCA}^{2} ÑMulti-Core Architecture to Mitigate Complexity Attacks, which dynamically combines the classical AC algorithm with our compressed implementation, to provide a robust solution to mitigate this cache-miss attack. We demonstrate the effectiveness of our architecture by examining cache-miss algorithmic complexity attacks against DPI engines and show a goodput boost of up to 73%. Finally, we show that our architecture may be generalized to provide a principal solution to a wide variety of algorithmic complexity attacks.

Joint work with Yehuda Afek, Anat Bremler-Barr, Yotam Harchol, Yaron Koral. . [pdf]

Enhancing network robustness via shielding

We consider shielding critical links to guarantee network connectivity under geographical and general failure models. We develop a mixed integer linear program (MILP) to obtain the minimum cost shielding to guarantee the connectivity of a single SD pair under a general failure model, and exploit geometric properties to decompose the shielding problem under a geographical failure model. We extend our MILP formulation to guarantee the connectivity of the entire network, and use Benders decomposition to significantly reduce the running time by exploiting its partial separable structure. We also apply simulated annealing to solve larger network problems to obtain near-optimal solutions in much shorter time. Finally, we extend the algorithms to guarantee partial network connectivity, and observe significant reduction in shielding cost, especially when the failure region is small. For example, when the failure region radius is 60 miles, we observe as much as 75% reduction in shielding cost by relaxing the connectivity requirement to 95% on a major US infrastructure network.

Joint work with Jianan Zhang, Eytan Modiano, . [pdf]

Ownership-Aware Software-Defined Backhauls in Next-Generation Cellular Networks

Future cellular networks will be owned by multiple parties, e.g., two mobile operators, each of which controls some elements of the access and backhaul infrastructure. In this context, it is important that as much traffic as possible is processed by the same party that generates it, i.e., that the coupling between traffic and network ownership is maximized. Software-defined backhaul networks can attain this goal; however, existing management schemes ignore ownership altogether. We fill this gap by presenting an ownership- aware network management scheme, maximizing the amount of traffic that is processed by the same party it belongs to.

Joint work with Francesco Malandrino, . [pdf]

Coopetition Between Network Operators and Content Providers in SDN/NFV Core Networks

It is widely expected that the core of next-generation cellular networks will be (i) based on software-defined networking (SDN) and network function virtualization (NFV), and (ii) shared between multiple parties, including traditional mobile operators and content providers. Such parties are normally competing with each other; however, they can obtain significant, mutual benefits even from limited and defined cooperation. We study how such a coopetition relationship influences the decisions of (i) how to place virtual network functions on physical hardware and (ii) how to steer traffic between them. We present an efficient, online algorithm to make such placement and steering decisions, and study its performance in a realistic scenario. We find that our algorithm would allow mobile operators and content providers to reduce their reliance on third-party vendors by 60%

Joint work with Nir Gazit, Francesco Malandrino, . [pdf]

Scalable URL Matching with Small Memory Footprint

URL matching lies at the core of many networking applications and Information Centric Networking architectures. For example, URL matching is extensively used by Layer 7 switches, ICN/NDN routers, load balancers, and security devices. Modern URL matching is done by maintaining a rich database that consists of tens of millions of URL which are classified to dozens of categories (or egress ports). In real-time, any input URL has to be searched in this database to find the corresponding category. In this paper, we introduce a generic framework for accurate URL matching (namely, no false positives or miscategorization) that aims to reduce the overall memory footprint, while still having low matching latency. We introduce a dictionary-based compression method that compresses the database by 60%, while having only a slight overhead in time. Our framework is very flexible and it allows hot-updates, cloud-based deployments, and can deal with strings that are not URLs.

Joint work with Anat Bremler-Barr, Daniel Krauthgamer, Shimrit Tzur-David. [pdf]