Tuesday, November 25, 2008

Reading for Lecture Nov 25

A Policy-aware Switching Layer for Data Centers

In today’s data centers, a variety of middleboxes are deployed to protect, manage and improve the performance of the applications and services they run. Administrators often have to overload layer-2 path selection mechanisms to coerce traffic through the desired sequence of middleboxes placed on the network path. This process is inflexible and prone to misconfiguration.

The Player is a policy-aware switching layer consisting of inter-connected policy aware switches or pswitches. Based on policies specified by administrators at a centralized controller, pswitches explicitly forward different types of traffic through different sequences of middleboxes.

A pswitch consists of two independent parts - the Switch core and the policy core. The switch core provides the forwarding functionability of standard Ethernet switch. The Policy core redirects frames to the middle boxes dictated by policy. The policy core redirects a frame by encapsulating it in a new Ethernet-II frame , identified by a new EtherTupe code.

Only switch connecting to the external networks requiring middlebox traversal guarantees are plugged in, need to be converted to pswitches. Other switches need not be converted if they can be configured or modified to treat encapsulated frames with the new EtherType as regular Ethernet frames.

Improving MapReduce Performance In Heterogeneous Environments

Hadoop is an open source implementation of MapReduce enjoying wide option and is often used for short jobs where low response time is critical. Hadoop’s performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make process linearly.

The above assumptions in Hadoop are used to run a speculative copy of a task that is performing poorly on another machine so that the task can finish faster. Hadoop’s scheduler start a speculative task based on a simple heuristic comparing each task’s progress to the average progress. This heuristic works well in homogeneous environments but can lead to severe performance degradation in other heterogeneous environments.

The LATE scheduler estimate the progress rate of each task as ProgressScore/T where T is the amount of time the task has been running for. Then the finish time estimate is (1 – ProgressScore)/ ProgressRate. This assumes that tasks make progress at a roughly constant rate.

The advantage of LATE is that it robust to node heterogeneity because it relaunches only the slowest task and just a number of tasks. LATE prioritizes between slowest tasks based on how much they hurt the task response time. It also caps the number of speculative tasks to limit the contention and share resources.

Thursday, November 20, 2008

Reading for Lecture Nov 20

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

In the IP multicast, data sources simply send to the group’s multicast address without any advance knowledge about the group’s membership. To receive data, receivers simply announce that they are interested. Each receiver joins and leaves the group individually, without affecting the data transmission to any other member.

The SRM maximizes information and data sharing among all the members, and strengthens the individuality of membership by making each member responsible for its own correct reception of all the data. SRM design follows the core design principles of TCP/IP. It requires only basic IP delivery model – best effort with possible duplication and reordering of packets – and build the reliability on an end-to-end basis. Furthermore, SRM algorithms dynamically adjust their control parameters based on the observed performance within a session.

TCP multicast style has a number of problems like if all receivers send ACK, the sender will be overwhelmed by this message. Also, how does a sender continuously track the active receivers and the reception rate of each. Round trip time is also very difficult to estimate due to the difference in propagation time to different receivers.

In the Wb framework, each member is identified by a globally unique identifier, the Source-ID and each page is identified by the Source-ID of the initiator and the page number locally to that initiator. The wb multicast design assumes: 1) All data has a unique name. 2) The name always refers to the same data. 3) Source-ID are persistent. 4) IP datagram delivery is available. 5) All participants join the same multicast group, no distinction between the senders and receivers.
When a receiver detects missing data, it will send a repair request at a random time.

Scalable Application Layer Multicast

This system is based on a hierarchical clustering of the application-layer multicast peers and can support a number of different data delivery trees with desirable properties. The application-layer multicast protocols can be evaluated along three dimensions: 1) Quality of data delivery path. 2) Robustness of the overlay. 3) Control overhead.

The project aims at developing an efficient, scalable, and distributed tree building protocol, which did not require any underlying topology information. This reduces the worst-case state and control overhead at any member to O(logN). An average member maintains state for a constant number of other members, and incurs constant control overhead for topology creation and maintenance.

Hosts a distributed in different layers with following properties: 1) A host belongs to only a single cluster at any layer. 2) If a host is present in some cluster in layer L_i, it must occur in one cluster in each of the layer, L_0, …, L_i-1. In fact, it is the cluster leader in each of these lower layers. 3) If a host is not present in layer L_i, it cannot be present in any layer L_j, where j > i. 4) Each cluster has its size bounded between k and 3k-1. 5) There are at most log_kN layer, and the highest layer has only a single member.

Tuesday, November 18, 2008

Reading for Lecture Nov 18

An Architecture for Internet Data Transfer

This architecture separates content negotiation from the data transfer. Application determines what data they need to send and then uses the new transfer service to send it. Historically, data transfers have been tightly linked with content negotiation because of TCP and the socket API provide a good enough mechanism and the challenged of naming objects as well.

The benefits of this mechanism is that the developers can reuse transfer technique instead of re-implementing them. Furthermore, applications that use the transfer service can immediately begin using the new transfer techniques without modification. The three main challenges of moving data into a new service are: 1) Providing a convenient and standard API for data transfer applications. 2) Allowing easy development and deployment of new transfer mechanisms. 3) Supporting applications with diverse negotiation and naming conventions.

The DOT service provides three properties: 1) Surviving transient disconnection and mobility. 2) Transferring via portable storage. 3) Using caching and multiple sources. The GTC splits objects into a series of chunks. Each chunk is named with a descriptor containing the hash of the chunk, it length and its offset into object. In DOT, hints are used to inform the receiver’s GTC of possible data locations.

A Delay-Tolerant Network Architecture for Challenged Internets

Today’s Internet may operate poorly in environments characterized by very long delay paths and frequent network partitions. In mobile and extreme environments lacking continuous connectivity, many such networks have their own specialize protocols, and do not use IP.

The DTN architecture is proposed to provide interoperability between them. The architecture operates as an overlay above the transport layers of the networks it interconnects. DTN is based on abstraction message switching. It provides flexible naming scheme in a form of late binding.

The challenged internetworkings are characterized by latency, bandwidth limitations, error probability, node longevity, or path stability that substantial worse than today’s Internet.

Thursday, November 13, 2008

Reading for Lecture Nov 13

X-trace: A Pervasive Network Tracing Framework

When complex systems misbehave, it is quite difficult to diagnose the source of the problem. X-trace is a framework that provides a comprehensive view of systems that adopt it. A user or operator invokes X-trace when initiating an application task, by inserting X-trace metadata with a task identifier. X-trace is comprehensive in the sense that it tags all network operations resulting from a particular task with the same task identifier, this set is called a task tree.

Diagnosing problems often requires tracing a task across different administrative domains (AD). Since ADs may not want to reveal information to each other, X-trace provide a clean separation between the client that invokes X-trace and the recipient of the trace information.
The design principles of X-trace are: 1) The trace request should be sent in-band, rather than in a separate probe message. 2) The collected trace data should be sent out-of-band, decoupled from the original datapath. 3) The entity that requests tracing is decoupled from the entity that received the trace report.

The task identifier within X-trace metadata is uniquely identifies each network task. The pushDown() and pushNext() operations are used to propagates X-trace metadata along with the datapath. The reports at each X-trace aware node are used to reconstruct the datapath.

End-to-End Internet Packet Dynamic

Each 100 Kbyte transfer at both the sender and receiver to distinguish between the end-to-end behaviors due to different directions on Internet paths, which often exhibit the asymmetries. Various issue is investigated in the paper such that out-of-order packet delivery, packet corruption, bottleneck bandwidth, pattern of packet loss, …

Two runs are used to find out the dynamic packet change in the course of 1995. Lots of different this are observed, discussed and inferred but it is not clear for me what are the main points.

Thursday, November 6, 2008

Reading for Lecture Nov 6

Internet Indirection Infrastructure

The i3 architecture aims to ease the process of providing services like mulitcast, anycast and mobility. It offers rendezvous-based communication abstraction. The i3’s level of indirection that decouples the act of sending a packet from the act of receiving the packet allows i3 efficiently support a wide variety of fundamental communication services.

In i3, sources send packets to a logical identifier, and receivers express interest in packets sent to an identifier. The delivery is still best-effort like today’s Internet. The i3’s join operation is inserting a trigger. This operation is more flexible than IP multicast since it allows receivers ro control the routing of the packet.

Each packet in i3 is a pair of (id, data). Receivers use triggers to express their interest in packets. A trigger is of the form (id, addr) indicating that all packets with an identifier id should be forwarded by i3 to the node identified by addr.

Each identifier is mapped to a unique i3 node. When a trigger is inserted, it is stored in the given node. When a packet is sent to the id, it is routed to the node responsible for the id. At there, it is matched against triggers and the packet is forward to all the hosts interested in that packet.

Middleboxes No Longer Considered Harmful

The middleboxes like NATs, firewalls, and transparent caches have become an important part of today’s Internet. The DOA architecture aims at facilitating the deployment of middleboxes while eliminating their dangerous side effects.

The DOA architecture is based on two main ideas: 1) all entities have a globally unique identifier in a flat namespace and packets carry these identifiers. 2) DOA allows senders and receivers to express that one or more intermediaries should process packets en route to a destination.

Each host has an unambiguous endpoint identifier picked from a large namespace. This identifier is required to be independent from network topology. It can also carry cryptographic meaning. To communicate with an end-host identifier, a prospective peer uses the mapping service to resolve the EID to an IP address.

Thus an EID can resolve to another EID, allowing an end-host to map its EID to a delegate’s identity. The DOA obeys the two Internet’s tenet. However, it seems that DOA try to satisfies the two telnets by an explicit delegation mechanism. No middleboxes implicitly interfere with other Internet element’s information.

Tuesday, November 4, 2008

Reading for Lecture Nov 4

Development of the Domain Name System

The large distribution cost of the HOSTS.TXT system prompts the genesis of DNS. The design assumptions for the DNS are: 1) provide at least all the same information as HOSTS.TXT; 2) Allow the database to be maintained in a distributed manner, 3) Have no obvious size limits for the names, name components, data associated with a name, 4) Interoperate across the DARPA Internet and in many other environment as possible, 5) Provide tolerable performance.

DNS has two major types of active components: name servers and resolvers. Name servers are repositories of information answering queries using whatever information they process. Resolvers interface to clients programs.

The DNS internal name space is a variable-depth tree where each node in the tree has an associated label. The domain name of a node is the concatenation of all labels on the path from the node to the root of the tree. The name space searching operation is done in a case-insensitive manner.

An organization can gets control of a zone of the name space by persuading a parent organization to delegate a subzone consisting of a single node. A caching mechanism for latter queries controlled by the TTL filed attached to each RR.

The search algorithm for the DNS is a downward search from domains that it can access already. Resolvers are typically configured with hints pointing at servers for the root node and the top of the local domain.

DNS Performance and the Effectiveness of Caching

The two widely believed most contributors to the scalability of DNS is hierarchy design and the aggressively use of caching. Both factors seek to reduce the load on the root servers at the top of the name space hierarchy while successful caching helps limit the client-perceived delays and wide area network usage.

The major questions would be: 1) What performance, in term of latency and failure, do DNS clients have? 2) How does varying the TTL and degree of cache sharing impact caching effectiveness? These questions are answered by using methods of analyzing the traces of TCP traffic along with related DNS traffic of three network traces.

One surprising result is that over a third of all lookups are not successfully answered. The DNS servers also appear to retransmit overly aggressively. The DNS usage patterns and performance are also observed to change. The capture of TCP traffic helps perform trace-driven simulations to investigate two important factors that effect caching effectiveness.

The root of the hierarchy is centrally administered and served from thirteen root servers. Sub-domains are delegated to other servers that are authoritative of the portion of their name space. To perform the analyzing, both DNS traffic and its driving workload are collected.

Thursday, October 30, 2008

Reading for Lecture Oct 30

Chord: A Scalable Peer-to-peer Lookup Service for Internet Application

Chord provides support for just one operation: given a key, it maps the key onto a node. Chord uses a variant of consistent hashing to assign keys to Chord nodes. This tends to balance load, since each receives roughly the same number of keys. The major features of Chord are simplicity, provable correctness and provable performance.

Chord node need routing information about O(log N) nodes and a node resolves the hash function by communicating with other nodes. It resolves all lookups via O(log N) messages to other nodes. Chord maintains its information as nodes join and leave the system; with high probability each such event results in no more than O(log^2N) message.

The consistent hash function assigns each node and keys an m-bit identifier using a base hash function such as SHA-1. A node’s identifier is chosen by hashing the node’s IP address, while a key identifier is produced by hashing the key. Identifiers are ordered in an identifier circle modulo 2^m. The key k is assigned to the first node whose identifier is equal to or follows k in the identifier space. If identifiers are represented as a circle of numbers from 0 to 2^m – 1, then successor(k) is the first node clockwise from k.

Each node n maintains a routing table with at most m entries called the finger table. The i^th entry in the table at node n contains the identity of node s = successor(n + 2^i) where 1 <= I <= m. Thus each node only has to store information about a small number of nodes and know more about nodes closely follow it on the identifier circle then about nodes farther away. When a node n does not know the successor of a key k, it can find a node whose ID is closer than its own to k. That node will know more about the identifier circle in the region of k than n does.

I do not really understand the “with high probability” phrase.

Looking Up Data in P2P Systems

This paper describes the data looking-up mechanism in P2P systems. The mechanism needs to be scalable, and distributable.

In tree-like routing, each node records, for each prefix, the location of some node with that prefix. Pastry, an instance of the algorithm, gives each node a randomly chosen ID indicating its position on an identifier circle. It routes messages with a key to the live node with a node ID numerically closest to the key.

Each node n maintains a leaf set L, which is the set of |L|/2 nodes closest to n and larger than n, and the set of |L|/2 nodes closest to n and smaller than n. The correctness of this leaf set is the only requirement for correctness; forwarding is always correct, unless |L|/2 nodes with adjacent IDs fail simultaneously.

When a node n’s sought key is in the leaf set, the query is forwarded to that node. If it is not there, it tries to forward the query to the node in the leaf set that have the same closest prefix to the key. This continues until the query reach the destination.