Tuesday, November 25, 2008

Reading for Lecture Nov 25

A Policy-aware Switching Layer for Data Centers

In today’s data centers, a variety of middleboxes are deployed to protect, manage and improve the performance of the applications and services they run. Administrators often have to overload layer-2 path selection mechanisms to coerce traffic through the desired sequence of middleboxes placed on the network path. This process is inflexible and prone to misconfiguration.

The Player is a policy-aware switching layer consisting of inter-connected policy aware switches or pswitches. Based on policies specified by administrators at a centralized controller, pswitches explicitly forward different types of traffic through different sequences of middleboxes.

A pswitch consists of two independent parts - the Switch core and the policy core. The switch core provides the forwarding functionability of standard Ethernet switch. The Policy core redirects frames to the middle boxes dictated by policy. The policy core redirects a frame by encapsulating it in a new Ethernet-II frame , identified by a new EtherTupe code.

Only switch connecting to the external networks requiring middlebox traversal guarantees are plugged in, need to be converted to pswitches. Other switches need not be converted if they can be configured or modified to treat encapsulated frames with the new EtherType as regular Ethernet frames.

Improving MapReduce Performance In Heterogeneous Environments

Hadoop is an open source implementation of MapReduce enjoying wide option and is often used for short jobs where low response time is critical. Hadoop’s performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make process linearly.

The above assumptions in Hadoop are used to run a speculative copy of a task that is performing poorly on another machine so that the task can finish faster. Hadoop’s scheduler start a speculative task based on a simple heuristic comparing each task’s progress to the average progress. This heuristic works well in homogeneous environments but can lead to severe performance degradation in other heterogeneous environments.

The LATE scheduler estimate the progress rate of each task as ProgressScore/T where T is the amount of time the task has been running for. Then the finish time estimate is (1 – ProgressScore)/ ProgressRate. This assumes that tasks make progress at a roughly constant rate.

The advantage of LATE is that it robust to node heterogeneity because it relaunches only the slowest task and just a number of tasks. LATE prioritizes between slowest tasks based on how much they hurt the task response time. It also caps the number of speculative tasks to limit the contention and share resources.

Thursday, November 20, 2008

Reading for Lecture Nov 20

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

In the IP multicast, data sources simply send to the group’s multicast address without any advance knowledge about the group’s membership. To receive data, receivers simply announce that they are interested. Each receiver joins and leaves the group individually, without affecting the data transmission to any other member.

The SRM maximizes information and data sharing among all the members, and strengthens the individuality of membership by making each member responsible for its own correct reception of all the data. SRM design follows the core design principles of TCP/IP. It requires only basic IP delivery model – best effort with possible duplication and reordering of packets – and build the reliability on an end-to-end basis. Furthermore, SRM algorithms dynamically adjust their control parameters based on the observed performance within a session.

TCP multicast style has a number of problems like if all receivers send ACK, the sender will be overwhelmed by this message. Also, how does a sender continuously track the active receivers and the reception rate of each. Round trip time is also very difficult to estimate due to the difference in propagation time to different receivers.

In the Wb framework, each member is identified by a globally unique identifier, the Source-ID and each page is identified by the Source-ID of the initiator and the page number locally to that initiator. The wb multicast design assumes: 1) All data has a unique name. 2) The name always refers to the same data. 3) Source-ID are persistent. 4) IP datagram delivery is available. 5) All participants join the same multicast group, no distinction between the senders and receivers.
When a receiver detects missing data, it will send a repair request at a random time.

Scalable Application Layer Multicast

This system is based on a hierarchical clustering of the application-layer multicast peers and can support a number of different data delivery trees with desirable properties. The application-layer multicast protocols can be evaluated along three dimensions: 1) Quality of data delivery path. 2) Robustness of the overlay. 3) Control overhead.

The project aims at developing an efficient, scalable, and distributed tree building protocol, which did not require any underlying topology information. This reduces the worst-case state and control overhead at any member to O(logN). An average member maintains state for a constant number of other members, and incurs constant control overhead for topology creation and maintenance.

Hosts a distributed in different layers with following properties: 1) A host belongs to only a single cluster at any layer. 2) If a host is present in some cluster in layer L_i, it must occur in one cluster in each of the layer, L_0, …, L_i-1. In fact, it is the cluster leader in each of these lower layers. 3) If a host is not present in layer L_i, it cannot be present in any layer L_j, where j > i. 4) Each cluster has its size bounded between k and 3k-1. 5) There are at most log_kN layer, and the highest layer has only a single member.

Tuesday, November 18, 2008

Reading for Lecture Nov 18

An Architecture for Internet Data Transfer

This architecture separates content negotiation from the data transfer. Application determines what data they need to send and then uses the new transfer service to send it. Historically, data transfers have been tightly linked with content negotiation because of TCP and the socket API provide a good enough mechanism and the challenged of naming objects as well.

The benefits of this mechanism is that the developers can reuse transfer technique instead of re-implementing them. Furthermore, applications that use the transfer service can immediately begin using the new transfer techniques without modification. The three main challenges of moving data into a new service are: 1) Providing a convenient and standard API for data transfer applications. 2) Allowing easy development and deployment of new transfer mechanisms. 3) Supporting applications with diverse negotiation and naming conventions.

The DOT service provides three properties: 1) Surviving transient disconnection and mobility. 2) Transferring via portable storage. 3) Using caching and multiple sources. The GTC splits objects into a series of chunks. Each chunk is named with a descriptor containing the hash of the chunk, it length and its offset into object. In DOT, hints are used to inform the receiver’s GTC of possible data locations.

A Delay-Tolerant Network Architecture for Challenged Internets

Today’s Internet may operate poorly in environments characterized by very long delay paths and frequent network partitions. In mobile and extreme environments lacking continuous connectivity, many such networks have their own specialize protocols, and do not use IP.

The DTN architecture is proposed to provide interoperability between them. The architecture operates as an overlay above the transport layers of the networks it interconnects. DTN is based on abstraction message switching. It provides flexible naming scheme in a form of late binding.

The challenged internetworkings are characterized by latency, bandwidth limitations, error probability, node longevity, or path stability that substantial worse than today’s Internet.

Thursday, November 13, 2008

Reading for Lecture Nov 13

X-trace: A Pervasive Network Tracing Framework

When complex systems misbehave, it is quite difficult to diagnose the source of the problem. X-trace is a framework that provides a comprehensive view of systems that adopt it. A user or operator invokes X-trace when initiating an application task, by inserting X-trace metadata with a task identifier. X-trace is comprehensive in the sense that it tags all network operations resulting from a particular task with the same task identifier, this set is called a task tree.

Diagnosing problems often requires tracing a task across different administrative domains (AD). Since ADs may not want to reveal information to each other, X-trace provide a clean separation between the client that invokes X-trace and the recipient of the trace information.
The design principles of X-trace are: 1) The trace request should be sent in-band, rather than in a separate probe message. 2) The collected trace data should be sent out-of-band, decoupled from the original datapath. 3) The entity that requests tracing is decoupled from the entity that received the trace report.

The task identifier within X-trace metadata is uniquely identifies each network task. The pushDown() and pushNext() operations are used to propagates X-trace metadata along with the datapath. The reports at each X-trace aware node are used to reconstruct the datapath.

End-to-End Internet Packet Dynamic

Each 100 Kbyte transfer at both the sender and receiver to distinguish between the end-to-end behaviors due to different directions on Internet paths, which often exhibit the asymmetries. Various issue is investigated in the paper such that out-of-order packet delivery, packet corruption, bottleneck bandwidth, pattern of packet loss, …

Two runs are used to find out the dynamic packet change in the course of 1995. Lots of different this are observed, discussed and inferred but it is not clear for me what are the main points.

Thursday, November 6, 2008

Reading for Lecture Nov 6

Internet Indirection Infrastructure

The i3 architecture aims to ease the process of providing services like mulitcast, anycast and mobility. It offers rendezvous-based communication abstraction. The i3’s level of indirection that decouples the act of sending a packet from the act of receiving the packet allows i3 efficiently support a wide variety of fundamental communication services.

In i3, sources send packets to a logical identifier, and receivers express interest in packets sent to an identifier. The delivery is still best-effort like today’s Internet. The i3’s join operation is inserting a trigger. This operation is more flexible than IP multicast since it allows receivers ro control the routing of the packet.

Each packet in i3 is a pair of (id, data). Receivers use triggers to express their interest in packets. A trigger is of the form (id, addr) indicating that all packets with an identifier id should be forwarded by i3 to the node identified by addr.

Each identifier is mapped to a unique i3 node. When a trigger is inserted, it is stored in the given node. When a packet is sent to the id, it is routed to the node responsible for the id. At there, it is matched against triggers and the packet is forward to all the hosts interested in that packet.

Middleboxes No Longer Considered Harmful

The middleboxes like NATs, firewalls, and transparent caches have become an important part of today’s Internet. The DOA architecture aims at facilitating the deployment of middleboxes while eliminating their dangerous side effects.

The DOA architecture is based on two main ideas: 1) all entities have a globally unique identifier in a flat namespace and packets carry these identifiers. 2) DOA allows senders and receivers to express that one or more intermediaries should process packets en route to a destination.

Each host has an unambiguous endpoint identifier picked from a large namespace. This identifier is required to be independent from network topology. It can also carry cryptographic meaning. To communicate with an end-host identifier, a prospective peer uses the mapping service to resolve the EID to an IP address.

Thus an EID can resolve to another EID, allowing an end-host to map its EID to a delegate’s identity. The DOA obeys the two Internet’s tenet. However, it seems that DOA try to satisfies the two telnets by an explicit delegation mechanism. No middleboxes implicitly interfere with other Internet element’s information.

Tuesday, November 4, 2008

Reading for Lecture Nov 4

Development of the Domain Name System

The large distribution cost of the HOSTS.TXT system prompts the genesis of DNS. The design assumptions for the DNS are: 1) provide at least all the same information as HOSTS.TXT; 2) Allow the database to be maintained in a distributed manner, 3) Have no obvious size limits for the names, name components, data associated with a name, 4) Interoperate across the DARPA Internet and in many other environment as possible, 5) Provide tolerable performance.

DNS has two major types of active components: name servers and resolvers. Name servers are repositories of information answering queries using whatever information they process. Resolvers interface to clients programs.

The DNS internal name space is a variable-depth tree where each node in the tree has an associated label. The domain name of a node is the concatenation of all labels on the path from the node to the root of the tree. The name space searching operation is done in a case-insensitive manner.

An organization can gets control of a zone of the name space by persuading a parent organization to delegate a subzone consisting of a single node. A caching mechanism for latter queries controlled by the TTL filed attached to each RR.

The search algorithm for the DNS is a downward search from domains that it can access already. Resolvers are typically configured with hints pointing at servers for the root node and the top of the local domain.

DNS Performance and the Effectiveness of Caching

The two widely believed most contributors to the scalability of DNS is hierarchy design and the aggressively use of caching. Both factors seek to reduce the load on the root servers at the top of the name space hierarchy while successful caching helps limit the client-perceived delays and wide area network usage.

The major questions would be: 1) What performance, in term of latency and failure, do DNS clients have? 2) How does varying the TTL and degree of cache sharing impact caching effectiveness? These questions are answered by using methods of analyzing the traces of TCP traffic along with related DNS traffic of three network traces.

One surprising result is that over a third of all lookups are not successfully answered. The DNS servers also appear to retransmit overly aggressively. The DNS usage patterns and performance are also observed to change. The capture of TCP traffic helps perform trace-driven simulations to investigate two important factors that effect caching effectiveness.

The root of the hierarchy is centrally administered and served from thirteen root servers. Sub-domains are delegated to other servers that are authoritative of the portion of their name space. To perform the analyzing, both DNS traffic and its driving workload are collected.

Thursday, October 30, 2008

Reading for Lecture Oct 30

Chord: A Scalable Peer-to-peer Lookup Service for Internet Application

Chord provides support for just one operation: given a key, it maps the key onto a node. Chord uses a variant of consistent hashing to assign keys to Chord nodes. This tends to balance load, since each receives roughly the same number of keys. The major features of Chord are simplicity, provable correctness and provable performance.

Chord node need routing information about O(log N) nodes and a node resolves the hash function by communicating with other nodes. It resolves all lookups via O(log N) messages to other nodes. Chord maintains its information as nodes join and leave the system; with high probability each such event results in no more than O(log^2N) message.

The consistent hash function assigns each node and keys an m-bit identifier using a base hash function such as SHA-1. A node’s identifier is chosen by hashing the node’s IP address, while a key identifier is produced by hashing the key. Identifiers are ordered in an identifier circle modulo 2^m. The key k is assigned to the first node whose identifier is equal to or follows k in the identifier space. If identifiers are represented as a circle of numbers from 0 to 2^m – 1, then successor(k) is the first node clockwise from k.

Each node n maintains a routing table with at most m entries called the finger table. The i^th entry in the table at node n contains the identity of node s = successor(n + 2^i) where 1 <= I <= m. Thus each node only has to store information about a small number of nodes and know more about nodes closely follow it on the identifier circle then about nodes farther away. When a node n does not know the successor of a key k, it can find a node whose ID is closer than its own to k. That node will know more about the identifier circle in the region of k than n does.

I do not really understand the “with high probability” phrase.

Looking Up Data in P2P Systems

This paper describes the data looking-up mechanism in P2P systems. The mechanism needs to be scalable, and distributable.

In tree-like routing, each node records, for each prefix, the location of some node with that prefix. Pastry, an instance of the algorithm, gives each node a randomly chosen ID indicating its position on an identifier circle. It routes messages with a key to the live node with a node ID numerically closest to the key.

Each node n maintains a leaf set L, which is the set of |L|/2 nodes closest to n and larger than n, and the set of |L|/2 nodes closest to n and smaller than n. The correctness of this leaf set is the only requirement for correctness; forwarding is always correct, unless |L|/2 nodes with adjacent IDs fail simultaneously.

When a node n’s sought key is in the leaf set, the query is forwarded to that node. If it is not there, it tries to forward the query to the node in the leaf set that have the same closest prefix to the key. This continues until the query reach the destination.

Tuesday, October 28, 2008

Reading for Lecture Oct 28

Resilient Overlay Networks
RON is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within several seconds. It is an application layer overlay on top of the existing Internet routing substrate.

The main goal of RON is to enable a group of nodes to communicate with each other in the face of problems with the underlying Internet paths connecting them. It aggressively probes and monitors the paths connecting its nodes to detect link failures and find the best paths. Information exchanged between nodes is metrics including latency, packet loss rate, and available throughput.

RON takes advantages of underlying Internet path redundancy on time-scales of a few seconds, reacting responsively to path outages and performance failure. It is closely tied to the application using it and it more readily integrates application specific path metrics and path selection policies. It demonstrate fast recovery from failure and improved latency and loss-rate even over short-time scale.

For each link, the latency estimate is updated as: lat ← α . lat + (1-α). new_sample. The overall latency for a path is the sum of the individual virtual link latencies: lat_path = Σlat_link. The loss rate metric is multiplicative on a path. The score = (1.5^0.2)/(rtt*p^0.5) is used to prevent large route oscillations from single packet losses.

Active network vision and reality lessons from a capsule-based system


Active networks help deploying new services rapidly but enabling the programmability at each router in the networks. Some active networks provide extensibility to individual devices; this kind of systems is well suited to the task of imposing policy or functionality at particular network location like boundary devices. Another style of system provides programmability across multiple odes for control tasks rather than new data transfer service.

ANTS does not a priori restrict who can program which nodes. It aims to allow each user to construct new data transfer services across the wide-area. ANTS is based on an aggressive “capsule” approach in which code is associated with packets and run at selected IP routers that are extensible.

In ANTS, applications obtain customized network services by sending and receiving special types of packets called capsules via a programmable router referred to as an active node. The field type to determine the service, hence, which code to run. If a node does not have that code, it requests from previous node. The ANTS places of limit of 16KB on the code each service to limit the impact of code transfers on the network.

Does processing capsule affect the performance at each node? With 16Kb code, what services can be supported? Do functionalities also depends on infrastructure at each node? Does active network violate end-to-end argument?
Resilient Overlay Networks
RON is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within several seconds. It is an application layer overlay on top of the existing Internet routing substrate.

The main goal of RON is to enable a group of nodes to communicate with each other in the face of problems with the underlying Internet paths connecting them. It aggressively probes and monitors the paths connecting its nodes to detect link failures and find the best paths. Information exchanged between nodes is metrics including latency, packet loss rate, and available throughput.

RON takes advantages of underlying Internet path redundancy on time-scales of a few seconds, reacting responsively to path outages and performance failure. It is closely tied to the application using it and it more readily integrates application specific path metrics and path selection policies. It demonstrate fast recovery from failure and improved latency and loss-rate even over short-time scale.

For each link, the latency estimate is updated as: lat ← α . lat + (1-α). new_sample. The overall latency for a path is the sum of the individual virtual link latencies: lat_path = Σlat_link. The loss rate metric is multiplicative on a path. The score = (1.5^0.2)/(rtt*p^0.5) is used to prevent large route oscillations from single packet losses.

Active network vision and reality lessons from a capsule-based system


Active networks help deploying new services rapidly but enabling the programmability at each router in the networks. Some active networks provide extensibility to individual devices; this kind of systems is well suited to the task of imposing policy or functionality at particular network location like boundary devices. Another style of system provides programmability across multiple odes for control tasks rather than new data transfer service.

ANTS does not a priori restrict who can program which nodes. It aims to allow each user to construct new data transfer services across the wide-area. ANTS is based on an aggressive “capsule” approach in which code is associated with packets and run at selected IP routers that are extensible.

In ANTS, applications obtain customized network services by sending and receiving special types of packets called capsules via a programmable router referred to as an active node. The field type to determine the service, hence, which code to run. If a node does not have that code, it requests from previous node. The ANTS places of limit of 16KB on the code each service to limit the impact of code transfers on the network.

Does processing capsule affect the performance at each node? With 16Kb code, what services can be supported? Do functionalities also depends on infrastructure at each node? Does active network violate end-to-end argument?

Tuesday, October 21, 2008

On Power-Law Relationships of the Internet Topology

The power-laws for the topology of the Internet over the duration of 1998 are determined from the topology of the Internet. The power law are expression of the form y ∝ x^a, where a is a constant and x and y are measures of the interest and ∝ stands for proportional to.

The sample space during that time is rather limited, though the size of the network increased substantially (45%), making any generalizations would be premature. However, the power-laws of Internet growth the same way the power-laws describing various natural networks such as the human respiratory system and automobile network.

The topology of the Internet can be divided into two different granularies: router level and inter-domain level. The metrics used to describe the graph are mainly the node outdegree and the distances between nodes. The power laws have been used to describe traffic in communication networks but not their topology.

The first power-law is called rank exponent. The outdegree of a node is proportional to the rank of the node to the power of the constant. The second power-law called outdegree exponent states that the frequency of an outdegree is proportional to the outdegree to the power of a constant. The third power-law, eigen exponent, states that the eigenvalues of a graph are proportional to the order to the power of a constant.

The power-law can be useful for predict future Internet based on the similarity with other network.

A First-Principles Approach to Understanding the Internet’s Router-level Topology

The degree distributions and other large scale statistics are popular metrics for evaluating how representative a given topology is and proposed topology generators are often evaluated on the basis of whether or not they can reproduce the same types of macroscopic statistics, especially power-law distributions.

However the above perspective is not complete and in need of correction actions. There exists many different graphs having the same distribution of node degree, some of which may be considered opposites from the viewpoint of network engineering. The first principles approach is to understand Internet topology at the router level.

Each of the topology generators uses one of the three probabilistic generation methods: 1) preferential attachment which says the growth and each newly added node connects to some existing nodes preferentially such that it is more likely to connect with a node that already has many connections. 2) General model of random graphs with a given expected degree sequence. The construction proceeds first by assigning each node its degree then probabilistically insert edges between nodes. 3) The power law random graph involves forming a set L of nodes containing as many distinct copies of a given vertex as the degree of that vertex, choosing a random matching of the elements of L, and applying a mapping of a given matching into an appropriate graph.

The commonly-used metrics are node degree distribution; expansion, resilience, distortion; hierarchy. Given the purpose for building a network is to carry effectively a projected overall traffic demand, others performance-related metrics are considered: throughput; router utilization; end user bandwidth distribution. The likelihood metric is used to differentiate between graphs having the same vertex set and the same degree distribution.

Wednesday, October 15, 2008

Thursday, October 9, 2008

Reading for Lecture Oct 9

XORs in The Air: Practical Wireless Network Coding

The paper shows that intelligently mixing packets increases network throughput. The COPE architecture’s principles exploit the broadcast nature of the wireless channel and the theory of network coding. It inserts a coding shim between the IP and MAC layers, which identifies coding opportunities and benefits from them by forwarding multiple packets in a single transmission.

The basic idea of COPE is that a COPE router XORs two packets from two nodes sending to each other and broadcast the XORed packet to both of the nodes so that the two can again XOR with their packets to get their expected packets, thus this process needs only three transmissions instead of 4. COPE architecture incorporates three main techniques: 1) Opportunistic listening; 2) Opportunistic Coding; and 3) Learning neighbor state.

Opportunistic listening happens since wireless is a broadcast medium so that nodes have many opportunities to overhear packets when they equipped with omni-directional antennae. Cope nodes in promiscuous mode snoop on all communications over the wireless medium and store overheard packets for a limited period of time. In addition, each node broadcasts reception reports to its neighbors about the overheard packets.

The opportunistic coding aims at maximizing the number of native packets delivered in a single transmission, while ensuring that each intended next hop has enough information to decode its native packets.

Since report packets can be lost during congestions, thus a node has to guess to learn about packets that its neighbors have using ETX metric broadcast to all nodes then estimates the probability that a particular neighbor has a packet as the delivery probability of the link between the packet’s previous hop and the neighbor.

The COPE actually has beneficial side effect on MAC because it reduces the mismatch between edge nodes and the routers resulting in fewer packets dropped and thus better gain.
The paper does not address the issue of packet reordering as they prefer to code packets of the same size. When finding the maximum number of packets encoded, they do not account for the error in wireless environment and implicitly assume that the links are error free. This probably not true since when n too large the probability that one packet is corrupted become considerably large, and if a node can not receive one packet out of n, it can not decode its expected packet thus a retransmission is likely to happen for each node.

ExOR: Opportunistic Multi-Hop Routing for Wireless Network

The ExOR chooses each hop of a packet’s route after the transmission for that hop so that the choice can reflect which intermediate nodes actually received the transmission. This deferred choice allows each transmission multiple opportunities to make progress by choosing the best possible current routes.

The basic idea of this routing is that when a source wants to send a packet, it broadcasts the packets and some sub-set of nodes will receive the packet. All the nodes run a protocol to discover and agree on which nodes are in that sub-set. The nodes in the subset closest to the destination broadcast the packet. Again the nodes receiving the second transmission agree on the closest receiver, which then broadcast the packet. This process continues until the destination has received the packet.

The agreement protocol must have low enough overhead that it does not overwhelm the potential throughput gain and must be robust in the face of packet loss that disagreement and thus duplicate forwarding are rare. ExOR must have a metric reflecting the cost of moving a packet forward from any node to the destination thus it helps choosing the closest node.

The ExOR must choose the most useful nodes as participants and avoid simultaneous transmissions by different nodes to minimize collisions. The ExOR operates in batches of packets and includes a list of prioritized candidate forwarders in each packet. The receiving nodes buffer successfully received packets and wait to the end of the batch. The forwarders then broadcast the packets in their buffer in prioritized order. The batch map included in each packet indicates the highest-priority node known to have received a copy of that packet.
Since ExOR operates in batches of packets, packets likely suffer from large delay.

Tuesday, October 7, 2008

Reading for Lecture Oct 7

A High Throughput Path Metric for Multi-Hop Wireless Routing

The Expected Transmission Count metric (ETX) minimizes the expected number of packet transmissions (including retransmissions) requires to successfully deliver a packet to the ultimate destination, thus by which finds high-throughput paths on multi-hop wireless network. The ETX is used as a metric for DSDV and DSR protocols.

The commonly used metric by existing ad-hoc routing protocols is minimum hop-count often implicitly assumes that the links either work well or don’t work at all, which is true for wired networks but not for wireless networks. The 802.11b ACK mechanism resends lost packets locally, thus making all the worst 802.11b links appear loss-free, however the retransmissions reduce path throughput and interfere with other traffics. The ETX find paths with the fewest expected number of transmissions.

The best path between each pair of nodes was found by sending data along ten potential best paths, one at a time, and selecting the path with the highest throughput. An offline routing algorithm is used to find potential best paths using as input measurement of per-link loss ratios and with a penalty to reflect the reduction in throughput caused by interference between successive hops of multi-hop paths.

The ETX metric must account for the following issues: 1) the wide range of link loss ratios; 2) the existence of links with asymmetric loss ratio; 3) the interference between successive hops of multi-hop paths. The ETX of a link is calculated from the forward delivery ratio df, the probability that a data packet successfully arrives at the recipient, and the reverse delivery ratio dr, the probability that the ACK packet is successfully received. Thus the expected probability that a transmission is successful is df x dr then ETX = 1/ df x dr.

These delivery ratios are measured using dedicated link probe packets. Each node broadcasts link probes of a fixed size at an average period τ with jitter. Each node remembers the probes it receives in during the last w seconds, thus allowing it to calculate the delivery ratio from the sender at any time t as r(t) = count(t – w, t)/(w/τ).

The ETX only makes sense for networks with local link-layer retransmission and furthermore, ETX assumes that radios have a fixed transmit power level. ETX loss measurements do not reflect how busy a link. And 802.11b can be unfair under high load causing a node might never be able to send its probes, causing its neighbors to believe that reverse delivery ratio had become zero.

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

In wireless mobile ad-hoc network, each mobile node operates not only as a host but also as a router forwarding packets for other mobile nodes in the network that may not in direct wireless transmission range of each other. The ns-2 network simulator is extended to include node mobility, a realistic physical layer, radio network interfaces and IEEE802.11 MAC protocol using DCF.

The DSDV protocol requires each node to periodically broadcast routing updates. Each DSDV node maintains a routing table listing the next hop for each reachable destination. DSDV tags each route with a sequence number and considers a route R more favorable than R’ if R has a greater sequence number, or if the two routes have equal sequence numbers but R has a lower metric.

The TORA protocol is designed to discover routes on demand provide multiple routes to a destination, establish routes quickly, and minimize communication overhead by localizing algorithmic reaction to topological changes when possible. The mechanism of TORA resembles water flowing downhill towards a destination node through a network of tubes that models the routing state of the real network.

The DSR protocol consists of two mechanisms: route discovery and route maintenance. The route discovery mechanism discovers routes when a node S wishing to send a packet to node D and obtains a new route to D. The route maintenance operates by which a source S detects if the network topology has changed such that it can no longer use its route to the destination D because some two nodes in its route have moved out of each other’s range, thus disconnected. The AODV is essentially a combination of both DSR and DSDV.

The overall goal of the experiments was to measure the ability of the routing protocols to react to network topology change while continuing to successfully deliver data packets to the destination. The simulation of 50 wireless nodes forming an ad hoc network moving about over a rectangular 1500m x 100m flat space during 900 seconds. The sources send data at a constant bit rate. The metrics for comparing protocols are packet delivery ratio, routing overhead and path optimality.
The experiment tries to use sources with constant bit rate which is of often not realistic. Non-constant bit rate connections often cause more problems for wireless environment since the network cannot frequently check the status of routes in the network as well as discover new network topology.

Thursday, October 2, 2008

Readings for Lecture Oct 2, 2008

Architecture and Evaluation of an Unplanned 802.11b Mesh Network

The Roofnet wireless mesh architecture has unplanned node placement, omni directional antennas and multi-hop routing and optimization of routing in a slowly changing network. The design goals of this network are operating without extensive planning or central management but still providing wide coverage and acceptable performance. This architecture carries risks that lack of planning might render the network’s performance unusably low.

This Roofnet is designed over an area of about four square kilometers. Most of nodes are located in three or four-storey buildings. Line-of-sight signal propagation between nodes id often obstructed. Each node is hosted by a volunteer user.

Each Roofnet node consists of a PC, an 802.11b card and a roof-mounted omni-directional antenna. The 802.11b cards operate with RTS/CTS disabled and all share the same 802.11b channel. The cards use o non-standard “pseudo-IBSS” mode , similar to standard 802.11b IBSS mode. The Pseudo-IBSS omits 802.11b’s beacons and BSSID (network ID).

Each Roofnet node runs identical turn-key software consisting Linux, routing software implemented in CLICK, a DHCP server and a webserver. The nodes are completely self-configuring, thus the software must automatically solve problems like allocating addresses, finding a gateway between Roofnet and the Internet, and choosing a good multi-hop route to that gateway.

The Roofnet carries IP packets inside its own header format and routing protocol. Each node will assign address automatically based on Ethernet address and unused class A IP address block. Only a small fraction of Roofnet users connected to Internet via wired Internet access links. The nodes connected to Internet advertise itself to Roofnet as an Internet gateway. These gateways act as a NAT for connections. Other nodes select the gateway to which it has the best route metric.

This network’s architecture favors ease of deployment by using omni-directional antennas, self-configuring software and link quality. The average throughput between nodes is 627 kbits/second. Since each internal node has multiple output gateway and can choose best current gateway, the Roofnet’s multi-hop mesh increase both connectivity and throughput.

Modeling Wireless Links for Transport Protocols

The wireless links’ intrinsic characteristics that affect the performance of transport protocols like variable bandwidth, corruption, channel allocation delays, and asymmetry. Congestion control in today’s Internet supposes that almost all packets losses result from congestion, however, packet losses on wireless links that are from corruption rather than congestion violates this assumption.

Three main types of wireless links are: cellular, WLAN and satellite. Topologies play an important role in wireless links. The number and location of wireless links in the path can affect the performance of transport protocols. Latencies, error loss rates of multiple wireless links add up, making loss recovery and congestion control more difficult.

Traffic models have significant effects on simulations results. Typically, mobile users transfer more data downlink than uplink. The performance metrics for evaluating performance include throughput, delay, fairness and dynamics. The first problem is the use of unrealistic models. The second problem occurs when the models are realistic but explore only a small corner of the parameter space.

Error losses can be modeled by dropping packets according to a per-packet, per-bit or time-based loss probability. The sudden increase in the delay can have a negative impact on transport protocols i.e. causes unnecessary retransmissions and false congestion control. Wireless links can introduce packet reordering during link-level error recovery using retransmission. They also often allocate channels based on the availability of user traffic. The bandwidth variation in wireless links can cause spurious timeouts and the asymmetry in bandwidth and latency can cause the congestion for TCP ACK. The mobility, an intrinsic property for most wireless links, presents a major challenge to transport protocols through the packet losses and delay introduced by handover.

In some cases, the wireless link level technologies have been design to take into account the performance requirement of the current TCP standard. In other cases, discussion is about the adaptation of transport protocols to the need of wireless links. The paper consider different models and parameters to be used in answering the questions and others about transport protocol performance over wireless links.

Tuesday, September 30, 2008

Readings for Lecture Sep 30, 2008

MACAW: A Media Access Protocol for Wireless LAN’s

The emerging of a wide variety of mobile computing devices like palmtops, personal digital assistants, and portable computers with the demand for network connection has motivated a new generation of wireless network technology and WLAN is a crucial enabling technology. The MAC layer of WLAN can be very different from traditional wired LAN due to its nature of shared, and scarce, resource.

The WLAN infrastructure consists of a “base station”, and pads, which are custom, built portable computing devices. All the wireless communication is between a pad and a base station. The range of transmission is 3 to 4 meters, and the near field signal strength decays very rapidly. Thus around each base station, a very small cell with very sharply defined boundaries, is called “nanocell”.

A collision occurs when a receiver is in the reception range of two transmitting stations and is unable to cleanly receive signal from either station. Capture occurs when a receiver is in the reception range of two transmitting stations, but is able to cleanly receive signal from the closer station. Interference occurs when a receiver is in range of one transmitting station and slightly out-of-range of another transmitting station but unable cleanly receive the signal of the closer station due to the interfering presence of other signal.

The common wireless multiple access algorithm is CSMA, in which every station senses the carrier before transmitting. This algorithm has problem at the receiver since the senders cannot sense the carrier at the receiver so the senders are hidden from each other. The MACA algorithm introduces a protocol with RTS-CTS exchange to initiate a connection. MACA uses the binary exponential backoff (BEB) to determine the retransmission time.

The ACK is implemented at the link layer to enable fast error recovery at the link layer due to frequent packet collision and corruption by noise. To further improve RTS-CTS model, a Data-Sending packet is sent at the sender to let other stations know the RTS-CTS has successfully implemented. The RRTS is added to help the receiver inform the sender that it is available to send. The efficiently implement the backoff algorithm, each station should maintain a separate backoff counter for each stream.

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links
TCP reponds to all losses by invoking congestion control and avoidance algorithm, resulting in degraded end-to-end performance in wireless and lossy systems. The paper classify the schemes into three broad categories: end-to-end protocols, where loss recovery is performed by the sender, link-layer protocols, the provide local reliability, and split-connection protocols that break the end-to-end connection into two parts at the base station.

The end-to-end throughput and goodput are used as performance metrics to seek the answer of the specific questions: What combination of mechanisms results in best performance for each of the protocol classes? How important is it for link layer schemes to be aware of TCP algorithm to achieve high end-to-end throughput? How useful are selective acknowledgements in dealing with lossy links, especially in the presence of burst losses? Is it important for the end-to-end connection to be split in order to effectively shield the sender from the wireless losses and obtain the best performance? The goodput for any path (or link) is defined as the ratio of the actual transfer size to the total number of bytes transmitted over that path.

The link-layer protocols have two main classes of techniques: error correction, using techniques such as forward error correction (FEC), and retransmission of lost packets in response to automatic repeat request (ARQ) message.

The split connection protocols split each TCP connection between a sender and receiver into two separate connections at the base station. Over the wireless hop, a specialized protocol tuned to the wireless environment may be used. One proposed two protocols: one in which wireless hop uses TCP, and another in which the wireless hop uses a selective repeat protocol (SRP) on top of UDP.

The snoop protocol introduces a module, called snoop agent, at the base station. The agent monitors every packet that passed through the TCP connection in both directions and maintains a cache of TCP segments sent across the link that have not yet been acknowledged by the receiver. A packet loss is detected by duplicate acknowledgement or a local timeout. The snoop agent retransmits the lost packet if it has it cached and suppresses the duplicate acknowledgement.

Selective acknowledgements contain information about up to three non-contiguous blocks of data that have been received successfully by the receiver. An alternate proposal, SMART, uses acknowledgements that contain the cumulative acknowledgement and the sequence number of the packet that caused the receiver to generate the acknowledgement.
The link layer protocols using knowledge of TCP improve the throughput by 10-30%. The SMART-based selective acknowledgement yields good throughput.

Tuesday, September 23, 2008

Readings for Lecture Sep 23, 2008

Scaling Internet Routers Using Optics

The Internet is widely reported to have a glut of capacity with average link utilization below 10% and a large fraction of installed but unused link capacity. Thus, the capacity of routers should grow since the demand for network capacity continues to double every year while doubling the number of router each year is impractical.

Current multi-rack systems suffer from unpredictable performance and poor scalability. This causes problems to operators since they do not know what utilization they should apply to their routers then they cannot full exploit their long-haul links. The paper explains how they can use optics with almost zero power consumption to design architectures with predictable throughput and scalable capacity.

The load-balanced switch consists of a single stage buffer sandwiched by two identical switching stages. The buffer at each intermediate input is partitioned into N separate FIFO queues, one per output, hence called Virtual Output Queue (VOQ). There are total N x N such queues. Both the switching stages walk through a fixed sequence of configuration.

The sequence of the switch configurations is pre-determined regardless of the traffic or the state of the queues. The load-balanced switch has 100% throughput even for non-uniform traffic. The two switches at rate R/N then can be replaced by a single switch running twice as fast. The Full Order Frames First (FOFF) can be used to bound the number of mis-sequenced packets in the switch to at most N^2 + 2.

This router architecture, which can guarantee throughput, can be useful to design routers for real-time Internet protocols.

A Fast Switched Backplane for a Gigabit Switched Router

Router functions can be separated in two types Datapath functions often implemented in
hardware and Control functions often implemented in software. There are three trends in designing high performance routers: 1) Datapath is implemented more in hardware; 2) Parallelism is employed more; 3) The most important one is the trend to more away from shared buses since buses are shared between multiple functions can be congested.

Early routers built around conventional computer architecture with a shared central bus, a
central CPU, memory and line cards have the main limitation is that the central CPU has to process every packet, thus limiting the throughput of the system. The improved architecture with multiple CPUs with local forward decision, however, is still limited since the forward decisions is made in software can be outperformed by careful design ASIC in forwarding decisions, managing queues and arbitrating access to the bus. The shared bus in this architecture could be the second limiting factor since one packet can traverse the bus at a time.

By replacing local CPUs with hardware forwarding engines on line card and replacing the bus with a crossbar switch can solve the problem. The performance is further boosted using multiple crossbar switches or slices in parallel to form a switch core. The packet length is chosen to be fixed so the ease and efficiency in design. The head of line blocking problem is fixed by using virtual output queueing (VOQ).

The scheduling algorithm has to satisfy the following properties: high throughput, starvation free, fast, and simple to implement. The iSLIP algorithm is chosen. The priority packet-scheduling scheme can make this architecture suitable for the RSVP protocol.

Thursday, September 18, 2008

Readings for Lecture Sep 18, 2008

Fundamental Design Issues for the Future Internet

The paper discusses the fundamental architectural design issues of “future“ Internet. The Internet at the paper’s time offers only single class of “best-effort” service and is neutral; there is not admission control and the network does not guarantee the about when or if packets will be deliver.

The emerging real-time applications like video and voice have different characteristics and requirement from data application could significantly alter the Internet architecture since real-time applications do not perform adequately when running over the current Internet due to the variations in transmission delay. In addition, real-time applications cannot back off during congestion and they continue contend for bandwidth with data application resulting in the lack of bandwidth for data applications.

The above issue can be resolved by using fair queueing and modifying the implementation of applications to adapt to delay variations. However, without admission control, the network can run into the situation where fair bandwidth is too small. Thus, it is necessary to extend current internet service model from one best effort class to include a variety of service classes.

The goal of the network design then is to maximize the performance of the resident applications, called efficacy V≡ΣiUi(si). V can be improved without the necessity to supply more bandwidth by provides a wider variety of services rather than just a single class of best effort service.

To determine the service for a flow, there are two possible ways. First, network can implicitly supply; this means that the application does not tell the network about its service requirement. However, this requires embedding application information into the network, which violates the Internet tenet. Second, applications can explicitly request their desired service class. This also has some disadvantages like the incentive for users and the lack of flexibility of service model.
The paper does not address the problem of delay bound and also assumes that the bandwidth will be extremely inexpensive.

Supporting Real-time Applications in an Integrated Services Packet Network Architecture and Mechanism

The paper proposes an Integrated Services Packet Network (ISPN) architecture that supports two distinct kinds of real-time services: guaranteed service which involves pre-computed worst case delay bounds and predicted service which uses the measured performance of the network in computing delay bound.

The characteristics of a class of real-time application dubbed playback applications are discussed to indentify requirements for the design proposal. The playback class has four important points: 1) it prefers low delay, 2) The applications need to know about the absolute or statistical bound of the delay, 3) Data just need to arrive before the play-back point, 4) The play-back applications can tolerate the loss of a certain fraction of packets.

The guaranteed service is if the network hardware is functioning and the client is conforming to its traffic characterization, then the service commitment will be met (does not require the conformant of other network clients). The predicted service is that the network commits if the past is the guide to the future, then the network will mot its service characterizations.

The Weighted Fair Queueing (WFQ) algorithm is suitable for guaranteed service. This scheduling does not favor the bursty sources. The FIFO algorithm is better for packet scheduling in predicted service, which provide better the utilization of the network, however, does not provide any isolation. The unified algorithm of WFQ and FIFO can handle both types of services in that WFQ scheme is used as a framework into which other scheduling algorithms can be fitted. All predicted services is assigned a pseudo WFQ flow (this flow has a constant data rate).

The paper assumes that most of future real-time application will fit the playback applications. Other types of real-time applications are not discussed in this paper like real-time control systems, which do not tolerate the loss of data.

Tuesday, September 16, 2008

Readings for Lecture Sep 16, 2008

Random Early Detection (RED) Gateways for Congestion Avoidance

Different from DECbit congestion avoidance scheme, the RED gateways detect potential congestion by computing the average queue size. Then the gateways can implicitly notify the sources of congestion connections either by dropping packets at the gateways or by setting a bit in packet headers then when the destinations of connections return ACK to the source, it includes this bit.

The next improvement of RED is the mechanism to drop a packet. When the average queue length exceeds a certain drop level, then the gateway drops each packet arriving at the gateway randomly with a fixed drop probability. If the average queue length exceeds a certain max drop level then every packet arriving is marked.

For each queue, the average queue length, avg, is estimated by: avg = (1 – weight) * avg + weight * currentQueueLen. If minth ≤ avg ≤ maxth, then the probability pb = maxp(avg – minth)/(maxth – minth) and pa = pb/(1 – count * pb). If avg ≥ maxth all the arriving packet are marked.

The value minth reflects the utilization of the link. If the typical traffic is rather busty, then minth has to be set to corresponding large to allow the link to be maintained at acceptable high bandwidth. While maxth depends in part on the maximum average delay that can be allowed by the gateway. The RED gateway function most effectively when maxth - maxth is larger then the typical increase in the calculated average queue size in one roundtrip time.

The mechanism presented in this paper is particularly useful in networks where it is undesirable to drop packet since the RED gateways predict and control the average queue size before congestions can happen. However, The paper does not specify whether the queue size should be measured in bytes or in packet.

Congestion Control for High Bandwidth-Delay Product
Explicit Control Protocol (XCP) generalizes the explicit congestion notification. In XCP protocol, the efficiency control and fairness control is separated. The senders in XCP maintain their congestion window cwnd and round trip time rtt and communicate these to the routers via a congestion header in every packet. Based on the difference between the link bandwidth and its input traffic rate, the router tells the flows sharing that link to increase or decrease their congestion windows.

Each XCP packet carries a congestion header, which is used to communicate a flow’s state to routers and feedback from the routers on to the receivers. The H_cwnd is the sender current congestion windows, H_rtt is the sender’s current RTT estimate. The H_feedback is initialized by the sender and modified by the routers along the path to directly control the congestion windows of the sources.

Whenever a new acknowledgment arrives, positive feedback increases the senders cwnd and negative feedback reduces it: cwnd = max(cwnd : H_feedback, s) where s is the packet size. The receivers whenever acknowledging a packet, it copies the congestion header from the data packet to its acknowledgement.

The efficiency control’s (EC) purpose is to maximize link utilization while minimizing the drop rate and persistent queues. Since the EC is separated from fairness control, so the aggregate feedback φ is computed for each control interval: φ = α.avgRTT.SpareBandwidth - β.PersistentQueueSize. The aggregate feedback φ computed from EC is divided equally between all flows (positive or negative). The principle for converging to the fairness is Additive Increase, Multiplicative Decrease.

The delay issue is not really evaluated in this paper. The question is that can we extend the protocol to guarantee timing delay of a packet is bounded?

Thursday, September 11, 2008

Readings for Lecture Sep 11, 2008

Analysis and Simulation of a Fair Queueing Algorithm

The queueing algorithms allocate three nearly independent quantities: bandwidth, promptness and buffer space. The first come first served (FCFS) queueing algorithm is not adequate to prevent ill-source from sending data to gateway at sufficiently high speed to capture all or high fraction of bandwidth. The fair queueing (FQ) algorithm proposed by Nagle can handle this issue by maintaining separate queues for packets from each individual source so each source cannot interfere with other sources.

However, Nagle’s proposal does not really satisfy the fairness in bandwidth since a source using long packets gets more bandwidth than a source using short packets. Additionally, static round-robin ordering violates the continuity requirement.

The max-min fairness states that an allocation is fair if: no user gets more than it requests, users with demand lower than average allocation resource get their demand and removed from list, users with demand higher than the average demand left get equal share of resource.

The straightforward bit-by-bit round-robin algorithm in which each time slot is given to one bit from one source sequentially is not realistic. The alternative way of implementing this algorithm is to collect all bits of one packet from one source and transmit the packet of a source with the smallest number of bits transmitted plus the number of bits in the packet, say F. This algorithm is thus emulating the bit-by-bit round robin. The preemptive version of this algorithm, newly arrived packet with smaller F will preempt the transmitting of the current transmitting packet.

Queueing non-negative delay δ is added to help inactive conversations have faster service. Now the sending order is determined by B¬i = Pi+max(Fi, R(ti) - δ), not Fi anymore. If a conversation is active then R(ti) ≤ Fi, thus δ has no effect to active conversations. In inactive conversations, R(ti) > Fi, the previous history of the source is taken into account.

This algorithm is complex in the sense that it has to keep state for each flow through a gateway, which is hard, and then the number of flows is large. Another issue is that how a gateway knows when a flow terminates so that it can clear the information about that flow.

Core-stateless Fair Queueing Architecture Achieving Approximately Fair Bandwidth Allocations in High Speed Network

To reduce the complexity of fair queueing (FQ) algorithm, an architecture call core-stateless FQ (CSFQ) can be used. In that architecture, only edge router maintain per flow state by estimating the incoming rate of each flow and insert a label into each packet header based on this estimation. Core routers maintain no per state; instead they use FIFO scheduling augmented by a probabilistic dropping algorithm that uses the packet labels and an estimate of the aggregate traffic at the router.

In an ideal model, fluid models, which are bufferless, the arrival rates, ri(t) of packets are supposed to know exactly and flows are modeled as a continuous stream of bits. The max-min bandwidth for flow i is a rate given by min(ri(t), ∝(t)), where ∝(t) is the fair share rate of the server. If the total arrival rate A(t) = Σri(t) is less than the output link speed C, then no bit is dropped. Otherwise bits from flows i that have ri(t) > ∝(t) will be dropped with the probability of max( 0, 1 - ∝(t)/ri(t)).

The packet algorithm is extended from the above algorithm with packetized transmission and buffering mechanism. The flow arrival rate is not fixed, instead it is estimated the flow rate of each flow in the form of e-T/K. The estimated aggregate arrival rate A is also estimated in the form e-T/K. This CSFQ can be continued to extend to support flows with different weights.
The flow rate is not fixed and supposed an exponential form that helps the estimated rate converge to real rate and allows to bound the excess service. The size of each island also needs more consideration in detail since this can affect the latency of the systems.

The paper does not discuss the latency of the approach while this can be a major issue for realtime-distributed systems. This architecture can be extend for future research to handle real time system communications in that edges router will not only estimate the flow but also classifies packets according to some patterns such that it still preserve the order of packets.

Tuesday, September 9, 2008

Readings for Lecture Sep 9, 2008

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Network

The co-existence of new technologies like Local Network Area (LANs) and fiber optic LANs and old technologies with low bandwidth like the twisted pair resulted in the discrepancy in the arrival and service rates in the intermediate nodes in the network causing increasing queue and congestions.

To allow networks to operate in the optimal region of low delay and high throughput, congestion avoidance is needed to prevent networks from being congested. As the load increases, the throughput increases. After the load reaches the network capacity, the throughput will stop increasing. If the load continues to increase further, the queues start to build up and this potentially results in packet being dropped. One of the congestion avoidance schemes is “binary feedback scheme” with increase/decrease algorithms can be used to address the problem.

The criteria for selecting control are efficiency, fairness, distributedness, and convergence. The efficiency of resource usage is the closeness of the total load on the resource to its knee, Xknee = Σxi(t). Users should be given the same amount of resource if they belong to the same equivalent class. They should be treated fairly by using some fairness index such as: F(x) = (Σx¬i)2/n( Σxi2). The control scheme should be distributed instead of being centralized. The system should be relatively stable, this means that the control scheme should converge, not really to a steady state but it should reach an “equilibrium”.

For the linear control: xi(t + 1) = aI + bIxi(t) if y(t) = 0 => Increase; xi(t + 1) = aD + bDxi(t) if y(t) = 0 => Decrease. With different assignments to aI, bI, aD, bD, we have different control functions. With different criteria for selecting control, we obtain constraints for aI, bI, aD, bD and thus, we establish the feasible control region. The time to converge is disproportionate with respect to the oscillation size for the efficiency criterion. For optimal convergence, the policy should be additive increase and multiplicative decrease.

The problem in the paper is very abstracted. To apply the results to practice, many other issues need to take into consideration such as the resource allocations must be integral and thus, simple rounding off to the nearest integer may cause violation of convergence criteria. Additionally, the scales and parameters in the papers are not easily gathered automatically and often require human help to configure and so the algorithm is more vulnerable to human error.
The paper has not discuss the affect of delayed feedback to the control and seemingly, it assumes that the communication time is zero. Further extension from “binary” feedback to multi-valued feedback would help in describing the exact state of the receivers. The paper also assumes that the number of users stays the same during time and all users start from initial state and follow the same phases of computation and feedback.

Congestion Avoidance and Control

Starting from a sudden factor-of-thousand drop in bandwidth, the authors try to investigate why things had gotten so bad. The principle of “conversation of packet” is that, when a connection in equilibrium, a new packet is not put into the network until old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion.

However, from observation of the Internet, this was not particularly robust. The reasons for this failure would be: connection does not reach equilibrium, or a sender injects a new packet before an old on has exited, or the equilibrium can’t be reached because of resource limits along the path.

To help connection reach equilibrium, a slow start scheme is used. Each connection will have a congestion window, cwnd. Whenever starting or restarting after a loss, set cwnd to one packet. Each acknowledgment for new data, increase cwnd by one packet. When sending, send the minimum of the receiver’s advertised window and cwnd.

The second problem is originated from neglecting the variation of the round trip time R since R would increase quickly with load. Thus, the retransmit timeout interval, RTO, has to be estimate accurately from updated R.

Congestion avoidance scheme is used to adapt the connection to a path. The strategy employs additive increase / multiplicative decrease policy: on any timeout, set cwnd to half the current window size; on each acknowledge, increase cwnd by 1/cwnd; when sending, send the minimum of the receiver’s advertised window and cwnd. This strategy can be easily mistaken with slow start scheme, however actually, congestion avoidance scheme will set the start threshold for the slow start, otherwise, the cwnd will increase unbounded.

The paper deals with algorithm for endpoints, which is not really relevant if the two endpoints are farther away from each other. The algorithm should be extended for gateway congestion avoidance. This could be difficult since each gateway often has many different flows.

Thursday, September 4, 2008

Readings for Lecture Sep 4, 2008

Interdomain Internet Routing

Wide area routing architecture is divided into autonomous systems (ASes) that exchange reachability information. Each AS is owned and administered by a single commercial entity, implements a set of policies in deciding how to route its packet to the rest of the Internet as well as how to export it routes to other ASes. A unique 16-bit number identifies each AS.

Within each AS, different routing protocols operate called Interior Gateway Protocols (IGPs). Between ASes, Interdomain routing protocols like Border Gateway Protocol (BGP) are used and called Exterior Gateway Protocols (EGPs). BGP was designed based on three important needs: Scalability, Policy, and Cooperation under competitive circumstances.

There are two prevalent forms of AS-AS interconnection namely provider-customer transit and peering form in which to ASes (typically ISPs) provide mutual access to a subset of each others’ routing table. Each ISP exports a part of its routing table to other ISPs by filtering according to its policy such that it can gain benefit if the filtered routing paths.

Routes then imported by ASes in the following priority order: customer > peer > provider. This rule can be implemented in BGP using a special attribute that’s locally maintained by routers in an AS, called the LOCAL PREF attribute. There are two types of BGP sessions: beg sessions are between BGP-speaking routers in different ASes, while iBGP sessions are between BGP-speaking router within the same AS. eBGP routers after obtaining information about external world will send the information to BGP routers inside ASes via iBGP sessions. iBGP is different from IGPs like OSPF and RIP.

BGP is actually a rather simple protocol but its configuration flexibility makes it complex. Another issue would be that BGP was not designed for rapid fault detection and recovery, and furthermore, upon the detection of a fault, a router sends a withdrawal message to its neighbors, to avoid massive route oscillations, the further propagation of such route announcements damped. This damping can cause delay.

One interesting feature of BGP would be on policy like hot-potato routing. This paper does not discuss this issue in detail on how to allow cold-potato routing. Thus, future research would be related to policy, failover, scalability, configuration, and correctness.

On Inferring Autonomous System Relationships in the Internet
A pair of ASes is interconnected via dedicated links and/or public network access points, and routing between ASes is determined by the Interdomain routing protocol such as Border Gateway Protocol (BGP). The BGP is closely related to policy of ASes (ISPs).

Thus, the relationships between ASes would be an important aspect of Internet architecture. Each AS set up its own BGP export policies according to its relationships with other ASes. The paper proposes a heuristic algorithm to classify AS relationships basing on BGP routing tables. This work helps understand structural properties of the Internet.

The AS relationships is represented by an annotated AS graph. This graph is partially directed graph whose nodes represent ASes and whose edges are classified into provider-to-customer, customer-to-provider, peer-to-peer, and sibling-to-sibling edges.

An AS selectively provides transit services for its neighboring ASes and it sets up its BGP export policies according to selective export rule. The selective export rule ensures that BGP routing table entries only contain Valley-Free AS paths. The uphill (or downhill) top provider of an AS path to be the AS that has the highest degree among all ASes in its maximal uphill (or downhill) path.

The algorithm goes through the AS path in each routing table entry and finds the highest degree AS and lets the AS be the top provider of the AS path. From the top provider, the algorithm traces to infer the relationships between two consecutive ASes in the path.
However, the above algorithm does not identify peering relationships. An AS pair have a peering relationship if and only if the AS pair do not transit traffic for each other. This is the original idea of the heuristic algorithm.

This paper is interesting in the sense that ISPs or companies can use AS relationship information to plan for future contractual agreement.

Monday, September 1, 2008

Readings for Lecture Sep 2, 2008

End-to-end Arguments in System Design

The paper discusses the design principle, called end-to-end argument, to choose the proper boundaries between functions. The careful file transfer problem is used to articulate the argument by appealing to application requirements and provides a rationale for moving function upward in a layered system, closer to the application that uses the function.

The paper assumes the specific steps for the problem then presents possible threats for failures in transferring a file from one host to another host. The paper then considers the solutions to deal with possible threats namely checksum to detect failures and retry/commit plan. Performance aspects are used to justify placing function in a low-level subsystem and consider the tradeoff. However, it concludes that the end-to-end check of the file transfer application must still be implemented no matter how reliable the communication system becomes.

By considering further end-to-end argument examples such as delivery guarantees, secure transmission of data, duplicate message suppression, guaranteeing FIFO message delivery and transaction management, the paper argues that application designers need to understand applications carefully and the lower-level subsystem that supports a distributed application may be wasting its effort providing a function that must by nature be implemented at the application level anyway can be applied to a variety of functions in addition to reliable data transmission.

By discussing the voice connections between two digital telephone instruments, the paper demonstrates that the requirements of each application will play an important role in indentifying and analyzing application and protocol design, literally speaking, what is the most important criterion for one application.

This paper is interesting in the sense that it presents several design perspectives for readers and designers to consider the tradeoff of implementing applications such as peer-to-peer applications. The advantages and disadvantages of each implementation in different scenarios would still be very helpful since possible threats are pointed out.
However, they have not pointed out how to improve the performance as well as the efficiency when implementing such applications. Future research would be on guaranteeing timing requirements for real-time distributed systems.

The design philosophy of the DARPA Internet protocols

The paper aims at providing the reasons for the TCP/IP protocol. Beginning with the primary goal of designing TCP/IP protocol is the effective technique for multiplexed utilization of existing interconnected network using packet switching. Then, a set of second level goals is summarized. The goals in the set with their relations are discussed to come up with solutions lying in the protocol.

Then the way of architecting, implementing and verifying based on the set of goals is discussed how to improve performance. The advantages of the use of datagrams, the fundamental Internet architecture feature, are presented and given explanations. Last is a brief review of the TCP history.

The packet switching is selected for multiplexing interconnected network to utilizing existing interconnected networks. Processor called gateways which implement a store and forward packet forwarding algorithm. To face with failure in some networks and gateways, the state information which describe the on-going conversation must be protected and stored at the endpoint of the net, at the entity which is utilizing the service network instead of at the intermediate packet switching nodes.

TCP is not suited to real time delivery of digitized speck for teleconferencing since instead of the reliable service, real time digital speech prefers minimizing the delay delivery of packets. Thus more than one transport service would be required and this cause TCP and IP separated into two layers as well as the reason for the emerging of UDP on IP layer.

The paper shows us the behind forces in forming the TCP/IP protocol and briefly sketches the protocol evolution. However, it has not discussed the lower priority goals like resource management. This paper gives readers more insight into TCP/IP design mechanism.

Datagrams model does not seem to be a good fit for resource management issue, we should look into this more in future research. To have better resource management, gateways should contain the information (states) about flows flowing through the gateways to have better processing performance. This is enabled by enforcing endpoints periodically send messages to ensure the proper type of services is associated with the flows.