CurrentArchitecture

From Cacheboy

Jump to: navigation, search

Contents

Current CDN Architecture

Introduction

Before you think to yourself, "This is mighty ghetto! I don't want to do this!" please remember that the initial primary goal is to handle a quantity of useful, live traffic to use to build tools and gather data. It is a lot easier to document what works and what doesn't when you have actual live traffic to test it with!

Architecture Modules

BGP

For now, BGP peering data is mostly used to generate a BGP prefix -> origin AS table. The main purpose of keeping BGP peers up is to have the data available -when- it is time to start including it in the selection/redirection logic, rather than rushing out to do it later.

Each mirror location runs at least one iBGP node, with one central route reflector.

The route reflector takes a snapshot of the BGP table once an hour and this is processed into a simple prefix -> origin AS table which is then sync'ed to each mirror node once an hour.

GeoIP

The GeoIP service used is the database from nerd.dk, sync'ed over hourly.

A GeoIP daemon runs on each server wishing to perform GeoIP lookups. It takes a full IPv4 / IPv6 address and returns the geoip database entry for the given matching network. The daemon stores the entire geoip database in memory (< 20 megabytes) and the currently single threaded app can handle high concurrency and high enough request rates (~ 4000/sec) without any effort toward optimisation.

I'll have to scale the geoip daemon (like the BGP daemon I'm writing) to handle upwards of around 10,000 lookups a second to handle _real_ traffic loads later and this just may not be easily possible when performing the lookups over connected TCP sockets. I'll investigate memcached a little more later on to see what can be pulled out to build a generic high performance "database" of "stuff".

It may just be that an alternative to using a daemon w/ sockets for lookups may be required. Perhaps an on-disk radix tree..

GeoIP Host Maps

The Perl Cacheboy library includes a GeoIP Hostmap - a list of backend {hosts,weights} which map to a given GeoIP location. Each host selection will select a random weighted host from the available list of backend hosts. This allows for some very functional but coarse grain traffic distribution across the available backends.

Authoritative DNS Servers

Each mirror/CDN site simply has a hostname '<something>.cdn.cacheboy.net' . These are served as "magic" records.

The DNS software is PowerDNS. The pipe-backend module is used to provide a very rapid and flexible development environment. Yes, this has performance implications but it is hoped that these will be addressed by stabilising the current feature set and then recoding the module in C++ (or migrate the Perl backend code to one of the available Perl DNS server frameworks.)

The pipe backend performs GeoIP lookups against the local GeoIP daemon. It then uses this GeoIP information to select a "close enough" backend host from a static list of servers, with "down" servers filtered out. A (weighted) random host from the set of backend hosts serving the given GeoIP region is then returned. If the GeoIP lookups fail for whatever reason (the GeoIP daemon is down, the mapping file between GeoIP region and mirror nodes isn't available or is blank, etc), a list of "last chance" servers are randomly selected from.

Health Checks

For now, the health checks are simple "Does this mirror node respond to an internal image request?" - if it can't do this in a short period (a few seconds), the host is marked as down and is subsequently not considered when building the GeoIP region -> backend server maps.

More complicated health checks are desired but aren't yet needed to just simply "push bits".

Origin Servers

The Origin Servers (ie, where the content actually lies) must answer to the CDN mirror node name '<something>.cdn.cacheboy.net' - to keep things simple, the CDN software currently doesn't rewrite URLs.

Mirror Nodes

The mirror nodes run either Lusca-HEAD or (for a few old ones) Cacheboy-1.6. These are Squid-2 derivatives. They are configured as reverse HTTP proxies. Using HTTP proxy caches lets me:

  • Only cache the frequently accessed content; meaning the servers can get away with a smaller set of very fast storage, rather than trying to store terabytes of frequently served content;
  • Is optimised for serving content quickly (contrary to legacy belief - Squid-2 and my subsequent Squid fork will fill a few gige's of medium to large sized file content;
  • Avoids the whole requirement of building a file mirroring system complete with updating and consistency checking; the HTTP/1.1 caching rules just "give" us that.

Statistics Gathering

Logging

Each request is logged. Log files are processed and then compressed each hour.

Processing

Three modules are available - the "site" overview module, the "geoip" module and the "BGP ASN" module. Data is aggregated per-site, per-geoip and per-ASN. The backend database on each server is simply "sqlite3" for the time being.

Data is (manually) fetched centrally via an evil export/import HTTP API. Some kind of distributed database (say, BigTable) would be nice for this kind of data. This does mean that aggregated hourly statistics per site/per AS and per-site/per GeoIP region is available.

Request Handling and Distribution

For now, requests are distributed purely using DNS. A backend proxy/cache server is returned to a DNS request for '<something>.cdn.cacheboy.net'.

Proxy/Cache nodes and origin servers do not do any redirection of their own to specific end servers.

The very first cut of this simply used the geoip module in PowerDNS. The current pipe-backend reimplements that (in Perl..) but is slowly growing to include other functionality (eg the weighted random host distribution) and will eventually begin to include BGP and further internet health/topology information.

Operational Feedback

So far, the GeoIP stuff works surprisingly well as a coarse grained request distribution method. The bulk of the traffic goes straight to the US (~ 40%), with Germany (7%), Canada (7%), Italy (5%) and the UK (5%) being the top 5. The weighted random selection stuff should be "good enough" to do very naive load balancing between available bandwidth.

This obviously doesn't take into account -any- network topology or health information and so scaling this out would require (more expensive) transit versus being smart about using available peering links. Moving to use a hybrid BGP and DNS setup will scale better once more nodes come online and more peering links are available. There is no point in putting in the extra effort to that right now with only a few nodes and a gigabit or so of aggregate available transit.

Using DNS to control traffic flows has its problems but again, since all that is going on is GeoIP selection, trying to further optimise it is a bit pointless. Combining DNS with HTTP redirect and BGP anycast will make more sense once BGP and general network health information are combined in the selection and redirection processes.

There is still a 24 hour cycle to the CDN traffic given the current content being mirrored, peaking at around 6pm UTC each day. Traffic levels double between the lowest and highest points in the cycle.

Personal tools