Keepalived: Difference between revisions

From Open-Xchange
 
(32 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Example loadbalancing configuration for Open-Xchange Cluster  =
= Keepalived Loadbalancer =
 
== Introduction ==
== Introduction ==


this page contains a basic description about how to set up keepalived for Open-Xchange cluster. This example is to work on debian systems. Keepalived mode is Direct Routing.
This page contains a basic description about how to set up a LVS (Linux Virtual Server) / <code>ipvsadm</code> / <code>keepalived</code> based loadbalancer for MySQL (Galera) loadbalancing.
 
While the setup is more involved that simple user-space daemons and suffers from more constraints / requirements, the resulting solution is the cleanest with regards to high level design, most robust and best performing MySQL loadbalancing solution we are aware of.
 
The instructions on this page have been worked out and tested on Debian (latest verified version: 8.9). It should be possible to transfer this information to other distributions / versions.
 
LVS is a linux kernel module and has been included the mainline kernel since roughly 2.4.something in 2003 (see http://www.linuxvirtualserver.org). Most documentation available seems very outdated, however this code is part of the standard upstream linux kernel and as such perfectly maintained. It is, however, tricky to find recent reference documentation or howtos.
 
The project homepage http://www.linuxvirtualserver.org/ has some applicable information, in particular on the wiki. There also exists a HOWTO http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/index.html which has proven useful while writing this article. But above all, consult the manpages for <code>ipvsadm</code>, <code>keepalived</code> and the references therein; they are up to date and precise.
 
Some terminology:
 
* The <code>keepalived</code> node(s) are called ''keepalived node'' or ''loadbalancer node''.
* The nodes the loadbalancer node is loadbalancing for, for example OX nodes or Galera nodes, are called ''server nodes'' or ''database nodes''.
 
=== High Level Design ===
 
The solution consists of several components.
 
Main component is some kernel modules which implements the real loadbalancing / forwarding functionality (<code>ip_vs</code>, <code>ip_vs_rr</code>, and some more).
 
There is a command line tool to manage the loadbalancing konfiguration of the kernel called <code>ipvsadm</code>.
 
It is possible to run an ipvsadm daemon which allows synchronization of connection states to a standby / slave ipvsadm/LVS instance, so that on failover "most" connections can keep intact. This is out of scope of this document. It is mentioned here to be aware of it and to not confuse it with the <code>keepalived<code> daemon (see below).
 
A LVS/ipvsadm loadbalancer can run standalone, i.e. without further "management" software ontop. This is helpful in setup and testing. However for production it lacks the functionality to health-check the loadbalancing targets (i.e. database servers) and adjust the loadbalancer tables accordingly. To do this, a separate user-space instance / daemon is required, and this is the functionality provided by keepalived.
 
=== Routing methods ===
 
LVS provides several modes of routing. We will describe here the Direct Routing (<code>DR</code>) mode, the Tunneling (<code>TUN</code>) mode, and the ''Network Address Translation'' mode (<code>NAT</code>). There are more routing methods available which might come interesting in special cases, but not covered in this document.
 
==== Direct Routing ====
 
Direct Routing works by replacing the target MAC in a package addressed to the loadbalancer to its virtual / loadbalancer IP with the MAC of the designated target server and re-sending it.
 
This requires the servers to accept packages for the given IP, so they need to configure the corresponding IP on some local looopback / <code>dummy</code> device. It must be ensured the servers do not answer ARP requests for the given IP. Otherwise there is a race condition on which server / loadbalancer ARP response will be first received by a client, leading to unwanted results. This is called ''the ARP problem'' in the documentation and there are given many possible solutions; however with current kernels the method explained below works reliably.
 
Response packages are sent directly from the server to the client, thus they don't go through the loadbalancer, but appear to come from a source where the source IP does not match the MAC address.
 
In addition to the requirement to be able to configure addtional "secondary" IPs on the involved machines, this method also requires that no involved networking component (routers, virtualization hypervisors, etc) discard packages which seem "forged" (like, IPs do not match MACs, etc). This is typically not a problem in "classical" networking infrastructures, but getting more and more problematic in modern virtualized / cloud infrastructures.
 
==== Tunneling ====
 
The tunneling method works by the loadbalancer encapsulating the package in an IPIP tunneling package and sending it to the corresponding server.
 
It also requires that the servers have configured the virtual / loadbalancer IP locally, but here on a <code>tunl</code> device. We have to cover the same ''ARP Problem'' as explained in the Direct Routing section above, with the same solution. We also have the situation that answers are going directly from the servers to the clients, not passing through the loadbalancer.
 
The Tunneling method generally works better in modern virtualized / cloud environments.
 
==== NAT ====
 
Here LVS substitutes the target IP from incoming packages from its own (virtual / loadbalancer) IP to the IP of the designated server, and re-sends the package.
 
If the target server is a different network and the loadbalancer is actually acting as a gateway, there is nothing more to do. However, in the standard setup where the loadbalancer and servers are in the same network, there some additional configuration required in order to make the package roundtrip work: the server must use the loadbalancer as a gateway (at least for the returning MySQL traffic), and the loadbalancer in turn must be configured to act as a gateway and to do IP masquerading for the servers.
 
This method is a fairly recent "discovery" and not yet extensively tested, but it promises to be the most robust method and allows for running also in general cloud environments. It needs no additional floating / service IPs, no special requirements on the involved routers, no ARP problems, etc.
 
== Software installation on the loadbalancer node ==
 
Packages are installed from standard repos using
 
# apt-get install keepalived
 
This will install the required dependencies like <code>ipvsadm</code> etc.
 
Contrary to earlier Debian distros, currently there is no requirement to configure any special service (yet) for loading kernel modules and such. In older Debian versions (like Squeeze) some <code>/etc/default/{ipvsadm,keepalived}</code> files needed some tweaking to leverage kernel module loading (which seemed to fail automatically). This is currently no longer true; if working on an old (historical!) Debian version, you may have to investigate here.
 
Also not required, but claimed somewhere is to configure IPv4 forwarding. If experimenting with other routing methods, this may become required; it is not required with <code>DR</code> or <code>TUN</code>.
 
== Configuration ==
 
The configuration examples given below assume a setup like


It is required to have ox servers and loadbalancer connected to the same switch or hub and that there is no filter for network packages between (some virtualization system do filter, too), so that MAC rewriting works.
10.0.0.1 database server / galera node 1
10.0.0.2 database server / galera node 2
10.0.0.3 database server / galera node 3
10.0.0.4 loadbalancer primary IP
10.0.0.5 database client, e.g. OX middleware node
10.0.0.10 only DR/TUN: loadbalancer virtual IP for writing (persistent routing / dedicated write node)
10.0.0.11 only DR/TUN: loadbalancer virtual IP for reading (round-robin)


For more information please see:
Note: with DR and TUN, it is not possible to change the port numbers on routing; thus, for each loadbalancer endpoint, the loadbalancer needs an additional virtual IP. (It is not possible to configure them on different ports on the same (e.g. primary) IP of the loadbalancer.) For NAT, we can use one IP with different ports (5506 for reading, 5507 for writing).
[http://www.keepalived.org/documentation.html www.keepalived.org]


== Directors setup ==
=== Manual configuration / testing ===


test1:~# apt-get install keepalived
==== Networking adjustments on the server nodes ====
dpkg-reconfigure ipvsadm


answer the questions with  "Yes" and then "Backup"
For DR and TUN, the server nodes need the loadbalancer virtual IP(s) configured on some network device in order for the server processes to be able to bind on this device. For DR, it seems natural to configure a dummy device. For TUN, you need a tunl device.


create a file <pre>/etc/keepalived/keepalived.conf</pre> with following contend (adapt network adresses)
For NAT, we need to set the default route for outgoing (or rather, responding) MySQL traffic. It seems it is sufficient to do so with some ip route / ip rule / iptables magic on packages with source port 3306.
 
For testing, you can do it manually on the server nodes:
 
# for TUN
ip link set up tunl0
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
 
# for DR
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
 
# for NAT
# default route for MySQL response traffic to the primary IP of the loadbalancer
ip route add 10.0.0.4/32 dev eth0 table 1
ip route add default via 10.0.0.4 dev eth0 table 1
ip rule add fwmark 1 table 1
iptables -t mangle -A OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1
 
Note about NAT: If the depicted approach proves insufficient, a different idea would be to use networking namespaces to have a separate default gateway for the mysql process tree -- but however, it seems it is actually beneficial to affect as few traffic by the "default route" as possible, which is why we propose this specific approach rather than a generic "default gateway" approach. The latter would look like <code>ip route delete default; ip route delete 10.0.0.0/24; ip route add 10.0.0.4/32 dev eth0; ip route add default via 10.0.0.4 dev eth0</code>. This also works in tests, however it is inconvenient for other ops to have all outgoing traffic routed through the loadbalancer, and it seems unclear / to require testing whether this would work for all usecases (like, full SST, etc).
 
==== Loadbalancer ====
 
For <code>DR</code>/<code>TUN</code>, The loadbalancer also needs the virtual IPs configured as secondary IPs (not required for <code>NAT</code>):
 
# TUN and DR
ip addr add 10.0.0.10/32 dev eth0
ip addr add 10.0.0.11/32 dev eth0
 
For <code>NAT</code>, we need to allow IP forwarding and masquerading:
 
# For NAT
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -s 10.0.0.1 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 10.0.0.2 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 10.0.0.3 -j MASQUERADE
 
Then the loadbalancer endpoints themselves can be configured with </code>ipvsadm</code>:
 
# For TUN
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -i -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -i -w 10
 
# For DR
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -g -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -g -w 10
 
# For NAT
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.4:5506 -s rr
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.1:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.2:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.3:3306 -m -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.4:5507 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.1:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.2:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.3:3306 -m -w 10
 
Note: For DR and TUN, you need to restart the MySQL service after the networking adjustments; otherwise, the MySQL daemon will not accept packages with the virtual IP as target IP. This has caused a lot of wasted time to quite some people.
 
Note: here in the manual setup / testing step, we let keepalived pick the master node in a persistent way. With keepalived (see below) we can also chose to pick a master node based on their cluster index.
 
Note: if lazy, you can test with one server node, and extend the configuration later to all three nodes.
 
Note: to view the current LVS configuration, use
 
# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  10.0.0.10:mysql rr
  -> 10.0.0.1:mysql              Tunnel  10    0          0
  -> 10.0.0.2:mysql              Tunnel  10    0          0
  -> 10.0.0.3:mysql              Tunnel  10    0          0
TCP  10.0.0.11:mysql rr persistent 86400
  -> 10.0.0.1:mysql              Tunnel  10    0          0
  -> 10.0.0.2:mysql              Tunnel  10    0          0
  -> 10.0.0.3:mysql              Tunnel  10    0          0
 
Note: to stop / start over, use <code>ipvsadm -C</code>.
 
Note: you can use <code>ipvsadm -S</code> / <code>ipvsadm -R</code> for easier iterative testing (see manpage).
 
# ipvsadm -S
-A -t 10.0.0.10:mysql -s rr
-a -t 10.0.0.10:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.3:mysql -i -w 10
-A -t 10.0.0.11:mysql -s rr -p 86400
-a -t 10.0.0.11:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.3:mysql -i -w 10
# ipvsadm -S > ipvsadm.conf
# ipvsadm -R < ipvsadm.conf
 
==== Testing ====
 
You should be able to verify functionality then from the client / OX middleware node with something like (omitting authentication command line arguments for brevity)
 
# while true; do mysql -h10.0.0.10 -B -N -e "select @@hostname;"; sleep 1; done
db3
db2
db1
db3
db2
db1
[...]
^C
# while true; do mysql -h10.0.0.11 -B -N -e "select @@hostname;"; sleep 1; done
db3
db3
db3
db3
db3
db3
[...]
^C
 
If it works not:
 
* Remember you need to restart the MySQL server after networking adjustments
* Try to use <code>tcpdump</code> to find out on which node (loadbalancer or server) your TCP packages actually arrive
* Use <code>arp -a</code> to verify the server nodes did not advertise the virtual IP addresses with their MAC
* Verify the usual candidates like <code>iptables</code> (off by default Debian; may vary in your installation), <code>selinux/apparmor</code> (if using SLES or RHEL), additional firewalls are not spoiling your testing
 
Please verify the manual setup before proceeding to the persistent / production configuration.
 
=== Persistent / production configuration ===
 
==== Networking adjustments on the server nodes ====
 
It is possible to attach the configuration to <code>/etc/network/interfaces</code>:
 
# TUN example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip link set up tunl0
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
    pre-down ip addr del 10.0.0.11/32 dev tunl0
    pre-down ip addr del 10.0.0.10/32 dev tunl0
    pre-down ip link set down tunl0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
 
# DR example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
    pre-down ip addr del 10.0.0.11/32 dev dummy0
    pre-down ip addr del 10.0.0.10/32 dev dummy0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
 
# NAT example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    post-up ip route add 10.0.0.4/32 dev eth0 table 1
    post-up ip route add default via 10.0.0.4 dev eth0 table 1
    post-up ip rule add fwmark 1 table 1
    post-up iptables -t mangle -A OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1
    pre-down iptables -t mangle -D OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1
    pre-down ip rule del fwmark 1 table 1
    pre-down ip route flush table 1
 
==== Keepalived configuration (health checks skipped) ====
 
Note: keepalived will manage the secondary IPs, so no need to hard-wire them in <code>/etc/network/interfaces</code> or alike. Rather, deconfigure any potentially manually configured seconday IPs from previous manual testing.
 
Create a config file <code>/etc/keepalived/keepalived.conf</code> for basic functionality testing like


  global_defs {
  global_defs {
    router_id OX
  # This should be unique.
  router_id galera-lb
  }
  }
   
   
  vrrp_sync_group OX_GROUP {
  vrrp_instance mysql_pool {
    group {
  # The interface we listen on.
        OX_GOUP
  interface eth0
    }
  }
  # The default state, one should be master, the others should be set to SLAVE.
  state MASTER
  priority 101
   
  # This should be the same on all participating load balancers.
  virtual_router_id 19
   
   
vrrp_instance OX_VRRP {
  # Set the interface whose status to track to trigger a failover.
    state BACKUP
  track_interface {
    interface eth0
     eth0
    garp_master_delay 10
  }
    virtual_router_id 10
    priority 101
    nopreempt
    advert_int 1
    authentication {
        auth_type AH  # Simple 'PASS' can use
        auth_pass 1234 # example password '1234'
    }
     virtual_ipaddress {
        10.20.30.77/24 brd 10.20.30.255 dev eth0 # virtual service ip 10.20.30.67
    }
    virtual_ipaddress_excluded {
    }
}
   
   
virtual_server_group OX_HTTP {
  # Password for the loadbalancers to share.
        10.20.30.77 80        # virtual ip and port 80
  authentication {
}
    auth_type PASS
    auth_pass Twagipmiv3
  }
   
   
virtual_server_group OX_OL_PUSH {
  # This is the IP address that floats between the loadbalancers.
        10.20.30.77 44335      # VIP VPORT
  virtual_ipaddress {
    10.0.0.10/32 dev eth0
    10.0.0.11/32 dev eth0
  }
  }
  }
   
   
  virtual_server group OX_HTTP {
# Here we add the virtual mysql read node
    delay_loop 3
  virtual_server 10.0.0.10 3306 {
    lvs_sched  rr
  delay_loop 6
    lvs_method DR
  # Round robin, but you can use whatever fits your needs.
    protocol  TCP
  lb_algo rr
    virtualhost 10.20.30.77
   
   
    real_server 10.20.30.123 80 {
  lb_kind TUN
        weight 1
  protocol TCP
        inhibit_on_failure
        HTTP_GET {
            url {
                path /servlet/TestServlet
                status_code 200
            }
            connect_port 80
            connect_timeout 10
        }
    }
   
   
    real_server 10.20.30.321 80 {
  # For each server add the following.
        weight 1
  real_server 10.0.0.1 3306 {
        inhibit_on_failure
    weight 10
        HTTP_GET {
  }
            url {
  real_server 10.0.0.2 3306 {
                path /servlet/TestServlet
    weight 10
                status_code 200
  }
            }
  real_server 10.0.0.3 3306 {
            connect_port 80
    weight 10
            connect_timeout 10
  }
        }
    }  
  }
  }
   
   
  virtual_server group OX_OL_PUSH {
# Here we add the virtual mysql write node
    delay_loop 3
  virtual_server 10.0.0.11 3306 {
    lvs_sched  rr
  delay_loop 6
    lvs_method DR
  # Round robin, but you can use whatever fits your needs.
    protocol  UDP
  lb_algo rr
   
   
    real_server 10.20.30.123 44335 {
  lb_kind TUN
        weight 1
  protocol TCP
        inhibit_on_failure
  TCP_CHECK {
  # the following two options implement that active-passive behavior
                  connect_port 9999
  persistence_timeout 86400
  connect_timeout 5
  # make sure all OX nodes are included in that netmask
        }
  persistence_granularity 0.0.0.0
    }
   
   
     real_server 10.20.30.321 44335 {
  # For each server add the following.
        weight 1
  real_server 10.0.0.1 3306 {
        inhibit_on_failure
     weight 10
        TCP_CHECK {
  }
                  connect_port 9999
  real_server 10.0.0.2 3306 {
  connect_timeout 5
    weight 10
         }
  }
  real_server 10.0.0.3 3306 {
    weight 10
  }
}
 
The file should be self-explaining if you followed the manual configuration explanations above. The only unexpected things are directives like <code>state</code>, <code>priority</code> which will be explained below for multi-keepalived-setup.
 
The example has been using <code>TUN</code>; for <code>DR</code>, just replace <code>TUN</code> by <code>DR</code> in the <code>virtual_server</code> definitions / <code>lb_kind</code> line.
 
For <code>NAT</code>, as this method does not use virtual IPs, you can disable/comment out the <code>virtual_ipaddress</code> section in the <code>vrrp_instance</code> section, and you change the virtual servers to run on the native IP on extra ports like 5506 and 5507. And of course you change the <code>lb_kind</code> to <code>NAT</code>.
 
After a <code>service keepalived restart</code> you should be able to execute the same client connectivity tests as shown above. (Remember to cleanly unconfigure your manual setup before in order to not measure false success.)
 
It is by no means recommended to run in production without health checks, but as an intermediate step.
 
==== Keepalived configuration (with health checks) ====
 
We now configure health checks in <code>keepalived.conf</code>. The recommended way is to configure a custom [[Clustercheck|clustercheck]] service on the server nodes and configure it in <code>keepalived.conf</code> on the loadbalancer as follows. Each <code>real_server</code> gets a <code>HTTP_GET</code> child element like
 
  real_server 10.0.0.1 3306 {
    weight 10
    HTTP_GET {
      url {
        path /
         status_code 200
      }
      connect_port 9200
     }
     }
}


= Networking adjustments =
This way only cluster nodes which are <code>Synced</code> will be considered available.
 
Furthermore, in particular if you run a distribured, otherwise non-synchronized Keepalived cluster, you might want to decide to use a method like MariaDB's Maxscale, to chose only the server as master which has <code>wsrep_local_index==0</code>. This way it is impossible to end up in scenarios where different loadbalancers will pick different master nodes. Then you probably should also deconfigure the <code>persistence_timeout</code> and <code>persistence_granularity</code> options. You configure this using <code>path /master</code> lines instead of the <code>path /</code> ones.
 
If you only have one (active) ''loadbalancer'' at each given time, you might want to stick with the standard way of letting Keepalived decide (persistently) for a master node. It might have faster failover/recovery times, and avoid (unnecessary) failbacks.
 
As usual, thorough testing after config changes is mandatory.
 
=== Adding a second Keepalived node for redundancy ===
 
With a single keepalived node we have a single point of failure. It is possible to add a second keepalived node which is communicating with the first keepalived node and transition from a backup state to master state upon failure of the first node.
 
To set up a second keepalived node as described above, create a keepalived node identical to the first one, with the following changes to the configuration file <code>/etc/keepalived/keepalived.conf</code>:
 
* Change the <code>router_id</code> (to the hostname, for example)
* Change the <code>state</code> to <code>BACKUP</code>
* Change the <code>priority</code> to something lower than the masters priority (e.g. <code>100</code>)
 
Make sure the <code>virtual_router_id</code> and authentication information is the same on the backup keepalived node as on the master keepalived node.


On debian we can do this conveniently by the following adjustments:
Now the backup node will notice the master going down and take over. Automatic failback also happens.


# For the keepalived node, configure in /etc/sysctl.conf: <code>net.ipv4.ip_forward = 1</code>
Keepalived will automatically manage the secondary IPs, so no need for any additional clustering software like <code>corosync/pacemaker</code> etc.
# For the server nodes, use a stanza in the /etc/network/interfaces file. Adjust the IP.


auto lo:0
== Keepalived monitoring ==
iface lo:0 inet static
    address 10.20.30.77
    netmask 255.255.255.255
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up /sbin/route add -host 10.20.30.77 dev lo:0
    pre-down /sbin/route del -host 10.20.30.77 dev lo:0
    # reset to defaults
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce


How to do this on other operating systems needs to be documented here.
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --stats
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --rate

Latest revision as of 06:44, 26 September 2017

Keepalived Loadbalancer

Introduction

This page contains a basic description about how to set up a LVS (Linux Virtual Server) / ipvsadm / keepalived based loadbalancer for MySQL (Galera) loadbalancing.

While the setup is more involved that simple user-space daemons and suffers from more constraints / requirements, the resulting solution is the cleanest with regards to high level design, most robust and best performing MySQL loadbalancing solution we are aware of.

The instructions on this page have been worked out and tested on Debian (latest verified version: 8.9). It should be possible to transfer this information to other distributions / versions.

LVS is a linux kernel module and has been included the mainline kernel since roughly 2.4.something in 2003 (see http://www.linuxvirtualserver.org). Most documentation available seems very outdated, however this code is part of the standard upstream linux kernel and as such perfectly maintained. It is, however, tricky to find recent reference documentation or howtos.

The project homepage http://www.linuxvirtualserver.org/ has some applicable information, in particular on the wiki. There also exists a HOWTO http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/index.html which has proven useful while writing this article. But above all, consult the manpages for ipvsadm, keepalived and the references therein; they are up to date and precise.

Some terminology:

  • The keepalived node(s) are called keepalived node or loadbalancer node.
  • The nodes the loadbalancer node is loadbalancing for, for example OX nodes or Galera nodes, are called server nodes or database nodes.

High Level Design

The solution consists of several components.

Main component is some kernel modules which implements the real loadbalancing / forwarding functionality (ip_vs, ip_vs_rr, and some more).

There is a command line tool to manage the loadbalancing konfiguration of the kernel called ipvsadm.

It is possible to run an ipvsadm daemon which allows synchronization of connection states to a standby / slave ipvsadm/LVS instance, so that on failover "most" connections can keep intact. This is out of scope of this document. It is mentioned here to be aware of it and to not confuse it with the keepalived daemon (see below).

A LVS/ipvsadm loadbalancer can run standalone, i.e. without further "management" software ontop. This is helpful in setup and testing. However for production it lacks the functionality to health-check the loadbalancing targets (i.e. database servers) and adjust the loadbalancer tables accordingly. To do this, a separate user-space instance / daemon is required, and this is the functionality provided by keepalived.

Routing methods

LVS provides several modes of routing. We will describe here the Direct Routing (DR) mode, the Tunneling (TUN) mode, and the Network Address Translation mode (NAT). There are more routing methods available which might come interesting in special cases, but not covered in this document.

Direct Routing

Direct Routing works by replacing the target MAC in a package addressed to the loadbalancer to its virtual / loadbalancer IP with the MAC of the designated target server and re-sending it.

This requires the servers to accept packages for the given IP, so they need to configure the corresponding IP on some local looopback / dummy device. It must be ensured the servers do not answer ARP requests for the given IP. Otherwise there is a race condition on which server / loadbalancer ARP response will be first received by a client, leading to unwanted results. This is called the ARP problem in the documentation and there are given many possible solutions; however with current kernels the method explained below works reliably.

Response packages are sent directly from the server to the client, thus they don't go through the loadbalancer, but appear to come from a source where the source IP does not match the MAC address.

In addition to the requirement to be able to configure addtional "secondary" IPs on the involved machines, this method also requires that no involved networking component (routers, virtualization hypervisors, etc) discard packages which seem "forged" (like, IPs do not match MACs, etc). This is typically not a problem in "classical" networking infrastructures, but getting more and more problematic in modern virtualized / cloud infrastructures.

Tunneling

The tunneling method works by the loadbalancer encapsulating the package in an IPIP tunneling package and sending it to the corresponding server.

It also requires that the servers have configured the virtual / loadbalancer IP locally, but here on a tunl device. We have to cover the same ARP Problem as explained in the Direct Routing section above, with the same solution. We also have the situation that answers are going directly from the servers to the clients, not passing through the loadbalancer.

The Tunneling method generally works better in modern virtualized / cloud environments.

NAT

Here LVS substitutes the target IP from incoming packages from its own (virtual / loadbalancer) IP to the IP of the designated server, and re-sends the package.

If the target server is a different network and the loadbalancer is actually acting as a gateway, there is nothing more to do. However, in the standard setup where the loadbalancer and servers are in the same network, there some additional configuration required in order to make the package roundtrip work: the server must use the loadbalancer as a gateway (at least for the returning MySQL traffic), and the loadbalancer in turn must be configured to act as a gateway and to do IP masquerading for the servers.

This method is a fairly recent "discovery" and not yet extensively tested, but it promises to be the most robust method and allows for running also in general cloud environments. It needs no additional floating / service IPs, no special requirements on the involved routers, no ARP problems, etc.

Software installation on the loadbalancer node

Packages are installed from standard repos using

# apt-get install keepalived 

This will install the required dependencies like ipvsadm etc.

Contrary to earlier Debian distros, currently there is no requirement to configure any special service (yet) for loading kernel modules and such. In older Debian versions (like Squeeze) some /etc/default/{ipvsadm,keepalived} files needed some tweaking to leverage kernel module loading (which seemed to fail automatically). This is currently no longer true; if working on an old (historical!) Debian version, you may have to investigate here.

Also not required, but claimed somewhere is to configure IPv4 forwarding. If experimenting with other routing methods, this may become required; it is not required with DR or TUN.

Configuration

The configuration examples given below assume a setup like

10.0.0.1 database server / galera node 1
10.0.0.2 database server / galera node 2
10.0.0.3 database server / galera node 3
10.0.0.4 loadbalancer primary IP
10.0.0.5 database client, e.g. OX middleware node
10.0.0.10 only DR/TUN: loadbalancer virtual IP for writing (persistent routing / dedicated write node)
10.0.0.11 only DR/TUN: loadbalancer virtual IP for reading (round-robin)

Note: with DR and TUN, it is not possible to change the port numbers on routing; thus, for each loadbalancer endpoint, the loadbalancer needs an additional virtual IP. (It is not possible to configure them on different ports on the same (e.g. primary) IP of the loadbalancer.) For NAT, we can use one IP with different ports (5506 for reading, 5507 for writing).

Manual configuration / testing

Networking adjustments on the server nodes

For DR and TUN, the server nodes need the loadbalancer virtual IP(s) configured on some network device in order for the server processes to be able to bind on this device. For DR, it seems natural to configure a dummy device. For TUN, you need a tunl device.

For NAT, we need to set the default route for outgoing (or rather, responding) MySQL traffic. It seems it is sufficient to do so with some ip route / ip rule / iptables magic on packages with source port 3306.

For testing, you can do it manually on the server nodes:

# for TUN
ip link set up tunl0
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
# for DR
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
# for NAT
# default route for MySQL response traffic to the primary IP of the loadbalancer
ip route add 10.0.0.4/32 dev eth0 table 1
ip route add default via 10.0.0.4 dev eth0 table 1
ip rule add fwmark 1 table 1
iptables -t mangle -A OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1

Note about NAT: If the depicted approach proves insufficient, a different idea would be to use networking namespaces to have a separate default gateway for the mysql process tree -- but however, it seems it is actually beneficial to affect as few traffic by the "default route" as possible, which is why we propose this specific approach rather than a generic "default gateway" approach. The latter would look like ip route delete default; ip route delete 10.0.0.0/24; ip route add 10.0.0.4/32 dev eth0; ip route add default via 10.0.0.4 dev eth0. This also works in tests, however it is inconvenient for other ops to have all outgoing traffic routed through the loadbalancer, and it seems unclear / to require testing whether this would work for all usecases (like, full SST, etc).

Loadbalancer

For DR/TUN, The loadbalancer also needs the virtual IPs configured as secondary IPs (not required for NAT):

# TUN and DR
ip addr add 10.0.0.10/32 dev eth0
ip addr add 10.0.0.11/32 dev eth0

For NAT, we need to allow IP forwarding and masquerading:

# For NAT
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -s 10.0.0.1 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 10.0.0.2 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 10.0.0.3 -j MASQUERADE

Then the loadbalancer endpoints themselves can be configured with ipvsadm:

# For TUN
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -i -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -i -w 10
# For DR
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -g -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -g -w 10
# For NAT
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.4:5506 -s rr
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.1:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.2:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5506 -r 10.0.0.3:3306 -m -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.4:5507 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.1:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.2:3306 -m -w 10
/sbin/ipvsadm -a -t 10.0.0.4:5507 -r 10.0.0.3:3306 -m -w 10

Note: For DR and TUN, you need to restart the MySQL service after the networking adjustments; otherwise, the MySQL daemon will not accept packages with the virtual IP as target IP. This has caused a lot of wasted time to quite some people.

Note: here in the manual setup / testing step, we let keepalived pick the master node in a persistent way. With keepalived (see below) we can also chose to pick a master node based on their cluster index.

Note: if lazy, you can test with one server node, and extend the configuration later to all three nodes.

Note: to view the current LVS configuration, use

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.10:mysql rr
  -> 10.0.0.1:mysql               Tunnel  10     0          0
  -> 10.0.0.2:mysql               Tunnel  10     0          0
  -> 10.0.0.3:mysql               Tunnel  10     0          0
TCP  10.0.0.11:mysql rr persistent 86400
  -> 10.0.0.1:mysql               Tunnel  10     0          0
  -> 10.0.0.2:mysql               Tunnel  10     0          0
  -> 10.0.0.3:mysql               Tunnel  10     0          0

Note: to stop / start over, use ipvsadm -C.

Note: you can use ipvsadm -S / ipvsadm -R for easier iterative testing (see manpage).

# ipvsadm -S
-A -t 10.0.0.10:mysql -s rr
-a -t 10.0.0.10:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.3:mysql -i -w 10
-A -t 10.0.0.11:mysql -s rr -p 86400
-a -t 10.0.0.11:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.3:mysql -i -w 10
# ipvsadm -S > ipvsadm.conf
# ipvsadm -R < ipvsadm.conf

Testing

You should be able to verify functionality then from the client / OX middleware node with something like (omitting authentication command line arguments for brevity)

# while true; do mysql -h10.0.0.10 -B -N -e "select @@hostname;"; sleep 1; done
db3
db2
db1
db3
db2
db1
[...]
^C
# while true; do mysql -h10.0.0.11 -B -N -e "select @@hostname;"; sleep 1; done
db3
db3
db3
db3
db3
db3
[...]
^C

If it works not:

  • Remember you need to restart the MySQL server after networking adjustments
  • Try to use tcpdump to find out on which node (loadbalancer or server) your TCP packages actually arrive
  • Use arp -a to verify the server nodes did not advertise the virtual IP addresses with their MAC
  • Verify the usual candidates like iptables (off by default Debian; may vary in your installation), selinux/apparmor (if using SLES or RHEL), additional firewalls are not spoiling your testing

Please verify the manual setup before proceeding to the persistent / production configuration.

Persistent / production configuration

Networking adjustments on the server nodes

It is possible to attach the configuration to /etc/network/interfaces:

# TUN example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip link set up tunl0
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
    pre-down ip addr del 10.0.0.11/32 dev tunl0
    pre-down ip addr del 10.0.0.10/32 dev tunl0
    pre-down ip link set down tunl0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
# DR example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
    pre-down ip addr del 10.0.0.11/32 dev dummy0
    pre-down ip addr del 10.0.0.10/32 dev dummy0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
# NAT example
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
    post-up ip route add 10.0.0.4/32 dev eth0 table 1
    post-up ip route add default via 10.0.0.4 dev eth0 table 1
    post-up ip rule add fwmark 1 table 1
    post-up iptables -t mangle -A OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1
    pre-down iptables -t mangle -D OUTPUT -p tcp --sport 3306 -j MARK --set-mark 1
    pre-down ip rule del fwmark 1 table 1
    pre-down ip route flush table 1

Keepalived configuration (health checks skipped)

Note: keepalived will manage the secondary IPs, so no need to hard-wire them in /etc/network/interfaces or alike. Rather, deconfigure any potentially manually configured seconday IPs from previous manual testing.

Create a config file /etc/keepalived/keepalived.conf for basic functionality testing like

global_defs {
  # This should be unique.
  router_id galera-lb
}

vrrp_instance mysql_pool {
  # The interface we listen on.
  interface eth0

  # The default state, one should be master, the others should be set to SLAVE.
  state MASTER
  priority 101

  # This should be the same on all participating load balancers.
  virtual_router_id 19

  # Set the interface whose status to track to trigger a failover.
  track_interface {
    eth0
  }

  # Password for the loadbalancers to share.
  authentication {
    auth_type PASS
    auth_pass Twagipmiv3
  }

  # This is the IP address that floats between the loadbalancers.
  virtual_ipaddress {
   10.0.0.10/32 dev eth0
   10.0.0.11/32 dev eth0
  }
}

# Here we add the virtual mysql read node
virtual_server 10.0.0.10 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
  }
  real_server 10.0.0.2 3306 {
    weight 10
  }
  real_server 10.0.0.3 3306 {
    weight 10
  }
}

# Here we add the virtual mysql write node
virtual_server 10.0.0.11 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # the following two options implement that active-passive behavior
  persistence_timeout 86400
  # make sure all OX nodes are included in that netmask
  persistence_granularity 0.0.0.0

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
  }
  real_server 10.0.0.2 3306 {
    weight 10
  }
  real_server 10.0.0.3 3306 {
    weight 10
  }
}

The file should be self-explaining if you followed the manual configuration explanations above. The only unexpected things are directives like state, priority which will be explained below for multi-keepalived-setup.

The example has been using TUN; for DR, just replace TUN by DR in the virtual_server definitions / lb_kind line.

For NAT, as this method does not use virtual IPs, you can disable/comment out the virtual_ipaddress section in the vrrp_instance section, and you change the virtual servers to run on the native IP on extra ports like 5506 and 5507. And of course you change the lb_kind to NAT.

After a service keepalived restart you should be able to execute the same client connectivity tests as shown above. (Remember to cleanly unconfigure your manual setup before in order to not measure false success.)

It is by no means recommended to run in production without health checks, but as an intermediate step.

Keepalived configuration (with health checks)

We now configure health checks in keepalived.conf. The recommended way is to configure a custom clustercheck service on the server nodes and configure it in keepalived.conf on the loadbalancer as follows. Each real_server gets a HTTP_GET child element like

  real_server 10.0.0.1 3306 {
    weight 10
    HTTP_GET {
      url {
        path /
        status_code 200
      }
      connect_port 9200
    }

This way only cluster nodes which are Synced will be considered available.

Furthermore, in particular if you run a distribured, otherwise non-synchronized Keepalived cluster, you might want to decide to use a method like MariaDB's Maxscale, to chose only the server as master which has wsrep_local_index==0. This way it is impossible to end up in scenarios where different loadbalancers will pick different master nodes. Then you probably should also deconfigure the persistence_timeout and persistence_granularity options. You configure this using path /master lines instead of the path / ones.

If you only have one (active) loadbalancer at each given time, you might want to stick with the standard way of letting Keepalived decide (persistently) for a master node. It might have faster failover/recovery times, and avoid (unnecessary) failbacks.

As usual, thorough testing after config changes is mandatory.

Adding a second Keepalived node for redundancy

With a single keepalived node we have a single point of failure. It is possible to add a second keepalived node which is communicating with the first keepalived node and transition from a backup state to master state upon failure of the first node.

To set up a second keepalived node as described above, create a keepalived node identical to the first one, with the following changes to the configuration file /etc/keepalived/keepalived.conf:

  • Change the router_id (to the hostname, for example)
  • Change the state to BACKUP
  • Change the priority to something lower than the masters priority (e.g. 100)

Make sure the virtual_router_id and authentication information is the same on the backup keepalived node as on the master keepalived node.

Now the backup node will notice the master going down and take over. Automatic failback also happens.

Keepalived will automatically manage the secondary IPs, so no need for any additional clustering software like corosync/pacemaker etc.

Keepalived monitoring

ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --stats
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --rate