Debugging Talus¶
There are many components to talus. This will attempt to give some insight into ways you might go about debugging the individual components of talus.
Master Daemon¶
The talus master daemon is an upstart job. The job configuration is found
in /etc/init/talus_master.conf
:
description "Talus Master Daemon"
author "Optiv Labs"
start on filesystem or runlevel [2345]
stop on shutdown
respawn
script
/home/talus/talus/src/master/bin/start_raw em1
end script
Logs for the master daemon can be found in /var/log/upstart/talus_master.log
. These logs
are automatically rotated and are created by upstart.
Restarting¶
To restart the master daemon (say, after having made some code changes, to force it to reconnect
to the AMQP server, etc), run sudo stop talus_master
. This should stop the master
daemon. If after a few seconds the master daemon does not gracefully quit (confirm with ps aux | grep master
),
force-kill any running master daemons with a good ol’ kill -KILL
.
After the master daemon has been killed, start it again with sudo start talus_master
.
Slave Daemon¶
The talus slave daemon that is present on each of the slaves is an upstart job. The job
configuration is found in /etc/init/talus_slave.conf
:
description "Talus Slave Daemon"
author "Optiv Labs"
start on (started networking)
stop on shutdown
respawn
script
aa-complain /usr/sbin/libvirtd
/home/talus/talus/src/slave/bin/start_raw 1.1.1.3 10 em1 2>&1 >> /var/log/talus/slave.log
end script
The aa-complain is to force apparmor to only complain about libvirtd and not enforce any policies. Libvirt runs extremely slow if apparmor is allowed to enforce policies on libvirtd. There might be a better way around this, but this works.
Restarting¶
To restart the slave daemon, run sudo restart talus_slave
. The slave daemon will
gracefully shutdown, killing all running vms before doing so. Sometimes this can take up to a minute
before the slave daemon has completely quit.
If you are paranoid that the slave daemon isn’t going restart cleanly, stop and start the daemon
separately, checking in between to make sure that it had completely exited before starting it again.
If it never fully quits, force-kill it with kill -KILL
.
Vagrant¶
Vagrant is a VM configuration utility (or that’s how I think of it). It is intended for developers to easily share build/development/production environments with other developers by only sharing their Vagrantfile. The Vagrantfile is a ruby script and can configure a VM from a base image. A lot of the work that has gone into Vagrant is about being able to configure VMs from a Vagrantfile.
Talus uses Vagrant during image configuration to provide a way for the user to perform automatic VM updates (e.g. run a script after every MS update to create a new image with the latest patches, etc).
Vagrant images (or boxes in Vagrant lingo) are stored in /root/.vagrant.d/boxes
. When a box is
started, the image in the boxes directory is uploaded to /var/lib/libvirt/images
and then is
run.
Since we aren’t using VMWare or VirtualBox (but litvirt instead), talus requires the vagrant-libvirt plugin to be added. During development of talus, several pull requests were submitted to this plugin to give us the functionality we needed.
Libvirt¶
Libvirtd¶
Talus uses libvirt. Libvirt runs as a daemon (libvirtd
) and accepts messages via a unix domain
socket.
There have been major problems with using libvirt and networking issues amongst the vms. Talus has
resorted to using static mac address that mapped to static ip addresses that were defined in
the talus-network
xml, as well as disabling mac filtering with ebtables in /etc/libvirt/qemu.conf
by setting mac_filters=0
.
Another notable configuration setting with libvirt is to set the vnc listen ip to 0.0.0.0
in
/etc/libvirt/qemu.conf
. Otherwise you won’t be able to remotely VNC to any running VMs.
Libvirt is restarted with /etc/init.d/libvirt-bin restart
.
Logs for libvirtd are found in in /var/log/libvirt/libvirtd.log
, and logs for individual
domains are found in /var/log/libvirt/qemu/<domain_name>.log
(iirc).
Virsh¶
virsh
is a command-line interface to sending messages to the libvirt daemon.
Common commands include:
virsh list --all
- list all of the defined/running domains (vms)virsh destroy <domain_id_or_name>
- forcefully destroy a domainvirsh dumpxml <domain_id_or_name>
- dump the xml that defines the domain- it may be useful to grep this for
vnc
to see which vnc port it’s on - it may be useful to grep this for
mac
to see what the mac address is (can correlate macs to ips witharp -an
)
- it may be useful to grep this for
virsh net-list
- list defined networks. Talus uses its own defined networktalus-network
virsh net-dumpxml <network-name>
- dump the xml that defines a network
I commonly found myself doing something like:
for id in $(sudo virsh list --all | tail -n+3 | awk '{print $1}') ; do sudo virsh destroy $id ; done
Docker¶
Several talus components are containerized using Docker. Docker (essentially a wrapper around linux containers) makes it easy to configure environments for a service. It uses an incremental build process to build containers.
In the talus source tree, the web
, amqp
, and db
directories contain scripts in
their bin directories to build, start, and stop their respective docker containers.
Docker users a Dockerfile to define the individual steps needed to build the container. Generally speaking you
either RUN
a command inside the container, or ADD
files and directories to the container. A default
entrypoint int the container specifies how the container should be started, unless an overriding --entrypoint
parameter is passed with the docker run
command.
Dockers containers can be linked to other already-running docker containers. For example, the script to run the
talus_web
container links itself to the talus_db
container (--link ...
), exposes several ports so that it
can accept remote connections (-p ...
), and mounts several volumes inside the container (-v ...
). The full script can be found in talus/src/web/bin/start
in the source tree:
sudo docker run \
--rm \
--link talus_db:talus_db \
-p 80:80 \
-p 8001:8001 \
-v /var/lib/libvirt/images:/images:ro \
-v /var/log/talus:/logs \
-v /tmp/talus/tmp:/tmp \
-v /talus/install:/talus_install \
-v /talus/talus_code_cache:/code_cache \
--name talus_web \
$@ talus_web
MongoDB¶
There is a specific order that docker containers must be started on the master. Most of the containers/services
rely on the talus_db
container being up and running. If the master needed to be rebooted and things
start complaining about connections, try shutting them down and restarting them in this order:
start talus_db
start talus_amqp
- this does not depend on talus_db, so this could be first if you wanted)start talus_web
start talus_master
start talus_slave
- if you also have a slave daemon running on the master server
Mongodb logs are stored in /var/log/talus/mongodb/*
.
Mongodb data is stored in /talus/data/*
.
Since the db is running in a container, you can’t drop into a mongo shell on the master
and attempt to connect to localhost (and actually, no mongo tools are required to be installed
on the master, so you might not be able to that out of the box anyways). You could either lookup the connection
info of the talus_db
container (which port it’s forwarded to locally), or you can start a
temporary container that has all of the necessary mongodb tools that will drop you into a mongo
shell. I highly recommend the second approach.
Such a script exists in the source tree at talus/src/db/bin/shell
. Run this script, and you
should be dropped into a mongo shell. You will have to tell it which database to use (the talus
database), after which you can perform raw mongodb commands:
talus@:~$ talus/src/db/bin/shell
MongoDB shell version: 3.0.6
connecting to: talus_db:27017/test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
Server has startup warnings:
2015-10-28T22:32:32.001+0000 I CONTROL [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2015-10-28T22:32:32.001+0000 I CONTROL [initandlisten]
2015-10-28T22:32:32.001+0000 I CONTROL [initandlisten]
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** WARNING: You are running on a NUMA machine.
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** We suggest launching mongod like this to avoid performance problems:
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** numactl --interleave=all mongod [other options]
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten]
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten]
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2015-10-28T22:32:32.002+0000 I CONTROL [initandlisten]
rs0:PRIMARY> use talus
switched to db talus
rs0:PRIMARY> show collections
code
file_set
fs.chunks
fs.files
image
job
master
o_s
result
slave
system.indexes
task
tmp_file
rs0:PRIMARY> db.image.find()
Notice how the prompt says rs0:PRIMARY
. This is HUGELY important. Talus uses a single-host replica set with mongodb
to be able to essentially have a cursor that will tail -f
all of the changes that occur in the database. This works
because, as a replica set, the intention is that database changes will have to be communicated to other databases on
different hosts (I believe shard is the term mongodb uses). A special collection called an oplog
is where all
of these changes are stored.
Talus uses the oplog to be notified of changes in the database so it won’t have to poll the database for changes.
Back to the prompt and the rs0:PRIMARY
. If the prompt DOES NOT say PRIMARY after rs0
(replicat-set 0),
then you’ll have to run a few commands in a mongo shell.
In the talus/src/db/startup.sh
script, a command is run that attempts to ensure that the current replica set
on talus (the only one), is also the PRIMARY replica set. Not being the primary replica set (called a slave) means that
you cannot make changes to the data (iirc). The code the startup.sh script runs in a mongo shell is below:
cfg={"_id" :"rs0", "version": 1, "members": [{"_id": 0, "host": "talus_db:27017"}]}
rs.initiate(cfg)
rs.reconfig(cfg, {force:true})
rs.slaveOk()
If you notice that the shell is not PRIMARY, you would usually only have to run
the rs.slaveOk()
command from a mongo shell to get things back to
normal. You might need the other commands if the previously mentioned command
fails to work.
AMQP¶
AMQP is also containerized with docker and is run as an upstart job. The upstart config for the talus_amqp
upstart job is found at /etc/init/talus_amqp.conf
.
Logs for amqp should be found at /var/log/talus/rabbitmq/*
.
This should rarely have to be debugged. Since it is debugged so rarely, debugging-specific scripts were never added.
However, if AMQP was suspected of being a problem, here’s a few things I’d check out:
restart amqp with
sudo restart talus_amqp
look in the logs at
/var/log/talus/rabbitmq/*
- setup the RabbitMQ management console and expose ports in the
talus_amqp
container so that you can access the management console remotely.
- setup the RabbitMQ management console and expose ports in the
- stop the
talus_amqp
container and run it the container manually with the entrypoint set to bash so that you can do additional debugging: *
talus/src/amqp/bin/start --entrypoint bash
- stop the
Webserver¶
Debugging the webserver should be fairly simple. The webserver is containerized
using docker and is run as an upstart job. The upstart script is found in
/etc/init/talus_web.conf
.
Logs for the talus web services are found in
/var/log/talus/apache2/*.log
.
The dynamic portion of the web application is made with django. Debugging django application is fairly straightforward, especially if you use pdb.
The start script (talus/src/web/bin/start
) has some logic to check for a
dev parameter. If present, it will mount the directories local to the start script
inside the container so that you won’t have to rebuild the container every time
you need to make some code changes.
My usual workflow goes like this:
- Make sure
talus_db
is running - Scp/rsync my code into the remote
talus/src/web
directory - Start a dev talus_web container with bash as the new entrypoint:
talus:~$ talus/src/web/bin/start dev --entrypoint bash
Error response from daemon: Cannot kill container talus_web_dev: no such id: talus_web_dev
Error: failed to kill containers: [talus_web_dev]
Error response from daemon: no such id: talus_web_dev
Error: failed to remove containers: [talus_web_dev]
root@54f7352ff90b:/# cd web
root@54f7352ff90b:/web# ls
README api code_cache launch.sh manage.py passwords requirements talus_web
root@54f7352ff90b:/web# python manage.py runserver 0.0.0.0:8080
DEBUG IS TRUE
DEBUG IS TRUE
Performing system checks...
System check identified no issues (0 silenced).
October 30, 2015 - 21:20:21
Django version 1.8.1, using settings 'talus_web.settings'
Starting development server at http://0.0.0.0:8080/
Quit the server with CONTROL-C.
At this point you will be able to break and step through the handling of any requests
(if you have added a import pdb ; pdb.set_trace()
somewhere). Remember that
port 8080
is exposed by default for the dev web container, so be sure to
run manage.py with port 8080 on ip 0.0.0.0.