Distributed Computing with Custom Beowulf Cluster

December 4, 2017December 8, 2018 jondeatonLeave a comment

In my last two blog posts, I discussed how I assembled and configured a distributed computing cluster using some old laptops and Raspberry Pis. In this post I will discuss using this cluster for developing distributed computing tasks in C++ using MPI. The distributed application that I developed is a DNA k-mer counter.

Background on DNA Sequence analysis

In biology, the analysis of DNA sequences is critical in understanding biologic systems. Many DNA analysis algorithms focus on identifying genes (the functional units that DNA encodes), however, some DNA analysis algorithms focus on other features of DNA sequences. One alternate approach is analyzing the “k-mer” content of a DNA sequence. K-mers are short sub-sequences of a DNA sequence of length k. Many DNA analysis algorithms make conclusions about biologic systems based on the abundances of each k-mer in the DNA sequence. Other k-mer based metrics include the number of unique k-mers in a DNA sequence and the shape of the distribution of k-mer frequencies. In my undergraduate research, I used the frequencies of k-mers in DNA sequences to identify novel viruses.

One step in gathering the training data was calculating the k-mer content in 18GB of bacterial DNA sequences, which I later used as a feature set for machine learning. At that time, I counted how many of each k-mer there were in each sequence using a single-threaded Python program, which took over 6 hours to churn through the data set. In this blog post I’ll describe how I took that Python code and turned it into a distributed, multi-threaded k-mer counter in C++ that could complete the same task in about 10 minutes.

K-mer counting algorithm

The k-mer counting algorithm that I used is very simple: for a given DNA sequence, use a sliding window length k (base pairs), and then slide along the DNA sequence from beginning to end. For each position of the sliding window, increment a counter of how many times that particular k-mer has been seen in the sequence. Since DNA has four base pairs (i.e. A, T, G, C), there are 4^k distinct k-mers to count the occurrences of. To store these counts create an array of integers of size 4^k and designate each k-mers to each element of the array lexicographically (e.g. AAA → 0, AAT → 1, AAC → 2, … GGG → 4^k-1). Thus, for each position of the sliding window, we simply calculate the lexicographic index of the k-mer and increment that element in the array.

The DNA data is stored in a file format called FASTA, which basically just stores the ATGCs of the sequences as plain text. The program reads in each sequence from fasta files, performs the described counting algorithm in memory, and then outputs the counts of each sequence to file as plain text.

Porting code to multi-threaded C++

I realized that if I was going to take advantage of the parallel capabilities of my computing cluster I should first take advantage of parallelism on each individual node. To do so, I re-wrote my original Python program in C++ where each DNA sequence was scheduled for processing on a thread pool from the Boost libraries.

At this point, I bench-marked the C++ implementation against it’s Python predecessor which took ~2 minutes to process a 150Mb file of viral genomes. My multi-threaded C++ program could process this file in ~1.8 seconds on the same computer– a 60x speedup.

An interesting discovery that I made was that if I set up the process to read data from standard input (std::cin) as opposed to constructing an ofstream from the file path, the program took nearly twice as long. Although some optimizations still remained, the task was then ready to be ported to a distributed implementation that shares the work across multiple computers.

Installing OpenMPI

A critical step in getting distributed tasks running is having some way to communicate between nodes. One option is passing around shell commands over ssh. I used this approach before but for this project I wanted a more intimate and flexible interface so I went with MPI. I chose to use OpenMPI simply because of the ease of installation.

One thing I had to do was build OpenMPI from source so that heterogenous computing would be supported. This is because the computers in my cluster had different architectures. I followed the instructions on the main page but passed the --enable-heterogeneous flag to configure like so:

tar zxvf openmpi-3.0.0.tar.gz
cd openmpi-3.0.0
./configure --prefix=/opt/openmpi-3.0.0 --enable-heterogeneous
make all install

I ran this seperately on all the machines in the cluster. I also added /opt/openmpi-3.0.0/bin and /ope/openmpi-3.0.0/lib to $PATH and $LD_LIBRARY_PATH respectively so that I could import and link the MPI libraries in my code.

Batch Processing Implementation with MPI

The basic design of the application is very similar to that of a so called “thread pool”: a single node is assigned to be the “head” node which sets up a queue of work to be done and then delegates tasks from the queue to all of the worker nodes. The worker nodes simply notify the head node when they are ready to process something, receive and process some work (in this case a file), and then either return the result of the processing to the head node, or persist their answer by writing to a file, database, etc…

Plugging my multi-threaded k-mer counter into this system design was theoretically very easy. The head node would search in a directory full of DNA sequence files and add each file to a queue, then send each file out to worker nodes. The worker nodes would use the (previously described) k-mer counting program to process the files and write out their answer to disk.

When programming with MPI, you use the basic message passing functionality of MPI_Send and MPI_Recv to pass messages of arbitrary size between nodes. I was helped by a great tutorial on basic MPI routines like sending, receiving, and probing messages, that I would suggest if you are interested in learning.

In my program I defined a generic BatchProcessor class contains a “master routine” and “worker routine” and functions to scheduling work. I call each unit of work a “key” which in my application is a string representing the filename of a FASTA file to process. To use the BatchProcessor a client program provides a function for scheduling keys, and a function for processing keys. In my application these functions are locating FASTA files, and counting k-mers in the sequences respectively.

Here is the implementation of the “worker routine” that is performed by all nodes on the cluster except the head node:

void BatchProcessor::worker_routine(function processKey) {
 MPI_Status status; // To store status of messages
 char ready = BP_WORKER_READY; // Single byte to send to master node

 int messageSize; // To store how long the incoming messages are
 string nextKey; // Stores the next key

 while (true) { // Continue until head node says stop
   // Tell the head node we are ready to process something
   MPI_Send(&ready, sizeof(char), MPI_BYTE, BP_HEAD_NODE, BP_WORKER_READY_TAG, MPI_COMM_WORLD);

    // Get information about the next key to come
    MPI_Probe(BP_HEAD_NODE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
    if (status.MPI_TAG == BP_WORKER_EXIT_TAG) break; // time to be done

    // Get the next key to process
    MPI_Get_count(&status, MPI_CHAR, &messageSize); // find key size
    auto key_cstr = (char*) malloc((size_t) messageSize); // Allocate memory for key
    MPI_Recv(key_cstr, messageSize, MPI_CHAR, BP_HEAD_NODE, BP_WORK_TAG, MPI_COMM_WORLD, &status);
    nextKey = key_cstr; // copy the next file to process
    free(key_cstr);
    processKey(nextKey); // <-- work is done here
  }
}

This “worker routine” is fairly straightforward: first the worker notifies the head node that it is ready to process some work with an MPI_Send. The BP_WORKER_READY_TAG is just a #define integer in my program for use to indicate that a worker is ready to process work. This command blocks until it’s sent to the head node. Once it is sent, the worker then waits for a response from the head node with a “key” with a blocking MPI_Recv call. The “keys” in this case are just file names (the DNA sequence files to be processed). The worker needs to use dynamically allocated memory (hence malloc) to store in incoming key (filename) since it can be any length. Once a filename is received, the worker’s job is to process that key with the processKey method provided to it, which simply to asynchronously count k-mers in the specified file and write the answer to disk. The worker continues this cycle until it is notified by the head node that there is no more work to complete and it can exit.

The master node’s routine is a little bit more complicated since it needs to coordinate all of the workers. However, to increase performance and reduce complexity, I broke it’s routine up into two sub-tasks, and scheduled each of these as tasks on the head node’s thread pool. These tasks are:

Search the user-specified directory for FASTA files, and add each to the queue of tasks.
Wait for a worker to become available, then send it the next item from the queue

The first of these tasks is the simplest but by processing it asynchronously on the thread pool, work dispatching and processing can occur while scheduling is still happening. This could allow for a long-running scheduling task. Anyways, the scheduling task is as so:

pool.schedule([&](){
    schedule_keys(); // Schedule keys asynchronously
    lock_guard lg(scheduling_complete_mutex);
    scheduling_complete = true; // Indicate done scheduling
}

The schedule_keys function is a client defined function passed to the BatchProcessor that searches through a directory for the FASTA files, and then schedules them on batch processor’s queue. Inside of the definition of schedule_keys the client should use the batch processor’s special schedule_key method which queues the key in a thread-safe manner, and notifies the work-dispatching thread on the master node:

void BatchProcessor::schedule_key(const string &key) {
 if (world_rank != BP_HEAD_NODE) return; // scheduling on worker nodes is forbidden
 lock_guard lg(queue_mutex); // Lock the queue of keys
 keys.push(key); // Add to the queue
 schedule_cv.notify_one(); // Notify potentially waiting thread of scheduling
}

The work dispatching task performed by the head node is as follows:

pool.schedule([&](){
  MPI_Status status;
  char worker_ready;

  while (!work_completed()) {
    // Waiting for ready worker
    MPI_Probe(MPI_ANY_SOURCE, BP_WORKER_READY_TAG, MPI_COMM_WORLD, &status);
    if (status.MPI_ERROR) continue;

    // Found ready worker
    MPI_Recv(&worker_ready, sizeof(char), MPI_BYTE, status.MPI_SOURCE, BP_WORKER_READY_TAG, MPI_COMM_WORLD, &status);
    if (!worker_ready || status.MPI_ERROR) continue; // Error or incorrect signal?

    if (queue_empty()) { // no more keys to process
      if (scheduling_completed()) break;
      else { // Wait until something has been put on the queue
        unique_lock lock(schedule_mutex);
        schedule_cv.wait(lock, [this]() {
          return !queue_empty();
        });
        lock.unlock();
      }
    }

    // There is some work to do in the queue
    unique_lock lock(*worker_mutex_list[status.MPI_SOURCE]);
    worker_ready_list[status.MPI_SOURCE] = false; // Mark worker as busy
    lock.unlock();

    // Send the next work to the worker
    queue_mutex.lock();
    string next_key = keys.front();
    MPI_Send(next_key.c_str(), (int) next_key.size() + 1, MPI_CHAR, status.MPI_SOURCE, BP_WORK_TAG, MPI_COMM_WORLD);
    keys.pop(); // Remove the key
    queue_mutex.unlock();
  }

  // Need to indicate to each worker to exit once the work is done
  for (int i = 0; i < world_size; i++) {
    if (i == BP_HEAD_NODE) continue;
    send_exit_signal(i);
  }
});

This task is really broken up into three parts:

Wait for work to be put on the queue
Wait for a worker to signal its ready and then send the next work from the queue
Tell all the workers to exit once the queue is empty and scheduling is complete

Compiling

CMake is great in that it helps you to easily produce portable makefiles. I used CMake to help with linking MPI and Boost libraries. If you are interested in the general technique for linking MPI libraries:

find_package(MPI REQUIRED)
include_directories(${MPI_INCLUDE_PATH})

set (MPI_TEST_SOURCE test/test-mpi.cpp)

add_executable (test-mpi ${MPI_TEST_SOURCE})
target_link_libraries (test-mpi ${MPI_LIBRARIES})

if (MPI_COMPILE_FLAGS)
    set_target_properties(test-mpi PROPERTIES COMPILE_FLAGS "${MPI_COMPILE_FLAGS}")
endif()

if (MPI_LINK_FLAGS)
    set_target_properties(test-mpi PROPERTIES LINK_FLAGS "${MPI_LINK_FLAGS}")
endif()

Troubleshooting running OpenMPI

I ran into problems with even just getting OpenMPI to be able to spawn up tasks on other machines. While testing I would compile a simple MPI version of hello world with mpicc mpi_hello.c -o hello and run it like so:

mpirun -n 1 -H node0 hello

but I continued to receive the following error

ORTE was unable to reliably start one or more daemons.

with suggestions for why this was occurring being:

Not finding the required libraries and/or binaries
Lack of authority to execute on one or more specified nodes.
The inability to write startup files into /tmp
Compilation of the orted with dynamic libraries when static are required
An inability to create a connection back to mpirun due to a
lack of common network interfaces

I learned quickly that this was a very common error and could be caused by any number of things that went wrong. The lack of more specific error messages had me searching for hours before I came upon something from this post that saved me. For some reason specifying the absolute path of the mpirun executable fixed this problem. Using

$(which mpirun) -n 1 -H node0 hello

worked perfectly and I didn’t have to remember the full path to mpirun.

Because I had a heterogenous cluster, at this point it was also time to investigate whether I would be able to compile different version of my distributed program and specify which executables to run on which machines. Luckily, Gilles Gouaillardet helped me on my Stack Overflow question about how to specify multiple compiled versions of the same program running with syntax like the following

mpirun -n 2 -H node0,node1 prog_x86 : -n 2 -H node2,node3 prog_arm

where prog_x86 and prog_arm are just example names for versions of the same program compiled for the different architectures in the cluster. This command would launch the version of the program compiled for x86 on node0 and node1 and the version of the program compiled for ARM on node2 and node3.

Testing

Once I had distributed tasks up and running I was able to analyze how this distributed application ran. Unfortunately when I tested this application my personal computing cluster was in a state of disrepair so I tested it instead on a shared computing cluster at school. With four nodes I was able to churn through 1.9 Gb of data in 1 minute. For reference, the original data set was 18GB and took 6.5 hours to churn through. This suggests that I was able to achieve a 40x speedup using parallelism and distributed computing. Under these assumptions, I would have been able to complete 18GB task in about 10 minutes.

Areas for improvement

You may have noticed that with version 2 of the program (single-node multi-threaded C++ code), I was able to get a 60x speedup, but the distributed program only had a 40x speedup. I believe this is for a few reasons, not the least of which is that I was executing these tasks on shared computing resources.

One other reason is that the workers only ask for one task at a time and once they complete that single task, they need to wait for a relatively long time for the master node to send them another task. This program could be modified so that the worker nodes ask for several tasks at a time, and that way can still process data while waiting for the head node to dispatch more work to them.

Conclusion

In conclusion this was a great project that taught a bit about distributed computing and MPI, gave me some exposure to the boost libraries, and honed my skills with C++. In the future I may make further optimizations to utilize the compute more efficiently, or even to test it on my own cluster once I get it back up and running. You can take a look at my code here.

Building a Beowulf Cluster from old MacBooks: Part 2

October 8, 2017 jondeatonLeave a comment

When we left off at part 1, we had ArchLinux installed on each of the computers in the cluster, and each is also running an SSH server so that is can be accessed remotely. The next step is to set up a shared file system that each node has access to. This will be very helpful later when coordinating distributed tasks, if for instance we are processing a bunch of files and we need each node to have access to any of the files.

There are several approaches to setting up a shared file system including SSHFS (SSH File System), NFS (Network File System), AFS (Andrew File System). I had some previous experience with SSHFS, a network file system build on SSH, but because SSHFS is encrypted it is less performant than the other two options. I also contemplated using AFS for the shared file system, (like the clusters are my school use), however, I wanted to use my 2TB Western Digital MyCloud to host the shared file system. Because the MyCloud already had NFS up and running, and it had little support for installing new software, I went with NFS for the shared file system on the cluster.

Configuring NFS

There is a great blog post by Benjamin Cane that goes over the basics of setting up NFS, so instead of reiterating, I’ll cover the extra steps that and troubles that I ran into. One of the trickiest parts of setting up the NFS was getting each node to have the proper permissions and ownership over the files in the NFS. The first problem that I ran into was permissions issues and not being able to access the files on the NFS even after mounting it on each node.

The core of this issue came down to two problems, the first of which is user/group ID mapping. In Linux, each user id and group id are associated with a string. For instance the user id 0 is associated with the string root. Regular users typically are assigned user IDs greater than 1000.

The problem comes in that a user id number of say user1 on the client may not have the same user id on the server hosting the NFS. As pointed out in a server fault thread there are two ways of resolving mismatched user ids for the same user on different systems. To summarize, you can either specify a user id mapping in the NFS configurations file (i.e. /etc/exports) or (more hacky) change the user id to be the same number on all the systems. I opted for the former.

The second problem, had to do with root permissions. For security reasons, it is genereally advised (and set by default on the WD MyCloud) for files owned by root on the NFS server to “squashed” so that even root on the NFS client cannot access them. For me, this manifested itself as not being able to modify files owned by root on the NFS partition on any of the nodes. Fixing this was as simple adding a no_root_squash option to the exported file systems in /etc/exports on the server.

An interesting quirk that I learned about the MyCloud is that it has a special partition set up for storing data, and a separate partition for the Linux system that it runs. After originally mounting the system on the (considerably smaller, 2GB) system partition, I quickly ran out of space. I was at first astounded at this because I thought that the MyCloud had 2TB of storage. After realizing that I had exported the wrong part of the MyCloud file system, the issue was quickly fixed by re-directing the NFS export to be some sub-directory of /DataVolume/ which is where the 1.8TB of free storage is.

Changing home directory

After mounting NFS on each node of the cluster, this was a really good time to change the home directory of each node to be on the NFS. This way, any configurations that reside in the home directory (e.g. ~/.bashrc, ~/.ssh etc.) were shared across all nodes in the cluster, thus only needed to be done once. As I learned in a stack exchange thread there are two ways to change the designated home directory of a user. If you aren’t logged in as the user that you want to change the home directory for, then this is easy: simply run

usermod -d /new/home userName

However, if you are logged in as the user, you’ll need to edit /etc/passwd and change the home directory for the designated user manually, and then log out and in again. After having changed the home directory of each node to be on the NFS mount, configuration became a lot easier.

Password-less SSH

Once all of the node have the same shared home directory, it was a great time to set up password-less SSH login. This is because all of the nodes now shared the same list of authorized keys stored in ~/.ssh/authorized_keys. If you’ve set up key-authenticated SSH before you’ll know that its super easy, but I included it in this post because I encountered some unique problems that you might not have seen before.

First, not only is it convenient to login to the cluster with keyed authentication, but for distributed tasks all of the nodes need to be able to communicate with each other over SSH without password prompts. This is because many of the message passing interfaces, and cluster monitoring tools utilize SSH for communication.

One interesting problem I encountered involved a potential security issue. In order to login to every node in the cluster from my remote laptop, I generated an RSA key pair, and added the public key to ~/.ssh/authorized_keys and added the path to the private key to ~/.ssh/config under IdentityFile. At this, point, in order to make each node log into each other node, I was tempted to copy the private key to ~/.ssh/sk.rsa and store that in the ssh config file on the cluster. I realized that this would have been a security issue however, because the home directory of each node was actually a mounted NFS. Since NFS communication is unencrypted, this means that the private key would have been sent over the network in the clear every time a node read this file in order to authenticate itself to another nodes. That would have been bad, but it has an easy fix which is simply placing the key into (the same) path on the local storage of each node and pointing the (shared) ~/.ssh/config to that path.

When setting up password-less SSH on the WD MyCloud I ran into another interesting issue that took a long time to diagnose. It turns out that if you store the authorized key file in ~/.ssh/authorized_keys, the home directory ~ needs to have permissions set to 700 as well (or less permissive than 777, anyway) as pointed out in an extremely helpful stack exchange post on debugging this problem.

Synchronizing time across the cluster

Another interesting issue that I ran into was discrepancies between each node’s notion of current time. Some nodes differed by over 30 seconds from others. This manifested itself when running commands that manage files on the shared NFS volume, such as the following warnings that I got while running make

make: Warning: File `main.c' has modification time 21s in the future
make: Warning: Clock skew detected. Your build may be incomplete.

To fix this, I needed to synchronize the dates with a program called NTP (Network Time Protocol). This system feature works by running a daemon that periodically synchronizes the system’s current time with that of an external trusted server. To synchronize clocks across the cluster, one way is to have a head node (in our case the WD MyCloud) fetch time from the outside periodically and then have the other nodes synchronize with the head node.

To do this, on the WDMyCloud I configured /etc/ntp.conf to be

server time.apple.com

so that the system hosting the NFS would fetch time from Apple. I did this because I wanted to mount the NFS on my personal MacBook and I didn’t want there to be any difference between my laptop’s time and the NFS time. To make this take effect, I restarted the ntpdate daemon with:

sudo service ntpdate restart

Then on the Arch Linux nodes, I configured /etc/ntp.conf to be the same (i.e. server time.apple.com) and restarted the daemon on them as well with

sudo systemctl restart ntpdate

Interestingly, even once I did this, the clock continued to be skewed when I would log days later. The only thing I could think of was quite the hack: I had the ntpdate daemon restart itself every two minutes by adding a cron job that performed this task. I don’t think this is the proper fix to the problem (isn’t the ntpdate daemon supposed to fetch time periodically by itself?!) but it was easy and fixed the problem permanently. To do this I ran crontab -e and then inserted the line

*/2 * * * * * sudo systemctl restart ntpdate

into the opened file.

NFS I/O optimization

One of the first things that I noticed about the NFS partition was that I/O was unbearably slow. I had a 1Gbit (125 MB/s) ethernet switch and each node (raspberry pi’s excluded) had fast enough network interface to handle network at this speed. Despite this, I was noticing very slow write speeds to the NFS. It was only when it took over 10 minutes to write a 100MB to the NFS that I decided it was unacceptable. Tuning the parameters on the NFS to optimize I/O was one of the more interesting steps in setting up the system.

I followed the detailed discussion here for how one diagnoses and optimizes NFS. The first thing to do is quantitatively assess current performance so that future configurations can be bench-marked for effectiveness. The command dd, which transfers raw data from source to sink, is a great tool for this. Running

dd if=/dev/zero of=/mnt/nfs/testfile bs=16k count=16384

will simply copy null bytes from /dev/zero into a test file on the NFS, and then report transfer speed. When I ran this command on the un-configured NFS, I saw 22.5 MB/s of write speed. Repeating but instead reading from the test file (into/dev/null), I saw 45.1 MB/s write speed.

Just to benchmark against the disk speed of the head node, I ran the test again directed at a local directory and saw 563 MB/s write and 2.3 GB/s read speeds, confirming my suspicion that indeed it was the NFS was performing poorly. Just to be sure that the bottle neck wasn’t the WD MyCloud disk speed, I checked the I/O on the WD MyCloud and found 124 MB/s write, and 98.8 MB/s read: Nope!

The first recommendation for tuning NFS is to edit block size and other mount options in /etc/fstab. Playing around with rsize/wsize on the head node yielded the following

rsize/wsize = 4k 13.7 MB/s write, 17.7 MB/s read

rsize/wsize = 8k 20.6 MB/s write, 31.9 MB/s read

rsize/wsize = 16k 27.6 MB/s write, 50.8 MB/s read

rsize/wsize = 32k 36.1 MB/s write, 70.2 MB/s read

rsize/wsize = 64k 26.9 MB/s write, 67.8 MB/s read

rsize/wsize = 524k 35.1 MB/s write, 71.6 MB/s read

Note that if you don’t unmount the NFS partition, the file will be buffered in memory and dd will read from memory instead of the network, purporting >2GB/s transfer. These results were promising but they only increased speed a little bit, and were still well below the theoretical limit of 125 MB/s.

Investigating the tips and tricks for using NFS on Arch Linux documentation, I found that the max_block_size of NFS might indeed be the issue given that /proc/fs/nfsd/max_block_size held 32767 on the MyCloud. This explained why transfer speed plateaued after block sizes of 32k. To change this parameter I simply edited the file, however, since the NFS daemon locks this file I first stopped the NFS daemon on the MyCloud with service nfs-kernel-server and service nfs-common stop.

Then I edited with echo 524288 > /proc/fs/nfsd/max_block_size and restarted the NFS daemons. This resulted in a write speed of 48.5 MB/s (better!) and read speed of 90.6 kB/s. Yeah, seriously that’s in kB/s. the system actually froze for a solid 10 minutes.

It was here that I realized that the MyCloud had only has 256Mb of RAM. My first thought was: “the 524 MB of buffer used in each block is eating up all the RAM on the MyCloud causing it to thrash?”. I ran the I/O benchmark command again and checked top on the MyCloud. Surprisingly, no process was using any significant memory. What then? My system has 4Gb of memory so its hard to believe that it was using up the memory.

I found an interesting article about network socket buffer sizes and decreased rsize down to 131072 and got 101 MB/s read speed. Okay, its not quite a Gigabit, but its quite an improvement and I can’t think of anything else so I’ll live with it.

Conclusion

Finally I had each node in the cluster was set up with a shared file system and authenticated communication. At this point I was ready to start writing distributed compute tasks. In my next blog post I will talk about how I used C++ and OpenMPI to coordinate distributed tasks across the cluster!

Building a Beowulf Cluster from old MacBooks: Part 1

October 1, 2017August 20, 2018 jondeatonLeave a comment

My family uses their laptops a lot, and they tend to get them replaced after several years. A year ago, I scrounged around our house and rounded up every old computer that I could find with the hope of putting them together to make some kind of super computer.

Five old MacBook Pros in hand (some very old), I did some research and discovered that indeed I could put them to use in what is called a “Beowulf cluster“. A Beowulf cluster is a collection of computers connected through a (hopefully fast) network connection so that they can coordinate parallel and distributed computing tasks. In the next few blog posts, I’ll go into detail about how I turned a these discarded laptops that would have never been used again into a cluster that I use for distributed computing.

Hardware

The computers that I managed to get my hands on didn’t have very impressive specs, but hopefully the fact that they will be working together on distributed tasks will make up for that! Nevertheless, here is the hardware that I had to work with:

15″ MacBook Pro 2007 2.2 GHz 4G RAM single core
13″ MacBook Pro 2009 2.3 GHz 2G RAM dual core
13″ MacBook Pro 2009 2.3 GHz 2G RAM dual core (broken screen)
Raspberry Pi 3
Raspberry Pi 2
WD MyCloud 2TB (more on this later)

I actually had two other laptops that were over 12 years old, an iBook G4 from ~2005 and a PowerBook from about 2006. Unfortunately, I didn’t end up using them because they wouldn’t even recognize and boot an Arch Linux USB boot drive.

Why Arch Linux?

Researching quickly lead me to realize that the first thing to do was install a distribution of Linux on each computer. With so many Linux distributions to choose from, I went with Arch Linux for the following reasons. First, several of these computers were over 5 years old, and running a bare-bones Linux distribution without the overhead of the full OSX or desktop environment would increase performance. Second, none of these laptops had identical hardware nor identical OSX versions. To avoid problems with in-homogeneous software compatibility, and the hassle of updating all computers to the same OSX, installing the latest version of some Linux distribution seemed the easiest solution. Third, I also had two Raspberry Pis laying around, and I wanted to include those in the cluster as well. I needed an operating system that could be installed on both an Apple computer and a Raspberry Pi. Fourth, building servers in Linux is much more common than with Mac OSX and for that reason the online resources for Linux server configuration are much more in depth. Fifth, although other distributions have great online support, I find the documentation for Arch absolutely stellar. Having installed Arch before, I foresaw a plethora of troubleshooting, and I knew having great online resources was paramount.

Finally, (most importantly) I wanted this project to be as a great a learning experience as possible! Prior to this project I felt much more comfortable in OSX than in Linux. I was confident that setting up the system using just the Linux CLI would maximize my learning. For these reasons, Arch Linux was my operating system of choice.

Installing Arch Linux

Installing Arch on each computer was probably the most time consuming and troublesome part of the project. There are a thousand installation guides online, not to mention the official one, so I won’t go into details, but I will mention some of the peculiarities and problems that I ran into (of which there were many). Luckily, the Arch Linux documentation has some excellent resources for installation troubleshooting, in addition to an army of community members, one of which has almost certainly already run into and solved whatever problems you have with Arch.

Installation Troubles

These computers were never going to be used for anything else again, so I did a full hard drive reformatting and partitioning (as opposed to a dual-boot). For the most part, installation went smoothly, but there were a few hiccups that left me staring at grub-rescue menus and hung boot processes– but nothing that digging deep through that Arch Linux forums couldn’t fix. One interesting problem that I had was that one of the 13″ MBPs had a completely broken screen (below).

This complicated the installation of Arch slightly. In an attempt to avoid doing the entire installation procedure blindly, I instead started an SSH service as soon the computer booted from the USB, and proceeded with installation remotely. To help with this, I wrote a bash script to run fresh on the USB boot media, that installs and starts an SHH daemon. This way, the only commands needed to be run blindly were the coping and running of the start-SSH script. I ended up running this script on all of the other nodes as well so that I could do as much of the system configuration remotely as possible.

Installing Arch on the Raspberry Pis were very easy since (luckily!) the Arch community has an image that can simply be flashed onto a micro SD card like you would with Raspbian.

Assembling the cluster

Putting the computers together was theoretically quite simple: just give each computer power and connect them to the same local network. Providing power was a bit tricky, given that all of the MacBooks used the old style charger, of which I had only a single functioning cable. I did have two frayed chargers, which luckily, I was able to solder together using a YouTube tutorial and it worked out perfectly.

Luckily, providing ethernet was much simpler given that each computer had a functioning ethernet port. I bought some ethernet cables, and plugged them into an 8-port ethernet switch and then connected that to my router. After that I simply set up the SSH daemon on each with a user to log in, told each computer not to sleep when the clamshell was closed and put them in the stack you see in the banner image.

Power Management

One of the first things that I noticed about the cluster was that the fans were running all of the time and the computers were getting very hot. It seems that the default power management that comes with Arch isn’t very well suited for Apple hardware, but after digging through some Arch forums, I found a solution. I installed the following packages and power management daemons

sudo systemctl enable powertop.service
sudo systemctl start powertop.service

sudo pacman -S gnome-power-manager

sudo pacman -Sy thermald
sudo systemctl enable thermald
sudo systemctl start thermald

sudo pacman -S cpupower
sudo systemctl enable cpupower
sudo systemctl start cpupower

and put the following

[Unit]
Description=Powertop tunings

[Service]
Type=oneshot
ExecStart=/usr/bin/powertop --auto-tune

[Install]
WantedBy=multi-user.target

into /etc/systemd/system/powertop.service, fixing the overheating problem.

Western Digital MyCloud

I also managed to scrounge up a 2TB Western Digital MyCloud (white tower-looking thing in the banner picture). This device is basically just a network storage unit for hosting “your own cloud”. I learned that it ran Linux and supported SSH connectivity out of the box, so I figured it would be a great device to host a network file system to share across the cluster. For now though, all I had to do was connect it to power and the ethernet switch, and then enable SSH access on the device. In my following blog post I will go into setting up the shared file system!

Conclusion

At this point, all of the computers were running Arch Linux, connected via ethernet, and running SSH. This was a good place to stop as the remaining steps were system configuration set up over SSH (of which there was a lot!).

The hardest part of these beginning steps was simply getting the old MacBooks to boot Arch. After this the real fun began: In my next blog post I will talk about setting up a shared file system, and making the nodes into a more cohesive unit. Then, in the post after that, I’ll go into using the cluster for some simple distributed computing tasks.

First blog post

July 30, 2017July 30, 2017 jondeatonLeave a comment

This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Jon Deaton's Blog

My adventures in computing

Author: jondeaton

Distributed Computing with Custom Beowulf Cluster

Background on DNA Sequence analysis

K-mer counting algorithm

Porting code to multi-threaded C++

Installing OpenMPI

Batch Processing Implementation with MPI

Compiling

Troubleshooting running OpenMPI

Testing

Areas for improvement

Conclusion

Building a Beowulf Cluster from old MacBooks: Part 2

Configuring NFS

Changing home directory

Password-less SSH

Synchronizing time across the cluster

NFS I/O optimization

Conclusion

Building a Beowulf Cluster from old MacBooks: Part 1

Hardware

Why Arch Linux?

Installing Arch Linux

Installation Troubles

Assembling the cluster

Power Management

Western Digital MyCloud

Conclusion

First blog post