Posts

SC09 talk bits

I’ll be giving a talk at SC09 about the design and installation of our new siCluster (storage cluster) product. The talk is entitled “Feeding the hungry Gopher”. I’ll explain that in a moment. It is Tuesday, November 17th, 3-3:30pm, at the Minnesota Supercomputing Institute booth. The gopher is the mascot of the University of Minnesota. The hungry gopher(s) are, in this case, the 1000+ nodes of their new HP cluster named Itasca.

Posts

SC09 booth bits

Ok, here’s the scoop. We will be in the Intel Partner Pavilion booth #3077. We are going to be bringing a nice JackRabbit JR4 and a Pegasus-GPU unit. The JR4 will have a pair of nice fast Intel Nehalem CPUs in it (probably X5550’s), and the Pegasus will have a pair of W5580’s. Both units will have a bit of RAM, and the JR4 will have 24x 500 GB disks, while the Pegasus will have, get this, 32x 500 GB disks.

Posts

The joy that is mmap ...

Mmap is a way to provide file IO in a nice simple manner. Create a buffer area, and as you read/write into that buffer, this is reflected physically into the file. Oversimplification, but this is basically what it is. In most operating system, mmap makes direct use of the paging paths in the kernel. Why am I writing about this? Because the paging paths are some of the slowest paths in modern kernels, typically doing IO a page at a time.

Posts

SC09 prep

Next two weeks are going to be crazy. Prepping 2 machines for SC09, getting demos on there, testing them out (bits don’t die in transit … oh no… never happens :( ) Planning on a JR4 and a Pegasus. Likely a nice fast connection between the two. Pegasus will be interesting in that it will have many 2.5" disks, very fast Intel Nehalem chips, lots of RAM, and some nice fast GPU cards.

Posts

Business brief

If we simply stopped working today, after finishing up the orders in hand we need to deliver now, we would be reporting a 50% growth in revenue over last year. That is, if we worked for only 10 months out of 12… . I won’t comment on our pipeline other than to say I like it. Did I mention I have been very busy?

Posts

Weblog awards nominations open up on 2-Nov

Folks, have a look at this. Please consider nominating and voting for your favorite tech/HPC blogs (no, not dropping any hints … nosiree … none whatsoever … nothing to see here folks, move along …).

Posts

Would you take operational/marketing advice from someone without such experience?

We are starting a process to get some additional capital into the company, apart from operations and profit generation. That is going well. One of the aspects of this are discussions with people over our strategy and other elements of the business. Some of these conversations are amusing. Some are annoying. Few are really helpful or insightful. That is, a great deal of time and effort is expended, with little return back for expending the time and effort.

Posts

Reducing risk: avoiding the bricking phenomenon

Something happened this week in a storage cluster we set up for a customer. You’ll hear more about the storage cluster at SC09, but thats not what this is about. This is about risk, and how to reduce it. Risk is a complex thing to define in practice, but there are several … well … simple ways you can indicate relative risk. A motherboard and power supply blew in one of our nodes.

Posts

Updates: storage cluster, SC09, and other things

Been busy. Incredibly busy. Back with the storage cluster fixing a blown KVM and as I found this morning, a blown motherboard (and I am hoping that this is it, but we are preparing to replace all of the innards … just in case). Storage cluster hit our performance targets in testing, even with IB running at 2/3 of rated speed. Working on finding out why 2/3 speed is the case vs full speed.

Posts

Good performance numbers on the storage gluster

I can’t go into them in depth, but we exceeded the performance targets (the system was purposefully designed to do this). The gluster team rocks! (I can’t emphasize this enough) Odd performance issue with the Mellanox QDR. Still trying to understand it, and hopefully will be able to update to our later kernel with the 1.5 OFED. I can say that running 24 parallel independent writes to each RAID w/o any parallel file system in there gave us about 20GB of sustained bandwidth to disk in aggregate, for writes far larger than system cache.