CALL GLOP()

Andy Malakov software blog

Wednesday, November 30, 2016

Wednesday, March 4, 2015

Connecting two CentOS computers using cheap Infiniband

This is a continuation of the previous post. This time I wanted to test direct Infiniband connection on Linux.

Setup is the same:

  • Two retired developer's desktops (built in 2008): AMD Opteron 2216 @2.4MHz, 8G DDR2.
  • A pair of Mellanox Infinihost III adapter MHGA28-XTC
  • CentOS Linux 6.3 (Minimal Install in my case)
  • MLNX_OFED_LINUX-1.5.3-4.0.42-rhel6.3-x86_64.iso OFED driver (Still available on mellanox.com)

Linux setup is pretty straightforward but in my opinion more involved than on Windows. Main problem was old age of these cars. In order to avoid rebuilding OFED drivers for these cards I used old version of CentOS (6.3). I've tried 2.x version but got MFE_OLD_DEVICE_TYPE error. Besides I wanted to test SDP in Java 7 and this protocol seems to be no longer available in OFED 2.x +.

Bottom line: for these old Infinihost III-family cards use older OFED driver (1.5.3). If you don't want to rebuild the driver, use Linux distro/version specified by the driver (there are quite a few).

I found that the following two resources most useful for this project: A and B. There is no reason to repeat these steps here. Connection verification and testing using OFED utilities is similar to Windows version.

Configuring simple Java Socket application to use SDP worked like a charm. See Oracle's tutorial.

Connecting two Windows 7 computers with low-cost Infiniband

Previous generation Infiniband cards are selling for a fraction of original price on eBay. Developers are buying these setups to test/learn this technology. I've followed this path and posting my notes here. The setup wasn't easy and I had to collect information from various sources.

Hardware

  • A pair of Mellanox Infinihost III adapter MHGA28-XTC ($36)
  • .
  • A pair of Molex 4X Infiniband copper cables ($12.5)

Total price tag was $97 (including shipping).

These are old-generation Dual-Port InfiniBand adapter cards that fit into PCI Express x8 slots. Each card has two 20Gb/s ports. I used two retired supermicro desktops (circa 2008) that fit these cards by age. Each computer is running Windows 7 (x64). [Next post will explore the same hardware setup on Linux].

Infinihost allows direct connection between two computers (in point-to-point setup there is no need for Infiniband switch).

BIOS Update

Multiple sources recommend upgrading card's firmware before trying them with Windows.

There are several revisions of MHGA28-XTC cards, mine was A3 (check the sticker attached to the back of each card). Firmware can be downloaded from Mellanox here.

To upgrade firmware and basic status testing Mellanox provides MFT utilities set. In my case the latest MFT version 3.8 refused to work with these cards claiming they are no longer supported. Luckily MFT version 2.7.2 is still available and works with these old Infinihost-family cards:

C:\Program Files\Mellanox\WinMFT>mst status
MST devices:
------------
  mt25218_pciconf0
  mt25218_pci_cr0

C:\Program Files\Mellanox\WinMFT>mlxburn -dev mt25218_pci_cr0 -image fw-25218-5_3_000-MHGA28-XTC_A3.bin

    Current FW version on flash:  5.2.916
    New FW version:               5.3.0

Read and verify Invariant Sector            - OK
Read and verify PPS/SPS on flash            - OK
Burning second FW image without signatures  - OK
Restoring second signature                  - OK
-I- Image burn completed successfully.

Windows Driver

Initially these cards showed up as "Infiniband controller" in Windows Device Manager:

Driver for these cards are available from Mellanox and OpenFabrics.org (OFED). I believe both of these sources actually provide the same driver maintained by OFED (sponsored by Mellanox).

Here I had the same story - the latest OFED driver version (3.2) simply didn't work with these cards. Setup ended with "Possible NetworkDirect startup failure" warning. The installed driver would identify the card properly but yellow triangle said that device was disabled due to errors. Windows event log showed that some driver components failed to initialize.

After some trial and error I found that OFED driver version 2.3 was what I needed. I can be downloaded from OpenFabrics archive.

As you can see below, in addition to Infiniband card Device Managers showed that I got two OpenFabrics IPoIB Adapters (since each card has two ports):

Configuration

I repeating above steps on both computers and connected cards with cables.

OFED software comes with set of utilities, one of which (IBSTAT) can be used to check connectivity status:

C:\Windows\system32>ibstat
CA 'ibv_device0'
        CA type:
        Number of ports: 2
        Firmware version: 0x500030000
        Hardware version: 0x20
        Node GUID: 0x0002c9020023c250
        System image GUID: 0x0002c9020023c253
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 20
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x90580000
                Port GUID: 0x0002c9020023c251
                Link layer: IB
        Port 2:
                State: Initializing
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x90580000
                Port GUID: 0x0002c9020023c252
                Link layer: IB

Subnet Manager

When these cards are connected directly we need to launch Infiniband Subnetwork Manager (opensm). OFED installs it as Windows Service (disabled by default). In my case I launched opensm from command line. You need to run this service on both computers.

C:\Windows\system32>opensm
-------------------------------------------------
OpenSM 3.3.6 UMAD
Command Line Arguments:
 Log File: %TEMP%\osm.log
-------------------------------------------------
OpenSM 3.3.6 UMAD

Entering DISCOVERING state

Using default GUID 0x2c9020023c251
Entering MASTER state

SUBNET UP

Entering STANDBY state

Each service can be configured to serve both ports (enter GUID of each port GUIDs into opensm configuration file).

After this step Windows should show your network status as connected:

If you plan to keep this setup running, OpenSM can be launched automatically as Windows Service (disabled by default).

Connectivity test

OFED has special ping utility that can be used for quick test.

Computer 1 (Here we print GUIDs of each port and launch ping server):

C:\Windows\system32>ibstat -p
0x0002c90200231745
0x0002c90200231746

C:\Windows\system32>ibping -S

Computer 2 (here we use GUID of the first computer's port):

C:\Windows\system32>ibping -G 0x0002c90200231745
Pong from ?hostname?.?domainname? (Lid 2): time 0.230 ms
Pong from ?hostname?.?domainname? (Lid 2): time 0.160 ms
Pong from ?hostname?.?domainname? (Lid 2): time 0.231 ms
Pong from ?hostname?.?domainname? (Lid 2): time 0.159 ms
Pong from ?hostname?.?domainname? (Lid 2): time 0.163 ms
Pong from ?hostname?.?domainname? (Lid 2): time 0.174 ms
(Nevermind weird host name).

Latency test

Computer 1 (Launching test server):
C:\Windows\system32>ib_send_lat -a -c RC 
Computer 2 (test client):
C:\Windows\system32>ib_send_lat -a -c RC oldfaithful
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
test
  local address:  LID 0x100, QPN 0x6040200, PSN 0x265a0000, RKey 0x2c0010 VAddr 0x00000001170040
  remote address: LID 0x200, QPN 0x6040600, PSN 0xb64e0000, RKey 0x2c0030 VAddr 0x00000000fc0040
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           4.10        2295.99             4.27
      4        1000           3.75        1305.61             4.27
      8        1000           3.75         256.86             3.93
     16        1000           3.75         266.24             3.93
     32        1000           4.44        1021.62             4.61
     64        1000           4.44         329.22             4.61
    128        1000           4.61         303.79             4.78
    256        1000           4.95         529.41             5.12
    512        1000           5.46         300.89             5.63
   1024        1000           6.83         309.25             7.00
   2048        1000           9.22         327.17             9.39
   4096        1000          11.61         280.92            11.78
   8192        1000          16.90         306.86            17.07
  16384        1000          27.65         329.56            27.82
...
This hardware is 9 years old, so numbers are sub-optimal. Still much better than TCP even without any special tuning.

What's next?

We got low-latency 20Gb/s connection between two Windows 7 machines using a pair of cheap Infiniband adapters.

In theory this setup can be used for ultra-fast file sharing etc. My primary interest was getting my hands on Infiniband and OFED stack (and ultimately using it from Java). Unfortunately Socket Direct Protocol (SDP) available in Java since version 7 is a) deprecated in the latest version of OFED and b) seems to be unsupported by Java on Windows anyway. There are various libraries that provide RDMA to Java using JNI wrappers.

Monday, December 5, 2011

My CPU collection

Monday, October 11, 2010

ShouldNotReachHere

#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (classFileParser.cpp:3161), pid=3136, tid=4676
# Error: ShouldNotReachHere()
#

Sunday, October 3, 2010

My first Android app





I created my first Android app. It converts a date in Gregorian Calendar into Maya "Long Count" calendar. Contrary to popular view Mayan calendar doesn't end on December 21st, 2012. You can test this in my app :-).

Sources can be found here.

Sunday, June 6, 2010

Alternative to Thread.sleep()

Follow-up to my previous post. Here is an alternative to Thread.sleep() that uses spin-yield:

private static final long SLEEP_PRECISION = TimeUnit.MILLISECONDS.toNanos(2); //TODO: Determine for current machine

public static void sleepNanos (long nanoDuration) throws InterruptedException {
final long end = System.nanoTime() + nanoDuration;


long timeLeft = nanoDuration;
do {
if (timeLeft > SLEEP_PRECISION)
Thread.sleep (
1);
else

Thread.sleep (
0); // Thread.yield();
timeLeft = end - System.nanoTime();

if (Thread.interrupted())
throw new InterruptedException ();

}
while (timeLeft > 0);

}



Test


I run 4 threads requesting 5 millisecond sleep 1000 times each (on my Dual-Core CPU). The first chart shows sleepNanos(TimeUnit.MILLISECONDS(5).toNanos()):

Actual sleep time of sleepNanos(5000000)

The second chart shows Thread.sleep(5):

Actual sleep time of Thread.sleep(5)

As you can see, sleepNanos() is much more precise. I found that this approach was originally used by Ryan Geiss for WinAmp visualization plugin.

UPDATE



Even better precision can be achieved if you are willing to consume more CPU power doing Spin-Wait for the last part of the wait:







public static void sleepNanos (long nanoDuration) throws InterruptedException {
final long end = System.nanoTime() + nanoDuration;
long timeLeft = nanoDuration;
do {

if (timeLeft > SLEEP_PRECISION)
Thread.sleep (1);
else
if (timeLeft > SPIN_YIELD_PRECISION)

Thread.sleep(0);

timeLeft = end - System.nanoTime();

if (Thread.interrupted())
throw new InterruptedException ();

} while (timeLeft > 0);

}

Thursday, May 13, 2010

Measuring nanoseconds in Java / Windows

I was under false impression that interval timers in Java allow sub-millisecond precision on Windows. I knew that Thread.sleep(millis, nanos) internally uses milliseconds, but for some reason I thought that methods like LockSupport.parkNanos() method provide precise waits. Well, I was wrong. The smallest delay this method can realize is approximately 1.95 milliseconds on my Windows PC.



Back in 2006 David Holmes explained that Java timer intervals are based on waitForMultipleObjects() Windows API (which uses dwMilliseconds). They still are.



One simple (but not universal) workaround is a "spin-sleep":


private static void sleepNanos (long nanoDelay) {
final long end = System.nanoTime() + nanoDelay;
do {
Thread.yield(); // Thread.sleep (0);
} while (System.nanoTime() < end);
}


When running standalone this method consumes all free resources of single CPU core, but it will share with other threads that may be running. For large durations it can be enhanced to use Thread.sleep() for bulk of the waiting.



P.S. Difference between Thread.yield() and Thread.sleep( 0 ) is explained here.

Friday, December 11, 2009

Old trap

I was puzzled why one of my objects had incorrect toString()

class Foo { 
private long id;
private String name;

Foo(
long id, String name) {
this.id = id;
this.name = name;
}

public String toString () {
return id + ' ' + name;
}
}


The statement new Foo(123, "Hello") prints something like "155Hello" instead of "123 Hello". Of course the reason is pretty simple:

 
public String toString2() {
return new StringBuilder().append(id + 32L).append(name.toString()).toString();
}

Wednesday, November 11, 2009

Stack allocation in Java is still a myth

There were rumors that Mustang will have on stack allocation as a part of hot spot optimization.

Four+ years later this :


public long encode (String input) {
final byte [] buffer = new byte [8];

.. encode input into buffer

.. convert buffer into LONG

return result;
}


I was hoping JVM will allocate buffer on stack based on the fact that it does not escape from this method. Running this test 10M times with -verbosegc shows extensive GC work (1.6.0_16-b01 64 bit server JVM with -XX:+DoEscapeAnalysis option).



On the positive side. GC is very fast. Consider three functions:

  1. encodeNewBuffer () uses new byte array to encode input string.

  2. encodeSynchronizedField () uses private field, guarded by synchronized{} block

  3. encodeThreadLocalField () uses ThreadLocal cache to encode input string



Here is the code:

long encodeNewBuffer (String input) {
final byte [] buffer = new byte [8];

return f (buffer);
}

/////////////

private final byte [] buffer = new byte [8];

synchronized long encodeSynchronizedField (String input) {
return f (buffer);
}

/////////////

ThreadLocal<byte[]> threadLocal = new ThreadLocal<byte[]>();
{
threadLocal.set(new byte [8]);
}

long encodeThreadLocalField (String input) {
byte [] buffer = threadLocal.get();
return f (buffer);
}


GC-based method is a winner:


encodeNewBuffer(): 4,108 sec.
encodeSynchronizedField(): 5,322 sec.
encodeThreadLocalField() : 5,411 sec