CALL GLOP()
Andy Malakov software blog
Wednesday, November 30, 2016
Wednesday, March 4, 2015
Connecting two CentOS computers using cheap Infiniband
This is a continuation of the previous post. This time I wanted to test direct Infiniband connection on Linux.
Setup is the same:
- Two retired developer's desktops (built in 2008): AMD Opteron 2216 @2.4MHz, 8G DDR2.
- A pair of Mellanox Infinihost III adapter MHGA28-XTC
- CentOS Linux 6.3 (Minimal Install in my case)
- MLNX_OFED_LINUX-1.5.3-4.0.42-rhel6.3-x86_64.iso OFED driver (Still available on mellanox.com)
Linux setup is pretty straightforward but in my opinion more involved than on Windows. Main problem was old age of these cars. In order to avoid rebuilding OFED drivers for these cards I used old version of CentOS (6.3). I've tried 2.x version but got MFE_OLD_DEVICE_TYPE error. Besides I wanted to test SDP in Java 7 and this protocol seems to be no longer available in OFED 2.x +.
Bottom line: for these old Infinihost III-family cards use older OFED driver (1.5.3). If you don't want to rebuild the driver, use Linux distro/version specified by the driver (there are quite a few).
I found that the following two resources most useful for this project: A and B. There is no reason to repeat these steps here. Connection verification and testing using OFED utilities is similar to Windows version.
Configuring simple Java Socket application to use SDP worked like a charm. See Oracle's tutorial.
Connecting two Windows 7 computers with low-cost Infiniband
Previous generation Infiniband cards are selling for a fraction of original price on eBay. Developers are buying these setups to test/learn this technology. I've followed this path and posting my notes here. The setup wasn't easy and I had to collect information from various sources.
Hardware
- A pair of Mellanox Infinihost III adapter MHGA28-XTC ($36) .
- A pair of Molex 4X Infiniband copper cables ($12.5)
Total price tag was $97 (including shipping).
These are old-generation Dual-Port InfiniBand adapter cards that fit into PCI Express x8 slots. Each card has two 20Gb/s ports. I used two retired supermicro desktops (circa 2008) that fit these cards by age. Each computer is running Windows 7 (x64). [Next post will explore the same hardware setup on Linux].
Infinihost allows direct connection between two computers (in point-to-point setup there is no need for Infiniband switch).
BIOS Update
Multiple sources recommend upgrading card's firmware before trying them with Windows.
There are several revisions of MHGA28-XTC cards, mine was A3 (check the sticker attached to the back of each card). Firmware can be downloaded from Mellanox here.
To upgrade firmware and basic status testing Mellanox provides MFT utilities set. In my case the latest MFT version 3.8 refused to work with these cards claiming they are no longer supported. Luckily MFT version 2.7.2 is still available and works with these old Infinihost-family cards:
C:\Program Files\Mellanox\WinMFT>mst status MST devices: ------------ mt25218_pciconf0 mt25218_pci_cr0 C:\Program Files\Mellanox\WinMFT>mlxburn -dev mt25218_pci_cr0 -image fw-25218-5_3_000-MHGA28-XTC_A3.bin Current FW version on flash: 5.2.916 New FW version: 5.3.0 Read and verify Invariant Sector - OK Read and verify PPS/SPS on flash - OK Burning second FW image without signatures - OK Restoring second signature - OK -I- Image burn completed successfully.
Windows Driver
Initially these cards showed up as "Infiniband controller" in Windows Device Manager:
Driver for these cards are available from Mellanox and OpenFabrics.org (OFED). I believe both of these sources actually provide the same driver maintained by OFED (sponsored by Mellanox).
Here I had the same story - the latest OFED driver version (3.2) simply didn't work with these cards. Setup ended with "Possible NetworkDirect startup failure" warning. The installed driver would identify the card properly but yellow triangle said that device was disabled due to errors. Windows event log showed that some driver components failed to initialize.
After some trial and error I found that OFED driver version 2.3 was what I needed. I can be downloaded from OpenFabrics archive.
As you can see below, in addition to Infiniband card Device Managers showed that I got two OpenFabrics IPoIB Adapters (since each card has two ports):
Configuration
I repeating above steps on both computers and connected cards with cables.
OFED software comes with set of utilities, one of which (IBSTAT) can be used to check connectivity status:
C:\Windows\system32>ibstat CA 'ibv_device0' CA type: Number of ports: 2 Firmware version: 0x500030000 Hardware version: 0x20 Node GUID: 0x0002c9020023c250 System image GUID: 0x0002c9020023c253 Port 1: State: Initializing Physical state: LinkUp Rate: 20 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x90580000 Port GUID: 0x0002c9020023c251 Link layer: IB Port 2: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x90580000 Port GUID: 0x0002c9020023c252 Link layer: IB
Subnet Manager
When these cards are connected directly we need to launch Infiniband Subnetwork Manager (opensm). OFED installs it as Windows Service (disabled by default). In my case I launched opensm from command line. You need to run this service on both computers.
C:\Windows\system32>opensm ------------------------------------------------- OpenSM 3.3.6 UMAD Command Line Arguments: Log File: %TEMP%\osm.log ------------------------------------------------- OpenSM 3.3.6 UMAD Entering DISCOVERING state Using default GUID 0x2c9020023c251 Entering MASTER state SUBNET UP Entering STANDBY state
Each service can be configured to serve both ports (enter GUID of each port GUIDs into opensm configuration file).
After this step Windows should show your network status as connected:
Connectivity test
OFED has special ping utility that can be used for quick test.
Computer 1 (Here we print GUIDs of each port and launch ping server):
C:\Windows\system32>ibstat -p 0x0002c90200231745 0x0002c90200231746 C:\Windows\system32>ibping -S
Computer 2 (here we use GUID of the first computer's port):
C:\Windows\system32>ibping -G 0x0002c90200231745 Pong from ?hostname?.?domainname? (Lid 2): time 0.230 ms Pong from ?hostname?.?domainname? (Lid 2): time 0.160 ms Pong from ?hostname?.?domainname? (Lid 2): time 0.231 ms Pong from ?hostname?.?domainname? (Lid 2): time 0.159 ms Pong from ?hostname?.?domainname? (Lid 2): time 0.163 ms Pong from ?hostname?.?domainname? (Lid 2): time 0.174 ms(Nevermind weird host name).
Latency test
Computer 1 (Launching test server):C:\Windows\system32>ib_send_lat -a -c RCComputer 2 (test client):
C:\Windows\system32>ib_send_lat -a -c RC oldfaithful ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : RC test local address: LID 0x100, QPN 0x6040200, PSN 0x265a0000, RKey 0x2c0010 VAddr 0x00000001170040 remote address: LID 0x200, QPN 0x6040600, PSN 0xb64e0000, RKey 0x2c0030 VAddr 0x00000000fc0040 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 4.10 2295.99 4.27 4 1000 3.75 1305.61 4.27 8 1000 3.75 256.86 3.93 16 1000 3.75 266.24 3.93 32 1000 4.44 1021.62 4.61 64 1000 4.44 329.22 4.61 128 1000 4.61 303.79 4.78 256 1000 4.95 529.41 5.12 512 1000 5.46 300.89 5.63 1024 1000 6.83 309.25 7.00 2048 1000 9.22 327.17 9.39 4096 1000 11.61 280.92 11.78 8192 1000 16.90 306.86 17.07 16384 1000 27.65 329.56 27.82 ...This hardware is 9 years old, so numbers are sub-optimal. Still much better than TCP even without any special tuning.
What's next?
We got low-latency 20Gb/s connection between two Windows 7 machines using a pair of cheap Infiniband adapters.
In theory this setup can be used for ultra-fast file sharing etc. My primary interest was getting my hands on Infiniband and OFED stack (and ultimately using it from Java). Unfortunately Socket Direct Protocol (SDP) available in Java since version 7 is a) deprecated in the latest version of OFED and b) seems to be unsupported by Java on Windows anyway. There are various libraries that provide RDMA to Java using JNI wrappers.
Monday, December 5, 2011
Monday, October 11, 2010
ShouldNotReachHere
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (classFileParser.cpp:3161), pid=3136, tid=4676
# Error: ShouldNotReachHere()
#
Sunday, October 3, 2010
My first Android app
I created my first Android app. It converts a date in Gregorian Calendar into Maya "Long Count" calendar. Contrary to popular view Mayan calendar doesn't end on December 21st, 2012. You can test this in my app :-).
Sources can be found here.
Sunday, June 6, 2010
Alternative to Thread.sleep()
private static final long SLEEP_PRECISION = TimeUnit.MILLISECONDS.toNanos(2); //TODO: Determine for current machine
public static void sleepNanos (long nanoDuration) throws InterruptedException {
final long end = System.nanoTime() + nanoDuration;
long timeLeft = nanoDuration;
do {
if (timeLeft > SLEEP_PRECISION)
Thread.sleep (1);
else
Thread.sleep (0); // Thread.yield();
timeLeft = end - System.nanoTime();
if (Thread.interrupted())
throw new InterruptedException ();
} while (timeLeft > 0);
}
Test
I run 4 threads requesting 5 millisecond sleep 1000 times each (on my Dual-Core CPU). The first chart shows sleepNanos(TimeUnit.MILLISECONDS(5).toNanos()):
The second chart shows Thread.sleep(5):
As you can see, sleepNanos() is much more precise. I found that this approach was originally used by Ryan Geiss for WinAmp visualization plugin.
UPDATE
Even better precision can be achieved if you are willing to consume more CPU power doing Spin-Wait for the last part of the wait:
public static void sleepNanos (long nanoDuration) throws InterruptedException {
final long end = System.nanoTime() + nanoDuration;
long timeLeft = nanoDuration;
do {
if (timeLeft > SLEEP_PRECISION)
Thread.sleep (1);
else
if (timeLeft > SPIN_YIELD_PRECISION)
Thread.sleep(0);
timeLeft = end - System.nanoTime();
if (Thread.interrupted())
throw new InterruptedException ();
} while (timeLeft > 0);
}
Thursday, May 13, 2010
Measuring nanoseconds in Java / Windows
I was under false impression that interval timers in Java allow sub-millisecond precision on Windows. I knew that Thread.sleep(millis, nanos) internally uses milliseconds, but for some reason I thought that methods like LockSupport.parkNanos() method provide precise waits. Well, I was wrong. The smallest delay this method can realize is approximately 1.95 milliseconds on my Windows PC.
Back in 2006 David Holmes explained that Java timer intervals are based on waitForMultipleObjects() Windows API (which uses dwMilliseconds). They still are.
One simple (but not universal) workaround is a "spin-sleep":
private static void sleepNanos (long nanoDelay) {
final long end = System.nanoTime() + nanoDelay;
do {
Thread.yield(); // Thread.sleep (0);
} while (System.nanoTime() < end);
}
When running standalone this method consumes all free resources of single CPU core, but it will share with other threads that may be running. For large durations it can be enhanced to use Thread.sleep() for bulk of the waiting.
P.S. Difference between Thread.yield() and Thread.sleep( 0 ) is explained here.
Friday, December 11, 2009
Old trap
class Foo {
private long id;
private String name;
Foo(long id, String name) {
this.id = id;
this.name = name;
}
public String toString () {
return id + ' ' + name;
}
}
The statement new Foo(123, "Hello") prints something like "155Hello" instead of "123 Hello". Of course the reason is pretty simple:
public String toString2() {
return new StringBuilder().append(id + 32L).append(name.toString()).toString();
}
Wednesday, November 11, 2009
Stack allocation in Java is still a myth
There were rumors that Mustang will have on stack allocation as a part of hot spot optimization.
Four+ years later this :
public long encode (String input) {
final byte [] buffer = new byte [8];
.. encode input into buffer
.. convert buffer into LONG
return result;
}
I was hoping JVM will allocate buffer on stack based on the fact that it does not escape from this method. Running this test 10M times with -verbosegc shows extensive GC work (1.6.0_16-b01 64 bit server JVM with -XX:+DoEscapeAnalysis option).
On the positive side. GC is very fast. Consider three functions:
- encodeNewBuffer () uses new byte array to encode input string.
- encodeSynchronizedField () uses private field, guarded by synchronized{} block
- encodeThreadLocalField () uses ThreadLocal cache to encode input string
Here is the code:
long encodeNewBuffer (String input) {
final byte [] buffer = new byte [8];
return f (buffer);
}
/////////////
private final byte [] buffer = new byte [8];
synchronized long encodeSynchronizedField (String input) {
return f (buffer);
}
/////////////
ThreadLocal<byte[]> threadLocal = new ThreadLocal<byte[]>();
{
threadLocal.set(new byte [8]);
}
long encodeThreadLocalField (String input) {
byte [] buffer = threadLocal.get();
return f (buffer);
}
GC-based method is a winner:
encodeNewBuffer(): 4,108 sec.
encodeSynchronizedField(): 5,322 sec.
encodeThreadLocalField() : 5,411 sec
Blog Archive
- November 2016 (1)
- March 2015 (2)
- December 2011 (1)
- October 2010 (2)
- June 2010 (1)
- May 2010 (1)
- December 2009 (1)
- November 2009 (4)
- October 2009 (1)
- July 2009 (2)
- June 2009 (1)
- December 2008 (1)
- September 2008 (2)
- August 2008 (4)
- July 2008 (3)
- June 2008 (4)
- May 2008 (1)