Saturday, March 25, 2017

AMDGPU-Pro 16.60 on Ubuntu kernel 4.10.5 with ROCM-smi


Although AMDGPU-Pro 16.40 with kernel 4.8 has been working fine for me, I decided to try 16.60 with kernel 4.10.  After my problems with 16.60 on 4.8, I read a few reports claiming it works well with kernel 4.10.

I started with a fresh Ubuntu desktop 16.04.2 install, and then installed 4.10.5 from the Ubuntu ppa.  Although the process is not very complicated, I wrote a small script which downloads the files and installs them.  After rebooting, I downloaded and installed the AMDGPU-Pro 16.60 drivers according to the instructions.  Finally, I installed ROC-smi, a utility which simplifies clock control using the sysfs interface.  To test the install, run "rocm-smi -a" which will show all info for any amdgpu cards installed.

Unfortunately, the new drivers no longer work with my ethminer fork, but sgminer-gm 5.5.5 works as was well as it did with 4.8/16.40.  On GCN3 and newer cards like Tonga and Polaris, the optimal core clock for mining ETH is often between 55% and 56% of the memory clock.  On my Sapphire Rx470 I have the memory overclocked to 2100Mhz, so dpm 6 at 1169Mhz is a perfect fit:
./rocm-smi -d 0 --setsclk 6

Once sgminer was running for a couple minutes, the speed settled at about 29.1Mh/s.  Note that the clock setting is only temporary for the next opencl program to run.  Just run the rocm-smi command each time.

Update 2017-04-08

 4.10.9 was uploaded to the Ubuntu ppa today, so I would recommend it instead of 4.10.5.

Tuesday, March 14, 2017

Riser Recycling


If you build multi-GPU servers, you'll likely encounter flaky or bad risers.  I've had a bad riser where I could see a burned trace on the PCB, and I've had flaky risers that appeared to be caused by poor soldering of the ribbon cable.  While the problem risers may not work with a GPU, chances are the power connectors are still good.  The riser shown above has a 6-pin PCI-e and a 4-pin molex connector, both of which I tested for continuity with a multi-meter.  With some fresh flux I was able to desolder the ribbon cable, so I could re-use the riser as a PCI-e to molex power adapter.  If you are wondering what I would use it for, look at the photo below.

Heat has caused the yellow 12V line to turn brown.  The cable was plugged into the motherboard's supplemental PCI-e power which is used when more than two GPUs are plugged in.  Each GPU will usually draw between 50 and 75 watts over the PCI-e bus, which is pushing the 18AWG (or even 20AWG on some power supplies) cable well beyond it's recommended rating.  By plugging the next molex connector in the chain into the riser, and by providing power to the 6-pin connector on the same riser, current will flow into the motherboard molex connector from both directions.

With the current through the brown wire cut in half, the power dissipated (and therefore the heat generated) is reduced by 75%, since P = I^2 * R.

Supplemental mod

Bitcointalk user BChydro questioned the current-carrying ability of the riser PCB, which turns out to be rather poor for the 12V trace.  The solder mask over the 12V trace was starting to turn brown after only a couple days of use, and a thermal image shows the trace getting hot.

To solve the problem I added a 18AWG jumper wire between the 12V pins:

Sunday, March 5, 2017

AMDGPU-Pro on Ubuntu

It's been almost a year since the first AMDGPU-Pro driver release.  There are now two main release versions; 16.40 and 16.60.  Although both versions supposedly support Ubuntu 16.04, version 16.40 with Ubuntu Desktop 16.04.2 is the only combination that works without a kernel update.

Ubuntu 16.04.2 is the first 16.04 release to use kernel version 4.8 instead of version 4.4.  Using 16.40 with kernel version 4.4 would sometimes lead to problems such as kernel message log floods or powerplay problems.  The typical powerplay problem was that the card would not switch to the full system and memory clock when running OpenCL programs.

Before a fresh Ubuntu install, I suggest disabling safeboot, since the AMDGPU-Pro drivers are not signed and therefore do not work with safeboot.  If safeboot is already set up on your system, the driver install script will prompt you to disable it.  Unlike the fglrx drivers, I have found the AMDGPU-Pro drivers will work along with the Intel i915 drivers.  In a multi-GPU system, I like to leave a monitor connected to the on-board video for a system console.  GPUs can easily be swapped in and out without having to move the monitor connection.

Before installing the driver, make sure your card is detected by running, "lspci | grep VGA".  The installation instructions are straightforward, and don't forget to update the video group as mentioned at the end of the instructions.  Otherwise OpenCL programs will not detect the GPU.  Note that there is a bug in clinfo (/opt/amdgpu-pro/bin/clinfo) that causes it to display 14 for "Max compute units" instead of the actual number of GPU compute units.  This bug is fixed in 16.60, which requires kernel 4.10 to work properly.

To test your GPU and the driver, you could try my ethminer fork.  Although I built and tested it on Ubuntu 14.04/fglrx, it works perfectly on Ubuntu 16.04.2 with AMDGPU-Pro 16.40.  Once you've started ethminer (or any other OpenCL program), you can check the core and memory clocks with the following commands:
 cat /sys/class/drm/card0/device/pp_dpm_sclk
 cat /sys/class/drm/card0/device/pp_dpm_mclk

The driver does not come with a tool like aticonfig for custom clock control.  The driver does expose ways of controlling the clocks and voltage, and some developers have written custom programs using information from the kernel headers.  Although nobody seems to have released a utility, the sgminer-gm sysfs code could likely be used as a template to create a stand-alone utility.

Monday, February 20, 2017

Inside AMD GCN code execution


AMD's Graphics Core Next architecture was introduced over five years ago.  Although there have been many documents written to help developers understand the architecture, and thereby write better code, I have yet to find one that is clear and concise.  AMD's best GCN documentation is often cluttered with unnecessary details on the old VLIW architecture, when the GCN architecture is already complicated enough on it's own.  I intend to summarize my research on GCN, and what that means for OpenCL and GCN assembler kernel developers.

As shown in the top diagram (GCN Compute Unit), the GPU consists of groups of four compute units.  Each CU has four SIMD units, each of which can perform 16 simultaneous 32-bit operations.  Each of these 16 SIMD "lanes" is also called a shading unit, so the R9 380 with 28 CUs has 28 * 4 * 64 = 1792 shading units.

AMD's documentation makes frequent reference to "wavefronts".  A wavefront is a group of 64 operations that executes on a single SIMD.  The SIMD operations take a minimum of four clock cycles to complete, however SIMD pipelines allow a new operation to be started every clock.  "The compute unit selects a single SIMD to decode and issue each cycle, using round-robin arbitration." (AMD GCN whitepaper pg 5, para 3).  So four cycles after SIMD0 has been issued an instruction, the CU is ready to issue it another.

In OpenCL, when the local work size is 64, the 64 work-items will be executed on a single SIMD.  Since a maximum of four SIMD units can access the same local memory (LDS), AMD GCN devices support a maximum local work size of 256.  When the local work size is 64, the OpenCL compiler can leave out barrier instructions, so performance will often (but not always) be better than using a local work size of 128, 192, or 256.

The SIMD units only perform vector operations such as mulitply, add, xor, etc.  Branching for loops or function calls is performed by the scalar unit, which is shared by all four SIMD units.  This means that when a kernel executes a branch instruction, it is executed by the scalar unit, leaving a SIMD unit available to perform a vector operation.  The two operations (scalar and vector) must come from different waves, so to ensure the SIMD units are fully utilized, the kernel must allow for 2 simultaneous wavefronts to execute.  For information on how resource usage such as registers and LDS impacts the number of simultaneous wavefronts that can execute, I suggest reading AMD's OpenCL Optimization Guide.  Note that some sources state that full SIMD occupancy requires four waves, when it is technically possible with just one wave using only vector instructions.  Most kernels will require some scalar instructions, so two waves is the practical minimum.

Monday, January 9, 2017

Hot Video Cards


When I read discussions about video card temperatures, the vast majority are about the GPU core temperature.  With older GPUs like the R9 290, temperature-based throttling when the GPU core temperature hits 94C can be a problem.  With newer GPUs like the R9 380 and especially with the Rx series cards, there is rarely issues with GPU core temperatures, even with low-end cooling systems.  While the GPU core is always cooled with a heatsink and fans, often the RAM is not.  The infrared image above shows how much of a difference that can make in RAM temperatures.

The image was taken of a 4GB MSI R9 380 card with the memory clocked at 1600Mhz while running ethminer-nr.  The memory chips above the GPU are connected to the heatsink through a thermal pad, but the chips to the left of the GPU are not.  Using an infrared thermometer I measured temperatures between 95 and 100C on the back side of the PCB from the RAM, so the RAM die temperatures are likely well in excess of 100C.

Keeping RAM cool can make a material difference in the clock speeds that can be achieved.  Instead of 1600Mhz, I have found that 1500Mhz-rated GDDR5 can reach stable speeds of 1700Mhz when connected to a basic heat spreader.  The brand of the memory, Elpida, Hynix, or Samsung, makes little difference in performance when compared to cooling.

While manufacturers will rarely provide enough detail in their specifications or product images to determine if the RAM is cooled, card tear-down reviews will often show the connection between the heatsink and RAM.  Of the cards I have used, only a MSI R9 380 Gaming card had all the RAM cooled.  Neither MSI Armor2X cards nor Gigabyte Windforce cards have all the RAM chips cooled with a heatsink or heat spreader.  I also own an Asus Rx 470 Strix card, and that also lacks active cooling for some of the memory chips.

Saturday, October 29, 2016

zcash mining


Zcash is the hottest coin this month, after going live on October 28th, following several of months of testing.  Zcash promises private transactions, so that they cannot be viewed on the public blockchain like bitcoin or ethereum.

I did not expect zcash mining to be immediately profitable, since mining rewards are being ramped up over the first month.  However the first hour of trading on Poloniex saw zcash (ZEC) trading at insane values of over 1000 bitcoin per ZEC.  Even after 24 hours, 1 ZEC is trading for about 6 BTC, or US$4300.  Despite the low mining reward rate, mining pool problems, and buggy mining software, I was able to earn 0.005 ZEC in one day with a couple rigs.

Zcash has both private address starting with "z", and public or transparent address starting with "t".  A bug in the zcash network software has meant problems with private transfers, so it is recommended for miners to use only transparent wallet addresses until the bug is fixed.  Miners using the "z" address have apparently had problems receiving their zcash payouts from mining pools.

I have been using eXtremal's miner version 0.2.2, which uses OpenCL kernels from the zcash open-source miner competition.  Windows and Linux binaries can be downloaded from coinsforall.io, the pool the software is designed for.  I get the best performance with the silentarmy kernel, but with only one instance as running 2 instances results in a crash.  On Windows running driver version 16.10.1 I get about 26 solutions/s with a Rx 470.  Under Ubuntu with fglrx drivers I get about 11 solutions/s for both R7 370 and R9 380 cards.

I experimented with the worksize and threads values in config.txt, but was unable to improve performance compared to the default 256/8192.  Increasing the core clock on the R9 380 cards from 900Mhz to 1Ghz increased the performance by 3-4%.

Genoil has released a miner, but only Windows binaries with tromp's kernel at this time.  A version including silentarmy's kernel is in the works.

I was unable to find any zcash mining calculators, so I wrote a short python calculator.  Here's an example based on the network hashrate (in thousands) at block 1072, for a rig mining 140 solutions/s:
./zec.py 1072 1840 140
Daily ramped mining reward in blocks: 308
Your estimated earnings: 0.0234347826087

At the current price of 6BTC/ZEC, the earnings work out to about US$100.  Even if the price drops to 3BTC/ZEC, the daily earnings are still more than double what the same hardware could make mining ethereum.  Apparently many other ethereum miners have realized this, since the ethereum network hashrate has dropped by about 25% in less than 30 hours.  I expect this trend to continue in the coming days, and eventually reach an equilibrium as the ZEC price continues to drop until it is below parity with BTC.

2016-10-30 update

Coinsforall is still having stability problems, and now 1 ZEC is worth about 1.2 BTC.  Therefore I've switched back to eth mining for all my cards except one Rx 470.  With Genoil's ZECminer I'm getting about 26 sol/s.  I started using zcash.miningpoolhub.com, and after an hour of mining the pool has been stable.  Reported hashrate on the pool is about 12H/s, or half the solution rate as expected.

Sunday, September 18, 2016

Advanced Tonga BIOS editing


I recently decided to spend some time to figure out some of the low-level details of how the BIOS works on my R9 380 cards.  A few months ago I had found Tonga Bios Editor, but hadn't done anything more than modify the memory frequency table so the card would default to 1500Mhz instead of 1375.  My goal was to modify the memory timing and to reduce power usage.

The card I decided to test the memory timing mods on was a Club3D 4GB R9 380 with Elpida W4032BABG-60-F RAM.  Although the RAM is rated for 6Gbps/1.5Ghz, the default memory clock is 1475Mhz.  In my previous testing I found that the card was stable with the memory overclocked well above 1.5Ghz, but the mining performance was actually slower at 1.6Ghz compared to 1.5Ghz.  Unfortunately Tonga Bios Reader does not provide a way to edit the memory timings aka straps, so I'd have to use a hex editor.


I've highlighted the 1500Mhz memory timing in the screen shot above.  I found it by searching for the string F0 49 02, which you first have to convert from little-endian to get 249F0, and then from hex to get 150,000, which is expressed in increments of .01Mhz.  The timing for up to 1625Mhz (C4 7A 02) comes after it, and then 1750Mhz (98 AB 02).  The Club3D BIOS actually has 2 sets of timings, one for memory type 01 (the number after F0 49 02), as and for memory type 02 (not shown).  This is so the same BIOS can be used on a card that can be made with different memory.  Obviously one type of memory the BIOS supports is Elpida, and from comparing BIOS images from other cards, I determined that memory type 02 is for Hynix.

To reduce the chance of bricking my card, the first time I modified only the 1625Mhz memory timing.  Since the default memory timing is 1475Mhz, my modified timing would only be used when overclocking the memory over 1500Mhz.  So if the the card crashed on the 1625Mhz timing, it would be back to the safe 1500Mhz timing after a reboot.  To actually make the change I copied the 1500Mhz timing (starting with 77 71) to the 1625Mhz timing.  After the change, the BIOS checksum is invalid, so I simply loaded the BIOS in Tonga Bios Reader and re-saved it in order to update the checksum.

I used Atiflash 2.71 to flash the BIOS since I have found no DOS or Linux flash utilities for Tonga GPUs.  After flashing the updated BIOS, I overclocked the RAM to 1625Mhz, and my eth mining speed went from just under 21Mh to about 22.5Mh.  To get even faster timings, I copied the 1375Mhz timings from a MSI R9 380 with Elpida RAM to the Club3d 1625Mhz memory timing.  That boosted my mining speed at 1625Mhz to slightly over 23Mh

I then tried a number of ways to improve the timing beyond 1625Mhz, but I found nothing that was both stable and faster at 1700Mhz.  Different cards may overclock better, depending on both the GPU asic and the memory.  Hynix memory seems to overclock a bit better than Elpida, while Samsung memory, which seems rather rare on R9 380 cards, tends to overclock the best.  The memory controller on the GPU also needs to be able overclock from 1475Mhz.  Unlike the simple voltage modding the Hawaii BIOS, there is no easy way to modify the memory controller voltage (VDDCI) on Tonga.  The ability to over-volt the memory controller would make it easier to overclock the memory speed beyond 1625Mhz.

Since the Club3D BIOS supports both Elpida and Hynix memory, I improved the timing for both memory types.  This allows me to use a single BIOS image for cards that have either Elpida or Hynix memory.  It's also dependent on the card having a NCP81022 voltage controller, but all my R9 380 cards have the same voltage controller.  I've shared it on my google drive as 380NR.ROM if you want to try it (at the possible risk of bricking your card).  Atiflash checks the subsystem ID of the target card against the BIOS to be flashed, so it is necessary to use the command-line version of atiflash with the "-fs" option:
atiflash -p 0 380RN.ROM -fs

In addition to improving memory speeds, I wanted to reduce power usage of my 380 cards.  On Windows it is possible to use a tool like MSI Afterburner to reduce the core voltage (VDDC), but on Linux there is no similar tool.  To reduce the voltage in the BIOS, modify value0 in Voltage Table2 for the different DPM states.  After a lot of experimenting, I made two different BIOSes with different voltage levels since some cards under-volt better than others.  The first one has 975, 1050, and 1100 mV for dpm 5, 6, & 7, while the other has 1025, 1100, & 1150 mV.  These are also shared on my google drive as 380NR1100.ROM and 380NR1150.ROM.

With the faster RAM timing and voltage modifications I've improved my eth mining hashrates by about 10%, without any material change in power use.  I've tried my custom ROM on four different cards.  Although two of them seem to be OK with 900/1650Mhz clocks, I'm playing it safe and running all four at 885/1625Mhz.  If you are lucky and have a card that is stable at 925/1700Mhz, you can mine eth at almost 25Mh/s.  With most cards you can expect to get between 23 and 24Mh/s.