B2B  |  Feedback  |  COSIS.net on Facebook COSIS.net News on Twitter


Login

Not yet registered?
Lost your login data?
 

Acceleware (Canada)

Join COSIS.net now!

Sign up now to view the full Profile of Acceleware .

Already a member? Please log in!




Acceleware

Acceleware provides High Performance Computing software solutions to the Oil & Gas and Computer Engineering markets for multi-core CPUs and compute GPUs. Additionally, our HPC consulting services are utilized by enterprises needing to harness the power of parallel computing. At Acceleware the goal is always the same – Go Faster!

Acceleware develops and markets software acceleration products that bring unparalleled performance and speed to today's most strenuous and challenging compute and/or data-intensive demands. With Acceleware solutions installed, computer processing for data processing and simulation applications can benefit from speed-ups of 10x to 50x. Acceleware also uses its unique expertise for providing consulting services for companies, and research institutions involved with High Performance Computing. These services include architecting multi-core processing solutions, code porting and training.

Our customers share a common and urgent need - the need for powerful and timely computer modeling and testing. They are pushing the boundaries of innovation and demand faster product-development cycles for more complex products. They want more effective tools to interpret vast amounts of data.

With Acceleware's products and services - you buy time.

Our target markets are served by a network of distribution partners and include: computer aided engineering, oil and gas, image reconstruction, pharmaceuticals, defense and other HPC related industries.



Contact

Address 1600-37 Street SW
Zip, City T3C 3P1 Calgary
Country Canada
Name Kirsten Spencer
Tel. No. (403) 249-9099 Ext.356
Email kirsten.spencer@acceleware.com
URL acceleware.com/


Latest Blog Posts

  • 26.03.2018: Grid Synchronization with Cooperative Groups

    Grid Synchronization with Cooperative Groups

    Cooperative Groups was introduced in CUDA 9 to provide a flexible model for synchronization and communication between groups of threads executing CUDA kernels.  Mark Harris and Kyrylo Perelygin have an excellent introduction to Cooperative Groups here.

    CUDA has always supported barrier synchronization of all the threads in a thread block through the __syncthreads() intrinsic. Cooperative Groups adds the ability to synchronize across thread blocks, by synchronizing all the threads in a grid.  Grid synchronization is supported on Compute Capability 6.0 (Pascal) and higher GPUs.  If you are running in Windows, your GPU must additionally be running in Tesla Compute Cluster (TCC) driver mode.

    To demonstrate the utility of grid synchronization, I use it to normalize an input array of integers to the range [-1, 1] by scaling each element of the input by the maximum absolute value of the input array.  The following source code listing contains three different normalization implementations which we compare below.

     

      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    #include <algorithm>
    #include <cooperative_groups.h>
    #include <cuda_runtime.h>
    #include <cuda_profiler_api.h>
    #include <limits>
    #include <stdio.h>
    #include <time.h>
    
    #define AxCheckError(err) CheckError(err, __FUNCTION__, __LINE__)
    
    inline void CheckError(cudaError_t const err, char const* const fun, const int line)
    {
        if (err)
        {
            printf("CUDA Error Code[%d]: %s\n%s() Line:%d\n", err, cudaGetErrorString(err), fun, line);
            exit(1);
        }
    }
    
    using namespace cooperative_groups;
    
    void GenerateTestData(int const N, int* const input, float* const normalized,
        float* const ref);
    void CompareData(int const N, float const* const a, float const* const b);
    
    template<unsigned int tileSize>
    __device__ int ShflReduceMaxTiledPartition(thread_block_tile<tileSize> const& group, int value)
    {
        for (int s = group.size() / 2; s > 0; s /= 2)
            value = max(abs(value), group.shfl_down(value, s));
        return value;
    }
    
    __global__ void FindMaximum(int const* const input, int* const maximum, int const N)
    {
        thread_block          myBlock = this_thread_block();
        thread_block_tile<32> myWarp = tiled_partition<32>(myBlock);
    
        int idx = blockIdx.x * myBlock.size() + myBlock.thread_rank();
        idx = min(idx, N - 1);
        int value = abs(input[idx]);
    
        int maxValue = ShflReduceMaxTiledPartition(myWarp, value);
    
        if (myWarp.thread_rank() == 0)
            atomicMax(maximum, maxValue);
    }
    
    __global__ void Normalize(int const* const input, int const* const maximum,
                              float* const output, int const N)
    {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
        if (idx < N)
        {
            float const invMax = (*maximum == 0) ? 0 : 1.0f / *maximum;
            output[idx] = (*maximum == 0) ? 0 : input[idx] * invMax;
        }
    }
    
    __global__ void GridSynchronizationNormalize(int const* const input, int* const maximum,
                                                 float* const output, int const N)
    {
        grid_group            myGrid  = this_grid();
        thread_block          myBlock = this_thread_block();
        thread_block_tile<32> myWarp  = tiled_partition<32>(myBlock);
    
        int idx = myGrid.thread_rank();
        int myMax = abs(input[min(idx, N - 1)]);
    
        idx += myGrid.size();
    
        for (; idx < N; idx += myGrid.size())
        {
            int value = abs(input[idx]);
            myMax = max(myMax, value);
        }
    
        myMax = ShflReduceMaxTiledPartition(myWarp, myMax);
    
        if (myWarp.thread_rank() == 0)
            atomicMax(maximum, myMax);
    
        myGrid.sync();
    
        float const invGlobalMax = (*maximum == 0) ? 0 : 1.0f / *maximum;
    
        idx = myGrid.thread_rank();
    
        for (; idx < N; idx += myGrid.size())
        {
            output[idx] = input[idx] * invGlobalMax;
        }
    }
    
    
    __global__ void GridSynchronizationStatefulNormalize(int const* const input, int* const maximum,
                                                         float* const output, int const N)
    {
        extern __shared__ float cache[];
    
        grid_group myGrid = this_grid();
        thread_block myBlock = this_thread_block();
        thread_block_tile<32> myWarp = tiled_partition<32>(myBlock);
    
        int idx = myGrid.thread_rank();
        int cacheIdx = myBlock.thread_rank();
        cache[cacheIdx] = input[min(idx, N - 1)];
        int myMax = abs(cache[cacheIdx]);
    
        idx += myGrid.size();
        cacheIdx += myBlock.size();
    
        for (; idx < N; idx += myGrid.size(), cacheIdx += myBlock.size())
        {
            int value = input[idx];
            cache[cacheIdx] = value;
            myMax = max(myMax, abs(value));
        }
    
        myMax = ShflReduceMaxTiledPartition(myWarp, myMax);
    
        if (myWarp.thread_rank() == 0)
            atomicMax(maximum, myMax);
    
        myGrid.sync();
    
        float const invGlobalMax = (*maximum == 0) ? 0 : 1.0f / *maximum;
    
        idx = myGrid.thread_rank();
        cacheIdx = myBlock.thread_rank();
    
        for (; idx < N; idx += myGrid.size(), cacheIdx += myBlock.size())
        {
            int value = cache[cacheIdx];
            output[idx] = value * invGlobalMax;
        }
    }
    
    
    int main(int const argc, char const* argv[])
    {
        int *inputH, *inputD;
        float *normalizedH, *normalizedD, *refH;
        int   *maximumD;
    
        cudaDeviceProp prop;
        int device;
    
        AxCheckError(cudaGetDevice(&device));
        AxCheckError(cudaGetDeviceProperties(&prop, device));
    
        printf("Is Cooperative Launch supported: %d\n", prop.cooperativeLaunch);
    
        int    const N       = argc > 1 ? atoi(argv[1]) : 917504;
        size_t const N_BYTES = N * sizeof(float);
    
        printf("Normalizing %d elements\n", N);
    
        inputH      = (int*)malloc(N_BYTES);
        normalizedH = (float*)malloc(N_BYTES);
        refH        = (float*)malloc(N_BYTES);
    
        GenerateTestData(N, inputH, normalizedH, refH);
    
        AxCheckError(cudaMalloc((void**)&inputD, N_BYTES));
        AxCheckError(cudaMalloc((void**)&normalizedD, N_BYTES));
        AxCheckError(cudaMalloc((void**)&maximumD, sizeof(unsigned)));
    
        AxCheckError(cudaMemcpy(inputD, inputH, N_BYTES, cudaMemcpyHostToDevice));
    
        // The two pass approach
        {
            dim3 blockDim(512, 1, 1);
            dim3 gridDim;
            gridDim.x = ((N + blockDim.x - 1) / blockDim.x);
    
            AxCheckError(cudaMemset(maximumD, 0, sizeof(int)));
    
            FindMaximum<<<gridDim, blockDim>>>(inputD, maximumD, N);
            AxCheckError(cudaDeviceSynchronize());
            AxCheckError(cudaGetLastError());
            Normalize<<<gridDim, blockDim>>>(inputD, maximumD, normalizedD, N);
            AxCheckError(cudaDeviceSynchronize());
            AxCheckError(cudaGetLastError());
    
            AxCheckError(cudaMemcpy(normalizedH, normalizedD, N_BYTES, cudaMemcpyDeviceToHost));
            CompareData(N, normalizedH, refH);
        }
    
        // Grid Synchronization Normalization
        {
            dim3 blockDim(512, 1, 1);
            dim3 gridDim;
            int  numBlocks;
            size_t dynamicSmem = 0;
    
            AxCheckError(cudaMemset(maximumD, 0, sizeof(int)));
    
            AxCheckError(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocks, &GridSynchronizationNormalize, blockDim.x, dynamicSmem));
            gridDim.x = numBlocks * prop.multiProcessorCount;
    
            void* args[4] = { (void*)&inputD, (void*)&maximumD, (void*)&normalizedD, (void*)&N};
    
            cudaError_t x = cudaLaunchCooperativeKernel((void const*)&GridSynchronizationNormalize, gridDim, blockDim, args, dynamicSmem, 0);
            AxCheckError(x);
            AxCheckError(cudaDeviceSynchronize());
            AxCheckError(cudaGetLastError());
    
            AxCheckError(cudaMemcpy(normalizedH, normalizedD, N_BYTES, cudaMemcpyDeviceToHost));
            CompareData(N, normalizedH, refH);
        }
    
        // Stateful grid synchronization approach
        // To avoid reading the input array from global memory twice, we
        // cache the inputs in shared memory while finding the maximum, and then
        // normalize using the cached values.  This limits the size of the input
        // array we can handle, as the entire input array must fit within the aggregate
        // amount of shared memory
        {
            dim3 blockDim(512, 1, 1);
            dim3   gridDim;
            int    numBlocks;
            size_t dynamicSmem = 16384;
    
            AxCheckError(cudaMemset(maximumD, 0, sizeof(int)));
    
            AxCheckError(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocks, &GridSynchronizationStatefulNormalize, blockDim.x, dynamicSmem));
            gridDim.x = numBlocks * prop.multiProcessorCount;
    
            // Check that array size does not exceed aggregate shared memory capacity
            if (gridDim.x * dynamicSmem < N * sizeof(int))
            {
                printf("GridSynchronizationStatefulNormalization Cooperative Launch kernel cannot operate on arrays with more than %zu elements\n",
                        gridDim.x * dynamicSmem / sizeof(int));
            }
            else
            {
                void* args[4] = { (void*)&inputD, (void*)&maximumD, (void*)&normalizedD, (void*)&N};
    
                AxCheckError(cudaLaunchCooperativeKernel((void const*)&GridSynchronizationStatefulNormalize, gridDim, blockDim, args, dynamicSmem, 0));
                AxCheckError(cudaDeviceSynchronize());
                AxCheckError(cudaGetLastError());
    
                AxCheckError(cudaMemcpy(normalizedH, normalizedD, N_BYTES, cudaMemcpyDeviceToHost));
                CompareData(N, normalizedH, refH);
            }
        }
    
        AxCheckError(cudaFree(maximumD));
        AxCheckError(cudaFree(normalizedD));
        AxCheckError(cudaFree(inputD));
        AxCheckError(cudaDeviceReset());
    
        free(inputH); free(normalizedH); free(refH);
    
    
    
        return 0;
    }
    
    void GenerateTestData(int const N, int* const input, float* const normalized, float* const ref)
    {
        int i;
    
        srand((unsigned)time(NULL));
    
        for (i = 0; i < N; i++)
        {
            input[i] = rand() - RAND_MAX / 2;
            normalized[i] = 0.0f;
        }
    
        int maximum = abs(input[0]);
    
        for (i = 1; i < N; i++)
        {
            maximum = std::max(abs(input[i]), maximum);
        }
    
        float invMax = maximum == 0 ? 0 : 1.0f / maximum;
        for (i = 0; i < N; i++)
        {
            ref[i] = input[i] * invMax;
        }
    }
    
    int UlpDifference(float a, float b)
    {
        int iA, iB;
        iA = *((int*)(&a));
        iB = *((int*)(&b));
        return abs(iA - iB);
    }
    
    void CompareData(int const N, float const* const a, float const* const b)
    {
        int i;
        int different = 0;
    
        printf("Comparing data\n");
        for (i = 0; i < N; i++)
        {
            different = (UlpDifference(a[i], b[i]) > 5);
            if (different) break;
        }
    
        if (different)
        {
            printf("Arrays do not match.\n");
        }
        else
        {
            printf("Arrays match.\n");
        }
    }
    

     

    The code must be compiled to generate relocatable device code, and target GPUs of Compute Capability 6.0 or higher, eg.:

    nvcc -rdc=true -arch=sm_60 -o normalize normalize.cu
     

    Two-Pass Normalization

    Without grid synchronization, thread blocks are only synchronized upon completion of a kernel, necessitating a two kernel approach for normalization: one kernel to find the maximum absolute value, followed by a second kernel to multiply by the inverse of the maximum absolute value.

    The FindMaximum kernel requires 1 thread per input element, and demonstrates the use of Cooperative Group’s partitioning mechanism to do a warp-level shuffle reduction to find the maximum absolute value within a warp, followed by having a single thread of each warp perform an atomicMax() operation to find the global maximum.
     

    One-Pass Normalization Via Grid Synchronization

    Using Cooperative Groups, we can merge the two steps in the normalization process into a single kernel by finding the maximum absolute value, synchronizing all the threads in the grid to avoid a potential race condition when accessing the maximum absolute value, and then scaling by the inverse of the maximum absolute value.

    Note that to enable grid synchronization, kernels must be launched via the host cudaLaunchCooperativeKernel launch API instead of the more familiar <<<…>>> execution configuration syntax.  Additionally, grids must be sized appropriately to guarantee that all the blocks in the grid are resident on the GPU's streaming multiprocessors (SMs).  An upper bound for the grid size can be determined by programmatically querying the number of blocks that can reside on a single SM via the cudaOccupancyMaxActiveBlocksPerMultiprocessor API call, and then multiplying by the number of SMs on the GPU.  Given the constraints on grid size, using the same 1 thread per input/output element strategy as the two-pass normalization approach would also limit the maximum size of the input array.  To maintain flexibility, instead of requiring 1 thread per input/output element, the GridSynchronizationNormalization kernel uses grid stride loops.

     

    Performance

    The table below shows the relative performance of the two approaches on a Tesla P100 GPU.

    Implementation
     
    Execution Time (uS)

    Two-pass

    46.8 (27.4 + 19.4)

    GridSynchronizationNormalize

    34.4

     

    GridSynchronizationNormalize is faster than the two-pass approach.  However, it still does more work than necessary.  The advantage of using grid synchronization is that it allows stateful approaches where data persists in registers and/or shared memory after the synchronization, which isn’t possible in the two-stage implementation.

    One-Pass Normalization Via Stateful Grid Synchronization

    The GridSynchronizationStatefulNormalization kernel uses shared memory to reduce global memory traffic.  The first part of the kernel finds the maximum absolute value while also explicitly caching input elements in shared memory.  After the grid synchronization operation, instead of rereading the input values for the scaling operation as in the other kernels, they are read directly from shared memory.  The table below shows the performance improvements from the stateful approach:

    Implementation

    Execution Time (uS)

    Two-pass

    46.8 (27.4 + 19.4)

    GridSynchronizationNormalize

    34.4

    GridSynchronizationStatefulNormalize

    25.4

     

    The improved performance of the stateful approach comes at a price.  With the stateful approach, the input elements must fit within the GPU’s aggregate shared memory capacity.  On a Tesla P100, the input array can contain no larger than 56 SMs * 64 KB/SM / 4 bytes/element = 917504 elements.  On a Tesla V100, the largest array we can support increases to 80 SMs * 96 KB/SM / 4 bytes/element = 1966080 elements.  These size limitations make the GridSynchronizationStatefulNormalize kernel less scalable than the alternatives.  The stateful approach could be employed as part of an optimized path for smaller data sets, with automatic fallback to the stateless approach for larger data sets.

     

    Potential Mistakes When Developing Grid Synchronization Code

    One of the strengths of the CUDA C programming model compared to other heterogeneous models like OpenCL, Vulkan, or even the CUDA Driver API, is that host and device code are combined into a single-source file. This advantage is especially evident when using CUDA’s <<<…>>> kernel launch operator.  The kernel launch operator facilitates compile-time error checking to ensure that kernels are invoked with the expected number of arguments, and that each kernel argument is of the expected type. No compile-time error checking is performed when using the cudaLaunchCooperativeKernel API call to launch a kernel that uses grid synchronization.  Beware that passing an incorrect number of arguments, or arguments of incorrect type through the void** args parameter results in undefined runtime behavior not a compile time error!  A quick experiment omitting the final argument to GridSynchronizationNormalize resulted in a segmentation fault.

    cudaLaunchCooperativeKernel does validate its gridDimblockDim and sharedMem arguments to ensure that all blocks of the grid are resident on the GPUs SMs and in the event of an error returns the error code cudaErrorCooperativeLaunchTooLarge = 82 with corresponding error string “too many blocks in cooperative launch”.


    Source: Acceleware Ltd. blogs

  • 13.02.2018: Using Shared Memory on NVIDIA GPUs

    Using Shared Memory on NVIDIA GPUs

    We received a good question on one of our webinars. 

    There are a couple of things that are happening with shared memory and its size.  You as the programmer declare the size of your shared memory allocation.  You can do this statically (at compile time) or dynamically (at run time).  If you statically declare a shared memory allocation that exceeds the limit per thread block (typically 48KB), the NVIDIA compiler will generate an error and your program will not compile.  If you dynamically attempt to allocate more memory than is available, your GPU kernel will not launch and an error code will be returned to your program, although you have to check for error codes programmatically. You can also have a combination of multiple static and one dynamic allocation.  If the sum of the allocations exceeds the thread limit per block, the GPU kernel will not launch.

    Finally, shared memory is allocated exclusively to a thread block to facilitate zero-overhead context switching.  As a result the total amount of shared memory on the SM limits the number of thread blocks that can be scheduled to run concurrently on the SM, which impacts occupancy, and potentially performance.  For more details on occupancy and performance, click here to watch our optimization webinar. 

     


    Source: Acceleware Ltd. blogs

  • 19.01.2018: Changes in SEG-Y Revision 2

    Changes in SEG-Y Revision 2

    The SEG-Y file format has become a widely used industry standard for working with and sharing seismic data. Since the format's first publication in 1975, a lot has changed in the way we record, store, and process seismic data. Many proprietary extensions or modifications have been designed to work around the limitations of the file format, but such alternatives often make shared data sets difficult or impossible to interpret without additional development.

    In 2002, SEG-Y revision 1 was released adding support for IEEE floating point data, ASCII text, and a system for extending the format in text headers. The latest update, revision 2, released in 2017 takes more significant steps towards modernizing the format and eliminating the problem of custom work-arounds.

    File Format Overview

    The following diagram shows an overview of the updated SEG-Y file format.

    SEGY

    SEG-Y files start with an optional 128 byte SEG-Y tape label. The first piece of mandatory data is the 3200 byte text header that typically contains company and survey information in a human-readable format. The text header is followed by the 400 byte binary header which defines the size and type of data contained in the file.

    Next, the SEG-Y file may contain a number of optional extended text headers. These headers were introduced in revision 1, and expanded on in revision 2. Extended text headers allow users to define how the file should be interpreted using official extensions. Developers can also design and publish their own extensions to the format without altering the contents of any parts of the file defined by the standard.

    The bulk of the SEG-Y file is taken up by 1 or more traces. Each trace must include a 240 byte trace header, followed by an optional number of extended trace headers. Extended trace headers were introduced in revision 2 to provide space for new official and user-defined header parameters.

    Finally, revision 2 allows for extra user-defined data to be placed at the end of the SEG-Y. This data must be preceded by headers and divided into 3200 byte blocks, but does prevent older code from accurately reading the file.

    New Features

    Revision 2 introduces many new features to the format. The following is an overview of how to use some the most significant changes that will make the SEG-Y format more precise and easier to work with.

    File Version and Backwards Compatibility

    A SEG-Y file's version is identified in 2 places: line 39 of the text header, and bytes 3501-3502 of the binary header. For a SEG-Y revision 2 file, line 39 of the text header should read “SEG-Y REV2.0”. Bytes 3501 and 3502 of the binary header are the major and minor revision numbers. For a SEG-Y revision 2 file they would be 0216 and 0016 respectively.

    The latest version of the SEG-Y file format is mostly backwards compatible with prior versions. A program written using the new standard will be able to read old SEG-Y files, and a program written using the old standard will be able to read new SEG-Y files that do not use any of the new extended data blocks. Extended text headers were introduced in revision 1, so they are not compatible with programs built to read revision 0. Similarly, extended trace headers were introduced in revision 2, so they are not compatible with older programs.

    In general, any parameters in a SEG-Y file with a value of 0 are considered to be not set. New features can therefore be ignored by setting their values to 0. Additionally, the update to the standard did not move or remove any parameters: new parameters were added in unassigned or reserved space so old SEG-Y files should have all new fields set to 0.

    Trace Header Extensions

    A common challenge in reading SEG-Y files is in interpreting the values of trace headers. Due to limitations in the original format, many users implement custom solutions that use fields for data outside of the specification. SEG-Y revision 2 solves this problem by adding extended trace headers that provide space for additional official and custom header parameters.

    Extended trace headers are a series of 240 byte blocks. The maximum number of additional trace headers in a file is specified in binary file header bytes 3507-3510. The actual number of extended trace headers may vary from one trace to the next, unless the file has fixed length traces. While the standard allows for variable length traces, fixed length traces are common to allow fast seeking. As of revision 1, a SEG-Y file can guarantee it has fixed length traces by setting field 3503-3504 in the binary header to 1. For variable length traces, the actual number of extended headers is indicated in bytes 157-158 of the first extended header.

    The final 8 bytes of all trace headers are now used to indicate the header's name. The standard trace header should be named “SEG00000”. If extended trace headers are used, the first one must be the new standard header named “SEG00001”. Further headers are fully customizable with the final 8 bytes reserved for a unique name. All header names of the format “SEG#####” are reserved for future revisions.

    Modern Data

    Revision 2 provides a number of changes that make SEG-Y able to handle much larger data sets. Files can contain more traces and traces can contain more samples. Files can now contain up to 264-1 traces and 232-1 traces per ensemble. This is accomplished by including several 64-bit unsigned integer fields in the extended trace header that override smaller data fields in the standard header. Bytes 1-8 of the extended header override the trace sequence number in a line (bytes 1-4 of the standard header) and bytes 9-16 override the trace sequence number in a file (bytes 5-8). Similarly, bytes 3261-3264 of the binary header override the number of traces per ensemble (bytes 3213-3214).

    Traces now also support more samples and new data formats. Bytes 3269-3272 of the binary file header override the number of samples per trace (bytes 3221-3222) allowing for up to 232-1 samples per trace. The trace data format is specified by setting bytes 3225-3226 of the binary file header to a particular code. Most notably, 4-byte IEEE floating point numbers were added in revision 1 and 8-byte IEEE floating point numbers were add in revision 2 (codes 5 and 6 respectively).

    This revision also provides a number of changes to make SEG-Y faster and easier to use on modern computers. SEG-Y files can now be written in little-endian format for faster I/O performance. Bytes 3297-3300 can be used to detect the correct byte order. A value of 1690906010 (0102030416) would indicate all bytes are in the correct order, while 6730598510 (0403020116) would indicate the bytes must be reversed.

    The new revision also provides a better way to specify many standard parameters with improved precision. Sampling interval can be specified as 64-bit IEEE floating point number in bytes 3273-3280 of the binary file header (this overrides bytes 3217-3218). Source and receiver coordinates, depths, and elevations can also be set as doubles as well using the first extended trace header:

     

    Field

    Bytes

    Overrides

    Receiver group elevation

    33-40

    41-44

    Receiver group depth

    41-48

     

    Source elevation

    49-56

    45-48

    Source depth

    57-64

    49-52

    Datum elevation at receiver group

    65-72

    53-56

    Datum elevation at source

    73-80

    57-60

    Source X

    97-104

    73-76

    Source Y

    105-112

    77-80

    Group X

    113-120

    81-84

    Group Y

    121-128

    85-88

    CDP X

    161-168

    181-184

    CDP Y

    169-176

    185-188

     

    These parameters eliminate the need for the various scaling factors that are present in the standard trace header.

    Finally, the standard trace header in bytes 29-30 is an identification code for the type of data contained in the trace. Many new types of data have been added, including velocity, electro-magnetic, and rotational sensor data.

    Trace Header Mapping

    A common challenge with using the SEG-Y format has been interpreting trace header data. Due to limitations in the original format, various custom solutions have arisen that can make sharing files difficult. While the new standard extended trace headers should alleviate most of problems, trace header mapping should eliminate the problem completely.

    The extended text header is a series of 3200 byte records. The number of records is provided in bytes 3505-3506 of the binary file header. These headers provide extra space for any data the user wishes to include in the file. Extended text headers are organized into sets of data, each set is referred to as a “stanza”. A stanza can include data in any format the user wishes, however a number of standard stanzas have been published with the SEG-Y specification. Stanzas must start on a 3200 byte boundary and are started with a stanza header. The header includes the name of the stanza as well as the company that published it. For example, a trace header mapping would begin with “((SEG: Trace Header Mapping ver 1.0))”.

    Revision 2 provides a specification for a text header stanza that overrides standard trace header locations. Custom extensions to the SEG-Y file format can now be codified within the file itself for easier processing. User defined parameters can be mapped, and standard parameters can be remapped to new locations or data formats.

    Trace header mappings are specified in the extended text header field.

    The following is an excerpt of the example trace header mapping provided in the SEG-Y revision 2 standard[1]:

     

    ((SEG: Trace Header Mapping ver 1.0))

    Header name = SEG00000

    Field name 1 = tracl

    Byte position 1 = 1

    Value format code 1 = 2

    Field description 1 = Trace sequence number within line

    Field name 2 = tracr

    Byte position 2 = 5

    Value format code 2 = 2

    Field description 2 = Trace sequence number within SEG-Y file

    Field name 3 = fldr

    Byte position 3 = 9

    Value format code 3 = 2

    Field description 3 = Original field record number

     

    As seen in the example, a field is mapped by specifying its name, position, format, and description. Using standard names for fields allows developers to document their proprietary modifications to the SEG-Y format in a way that can still be read by anyone that the data is shared with.

    If extra data needs to be saved with each trace, new fields can also be mapped to custom headers. The “Header name” line indicates which trace header block is currently being mapped. The example above maps its fields to the standard trace header “SEG00000”.

    Conclusion

    SEG-Y revision 2 represents a significant step forward in modernizing the file format. By using the new extended trace headers for more precision, and trace header mappings to document any modifications or extra headers, the SEG-Y format will be much easier to use for processing and sharing seismic data.

    References

    [1] Hagelund, Rune, and Stewart A. Levin, editors. “SEG-Y_r2.0: SEG-Y Revision 2.0 Data Exchange Format.” Mar. 2017, seg.org/Portals/0/SEG/News%20and%20Resources/Technical%20Standards/seg_y_rev2_0-mar2017.pdf


    Source: Acceleware Ltd. blogs

  • 05.01.2018: GPU Hardware System Engineering – A Debugging Story

    GPU Hardware System Engineering – A Debugging Story

    The Performance Problem

    Earlier this year, a customer had a most unusual problem. They bought a workstation with a P100 GPU to run finite difference time domain (FDTD) simulations. Initially, the simulation ran at the expected speed but over time, the GPUs would slow down to approximately 1/3 of the expected speed. From a support perspective, we have never an issue like this. Was it a software bug? Why would the simulation slow down? Ruling out a software bug, I focused my efforts on the system configuration. 

    Preliminary Debugging

    The customer sent me the OS configuration, NVIDIA driver version, a sample simulation and software version information. Even though they are running a P100, I asked them to check the NVIDIA driver model, to determine whether the GPU was set to TCC or WDDM [http://www.acceleware.com/blog/Timeout-Detection-in-the-Windows-Display-.... As expected, the GPU was running in TCC mode. I considered the fact that Windows 10 is continuously under development, so perhaps a new update or perhaps a lack of updates was causing this. After ensuring the OS was updated, I elevated the issue to our software development team. My best guess was that the customer must have found a bug in our code related exclusively to the Pascal architecture. As a final attempt, I asked the customer to upgrade to the latest BIOS. As expected, BIOS updates did nothing to resolve the problem.

    The Debugging Plot Thickens

    Our software development lead denied that this is Pascal related bug and believed it to be a hardware issue. Acceleware has a limited number of P100 GPUs, so hardware access is limited. Initally, I gave the software developer a K10 in Windows Server 2012 R2 to debug on. After making no progress, the software developer insisted that I get him access to the P100 in Windows 10, mirroring the customer’s workstation as closely as possible. Our server is a two socket Dell R730 server with two P100s. The customer’s system has only one P100. Since we didn’t want to take one P100 out for testing, we used CUDA_VISIBLE_DEVICES to run on one GPU only [http://acceleware.com/blog/cudavisibledevices-masking-gpus]. Finally, we were able to debug on a setup similar to the customer. During a short simulation run, the speed slowed down over time.  We were now able to reproduce the problem! (although it is not as pronounced as the customer’s case). Also, the slowdowns were different on the two P100s we isolated with the CUDA_VISIBLE_DEVICES.

    Bad Hardware or Simply a Software Bug?

    The performance drop made no sense to us; maybe we had a defective P100. We tried on the second P100 using CUDA_VISIBLE_DEVICES to select the card the other GPU. This time speed dropped off faster. Two defective P100s? Seems unlikely. However, we observed that our speed drop off wasn’t quite as bad as the customers; we went from full speed to 90% of full speed. We have a NVIDIA Titan X Pascal in house that is almost the same as a P100. Testing a similar setup with our Titan X Pascal didn’t result in any performance issues. The software developer and I concluded that we must have a P100 related bug in our FDTD software. Debugging this will be a difficult and time consuming task as it is specific to the card. We relayed the bad news to our customer.

    A Breakthrough!

    We thought of one last idea before intensively debugging the software. We decided to run nvidia-smi to log the GPU clock speed. P100s and many other NVIDIA GPUs have variable clock speeds to conserve power and increase performance. (While the P100 has several clock domains, for simplicity we are graphing only the graphics clock) The following command logs GPU temperature and power draw every 5 seconds until the command is killed.

    C:\"Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,clocks.current.graphics, --format=csv -l 5

    Figure 1
    Nvidia-smi output

     


    Figure 2

     

    Initially, the P100 was running at 405MHz, the idle speed. Once the CUDA program started it ran at 1189 MHz. It stayed at that speed while the simulation is initialized on the GPU. Once the GPU began time-stepping through the simulation, the clock rate increased to 1328MHz. At this point, the GPU temperature began rising once the simulation time-stepping begins. A little later, the GPU temperature reached 79C and the clock was throttled to keep the GPU from overheating. This occurred because the cooling was insufficient. As the simulation progressed, the clock throttling increased. The clock reached as low as 898MHz, almost a 1/3 drop in speed from the boost clock. 

    Why did it drop? I decided to use the Dell iDrac interface and increased the fan speed of the server to 100%. Running the simulation again, the GPU clock rate was consistent, as was the simulation speed. At that point it became obvious, the P100 was throttling its clock rate under load as the cooling was insufficient. This also explained the response of the two P100 cards for simulation speed. They had a slightly different airflow which caused different throttling rates. Figure 1 shows how one P100 is always running cooler than the other. I asked the customer to change the fan speed for their workstation. That solved the problem.

    Wrapping Up

    Why did this happen? Both our Dell server and client’s workstation were on NVIDIA’s certified server’s list for P100 GPUs [http://www.nvidia.com/object/grid-certified-servers.html]. The latest BIOS was installed on both computers. The customer was running their workstation in an office environment and not in a dedicated server room. As a result, their GPU clock throttling was even more significant than our tests. Their clock ran as low as 645MHz, half the peak clock. The reason our issue occurred is that NVIDIA Tesla GPUs are intended for large datacenters. Datacenters have a fixed cooling and power budget. This is often managed by a dedicated team. For those that have just one or two P100s, we needed to better understand the environment required to ensure the P100 can run to its maximum potential. In this case, our server fan speed needed adjustment, as the customer was running a P100 in a less than ideal environment. As for our server room, we later found out that one of our two AC units needed servicing. Adjusting the server fan speed solved the problem, but the real reason for the slowdown was an environmental one.

     

    Source: Acceleware Ltd. blogs

  • 09.11.2017: Choosing a Normalization Method

    Choosing a Normalization Method

    Normalization is an important method for data pre-processing, especially for machine learning. Why do we need normalization? Normalization may change the distribution of the feature vector data set and the distances among observation points. Additionally, it may accelerate convergence of deep learning algorithms¹. As another example, out AxHeat product uses normalization for the RF heating of reservoirs, and improves algorithm stability after we have applied it to the water saturation.

    In this blog post, we will discuss 3 types of fundamental normalizations. Those methods specific for deep learning, i.e. local response normalization, batch normalization and layer normalization will be discussed in a future blog post².

    Zero-mean normalization

    Equation:

    In this case, zero-mean normalization is the same as standardization. The porcessed data with this method will fit in the standard normal distribution. For some cluster algorithms using distance to measure similarity, i.e. K-means, zero-mean noramlization is a good choice³.

    Min-max normalization

    Equation:

    x = ( x - min ) / ( max - min )

    Min-max normalization is a linear normalization technique. It does not change the distribution of the data set. However, if the min and max are not stable among input data sets, it may cause instabilities in the results. Miin-max normalization is the most common method in imaging processing as most of the pixels values are in range of [0, 255].

    Non-linear normalizations

    The typical non-linear normalizations include logarithm, exponential functions, and inverse trigonometric functions.

    The choice of non-linear function depends on the distribution of your inputs and the expected distribution of outputs. log() has better discernibility in range of [0, 1]. arctan() can take any real numbers as inputs and convert to the output to values in the range

    .

     

    Let's have a look of data distribution after applying zero-mean, min-max, log and arctan normalizations to a standard Cauchy distribution input:

    We generate 200 sample points randomly (shown as the blue points), which are in range of [-40.0, 40.0]. After normalizations, the data is shrunk into [-10.0, 15.0]. We do see different data distributions where there is no absolute good or bad. It depends on your criteria. For instance, if your target is to minimize the distance among the points, min-max is a good choice. If you expect evenly distributed data with obvious differences, log may be a good idea.

    Lastly, Scikit-learn provides handy tools to compare/visualize normalized results with your inputs⁴. It is a good idea to run your candidate normalization algorithm on your sample data set before applying to your real data set.

     

    References

    [1] https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs

    [2] http://yeephycho.github.io/2016/08/03/Normalizations-in-neural-networks  

    [3] http://ai.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf

    [4] http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#normalizer


    Source: Acceleware Ltd. blogs

  • 19.10.2017: Assessing Correctness in Scientific High-Performance Computing

    Assessing Correctness in Scientific High-Performance Computing

    High-Performance Computing as applied to numerical modelling is the art of optimizing numerical approximations to mathematical models on modern compute hardware for the interests of professionals in many fields of science and engineering. In general, the numerical approximations are of interest because realistic problems do not have an analytical solution that can be solved with pen and paper. Advances in computer hardware have led to stunningly realistic models that allow researchers to investigate phenomena such as how proteins fold, how the human heart beats, and how stars are formed. But how do we know that our numerical predictions are correct, and not simply an artifact from a bug in the code?

    First and foremost, it is important to have a few simple test cases, with a (pseudo-)analytical result that is accurate to the limitations of the numerical approximation used. This can be very difficult, especially because the test must be set up in a way that minimizes expected numerical errors. Instead, this post will focus on consistency tests that are much easier to design and automate. Although none of the following are always satisfied, scientific numerical models should

    1. Give the same result when run twice with an identical setup
    2. Give the same result when run on different hardware types
    3. Give the same result when split over multiple machines versus run on a single machine

    Unfortunately, “give the same result” does not always mean “expect a binary match”. In all cases, some level of understanding of the expected result will be required to decide what is “similar enough”. 

     

    Analytical Tests

    Automated analytical tests are difficult to setup and assess because numerical approximations do not typically exhibit an exact match with the analytical solution. Instead, they converge within a certain level of error. Many numerical methods have books dedicated to the expected errors, how to mitigate them, and how to determine upper bounds for the error. Testing at this level is often done by having an expert in the field manually assess the result, and then it is saved as-is and all future versions of the code are expected to reproduce that result. I do not like this method because an improvement to an algorithm can result in unnecessarily failed tests. It is worth automating analytical tests by extracting the desired information from the results whenever possible. (e.g. Extract a single scalar of interest instead of saving a 3D example output as a reference) If it is not possible, then the method of assessing the correctness of the reference result should be well documented.

    A better class of analytical tests ensure properties such as reciprocity, symmetry, and image theory are satisfied. One of my favorite tests is for perfectly planar reflecting boundary conditions in wave equation modelling. Analytically, an infinite planar Neumann or Dirichlet boundary can be replaced by a mirror image (with a possible minus sign) of the setup on one side of the plane. This holds no matter how complicated the model is. Therefore, two different tests can be setup that are expected to produce the same result including numerical errors in the wave propagation. The only difference should be from the error in the implementation of the reflective boundary, which is prone to shifts in location if implemented incorrectly.

    In summary, analytical tests are essential, but should be implemented in a way that allows for improvements to numerical accuracy and minimizes the need for domain experts to repeatedly manually assess the result.

     

    Expected Result of Multiple Runs with an Identical Setup

    One of the simplest automated tests is to run the same numerical experiment twice, and assert that the same answer is reproduced. In many cases, a binary match is expected. It is worth putting in extra effort to ensure a binary match is achievable.

    It is common for parallel algorithms involving a reduction to give a different result depending on the order of operations because floating-point arithmetic is not associative and scheduling is non-deterministic. Note that if the order of operations remains fixed, a parallel implementation of a reduction should give a binary match when run multiple times.

    A second exception is algorithms that require pseudo-random number generation. In that case, I recommend consistently seeding the random number generation on every processing unit to maintain reproducible results.

    Finally, we have found that unaligned allocations on the CPU can have run-to-run variations if that memory is operated on within a vectorized loop. The size of the remainder loop (the leftover elements when the number of iterations is not a multiple of the SIMD vector length) changes depending on the alignment of the allocated memory and the remainder loop produces a subtly different numerical result. Solutions include disabling vectorization for that loop (#pragma novector), vectorizing the remainder loop (pragma vector vecremainder), or aligning the memory.

    This type of test is commonly used during our optimization passes as an initial assessment because it is easily automated, requires no domain knowledge, and fails if a race condition or out of bounds access error is introduced. Our overnight test suite runs small tests cases 3 or 4 times to help catch uncommon race conditions.

     

    Expected Result When Solving Problems with Different Compute Architectures

    Our applications are usually developed to run on x64 CPUs and NVIDIA GPUs. Much of the code is structured to keep high level portions of the algorithm in common base classes, but the core of the algorithm ends up being written twice. Furthermore, data access patterns and optimization details differ between the two platforms. Therefore, the likelihood of an identical bug being written in both versions of the code is greatly reduced. For consistency, we run automated tests to ensure both implementations of the code produce similar results.

    For different hardware types, a binary match is not expected. In fact, the differences between CPU and GPU can be substantial, and it is difficult to assess whether the two implementations are close enough. I find that acceptable differences should be relatively incoherent, and that coherent differences are usually caused by bugs. Architectures with and without support for fused multiply-add (FMA) tend to have the most differences. Obviously, production code should use FMAs, but the NVIDIA compiler flag (--fmad=true) and the Intel compiler flag (-no-fma) are handy when investigating questionable cases.

    I strongly recommend against running hardware comparison tests on random data, or any data that is not designed to produce a meaningful result. Numerical algorithms are expected to have accuracy limitations and suffer from floating point errors, but nothing shows that better than taking a high-order derivative of meaningless data. Don’t waste your time trying to justify why different architectures do not match in these cases.

     

    Expected Result When Solving Problems on Differing Numbers of Processing Units

    Real world numerical models are often split over multiple processing units because they are too large to fit on a single one. In this case, I am referring to processing units in general as either individual cores on a CPU, NUMA nodes, discrete GPU chips, or whatever your base computation unit is. Assuming you are able to get an exact binary match for multiple runs on a single device, you should be able to get an exact binary match on multiple. The only caveat is once again you need to be careful of how remainder loops change when the numerical model is split over multiple devices. The easiest solution is to ensure that optimized code is padded and aligned. Otherwise the remainder loop should be vectorized with compiler directives.

     

    Final Remarks

    Consistency tests often elucidate the strangest bugs. The ones that cause the late-time NaNs, but only for specific sizes, and the ones that cause a segmentation fault on your customer's hardware, but work just fine for you. An example bug that we found this week was introduced during the optimization of GPU code for modelling waves in an anisotropic material. For anisotropic modelling, several terms in the underlying differential equation have no effect in isotropic regions, and a subtle effect in anisotropic ones. Data was improperly loaded for a subset of threads in a way the only affected one of those subtle terms in a small subset of the total volume. As such, the code passed analytical tests, but failed consistency tests. It took two developers more than a day to track down, but detecting and fixing peculiar bugs like that increases confidence in both the code and the test suite.

     


    Source: Acceleware Ltd. blogs


  • Connect as Follower Connect as Follower
  • Connect as Employee Connect as Employee
  • Add to address book Add to address book
  • Send private message Send private message
  • Remove from your network Remove from your network
  • Remove from your network & block Remove from your network & block

Latest Activities



Followers

Berlin
(Germany)


News

  • [25.03.2019] Icy giant planets in the laboratory Giant planets like Neptune may contain much less free hydrogen than [...] visit
  • [25.03.2019] Scientist constructs artificial photosynthetic cells Scientists build artificial cells as models of primitive cells. Research [...] visit
  • [25.03.2019] Overland migration of Arctic Terns revealed Data from a landmark three year study of the world's longest migrating [...] visit


Alerts

  • [21.10.2018] New article published in Journal of Applied Ecology : Bat overpasses: an insufficient solution to restore habitat connectivity [...] more
  • [21.10.2018] New article published in Journal of Applied Ecology : Missing the people for the trees: Identifying coupled natural??human system [...] more
  • [21.10.2018] New article published in Journal of Sustainable Agriculture : Use and perceptions of alternative economic activities among smallholder [...] more