NVIDIA GPU Technology Conference Presentations

GTC Europe 2016 September 28-29, 2016 Amsterdam

At GPU Technology Conference (GTC), this GCoE presented:

  • Best Practices for OpenACC Optimizations in Large Scale Multi-Physics Applications
    • Vishal Mehta Sr. Engineer and PhD Student. Learn best practices for finite element modelling in multi-physics applications. Understanding the right approach to sparse matrix assembly for performance and portability of the application. Adding OpenACC kernels and parallel regions with proper granularity. Learn how to program large scale MPI simulation along with OpenACC. The workshop will have hands on coding experience with vectorization of code, adding OpenACC pragmas with data dependence and end with scaling the code with MPI processes.
  • Analyzing the effect of last level cache sharing on integrated platforms with fine-grain CPU-GPU collaboration
    • Victor Garcia PhD Candidate, Antonio J. Peña GCoE Acting Director, and Eduard Ayguadé CS Department Associate Director, in collaboration with Juan Gómez-Luna, Thomas Grass, and Alejandro Rico. Although on-die GPU integration seems to be the trend among the major microprocessor manufacturers, there are still many open questions regarding the architectural design of these systems. This poster is a step forward towards understanding the effect of on-chip resource sharing between GPU and CPU cores, and in particular, of the impact of last-level cache (LLC) sharing in heterogeneous computations.

GTC 2016 April 4-8, 2016 Silicon Valley

At GPU Technology Conference (GTC), this GCoE presented:

  • ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems
    • Vishal Mehta Sr. Engineer and PhD Student. Learn to interface CUDA kernels, CUDA library API and driver APIs with existing Fortran applications in HPC. This session informs you about the Alya multi-physics code developed at Barcelona Supercomputing Centre. The code is based on Fortran95 and scales across thousands of cores. We describe in depth how to port computationally heavy modules from Fortran to CUDA. The session will teach in depth on how to use CUDA features like dynamic parallelism, CUDA streams, unified memory, and error handling features for Fortran applications with NVCC compiler. We also discuss future directions using next-generation programming models such as OmpSs for hybrid CPU and GPU computing. The presentation includes various example codes for improving the programming skills of the scientific community.
  • HPC Application Porting to CUDA? at BSC
    • Pau Farre Jr. Engineer, Mar Jorda, Jr. Engineer. In this session you will learn the main challenges that we have overcome at the BSC to successfully accelerate two large applications by using CUDA and NVIDIA GPUs: WARIS (a Volcanic Ash Transportation Model) and PELE (a Drug Molecule Interaction Simulator). We show that leveraging asynchronous execution is key to achieve a high utilization of the GPU resources (even for very small problem sizes) and to overlap CPU and GPU execution. We also explain some techniques to introduce Unified Virtual Memory in your data structures for seamless CPU/GPU data sharing. Our results show an execution time improvement in WARIS of 8.6x for a 4-GPU node compared to a 16-core CPU node (using by-hand AVX vectorization and MPI). Preliminary experiments in PELE already show a 2x speedup.
  • Implementing Deep Learning for Video Analytics on Tegra X1
    • Carles Fernandez, UPC startup Herta Security. The performance of Tegra X1 architecture opens the door to real-time evaluation and deployment of deep neural networks for video analytics applications. This session presents a highly optimized, low-latency pipeline to accelerate demographics estimation based on deep neural networks in videos. The proposed techniques leverage the on-die hardware video decoding engine and Maxwell GPU cores for conducting advanced video analytics such as gender or age estimation. Our results show that Tegra X1 is the right platform for developing embedded video analytics solutions.

GTC 2015 March 17-20, 2015 Silicon Valley

At GPU Technology Conference (GTC), Barcelona Supercomputing Center - CUDA Center of Excellence presented

  • Exploiting CUDA Dynamic Parallelism for Low-Power ARM-Based Prototypes
    • Vishal Mehta Engineer, Paul Carpenter, Senior Researcher. Learn to exploit CUDA features for saving energy and thus your pockets. This session briefs about the Pedraforca prototype developed at Barcelona Supercomputing Centre under the Mont-Blanc project. The prototype is based on NVIDIA® Tegra® and NVIDIA® Tesla® platforms and aims at reducing the raw power footprint of the HPC clusters. This session describes in depth how to exploit CUDA dynamic parallelism and CUDA streams for GPU applications to be ported on low power ARM based prototypes. Also includes architectural description of the prototype, power budget comparisons, and various example codes for improving the programming skills of CUDA users.
  • OmpSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
    • Hugo Pérez PhD Student, Benjamin Hernandez Researcher, Isaac Rudomin Senior Researcher. Industry trends in the coming years in the race to exascale imply the availability of cluster computing with hundreds to thousands of cores per chip. Programming presents a challenge due to the heterogeneous architecture. Using novel programming models that facilitate this process is necessary. In this talk we present the case of simulation and visualization of crowds. We analyze and compare the use of two programming models: OmpSs and CUDA and show that OmpSs allows us to exploit all the resources combining the use of CPU and GPU taking care of memory management, scheduling, communications and synchronization automatically. We will present experimental results obtained in the Barcelona Supercomputing Center GPU Cluster as well as describe several modes used for visualizing the results.
  • Accelerating Face Detection and Deep Learning on GPUs
    • UPC startup Herta Security is ahead in using GPUs to extract biometric data. In an era where security has become a major growth industry, technologies to refine facial recognition are in high demand. Herta Security is on the cutting edge of real-time face recognition and has developed a number of solutions, including Biosurveillance and BioFinder — high-performance video-surveillance specially designed to simultaneously identify subjects in a crowded and changeable environment. On the non-security side, Herta has developed BioMarketing, which can identify parameters such as gender, approximate age, use of glasses, and various facial expressions to enable advertisers to reach an identified audience with a specific message.
    • Herta Security, Javier Rodriguez Saeta, CEO, has been selected to present this work at the Emerging Companies Summit “Show & Tell” event at GTC 2015
  • CUDA Center of Excellence Achievement Session & Awards
    • In this session, GTC highlights and rewards excellent research taking place at institutions at the forefront of GPU computing teaching and research - NVIDIA Centers of Excellence (COE). An NVIDIA panel of GPU computing luminaries, selected four exemplars from our twenty-two COEs to represent the amazing GPU computing research being done. This year winer is: Harvard University; Nature's Lessons for Technology, Playing with Molecules and Light.
    • GPU Centers of Excellence session video Recording.

GTC 2014

  • At the session Easy Multi-GPU Programming with CUDArrays, where Javier Cabezas (Ph.D. Student, BSC), shows how to boost your productivity by offering a multi-dimensional array data type that can be used both in host and device code. In systems with several GPUs and P2P memory access support, CUDArrays transparently distributes the computation across several GPUs.
  • In another session, Generation, Simulation and Rendering of Large Varied Animated Crowds, Isaac Rudomin (Senior Researcher, BSC) and Benjamin Hernandez (Postdoc, BSC) discuss several steps in the process for simulating and visualizing large and varied crowds, in real time, for consumer-level computers and graphic cards (GPUs).

GTC 2013

  • Pedraforca: A First ARM + GPU Cluster for HPC, Alex Ramirez (Heterogeneous Architectures Manager, Barcelona Supercomputing Center)
    • The HPC community is always on the lookout for increased performance and energy efficiency. Recently, this led to a growing interest in GPU computing and in clusters built from low-power energy efficient parts from the embedded and mobile markets. See a developed first proof of concept for a hybrid compute platform that brings together an ARM multicore CPU for energy efficiency, and a discrete GPU accelerator that provides the compute performance. This talk presents the architecture of the system, the system software stack, preliminary performance and power measurements, and concludes with guidelines for future ARM+GPU platforms.
  • GPU Generation of Large Varied Animated Crowds, Isaac Rudomin (Computing Sciences, Barcelona Supercomputing Center)
    • We will discuss several steps in the process for simulating and visualizing large and varied crowds, in real time, for consumer-level computers and graphic cards (GPUs). Animating varied crowds using a diversity of models and animations (assets) is complex and costly. One needs models that are expensive if bought, take a long time to model, and consume too much memory and computing resources. We have developed methods for generating, simulating and animating crowds of varied aspect and a diversity of behaviors.. Efficient simulations run in low cost systems because we use the power of modern programmable GPUs. One can apply similar technology using GPU clusters and HPC for large scale problems. Such systems scale up almost linearly by using multiple GPUs.
  • OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous Clusters of Hardware, Xavier Martorell (Associate Professor / Programming Models Group Manager, Technical University of Catalonia / Barcelona Supercomputing Center)
    • OmpSs is a data-flow programming model based on code annotations to identify independent tasks. These annotations are interpreted by the Mercurium source-to-source compiler, which emits calls to the runtime system Nanos++. Nanos++ uses the information provided by this user annotations to dynamically build a task dependency graph, which is used to schedule tasks in a data-flow way. This extended programming model directly supports tasks written in CUDA C or OpenCL C, freeing end users from writing all the boilerplate code required to explicitly schedule kernels and manage data transfers, specially on multi-accelerator and distributed systems.

GTC 2012

  • GMAC-2: Easy and Efficient Programming for CUDA-Based Systems, Javier Cabezas (BSC/Multicoreware), Isaac Gelado (BSC/Multicoreware), Lluís Vilanova (BSC), Nacho Navarro (UPC/BSC), Wen-Mei Hwu (UIUC/Multicoreware), May 17th, 2012
    • In this talk we introduce GMAC-2, a framework that eases the development of CUDA applications and tools while achieving similar or better performance than hand-tuned code. The new features implemented in GMAC-2 allow programmers to further fine-tune their code and remove some limitations found in the original GMAC library. For example, memory objects can be now arbitrarily mapped on several devices without restrictions and a host thread can launch kernels on any GPU in the system. Moreover, GMAC-2 transparently takes advantage of the new features offered by the hardware like the GPUDirect 2 peer-to-peer communication.
    • Streaming: NVIDIA FLV, NVIDIA MP4

GTC 2010

  • GMAC: Global Memory For Accelerators, Isaac Gelado (UPC), John E. Stone (UIUC), Javier Cabezas (UPC), Nacho Navarro (UPC), Wen-mei W. Hwu (UIUC), September 23th, 2010
    • Learn how to use GMAC, a novel run-time for CUDA GPUs. GMAC unifies the host and device memories into a unified virtual address space, enabling the host code to directly access the device memory, and removing the need for data transfers between host and device memories. Moreover, GMAC also allows pointers to be used by both, the host and device code indistinctly. This session will present the GMAC run-time and show how to use it in current applications. This session will cover from the basics of GMAC to multi-threaded applications using POSIX threads, OpenMP and MPI.
    • Streaming: NVIDIA FLV, NVIDIA MP4
  • Reverse Time Migration with GMAC, Javier Cabezas (BSC), Mauricio Araya (Repsol/BSC), Isaac Gelado (UPC/UIUC), Thomas Bradley (NVIDIA), Gladys González (Repsol), José María Cela (UPC/BSC), Nacho Navarro (UPC/BSC), September 22th, 2010
    • Get a close look at implementing Reverse Time Migration (RTM) applications across multiple GPUs. We will focus on how RTM applications can be scaled using the GMAC asymmetric distributed shared memory (ADSM) library to break the problem into manageable chunks. We will provide an introduction to GMAC and discuss handling boundary conditions and using separate kernels to improve efficiency.
    • Streaming: NVIDIA FLV, NVIDIA MP4
gtc.txt · Last modified: 2016/12/15 18:00 by apena
www.bsc.es CUDA Research Center