National Center for High-Performance Computing (NCHC) - Taiwan Computing Cloud (TWCC)

1. Case Summary

In order to solve the problems of insufficient high-speed computing and storage resources in Taiwan in recent years, with the support of the government, the National Center for High-performance Computing (hereafter referred to as the “NCHC”) has promoted and built national-level infrastructure for AI research/development and cloud services, known as "Taiwania 2" or "Taiwan Computing Cloud (TWCC)", as an important infrastructure for the AI development in Taiwan. This project is divided into three parts: for computing services: virtual machines, containers, and HPC; for storage services: GPFS, Block Storage and Object Storage; for network services: virtual network, load balancer, firewall and VPN. Gemini Open Cloud integrates the above computing, storage, and network services to help users to directly access their resources with API interfaces and run AI and big data services

Industry

Public Sector / Government

Region

Taiwan

Use Case

  • Artificial Intelligence/Big Data
  • Nvidia Tesla V100、OpenStack、Kubernetes、Slurm、Singularity、Ceph、GPFS
  • Dedicated resources(VM, Container), Job, and Services(Database, Hadoop, Kubernetes, etc)
  • Virtual network、Load balancer、Firewall、VPN
  • Gemini API Gateway and Cloud Platform

2. Pain Points and Challenges

  • Public cloud usage scenarios need to be considered, but resources are limited
  • High-performance computer containerization technology
  • Integration of multiple heterogeneous computing and storage resources
  • Complex authentication, authorization and accounting
  • Need to consider the future remote sites can also be painlessly intergrated

Since this case is used as a public cloud to provide services, it requires detailed evaluation and inspection on resource use efficiency and measurement. It is also need to integrate the amount used by users on different resources and push to corresponding system in NCHC for deducting fees and providing billing and informations for users. The architecture also needs to consider the integration of remote resources, so that in the future, the computing and storage resources of nchc in different regions sites can be obtained by users on this platform.

3. Architecture Design Features

This case combines high-speed computing supercomputer (HPC), container computing services (Kubernetes, Docker), virtual machine computing services (OpenStack, VM), super large object storage system (S3 Storage), parallel file storage system (GPFS), and NCHC account service system (iService). It also needs to integrate the above resources in the context of use to provide users with the easiest way to get their resources.

  • API Gateway provides a single entry point to the outside, and a single account can use all resources.
  • Single PaaS is integrated with diverse heterogeneous platforms, including computing and storage
  • Provide software installation templates, so that the administrator can provide softwares required by the user, and the user can also create the required service with one click
National Center for High-Performance Computing

Gemini Open Cloud provides two main software layers, Gemini Open Cloud Platform and Gemini API Gateway, and also supports Slurm and Singularity for HPC. It provide users with direct access to get resources through APIs or CLI. The user portal also uses Gemini APIs, so that users can have a consistent experience regardless of whether they use the API or the UI.

In the Cloud Platform part, the two major cloud computing open source softwares, OpenStack and Kubernetes, are also integrated to manage different types of virtualization resources such as Virtual Machine and Container to provide peripheral services such as data preprocessing, inference for AI artificial intelligence. In addition, the S3 storage is also integrated to provide relevant key information according to different accounts. Big data can be smoothly integrated with artificial intelligence technology to provide a complete end-to-end service by platform integration technology of Gemini Open Cloud.

API Gateway is connected with GOC PaaS, NCHC iService account services, and the Harbor warehouse for container images, allowing users to use the full service through a single entry point. Users can also create their own sub-keys through the features of API Gateway, so that users who are not on the platform can also easily connect to their own software through the key.

In order for the high-speed computing host to use its resources to the greatest extent possible by users, related software and services need to be as lightweight as possible. Therefore, the international open source software Slurm and Singularity used as resource scheduling software and containerized services are most suitable for supercomputer architecture.

Gemini Cloud uses API and CLI methods to support users dispatch resources through Slurm, and use Singularity to package the software framework required for AI computing, allowing users to directly use the performance of supercomputers in the same environment as Docker to make AI model training faster.

Users use Singularity, a lightweight containerized technology, to help users use GPU resources in supercomputers to perform AI artificial intelligence operations. This technology is mostly used in high-speed computing performance computers (HPC), which is a rare successful case in Taiwan.

4. Result

Business transformation

  • Through the high-speed network of the NCHC, coupled with world-class computing computers, provide the best big data and AI computing environment.
  • The industry-university-research can obtain computing resources in a more efficient way without spending a lot of money to build a computing environment.
  • Through the public cloud operation, it is expected that TWCC will continue to be operated, which will benefit the development of industry-university-research technology

NCHC built TWCC (Taiwan Computing Cloud) platform, successfully provide public cloud platform services, and provided the best big data and AI model training environment in Taiwan's industry, academia, and research communities. In addition to providing a computing, storage, and secure network environment, this platform will also integrate AI models and tools developed by various sectors in Taiwan, as well as important data sets at home and abroad, to promote the development of domestic AI and big data. To realize the vision of Smart Taiwan.

IT transformation

  • Developers can quickly deploy development environments, which can increase work efficiency compared to the past
  • High-speed parallel computing across nodes can increase performance compared to existing services
  • The industry, academia, and research communities gradually transplant AI model training to this platform to accelerate the development of domestic related technologies

NCHC has developed a dedicated CLI interface through the API provided by Gemini Open Cloud team, allowing users to create and delete container services directly through the CLI. As for the Web user interface, most of the functions are also developed by APIs provided by Gemini Open Cloud, so users can get the same information no matter when they use CLI, Web UI or API, including virtual Machines, containers, network services, storage volumes, etc. It also provides a multi-tenant environment to isolate the virtual resources between each tenant from the subnet, and achieve a more secure protection effect.

The software and hardware teams of this case is huge. The time to start from scratch has been drastically reduced after import of Gemini Open Cloud. Services such as OpenStack and Kubernetes can be adjusted slightly to meet the requirement. Different from other types of cases, there are many types of computing services and storage services provided in this case, and it also needs to be connected to the account service (iService) of NCHC. Therefore, the Gemini Open Cloud team provided a large amount of customized service in this case, also provides complete professional services to NCHC from demand interviews, technical details, education and training, etc.