New Cloud Solutions Using Groq’s Tensor Streaming Processor


  • Grog’s tensor streaming processor (TSP) silicon is now available to accelerate customers’ AI workloads in the cloud
  • The recent results published by the company shows that the chip can achieve up to 21,700 inferences per second.

Cloud service provider Nimbix now offers machine learning acceleration on Groq hardware as an on-demand service. While there are several startups building AI silicon for the data centre, Groq now joins Graphcore as the only two with accelerators commercially available for customers. These services can be used as part of a cloud service. Graphcore previously announced its accelerators are available as part of Microsoft Azure.

Steve Hebert, the CEO of Nimbix recently said in a statement,

“ Groq’s simplified processing architecture is unique, providing unprecedented, deterministic performance for compute-intensive workloads, and is an exciting addition to our cloud-based AI and Deep Learning platform.

Grog’s PetaOp-capable architecture was used to create the Tensor Streaming Processor shown on this PCIe board which is currently being tested by customers

Perform operations of up to 1000 TOPS


According to the company, Groq’s TSP chip is capable of performing operations of up to 1,000 TOPS (1 PETA operations per second). The recent results published by the company shows that the chip can achieve up to 21,700 inferences per second. This value of instructions is more than double the performance of today’s GPU-based systems. Due to this feature, Groq’s architecture is claimed to be one of the fastest and commercial neural network processors in the world.

Jonathan Ross, Groq’s co-founder and CEO said

These ResNet-50 results are a validation that Groq’s unique architecture and approach to machine learning acceleration delivers substantially faster inference performance than our competitors,”  “These real-world proof points, based on industry-standard benchmarks and not simulations or hardware emulation, confirm the measurable performance gains for machine learning and artificial intelligence applications made possible by Groq’s technologies.

Positronic Introduces Power Connector for Rugged Industrial Applications

Can achieve massive parallelism

The company told that this architecture can achieve the massive parallelism required for deep learning acceleration. It can do so without the synchronisation overhead of traditional CPU and GPU architectures. The TSP chip can offer a moderate 2.5x latency advantage over GPUs at large batch sizes.

Removes the software-driven approach

The control features have been removed from the silicon and given to the compiler, unlike the traditional software-driven approach. This can lead to a predictable, deterministic operation orchestrated by the compiler. It also allows performance parameters to be fully understood at compile time.

Elimination of the batching process

Another key feature of the processor is the elimination of the batching process. Batching is a common technique in the data centre where multiple data samples are processed at a time. This technique is practised to improve the maximum throughput of the system.

According to Groq, its architecture can reach peak performance even at a single batch. Hence, it can fulfil the requirements for inference applications that may be working on a stream of data arriving in real-time.




Please enter your comment!
Please enter your name here

Are you human? *