in Amazon Elastic Compute Cloud EC2 by
How do I take advantage of AWS Inferentia’s NeuronCore Pipeline capability to lower latency in Amazon EC2?

► Click here to show 1 Answer

0 votes

Inf1 instances with multiple Inferentia chips, such as Inf1.6xlarge or Inf1.24xlarge, support a fast chip-to-chip interconnect. Using the Neuron Processing Pipeline capability, you can split your model and load it to local cache memory across multiple chips. The Neuron compiler uses ahead-of-time (AOT) compilation technique to analyze the input model and compile it to fit across the on-chip memory of single or multiple Inferentia chips. Doing so enables the Neuron Cores to have high-speed access to models and not require access to off-chip memory, keeping latency bounded while increasing the overall inference throughput.

Learn More with Madanswer

Related questions

+1 vote
asked May 22, 2020 in Amazon Elastic Compute Cloud EC2 by Indian