May 26, 2020 in Amazon Elastic Compute Cloud EC2
Q: How do I take advantage of AWS Inferentia’s NeuronCore Pipeline capability to lower latency in Amazon EC2?

1 Answer

0 votes
May 26, 2020

Inf1 instances with multiple Inferentia chips, such as Inf1.6xlarge or Inf1.24xlarge, support a fast chip-to-chip interconnect. Using the Neuron Processing Pipeline capability, you can split your model and load it to local cache memory across multiple chips. The Neuron compiler uses ahead-of-time (AOT) compilation technique to analyze the input model and compile it to fit across the on-chip memory of single or multiple Inferentia chips. Doing so enables the Neuron Cores to have high-speed access to models and not require access to off-chip memory, keeping latency bounded while increasing the overall inference throughput.

Related questions

+1 vote
May 22, 2020 in Amazon Elastic Compute Cloud EC2
0 votes
May 22, 2020 in Amazon Elastic Compute Cloud EC2
...