The investment in Artificial Intelligence (AI) chips has seen heated over the past few years including from big titans in software territory after they found the computing hardware had become a bottleneck for the heavy lifting of Machine Learning. The reasons include:

  • The traditional processor architecture is inefficient for data-heavy workload,
  • The slowing of Moore’s Law (which had dominated computing performance for decades, now it’s not a sole decisive factor),
  • Different uses (for inference, training, or real-time continuous learning),
  • Diversified and distributed AI computing (extreme low-power for IoT edge computing or super high throughput for complex Deep Learning algorithms),
  • A coherent optimization for software and hardware full stack is necessary.

There are literally hundreds, if not thousands, of fabless AI chip startups, and dozens of programs in established semiconductor companies to create AI chips or embed AI technology in other parts of the product line. (source: Cadence blog)

An array of innovations in AI chip designs have been developed, some already shipped. For example, memory efficiency, or in-memory processing capability, has become a essential feature to handle data-heavy AI workloads. New generation architectures aim to reduce the delay in getting data to the processing point. Several strategies are available from near-memory to processor-in-memory (PIM) designs.

A popular benchmark to compare their performance is TOPS (tera operations per second) — an ideal metric for pure performance. It has grown orders of magnitude over the last couple of years. But TOPS is far from the whole story of the real performance when those AI chips implemented in real applications. Considering the explosion of data, the huge energy consumption of AI computing raising an environmental concern, and extremely low power needs at the edge AI, comparing the highest performance per watt is more meaningful.

There are many ways to improve performance per watt, and not just in hardware or software. Kunle Olukotun, Cadence Design Systems Professor of electrical engineering and computer science at Stanford University, said that relaxing precision, synchronization and cache coherence can reduce the amount of data that needs to be sent back and forth. That can be reduced even further by domain-specific languages, which do not require translation. (more discussion here)

An important consideration of AI and other high-performance devices is the fact that actual performance is not known until the end application is run. This raises questions for many AI processor startups that insist they can build a better hardware accelerator for matrix math and other AI algorithms than others.

All inference accelerators today are programmable because customers believe their model will evolve over time. This programmability will allow them to take advantage of enhancements in the future, something that would not be possible with hard-wired accelerators. However, customers want this programmability in a way where they can get the most throughput for a certain cost, and for a certain amount of power. This means they have to use the hardware very efficiently. The only way to do this is to design the software in parallel with the hardware to make sure they work together very well to achieve the maximum throughput.

One of the biggest problems today is that companies find themselves with an inference chip that has lots of MACs (multiplier–accumulator) and tons of memory, but actual throughput on real-world models is lower than expected because much of the hardware is idling. In almost every case, the problem is that the software work was done after the hardware was built. During the development phase, designers have to make many architectural tradeoffs and they can’t possibly do those tradeoffs without working with both the hardware and software — and this needs to be done early on. Chip designers then build a performance estimation model to determine how different amounts of memory, MACs, and DRAM would change relevant throughput and die size; and how the compute units need to coordinate for different kinds of models.

Semiconductor companies and AI computing players need to coalesce their strategy around hardware innovations and software that reduces the complexities and offers a wide appeal to the developer ecosystem which proved to be crucial for the success of IP or chip vendors in decades of IC industry history.

To cultivate a user community and integrate well with the wider ecosystem, hardware players need not only to offer great interfaces and software suite compatible with popular frameworks/libraries (such as TensorFlow, MXNet) but also closely follow models’ evolution/trends or application needs to respond accordingly. Besides, the programming development environment needs to support programming and balancing workloads across different types of microprocessors suitable for different algorithms (heavily scalar, vector, or matrix, etc.).

Note: If you need to learn about AI chip basics and economics around them, a report published in April 2020, from the Center of Security and Emerging Technology is a good read — AI Chips: What They Are and Why They Matter.