TSMC: And Then There Was One

Why out-year AI CapEx projections are likely too low, and why Intel is unlikely to disrupt the TSMC monopoly

Jun 29, 2025

TSMC (2330 TW) looks very compelling at mid-teens NTM P/E ex. net cash.1 Management believes they can likely grow top-line at a 20% CAGR through 2029. This seems reasonable: industry grows high-single digits, logic grows a few points faster, and TSMC grows a few points faster with market share gains. Margins will expand and hence earnings will grow in excess of 20% CAGR. HPC, which includes AI-related revenue, is already 60% of revenue mix and growing, and is the major segment driving overall growth.

Three years out, this is potentially trading at a roughly 7-8x 2028 P/E ex. net cash, assuming all cash generated is retained. Looking back, shares mostly trade within the 20-25x P/E range, suggesting a triple in 3 years, if historical trading ranges hold.

The sell-side is however modeling a steep deceleration in revenue growth. Consensus is at 27%, 17% and 18% revenue growth for 2025, 2026, and 2027 respectively, suggesting a “CapEx cliff” in 2026.

I don’t know what the buy-side thinks, but what valuation multiple would you pay for an effective monopoly growing top-line nearing 20% and earnings faster than that, levered to probably one of the biggest paradigm shifts in computing? I would argue that it would be a multiple significantly higher than mid-teens NTM P/E ex. cash. And that delta (which I think is substantial) would implicitly reflect buy-side expectations.

But let's work with what we have - sell-side numbers. They are implicitly saying that AI CapEx growth is unsustainable at current rates and given the scale of the initial ramp, there should be a “digestion” period where AI CapEx growth takes a significant step down before accelerating slightly.

The main reasons are twofold - one shorter-term, one longer-term. The former is supply chain bottlenecks at the server makers, and the latter revolves around questionable ROI on AI CapEx. I think both concerns are largely overblown.

State of the AI cycle

GB200 assembly yields

Early 2025 brought justified concerns. Nvidia's GB200 NVL72 assembly at ODMs like Quanta, Wiwynn, Wistron, and Gigabyte was struggling, plagued by poor yields and delayed shipments. This wasn't entirely unexpected; these systems are exceptionally complex, integrating 36 CPUs and 72 GPUs per rack via NVLink. The delays led to inventory piling up at the ODMs, causing significant negative cash flow. If this continued, these manufacturers wouldn't be in a position to place further orders with Nvidia. However, assembly yields have likely seen steady improvement. This is evidenced by a substantial acceleration in monthly revenue at most of these ODMs as we moved from late Q1 2025 into early Q2 2025.

ROI on AI CapEx

a. GPU Useful Life

More broadly, there is vast skepticism regarding the ROI on AI CapEx in recent quarters. This has become increasingly difficult to ignore given growing capital intensity at the hyperscalers is becoming a material drag to free cash flow.

The critical question rests on the useful life of datacenter GPUs and the economics they can generate across that useful life.

If the latest GPUs become obsolete and uneconomic within a year or two of introduction, ROI on AI CapEx would be hugely negative. And hence hyperscalers et al would substantially scale back AI CapEx growth, and consensus expectations would become reality.

In my view, the consensus is vastly underestimating the useful life of GPUs and thus their lifetime economics. I have previously shown that GPUs that were launched 5-8 years ago are still economically valuable even assuming original MSRPs. I will expand on this and argue that GPUs that are just a few years from launch (i.e. A100s, H100s) or are currently ramping (i.e. B100/B200) are likely to have an economic useful life significantly longer than the oft-cited 2 years.

The main driver of this extended useful life is due to the explosion in LLM inference demand.

LLM development has two stages: training and inference. Each possesses unique computational profiles, resource requirements, and economic implications.

Training is where a model acquires knowledge. This involves feeding a neural network massive datasets so it can learn statistical patterns and relationships within the data. Computationally, training is highly intensive. It requires not only a forward pass of data through the network but also a backward pass to calculate gradients and update the model's parameters. This is repeated iteratively. Consequently, training is a large-scale, CapEx-heavy event. It is a one-time (per model) cost, requiring large GPU clusters to complete in a reasonable timeframe. While throughput is paramount, latency is a secondary concern.

Inference is the application of said knowledge. It is the "real-world" deployment where the trained model makes predictions, generates text, etc. In contrast to the (mostly) one-off nature of training, inference is an ongoing process that scales directly with user engagement and app usage. Every prompt or API call triggers an inference task, making its cumulative computational cost over the life of an application potentially far greater than the initial training cost. This transforms the equation from one of CapEx to one of OpEx.

Essentially, the technical demands of training and inference differ significantly, specifically on compute, memory, and latency. The economic trade-off is thus between throughput (compute and memory), latency, and total cost of ownership.

Latency-sensitive workloads are typically real-time, interactive applications where user experience is paramount. Examples include customer-facing chatbots, AI copilots integrated into software, and real-time data analysis tools. For these use cases, low time-to-first-token and low time-per-output-token are non-negotiable, as users are sensitive to any latency. To achieve the lowest possible latency, these apps run with small batch sizes to avoid waiting for other requests to be processed. This minimal to no batching under-utilizes the GPUs’ parallel processing capabilities but minimizes wait time. Hence, these premium workloads demand the absolute best performance and are suited to the latest and greatest GPUs, such as the H100/H200, B100/B200, and its successors.

Throughput-sensitive workloads are enterprise AI workloads that are not interactive (or where immediate output is not expected, i.e. reasoning-intensive queries) and therefore much less sensitive to the latency of any single request. These include asynchronous tasks like batch processing of documents for summarization, offline analytics, deep research, report generation, content moderation, and internal data classification. For these apps, the key metric is not speed but throughput. To maximize throughput and drive down cost-per-token, providers use dynamic batching where multiple independent user requests are grouped together and fed to the GPU simultaneously. This allows the GPUs’ thousands of cores to be fully utilized, dramatically increasing the number of tokens processed per second. While the downside is higher latency, the upside is significantly greater overall system efficiency and lower cost per inference.

The existence of this large category of throughput-oriented workloads creates a structural demand for older, "good enough" hardware. An older, fully depreciated A100, while slower than a new B200 for a single, latency-sensitive query, can be highly cost-effective for throughput-sensitive workloads. When running large, batched workloads, the A100 can be driven to high utilization delivering a lower TCO for that workload than a brand new, expensive B200 that might be under-utilized.

This creates a situation where hyperscalers and enterprises will deploy their newest, most powerful GPUs for latency-critical tasks, while repurposing prior-generation GPUs to serve the massive, cost-sensitive market for batch inference. This dynamic fundamentally alters the traditional IT depreciation curve, giving older hardware an economically valuable and extended useful life.

Essentially, an A100 purchased in 2021 for foundational model training can be strategically repurposed in 2024 for a premium, low-latency inference tier. By 2026, as even faster GPUs (i.e. B100/B200) take over that role, the same A100 can be shifted again to a bulk, low-cost, throughput-oriented inference tier. This deployment model extends the useful economic life of the asset from the oft-cited 2 years to a more favorable 6-7 years.

Real-world evidence supports this model of extended lifecycles. Azure's public hardware retirement policies provide a clear precedent. For example, Azure announced the retirement of its original NC, NCv2, and ND-series VMs (powered by Nvidia K80, P100, and P40 GPUs) for August/September 2023. Given these GPUs were launched between 2014 and 2016, this implies a useful service life of 7-9 years. More recently, the retirement of the NCv3-series (powered by Nvidia V100 GPUs) was announced for September 2025, approximately 7.5 years after the V100's launch. This demonstrates the viability of extracting value from GPUs over a much longer period than the consensus implies.

There are other reasons (e.g. software optimization improving performance of prior-gen GPUs beyond their initial capabilities, model architecture, etc) which further drive the viability of prior-generation GPUs over a longer useful life that I won’t get into. But the main point is that the rapid growth in inference demand has extended the useful life and lifetime economics of prior-generation GPUs and hence will drive healthy ROI on AI CapEx.

Other points of note are that the operating margins and ROIC of hyperscalers are still expanding, and the hyperscalers’ balance sheets are still very cash-rich, implying there is still a lot of runway for AI CapEx to grow. Non-hyperscaler demand (i.e. other enterprises, neoclouds, sovereigns) is diversifying and also growing AI CapEx demand, as evidenced by Nvidia commentary.

b. Neocloud Pricing

While some may point to declining neocloud pricing for say, H100/H200 rentals, implying quickly-eroding GPU rental economics, this does not consider the broader context. Neocloud pricing is much more commoditized because they (mostly) only offer GPU rentals and nothing else, whereas hyperscalers offer solutions (hence their much higher pricing for the same compute) and overly focusing on neocloud pricing misses the point.

Most smaller neoclouds will probably fail but those that can differentiate beyond plain vanilla GPU rentals (e.g. CoreWeave) would likely be long-term winners. And the vast majority of AI CapEx demand comes from the hyperscalers, or large tech companies (e.g. Oracle) or (likely) structural winning neoclouds (e.g. CoreWeave) or sovereigns (e.g. Middle East), so small neoclouds don’t matter.

c. LLM TAM

Some critics point to OpenAI’s current revenue run-rate and compare that against hyperscaler CapEx and estimated GPU useful life to imply much of that CapEx is unsustainable. We’ve already established that GPU useful life is much longer than the conventional view so I’ll focus on OpenAI.

As of June 2025, ChatGPT is at a $10 billion ARR run-rate. Sell-side is mostly modeling 20 million subscribers, which is probably broadly reflective of reality as it would imply $42/month ARPU, which suggests the vast majority of subs skew towards Plus ($20/mo), with a meaningful minority on Pro ($200/mo). They have over 800 million weekly active users, which likely implies over 1 billion monthly active users. This suggests subscriber penetration on MAUs of 2% or less. This is a fraction of what other “killer” consumer apps such as Spotify, Netflix, etc are at.

And LLMs are clearly a “killer” consumer app given ChatGPT scaled from inception to over 1 billion MAUs within 3 years. Given most LLM usage, especially those of the non-commercial nature, overlap with Google Search, average user lifetime on LLMs are likely to be measured in decades, not a few years.

Now, I have no idea where sub penetration will end up, but I’m betting it's not going to stay at 2% or less. We can argue about whether OpenAI will be the ultimate winner, but it's clear the eventual TAM is much larger than OpenAI’s current $10 billion run-rate. It does not take super-aggressive assumptions to size a mature TAM at in excess of $100 billion, and it’ll be easy to make the case for a meaningful larger number, especially considering the pace of consumer adoption.

Compare that $100 billion+ mature TAM to $320 billion in hyperscaler CapEx spend this year on GPUs with useful lives likely exceeding 6-7 years, and AI CapEx doesn’t appear excessive.

Further, OpenAI has noted that it costs them 0.34 watt hours to serve each query on average, and industrial electricity prices in the US are likely at $0.10/kWh. Given rate limits on the free, Plus, and Pro plan, the variable cost to serve each incremental query is very low; you would have to do nearly 600,000 queries a month for OpenAI to have negative contribution margin if you were a Plus subscriber, and obviously rate limits ensure you will never get anywhere near that amount. So this is likely a very profitable business at maturity and it's the fixed cost of training, R&D, etc that drives current unprofitability.

Cost of compute will continue to decrease, driving better models, and that would allow pricing of the best models to at least be maintained, while prior models (that are still very good; o3 is amazing but the average person on a free plan likely thinks 4o is great) are given away for free.

And this TAM is just consumer apps, it doesn’t consider enterprise (e.g. Meta’s AR/VR to AI pivot is likely a large part of why they were able to overcome Apple’s IDFA changes), vertical-specific LLMs for coding, etc.

Even if model improvement becomes more incremental than step-function (i.e. scaling laws slow substantially), usage will likely be sticky; Google Search was probably only a bit better year by year for a decade plus, but usage still grew despite the improvement being more gradual than revolutionary.

Essentially, while AI CapEx could stumble, I don’t think we are near that point yet, and thus consensus out-year revenue growth expectations for TSMC seem too low; the deceleration is likely to be more measured than what the sell-side is currently modeling.

(Potential) Competition

Apart from the sustainability of AI CapEx, investors appear overly focused on gross margin dilution from 2nm ramp (2-3% gross margin dilution expected), overseas expansion (another 2-3% gross margin hit), TWD appreciation against the USD (roughly 4-5% gross margin hit at current exchange rates), and tariffs (unknown).

Most of these lead to temporary under-earning, and the rest can be resolved by price. Essentially, my view is that TSMC has substantial pricing power as it has become the effective monopoly for leading edge nodes. They will thus be able to raise prices in excess of said margin dilution factors.

HPC end-customers like the hyperscalers have huge gross margins and cash flow and can take price increases. TSMC is selling to customers who make 70%+ gross margins (e.g. Nvidia) who themselves will likely raise prices for the end-customer (e.g. a hyperscaler). The end-customer themselves also generally has 70%+ gross margins in their core businesses. Some of TSMC’s customers actually likely want TSMC to raise prices as it is strategically advantageous for them; Nvidia has been publicly supporting TSMC raising prices, likely in part because Jensen can more easily raise his prices than AMD can.

The real risk over the long-term (other than the one across the sea) is competition. The main competitive threat over the next few years is Intel (Samsung has been bleeding share to TSMC for years, SMIC faces geopolitical hurdles outside of China, and Rapidus hasn’t even shipped any wafers).2

If Intel steals significant market share from TSMC on the leading edge, TSMC’s ability to raise prices would be materially curtailed. Intel has suggested that they are well-positioned to achieve exactly this with their 18A process technology, which they believe is superior to TSMC’s N2. Both are communicated by the respective management teams to ramp by end-2025.

Obviously, I do not have the engineering qualifications to assess these claims. However, I think insight on the validity of these claims can be ascertained in non-engineering ways.

I think Intel 18A parametric yields are worse than Intel bulls believe. Thus they will continue to struggle to gain significant external foundry revenue. Parametric yields are a closely-guarded secret and thus there is no way to get actual numbers. However, there is substantial evidence which suggests my assertion is highly likely to be true.

Parametric yields refer to the percentage of chips that meet a given customer’s specifications (power consumption, performance, range of operating temperatures, etc). It differs from functional yield which only measures the percentage of chips that can turn on. Chips that meet spec are “good dies.” Basically, you can have high functional yield but poor parametric yields and thus the amount of actual sellable chips would be low.

Even if you priced your wafers at a significant discount to the competition (like many suggest Intel is doing), your cost per good die would be much higher than the competition (because you have poor parametric yields and thus a low number of good dies per wafer) even though your cost per wafer might be lower.

The inability to achieve high parametric yields is also the reason why Samsung has bled enormous market share to TSMC over the last decade. At its peak, Samsung had roughly 20% market share in logic foundry, and now it is down to 6-7%. I don’t blame them - leading edge semiconductor manufacturing is very hard stuff - see the appendix for a slightly technical explanation.

So how does TSMC do it? I’d wager a large part of it comes from the fact they have the fattest market share (~67% as of 1Q25) and thus the most volume.3 Volume fuels yield learning, resulting in higher yields, which attracts customers, and leads to more volume. This is naturally self-reinforcing. And because they have the largest volume, they are able to go down the yield learning curve at a pace that smaller competitors are unable to match.

I believe this is an important reason why Intel and Samsung have persistently struggled. Their volume is sub-scale (compared to TSMC) and hence their pace of yield learning is much slower, and by the time they get to high parametric yields on the current process technology, the next generation is already here.

According to Intel bulls however, the company is supposedly resurgent. Intel management has publicly announced that defect density (D0) on 18A is <0.4/cm^2. They announced this in September 2024, which is roughly 4-5 quarters before mass production. For reference, sub-0.4 D0 is similar to TSMC’s prior (very successful) 7nm and 5nm nodes at similar time-to-mass-production.

But defect density only tells you about the functional yield, not the parametric yield. A sub-0.4 D0 4-5 quarters before mass production could plausibly be 0.1 at mass production, which would imply a ~91% functional yield assuming a chip area of 1cm^2 (100mm^2, roughly matching the die size of a mobile SoC like the Apple A18).4

But importantly, it says nothing about parametric yield. Even if you have 91% functional yield, if you have 20% parametric yield, your overall yield would be ~18%. Basically, you can have a lot of functional chips but it doesn’t mean all (or even most) of them meet customer specifications.

I think this scenario is (broadly) what Intel is experiencing on 18A. Although (most of) their claims suggest otherwise, their actions strongly imply this is the case.

Triangulating the state of Intel 18A

Skipping 20A

Last September, Intel announced that it was skipping its 20A process technology and focusing on 18A instead. The purported explanation was that they would save money (half a billion dollars) and because 18A yields were in a “production-worthy” state. Focusing on 18A would allow them to accelerate their “five nodes in four years” plan while saving costs.

But skipping 20A meant outsourcing products like Arrow Lake and Lunar Lake to TSMC. This suggests 20A was likely not hitting yield, cost, or performance targets. Fabs have massive fixed costs (mostly due to expensive equipment depreciation) and thus require huge volume to drive high utilization and profitability. If Intel had strong external customer commitments for 20A, or if their own products still heavily relied on it, it would be extremely difficult to justify skipping it for high-volume manufacturing simply because 18A is looking good. Foundries don't build multi-billion dollar facilities and develop processes only for them to serve as testbeds for a brief period if there was strong demand.

While Intel has announced some high-profile wins like Microsoft for 18A, the public statements around significant external customer commitments for advanced nodes have been notably sparse compared to TSMC's broad customer base. Almost every current- and next-gen chip your favorite hyperscaler or smartphone manufacturer or chip startup or ASIC provider uses is fabbed by TSMC.

This suggests that the pipeline of customers willing to jump onto a brand new Intel node, especially one that was meant to be a short-lived bridge, was likely insufficient to justify the enormous cost of bringing 20A to full high-volume manufacturing.

If there was strong demand, 20A would likely be profitable in the multi-billions range, so it says a lot when Intel seemingly passed over making billions to save a couple hundred million.

Outsourcing to TSMC

Intel has been increasingly outsourcing to TSMC in recent years. The percentage of wafers outsourced to TSMC was likely in the 10% range prior to 2020. As Intel started to experience significant delays with its 7nm process (now Intel 4), it increased outsourcing to TSMC.

The percentage as of March 2025 is roughly 30% per Intel, with them targeting to bring it down to 15-20% long-term. Crucially, as this percentage grew, Intel started outsourcing high value tiles such as the graphics, SoC, and I/O extender tiles for Meteor Lake, the compute tile for Arrow Lake, and eventually all logic dies for Lunar Lake.

One argument Intel bulls are making is that outsourcing to TSMC gives Intel strategic flexibility. They argue that outsourcing gives Intel more capacity for high-value tiles for its own products and also retain capacity for external customers. But, as mentioned, they are increasingly outsourcing high-value tiles to TSMC.

Additionally, Intel management has admitted recently that external foundry revenue would be “low to mid-single digit billions” by 2027 and not all of that would be 18A; a significant portion would come from advanced packaging and older nodes (i.e. their partnership with UMC and Tower). They mentioned this in the context of achieving foundry breakeven in 2027 (foundry lost $13 billion in 2024).

Good for them, but this essentially means they have no large customers ramping even in 2027. This is despite announcing in early 2024 that they were developing a chip for Microsoft that was worth “$15 billion in lifetime revenue” and touting their partnership with Amazon on an AI chip (which is more a networking chip, rather than something like Google TPUs).

So if they have no major customers, this strongly suggests the reason they needed to outsource to TSMC is to retain capacity for themselves, which implies Intel’s parametric yields are terrible, hence the need for more internal capacity. Intel also cut 2025 CapEx guidance by 10% from $20 billion to $18 billion. Foundry CapEx today represents capacity and market share in the future and cutting CapEx tells you their views on their future customer base.

Another related argument made by Intel bulls is that TSMC’s wafers are more expensive than Intel’s wafers and they are excited that Intel is targeting less outsourcing over the next few years as this should result in better gross margins.

But the very fact that Intel is outsourcing strongly suggests that while TSMC may be more expensive on a cost per wafer basis compared to Intel, it is much cheaper on a cost per good die basis, as TSMC has better parametric yield. It simply would not make sense for Intel to outsource if their own foundry had good parametric yield as they would give their own foundry more volume to further improve yield and also avoid paying for TSMC’s “expensive” wafers.

“Prioritizing”

Furthermore, Intel management has recently talked about prioritizing Panther Lake over Clearwater Forest. Intel blamed packaging challenges as to the need to prioritize, but this is an insufficient explanation. If 18A was ramping with good parametric yields, Intel should be able to support both products without needing to explicitly prioritize one over the other. Panther Lake is also a client CPU product whereas Clearwater Forest is a server CPU. Client CPUs have smaller die sizes and less stringent reliability and performance requirements compared to server CPUs, making it easier to achieve relatively better parametric yields. Again, this points to the poor state of Intel’s 18A.

Oh, and Intel took almost $19 billion in impairment charges in 2024. $10 billion was writing down valuation allowances on deferred tax assets (that tells you their accountants’ opinion on future profitability). But $3 billion of that was impairing equipment related to Intel 7 capacity. But now Intel is saying Intel 7 is now “capacity constrained” due to an “unexpected surge in demand” for Raptor Lake and Sapphire Rapids.

A prior impairment combined with capacity constraints just a few quarters after is suspect - if newer nodes were highly compelling on cost/performance, demand would naturally shift there, negating the need for older products. It's more likely that the demand for older products remains robust because the parametric yield of 18A isn't yet good enough. This probably explains why Dell, historically an Intel-only OEM, finally partnered with AMD on the client market.

Essentially, Intel bulls are conflating having the process technology with being able to manufacture it economically in high-volume (i.e. good parametric yields), and there is enormous evidence of Intel struggling at the latter. Intel management essentially claims that 18A is going great, but their actions strongly suggest otherwise.

(Potential) Dual-sourcing

If my assertion of the state of 18A is largely true, it should come as no surprise why there has been limited customer activity on the node. Will this continue? I think Intel could win customers, but these would likely be minor wins (relative to TSMC’s top customers) and a process spanning many years to decades as they build their track record, similar to how TSMC did it.

This is because there are very large and real switching costs and potentially existential risks for customers to shift a huge amount of volume to a largely unproven foundry such as Intel.

Chip designers rely on TSMC's process design kits5 and collaborate intensely to optimize their designs for TSMC process nodes resulting in incredibly deep relationships. Their collaboration spans years and multiple product generations. Customers invest billions in IP blocks, design methodologies, and tools that are optimized for TSMC’s ecosystem. Switching would be extremely complex, time-consuming, and expensive as it would require significant re-tooling, re-design, re-simulation, and re-validation on another foundry’s PDK and ecosystem.

And no, Intel can’t just reverse engineer TSMC’s PDKs. Apart from the obvious IP infringement, a TSMC 3nm PDK is useless for an Intel 18A fab because their methodologies, processes, and procedures are all different. Chip designs optimized for TSMC’s PDK would not run on Intel’s fabs, and vice versa.

So a chip designer looking to dual-source would need to have two entirely different product roadmaps with separate R&D teams, which exposes them to being leapfrogged by competitors who can focus all of their R&D resources on a single product roadmap. They would also have to split their volume between two foundries, making the yield learning curve harder and costlier for both, which raises cost per good die and directly impacts the chip designers’ ability to competitively price their products.

Some of the large chip designers have actually tried dual-sourcing a few years ago. Qualcomm famously dual-sourced its Snapdragon 8 Gen 1 (a flagship mobile SoC) with Samsung's 4nm process, but the Samsung chips had many issues with power efficiency and overheating, leading Qualcomm to largely shift subsequent flagship Snapdragon chips back to TSMC. Nvidia used Samsung for some gaming GPU generations (e.g. RTX 30 series on Samsung 8nm) which also had overheating issues, and Nvidia shifted all their gaming GPU volume to TSMC after.

This shifting of volume from Samsung’s foundry to TSMC was not cheap for the chip designers who previously dual-sourced. TSMC generally builds capacity based on, not ahead of, customer demand and is often capacity constrained, especially at leading edge nodes. Due to their prior dual-sourcing stance, Qualcomm and Nvidia had to make billions in prepayments to TSMC to shift their volume back. Nvidia in particular prepaid over $3 billion to TSMC around the 2021-2022 period to secure capacity. In contrast, sole-source customers such as AMD and Mediatek have historically not needed to prepay for large amounts of capacity at TSMC.

In my view, the biggest risk for chip designers dual-sourcing is competition. If a competitor leapfrogs them, the risk becomes almost existential. Intel’s strategy is reminiscent of Samsung’s where the latter was the first to EUV insertion at 7nm and GAA (new transistor architecture) at 3nm but stumbled heavily on high volume manufacturing due to poor parametric yields. Intel is trying both GAA and backside power (new design in power delivery) within one node generation. Notwithstanding the complication of adding two technological advancements at once, such a strategy is rooted in a belief that conflates technological superiority with customer demand.

But it has been clear from history (not only the Nvidia and Qualcomm examples above, but also Apple, AMD, etc) that high parametric yields at high volume and reliable time-to-market are vastly valued over gains in technology. This is because a better-performing product at poor parametric yields and thus low volume means the chip designer is ceding enormous market share to competitors who may have slightly-worse products but at high parametric yields and thus huge volume.

And if you don’t have market share for this generation of products, you can’t afford to fund R&D for the next. Dual-sourcing in a large way is thus a bet-the-company proposition where the risk/reward calculus does not skew favorably.

Moreover, given we are in the midst of AI penetration ramping, chip designers are likely focused on capturing market share rather than cost-optimizing their supply arrangements, which would probably come once the end-market has matured. Large Intel foundry external customers are thus likely years away.

Importantly, this competitive risk is diminished with each year Intel fails to get large external customers. Even if you give them credit for getting foundry to breakeven in 2027, they still would not have the money for R&D and CapEx on future nodes unless they get big customers soon.

Intel Products has been bleeding share to AMD and others for years. This means internal foundry volume is declining too. Intel Products’ cash flow was (mostly) the reason why Intel was able to fund its aggressive 5N4Y strategy, and with the former’s continued share bleed, continued funding of foundry R&D and CapEx is increasingly financially dangerous. Historically Intel also had to resort to selling equity in their new fabs (see Brookfield’s $15b deal for 49% of Arizona fabs 52 and 62 and Apollo’s $11b deal for 49% of Ireland fab 34) to fund operations.

Conclusion

TSMC presents a compelling investment opportunity, trading at a mid-teens NTM P/E ex. cash, with management projecting almost 20% CAGR in top-line growth over the next few years, mostly driven by HPC/AI. This is supported by overall industry expansion, logic segment acceleration, market share gains, and price increases, likely producing earnings growth in excess of 20% CAGR through margin expansion.

Consensus modeling of a steep deceleration in revenue growth and a "CapEx cliff" appears overly pessimistic. Concerns regarding supply chain bottlenecks and the ROI on AI CapEx are likely overblown, given the steady improvement in GB200 assembly yields, the significantly extended useful economic life of GPUs for inference workloads, and a huge TAM at maturity. Therefore, TSMC's out-year revenue growth deceleration is likely to be more measured than currently modeled by the sell-side.

Furthermore, margin dilution from 2nm ramp, overseas expansion, TWD appreciation, and potential tariffs are either temporary or should be offset by pricing power. The historical difficulties faced by competitors like Intel and Samsung in achieving high parametric yields at high volume manufacturing, despite technological advancements, underscore TSMC's enduring competitive advantage. Intel's actions, such as skipping 20A, increased outsourcing to TSMC, CapEx cuts, and limited external foundry revenue projections, strongly suggest ongoing struggles with 18A parametric yields, contradicting their other public (marketing) claims. The high switching costs and existential risks for chip designers to shift volume to unproven foundries, as evidenced by past dual-sourcing failures by Qualcomm and Nvidia, further solidify TSMC's dominant position. In a rapidly expanding AI market, chip designers prioritize market share capture over supply chain cost optimization. Hence, TSMC’s market dominance is likely to persist indefinitely.

Geopolitical risk is ever-present, naturally. I rarely find it fruitful to discuss this factor with others as everyone (including me) tends to have very strong views on the topic, so I won’t. But I will note that every company up and down the supply chain is subject to the same risk, and a tail risk outcome is easily (and cheaply) hedgeable if you are creative.

Appendix

A non-engineer’s understanding of semiconductor manufacturing

Why is getting high yields tough? Well, because high-volume chip manufacturing at the leading edge is basically the hardest thing to do in the world.

A single integrated circuit at the leading edge could contain hundreds of billions of transistors that need to be fabricated with atomic-level precision. These chips are comprised of dozens of nano-scale layers which go through numerous steps, including deposition, coating/developing, doping, lithography, etching, cleaning, etc on each individual layer. Slight deviations on an atomic level could result in an entire section of a chip being non-functional. Tolerance for error is virtually zero.

Further, as transistors shrink and become more tightly packed together, the probability that a microscopic defect, say a single dust particle, an impurity in a chemical, or an imperfection in a deposited film, will disrupt an important circuit element is disproportionately higher. Defects that might be inconsequential on earlier (larger) process technologies can destroy leading edge chips.

Limiting process variation becomes increasingly paramount. At these extreme scales, inherent variations in manufacturing processes, which were tolerable at larger nodes, become significant enough to impact performance and chip functionality. These variations can manifest as slight differences in gate length, oxide thickness, or dopant concentration across a wafer, or even between nominally identical transistors on the same chip. While a transistor might theoretically be functional, these variations can push its electrical characteristics outside the specifications required for the chip to operate correctly or efficiently.

Advanced lithography adds to the complexity. EUV lithography is necessary for patterning these tiny features on leading edge chips. The masks used in EUV are incredibly complex and expensive, and any defect on the mask is directly transferred to the silicon. The very short wavelength of EUV light, while enabling smaller features, also makes the process highly sensitive to subtle imperfections and aberrations. Furthermore, achieving multi-patterning (using multiple lithography steps to create a single complex feature) increases the number of opportunities for alignment errors and accumulated defects.

When a new process node is introduced, the initial yield is often significantly lower than target (TSMC often aims for 90% or more on mature leading-edge nodes). Reaching yield targets requires a painstaking, iterative process of identifying the root causes of defects and variations, often through advanced analytical techniques like electron microscopy and electrical testing, and then implementing subtle process adjustments.

Because the overall yield of a chip is the product of yields of each individual processing step, potential losses in yield are multiplicative. Simply put, if you have a target of 80% overall yield on a chip with 100 mask layers (TSMC’s 3nm node family has 90-100 mask layers), a seemingly high yield of 99% on each layer results in an unacceptable low cumulative yield (0.99^100 =~37%); you actually require a per-layer yield of 99.77% to achieve a cumulative yield of 80%. The "yield ramp" period is hence characterized by intense R&D, significant wafer scrapping, and delayed revenue, making it a critical phase for profitability. The institutional knowledge and experience to rapidly identify and rectify these complex issues is vital.

In essence, achieving high yields at leading-edge nodes requires a mastery of materials science, quantum physics, chemical engineering, statistical process control, and sophisticated automation, all while operating at the limits of what is physically and technologically possible. And this is just achieving high functional yields. Achieving high parametric yields adds another level of difficulty as not only do the chips need to work, they need to also meet customer specifications.

I am referring to the Taiwanese shares which trade at a meaningful discount to the ADRs and are available through Interactive Brokers.

Samsung used to have nearly 20% market share in logic foundry but has lost more than half of that, mostly to TSMC.

All references to market share reflect the industry definition. TSMC used to use the industry definition of foundry market share but recently included advanced packaging in their total addressable market calculation, and hence the differing definitions compared to the industry.

A common industry method for modeling functional yield is the Poisson yield model where functional yield = e^(-D*A), with D=defect density, and A= chip area. In my example, e^(-0.1*1) = ~91%.

An oversimplified explanation for a PDK is that its a bunch of extremely complex data files outlining stuff like design rules, tools, IP libraries, simulation models, and documentation etc, that encapsulates the foundry’s accumulated manufacturing expertise and IP. PDKs are used to model chip performance and behavior in the real-world at various operating conditions (e.g. temperature, voltage). A really good PDK would have relatively minimal difference in simulated chip performance/behavior versus the real world, and vice versa.

Disclaimer: The author's reports contain factual statements and opinions. The author derives factual statements from sources which he believes are accurate, but neither they nor the author represent that the facts presented are accurate or complete. Opinions are those of the author and are subject to change without notice. His reports are for informational purposes only and do not offer securities or solicit the offer of securities of any company. The author accepts no liability whatsoever for any direct or consequential loss or damage arising from any use of his reports or their content. The author advises readers to conduct their own due diligence before investing in any companies covered by him. He does not know of each individual's investment objectives, risk appetite, and time horizon. His reports do not constitute as investment advice and are meant for general public consumption. Past performance is not indicative of future performance.

Jun 29

Thank you. What about this on June 6th https://aws.amazon.com/blogs/aws/announcing-up-to-45-price-reduction-for-amazon-ec2-nvidia-gpu-accelerated-instances/ ?

Not sure if I have missed it in your post.

It's not a neocloud but a hyperscaller. How do you think about this?

Expand full comment

3 replies by LG and others

Applied Conjectures

Discussion about this post