this post was submitted on 28 Jan 2025
322 points (96.3% liked)

Technology

61227 readers
4207 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] Australis13@fedia.io 38 points 2 days ago (3 children)

The big win I see here is the amount of optimisation they achieved by moving from the high-level CUDA to lower-level PTX. This suggests that developing these models going forward can be made a lot more energy-efficient, something I hope can be extended to their execution as well. As it stands currently, "AI" (read: LLMs and image generation models) consumes way too many resources to be sustainable.

[–] KingRandomGuy@lemmy.world 5 points 1 day ago* (last edited 1 day ago)

What I'm curious to see is how well these types of modifications scale with compute. DeepSeek is restricted to H800s instead of H100s or H200. These are gimped cards to get around export controls, and accordingly they have lower memory bandwidth (~2 vs ~3 TB/s) and most notably, much slower GPU to GPU communication (something like 400 GB/s vs 900 GB/s). The specific reason they used PTX in this application was to help alleviate some of the bottlenecks due to the limited inter-GPU bandwidth, so I wonder if that would still improve performance on H100 and H200 GPUs where bandwidth is much higher.

[–] Dkarma@lemmy.world 3 points 1 day ago

Yeah I'd like to see size comparisons too. The cuda stack is massive.

[–] Knock_Knock_Lemmy_In@lemmy.world -5 points 1 day ago (2 children)

PTX also removes NVIDIA lock-in.

[–] sunbeam60@lemmy.one 12 points 1 day ago (1 children)

Wtf, this is literally the opposite of true. PTX is nvidia only.

[–] Knock_Knock_Lemmy_In@lemmy.world 4 points 1 day ago (1 children)

Google was giving me bad search results about PTX so I just posted am opinion and hoped Cunningham's Law would work.

[–] accideath@lemmy.world 4 points 1 day ago
[–] mholiv@lemmy.world 16 points 1 day ago (2 children)

Kind of the opposite actually. PTX is in essence nvidia specific assembly. Just like how arm or x86_64 assembly are tied to arm and x86_64.

At least with cuda there are efforts like zluda. Cuda is more like objective-c was on the mac. Basicly tied to platform but at least you could write a compiler for another target in theory.

[–] KingRandomGuy@lemmy.world 3 points 1 day ago

IIRC Zluda does support compiling PTX. My understanding is that this is part of why Intel and AMD eventually didn't want to support it - it's not a great idea to tie yourself to someone else's architecture you have no control or license to.

OTOH, CUDA itself is just a set of APIs and their implementations on NVIDIA GPUs. Other companies can re-implement them. AMD has already done this with HIP.

Ah, I hoped it was cross platform, more like Opencl. Thinking about it, a lower level language would be more platform specific.