HK

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

skidrow

Born on July 02, 2024•389 Karma

About Submitted Comments Favorites

31.

Implementing a Fast Tensor Core Matmul on the Ada Architecture

11 months ago

2 points

32.

Creating custom kernels for the AMD MI300

11 months ago

1 points

33.

Implementing a Fast Tensor Core Matmul on the Ada Architecture

11 months ago

4 points

34.

Implementing a Fast Tensor Core Matmul on the Ada Architecture

11 months ago

2 points

35.

Compiler Explorer: An Essential Kernel Playground for CUDA Developers

11 months ago

2 points

36.

Creating custom kernels for the AMD MI300

11 months ago

1 points

37.

DeepSeek-R1 and FP8 Mixed-Precision Training

colfax-intl.com

on April 19, 2025

2 points

38.

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024)

alexarmbr.github.io

on April 19, 2025

147 points

39.

DeepSeek-R1 and FP8 Mixed-Precision Training

colfax-intl.com

on April 18, 2025

2 points

40.

Implementing a Fast Tensor Core Matmul on the Ada Architecture

on April 18, 2025

1 points

41.

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores

alexarmbr.github.io

on April 18, 2025

2 points

42.

Understanding Peak, Max-Achievable and Delivered FLOPs

on April 1, 2025

1 points

43.

DeepSeek-R1 and FP8 Mixed-Precision Training

colfax-intl.com

on April 1, 2025

1 points

44.

Outperforming cuBLAS on H100: A Worklog

cudaforfun.substack.com

on April 1, 2025

3 points

45.

Optimizing Matrix Multiplication on RDNA3

seb-v.github.io

on March 25, 2025

118 points

46.

Outperforming cuBLAS on H100: A Worklog

cudaforfun.substack.com

on March 25, 2025

1 points

47.

Mastering LLM Techniques: Inference Optimization

on March 24, 2025

2 points

48.

Optimizing Matrix Multiplication on RDNA3

seb-v.github.io

on March 24, 2025

2 points

49.

Outperforming cuBLAS on H100: A Worklog

cudaforfun.substack.com

on March 24, 2025

4 points

50.

Understanding Latency Hiding on GPUs [pdf]

eecs.berkeley.edu

on March 17, 2025

2 points

51.

AMD Radeon RX 9070 Series Linux GPU Compute Performance

on March 17, 2025

2 points

52.

Outperforming cuBLAS on H100: A Worklog

cudaforfun.substack.com

on March 17, 2025

3 points

53.

on March 16, 2025

2 points

54.

Understanding Latency Hiding on GPUs [pdf]

eecs.berkeley.edu

on March 16, 2025

2 points

55.

A guide to LLM inference and performance

on Feb 16, 2025

1 points

56.

Mastering LLM Techniques: Inference Optimization

on Feb 16, 2025

2 points

57.

GPT from Scratch with MLX

on Feb 16, 2025

1 points

58.

Mastering LLM Techniques: Inference Optimization

on Feb 4, 2025

3 points

59.

Beating OpenBLAS in FP32 Matrix Multiplication

salykova.github.io

on Feb 4, 2025

4 points

60.

Beating OpenBLAS in FP32 Matrix Multiplication

salykova.github.io

on Jan 28, 2025

1 points