FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling ...
Abstract: In this letter, a novel high throughput software polarization-adjusted convolutional (PAC) decoder based on the four-node fast list (FFL) decoding algorithm is proposed. To improve the ...
Abstract: In motion control, velocity is required to be estimated with less delay and less error, but the two are trade-offs because of an encoder's quantization noise, and they are difficult to make ...
Clean-room GPT-2/GPT-3 implementation: tokenizers, architecture blocks, training loop with AdamW + cosine decay, CLI scripts, inference tools, and pytest suite. Covers OpenWebText-10k & WikiText-103 ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results