Latest Research
Discover our cutting-edge research in efficient AI models and language processing
Spectra-1
ICLR 2025 Spotlight
Surprising Effectiveness of Pretraining Ternary Language Models at Scale
Spectra introduces the first open suite of low-bitwidth LLMs, including TriLMs, QuantLMs, and FloatLMs, from 99M to 3.9B parameters. TriLMs are pretrained ternary models that outperform traditional quantized and floating-point models at scale. The 3.9B TriLM matches the performance of its Float LM counterpart with far fewer bits, enabling efficient inference. This work pushes the frontier of memory-efficient, scalable language models.
54 multilingual models
from 99M-3.9B parameters
TriLMs—compact, fast,
high-performing.
Powerful AI on
low-resource devices.
Researchers
Tejas Vaidhya Ayush Kaushal Arnab Kumar Mondal Tejas Pandey Aaryan Bhagat Irina Rish
Hi-NOLIN
Bridging the English-Hindi Language Gap in Open-Source AI
Hi-NOLIN is the first open-source English-Hindi bilingual large language model (LLM), built upon the Pythia architecture and expanded to 9B parameters. Trained on a 300B token corpus encompassing English, code, and Hindi, it leverages continual pretraining techniques to enhance performance across multiple domains. Remarkably, Hi-NOLIN outperforms larger models like Pythia 12B and multilingual BLOOM on standard benchmarks
The best open-source
Hindi-English LLM
of its size
Extends ability to a new language while boosting English and Code performance.
Researchers
Tejas Vaidhya Ayush Kaushal Irina Rish
Spectra-1.1
ACL 2025
Scaling Laws and Efficient Inference for Ternary Language Models
This research demonstrates that Ternary Language Models offer superior scaling behavior, providing valuable insights into efficient low-bitwidth language models. TriLMs introduces a suite of ternary language models (TriLMs) trained on up to 1.2 trillion tokens. These models use quantization-aware training and novel bit-packing schemes to dramatically cut memory use.
1.2T token-trained TriLMs
Up to 5× faster inference
with TriRun
Novel 1.6- and 2-bit
packing schemes
Researchers
Tejas Vaidhya Ayush Kaushal Arnab Kumar Mondal Tejas Pandey Aaryan Bhagat Irina Rish
More Publications
Lord: Low rank decomposition of monolingual code llms for one-shot compression
September 2023
This paper demonstrates efficient LLM compression using Low Rank Decomposition, allowing code LLMs to be compressed by up to 39.58% with minimal performance loss. This method provides an effective approach to reducing model size while maintaining code generation capabilities.
Researchers
Vaidhya Kaushal Rish
Ternary LLMs are more Performant than Quantized FP16 LLMs
September 2023
We introduce TriLM, a family of pretrained ternary language models that are both compact and high-performing. TriLMs outperform their quantized counterparts and rival full-precision models at larger scales. Our findings show that ternary models not only offer superior efficiency in terms of bit-level size but also maintain strong performance on knowledge benchmarks—establishing TriLM as a compelling choice for efficient LLM deployment.
Researchers
Kaushal Vaidhya pandey Bhagat Rish
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting
September 2024
Lag-Llama is a general-purpose foundation model for univariate probabilistic time series forecasting, built on a decoder-only transformer using lagged values as covariates. Pretrained on a diverse corpus of time series data, it shows strong zero-shot generalization and achieves state-of-the-art performance when fine-tuned on small amounts of unseen data. Lag-Llama sets a new benchmark for foundation models in time series forecasting.
Researchers
Rasul Ashok Williams Ghonia Bhagwatkar Khorasani Bayazi Adamopoulos Riachi Hassen Biloš Garg Schneider Chapados Drouin Zantedeschi Nevmyvaka Rish
What do tokens know about their characters and how do they know it?
September 2023
This work investigates how pretrained language models (PLMs) encode character-level information despite using subword tokenization. By probing embeddings from models like GPT-J, BERT, and RoBERTa, the study finds that PLMs reliably capture whether specific characters appear in a token—even across non-Latin scripts. Larger models generally encode this information more robustly. The analysis suggests this ability arises from patterns in tokenization, character–POS correlations, and language variability.
Researchers
Kaushal Mahowald
Efficient Encoders for Streaming Sequence Tagging
2023
This work presents HEAR, a Hybrid Encoder with Adaptive Restart, designed for efficient and accurate streaming sequence tagging. Unlike naive bidirectional encoders, HEAR reduces redundant computation and label instability by reusing prior context and selectively restarting bidirectional layers. HEAR maintains strong offline performance while achieving up to 71.1% FLOP savings and +10% improvements in streaming exact match across four tasks.
Researchers
Kaushal Gupta Upadhyay Faruqui
Efficient Encoders for Incremental Sequence Tagging
2023
This work addresses the inefficiency of re-running bidirectional models like BERT for every new token in streaming NLU settings. The proposed approach reduces FLOP count and improves generalization on partial inputs using a hybrid partially bidirectional encoder and an adaptive restart mechanism. It retains comparable performance on full sequences while improving efficiency and streaming accuracy across four sequence tagging datasets.
Researchers
Gupta Kaushal Faruqui Upadhyay
© 2025 Nolano AI. All rights reserved.