Nolano AI

Pioneering AI Research at Nolano

We're committed to advancing the field of AI with cutting-edge research in foundation models, multimodality, and decision intelligence.

Latest Research

Discover our cutting-edge research in efficient AI models and language processing

Spectra-1

★ ICLR 2025 Spotlight

Surprising Effectiveness of Pretraining Ternary Language Models at Scale

Spectra introduces the first open suite of low-bitwidth LLMs, including TriLMs, QuantLMs, and FloatLMs, from 99M to 3.9B parameters. TriLMs are pretrained ternary models that outperform traditional quantized and floating-point models at scale. The 3.9B TriLM matches the performance of its Float LM counterpart with far fewer bits, enabling efficient inference. This work pushes the frontier of memory-efficient, scalable language models.

54 multilingual models

from 99M-3.9B parameters

TriLMs—compact, fast,

high-performing.

Powerful AI on

low-resource devices.

Publication

Explore more

Researchers: Vaidhya, Kaushal, Mondal, Pandey, Bhagat, Rish

Hi-NOLIN

Bridging the English-Hindi Language Gap in Open-Source AI

Hi-NOLIN is the first open-source English-Hindi bilingual large language model (LLM), built upon the Pythia architecture and expanded to 9B parameters. Trained on a 300B token corpus encompassing English, code, and Hindi, it leverages continual pretraining techniques to enhance performance across multiple domains. Remarkably, Hi-NOLIN outperforms larger models like Pythia 12B and multilingual BLOOM on standard benchmarks

The best open-source
Hindi-English LLM
of its size

The best open-source Hindi-English LLM

of its size

Extends ability to a new language while boosting English and Code performance.

Publication

Explore more

Researchers: Vaidhya, Kaushal, Rish

Spectra-1.1

★ ACL 2025

Scaling Laws and Efficient Inference for Ternary Language Models

This research demonstrates that Ternary Language Models offer superior scaling behavior, providing valuable insights into efficient low-bitwidth language models. TriLMs introduces a suite of ternary language models (TriLMs) trained on up to 1.2 trillion tokens. These models use quantization-aware training and novel bit-packing schemes to dramatically cut memory use.

1.2T token-trained TriLMs

Up to 5× faster inference
with TriRun

Novel 1.6- and 2-bit
packing schemes

Publication

Explore more

Researchers: Vaidhya, Kaushal, Mondal, Pandey, Bhagat, Rish

More Publications

★ ICML Talk

Lord: Low rank decomposition of monolingual code llms for one-shot compression

This paper demonstrates efficient LLM compression using Low Rank Decomposition, allowing code LLMs to be compressed by up to 39.58% with minimal performance loss. This method provides an effective approach to reducing model size while maintaining code generation capabilities.

Sep 2023

Publication

Explore more

Researchers: Vaidhya, Kaushal, Rish

Ternary LLMs are more Performant than Quantized FP16 LLMs

We introduce TriLM, a family of pretrained ternary language models that are both compact and high-performing. TriLMs outperform their quantized counterparts and rival full-precision models at larger scales. Our findings show that ternary models not only offer superior efficiency in terms of bit-level size but also maintain strong performance on knowledge benchmarks—establishing TriLM as a compelling choice for efficient LLM deployment.

Jul 2024

Publication

Explore more

Researchers: Kaushal, Vaidhya, pandey, Bhagat, Rish

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Lag-Llama is a general-purpose foundation model for univariate probabilistic time series forecasting, built on a decoder-only transformer using lagged values as covariates. Pretrained on a diverse corpus of time series data, it shows strong zero-shot generalization and achieves state-of-the-art performance when fine-tuned on small amounts of unseen data. Lag-Llama sets a new benchmark for foundation models in time series forecasting.

Oct 2023

Publication

Explore more

Researchers: Rasul, Ashok, Williams, Ghonia, Bhagwatkar, Khorasani, Bayazi, Adamopoulos, Riachi, Hassen, Biloš, Garg, Schneider, Chapados, Drouin, Zantedeschi, Nevmyvaka, Rish

What do tokens know about their characters and how do they know it?

This work investigates how pretrained language models (PLMs) encode character-level information despite using subword tokenization. By probing embeddings from models like GPT-J, BERT, and RoBERTa, the study finds that PLMs reliably capture whether specific characters appear in a token—even across non-Latin scripts. Larger models generally encode this information more robustly. The analysis suggests this ability arises from patterns in tokenization, character–POS correlations, and language variability.

Jun 2022

Publication

Explore more

Researchers: Kaushal, Mahowald

Efficient Encoders for Streaming Sequence Tagging

This work presents HEAR, a Hybrid Encoder with Adaptive Restart, designed for efficient and accurate streaming sequence tagging. Unlike naive bidirectional encoders, HEAR reduces redundant computation and label instability by reusing prior context and selectively restarting bidirectional layers. HEAR maintains strong offline performance while achieving up to 71.1% FLOP savings and +10% improvements in streaming exact match across four tasks.

Jan 2023

Publication

Explore more

Researchers: Kaushal, Gupta, Upadhyay, Faruqui