Using AI and Nushell for simple data cleaning and validation

Submitted by Lennart on Sat, 21 Jun 2025 - 18:13

Imagine you have a page with some text and some links, and only some of the links are RSS feeds.

But you want those RSS feeds for a newsreader of some kind.

Just copy the whole page and feed it to an LLM with the right instructions.

In this case I do this with Sigodens’ Aichat and Cerebras’ llama-4 model:

p | aichat --model cer:llama-4-scout-17b-16e-instruct just provide the rss url and feed urls and nothing else provide one url per line

p is just a command that outputs whatever I copied into my system clipboard.

Here are more examples of similar approaches with other command line LLM tools:

Using OpenAI’s GPT with fabric:

p | fabric --pattern extract_rss_feeds

Using Ollama with a local model:

p | ollama run llama3.2:3b "Extract only RSS feed URLs from this content. Output one URL per line with no other text."

Let’s return to the initial example. Maybe we want some kind of validation that the urls returned by the LLM are actually RSS feeds.

With Nushell that is easy to do:

p | aichat --model cer:llama-4-scout-17b-16e-instruct "extract rss urls only" | lines | where $it =~ 'feed|rss|xml'

Now, this validation is not perfect, much less a guarantee that the feeds are actually functional. That would take more code to verify - but it could certainly be done in Nushell rather fast.

There are plenty of other things LLM’s can easily find and present for you that would otherwise be time consuming when buried in plenty of surrounding text.

Let’s have a look at a few more examples:

Extracting Email Addresses

Suppose we have some pages with a lot of text and a few email addresses scattered around:

Feeding the content of these pages to an LLM:

p | aichat --model cer:llama-4-scout-17b-16e-instruct extract email addresses one per line

Would output something like:

support@example.com
admin@example.net
team@example.io

These result are even easier to do simple validation on

p | aichat --model cer:llama-4-scout-17b-16e-instruct extract email addresses one per line | lines | where $it =~ '@'

Extracting Names

LLM’s are very useful for extracting data like names since they kinda understand what names are.

Suppose we have a page with some text and names:

Our team consists of John Smith, Jane Doe, and Bob Johnson.
You can reach out to John or Jane for more information.

Feeding this to an LLM:

p | aichat --model cer:llama-4-scout-17b-16e-instruct extract names one per line

Output:

John Smith
Jane Doe
Bob Johnson
John
Jane

One last example, that anyone wanting to work with a large corpus af text might find useful, is removing stopwords.

Removing Stopwords

Suppose we have a page with some text and we want to remove common stopwords like “the”, “and”, etc.:

This is an example sentence with common words like the and a.

Feeding this to an LLM:

p | aichat --model cer:llama-4-scout-17b-16e-instruct remove stopwords

Output:

example sentence common words like

There are certainly more efficient ways to solve these problems, not least the stopwords example.

But LLM’s make it easy for everyone to clean data and extract desired information because the prompts can be given in natural day-to-day language.

And Nushell makes it easy to validate this extracted data and continue working with it right from the command line.

Course: Mastering Nushell for Content Management and Publishing

Employee Skills

Website package

Using AI and Nushell for simple data cleaning and validation

The world's best CRM is mine — because it's the only one built for me

A small contribution to Nushell: when completion should also look at the description

The LLM lives in my shell — and that changes everything

The code and the model

I am now sharing memory with my AI — and it's changing how I think