🧹 Cleaning Data

Many real-world Excel files contain messy, incomplete, or inconsistent values. To simplify cleaning tasks, frosts provides a predicates submodule for common checks like blank cells or matching values, as well as built in DataFrame methods for common cleaning patterns.

Click to Expand/Collapse

Table of Contents
The predicates submodule, Boolean Predicates, and filtering
encode_headers()
See Also

The `predicates` submodule, Boolean Predicates, and filtering

To reduce repetitive code and overwhelming custom predicates for new programmers, frosts comes equipped with several out-of-box Boolean returning predicates that can easily be used with df.filter() for data cleaning. All predicates can be called with

fr.predicates.predicate_name        //For predicates without inputs like is_blank, and is_nan
fr.predicates.predicate_name(value) //For predicates with inputs like equals, and includes

All predicates can also be negated with fr.not(). For example, to filter out blanks in a column you could use

df.filter("Header",fr.not(fr.predicates.is_blank))

If these negation calls become too lengthy, consider aliasing fr.predicates() and fr.not() as follows

//Aliasing Commands
const pr = fr.predicates;
const not = fr.not;

//Simplified filter calls
df.filter("Header",not(pr.is_blank))
df.filter("Header",not(pr.includes("East Coast:")))

`is_blank`

Returns true if the value is a blank string ("").

df.filter("Status", fr.predicates.is_blank)

Note: this is a equality-like == check, not strict equality === so is_blank will also match 0s and nulls, but NOT with NaNs in numeric columns.

`is_nan`

Returns true if the row value is NaN. Used instead of is_blank for numeric columns

df.filter("Sales", fr.predicates.is_nan)

`equal(value: CellValue)`

Checks if row values equal the given value.

df.filter("Status", fr.predicates.equal("Active")) //Filters to only have rows with an "Active" Status.

`includes(value: string)`

Check if row values include the value text

let east_coast = df.filter("Region - Property", fr.predicates.includes("East"))

Since the .includes() method only works on strings, the numeric/boolean values will be coerced to strings for this check.

`starts_with(value: string)`

Checks if row values start with the value text

df.filter("Region", fr.predicates.starts_with("East"))

`ends_with(value: string)`

Checks if row values end with the value text

To learn more about custom predicates and advanced data filtering, read more in the Filtering section

`encode_headers()`

encode_headers(
  columnName: string,
  isHeaderRow: (row: FrostRow) => boolean,
  extractValue: (row: FrostRow) => CellValue,
  keepHeaders: boolean = false
): DataFrame

Excel exports from property management systems, financial tools, or legacy systems often include “section headers” embedded in the rows — such as property names or account groupings — instead of having that value in a clean, dedicated column.

The encode_headers() method lets you identify these headers, extract their values, and fill them down into a new column, making your dataset tabular and analysis-ready.

Parameters

Parameter	Type	Description
`columnName`	`string`	The name of the new column that will store the header values.
`isHeaderRow`	`(row: FrostRow) => boolean`	A function that returns `true` if a given row is a header row.
`extractValue`	`(row: FrostRow) => CellValue`	A function that extracts the header value from a header row.
`keepHeaders`	`boolean=false`	Whether to keep the rows flagged as headers in the output (default `false`)

Returns a new DataFrame with the extracted header values encoded in the specified column.

Suppose you have an Excel export in this form:

City	Sales	Variance to Budget
Sales Region: West (W001)
Los Angeles	5500.00	5.2%
San Diego	4200.00	4.8%
Sales Region: Central (C002)
Dallas	6100.00	5.5%
Houston	5900.00	5.3%
Sales Region: East (E003)
New York	8000.00	6.1%
Boston	7200.00	6.0%

You can propagate the property ID using:

df.encode_headers(
  "PropertyID",
  row => row.get_string("City").includes("Sales Region: "),
  row => row.get_string("City").split("(")[1]?.split(")")[0] //Get the text in between the parentheses
)

With keepHeaders = false this would return:

City	Sales	Variance to Budget	RegionCode
Los Angeles	5500.00	5.2%	W001
San Diego	4200.00	4.8%	W001
Dallas	6100.00	5.5%	C002
Houston	5900.00	5.3%	C002
New York	8000.00	6.1%	E003
Boston	7200.00	6.0%	E003

And with keepHeaders = true this would return

City	Sales	Variance to Budget	RegionCode
Sales Region: West (W001)			W001
Los Angeles	5500.00	5.2%	W001
San Diego	4200.00	4.8%	W001
Sales Region: Central (C002)			C002
Dallas	6100.00	5.5%	C002
Houston	5900.00	5.3%	C002
Sales Region: East (E003)			E003
New York	8000.00	6.1%	E003
Boston	7200.00	6.0%	E003

Tips:

Rows before the first detected header will receive an empty string in the new column

You can chain this with .drop() if you want to remove the original header column.

`.fill_na(columnName: string|string[]|"ALL", method: "prev" | "next" | "value", value?: string | number | boolean):DataFrame`

Missing values (null or "") are common in real-world Excel datasets, especially those exported from forms, sensors, or user input. These gaps can break analysis, distort averages, or even cause charts to fail.

The .fill_na() method provides flexible ways to handle missing values without needing formulas or manual patching:

🔁 Forward-fill ("prev") — Repeats the last known value down the column. Useful for time series, log files, or cumulative data.
🔄 Backward-fill ("next") — Propagates the next known value up. Helpful when later values are known but earlier ones are missing.
🔢 Constant fill ("value") — Replaces all missing entries with a custom value like 0, "N/A", or false.

This is essential for:

Cleaning Messy Reports with non-standard headers
Making statistical operations like .mean() or .sum() reliable
Replacing blanks that occur after encode_headers() or merges

Use it when your dataset has blanks that you want to automatically and consistently handle.

Parameters:

Parameter	Type	Description
`columnName`	`string`, `string[]`, or `"ALL"`	The name of the column (or columns) to fill. Use `"ALL"` to apply to every column.
`method`	`"prev"`, `"next"`, or `"value"`	The strategy to use when filling missing values: - `"prev"`: Fill with the previous non-null value - `"next"`: Fill with the next non-null value - `"value"`: Fill with a user-provided constant
`value`	`string` , `number`, or `boolean`	Required only if `method` is `"value"`. The constant value to use when replacing missing cells. Ignored for `"prev"` and `"next"` strategies.

const df = new frosts.DataFrame([
  ["Day", "Temperature"],
  ["Mon", 72],
  ["Tue", null],
  ["Wed", 75],
  ["Thu", null],
  ["Fri", 78]
]);

// Fill nulls using previous non-null value
const forwardFill = df.fill_na("Temperature", "prev");
console.log(forwardFill.get_column("Temperature"));
// Output: [72, 72, 75, 75, 78]

// Fill nulls using next non-null value
const backwardFill = df.fill_na("Temperature", "next");
console.log(backwardFill.get_column("Temperature"));
// Output: [72, 75, 75, 78, 78]

// Fill nulls using a constant value
const filledWithZero = df.fill_na("Temperature", "value", 0);
console.log(filledWithZero.get_column("Temperature"));
// Output: [72, 0, 75, 0, 78]

`melt(newColumnName: string, newValueName:string, ...columns:string[]): DataFrame`

Many Excel datasets are structured in a wide format, where categories are stored as separate columns. This makes them difficult to analyze, chart, or aggregate in platforms like PowerBI.

The .melt() method solves this by transforming your data into a long format (also called “tidy data”), where each row represents a single observation. This is essential for:

📊 Pivot tables and charts — Long format is easier to slice, aggregate, and visualize.
🧮 Statistical summaries — Grouping by a single column like "Subject" or "Attribute" becomes possible.
🔄 Consistency — Makes your data uniform and easier to work with in pipelines.

You can think of melt() as the opposite of a pivot: it “unspreads” your data and turns columns into rows.

Parameters:

Parameter	Type	Description
`newColumnName`	`string`	The name of the new column that will contain the original column names (e.g. `"Subject"`).
`newValueName`	`string`	The name of the new column that will contain the values from the melted columns (e.g. `"Score"`).
`...columns`	`string[]`	One or more column names to be unpivoted into long format. These are the columns that will be “melted” into rows.

Note:

All columns not listed in ...columns will be treated as identifier columns and preserved in the output.

If the melted columns contain mixed types (e.g., numbers and booleans), their values will be automatically converted to strings for consistency.

const df = new frosts.DataFrame([
  ["Name", "Math", "English", "Science"],
  ["Alice", 90, 85, 95],
  ["Bob", 78, 82, 88]
]);

// Convert subject scores into rows
const melted = df.melt_columns("Subject", "Score", "Math", "English", "Science");
console.log(melted.to_array());
/*
Output:
[
  ["Name", "Subject", "Score"],
  ["Alice", "Math", 90],
  ["Alice", "English", 85],
  ["Alice", "Science", 95],
  ["Bob", "Math", 78],
  ["Bob", "English", 82],
  ["Bob", "Science", 88]
]
*/

If the selected columns have mixed types, all values are converted into strings

const df = new frosts.DataFrame([
  ["Name", "Age", "Score", "Enrolled"],
  ["Alice", 25, 90, true],
  ["Bob", 30, 78, false]
]);

// Melt the Age, Score, and Enrolled columns into rows
const melted = df.melt_columns("Attribute", "Value", "Age", "Score", "Enrolled");
console.log(melted.to_array());
/*
Output:
[
  ["Name", "Attribute", "Value"],
  ["Alice", "Age", "25"],
  ["Alice", "Score", "90"],
  ["Alice", "Enrolled", "true"],
  ["Bob", "Age", "30"],
  ["Bob", "Score", "78"],
  ["Bob", "Enrolled", "false"]
]
*/

`melt_except(newColumnName: string, newValueName: string, ...except: string[]): DataFrame`

Like melt(), this method reshapes your data from wide to long format — but instead of specifying which columns to melt, you specify which ones to exclude. This is especially useful when most of your dataset should be unpivoted, and you want to preserve just a few key identifiers like "Name" or "Date".

Use melt_except() when:

You’re working with datasets where the majority of columns represent measured variables.
It’s easier to list a few columns to keep than many to melt.
You’re working with a dynamic import whose number of column attributes may grow/shrink in size.
- Example: A financial report with 1 column per location.

The excluded columns will remain untouched, and all others will be stacked into long format using the provided newColumnName and newValueName.

const df = new frosts.DataFrame([
  ["Name", "Math", "English", "Science"],
  ["Alice", 90, 85, 95],
  ["Bob", 78, 82, 88]
]);

// Melt all columns except "Name"
const melted = df.melt_except("Subject", "Score", "Name");
console.log(melted.to_array());
/*
Output:
[
  ["Name", "Subject", "Score"],
  ["Alice", "Math", 90],
  ["Alice", "English", 85],
  ["Alice", "Science", 95],
  ["Bob", "Math", 78],
  ["Bob", "English", 82],
  ["Bob", "Science", 88]
]
*/

Table of Contents

The predicates submodule, Boolean Predicates, and filtering

is_blank

is_nan

equal(value: CellValue)

includes(value: string)

starts_with(value: string)

ends_with(value: string)

encode_headers()

.fill_na(columnName: string|string[]|"ALL", method: "prev" | "next" | "value", value?: string | number | boolean):DataFrame

melt(newColumnName: string, newValueName:string, ...columns:string[]): DataFrame

melt_except(newColumnName: string, newValueName: string, ...except: string[]): DataFrame

See Also