ProbableOdyssey

Bug hunt: Rescaling DataFrames in Python

This is one of the tricky bugs I fixed relatively early in my career. This one took a solid few days to truly understand, but it boiled down to a relatively simple fix.

The problem

This bug affected a data export pipeline in a legacy codebase. We were converting raw time series values into a proprietary data format (with permission) so that the data could be inspected in another program by the technicians.

Some of the technicians raised a bug report to us:

Some of the files would crash the program when opened

Their workaround was to re-create the file on a subset of the data – indicating there was potentially some bad values that weren’t being scaled properly, probably leading to an “out-of-bounds” error.

The steps that are taken to convert the raw data:

The part of the pipeline concerned with rescaling values was

 1import pandas as pd
 2
 3def rescale_channels(
 4    data: pd.DataFrame,
 5    channel_headers: list[str],
 6) -> pd.DataFrame:
 7    maximum_point = data[channel_headers].max().max()
 8    minimum_point = data[channel_headers].min().min()
 9    abs_max = max(maximum_point, abs(minimum_point))
10
11    absolute_max = 5 * (10**-3)
12    for column in channel_headers:
13        _data = data[column]
14
15        # CLIPPING
16        if abs_max > absolute_max:
17            abs_max = 5 * (10**-3)
18            _data = _data.clip(lower=-abs_max, upper=abs_max)
19
20        # RESCALING
21        _data = (_data / abs_max * 510).round()
22        _data = _data.fillna(511)  # 511 is "undefined".
23
24        # TYPECASTING
25        _data = _data.astype(int)
26
27        # RECOMBINING COLUMN IN DF
28        data[column] = _data
29
30    return data

Can you spot the bug?

The solution

Turns out the issue was an overwritten loop variable:

1    # CLIPPING
2    if abs_max > absolute_max:
3        abs_max = 5 * (10**-3)  # <--- ISSUE OCCURS HERE
4        _data = _data.clip(lower=-abs_max, upper=abs_max)

In cases when 5mV < absmax(CH1) < absmax(CH2), the highlighted if statement is only entered once, resulting in values in CH2 that exceed +/- 512.

The fix is thankfully quite simple:

 1# channel_headers = ["ECG1", "ECG2", "DIFF"]
 2def rescale_channels(
 3    data: pd.DataFrame,
 4    channel_headers:list[str],
 5) -> pd.DataFrame:
 6    maximum_point = data[channel_headers].max().max()
 7    minimum_point = data[channel_headers].min().min()
 8    abs_max = max(maximum_point, abs(minimum_point))
 9
10    absolute_max = 5 * (10**-3)
11    for column in channel_headers:
12        abs_max_col = abs_max  # <--- CREATE A TEMP ITERATION VARIABLE TO AVOID OVERWRITE
13        _data = data[column]
14
15        # CLIPPING
16        if abs_max_col > absolute_max:
17            abs_max_col = 5 * (10**-3)
18            _data = _data.clip(lower=-abs_max_col, upper=abs_max_col)
19
20        # RESCALING
21        _data = (_data / abs_max_col * 510).round()
22        _data = _data.fillna(511)  # 511 is "undefined".
23
24        # TYPECASTING
25        _data = _data.astype(int)
26
27        # RECOMBINING COLUMN IN DF
28        data[column] = _data
29
30    return raw_data

Once this was implemented, I added a unit test to capture this case and ensured that the issue was fixed when the technicians opened their files with this fix implemented.

Lesson I learned from this:

Reply to this post by email ↪