Bug hunt: Rescaling DataFrames in Python
This is one of the tricky bugs I fixed relatively early in my career. This one took a solid few days to truly understand, but it boiled down to a relatively simple fix.
The problem
This bug affected a data export pipeline in a legacy codebase. We were converting raw time series values into a proprietary data format (with permission) so that the data could be inspected in another program by the technicians.
Some of the technicians raised a bug report to us:
Some of the files would crash the program when opened
Their workaround was to re-create the file on a subset of the data – indicating there was potentially some bad values that weren’t being scaled properly, probably leading to an “out-of-bounds” error.
The steps that are taken to convert the raw data:
- Clip values to
+/- 5 mV
, which are the limits of what can be represented in the proprietary filetype - Rescale the channels so that all values are within
+/- 512
, which is the range of integers that can be encoded in 10 bits - Split up the study data into 1 hour frames
- Convert each scaled frame of data into binary and append to proprietary file.
The part of the pipeline concerned with rescaling values was
1import pandas as pd
2
3def rescale_channels(
4 data: pd.DataFrame,
5 channel_headers: list[str],
6) -> pd.DataFrame:
7 maximum_point = data[channel_headers].max().max()
8 minimum_point = data[channel_headers].min().min()
9 abs_max = max(maximum_point, abs(minimum_point))
10
11 absolute_max = 5 * (10**-3)
12 for column in channel_headers:
13 _data = data[column]
14
15 # CLIPPING
16 if abs_max > absolute_max:
17 abs_max = 5 * (10**-3)
18 _data = _data.clip(lower=-abs_max, upper=abs_max)
19
20 # RESCALING
21 _data = (_data / abs_max * 510).round()
22 _data = _data.fillna(511) # 511 is "undefined".
23
24 # TYPECASTING
25 _data = _data.astype(int)
26
27 # RECOMBINING COLUMN IN DF
28 data[column] = _data
29
30 return data
Can you spot the bug?
The solution
Turns out the issue was an overwritten loop variable:
1 # CLIPPING
2 if abs_max > absolute_max:
3 abs_max = 5 * (10**-3) # <--- ISSUE OCCURS HERE
4 _data = _data.clip(lower=-abs_max, upper=abs_max)
In cases when 5mV < absmax(CH1) < absmax(CH2)
, the highlighted if
statement is only entered
once, resulting in values in CH2
that exceed +/- 512
.
The fix is thankfully quite simple:
1# channel_headers = ["ECG1", "ECG2", "DIFF"]
2def rescale_channels(
3 data: pd.DataFrame,
4 channel_headers:list[str],
5) -> pd.DataFrame:
6 maximum_point = data[channel_headers].max().max()
7 minimum_point = data[channel_headers].min().min()
8 abs_max = max(maximum_point, abs(minimum_point))
9
10 absolute_max = 5 * (10**-3)
11 for column in channel_headers:
12 abs_max_col = abs_max # <--- CREATE A TEMP ITERATION VARIABLE TO AVOID OVERWRITE
13 _data = data[column]
14
15 # CLIPPING
16 if abs_max_col > absolute_max:
17 abs_max_col = 5 * (10**-3)
18 _data = _data.clip(lower=-abs_max_col, upper=abs_max_col)
19
20 # RESCALING
21 _data = (_data / abs_max_col * 510).round()
22 _data = _data.fillna(511) # 511 is "undefined".
23
24 # TYPECASTING
25 _data = _data.astype(int)
26
27 # RECOMBINING COLUMN IN DF
28 data[column] = _data
29
30 return raw_data
Once this was implemented, I added a unit test to capture this case and ensured that the issue was fixed when the technicians opened their files with this fix implemented.
Lesson I learned from this:
- Don’t overwrite loop variables – this leads to tricky to spot bugs!