Upcasting in python
Upcasting
Have you ever run into a scenario where you have set your column type to int but when you go to display it either in a report or visualization it comes out a float? This happens because of something called upcasting. "Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type (e.g. int to float)."
Lets start with a DataFrame of 1 column and 8 rows where the values are random numbers from the normal distribution.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
>>> df1
A
0 0.406792
1 0.810450
2 1.161985
3 -1.402411
4 1.385434
5 -1.091746
6 0.018586
7 -0.606741
Now lets create a 3x8 DataFrame
>>> df2 = pd.DataFrame(dict(
A = pd.Series(np.random.randn(8), dtype='float16'),
B = pd.Series(np.random.randn(8)),
C = pd.Series(np.array(np.random.randn(8),
dtype='uint8'))
))
>>> df2
A B C
0 0.198608 -0.117426 2
1 -0.751953 -1.399014 0
2 0.959961 1.436223 0
3 -0.013000 -0.000660 0
4 0.676270 -0.480145 0
5 0.329834 1.903244 0
6 0.228149 1.831972 0
7 2.392578 -0.443288 0
Now lets match the columns from DataFrame 1 to DataFrame2. (Missing columns are filled with NaN)
>>> df3 = df1.reindex_like(df2)
>>> df3
A B C
0 0.406792 NaN NaN
1 0.810450 NaN NaN
2 1.161985 NaN NaN
3 -1.402411 NaN NaN
4 1.385434 NaN NaN
5 -1.091746 NaN NaN
6 0.018586 NaN NaN
7 -0.606741 NaN NaN
df3 now has NaN values which we will fill with a floats
>>> df3 = df1.reindex_like(df2).fillna(value=0.0)
>>> df3
A B C
0 0.406792 0.0 0.0
1 0.810450 0.0 0.0
2 1.161985 0.0 0.0
3 -1.402411 0.0 0.0
4 1.385434 0.0 0.0
5 -1.091746 0.0 0.0
6 0.018586 0.0 0.0
7 -0.606741 0.0 0.0
Now look what happens when we add df2 and df3
>>> df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
>>> print(df3)
A B C
0 0.605401 -0.117426 2.0
1 0.058497 -1.399014 0.0
2 2.121946 1.436223 0.0
3 -1.415411 -0.000660 0.0
4 2.061704 -0.480145 0.0
5 -0.761912 1.903244 0.0
6 0.246735 1.831972 0.0
7 1.785838 -0.443288 0.0
>>> print(df3.dtypes)
A float32
B float64
C float64
dtype: object
We got FLOATS when we added a float to an integer! But why?
"The values attribute on a DataFrame return the lower-common-denominator of the dtypes, meaning the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting."
Okay, but why do we need a homogeneous dtyped Numpy array? I though we just used numpy for random number generation? This can be explained by looking into the type.
>>> type(df3['C'])
<class 'pandas.core.series.Series'>
>>> type(df3['C'][0])
<type 'numpy.float64'>
Pandas series store data as numpy arrays (in short it does this because numpy arrays can do row/column numerical operations very quickly). Numpy arrays must have all their elements the same type docs -> "An array object represents a multidimensional, homogeneous array of fixed-size items."
See what happens when we change one of column C's values to a string.
>> df4 = df3
>> df4.loc[1,'C'] = 'd'
A B C
0 0.605401 -0.117426 2
1 0.058497 -1.399014 d
2 2.121946 1.436223 0
3 -1.415411 -0.000660 0
4 2.061704 -0.480145 0
5 -0.761912 1.903244 0
6 0.246735 1.831972 0
7 1.785838 -0.443288 0
>>> df4.dtypes
A float32
B float64
C object
dtype: object
Now the column is an 'object' type. Upcasting in action!
Comments
Post a Comment