在Pandas中更改列类型

在Pandas里，有四个主要方法可以用来转换数据类型：

to_numeric()：能够安全地将非数字类型（例如字符串）转换为合适的数字类型。同时还有to_datetime()和to_timedelta()方法。
astype()：可以将（几乎）任何类型转换为（几乎）其他类型，也可以将数据转换为分类类型。
infer_objects()：这是一个实用方法，能够尽可能地把包含Python对象的对象列转换成Pandas类型。
convert_dtypes()：将DataFrame列转换为支持pd.NA的“最佳可能”数据类型。

下面将对这些方法的详细解释和用法进行介绍。

1. `to_numeric()`

把DataFrame的一个或多个列转换为数字值的最佳方式是使用pandas.to_numeric()函数。该函数会尝试把非数字对象（例如字符串）转换为合适的整数或浮点数。

基本用法

to_numeric()函数的输入可以是一个Series或者DataFrame的单列。

import pandas as pd

s = pd.Series(["8", 6, "7.5", 3, "0.9"])  # 混合的字符串和数值
print(s)
# 输出:
# 0      8
# 1      6
# 2    7.5
# 3      3
# 4    0.9
# dtype: object

result = pd.to_numeric(s)  # 转换为浮点值
print(result)
# 输出:
# 0    8.0
# 1    6.0
# 2    7.5
# 3    3.0
# 4    0.9
# dtype: float64

运行之后，会返回一个新的Series。需要将这个输出赋值给一个变量或者列名，这样才能继续使用。

错误处理

在某些值无法转换为数字类型时，to_numeric()函数提供了errors关键字参数。它可以让非数字值变为NaN ，或者直接忽略包含这些值的列。

s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
print(s)
# 输出:
# 0         1
# 1         2
# 2       4.7
# 3    pandas
# 4        10
# dtype: object

# 默认行为是抛出错误
try:
    pd.to_numeric(s)
except ValueError as e:
    print(f"ValueError: {e}")

# 把无效值强制转换为NaN
result = pd.to_numeric(s, errors='coerce')
print(result)
# 输出:
# 0     1.0
# 1     2.0
# 2     4.7
# 3     NaN
# 4    10.0
# dtype: float64

# 忽略无法转换的值
result = pd.to_numeric(s, errors='ignore')
print(result)
# 输出:
# 0         1
# 1         2
# 2       4.7
# 3    pandas
# 4        10
# dtype: object

在不知道DataFrame的哪些列能可靠地转换为数字类型时，可以用ignore选项来转换整个DataFrame。

向下转换

默认情况下，使用to_numeric()函数转换会产生int64或float64类型（可以根据平台决定整数宽度）。若希望节省内存并使用更紧凑的类型，可以通过downcast参数把类型向下转换为'integer'、'signed'、'unsigned'或'float'。

s = pd.Series([1, 2, -7])
print(s)
# 输出:
# 0    1
# 1    2
# 2   -7
# dtype: int64

# 向下转换为最小的整数类型
result = pd.to_numeric(s, downcast='integer')
print(result)
# 输出:
# 0    1
# 1    2
# 2   -7
# dtype: int8

# 向下转换为较小的浮点数类型
result = pd.to_numeric(s, downcast='float')
print(result)
# 输出:
# 0    1.0
# 1    2.0
# 2   -7.0
# dtype: float32

2. `astype()`

astype()方法可以让用户明确指定DataFrame或Series的数据类型，非常灵活，可以尝试从一种类型转换为任何其他类型。

基本用法

可以选择NumPy数据类型（如np.int16）、Python类型（如bool）或者Pandas特定类型（如分类类型）。调用该方法时，它会尝试进行转换。

import numpy as np

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# 将整个DataFrame转换为int64类型
df = df.astype(int)
print(df.dtypes)
# 输出:
# a    int64
# b    int64
# dtype: object

# 将列a转换为int64类型，列b转换为复数类型
df = df.astype({'a': int, 'b': complex})
print(df.dtypes)
# 输出:
# a           int64
# b    complex128
# dtype: object

# 将Series转换为float16类型
s = pd.Series([1, 2, 3])
s = s.astype(np.float16)
print(s.dtype)
# 输出: float16

# 将Series转换为Python字符串类型
s = pd.Series([1, 2, 3])
s = s.astype(str)
print(s.dtype)
# 输出: object

# 将Series转换为分类类型
s = pd.Series([1, 2, 3])
s = s.astype('category')
print(s.dtype)
# 输出: category

使用astype()尝试转换时，如果不知道如何转换Series或DataFrame中的值，它会抛出错误。不过从Pandas 0.20.0版本开始，通过传递errors='ignore'可以抑制这个错误，这时候会返回原始对象。

注意事项

astype()功能强大，但有时可能会“错误”地转换值。

s = pd.Series([1, 2, -7])
print(s)
# 输出:
# 0    1
# 1    2
# 2   -7
# dtype: int64

# 将Series转换为无符号8位整数
result = s.astype(np.uint8)
print(result)
# 输出:
# 0      1
# 1      2
# 2    249
# dtype: uint8

转换虽然完成了，但-7被转换为了249。要避免这种错误，可以使用pd.to_numeric()的downcast参数。

3. `infer_objects()`

从Pandas 0.21.0版本开始，引入了infer_objects()方法。这个方法可以把DataFrame中对象类型的列转换为更合适的类型（软转换）。

df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3', '2', '1']}, dtype='object')
print(df.dtypes)
# 输出:
# a    object
# b    object
# dtype: object

# 尝试推断更合适的数据类型
df = df.infer_objects()
print(df.dtypes)
# 输出:
# a     int64
# b    object
# dtype: object

4. `convert_dtypes()`

从Pandas 1.0版本及以上，有一个convert_dtypes()方法。该方法能将Series和DataFrame列转换为支持pd.NA缺失值的“最佳可能”数据类型。

df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3', '2', '1']}, dtype='object')
print(df.convert_dtypes().dtypes)
# 输出:
# a     Int64
# b    string
# dtype: object

print(df.convert_dtypes(infer_objects=False).dtypes)
# 输出:
# a    object
# b    string
# dtype: object

该方法默认会根据每列中的对象值推断类型。通过传递infer_objects=False可以改变这个行为。

其他实用技巧

保留转换后的整数类型

pd.to_numeric(..., errors='coerce')会把整数转换为浮点数。保存整数，可使用支持空值的整数类型，像'Int64' ，可调用.convert_dtypes()或直接用.astype('Int64') 。Pandas 2.0开始，还能通过dtype_backend参数在一次函数调用中完成类型转换。

df = pd.DataFrame({'A': ['#DIV/0!', 3, 5]})
df['A'] = pd.to_numeric(df['A'], errors='coerce').convert_dtypes()
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')
df['A'] = pd.to_numeric(df['A'], errors='coerce', dtype_backend='numpy_nullable')

转换长浮点数的字符串表示为数值

列中有需精确求值的长浮点数的字符串表示时，常规浮点数转换会导致精度损失，pd.to_numeric更不准确。此时，可使用Python内置的decimal库中的Decimal类型。虽然列的数据类型为object，但decimal.Decimal支持所有算术运算，仍可进行矢量化操作。

from decimal import Decimal

df = pd.DataFrame({'long_float': ["0.1234567890123456789", "0.123456789012345678", "0.1234567890123456781"]})
df['w_float'] = df['long_float'].astype(float)
df['w_Decimal'] = df['long_float'].map(Decimal)

print(df['w_Decimal'] == Decimal(df.loc[1, 'long_float']))
# 输出:
# 0    False
# 1     True
# 2    False
# Name: w_Decimal, dtype: bool
print(df['w_float'] == float(df.loc[1, 'long_float']))
# 输出:
# 0     True
# 1     True
# 2     True
# Name: w_float, dtype: bool

转换长整数的字符串表示为整数

默认astype(int)会转换为int32类型，如果数字很长（如电话号码）则会引发OverflowError，这时可以尝试使用'int64' 、float。

1	`df['long_num'] = df['long_num'].astype('int64')`

若遇到SettingWithCopyWarning ，可开启写时复制模式。

import pandas as pd

pd.set_option('mode.copy_on_write', True)

df[[ 'col1',  'col2']] = df[[ 'col1',  'col2']].astype(float)

df = df.assign(**df[[ 'col1',  'col2']].astype(float))

整数转换为时间差

长字符串或整数可能代表日期时间或时间差，可使用to_datetime或to_timedelta转换为相应的数据类型。

1
2
3

df = pd.DataFrame({'long_int': ['1018880886000000000', '1590305014000000000', '1101470895000000000', '1586646272000000000', '1460958607000000000']})
df['datetime'] = pd.to_datetime(df['long_int'].astype('int64'))
df['timedelta'] = pd.to_timedelta(df['long_int'].astype('int64'))

时间差转换为数字

要把日期时间或时间差转换为数字，可以通过view('int64')完成，适用于构建需要以数字形式使用时间的机器学习模型。要确保原始数据是字符串，得先转换为时间差或日期时间。

df = pd.DataFrame({'Time diff': ['2 days 4:00:00', '3 days', '4 days', '5 days', '6 days']})
df['Time diff in nanoseconds'] = pd.to_timedelta(df['Time diff']).view('int64')
df['Time diff in seconds'] = pd.to_timedelta(df['Time diff']).view('int64') // 10**9
df['Time diff in hours'] = pd.to_timedelta(df['Time diff']).view('int64') // (3600*10**9)

日期时间转换为数字

日期时间在数值上是该日期时间与UNIX纪元（1970-01-01）的时间差。

1 2	`df = pd.DataFrame({'Date': ['2002-04-15', '2020-05-24', '2004-11-26', '2020-04-11', '2016-04-18']}) df['Time_since_unix_epoch'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').view('int64')`

`astype`比`to_numeric`更快

import numpy as np

df = pd.DataFrame(np.random.default_rng().choice(1000, size=(10000, 50)).astype(str))
df = pd.concat([df, pd.DataFrame(np.random.rand(10000, 50).astype(str), columns=range(50, 100))], axis=1)

print('astype操作时间：')
%timeit df.astype(dict.fromkeys(df.columns[:50], int) | dict.fromkeys(df.columns[50:], float))

print('to_numeric操作时间：')
%timeit df.apply(pd.to_numeric)

创建和合并不同数据类型的DataFrame

1
2
3

d1 = pd.DataFrame(columns=['float_column'], dtype=float)
d1 = d1.append(pd.DataFrame(columns=['string_column'], dtype=str))
print(d1.dtypes)

处理包含单位的列

nutrition = pd.read_csv('https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv', index_col=[0])
nutrition.index = pd.RangeIndex(start=0, stop=8789, step=1)
nutrition.set_index('name', inplace=True)
nutrition.replace('[a-zA-Z]', '', regex=True, inplace=True)
nutrition = nutrition.astype(float)
print(nutrition.dtypes)

# 收集单位并添加到列名
nutrition.index = pd.RangeIndex(start=0, stop=8789, step=1)
nutrition.set_index('name', inplace=True)
units = nutrition.astype(str).replace('[^a-zA-Z]', '', regex=True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = {k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]', '', regex=True, inplace=True)
nutrition = nutrition.astype(float)

移除浮点数后面的`.0`

1 2	`firstCol = list(df.columns)[0] df[firstCol] = df[firstCol].fillna('').astype(str).apply(lambda x: x.replace('.0', ''))`

创建DataFrame时指定数据类型

使用`DataFrame.from_records`

import numpy as np
import pandas as pd

x = [['foo', '1.2', '70'], ['bar', '4.2', '5']]
df = pd.DataFrame.from_records(np.array(
    [tuple(row) for row in x],
    'object, float, int'
))
print(df.dtypes)

使用`read_csv`

import io

lines = '''
foo,biography,5
bar,crime,4
baz,fantasy,3
qux,history,2
quux,horror,1
'''
columns = ['name',

数据分析 > Pandas库应用

#Python #数据分析 #Pandas #列类型转换

在Pandas中更改列类型

https://119291.xyz/posts/change-column-type-in-pandas/

作者

发布于

2025年5月30日

许可协议

在不停止程序的情况下捕获并打印完整的Python异常回溯信息上一篇

使用推导式创建字典下一篇

在Pandas中更改列类型

在Pandas中更改列类型

1. to_numeric()

基本用法

错误处理

向下转换

2. astype()

基本用法

注意事项

3. infer_objects()

4. convert_dtypes()

其他实用技巧

保留转换后的整数类型

转换长浮点数的字符串表示为数值

转换长整数的字符串表示为整数

整数转换为时间差

时间差转换为数字

日期时间转换为数字

astype比to_numeric更快

创建和合并不同数据类型的DataFrame

处理包含单位的列

移除浮点数后面的.0

创建DataFrame时指定数据类型

使用DataFrame.from_records

使用read_csv

1. `to_numeric()`

2. `astype()`

3. `infer_objects()`

4. `convert_dtypes()`

`astype`比`to_numeric`更快

移除浮点数后面的`.0`

使用`DataFrame.from_records`

使用`read_csv`