一文详解Pandas_代码007(未授权)

本文介绍: 一文详解Pand a s一、Pandas 概述二、Pandas数据结构2.1 Se r ies2.2 DataFram e 数据结构二、数学与统计计算三、DataFrame的文件操作3.1 读取文件3.2 写入文件四、数据处理4.1 缺失值处理4.2 重复值处理一、Pandas概述Pandas是另外一个用于处理高级数据结构和数据分析的Py th on库，Pandas是基于Num py 构建的一种工具，，纳入了大量的模块和库一些标准数据模型，提高了Py th on处理大数据的性能。特点：DataFrame是一种高效快速的数据

Pandas是另外一个用于处理高级数据结构和数据分析的Py th on库，Pandas是基于Num py 构建的一种工具，，纳入了大量的模块和库一些标准数据模型，提高了Py th on处理大数据的性能。

特点：

Pandas广泛用于金融、经济、数据分析、统计等商业领域，为各个领域数据从业者提供了便捷。

Pandas的安装与Num py 相似，如果你已经安装了Anaconda，那么直接导入即可，

安装命令：

pip install pandas

import pandas as pd

import pandas as pd

# 将列表作为数据导入 转换成Series
s1 = pd.Series([1,2,3,4,5])
print('s1:{}'.format(s1))  # 格式化字符串函数 str.format（）

s1:0    1
1    2
2    3
3    4
4    5
dtype: int64

import pandas as pd
s2 = pd.Series([1,2,3,4,5],index = ['第一','第二','第三','第四','第五'])
print('s2 : {}'.format(s2))

s2:第一    1
第二    2
第三    3
第四    4
第五    5
dtype: int64

import pandas as pd
s2 = pd.Series([1,2,3,4,5],index = ['第一','第二','第三','第四','第五'])
print('s2:{}'.format(s2))
print('s2索引：{}'.format(s2.index))
print('s2数值：{}'.format(s2.values))

s2:第一    1
第二    2
第三    3
第四    4
第五    5
dtype: int64
s2索引：Index(['第一', '第二', '第三', '第四', '第五'], dtype='object')
s2数值：[1 2 3 4 5]

print('s2中 第二 对应的数值： {}'.format(s2['第二']))
s2['第二'] = 10
print('s2中 第二 对应的数值： {}'.format(s2['第二']))

s2中 第二 对应的数值： 2
s2中 第二 对应的数值： 10

print('s2中 第二第四第五 对应的数值: {}'.format(s2[['第二','第四','第五']]))

s2中 第二第四第五 对应的数值: 第二    10
第四     4
第五     5
dtype: int64

print('s2中 第二到第五 对应的数值：{}'.format(s2['第二':'第五']))

s2中 第二到第五 对应的数值：第二    10
第三     3
第四     4
第五     5
dtype: int64

s3_dic = {'First':1,'Second':2,'Third':3,'Fourth':4,'Fifth':5}
s3 = pd.Series(s3_dic)
print('s4: {}'.format(s3))

s4: First     1
Second    2
Third     3
Fourth    4
Fifth     5
dtype: int64

s4_dic = {'First':1,'Second':2,'Third':3,'Fourth':4,'Fifth':5}
s4 = pd.Series(s4_dic,index = ['First','Second','Third','Fourth','Fifth'])
print('s4:{}'.format(s4))

s4:First     1
Second    2
Third     3
Fourth    4
Fifth     5
dtype: int64

print('s4 中含有 sixth:{}'.format('sixth' in s4))
print('s4中不含有sixth:{}'.format('sixth' not in s4))

s4 中含有 sixth:False
s4中不含有sixth:True

s4_dic = {'First':1,'Second':2,'Third':3,'Fourth':4,'Fifth':5}
s4 = pd.Series(s4_dic,index = ['First','Second','Third','Fourth','Tenth'])
print('s4:{}'.format(s4))

s4:First     1.0
Second    2.0
Third     3.0
Fourth    4.0
Tenth     NaN
dtype: float64

print('数据缺失：{}'.format(s4.isnull()))
print('数据不缺失:{}'.format(s4.notnull()))

数据缺失：First     False
Second    False
Third     False
Fourth    False
Tenth      True
dtype: bool
数据不缺失:First      True
Second     True
Third      True
Fourth     True
Tenth     False
dtype: bool

print('s3 + s4: {}'.format(s3 + s4))

s3 + s4: Fifth     NaN
First     2.0
Fourth    8.0
Second    4.0
Tenth     NaN
Third     6.0
dtype: float64

df_dic = {'color':['red','yellow','blue','purple','pink'],'size':['medium','small','big','medium','small'],'taste':['sweet','sour','salty','sweet','spicy']}
df = pd.DataFrame(df_dic)
print(df)

    color    size  taste
0     red  medium  sweet
1  yellow   small   sour
2    blue     big  salty
3  purple  medium  sweet
4    pink   small  spicy

df1 = pd.DataFrame(df_dic,columns = ['taste','color','size'])
print(df1)

   taste   color    size
0  sweet     red  medium
1   sour  yellow   small
2  salty    blue     big
3  sweet  purple  medium
4  spicy    pink   small

df1 = pd.DataFrame(df_dic,columns = ['taste','color','size','category'])
print(df1)

   taste   color    size category
0  sweet     red  medium      NaN
1   sour  yellow   small      NaN
2  salty    blue     big      NaN
3  sweet  purple  medium      NaN
4  spicy    pink   small      NaN

df1.index.name = 'sample'
df1.columns.name = 'feature'
print(df1)

feature  taste   color    size
sample                        
0        sweet     red  medium
1         sour  yellow   small
2        salty    blue     big
3        sweet  purple  medium
4        spicy    pink   small

print('df1的values值为： {}'.format(df1.values))

df1的values值为： [['sweet' 'red' 'medium']
 ['sour' 'yellow' 'small']
 ['salty' 'blue' 'big']
 ['sweet' 'purple' 'medium']
 ['spicy' 'pink' 'small']]

print('df1中的color列： {}'.format(df1['color']))
print('df1中的color列： {}'.format(df1.color))

df1中的color列： sample
0       red
1    yellow
2      blue
3    purple
4      pink
Name: color, dtype: object
df1中的color列： sample
0       red
1    yellow
2      blue
3    purple
4      pink
Name: color, dtype: object

print(df1.ix[3])

feature
taste     sweet
color    purple
size     medium
Name: 3, dtype: object

import numpy as np
df1['category'] = np.arange(5)
print(df1)

feature  taste   color    size  category
sample                                  
0        sweet     red  medium         0
1         sour  yellow   small         1
2        salty    blue     big         2
3        sweet  purple  medium         3
4        spicy    pink   small         4

import numpy as np
df1['category'] = pd.Series([2,3,4],index = [0,2,4])
print(df1)

feature  taste   color    size  category
sample                                  
0        sweet     red  medium       2.0
1         sour  yellow   small       NaN
2        salty    blue     big       3.0
3        sweet  purple  medium       NaN
4        spicy    pink   small       4.0

df1['country'] = pd.Series(['China','UK','USA','Australia','Japan'])
print(df1)

feature  taste   color    size  category    country
sample                                             
0        sweet     red  medium       2.0      China
1         sour  yellow   small       NaN         UK
2        salty    blue     big       3.0        USA
3        sweet  purple  medium       NaN  Australia
4        spicy    pink   small       4.0      Japan

print(df1[df1['category'] < 3])

feature  taste color    size  category country
sample                                        
0        sweet   red  medium       2.0   China

df5 = pd.DataFrame([[3,2,3,1],[2,5,3,6],[3,4,5,2],[9,5,3,1]],index = ['a','b','c','d'],columns = ['one','two','three','four'])
print(df5)

   one  two  three  four
a    3    2      3     1
b    2    5      3     6
c    3    4      5     2
d    9    5      3     1

print('按列求和： {}'.format(df5.sum()))
print('按行求和： {}'.format(df5.sum(axis = 1)))

按列求和： one      17
two      16
three    14
four     10
dtype: int64
按行求和： a     9
b    16
c    14
d    18
dtype: int64

print('从上到下累计求和： {}'.format(df5.cumsum()))
print('从左往右累计求和： {}'.format(df5.cumsum(axis = 1)))

从上到下累计求和：    one  two  three  four
a    3    2      3     1
b    5    7      6     7
c    8   11     11     9
d   17   16     14    10
从左往右累计求和：    one  two  three  four
a    3    5      8     9
b    2    7     10    16
c    3    7     12    14
d    9   14     17    18

统计函数	解释
mean	均值
median	中位数
count	非缺失值数量
min、max	最大最小值
describe	汇总统计
var	方差
std	标准差
skew	偏度
kurt	峰度
diff	一阶差分
cu min、cu max	累计最大值、累计最小值
cumsum、cumprod	累计和、累计积
cov、corr	协方差、相关系数

读取数据文件函数	解释
pd.read_csv(filename)	从csv文件导入数据，默认分隔符为“,”
pd.read_table(filename)	从文本文件导入数据，默认分隔符为制表符
pd.read_excel(filename)	从Excel文件导入数据
pd.read_sql(query,connection_object)	从SQL表/库中导入数据
pd.read_json(json_string)	从json文件导入数据
pd.read_html(url)	解析 url、字符串或者HTML文件，提取数据表格
pd.DataFrame(dict)	从字典对象中读入数据

pd.read_csv('df.csv',encoding = 'utf-8')

读取数据文件函数	解释
pd.to_csv(filename)	导入数据至csv文件
pd.to_excel(filename)	导入数据至excel文件
pd.to_sql(table_name,connection_object)	导入数据至SQL表
pd.to_json(json_string)	导出数据为json 格式
pd.to_html(url)	导出数据为html文件
pd.to_clipboard(filename)	导出数据到剪切板

df.to_csv('df.csv',seq = ',',header = True,index = True,encoding = 'utf-8')

import pandas as pd
import numpy as np
df6 = pd.DataFrame([[3,np.nan,3,1],[2,5,np.nan,6],[3,4,5,np.nan],[5,3,1,3]],index = ['a','b','c','d'],columns = ['one','two','three','four'])
print(df6.isnull())

     one    two  three   four
a  False   True  False  False
b  False  False   True  False
c  False  False  False   True
d  False  False  False  False

# 输出含有缺失值的行  所有的行
print(df6[df6.isnull().any(axis = 1)])

   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    NaN   6.0
c    3  4.0    5.0   NaN

# 创建一个Series数组
arr = pd.Series([1,2,3,np.nan,5,6])
print(arr)
print(arr.dropna())

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64
0    1.0
1    2.0
2    3.0
4    5.0
5    6.0
dtype: float64

可以接着输出：python print(arr)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

arr = arr.dropna()
print(arr)

print(df6.dropna())

   one  two  three  four
d    5  3.0    1.0   3.0

df6['fifth'] = np.NAN
print(df6)
print(df6.dropna(how = 'all',axis = 1,inplace = True))

   one  two  three  four  fifth
a    3  NaN    3.0   1.0    NaN
b    2  5.0    NaN   6.0    NaN
c    3  4.0    5.0   NaN    NaN
d    5  3.0    1.0   3.0    NaN

df6['fifth'] = np.NAN
print(df6)
print(df6.fillna(0))

   one  two  three  four  fifth
a    3  NaN    3.0   1.0    NaN
b    2  5.0    NaN   6.0    NaN
c    3  4.0    5.0   NaN    NaN
d    5  3.0    1.0   3.0    NaN
   one  two  three  four  fifth
a    3  0.0    3.0   1.0    0.0
b    2  5.0    0.0   6.0    0.0
c    3  4.0    5.0   0.0    0.0
d    5  3.0    1.0   3.0    0.0

print(df6)
print(df6.fillna(df6.median()))

   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    NaN   6.0
c    3  4.0    5.0   NaN
   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    NaN   6.0
c    3  4.0    5.0   NaN
d    5  3.0    1.0   3.0
   one  two  three  four
a    3  4.0    3.0   1.0
b    2  5.0    3.0   6.0
c    3  4.0    5.0   3.0
d    5  3.0    1.0   3.0

print(df6)
print(df6.fillna(df6.ffill()))

   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    NaN   6.0
c    3  4.0    5.0   NaN
d    5  3.0    1.0   3.0
   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    3.0   6.0
c    3  4.0    5.0   6.0
d    5  3.0    1.0   3.0

print(df6)
print(df6.fillna(df6.bfill()))

   one  two  three  four
a    3  NaN    3.0   1.0
b    2  5.0    NaN   6.0
c    3  4.0    5.0   NaN
d    5  3.0    1.0   3.0
   one  two  three  four
a    3  5.0    3.0   1.0
b    2  5.0    5.0   6.0
c    3  4.0    5.0   3.0
d    5  3.0    1.0   3.0

df7 = pd.DataFrame([[3,5,3,1],[2,5,5,6],[3,4,5,3],[5,3,1,3],[3,4,5,3],[3,4,6,8]],index = ['a','b','c','d','e','f'],columns = ['one','two','three','four'])

print(df7[df7.duplicated()])
print(df7[df7.duplicated(subset = ['one','two'])])

   one  two  three  four
e    3    4      5     3
   one  two  three  four
e    3    4      5     3
f    3    4      6     8

print(df7.drop_duplicates(subset = ['one','two'],keep = 'first'))

   one  two  three  four
a    3    5      3     1
b    2    5      5     6
c    3    4      5     3
d    5    3      1     3

df8 = pd.DataFrame([[3,3,2,4],[5,4,3,3]],index = ['g','h'],columns = ['one','two','three','four'])
print(df8.append(df7))

   one  two  three  four
g    3    3      2     4
h    5    4      3     3
a    3    5      3     1
b    2    5      5     6
c    3    4      5     3
d    5    3      1     3
e    3    4      5     3
f    3    4      6     8

# 默认上下连接
print(pd.concat([df7,df8]))

   one  two  three  four
a    3    5      3     1
b    2    5      5     6
c    3    4      5     3
d    5    3      1     3
e    3    4      5     3
f    3    4      6     8
g    3    3      2     4
h    5    4      3     3

# 左右连接
print(pd.concat([df8,df7],axis = 1))

   one  two  three  four  one  two  three  four
a  NaN  NaN    NaN   NaN  3.0  5.0    3.0   1.0
b  NaN  NaN    NaN   NaN  2.0  5.0    5.0   6.0
c  NaN  NaN    NaN   NaN  3.0  4.0    5.0   3.0
d  NaN  NaN    NaN   NaN  5.0  3.0    1.0   3.0
e  NaN  NaN    NaN   NaN  3.0  4.0    5.0   3.0
f  NaN  NaN    NaN   NaN  3.0  4.0    6.0   8.0
g  3.0  3.0    2.0   4.0  NaN  NaN    NaN   NaN
h  5.0  4.0    3.0   3.0  NaN  NaN    NaN   NaN

df_dic11 = {'color':['red','yellow','blue','purple','pink'],'size':['medium','small','big','medium','small'],'taste':['sweet','sour','salty','sweet','spicy'],'category':[2,3,4,5,6]}
df9 = pd.DataFrame(df_dic11,columns = ['taste','color','size','category'])
print(df9)
df_dic12 = {'country':['China','UK','USA','Australia','Japan'],'quality':['good','normal','excellent','good','bad'],'category':[2,3,4,5,6]}
df10 = pd.DataFrame(df_dic12,columns = ['country','quality','category'])
print(df10)
print(pd.merge(df9,df10,left_on = 'category',right_on = 'category',how = 'left'))

   taste   color    size  category
0  sweet     red  medium         2
1   sour  yellow   small         3
2  salty    blue     big         4
3  sweet  purple  medium         5
4  spicy    pink   small         6
     country    quality  category
0      China       good         2
1         UK     normal         3
2        USA  excellent         4
3  Australia       good         5
4      Japan        bad         6
   taste   color    size  category    country    quality
0  sweet     red  medium         2      China       good
1   sour  yellow   small         3         UK     normal
2  salty    blue     big         4        USA  excellent
3  sweet  purple  medium         5  Australia       good
4  spicy    pink   small         6      Japan        bad