Pandas教程（非常详细）_代码007(未授权)

本文介绍: 文章目录教程特点阅读条件Pand a s是什么Pand a s主要特点Pand a s主要优势Pand a s内置数据结构Pandas库下载和安装Windows系统安装Linux 系统安装1) Ubuntu用户2) Fedora用户MacOSX系统安装Pandas Series 入门教程创建Series 对象1) 创建一个空Series 对象2) ndarray创建Series 对象3) dic t创建Series对象4) 标量创建Series对象访问Series 数据1) 位置索引访问2) 索引标签访问Series常用属性1) axe

转载于：
http://c.biancheng.net/pandas/

Pandas 库是一个免费、开源的第三方 Pyt h on 库，是 Pyt h on 数据分析必不可少的工具之一，它为 Pyt h on 数据分析提供了高性能，且易于使用的数据结构，即 Series 和 DataFrame。Pandas 自诞生后被应用于众多的领域，比如金融、统计学、社会科学、建筑工程等。

Pandas 库基于 Pyt h on NumPy 库开发而来，因此，它可以与 Pyth on 的科学计算库配合使用。Pandas 提供了两种数据结构，分别是 Series（一维数组结构）与 DataFrame（二维数组结构），这两种数据结构极大地增强的了 Pandas 的数据分析能力。在本套教程中，我们将学习 Pyth on Pandas 的各种方法、特性以及如何在实践中运用它们。

本套教程是为 Pandas 初学者打造的，学习完本套教程，您将在一定程度上掌握 Pandas 的基础知识，以及各种功能。如果您是从事数据分析的工作人员，那么这套教程会对您有所帮助。

本套教程对 Pyth on Pandas 库进行详细地讲解，包括文件读写、统计学函数、缺失值处理、以及数据可视化等重点知识。为了降低初学者的学习门槛，我们的教程尽量采用通俗易懂、深入浅出的语言风格，相信通过对本套教程的学习，您一定会收获颇丰。

数据结构	维度	说明
Series	1	该结构能够存储各种数据类型，比如字符数、整数、浮点数、Python 对象等，Series 用 name 和 index 属性来描述数据值。Series 是一维数据结构，因此其维数不可以改变。
DataFrame	2	DataFrame 是一种二维表格型数据的结构，既有行索引，也有列索引。行索引是 index，列索引是 columns。在创建该结构时，可以指定相应的索引值。

sudo apt-get install numpy scipy matplotlib pandas

sudo yum install numpy scipy matplotlib pandas

pip install pandas

import pandas as pd
s=pd.Series( data, index, dtype, copy)

参数名称	描述
data	输入的数据，可以是列表、常量、ndarray 数组等。
index	索引值必须是惟一的，如果没有传递索引，则默认为 np.ar range(n)。
dtype	dtype表示数据类型，如果没有提供，则会自动判断得出。
copy	表示对 data 进行拷贝，默认为 False。

import pandas as pd
#输出数据为空
s = pd.Series()
print(s)

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)

0   a
1   b
2   c
3   d
dtype: object

上述示例中没有传递任何索引，所以索引默认从 0 开始分配，其索引范围为 0 到len(data)-1，即 0 到 3。这种设置方式被称为“隐式索引”。

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
#自定义索引标签（即显示索引）
s = pd.Series(data,index=[100,101,102,103])
print(s)

100  a
101  b
102  c
103  d
dtype: object

import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a 0.0
b 1.0
c 2.0
dtype: float64

示例 2，为index参数传递索引时：

import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print(s)

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

0  5
1  5
2  5
3  5
dtype: int64

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[0])  #位置下标
print(s['a']) #标签下标

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[:3])

a  1
b  2
c  3
dtype: int64

import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[-3:])

c  3
d  4
e  5
dtype: int64

import pandas as pd
s = pd.Series([6,7,8,9,10],index = ['a','b','c','d','e'])
print(s['a']）

import pandas as pd
s = pd.Series([6,7,8,9,10],index = ['a','b','c','d','e'])
print(s[['a','c','d']])

a    6
c    8
d    9
dtype: int64

import pandas as pd
s = pd.Series([6,7,8,9,10],index = ['a','b','c','d','e'])
#不包含f值
print(s['f'])

......
KeyError: 'f'

名称	属性
axes	以列表的形式返回所有行索引标签。
dtype	返回对象的数据类型。
empty	返回一个空的 Series 对象。
ndim	返回输入数据的维数。
size	返回输入数据的元素数量。
values	以 ndarray 的形式返回 Series 对象。
index	返回一个RangeIndex对象，用来描述索引的取值范围。

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print(s)

0    0.898097
1    0.730210
2    2.307401
3   -1.723065
4    0.346728
dtype: float64

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print ("The axes are:")
print(s.axes)

The axes are:
[RangeIndex(start=0, stop=5, step=1)]

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print ("The dtype is:")
print(s.dtype)

The dtype is:
float64

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print("是否为空对象?")
print (s.empty)

是否为空对象?
False

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print (s)
print (s.ndim)

0    0.311485
1    1.748860
2   -0.022721
3   -0.129223
4   -0.489824
dtype: float64
1

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(3))
print (s)
#series的长度大小
print(s.size)

0   -1.866261
1   -0.636726
2    0.586037
dtype: float64
3

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(6))
print(s)
print("输出series中数据")
print(s.values)

0   -0.502100
1    0.696194
2   -0.982063
3    0.416430
4   -1.384514
5    0.444303
dtype: float64
输出series中数据
[-0.50210028  0.69619407 -0.98206327  0.41642976 -1.38451433  0.44430257]

#显示索引
import pandas as pd
s=pd.Series([1,2,5,8],index=['a','b','c','d'])
print(s.index)
#隐式索引
s1=pd.Series([1,2,5,8])
print(s1.index)

隐式索引：
Index(['a', 'b', 'c', 'd'], dtype='object')
显示索引：
RangeIndex(start=0, stop=4, step=1)

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print ("The original series is:")
print (s)
#返回前三行数据
print (s.head(3))

原系列输出结果:
0    1.249679
1    0.636487
2   -0.987621
3    0.999613
4    1.607751
head(3)输出：
dtype: float64
0    1.249679
1    0.636487
2   -0.987621
dtype: float64

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
#原series
print(s)
#输出后两行数据
print (s.tail(2))

原Series输出：
0    0.053340
1    2.165836
2   -0.719175
3   -0.035178
输出后两行数据：
dtype: float64
2   -0.719175
3   -0.035178
dtype: float64

import pandas as pd
#None代表缺失数据
s=pd.Series([1,2,5,None])
print(pd.isnull(s))  #是空值返回True
print(pd.notnull(s)) #空值返回False

0    False
1    False
2    False
3     True
dtype: bool

notnull():
0     True
1     True
2     True
3    False
dtype: bool

Column	Type
name	String
age	integer
gender	String
rating	Float

import pandas as pd
pd.DataFrame( data, index, columns, dtype, copy)

参数名称	说明
data	输入的数据，可以是 ndarray，series，list，dict，标量以及一个 DataFrame。
index	行标签，如果没有传递 index 值，则默认行标签是 np.arange(n)，n 代表 data 的元素个数。
columns	列标签，如果没有传递 columns 值，则默认列标签是 np.arange(n)。
dtype	dtype表示每一列的数据类型。
copy	默认为 False，表示复制数据 data。

import pandas as pd
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

      Name      Age
0     Alex      10
1     Bob       12
2     Clarke    13

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)

      Name     Age
0     Alex     10.0
1     Bob      12.0
2     Clarke   13.0

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

      Age      Name
0     28        Tom
1     34       Jack
2     29      Steve
3     42      Ricky

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

         Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

    a    b      c
0   1   2     NaN
1   5   10   20.0

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print(df)

        a   b       c
first   1   2     NaN
second  5   10   20.0

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)

#df2输出
         a  b
first    1  2
second   5  10

#df1输出
         a  b1
first    1  NaN
second   5  NaN

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df ['one'])

a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
#使用df['列']=值，插入新的数据列
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
#将已经存在的数据列做相加运算
df['four']=df['one']+df['three']
print(df)

使用列索引创建新数据列:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

已存在的数据列做算术运算：
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN

上述示例，我们初次使用了 DataFrame 的算术运算，这和 NumPy 非常相似。除了使用df[]=value的方式外，您还可以使用 insert() 方法插入新的列，示例如下：

import pandas as pd
info=[['Jack',18],['Helen',19],['John',17]]
df=pd.DataFrame(info,columns=['name','age'])
print(df)
#注意是column参数
#数值1代表插入到columns列表的索引位置
df.insert(1,column='score',value=[91,90,75])
print(df)

添加前：
    name  age
0   Jack   18
1  Helen   19
2   John   17

添加后：
    name  score  age
0   Jack     91   18
1  Helen     90   19
2   John     75   17

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
   'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
#使用del删除
del df['one']
print(df)
#使用pop方法删除
df.pop('two')
print (df)

原DataFrame:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

使用del删除 first:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

使用 pop()删除:
   three
a  10.0
b  20.0
c  30.0
d  NaN

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df.loc['b'])

one 2.0two 2.0Name: b, dtype: float64

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df.iloc[2]）

one   3.0
two   3.0
Name: c, dtype: float64

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
#左闭右开
print(df[2:4])

   one  two
c  3.0    3
d  NaN    4

import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
#在行末追加新数据行
df = df.append(df2)
print(df)

import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
#注意此处调用了drop()方法
df = df.drop(0)
print (df)

执行drop(0)前：
   a  b
0  1  2
1  3  4
0  5  6
1  7  8

执行drop(0)后：
  a b
1 3 4
1 7 8

名称	属性&方法描述
T	行和列转置。
axes	返回一个仅以行轴标签和列轴标签为成员的列表。
dtypes	返回每列数据的数据类型。
empty	DataFrame中没有数据或者任意坐标轴的长度为0，则返回True。
ndim	轴的数量，也指数组的维数。
shape	返回一个元组，表示了 DataFrame 维度。
size	DataFrame中的元素数量。
values	使用 numpy 数组表示 DataFrame 中的元素值。
head()	返回前 n 行数据。
tail()	返回后 n 行数据。
shift()	将行或列移动指定的步幅长度

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#输出series
print(df)

输出 series 数据:
     Name  years  Rating
0  c语言中文网    5    4.23
1     编程帮     6    3.24
2      百度     15    3.98
3   360搜索     28    2.56
4      谷歌     3     3.20
5     微学苑    19    4.60
6  Bing搜索     23    3.80

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#输出DataFrame的转置
print(df.T)

Our data series is:
             0         1      2      3       4    5       6
Name    c语言中文网   编程帮    百度  360搜索   谷歌  微学苑  Bing搜索
years        5        6      15      28      3     19      23
Rating    4.23       3.24    3.98    2.56    3.2   4.6     3.8

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#输出行、列标签
print(df.axes)

[RangeIndex(start=0, stop=7, step=1), Index(['Name', 'years', 'Rating'], dtype='object')]

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#输出行、列标签
print(df.dtypes)

Name       object
years       int64
Rating     float64
dtype:     object

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#判断输入数据是否为空
print(df.empty)

判断输入对象是否为空：
False

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#DataFrame的维度
print(df.ndim)

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#DataFrame的形状
print(df.shape)
输出结果：

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#DataFrame的中元素个数
print(df.size)

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#DataFrame的数据
print(df.values)

[['c语言中文网' 5 4.23]
['编程帮' 6 3.24]
['百度' 15 3.98]
['360搜索' 28 2.56]
['谷歌' 3 3.2]
['微学苑' 19 4.6]
['Bing搜索' 23 3.8]]

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#获取前3行数据
print(df.head(3))

     Name       years   Rating
0   c语言中文网      5     4.23
1    编程帮         6     3.24
2    百度          15     3.98

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['c语言中文网','编程帮',"百度",'360搜索','谷歌','微学苑','Bing搜索']),
   'years':pd.Series([5,6,15,28,3,19,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
#构建DataFrame
df = pd.DataFrame(d)
#获取后2行数据
print(df.tail(2))

      Name     years   Rating
5     微学苑      19     4.6
6    Bing搜索     23     3.8

如果您想要移动 DataFrame 中的某一行/列，可以使用 shift() 函数实现。它提供了一个periods参数，该参数表示在特定的轴上移动指定的步幅。

参数名称	说明
peroids	类型为int，表示移动的幅度，可以是正数，也可以是负数，默认值为1。
freq	日期偏移量，默认值为None，适用于时间序。取值为符合时间规则的字符串。
axis	如果是 0 或者 “index” 表示上下移动，如果是 1 或者 “columns” 则会左右移动。
fill_value	该参数用来填充缺失值。

import pandas as pd 
info= pd.DataFrame({'a_data': [40, 28, 39, 32, 18], 
'b_data': [20, 37, 41, 35, 45], 
'c_data': [22, 17, 11, 25, 15]}) 
#移动幅度为3
info.shift(periods=3)

   a_data  b_data  c_data
0     NaN     NaN     NaN
1     NaN     NaN     NaN
2     NaN     NaN     NaN
3    40.0    20.0    22.0
4    28.0    37.0    17.0

import pandas as pd 
info= pd.DataFrame({'a_data': [40, 28, 39, 32, 18], 
'b_data': [20, 37, 41, 35, 45], 
'c_data': [22, 17, 11, 25, 15]}) 
#移动幅度为3
print(info.shift(periods=3))
#将缺失值和原数值替换为52
info.shift(periods=3,axis=1,fill_value= 52)

原输出结果：
   a_data  b_data  c_data
0     NaN     NaN     NaN
1     NaN     NaN     NaN
2     NaN     NaN     NaN
3    40.0    20.0    22.0
4    28.0    37.0    17.0

替换后输出：
   a_data  b_data  c_data
0      52      52      52
1      52      52      52
2      52      52      52
3      52      52      52
4      52      52      52

参数名称	描述说明
data	输入数据，可以是 ndarray，Series，列表，字典，或者 DataFrame。
items	axis=0
major_axis	axis=1
minor_axis	axis=2
dtype	每一列的数据类型。
copy	默认为 False，表示是否复制数据。

import pandas as pd
p = pd.Panel()
print(p)

<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None

import pandas as pd
import numpy as np
#返回均匀分布的随机样本值位于[0,1）之间
data = np.random.rand(2,4,5)
p = pd.Panel(data)
print (p)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

import pandas as pd
import numpy as np
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
   'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p)

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

如果想要从 Panel 对象中选取数据，可以使用 Panel 的三个轴来实现，也就是items，major_axis，minor_axis。下面介绍其中一种，大家体验一下即可。

import pandas as pd
import numpy as np
data = {'Item1':pd.DataFrame(np.random.randn(4, 3)),
   'Item2':pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)
print(p['Item1'])

            0          1          2
0    0.488224  -0.128637   0.930817
1    0.417497   0.896681   0.576657
2   -2.775266   0.571668   0.290082
3   -0.400538  -0.144234   1.110535

函数名称	描述说明
count()	统计某个非空值的数量。
sum()	求和
mean()	求均值
median()	求中位数
mode()	求众数
std()	求标准差
min()	求最小值
max()	求最大值
abs()	求绝对值
prod()	求所有数值的乘积。
cumsum()	计算累计和，axis=0，按照行累加；axis=1，按照列累加。
cumprod()	计算累计积，axis=0，按照行累积；axis=1，按照列累积。
corr()	计算数列或变量之间的相关系数，取值-1到1，值越大表示关联性越强。

import pandas as pd
import numpy as np
#创建字典型series结构
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df)

   Name  Age  Rating
0    小明   25    4.23
1    小亮   26    3.24
2    小红   25    3.98
3    小华   23    2.56
4    老赵   30    3.20
5    小曹   29    4.60
6    小陈   23    3.80
7    老李   34    3.78
8    老王   40    2.98
9    小冯   30    4.80
10   小何   51    4.10
11   老张   46    3.65

import pandas as pd
import numpy as np
#创建字典型series结构
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
#默认axis=0或者使用sum("index")
print(df.sum())

Name      小明小亮小红小华老赵小曹小陈老李老王小冯小何老张
Age                            382
Rating                       44.92
dtype: object

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
#也可使用sum("columns")或sum(1)
print(df.sum(axis=1))

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.mean())

Age       31.833333
Rating     3.743333
dtype: float64

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,59,19,23,44,40,30,51,54]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.std())

Age       13.976983
Rating     0.661628
dtype: float64

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#创建DataFrame对象
df = pd.DataFrame(d)
#求出数据的所有描述信息
print(df.describe())

             Age     Rating
count  12.000000  12.000000
mean   34.916667   3.743333
std    13.976983   0.661628
min    19.000000   2.560000
25%    24.500000   3.230000
50%    28.000000   3.790000
75%    45.750000   4.132500
max    59.000000   4.800000

describe() 函数输出了平均值、std 和 IQR 值(四分位距)等一系列统计信息。通过 describe() 提供的include能够筛选字符列或者数字列的摘要信息。

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,59,19,23,44,40,30,51,54]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.describe(include=["object"]))

       Name
count    12
unique   12
top      小红
freq      1

最后使用all参数，看一下输出结果，如下所示：

import pandas as pd
import numpy as np
d = {'Name':pd.Series(['小明','小亮','小红','小华','老赵','小曹','小陈',
   '老李','老王','小冯','小何','老张']),
   'Age':pd.Series([25,26,25,23,59,19,23,44,40,30,51,54]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.describe(include="all"))

       Name        Age     Rating
count    12  12.000000  12.000000
unique   12        NaN        NaN
top      小红       NaN       NaN
freq      1        NaN        NaN
mean    NaN  34.916667   3.743333
std     NaN  13.976983   0.661628
min     NaN  19.000000   2.560000
25%     NaN  24.500000   3.230000
50%     NaN  28.000000   3.790000
75%     NaN  45.750000   4.132500
max     NaN  59.000000   4.800000

Pandas 在数据分析、数据可视化方面有着较为广泛的应用，Pandas 对 Matplotlib 绘图软件包的基础上单独封装了一个plot()接口，通过调用该接口可以实现常用的绘图操作。本节我们深入讲解一下 Pandas 的绘图操作。

import pandas as pd
import numpy as np
#创建包含时间序列的数据
df = pd.DataFrame(np.random.randn(8,4),index=pd.date_range('2/1/2020',periods=8), columns=list('ABCD'))
df.plot()

通过关键字参数kind可以把上述方法传递给 plot()。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d','e'])
#或使用df.plot(kind="bar")
df.plot.bar()

通过设置参数stacked=True可以生成柱状堆叠图，示例如下：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,5),columns=['a','b','c','d','e'])
df.plot(kind="bar",stacked=True)
#或者使用df.plot.bar(stacked="True")

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
print(df)
df.plot.barh(stacked=True)

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.random.randn(100)+2,'B':np.random.randn(100),'C':
np.random.randn(100)-2}, columns=['A', 'B', 'C'])
print(df)
#指定箱数为15
df.plot.hist(bins=15)

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.random.randn(100)+2,'B':np.random.randn(100),'C':
np.random.randn(100)-2,'D':np.random.randn(100)+3},columns=['A', 'B', 'C','D'])
#使用diff绘制
df.diff().hist(color="r",alpha=0.5,bins=15)

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 4), columns=['A', 'B', 'C', 'D'])
df.plot.box()

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area()

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a',y='b')

import pandas as pd
import numpy as np
df = pd.DataFrame(3 * np.random.rand(4), index=['go', 'java', 'c++', 'c'], columns=['L'])
df.plot.pie(subplots=True)

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',names=None, index_col=None, usecols=None)

ID,Name,Age,City,Salary
1,Jack,28,Beijing,22000
2,Lida,32,Shanghai,19000
3,John,43,Shenzhen,12000
4,Helen,38,Hengshui,3500

import pandas as pd
#需要注意文件的路径
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv")
print (df)

   ID   Name  Age      City  Salary
0   1   Jack   28   Beijing   22000
1   2   Lida   32  Shanghai   19000
2   3   John   43  Shenzhen   12000
3   4  Helen   38  Hengshui    3500

在 CSV 文件中指定了一个列，然后使用index_col可以实现自定义索引。

import pandas as pd
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv",index_col=['ID'])
print(df)

     Name  Age      City  Salary
ID                             
1    Jack   28   Beijing   22000
2    Lida   32  Shanghai   19000
3    John   43  Shenzhen   12000
4   Helen   38  Hengshui    3500

import pandas as pd
#转换salary为float类型
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv",dtype={'Salary':np.float64})
print(df.dtypes)

ID          int64
Name       object
Age         int64
City       object
Salary    float64
dtype: object

import pandas as pd
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv",names=['a','b','c','d','e'])
print(df)

    a      b    c         d       e
0  ID   Name  Age      City  Salary
1   1   Jack   28   Beijing   22000
2   2   Lida   32  Shanghai   19000
3   3   John   43  Shenzhen   12000
4   4  Helen   38  Hengshui    3500

注意：文件标头名是附加的自定义名称，但是您会发现，原来的标头名（列标签名）并没有被删除，此时您可以使用header参数来删除它。

import pandas as pd
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv",names=['a','b','c','d','e'],header=0)
print(df)

   a      b   c         d      e
0  1   Jack  28   Beijing  22000
1  2   Lida  32  Shanghai  19000
2  3   John  43  Shenzhen  12000
3  4  Helen  38  Hengshui   3500

skiprows参数表示跳过指定的行数。

import pandas as pd
df=pd.read_csv("C:/Users/Administrator/Desktop/person.csv",skiprows=2)
print(df)

   2   Lida  32  Shanghai  19000
0  3   John  43  Shenzhen  12000
1  4  Helen  38  Hengshui   3500

import pandas as pd 
data = {'Name': ['Smith', 'Parker'], 'ID': [101, 102], 'Language': ['Python', 'JavaScript']} 
info = pd.DataFrame(data) 
print('DataFrame Values:n', info) 
#转换为csv数据
csv_data = info.to_csv() 
print('nCSV String Values:n', csv_data)

DataFrame:
      Name   ID    Language
0   Smith  101      Python
1  Parker  102  JavaScript

csv数据:
,Name,ID,Language
0,Smith,101,Python
1,Parker,102,JavaScript

import pandas as pd
#注意：pd.NaT表示null缺失数据
data = {'Name': ['Smith', 'Parker'], 'ID': [101, pd.NaT], 'Language': ['Python', 'JavaScript']}
info = pd.DataFrame(data)
csv_data = info.to_csv("C:/Users/Administrator/Desktop/pandas.csv",sep='|')

如果想要把单个对象写入 Excel 文件，那么必须指定目标文件名；如果想要写入到多张工作表中，则需要创建一个带有目标文件名的ExcelWriter对象，并通过sheet_name参数依次指定工作表的名称。

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None)

参数名称	描述说明
excel_wirter	文件路径或者 ExcelWrite 对象。
sheet_name	指定要写入数据的工作表名称。
na_rep	缺失值的表示形式。
float_format	它是一个可选参数，用于格式化浮点数字符串。
columns	指要写入的列。
header	写出每一列的名称，如果给出的是字符串列表，则表示列的别名。
index	表示要写入的索引。
index_label	引用索引列的列标签。如果未指定，并且 hearder 和 index 均为为 True，则使用索引名称。如果 DataFrame 使用 MultiIndex，则需要给出一个序列。
start row	初始写入的行位置，默认值0。表示引用左上角的行单元格来储存 DataFrame。
startcol	初始写入的列位置，默认值0。表示引用左上角的列单元格来储存 DataFrame。
engine	它是一个可选参数，用于指定要使用的引擎，可以是 openpyxl 或 xlsx writer。

import pandas as pd
#创建DataFrame数据
info_website = pd.DataFrame({'name': ['编程帮', 'c语言中文网', '微学苑', '92python'],
     'rank': [1, 2, 3, 4],
     'language': ['PHP', 'C', 'PHP','Python' ],
     'url': ['www.bianchneg.com', 'c.bianchneg.net', 'www.weixueyuan.com','www.92python.com' ]})
#创建ExcelWrite对象
writer = pd.ExcelWriter('website.xlsx')
info_website.to_excel(writer)
writer.save()
print('输出成功')

pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None,
              usecols=None, squeeze=False,dtype=None, engine=None,
              converters=None, true_values=None, false_values=None,
              skiprows=None, nrows=None, na_values=None, parse_dates=False,
              date_parser=None, thousands=None, comment=None, skipfooter=0,
              convert_float=True, **kwds)

参数名称	说明
io	表示 Excel 文件的存储路径。
sheet_name	要读取的工作表名称。
header	指定作为列名的行，默认0，即取第一行的值为列名；若数据不包含列名，则设定 header = None。若将其设置为 header=2，则表示将前两行作为多重索引。
names	一般适用于Excel缺少列名，或者需要重新定义列名的情况；names的长度必须等于Excel表格列的长度，否则会报错。
index_col	用做行索引的列，可以是工作表的列名称，如 index_col = ‘列名’，也可以是整数或者列表。
usecols	int或list类型，默认为None，表示需要读取所有列。
squeeze	boolean，默认为False，如果解析的数据只包含一列，则返回一个Series。
converters	规定每一列的数据类型。
skiprows	接受一个列表，表示跳过指定行数的数据，从头部第一行开始。
nrows	需要读取的行数。
skipfooter	接受一个列表，省略指定行数的数据，从尾部最后一行开始。

import pandas as pd
#读取excel数据
df = pd.read_excel('website.xlsx',index_col='name',skiprows=[2])
#处理未命名列
df.columns = df.columns.str.replace('Unnamed.*', 'col_label')
print(df)

           col_label     rank    language          agelimit
name                                                 
编程帮           0         1        PHP       www.bianchneg.com
微学苑           2         3        PHP       www.weixueyuan.com
92python        3         4        Python    www.92python.com

import pandas as pd
#读取excel数据
#index_col选择前两列作为索引列
#选择前三列数据，name列作为行索引
df = pd.read_excel('website.xlsx',index_col='name',index_col=[0,1],usecols=[1,2,3])
#处理未命名列，固定用法
df.columns = df.columns.str.replace('Unnamed.*', 'col_label')
print(df)

                   language
name      rank        
编程帮       1          PHP
c语言中文网   2           C
微学苑       3          PHP
92python    4         Python

import numpy as np
arr = np.array([2, 4, 6, 8, 10, 12])
print(type(arr))
print ("打印新建数组: ",end="")
#使用for循环读取数据 
for l in range (0,5): 
    print (arr[l], end=" ")

<class 'numpy.ndarray'>
打印新建数组: 2 4 6 8 10

import array
#注意此处的 'l' 表示有符号int类型 
arr = array.array('l', [2, 4, 6, 8, 10, 12])
print(type(arr))
print ("新建数组: ",end="") 
for i in range (0,5): 
    print (arr[i], end=" ")

<class 'array.array'>
新建数组: 2 4 6 8 10

import pandas as pd 
dict = {'name':["Smith", "William", "Phill", "Parker"],  
        'age': ["28", "39", "34", "36"]}  
info = pd.DataFrame(dict, index = [True, True, False, True])  
print(info)

          name age
True     Smith  28
True   William  39
False    Phill  34
True    Parker  36

然后使用.loc访问索引为 True 的数据。示例如下：

import pandas as pd   
dict = {'name':["Smith", "William", "Phill", "Parker"],  
        'age': ["28", "39", "34", "36"]}  
info = pd.DataFrame(dict, index = [True, True, False, True])
#返回所有为 True的数据     
print(info.loc[True])

         name age
True    Smith  28
True  William  39
True   Parker  36

import numpy as np 
arr = np.arange(16) 
print("原数组: n", arr) 
arr = np.arange(16).reshape(2, 8) 
print("n变形后数组:n", arr) 
arr = np.arange(16).reshape(8 ,2) 
print("n变形后数组:n", arr)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

变形后数组:
[[ 0  1  2  3  4  5  6  7]
[ 8  9 10 11 12 13 14 15]]

变形后数组:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[12 13]
[14 15]]

比较项	Pandas	NumPy
适应性	Pandas主要用来处理类表格数据。	NumPy 主要用来处理数值数据。
工具	Pandas提供了Series和DataFrame数据结构。	NumPy 构建了 ndarray array来容纳数据。
性能	Pandas对于处理50万行以上的数据更具优势。	NumPy 则对于50万以下或者更少的数据，性能更佳。
内存利用率	与 NumPy相比，Pandas会消耗大量的内存。	NumPy 会消耗较少的内存。
对象	Pandas 提供了 DataFrame 2D数据表对象。	NumPy 则提供了一个多维数组 ndarray 对象

info = pd.DataFrame({"P": [2, 3], "Q": [4.0, 5.8]})
#给info添加R列 
info['R'] = pd.date_range('2020-12-23', periods=2)
print(info)
#将其转化为numpy数组 
n=info.to_numpy()
print(n)
print(type(n))

[[2 4.0 Timestamp('2020-12-23 00:00:00')]
[3 5.8 Timestamp('2020-12-24 00:00:00')]]

import pandas as pd 
#创建DataFrame对象
info = pd.DataFrame([[17, 62, 35],[25, 36, 54],[42, 20, 15],[48, 62, 76]], 
columns=['x', 'y', 'z']) 
print('DataFramen----------n', info) 
#转换DataFrame为数组array
arr = info.to_numpy() 
print('nNumpy Arrayn----------n', arr)

DataFrame
----------
     x   y   z
0  17  62  35
1  25  36  54
2  42  20  15
3  48  62  76

Numpy Array
----------
[[17 62 35]
[25 36 54]
[42 20 15]
[48 62 76]]

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

教程特点

阅读条件

Pandas是什么

Pandas主要特点

Pandas主要优势

Pandas内置数据结构

Pandas库下载和安装

Windows系统安装

Linux系统安装

1) Ubuntu用户

2) Fedora用户

MacOSX系统安装

Pandas Series入门教程

创建Series对象

1) 创建一个空Series对象

2) ndarray创建Series对象

3) dict创建Series对象

4) 标量创建Series对象

访问Series数据

1) 位置索引访问

2) 索引标签访问

Series常用属性

1) axes

2) dtype

3) empty

4) ndim

5) size

6) values

7) index

Series常用方法

1) head()&tail()查看数据

2) isnull()&nonull()检测缺失值

Pandas DataFrame入门教程（图解版）

认识DataFrame结构

创建DataFrame对象

1) 创建空的DataFrame对象

2) 列表创建DataFame对象

3) 字典嵌套列表创建

4) 列表嵌套字典创建DataFrame对象

5) Series创建DataFrame对象

列索引操作DataFrame

1) 列索引选取数据列

2) 列索引添加数据列

3) 列索引删除数据列

行索引操作DataFrame

1) 标签索引选取

2) 整数索引选取

3) 切片操作多行选取

4) 添加数据行

5) 删除数据行

常用属性和方法汇总

1) T（Transpose）转置

2) axes

3) dtypes

4) empty

5) ndim

6) shape

7) size

8) values

9) head()&tail()查看数据

10) shift()移动行或列

Pandas Panel三维数据结构

pandas.Panel()

创建Panel 对象

1) 创建一个空Panel

2) ndarray三维数组创建

3) DataFrame创建

Panel中选取数据

1) 使用 items选取数据

Python Pandas描述性统计

sum()求和

mean()求均值

std()求标准差

数据汇总描述

Python Pandas绘图教程（详解版）

柱状图

箱型图

区域图

散点图

饼状图

2) isnull()&no null()检测缺失值

发表回复取消回复