学习Pandas 二（Pandas缺失值处理、数据离散化、合并、交叉表与透视表、分组与聚合）

本文介绍: 连续属性的离散化就是将连续属性的值域上，将值域划分为若干个离散的区间，最后用不同的符号或整数值代表落在每个子区间中的属性值。离散化有很多种方法。这里使用一种最简单的方式去操作：原始的身高数据：165,174,160,180,159,163,192,184。假设按照身高分几个区间段：(150,165],(165,180],(180,195]。分组与聚合通常是分析数据的一种方式，通常与一些统计函数一起使用，查看数据的分组情况。一步一个脚印，lyy加油！

学习Pand as的基本操作：

学习Pandas 一（Pandas介绍、DataFrame结构、Series结构、Pandas基本数据操作、DataFrame运算、Pandas画图、文件读取与存储）

如何进行缺失值处理：
两种思路：
1、删除含有缺失值NaN的样本
2、替换/插补

判断数据是否存在NaN：
pd.isnull(df)
pd.notnull(df)

若存在缺失值：
1、删除存在缺失值的：dropna(axis=‘rows’, inplace=Ture/False)
inplace=True就地删除，False不会修改原数据，返回新的经过删除过缺失值的df，需要接受返回值

print(pd.isnull(movie))

print(np.any(pd.isnull(movie)))

print(pd.notnull(movie))

print(np.all(pd.notnull(movie)))

print(pd.isnull(movie).any())

print(pd.notnull(movie).all())

movie1 = movie.dropna()
print(pd.notnull(movie).all())
print(pd.notnull(movie1).all())

movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)
movie['Metascore'].fillna(movie['Metascore'].mean(), inplace=True)
print(pd.notnull(movie).all()) # 缺失值处理完毕，已经不存在缺失值了

data_new = data.replace(to_replace='?', value=np.nan)
print(data_new[21:40])
data_new.dropna(inplace=True) # 原数据上删除含有缺失值的样本
print(data_new.isnull().any())

# 准备数据
data = pd.Series([165, 174, 160, 180, 159, 163, 192, 184],
                 index=['No1:165 ', 'No2:174', 'No3:160', 'No4:180 ', 'No5:159', 'No6:163', 'No7:192 ', 'No8:184'])
print(data)
# 分组
# sr = pd.qcut(data, 3) # 自动分组
bins = [150, 165, 180, 195]
sr = pd.cut(data, bins) # 自定义分组
print(type(sr)) # <class 'pandas.core.series.Series'&gt;
print(sr)
print(sr.value_counts()) # 查看分组情况
# 转换成one-hot编码
print(pd.get_dummies(sr, prefix='身高', dtype=int))

stock = pd.read_csv('./file_csv/stock_day.csv')
print(stock.head())
p_change = stock['p_change'].head()
print(p_change)
print(pd.concat([stock, p_change], axis=1).head())

# 准备数据
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                        'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                        'key2': ['K0', 'K0', 'K0', 'K0'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
result = pd.merge(left, right, on=['key1', 'key2'], how='inner') # 内连接
# result = pd.merge(left, right, on=['key1', 'key2'], how='outer') # 外连接
# result = pd.merge(left, right, on=['key1', 'key2'], how='left') # 左连接
# result = pd.merge(left, right, on=['key1', 'key2'], how='right') # 右连接
print(result)

# 数据准备：准备两列数据，星期数据以及涨跌幅是好是坏数据，进行交叉计算
# pd.crosstab(星期数据列, 涨跌幅数据列)
stock = pd.read_csv('./file_csv/stock_day.csv')
# 1、准备星期数据列
# print(stock.index) # 转换为DatetimeIndex类型，方便用
# pandas日期类型
date = pd.to_datetime(stock.index)
print(date)
stock['week'] = date.weekday
print(stock.head())
# 2、准备涨跌幅数据列
stock['pona'] = np.where(stock['p_change'] > 0, 1, 0)
print(stock.head())
# 3、调用交叉表
data = pd.crosstab(stock['week'], stock['pona'])
print(data)
print(data.sum(axis=1))
print(data.div(data.sum(axis=1), axis=0))
data.div(data.sum(axis=1), axis=0).plot(kind='bar', stacked=True) # 柱状图
plt.show()

stock = pd.read_csv('./file_csv/stock_day.csv')
date = pd.to_datetime(stock.index)
stock['week'] = date.weekday
stock['pona'] = np.where(stock['p_change'] > 0, 1, 0)
print(stock.head())
print(stock.pivot_table(['pona'], index=['week']))

col =pd.DataFrame({'color': ['white', 'red', 'green', 'red', 'green'],
                   'object': ['pen', 'pencil', 'pencil', 'ashtray', 'pen'],
                   'price1':[5.56, 4.20, 1.30, 0.56, 2.75],
                   'price2':[4.75, 4.12, 1.60, 0.75, 3.15]})
print(col)
# 进行分组，对颜色分组，price1进行聚合
# 用dataframe的方法进行分组
print(col.groupby(by='color')['price1'].max()) # 按颜色进行分组，然后按price1进行聚合，求每个颜色的最大值
# 用series方法进行分组
print(col['price1'].groupby(col['color']).max())

# 从文件中读取星巴克店铺数据
starbucks = pd.read_csv('./file_csv/directory.csv')
# print(starbucks.head())

# 按照国家分组，求出每个国家的星巴克零售店数量
count = starbucks.groupby(by='Country').count()
# print(count)

starbucks_count = starbucks.groupby("Country").count()["Brand"].sort_values(ascending=False)[:10] # 分组聚合之后排序取前十行
starbucks_count.plot(kind="bar", figsize=(7, 4), fontsize=6)
plt.show()

# 准备数据
movie = pd.read_csv('./file_csv/IMDB-Movie-Data.csv')
# print(movie.head())
print(movie.shape)

# 1、想知道这些电影数据中评分的平均分，导演的人数等信息，我们应该怎么样获取？
print(movie['Rating'].mean()) # 评分的平均分
print(np.unique(movie['Director']).size) # 导演的人数

# 2、对于这一组电影数据，如果我们想看Rating、Runtime (Minutes)的分布情况，应该如何呈现数据？
movie['Rating'].plot(kind='hist',figsize=(8, 4))
plt.show()
movie['Runtime (Minutes)'].plot(kind='hist',figsize=(8, 4))
plt.show()

# 3、对于这一组电影数据，如果我们希望统计电影分类（Genre）的情况，应该如何处理数据？
# 先统计电影类别都有哪些
# print(movie['Genre'])
movie_genre = [i.split(',') for i in movie['Genre']] # 一部电影，有多个类别，先把每一部类型以列表显示
# print(movie_genre)
movie_class = np.unique([j for i in movie_genre for j in i]) # 把一个二维列表化成一维列表，再去重存储在movie_class
print(movie_class)
print(len(movie_class)) # 20
# 统计每个类别，有几个电影
count = pd.DataFrame(np.zeros(shape=[100, 20], dtype='int32'), columns=movie_class)
print(count)
# 计数填表
for i in range(1000): # movie的形状是(1000, 12)
    count.loc[i, movie_genre[i]] = 1
print(count)
count.sum(axis=0).sort_values(ascending=False).plot(kind="bar", figsize=(10, 5), fontsize=20, colormap="cool")
plt.show()