欧美精品AⅤ在线视频,A级毛片全部免费高清视频人人,日产1区至六区狼人

主頁(yè) > 知識(shí)庫(kù) > pandas group分組與agg聚合的實(shí)例

pandas group分組與agg聚合的實(shí)例

如下：

import pandas as pd
 
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 
     'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
     'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})

構(gòu)造的數(shù)據(jù)如下：

 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

分組

單列分組

df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)

輸出

America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000

多列分組

df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)

輸出

('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

聚合

對(duì)分組后數(shù)據(jù)進(jìn)行聚合

默認(rèn)情況對(duì)分組之后其他列進(jìn)行聚合

df_agg = df.groupby('Country').agg(['min', 'mean', 'max'])
print(df_agg)

輸出

 Age     Income      
   min   mean max min   mean max
Country              
America 250 250.000000 250 40000 40000.000000 40000
China 4321 4607.000000 5000 8000 9333.333333 10000
India 1234 3188.333333 4321 5000 5000.666667 5002
Japan  250 250.000000 250 50000 50000.000000 50000

對(duì)分組后的部分列進(jìn)行聚合

某些情況，只需要對(duì)部分?jǐn)?shù)據(jù)進(jìn)行不同的聚合操作，可以通過(guò)字典來(lái)構(gòu)建

num_agg = {'Age':['min', 'mean', 'max']}
print(df.groupby('Country').agg(num_agg))

輸出

 Age     
   min   mean max
Country       
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan  250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
print(df.groupby('Country').agg(num_agg))

輸出

  Age     Income  
   min   mean max min max
Country          
America 250 250.000000 250 40000 40000
China 4321 4607.000000 5000 8000 10000
India 1234 3188.333333 4321 5000 5002
Japan  250 250.000000 250 50000 50000

補(bǔ)充：pandas——很全的groupby、agg，對(duì)表格數(shù)據(jù)分組與統(tǒng)計(jì)

我這篇groupby寫(xiě)的不好。太復(fù)雜了。其實(shí)實(shí)際上經(jīng)常用的就那么幾個(gè)。舉個(gè)例子，把常用的往那一放就很容易理解和拿來(lái)用了。日后再寫(xiě)一篇。

groupby功能：分組

groupby + agg(聚集函數(shù)們)：分組后，對(duì)各組應(yīng)用一些函數(shù)，如'sum',‘mean',‘max',‘min'…

groupby默認(rèn)縱方向上分組，axis=0

DataFrame
import pandas as pd
import numpy as np

 df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
     'key2':['one', 'two', 'one', 'two', 'one'],
     'data1':np.random.randn(5),
     'data2':np.random.randn(5)})
print(df)

  data1  data2 key1 key2
0 -0.410122 0.247895 a one
1 -0.627470 -0.989268 a two
2 0.179488 -0.054570 b one
3 -0.299878 -1.640494 b two
4 -0.297191 0.954447 a one

分組，并對(duì)分組進(jìn)行迭代

list(df.groupby(['key1']))#list后得到：[(group1),(group2),......]

[('a',  data1  data2 key1 key2
 0 -0.410122 0.247895 a one
 1 -0.627470 -0.989268 a two
 4 -0.297191 0.954447 a one), ('b',  data1  data2 key1 key2
 2 0.179488 -0.054570 b one
 3 -0.299878 -1.640494 b two)]

list后得到：[(group1),(group2),…]

每個(gè)數(shù)據(jù)片(group)格式: (name,group)元組

1. 按key1（一個(gè)列）分組，其實(shí)是按key1的值

groupby對(duì)象支持迭代，產(chǎn)生一組二元元組：（分組名，數(shù)據(jù)塊）,（分組名，數(shù)據(jù)塊）…

for name,group in df.groupby(['key1']):
 print(name)
 print(group)

a
  data1  data2 key1 key2
0 -0.410122 0.247895 a one
1 -0.627470 -0.989268 a two
4 -0.297191 0.954447 a one
b
  data1  data2 key1 key2
2 0.179488 -0.054570 b one
3 -0.299878 -1.640494 b two

2. 按[key1, key2]（多個(gè)列）分組

對(duì)于多重鍵，產(chǎn)生的一組二元元組：（（k1,k2），數(shù)據(jù)塊）,（（k1,k2），數(shù)據(jù)塊）…

第一個(gè)元素是由鍵值組成的元組

for name,group in df.groupby(['key1','key2']):
 print(name) #name=(k1,k2)
 print(group)

('a', 'one')
  data1  data2 key1 key2
0 -0.410122 0.247895 a one
4 -0.297191 0.954447 a one
('a', 'two')
  data1  data2 key1 key2
1 -0.62747 -0.989268 a two
('b', 'one')
  data1 data2 key1 key2
2 0.179488 -0.05457 b one
('b', 'two')
  data1  data2 key1 key2
3 -0.299878 -1.640494 b two

3. 按函數(shù)分組

4. 按字典分組

5. 按索引級(jí)別分組

6.將函數(shù)跟數(shù)組、列表、字典、Series混合使用也不是問(wèn)題，因?yàn)槿魏螙|西最終都會(huì)被轉(zhuǎn)換為數(shù)組

將這些數(shù)據(jù)片段做成字典

dict(list(df.groupby(['key1'])))#dict(list())

{'a':  data1  data2 key1 key2
 0 -0.410122 0.247895 a one
 1 -0.627470 -0.989268 a two
 4 -0.297191 0.954447 a one, 'b':  data1  data2 key1 key2
 2 0.179488 -0.054570 b one
 3 -0.299878 -1.640494 b two}

分組后進(jìn)行一些統(tǒng)計(jì)、計(jì)算等

1. 分組后，返回一個(gè)含有分組大小的Series

按key1分組

df.groupby(['key1']).size()

key1
a 3
b 2
dtype: int64

dict(['a1','x2','e3'])



{'a': '1', 'e': '3', 'x': '2'}

按[key1,key2]分組

df.groupby(['key1','key2']).size()

key1 key2
a  one  2
  two  1
b  one  1
  two  1
dtype: int64

2. 對(duì)data1按key1進(jìn)行分組，并計(jì)算data1列的平均值

df['data1'].groupby(df['key1']).mean()
#groupby沒(méi)有進(jìn)行任何的計(jì)算。它只是進(jìn)行了一個(gè)分組

key1
a -0.444928
b -0.060195
Name: data1, dtype: float64

df.groupby(['key1'])['data1'].mean()#理解：對(duì)df按key1分組，并計(jì)算分組后df['data1']的均值
#等價(jià)于：df.groupby(['key1']).data1.mean()

key1
a -0.444928
b -0.060195
Name: data1, dtype: float64

說(shuō)明：

groupby沒(méi)有進(jìn)行任何的計(jì)算。它只是進(jìn)行了一個(gè)分組。

數(shù)據(jù)（Series）根據(jù)分組鍵進(jìn)行了聚合，產(chǎn)生了一個(gè)新的Series，其索引為key1列中的唯一值。

這種索引操作所返回的對(duì)象是一個(gè)已分組的DataFrame（如果傳入的是列表或數(shù)組）或已分組的Series

df.groupby(['key1'])['data1'].size()

key1
a 3
b 2
Name: data1, dtype: int64

3.對(duì)data1按[key1,key2]進(jìn)行分組，并計(jì)算data1的平均值

df['data1'].groupby([df['key1'],df['key2']]).mean()

key1 key2
a  one -0.353657
  two -0.627470
b  one  0.179488
  two -0.299878
Name: data1, dtype: float64

df.groupby(['key1','key2'])['data1'].mean()
#等價(jià)于：df.groupby(['key1','key2']).data1'.mean()

key1 key2
a  one -0.353657
  two -0.627470
b  one  0.179488
  two -0.299878
Name: data1, dtype: float64

通過(guò)兩個(gè)鍵對(duì)數(shù)據(jù)進(jìn)行了分組，得到的Series具有一個(gè)層次化索引（由唯一的鍵對(duì)組成）：

df.groupby(['key1','key2'])['data1'].mean().unstack()

key2	one	two
key1
a	-0.353657	-0.627470
b	0.179488	-0.299878

在上面這些示例中，分組鍵均為Series。實(shí)際上，分組鍵可以是任何長(zhǎng)度適當(dāng)?shù)臄?shù)組。非常靈活。

橫方向上

按列的數(shù)據(jù)類型（df.dtypes）來(lái)分

df共兩種數(shù)據(jù)類型：float64和object，所以會(huì)分為兩組（dtype(‘float64'）,數(shù)據(jù)片),（dtype(‘O'), 數(shù)據(jù)片）

list(df.groupby(df.dtypes, axis=1))

[(dtype('float64'),  data1  data2
 0 -0.410122 0.247895
 1 -0.627470 -0.989268
 2 0.179488 -0.054570
 3 -0.299878 -1.640494
 4 -0.297191 0.954447), (dtype('O'), key1 key2
 0 a one
 1 a two
 2 b one
 3 b two
 4 a one)]

agg的應(yīng)用

groupby+agg 可以對(duì)groupby的結(jié)果同時(shí)應(yīng)用多個(gè)函數(shù)

SeriesGroupBy的方法agg()參數(shù)：

aggregate(self, func_or_funcs, * args, ** kwargs)
func: function, string, dictionary, or list of string/functions

返回：aggregated的Series

s= pd.Series([10,20,30,40])
s

0 10
1 20
2 30
3 40
dtype: int64

for n,g in s.groupby([1,1,2,2]):
 print(n)
 print(g)

0 10
1 20
dtype: int64
2
2 30
3 40
dtype: int64

s.groupby([1,1,2,2]).min()
1
1 10
2 30
dtype: int64

#等價(jià)于這個(gè)：
s.groupby([1,1,2,2]).agg('min')

1 10
2 30
dtype: int64

s.groupby([1,1,2,2]).agg(['min','max'])#加[]，func僅接受一個(gè)參數(shù)

	min	max
1	10	20
2	30	40

常常這樣用：

df

	data1	data2	key1	key2
0	-0.410122	0.247895	a	one
1	-0.627470	-0.989268	a	two
2	0.179488	-0.054570	b	one
3	-0.299878	-1.640494	b	two
4	-0.297191	0.954447	a	one

比較下面，可以看出agg的用處：

df.groupby(['key1'])['data1'].min()

key1
a -0.627470
b -0.299878
Name: data1, dtype: float64

df.groupby(['key1'])['data1'].agg({'min'})

	min
key1
a	-0.627470
b	-0.299878

#推薦用這個(gè)√
df.groupby(['key1']).agg({'data1':'min'})#對(duì)data1列，取各組的最小值，名字還是data1

	data1
key1
a	-0.627470
b	-0.299878

#按key1分組后，aggregate各組data1的最小值和最大值：
df.groupby(['key1'])['data1'].agg({'min','max'})

	max	min
key1
a	-0.297191	-0.627470
b	0.179488	-0.299878

#推薦用這個(gè)√
df.groupby(['key1']).agg({'data1':['min','max']})

	data1
	min	max
key1
a	-0.627470	-0.297191
b	-0.299878	0.179488

可以對(duì)groupby的結(jié)果更正列名（不推薦用這個(gè)，哪怕在后面單獨(dú)更改列名）

# 對(duì)data1，把min更名為a，max更名為b
df.groupby(['key1'])['data1'].agg({'a':'min','b':'max'})#這里的'min' 'max'為兩個(gè)函數(shù)名

d:\python27\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version

	a	b
key1
a	-0.627470	-0.297191
b	-0.299878	0.179488

重要技巧： groupby之后直接.reset_index()可以得到一個(gè)沒(méi)有多級(jí)索引的DataFrame

之后可以通過(guò)df.rename({‘old_col1':‘new_col1',‘old_col2':‘new_col2',…})重命名

eg:

df1= df.groupby(['date'])['price'].agg({'sum','count'}).reset_index()

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教。

您可能感興趣的文章:

Pandas中DataFrame的分組/分割/合并的實(shí)現(xiàn)
pandas 實(shí)現(xiàn)分組后取第N行
pandas分組排序如何獲取第二大的數(shù)據(jù)
pandas groupby分組對(duì)象的組內(nèi)排序解決方案
pandas組內(nèi)排序,并在每個(gè)分組內(nèi)按序打上序號(hào)的操作
pandas 實(shí)現(xiàn)某一列分組,其他列合并成list

標(biāo)簽：隨州安慶西安錦州白城股票日照天水

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《pandas group分組與agg聚合的實(shí)例》，本文關(guān)鍵詞 pandas,group,分組,與,agg,聚合,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問(wèn)題，煩請(qǐng)?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無(wú)關(guān)。