[Python] DataFrame의 데이터 변형 메서드

티스토리 뷰

Python/Process

[Python] DataFrame의 데이터 변형 메서드

Aaron 2019. 2. 15. 15:45

참고글 :

[Python] Pandas - DataFrame

[Python] Pandas - DataFrame 관련 메서드

#. 문자열 분리, 결합, 공백 제거 (.split, .join, .strip)

# 문자열 분리 : split메서드

pro.EMAIL

0 captain@abc.net

1 sweety@abc.net

...

14 napeople@jass.com

15 silver-her@daum.net

Name: EMAIL, dtype: object

pro.EMAIL.map(lambda x : x.split('@')) # 벡터 연산 불가

0 [captain, abc.net]

1 [sweety, abc.net]

...

14 [napeople, jass.com]

15 [silver-her, daum.net]

Name: EMAIL, dtype: object

# 문자열 결합 : +연산자, join메서드

n = ['one', 'two', 'three']

'one' + '/' + 'two' + '/' + 'three'

'/'.join(n)

'one/two/three'

# 공백 제거 : strip메서드

a = ['apple ', 'banan ', ' lemon']

list(map(lambda x : x.split(), a)) # list적용 시 map 함수 사용(map메서드는 pandas용)

[['apple'], ['banan'], ['lemon']]

#. 특정 문자열 찾기 (in연산자, .find,, .index)

a = 'apple,banana,lemon'

# in연산자

'apple'in a # 특정 문자열 포함 여부 확인

True

# find메서드

a.find('ple') # 특정 문자열 포함 시, 시작 위치 리턴

a.find('pal') # 특정 문자열 미포함 시, -1 리턴

-1

# index메서드

a.index('ple') # 특정 문자열 포함 시, 시작 위치 리턴

a.index('pal') # 특정 문자열 미포함 시, 예외 발생

ValueError: substring not found

#. 특정 문자열 개수 확인 (.count)

a = 'apple,banana,lemon'

a.count('a')

#. 패턴 or 문자열 치환 (.replace) * 중요

# 1. 문자 패턴 치환

1) NA 치환 불가

2) 벡터연산 불가

3) 문자열에 적용 가능

s1 = Series(['1,100','2,200','3,300','4,400'])

s1.map(lambda x : x.replace(',',''))

0 1100

1 2200

2 3300

3 4400

dtype: object

s2 = Series(['a','b','-','-'])

s2.map(lambda x : x.replace('-',np.nan)) # NA 치환 불가

TypeError: replace() argument 2 must be str, not float

# 2. 문자열 치환

1) NA 치환 가능

2) 벡터연산 가능

3) 문자 패턴 치환 불가능

4) 정확히 일치하는 value 치환

* replace 메서드 앞에 Series or DataFrame이 올 경우 pandas용 replace 메서드로 적용

s1 = Series(['1,100','2,200','3,300','4,400'])

s1.replace(',','') # 문자 패턴 치환 불가능, 정확히 일치하는 value 치환

0 1,100

1 2,200

2 3,300

3 4,400

dtype: object

s1.replace('1,100','0')

0 0

1 2,200

2 3,300

3 4,400

dtype: object

s2 = Series(['a','b','-','-'])

s2.replace('-',np.nan)

0 a

1 b

2 NaN

3 NaN

dtype: object

#. 축 이름 변경 (.rename)

- 일부 축 이름 변경 시 유용

data.rename?

data.rename(

['mapper=None', 'index=None', 'columns=None', 'axis=None', 'copy=True', 'inplace=False', 'level=None'],

)

fruits

fruits.rename(index={0:'one', 1:'two'}, # dictionary를 사용한 축 이름 변경

columns={'price':'won'})

#. 중복 제거 (.duplicated)

- key-value 조합에 대한 중복

df1.duplicated?

df1.duplicated(subset=None, keep='first')

# keep : first(처음 발견된 값 반환), last(마지막 발견된 값 반환)

df1 = DataFrame({'a':[1,1,2,2], 'b':[1,2,3,3]})

df1.duplicated() # 중복 여부 확인

0 False

1 False

2 False

3 True

dtype: bool

df1[df1.duplicated()] # 중복 데이터 색인

df1.drop_duplicates() # 중복 제거 후 출력

df1[-df1.duplicated()]

df1[~df1.duplicated()]

#. map의 추가적인 활용 * 중요

- 특정 Key를 여러 값으로 치환

emp = get_query('select * from emp')

# 방법1

np.where(emp['DEPTNO'] == 10, '인사부',

np.where(emp['DEPTNO'] == 20, '총무부','재무부'))

array(['총무부', '재무부', '재무부', '총무부', '재무부', '재무부', '인사부', '총무부', '인사부',

'재무부', '총무부', '재무부', '총무부', '인사부'], dtype='<U3')

# 방법2 (map과 dictionary의 활용)

dic_deptno = {10:'인사부', 20:'총무부', 30:'재무부'}

emp['DEPTNO'].map(dic_deptno) # 딕셔너리의 key로 전달되어 value가 출력

0 총무부

1 재무부

2 재무부

...

11 재무부

12 총무부

13 인사부

Name: DEPTNO, dtype: object

# 방법3

emp['DEPTNO'].replace(dic_deptno) # pandas용 replace

0 총무부

1 재무부

2 재무부

...

11 재무부

12 총무부

13 인사부

Name: DEPTNO, dtype: object

#. 범주형 데이터 그룹별 분류 (.cut)

pd.cut?

pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

# x : array data

# bins : 최소, 최대 값

# right : 범위의 최대값을 개방할지 여부, True : 초과~이하, False : 이상~미만

# labels : 그룹의 이름 지정

# precision : 라벨이 없는 경우의 정밀도(소수점 개수)

emp['SAL']

0 800.0

1 1600.0

2 1250.0

3 2975.0

4 1250.0

5 2850.0

6 2450.0

7 3000.0

8 5000.0

9 1500.0

10 1100.0

11 950.0

12 3000.0

13 1300.0

Name: SAL, dtype: float64

np.where(emp['SAL'] <= 1000, 'C', # np.where을 사용한 방법

np.where(emp['SAL'] <= 2000, 'B', 'A'))

array(['C', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'C', 'A',

'B'], dtype='<U1')

pd.cut(emp['SAL'],[0,1000,2000,10000])

0 (0, 1000] # 0초과 1000이하

1 (1000, 2000]

2 (1000, 2000]

3 (2000, 10000]

4 (1000, 2000]

5 (2000, 10000]

6 (2000, 10000]

7 (2000, 10000]

8 (2000, 10000]

9 (1000, 2000]

10 (1000, 2000]

11 (0, 1000]

12 (2000, 10000]

13 (1000, 2000]

Name: SAL, dtype: category

Categories (3, interval[int64]): [(0, 1000] < (1000, 2000] < (2000, 10000]] # bins에 설정한 3개의 그룹 범위

pd.cut(emp['SAL'],[0,1000,2000,10000], right=False, labels=['C','B','A']) # 그룹 순서에 맞게 labels 지정

0 C

1 B

2 B

3 A

4 B

5 A

6 A

7 A

8 A

9 B

10 B

11 C

12 A

13 B

Name: SAL, dtype: category

Categories (3, object): [C < B < A]

# cut 메서드 속성 (Series 적용 불가)

- codes : 각 그룹의 순서 확인

- categories : 각 그룹의 이름 확인

- value_counts() : 각 그룹별 데이터 수

sal = pd.cut(np.array(emp['SAL']),[0,1000,2000,10000],labels=['C','B','A'])

sal.codes

array([0, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 0, 2, 1], dtype=int8)

sal.categories

Index(['C', 'B', 'A'], dtype='object')

pd.value_counts(sal)

A 6

B 6

C 2

dtype: int64

sal2 = pd.cut(np.array(emp['SAL']),4) # 범위가 균등한 그룹 개수 지정

pd.value_counts(sal2)

(795.8, 1850.0] 8

(2900.0, 3950.0] 3

(1850.0, 2900.0] 2

(3950.0, 5000.0] 1

dtype: int64

#. 분위수 기반으로 균등하게 데이터 분류 (.qcut)

- 분위수, 퍼센티지(%) 기반으로 데이터를 분류할 때 주로 사용

pd.qcut?

pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

# x : array data

# labels : 그룹의 이름 지정

# precision : 라벨이 없는 경우의 정밀도(소수점 개수)

data = np.random.randn(100) # 분포

array([ 0.76905033, 0.64619414, 2.14542454, -0.88857032, -1.163164 ,

0.0640485 , 0.13168462, 3.08208798, -1.43318155, -0.5459889 ,

...

0.0425782 , 0.10225214, -0.68731568, 0.88087666, -0.9302119 ,

0.89370151, -0.47331144, 0.25940389, 1.16488655, -1.78270453])

cats = pd.qcut(data , 4, precision=2) # 사분위수

[(0.12, 0.77], (0.12, 0.77], (0.77, 3.09], (-1.79, -0.67], (-1.79, -0.67], ..., (0.77, 3.09], (-0.67, 0.12], (0.12, 0.77], (0.77, 3.09], (-1.79, -0.67]]

Length: 100

Categories (4, interval[float64]): [(-1.79, -0.67] < (-0.67, 0.12] < (0.12, 0.77] < (0.77, 3.09]]

pd.value_counts(cats)

(0.77, 3.09] 25

(0.12, 0.77] 25

(-0.67, 0.12] 25

(-1.79, -0.67] 25

dtype: int64

#. 이상치(특이값, Outlier) 확인 및 치환

- 회귀분석에서 이상치가 굉장히 민감

- 데이터셋에 이상치가 있을 시, 치환 필요(mean, min, max 값 등)

data = DataFrame(np.random.rand(100,4))

data[3]

0 0.219537

1 0.842740

2 0.714011

...

98 0.692940

99 0.878966

Name: 3, Length: 100, dtype: float64

data.describe()

data[3][data[3]>0.99] # '3'컬럼의 이상치 확인

1 0.997332

28 0.996564

Name: 3, dtype: float64

data[(data > 0.99).any(1)] # 이상치가 있는 행 출력

data[data>0.99] = np.sign(data) * data[3].describe().loc['mean'] # 이상치를 '3'컬럼의 평균으로 치환

# np.sign : 부호 확인

#. 데이터 샘플링 (numpy.random.permutation)

- 샘플링 후 ix 혹은 take 로 재배치

# 추출 준비

np.random.permutation(20) # 0~19까지의 숫자를 비복원 추출(replace = F)

array([13, 9, 5, 12, 15, 1, 17, 2, 11, 0, 3, 6, 4, 19, 14, 8, 10, 16, 7, 18])

np.random.randint(0,20,20) # 0~19까지의 숫자를 복원 추출(replace = T)

array([ 6, 7, 18, 11, 12, 19, 3, 4, 6, 18, 2, 6, 13, 12, 9, 13, 16, 2, 3, 6])

# 추출된 index 색인

data.iloc[np.random.permutation(20)] # 추출된 index를 DataFrame 색인으로 전달

data.take(np.random.permutation(20)) # 추출된 index를 take 함수로 전달

#. Q 데이터셋에서 train data(70%) & test data(30%) 추출

std

n = len(std) # 행의 수 = 20

t = int(len(std) * 0.7) # 데이터의 수(70%) = 14

ind = np.random.permutation(n) # random index

std.iloc[ind].iloc[:t] # train data(70%) 추출

std.take(ind).iloc[:t]

std.iloc[ind].iloc[t:] # test data(30%) 추출

std.take(ind).iloc[t:]

#. 분류값을 더미 변수로 변환 (pandas.get_dummies())

* 더미(dummy) 변수

- 0 , 1로 표현되는 값으로 어떤 특징이 존재지에 대한 여부를 표시하는 독립 변수

- A,B,C의 카테고리를 갖고, 설명변수가 존재하는 범주형 자료가 있다고 할 때, 표시 행렬(0, 1로 이루어진 행렬)로 변환하여 변수를 다시 해석

- Deep Learning Model 에서 Y값은 반드시 더미 변수화가 필요

- 더미 변수화 함수들 중 하나

data['SCORE']

0 A

1 C

2 B

...

17 B

18 B

19 B

Name: SCORE, dtype: category

Categories (4, object): [F < C < B < A]

pd.get_dummies(data['SCORE'])

#. Q1 (about cut)

### gogak, gift 테이블을 불러와서 각 고객이 가져갈 수 있는 최대 상품 출력

gogak = get_query('select * from gogak')

gift = get_query('select * from gift')

# 1) 일반 사용자 정의 함수

f2 = lambda x : gift.loc[(gift['G_START'] < x) & (x < gift['G_END']),'GNAME'].values[0]

gogak['POINT'].map(f2)

0 양쪽문냉장고

1 참치세트

2 주방용품세트

...

17 노트북

18 벽걸이TV

19 벽걸이TV

Name: POINT, dtype: object

# 2) cut 메서드 사용

bins = gift['G_START']

bins[10] = 1000001 # pd.concat([bins, Series(1000001)], ignore_index=True)

# bins = bins.append(Series(1000001), ignore_index=True)

pd.cut(gogak['POINT'], bins, labels=gift['GNAME'], right=False) # 이상~미만을 위해 right=False로 설정

0 양쪽문냉장고

1 참치세트

2 주방용품세트

3 참치세트

4 샴푸세트

5 샴푸세트

6 세차용품세트

7 주방용품세트

8 LCD모니터

9 세차용품세트

10 샴푸세트

11 참치세트

12 산악용자전거

13 세차용품세트

14 산악용자전거

15 LCD모니터

16 노트북

17 노트북

18 벽걸이TV

19 벽걸이TV

Name: POINT, dtype: category

Categories (10, object): [참치세트 < 샴푸세트 < 세차용품세트 < 주방용품세트 ... 노트북 < 벽걸이TV < 드럼세탁기 < 양쪽문냉장고]

#. Q2 (about cut)

# student와 exam_01 테이블을 조인하여

std = get_query('select * from student')

exam = get_query('select * from exam_01')

1) 각 학생의 학점을 출력

data = pd.merge(std , exam , on = 'STUDNO')

data['SCORE'] = pd.cut(data['TOTAL'] , [0,70,80,90,101] , labels = ['F','C','B','A'] , right = False)

2) 각 학점별 학생의 인원수

data['SCORE'].value_counts().sort_index(ascending = False)

A 4

B 12

C 3

F 1

Name: SCORE, dtype: int64

data.pivot_table(index = 'SCORE' , aggfunc = len).loc[ : , 'STUDNO']

SCORE

A 1

C 3

B 12

F 4

Name: STUDNO, dtype: int64

data.pivot_table(index = 'SCORE' , columns = 'STUDNO' , values = 'NAME', aggfunc = len).sum(1)

SCORE

F 1.0

C 3.0

B 12.0

A 4.0

dtype: float64

참고: KIC 캠퍼스 머신러닝기반의 빅데이터분석 양성과정

저작자표시 (새창열림)

'Python > Process' 카테고리의 다른 글

[Python] 그룹 연산 (Group by) (8)	2019.02.19
[Python] 정규표현식 (re Module) (0)	2019.02.18
[Python] 피벗 (.pivot, .pivot_table) (0)	2019.02.14
[Python] 데이터 결합 (np.concatenate, pd.concat) (0)	2019.02.14
[Python] 데이터 병합(Join) - pandas.merge (0)	2019.02.14

최근에 올라온 글

최근에 달린 댓글

링크

Total

Today

Yesterday

TAG more

Data Makes Our Future

티스토리 뷰

[Python] DataFrame의 데이터 변형 메서드

'Python > Process' 카테고리의 다른 글

티스토리툴바