Python Crawling Useful features Read Excel File & Show Progress bar & Make DataFrame import pandas as pd from tqdm import tqdm file_name = 'test_file' file_df = pd.read_excel('C:\\Users\\cristoval\\Desktop\\data\\' + file_name + '.xlsx') data = {'id': [], 'title': [], 'link' : []} result_df = pd.DataFrame(data=data) for idx, row in tqdm(file_df.iterrows()): # do something result_df = result_df.a..
Wikipedia 국/영문 데이터 수집/분석 자연어 처리를 위해 wikipedia 데이터를 활용해보자. Download Wiki dump file https://dumps.wikimedia.org/kowiki/latest/ https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2 2021/07 기준 데이터 : 1208126 건 https://dumps.wikimedia.org/enwiki/latest/ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 2021/07 기준 데이터 : 15839021 건 pages-articles.xm..
Text preprocessing 유원준님의 "딥 러닝을 이용한 자연어 처리 입문" 책을 (제가 보기 편하게) 간략히 정리한 글입니다. Table Of Contents Tokenization Word Tokenization Sentence Tokenization 한국어 토큰화 품사 태깅(part-of-speech tagging) Cleaning and Normalization 표제어 추출 & 어간 추출 Stopword Regular Expression Splitting Data Text Preprocessing Tools for Korean Text Tokenization Word Tokenization pip install nltk Do, n't from nltk.tokenize import word_..
OpenNMT-py Step 1: Prepare the data 사용할 데이터 지정은 .yaml 구성 파일에 작성 toy_en_de.yaml ## 샘플 생성 위치 save_data: toy-ende/run/example ## 어휘 생성 위치 src_vocab: toy-ende/run/example.vocab.src tgt_vocab: toy-ende/run/example.vocab.tgt # 기존 파일 덮어쓰기 방지 overwrite: False # Corpus opts: data: corpus_1: path_src: toy-ende/src-train.txt path_tgt: toy-ende/tgt-train.txt valid: path_src: toy-ende/src-val.txt path_tgt: t..
Python Setting For NLP Install Docker & Ubuntu SSH https://data-make.tistory.com/674 Install Python Install Python & PyDev in Eclipse Reference Install Python packages offline Reference ############################################################## ## 1. python install package (transformers, pytorch, OpenNMT-py) ## python -m pip --trusted-host pypi.org --trusted-host files.pythonhosted.org insta..
1. DataFrame을 개별 파일로 저장123data1.to_excel('/Users/aaron/Desktop/data1.xlsx', sheet_name='Sheet1', index=False)data2.to_excel('/Users/aaron/Desktop/data2.xlsx', sheet_name='Sheet1', index=False)data3.to_excel('/Users/aaron/Desktop/data3.xlsx', sheet_name='Sheet1', index=False)cs 2. DataFrame을 하나의 엑셀 파일에 여러 시트로 저장12345678910from pandas import ExcelWriter def save_xls(list_dfs, xls_path): writer = E..
1. 특정 시트--123456789import pandas as pd def readExel(xlse_path, sheetName): xls_file = pd.ExcelFile(xlse_path) data = xls_file.parse(sheetName) return data data = readExel('/Users/aaron/Desktop/test/testExcel.xlsx', 'Sheet') Colored by Color Scriptercs 2. 전체 시트--12345678import pandas as pd xls = pd.ExcelFile('/Users/aaron/Desktop/testExcel.xlsx')sheets = xls.sheet_names sh1 = xls.parse(sheet_name..
#. 이미지 url 다운로드-- urllib.request.urlretrieve(image_url, image_path) 12345678import urllib.request def downImage(img_url, img_name): dir = '/Users/aaron/Desktop/test/' urllib.request.urlretrieve(img_url, dir + img_name + '.jpg') downImage('http://blogfiles.naver.net/20120808_277/tpet2_1344432366023a5JkD_PNG/2.png','cat') Colored by Color Scriptercs