python - Investigating different datatypes in Pandas DataFrame -
i have 4 files want read python / pandas, files are: https://github.com/kelsey9649/cs8370group/tree/master/tafengdataset stripped away first row (column titles in chinese) in 4 files. other that, 4 files supposed have same format.
now want read them , merge 1 big dataframe. tried using
pars = {'sep': ';', 'header': none, 'names': ['date','customer_id','age','area','prod_class','prod_id','amount','asset','price'], 'parse_dates': [0]} df = pd.dataframe() in ('01', '02', '12', '11'): df = df.append(pd.read_csv(cfg.abspath+'d'+i,**pars))
but: file d11 gives me different format of single columns , cannot merged properly. file contains on 200k lines , cannot problem in file mentioned above, assuming has same format, there's small difference in format.
what's easiest way of investigating problem? obviously, cannot check every single line in file...
when read 3 working files , merge them; , read d11 independetly, line
a = pd.read_csv(cfg.abspath+'d11',**pars)
still gives me following warning:
c:\python27\lib\site-packages\pandas\io\parsers.py:1130: dtypewarning: columns ( 1,4,5,6,7,8) have mixed types. specify dtype option on import or set low_memory= false. data = self._reader.read(nrows)
using method .info()
in pandas (for a
, df
) yields:
<class 'pandas.core.frame.dataframe'> int64index: 594119 entries, 0 178215 data columns (total 9 columns): date 594119 non-null datetime64[ns] customer_id 594119 non-null int64 age 594119 non-null object area 594119 non-null object prod_class 594119 non-null int64 prod_id 594119 non-null int64 amount 594119 non-null int64 asset 594119 non-null int64 price 594119 non-null int64 dtypes: datetime64[ns](1), int64(6), object(2) <class 'pandas.core.frame.dataframe'> int64index: 223623 entries, 0 223622 data columns (total 9 columns): date 223623 non-null object customer_id 223623 non-null object age 223623 non-null object area 223623 non-null object prod_class 223623 non-null object prod_id 223623 non-null object amount 223623 non-null object asset 223623 non-null object price 223623 non-null object
even if use dtype-option on import, somehow still scared of wrong/bad results there might happen wrong casting of datatypes while importing!?
how overcome , solve issue? lot
whenever have problem boring done hand, solution write program:
for col in ('age', 'area'): i, val in enumerate(a[col]): try: int(val) except: print('line {}: {} = {}'.format(i, col, val))
this show lines in file non-integer values in age
, area
columns. first step in debugging problem. once know problematic values are, can better decide how deal them -- maybe pre-processing (cleaning) data file, or using pandas code select , fix problematic values.
Comments
Post a Comment