python - Investigating different datatypes in Pandas DataFrame -

- March 15, 2013

i have 4 files want read python / pandas, files are: https://github.com/kelsey9649/cs8370group/tree/master/tafengdataset stripped away first row (column titles in chinese) in 4 files. other that, 4 files supposed have same format.

now want read them , merge 1 big dataframe. tried using

pars = {'sep':          ';',             'header':       none,             'names':        ['date','customer_id','age','area','prod_class','prod_id','amount','asset','price'],              'parse_dates':  [0]}  df = pd.dataframe() in ('01', '02', '12', '11'):     df = df.append(pd.read_csv(cfg.abspath+'d'+i,**pars))

but: file d11 gives me different format of single columns , cannot merged properly. file contains on 200k lines , cannot problem in file mentioned above, assuming has same format, there's small difference in format.

what's easiest way of investigating problem? obviously, cannot check every single line in file...

when read 3 working files , merge them; , read d11 independetly, line

a = pd.read_csv(cfg.abspath+'d11',**pars)

still gives me following warning:

c:\python27\lib\site-packages\pandas\io\parsers.py:1130: dtypewarning: columns ( 1,4,5,6,7,8) have mixed types. specify dtype option on import or set low_memory= false.   data = self._reader.read(nrows)

using method .info() in pandas (for a , df) yields:

<class 'pandas.core.frame.dataframe'> int64index: 594119 entries, 0 178215 data columns (total 9 columns): date           594119 non-null datetime64[ns] customer_id    594119 non-null int64 age            594119 non-null object area           594119 non-null object prod_class     594119 non-null int64 prod_id        594119 non-null int64 amount         594119 non-null int64 asset          594119 non-null int64 price          594119 non-null int64 dtypes: datetime64[ns](1), int64(6), object(2)  <class 'pandas.core.frame.dataframe'> int64index: 223623 entries, 0 223622 data columns (total 9 columns): date           223623 non-null object customer_id    223623 non-null object age            223623 non-null object area           223623 non-null object prod_class     223623 non-null object prod_id        223623 non-null object amount         223623 non-null object asset          223623 non-null object price          223623 non-null object

even if use dtype-option on import, somehow still scared of wrong/bad results there might happen wrong casting of datatypes while importing!?

how overcome , solve issue? lot

whenever have problem boring done hand, solution write program:

for col in ('age', 'area'):     i, val in enumerate(a[col]):         try:             int(val)         except:             print('line {}: {} = {}'.format(i, col, val))

this show lines in file non-integer values in age , area columns. first step in debugging problem. once know problematic values are, can better decide how deal them -- maybe pre-processing (cleaning) data file, or using pandas code select , fix problematic values.

Search This Blog

Backgorund

python - Investigating different datatypes in Pandas DataFrame -

Comments

Post a Comment

Popular posts from this blog

database - VFP Grid + SQL server 2008 - grid not showing correctly -

jquery - Set jPicker field to empty value -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -