multithreading - Java wordcount: a mediocre implementation -


i implemented wordcount program java. basically, program takes large file (in tests, used 10 gb data file contained numbers only), , counts number of times each 'word' appears - in case, number (23723 example might appear 243 times in file).

below implementation. seek improve it, performance in mind, few other things well, , looking guidance. here few of issues wish correct:

  1. currently, program threaded , works properly. however, pass chunk of memory (500mb/num_threads) each thread, , each thread proceeds wordcount. problem here have main thread wait threads complete before passing more data each thread. isn't of problem, there period of time few threads wait , nothing while. believe sort of worker pool or executor service solve problem (i have not learned syntax yet).

  2. the program work file contains integers. that's problem. struggled lot, didn't know how iterate through data without creating loads of unused variables (using string or stringbuilder had awful performance). currently, use fact know input integer, , store temporary variables int, no memory problems there. want able use sort of delimiter, whether delimiter space, or several characters.

  3. i using global concurrenthashmap story key value pairs. example, if thread finds number "24624", searches number in map. if exists, increase value of key one. value of keys @ end represent number of occurrences of key. proper design? gain in performance giving each thread it's own hashmap, , merging them @ end?

  4. is there other way of seeking through file offset without using class randomaccessmemory? class read byte array, have convert. haven't timed conversion, maybe faster use else.

i open other possibilities well, comes mind.

note: splitting file not option want explore, might deploying on server in should not creating own files, if performance boost, might listen.

other note: new java threading, new stackoverflow. gentle.

    public class bigcount2 {         public static void main(string[] args) throws ioexception, interruptedexception {             int num, counter;             long i, j;             string delimiterstring = " ";             arraylist<character> delim = new arraylist<character>();             (char c : delimiterstring.tochararray()) {                 delim.add(c);             }             int counter2 = 0;             num = integer.parseint(args[0]);             int bytestoread = 1024 * 1024 * 1024 / 2; //500 mb, size of loop             int remainder = bytestoread % num;             int k = 0;             bytestoread = bytestoread - remainder;             int byr = bytestoread / num;             string filepath = "c:/users/daniel/desktop/int-dataset-10g.dat";             randomaccessfile file = new randomaccessfile(filepath, "r");             thread[] t = new thread [num];//array of threads             concurrentmap<integer, integer> wordcountmap = new concurrenthashmap<integer, integer>(25000);             byte [] bytearray = new byte [byr]; //allocates 500mb 2d byte array             char[] newbyte;             (i = 0; < file.length(); += bytestoread) {                 counter = 0;                 (j = 0; j < bytestoread; j += byr) {                     file.seek(i + j);                     file.read(bytearray, 0, byr);                     newbyte = new string(bytearray).tochararray();                     t[counter] = new thread(                             new bigcountthread2(counter,                                  newbyte,                                  delim,                                  wordcountmap));//giving each thread t[i] different file filereader[i]                      t[counter].start();                     counter++;                     newbyte = null;                 }                 (k = 0; k < num; k++){                     t[k].join(); //main thread continues after threads have finished.                  }                 counter2++;                 system.gc();             }             file.close();             system.exit(0);         }     }     class bigcountthread2 implements runnable {     private final concurrentmap<integer, integer> wordcountmap;     char [] newbyte;     private arraylist<character> delim;     private int threadid; //use later     bigcountthread2(int tid,              char[] newbyte,              arraylist<character> delim,             concurrentmap<integer, integer> wordcountmap) {          this.delim = delim;         threadid = tid;         this.wordcountmap = wordcountmap;         this.newbyte = newbyte;     }     public void run() {         int intcheck = 0;         int counter = 0; int = 0; integer check;  int j =0; int temp = 0; int intbuilder = 0;         (i = 0; < newbyte.length; i++) {             intcheck = character.getnumericvalue(newbyte[i]);             if (newbyte[i] == ' ' || intcheck == -1) {    //once delimiter found, current temparray needs added map                 check = wordcountmap.putifabsent(intbuilder, 1);                 if (check != null) { //if returns null, first instance                     wordcountmap.put(intbuilder, wordcountmap.get(intbuilder) + 1);                 }                  intbuilder = 0;             }              else {                 intbuilder = (intbuilder * 10) + intcheck;                 counter++;             }          }     } } 

some thoughts on little of ..

.. believe sort of worker pool or executor service solve problem (i have not learned syntax yet).

if threads take same time process same amount of data, there isn't of "problem" here.

however, 1 nice thing thread pool allows 1 rather trivially adjust basic parameters such number of concurrent workers. furthermore, using executor service , futures can provide additional level of abstraction; in case handy if each thread returned map result.

the program work file contains integers. that's problem. struggled lot, didn't know how iterate through data without creating loads of unused variables (using string or stringbuilder had awful performance) ..

this sounds implementation issue. while first try streamtokenizer (because it's written), if doing manually, check out source - bit of can omitted when simplifying notion of "token". (it uses temporary array build token.)

i using global concurrenthashmap story key value pairs. .. proper design? gain in performance giving each thread it's own hashmap, , merging them @ end?

it reduce locking , may increase performance use separate map per thread , merge strategy. furthermore, current implementation broken wordcountmap.put(intbuilder, wordcountmap.get(intbuilder) + 1) not atomic , operation might under count. use separate map because reducing mutable shared state makes threaded program easier reason about.

is there other way of seeking through file offset without using class randomaccessmemory? class read byte array, have convert. haven't timed conversion, maybe faster use else.

consider using filereader (and bufferedreader) per thread on same file. avoid having first copy file array , slice out individual threads which, while same amount of total reading, avoids having soak memory. reading done not random access, merely sequential (with "skip") starting different offsets - each thread still works on mutually exclusive range.

also, original code slicing broken if integer value "cut" in half each of threads read half word. 1 work-about have each thread skip first word if continuation previous block (i.e. scan 1 byte sooner) , read-past end of it's range required complete last word.


Comments

Popular posts from this blog

C# random value from dictionary and tuple -

cgi - How do I interpret URLs without extension as files rather than missing directories in nginx? -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -