First non-repeat word in a file

Find first non repeating word in 100GB file

  1. divide files into N chunks

  2. read each chunk, use a map to store each word

  3. key is word(string), value is [file_position, count],

  4. use map reduce to aggregate from several chunks, The counts will be aggregated and the minimum file_position would be chosen.

  5. then get all words which count == 1

  6. sort by file_position to get first non-repeating word

Last updated

Was this helpful?