First non-repeat word in a file

Find first non repeating word in 100GB file

divide files into N chunks
read each chunk, use a map to store each word
key is word(string), value is [file_position, count],
use map reduce to aggregate from several chunks, The counts will be aggregated and the minimum file_position would be chosen.
then get all words which count == 1
sort by file_position to get first non-repeating word

Last updated 4 years ago

Was this helpful?