1. Load the text file as RDD:
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
2. Function that breaks each line into words:
def toWords(line):
return line.split();
3. Run the toWords function on each element of RDD in Spark as flatMap transformation:
words = line.flatMap(toWords);
4. Convert each word into (key,value) pair:
def toTuple(word):
return (word, 1);
wordTuple = words.map(toTuple);
5. Perform reduceByKey() action:
def sum(x, y):
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
28. Suppose you have a huge text file. How will you check if a particular keyword exists using Spark?
lines = sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def isFound(line):
if line.find(“my_keyword”) > -1
return 1
return 0
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “Found”
else:
print “Not Found”;