PySpark supports custom profilers. The custom profilers are used for building predictive models. Profilers are also used for data review to ensure that it is valid, and we can use it in consumption. When we require a custom profiler, it has to define some of the following methods:
- stats: This is used to return collected stats of profiling.
- profile: This is used to produce a system profile of some sort.
- dump: This is used to dump the profiles to a specified path.
- dump(id, path): This is used to dump a specific RDD id to the path given.
- add: This is used for adding profile to existing accumulated profile. The profile class has to be selected at the time of SparkContext creation.