How to get the top X of words from a SparseVector to a string array with PySparkHow to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist

Why is the Intel 8086 CPU called a 16-bit CPU?

Finding all possible pairs of square numbers in an array

Random piece of plastic

Term “console” in game consoles

Column deletion based on number of string matches within column

At which point can a system be compromised when downloading archived data from an untrusted source?

What is this geomorphological feature?

I have found a mistake on someone's code published online: what is the protocol?

Why did my "seldom" get corrected?

Locked-up DOS computer beeped on keypress. What mechanism caused that?

Difference between Pure EdDSA (ed25519) and HashEdDSA (ed25519ph)

An entire function all whose forward orbits are bounded

Necroskitter and creatures dying because of placing -1/-1 counters

I want to identify a part from a photo

How did Jayne know when to shoot?

When can a polynomial be written as a polynomial function of another polynomial?

A scene of Jimmy diversity

What causes a rotating object to rotate forever without external force—inertia, or something else?

Is the Münchhausen trilemma really a trilemma?

Why is Google approaching my VPS machine?

What happens if a company buys back all of its shares?

Arithmetics in LuaLaTeX

Exporting animation to Unity

Why won't some unicode characters print to my terminal?



How to get the top X of words from a SparseVector to a string array with PySpark


How to get the current time in PythonSpark 1.6 kafka streaming on dataproc py4j errorpyspark wrapper for IndexedRowMatrix multiply()py4j.Py4JException: Method socketTextStream does not existPySpark Throwing error Method __getnewargs__([]) does not existPyspark approxQuantile throwing errorRow-by-row aggregation of a PySpark DataFramePyspark - Error while loading ML modelHow to add any new library like spark-sftp into my Pyspark code?In pySpark I am getting py4j.protocol.Py4JError: py4j.Py4JException: Method isBarrier([]) does not exist






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








1















I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :



When I do :



getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)


I am getting this error :



Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening



vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)


But I get this error TypeError: 'NoneType' object is not subscriptable










share|improve this question



















  • 1





    one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

    – ags29
    Mar 26 at 12:31











  • Yes indeed, the python sort method return nothing: it directly alters the list.

    – Kaharon
    Mar 29 at 8:09

















1















I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :



When I do :



getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)


I am getting this error :



Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening



vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)


But I get this error TypeError: 'NoneType' object is not subscriptable










share|improve this question



















  • 1





    one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

    – ags29
    Mar 26 at 12:31











  • Yes indeed, the python sort method return nothing: it directly alters the list.

    – Kaharon
    Mar 29 at 8:09













1












1








1


0






I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :



When I do :



getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)


I am getting this error :



Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening



vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)


But I get this error TypeError: 'NoneType' object is not subscriptable










share|improve this question
















I am currently clustering some text documents.
I am using K-means and proceed my data with TF-IDF thanks to the PySpark methods.
And now I want to get the top 10 words for each cluster :



When I do :



getTopwords_udf = udf(lambda vector: [ countVectorizerModel.vocabulary[indice] for indice in vector.toArray().tolist().argsort()[-10:][::-1]], ArrayType(StringType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopwords_udf(col('means')))
.select("prediction", "topWord")
.show(2, truncate=100)


I am getting this error :



Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/bigpipe/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 597, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o225.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)


I thought it was because of the type (from DoubleType to float for numpy) so I have tried this as well to see what is happening



vector_udf = udf(lambda vector: vector.toArray().tolist(), ArrayType(FloatType()))
vector2_udf = udf(lambda vector: vector.sort()[:10], ArrayType(FloatType()))

predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", vector_udf(col('means')))
.withColumn("topWord2", vector2_udf(col('topWord')))
.select("prediction", "topWord", "topWord2")
.show(2, truncate=100)


But I get this error TypeError: 'NoneType' object is not subscriptable







python apache-spark pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 26 at 10:38







Kaharon

















asked Mar 26 at 9:19









KaharonKaharon

408 bronze badges




408 bronze badges







  • 1





    one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

    – ags29
    Mar 26 at 12:31











  • Yes indeed, the python sort method return nothing: it directly alters the list.

    – Kaharon
    Mar 29 at 8:09












  • 1





    one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

    – ags29
    Mar 26 at 12:31











  • Yes indeed, the python sort method return nothing: it directly alters the list.

    – Kaharon
    Mar 29 at 8:09







1




1





one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31





one potential issue might be in vector_udf2 where you have vector.sort()[:10], as vector.sort does not return a value.

– ags29
Mar 26 at 12:31













Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09





Yes indeed, the python sort method return nothing: it directly alters the list.

– Kaharon
Mar 29 at 8:09












1 Answer
1






active

oldest

votes


















1














I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
Here is my solution for those who might be interested...



def getTopWordContainer(v):
def getTopWord(vector):
vectorConverted = vector.toArray().tolist()
listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
return [v[j] for j in listSortedDesc]
return getTopWord

getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
.withColumn("topWord", getTopWord_udf(col('means')))
.select("prediction", "topWord")


I am a beginner in spark so if you know hot to enhance it, let me know :)






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353505%2fhow-to-get-the-top-x-of-words-from-a-sparsevector-to-a-string-array-with-pyspark%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
    Here is my solution for those who might be interested...



    def getTopWordContainer(v):
    def getTopWord(vector):
    vectorConverted = vector.toArray().tolist()
    listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
    return [v[j] for j in listSortedDesc]
    return getTopWord

    getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
    getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

    top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
    .withColumn("topWord", getTopWord_udf(col('means')))
    .select("prediction", "topWord")


    I am a beginner in spark so if you know hot to enhance it, let me know :)






    share|improve this answer



























      1














      I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
      Here is my solution for those who might be interested...



      def getTopWordContainer(v):
      def getTopWord(vector):
      vectorConverted = vector.toArray().tolist()
      listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
      return [v[j] for j in listSortedDesc]
      return getTopWord

      getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
      getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

      top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
      .withColumn("topWord", getTopWord_udf(col('means')))
      .select("prediction", "topWord")


      I am a beginner in spark so if you know hot to enhance it, let me know :)






      share|improve this answer

























        1












        1








        1







        I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
        Here is my solution for those who might be interested...



        def getTopWordContainer(v):
        def getTopWord(vector):
        vectorConverted = vector.toArray().tolist()
        listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
        return [v[j] for j in listSortedDesc]
        return getTopWord

        getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
        getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

        top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
        .withColumn("topWord", getTopWord_udf(col('means')))
        .select("prediction", "topWord")


        I am a beginner in spark so if you know hot to enhance it, let me know :)






        share|improve this answer













        I have figured out how to get the top X of words from a SparseVector to a string array with PySpark.
        Here is my solution for those who might be interested...



        def getTopWordContainer(v):
        def getTopWord(vector):
        vectorConverted = vector.toArray().tolist()
        listSortedDesc= [i[0] for i in sorted(enumerate(vectorConverted), key=lambda x:x[1])][-10:][::-1]
        return [v[j] for j in listSortedDesc]
        return getTopWord

        getTopWordInit = getTopWordContainer(countVectorizerModel.vocabulary)
        getTopWord_udf = udf(getTopWordInit, ArrayType(StringType()))

        top = predictions.groupBy("prediction").agg(Summarizer.mean(col("features")).alias("means"))
        .withColumn("topWord", getTopWord_udf(col('means')))
        .select("prediction", "topWord")


        I am a beginner in spark so if you know hot to enhance it, let me know :)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 29 at 8:16









        KaharonKaharon

        408 bronze badges




        408 bronze badges
















            Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







            Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55353505%2fhow-to-get-the-top-x-of-words-from-a-sparsevector-to-a-string-array-with-pyspark%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

            용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

            155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해