Skip to content

ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation#13687

Merged
pitrou merged 26 commits into
apache:masterfrom
vibhatha:arrow-17181
Sep 29, 2022
Merged

ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation#13687
pitrou merged 26 commits into
apache:masterfrom
vibhatha:arrow-17181

Conversation

@vibhatha

Copy link
Copy Markdown
Contributor

This PR contains Python Scalar UDF documentation as an experimental version of docs.
At the moment we only support Scalar UDFs and the code snippets include how to use
UDFs with PyArrow.

@vibhatha

Copy link
Copy Markdown
Contributor Author

cc @westonpace @lidavidm @pitrou

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@vibhatha

Copy link
Copy Markdown
Contributor Author

cc @amol- @jorisvandenbossche

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but see comments below.
Also, can you add API docs for register_scalar_function and ScalarUdfContext?

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is function_name here? Perhaps it would be better to spell it explicitly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @jorisvandenbossche do you remember why Expression._call has a leading underscore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function_name replaced with "affine"

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
@vibhatha

Copy link
Copy Markdown
Contributor Author

Also, can you add API docs for register_scalar_function and ScalarUdfContext?

Do you mean include it in docs for Python? I am not sure how to link it properly. But the text is there for both ScalarUdfContext and register_scalar_function where they are defined.

@vibhatha

Copy link
Copy Markdown
Contributor Author

@pitrou could you please help me with adding API docs? or explain how to do that?

@lidavidm

Copy link
Copy Markdown
Member

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

@pitrou

pitrou commented Jul 22, 2022

Copy link
Copy Markdown
Member

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

And also reference the given symbols in https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.

Comment thread cpp/src/arrow/python/udf.h Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment on lines 457 to 461

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this taking m[0]? I don't understand what this example is supposed to show...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed you asked me to show the all scalar scenario in an example rather than wording. That’s why I added one to showcase that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred to this: #13687 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DId I misinterpret your statement?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the all-scalar scenario works with the original UDF? no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou should we keep this example or just keep a note/warning? I am inclined towards the example, though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep this example then we should:

  • Change the above ..note:: so that it isn't a note but a dedicated subsection.
  • Move the paragraph starting with "More generally," to come after the example.
  • Add a paragraph motivating the example.

However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-organized and included content. Please check it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.

However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
@vibhatha

Copy link
Copy Markdown
Contributor Author

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

@lidavidm I have already included the docs in the Cython in the original PR, but that function is in compute.pyx and only referred as an import in the compute.py. I was not sure whether to mark the method in Cython underscore function and call it in the compute.py and add the docs there. That was the confusing part.

@vibhatha

Copy link
Copy Markdown
Contributor Author

@vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.

Of course, sorry about the issues. Will make sure to avoid this.

@westonpace westonpace self-requested a review July 25, 2022 16:54
@vibhatha vibhatha requested a review from pitrou July 25, 2022 16:58

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding ScalarUdfContext and register_scalar_function: Both seem to have the correct docstrings in the python code (to the best of my knowledge). I think the problem what @pitrou alluded to. You need to include both in some kind of /api/*.rst file (and /pi/compute.rst does make the most sense).

For example: 757ec32

These /api/*.rst files are an index into the API documentation. If a type or method is not referenced by one of these files then API documentation will not be autogenerated for it.

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment on lines 457 to 461

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep this example then we should:

  • Change the above ..note:: so that it isn't a note but a dedicated subsection.
  • Move the paragraph starting with "More generally," to come after the example.
  • Add a paragraph motivating the example.

However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
@vibhatha

Copy link
Copy Markdown
Contributor Author

@westonpace thanks for the detailed review. I will address these.

Comment thread docs/source/python/compute.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "vivid" the right word here? @westonpace What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any would be better.

Comment thread docs/source/python/compute.rst Outdated
Comment on lines 457 to 461

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?

@vibhatha

vibhatha commented Aug 1, 2022

Copy link
Copy Markdown
Contributor Author

Re: #13687 (comment)

@pitrou we can use the other affine function
But it beats the purpose right?
I thought we decided to elaborate the a single sentence from an example. May be I misunderstood your point.

@vibhatha vibhatha requested review from pitrou and westonpace August 1, 2022 14:29
@pitrou

pitrou commented Aug 1, 2022

Copy link
Copy Markdown
Member

The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer.

@pitrou

pitrou commented Aug 1, 2022

Copy link
Copy Markdown
Member

We can also remove this entire example (the one showing execution on all scalars), because I'm not sure how useful it actually is.

@vibhatha

vibhatha commented Aug 1, 2022

Copy link
Copy Markdown
Contributor Author

The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer.

Yes @pitrou that's the most important part. But shouldn't we point out how it could be used with a third-party library function. Let's say the user has already written a custom function which is an old python script, it doesn't use PyArrow compute API. I thought about that angel, that's why added the note. Is this something we don't need to discuss now and may be leave it for a different venue to discuss? Please correct me if I am wrong.

cc @westonpace Any thoughts?

Comment thread docs/source/python/compute.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any would be better.

Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment thread docs/source/python/compute.rst Outdated
Comment on lines 457 to 461

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.

However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.

Comment thread docs/source/python/compute.rst Outdated
Comment on lines 474 to 479

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function correctly handles the all-scalar case but it does not handle other cases. Ideally, an example should demonstrate how to write a UDF that can handle all possible cases. For example:

print('10.0 * 5.0 + 1.0 should be 51.0')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.scalar(5.0), pa.scalar(1.0)])}')

print('[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.array([10.0, 10.0]), pa.array([5.0, 6.0]), pa.array([1.0, 1.0])])}')

print('10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')

Right now, the function as it is designed, gives me this output:

10.0 * 5.0 + 1.0 should be 51.0
Answer=51.0
[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]
Answer=[
  51
]
10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]
Traceback (most recent call last):
  File "/home/pace/experiments/arrow-17181/repr.py", line 39, in <module>
    print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
  File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_compute.pyx", line 2506, in pyarrow._compute._scalar_udf_callback
  File "/home/pace/experiments/arrow-17181/repr.py", line 21, in affine_with_python
    m = m[0].as_py()
TypeError: 'pyarrow.lib.DoubleScalar' object is not subscriptable

Comment thread docs/source/python/compute.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an array? I see:

<pyarrow.DoubleScalar: 123.22>

It probably should be an array.

Comment thread docs/source/python/compute.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are trying to say with the sentence that starts "Depending the..."

Comment thread docs/source/python/compute.rst Outdated
Comment on lines 516 to 519

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "The passed arguments to this function call are expressions not scalar values".

However, pc.scalar(5.2) and pc.scalar(2.1) look like scalar values. I'm not sure a user will recognize the subtle difference between pc.scalar(5.2) and pa.scalar(5.2) without further explanation.

Comment thread docs/source/python/compute.rst Outdated
@westonpace

Copy link
Copy Markdown
Member

I think part of the challenge with this documentation is that implementing affine in pure-python is not a very compelling use case. I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function:

import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

function_name = "numpy_gcd"
function_docs = {
       "summary": "Calculates the greatest common divisor",
       "description":
           "Given 'x' and 'y' find the greatest number that divides\n"
           "evenly into both x and y."
}

input_types = {
   "x" : pa.int64(),
   "y" : pa.int64()
}

output_type = pa.int64()

def to_np(val):
    if isinstance(val, pa.Scalar):
        return val.as_py()
    else:
        return np.array(val)

def gcd_numpy(ctx, x, y):
    np_x = to_np(x)
    np_y = to_np(y)
    return pa.array(np.gcd(np_x, np_y))

pc.register_scalar_function(gcd_numpy,
                            function_name,
                            function_docs,
                            input_types,
                            output_type)

print('gcd(27, 63) should be 9')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
print()
print('gcd([27, 18], [54, 63]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
print()
print('gcd(27, [54, 18]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')

Notice the use of the helper function to_np to convert from inputs of different shapes to ensure that we get something that numpy can work with.

@vibhatha

Copy link
Copy Markdown
Contributor Author

I think part of the challenge with this documentation is that implementing affine in pure-python is not a very compelling use case. I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function:

import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

function_name = "numpy_gcd"
function_docs = {
       "summary": "Calculates the greatest common divisor",
       "description":
           "Given 'x' and 'y' find the greatest number that divides\n"
           "evenly into both x and y."
}

input_types = {
   "x" : pa.int64(),
   "y" : pa.int64()
}

output_type = pa.int64()

def to_np(val):
    if isinstance(val, pa.Scalar):
        return val.as_py()
    else:
        return np.array(val)

def gcd_numpy(ctx, x, y):
    np_x = to_np(x)
    np_y = to_np(y)
    return pa.array(np.gcd(np_x, np_y))

pc.register_scalar_function(gcd_numpy,
                            function_name,
                            function_docs,
                            input_types,
                            output_type)

print('gcd(27, 63) should be 9')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
print()
print('gcd([27, 18], [54, 63]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
print()
print('gcd(27, [54, 18]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')

Notice the use of the helper function to_np to convert from inputs of different shapes to ensure that we get something that numpy can work with.

I see your point. I will update the example to use this.

@vibhatha

Copy link
Copy Markdown
Contributor Author

@jorisvandenbossche WDYT about this: #13687 (comment)

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is good enough now :-)
Thank you @vibhatha !

@pitrou pitrou merged commit 902781d into apache:master Sep 29, 2022
@vibhatha

Copy link
Copy Markdown
Contributor Author

Thanks everyone for the reviews 🙂

@ursabot

ursabot commented Sep 29, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 60c9383 and contender = 902781d. 902781d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.82% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.07%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 902781d1 ec2-t3-xlarge-us-east-2
[Finished] 902781d1 test-mac-arm
[Failed] 902781d1 ursa-i9-9960x
[Finished] 902781d1 ursa-thinkcentre-m75q
[Finished] 60c93833 ec2-t3-xlarge-us-east-2
[Failed] 60c93833 test-mac-arm
[Failed] 60c93833 ursa-i9-9960x
[Finished] 60c93833 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants