Profiling python
-
I have an in-house python data analysis package that uses pandas to crunch some numbers. It's currently taking almost 30 seconds. I would love to get it to under a second. What do you guys recommend I use for profiling? We can throw more hardware at it, set up a cluster, etc if we need to, but currently, most of the code is basically single threaded.
-
@dangeRuss I'm not sure about profiling tools. I've done these things manually in the past.
Don't mean to come off tech-support ish but have you used numpy wherever you could possibly use? No unwarranted list comprehensions? List comprehension where a generator would more efficiently do the job?
-
I wonder if VS2017's profiling tools work on Python?
-
@mott555 I bet he can find a solution faster than he can get VS to play well with python data analysis code.
-
@stillwater said in Profiling python:
@dangeRuss I'm not sure about profiling tools. I've done these things manually in the past.
Don't mean to come off tech-support ish but have you used numpy wherever you could possibly use? No unwarranted list comprehensions? List comprehension where a generator would more efficiently do the job?
pandas uses numpy internally. TBH I'm having trouble wrapping my head around pandas stuff. Does anyone know of a good tutorial?
-
@dangeRuss The pandas docs. You'll thank me later. Pandas in ten mins and then working through the rest of the docs will make you better than 90% of people who fuck around with pandas for a living.
-
@dangeRuss said in Profiling python:
It's currently taking almost 30 seconds. I would love to get it to under a second.
Have you identified what is taking the time yet?
-
- If you already have any IDE installed anywhere (Visual Studio, Eclipse, NetBeans) it most likely has a Python module that probably supports profiling.
- Python has a built-in profiler called cProfile. Just run
python -m cProfile script.py
and it should give you an overview of function times. Documentation link. - This isn't really a solution, but if you want efficiency, consider using other implementations like PyPy. CPython is pretty crap.
-
@dangeRuss said in Profiling python:
It's currently taking almost 30 seconds. I would love to get it to under a second.
Have you looked at algorithmic complexity yet? If you're doing something O(N2) which can be done O(N log N) that's a huge win.
-
@dangeRuss I dunno what panda is, but can you compile it with ironpython? That would probably get you visual studio profilers and .net performance.
-
@dangeRuss Where was the bottleneck? Did you find out?
-
@sockpuppet7 said in Profiling python:
I dunno what panda is, but can you compile it with ironpython?
It's a data analysis library, and it depends strongly on the numerics library (numpy) which is available for Iron Python, but isn't trivially available. The complication is that numpy itself has a significant amount of itself written in languages other than Python (it's got quite a bit of C and maybe some Fortran too) so porting that to a managed environment isn't trivial.
Other than that, a minimal installation looks like it shouldn't be too hard. Getting all the optional bits might be trickier. (That's quite a long list of dependencies at the bottom of the installation page…)
-
@dkf Also, it amuses me slightly that there is a Python package called
bottleneck
…
-
@dkf said in Profiling python:
porting that to a managed environment isn't trivial.
With .NET? Not really, P/Invoke still works in IronPython.
-
@dkf said in Profiling python:
@dangeRuss said in Profiling python:
It's currently taking almost 30 seconds. I would love to get it to under a second.
Have you identified what is taking the time yet?
I'm working on a bunch of competing priorities right now. I will try to hook it up to some kind of profiler, and then I really need to start understanding pandas code because this code is 90% pandas.
-
@dangeRuss said in Profiling python:
I really need to start understanding pandas code because this code is 90% pandas.
Stay away from writing code that looks clever using pandas. It's very tempting but will bite you in the ass cheeks down the line. Other than that pandas is not gonna be that much of a pain.
-
@dangeRuss said in Profiling python:
I'm working on a bunch of competing priorities right now. I will try to hook it up to some kind of profiler, and then I really need to start understanding pandas code because this code is 90% pandas.
I'm guessing that something you're doing is unnecessarily recomputing values, or is computing values immediately instead of postponing the computation to the point where you need it. It's pretty easy to create real problems for yourself that way with Python; list comprehensions versus arrays versus slices… that can really hurt. I tend to let my colleagues do most of the worrying about that (our code uses a lot of numpy, which is what pandas is built on) but the principle is simple enough: any calculation should be done at most once and should be hoisted into the numpy core if at all possible as that's orders of magnitude faster than ordinary Python. Almost everything is faster than ordinary Python (except perhaps Ruby) but numpy helps a lot…
-
-
@dkf Looks like there's a loop that loops over every row and calculates values. The reason that you can't calculate the whole dataframe at once is because current row depends on the values of the previous row.
-
@dangeRuss Can anything that's not actually dependent on iterations be hoisted? Any I/O in the loop?
-
@dangeRuss https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
Something like df['myColumn'] = df.mydata - df.mydata.shift(1)
Can you use shift somewhere like this in your calculation? I'm not sure how one would go about using custom functions with shift though. But even if the loops are optimized they're something you might have to clean up down the line.
-
@Gribnit said in Profiling python:
@dangeRuss Can anything that's not actually dependent on iterations be hoisted? Any I/O in the loop?
I'm sure there are other things that can be sped up, there's some data reading stuff that I'm going to start caching, but by far the bulk of the slowdown comes from this loop right now (about 50%).
-
@stillwater said in Profiling python:
@dangeRuss https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
Something like df['myColumn'] = df.mydata - df.mydata.shift(1)
Can you use shift somewhere like this in your calculation? I'm not sure how one would go about using custom functions with shift though. But even if the loops are optimized they're something you might have to clean up down the line.
I will have to look into that. Here is roughly what the code looks like (anonymized).
for i in range(len(df)): condition = (df.period==i+1) df.loc[condition, 'A'] = df[condition]['F'] * df[condition]['E'] df.loc[condition, 'B'] = df[condition]['D'] - df[condition]['A'] df.loc[condition, 'C'] = df[condition]['E'] - df[condition]['B'] if i < len(df): df.loc[df.period==i+2, 'E'] = list(df[df.period==i+1]['C'] )
So really it's just the last line that updates E for the next period so that it could be used in the next iteration.
I asked my coworker for help (he doesn't know pandas very well, but he's familiar with optimizing for numpy) and he thinks he's got, it. I guess we'll see what he comes up with on Mon.
-
@dangeRuss I just saw a post about a python profiler on hackernews
https://news.ycombinator.com/item?id=17927200