Tiago Antao

+ Follow
since Oct 14, 2020
Salem, MA, US
Cows and Likes
Total received
In last 30 days
Total given
Total received
Received in last 30 days
Total given
Given in last 30 days
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Tiago Antao

That is a complex subject with Python. TLDR:

- You have multi-threaded code in Python, but in the flagship implementation only one thread at a time can run (so no multiple cores). Google "Python GIL". This varies with Python implementations, for example you can do multi-threaded parallelism with Jython.
- You can easily do parallel multi-processing with Python
- Lower level code linked to Python can release the GIL and be multi-threaded and parallel. For example NumPy can be parallel multi-threaded
- Cython can be multi-threaded as long as you can release the GIL.

Its a fairly complicated topic due to the Python GIL. But, at the end, the solutions end up being quite straightforward.

1 year ago
Depends on what you want to do. I would suggest having a look at dask, for example - though this has a data analytics flavor.

In general it is quite easy to use existing parallel platforms with Python: Spark, Cloud (e.g. AWS auto-scaling groups with SQS queues), academic clusters (e.g SLURM).
Also there are libraries for stuff like MPI.
1 year ago
Hi Gary,

It depends a lot on the details, but I am pretty sure you will be able to do this in the Python ecosystem, even if not in Python proper:

- Python might be enough
- If it is not enough, before you go ahead and implement stuff yourself, be sure that the functionality you need is not implemented in existing Python libraries. Or at least can be implemented on top of those libraries.
- You can easily link lower level languages to Python to take care of the expensive inner loop. In the book I use Cython (because its the easiest if you already know Python). But you can use other languages as well: C, C++, Rust, Fortran....

I would not be afraid of diving into Python for large datasets. But you will probably use something else than just pure Python.
1 year ago

There are plenty of other applications for book. Games for sure (given that they are performance sensitive). API construction would be less obvious to me that the book would be useful.

1 year ago

The book doesn't cover framework-as-an-application.

Efficient data analytics best practices is a fairly wide topic. I do have a few tips based on mistakes that I see in the past.

- For cloud users think before using the mantra "just add more servers". Cloud cost compounds in many ways. If you have a performance problem start by auditing your architecture and code.
- Code profiling is your friend: People are normally wrong on their insights about where there is a code in inefficient. Use metrics, not gut feelings (I say this as a person who normally tends to trust intuition). Profile, measure
- Again for the cloud: Think twice before using a proprietary technology for performance gains. It locks you in, it tends to be expensive - even if apparently not so. And also, many times is is less performant than the marketing department of your cloud vendor tells you it is
- If possible try to make your solutions platform agnostic. For example, it tends to be moderately easy to abstract your parallelization platform.
- When use Parallel processing try to be as coarse grained as possible and if you can avoid interprocess communication, the better. Most problems that I have found actually fit this bill.
- Understand the computational cost complexity of basic data structures. I have lost count of the times where replacing a Python list by a Python set reduced the cost of an algo 10x.
- Understand the computational cost complexity of storage mechanisms. The obvious is SQL database indexing. But also stuff like File system access cost or S3 access cost
- If possible use de facto standard tools: there will be a community to answer your problems and chances are that someone has suffered your pain in the past and is kind enough to help you
1 year ago
No. If you are referring to this https://specs.frictionlessdata.io/data-package/ , I do not think it is of common use in Python
1 year ago
Let me start with the distribution question: As long as you produce a wheel you should be fine (see https://www.python.org/dev/peps/pep-0427/ ). But source-only distributions will be problematic, as the user will have to have the compiler on their side.

Is Cython platform-independent? Good question and honestly, difficult to answer. In most cases saying yes is reasonable: if you just write Cython code without dependencies you will be mostly OK. But remember that Cython can be a link to the C world and if you depend on C libraries then you might have a problem. I know that the same can be said about Python, but linking Cython to C is more common than linking Python with C libraries.

If the "only" thing you do is add type annotations and cython decorators I think it is fair to say it is platform independent-ish.

As a side, In the data analysis world, the relative importance of Windows is reduced. Mac, and above all, Linux are substantially more used. I seem commonly Mac for development and Linux for deployment (my personal case is Linux for development also)
1 year ago
The book is based on examples/use cases. But it doesn't cover data analytics concepts per se. The content is about how to implement fast Python for data analytics.

So, it discusses how to optimize Python libraries used for data analytics (NumPy or Pandas, for example) and storage related frameworks (Apache Parquet, for example).

It does have a chapter that is very data oriented in the sense that it discusses optimization from the point of view of processing incomplete amounts of data (using the data statistical properties to decide on how to subset data without loosing precision) - The point being if we are able to reduce the amount of data that we need to process then computation becomes more efficient. But save for that case, it discusses the optimization of the Python ecosystem and not data analytics concepts.
1 year ago
What people call Python is normally two things:

- Python the language
- CPython - the most common implementation of the language. Do not confuse CPython - the standard Python interpreter - with Cython. There are some alternatives to CPython: For example there is a Python interpreter for the JVM called Jython.

Python, being a dynamic language with lots of introspection features tends to be slow (I do believe that there is such thing as slow languages - especially if they have goodies like dynamic typing or garbage collection)

CPython as an implementation happens to be horrendously slow, further compounding the problem.

Python (for now on I will be assuming CPython when I say Python) can be made faster (*) by using libraries implemented in other languages. For instance, NumPy - Python's workhorse in data analytics - is mostly implemented in C and dependent of very efficient external algebra libraries implemented in C or Fortran.

(*) Of course code CPython code can be made faster by using best practices when coding in native Python - but that only gets yo so far.


- Python is slooow.
- Its implementation against Java on the JVM is not really comparable. Python loses, bad
- All this is circumvented by libraries implemented in lower level languages for the really computationally expensive stuff, so things end up OK.

Cython compiles Python (a super-set of Python - you really need to add type annotations and other stuff) to very efficient C. Its reasonably fair to compare it to C. Its quite easy to like Python with Cython and Cython with C libraries. Cython code can be really fast.

Cython also allows for parallel multi-threaded code. Did I mention that Python cannot have parallel threads because of the Global Interpreter Lock (GIL)?
1 year ago
In my previous book - in the field of Bioinformatics - I used Jupyter quite extensively. Most of the content of this book happens more at the library level - not at the analysis level - so I assume mostly the standard Python interpreter. I see environments like Jupyter being great for exploratory analysis of data (data science). This book is more about the guts of processing. That being said I do talk a bit about Jupyter (especially IPython magics that can be useful in many situations e.g. for quick profiling or cython development)

I discuss quite a bit of vanilla Python. Data structures and memory allocation. Multi-processing, ...
Then there is Cython as you refer. Also a lot of stuff about NumPy (being a book targeted at data analytics). And Pandas. And Numba.
And then some more advanced topics on Python/Numpy optimizations for CPU caching, GPU usage with Python. And also file system and advanced storage formats for data analysis (Apache parquet for example).

A side comment about Jupyter: I tend to prefer Notebooks formatted as jupytext: as they allow you to use normal text editors: https://github.com/mwouts/jupytext
1 year ago
A general answer to that question is quite difficult. I can make a few general comments, but it would help if we could discuss some more concrete cases.

Some general comments:

Whatever is your final platform, you should be able to at least run part of your code independently of the platform. So, even if you end up on the cloud, you should be able to run part (if not all) of your code locally. Maybe not with a full production dataset, but with enough to do testing.

I would advise, in as much as possible, to steer clear of very proprietary cloud technologies. So you will have to use proprietary compute (e.g. EC2 on Amazon), sure and probably proprietary storage (e.g. S3 on Amazon), but think carefully before tying yourself to less common services. For example: do you really need DynamoDB (a large-scale key-storage DB on AWS) or can you survive with just S3 or RDS? if you really require a key store maybe something non-proprietary?

If you end up with a solution that is agnostic as possible against the computing and storage platform - then you end up with more flexibility in the future to change your decision.

For example, If you need a lot of parallel processes, its not very hard to abstract away the parallel infrastructure in a library that gives you flexibility in the future.

So: Try to make your software architecture to be - inasmuch as possible - independent from a platform (making the question moot). Make sure your development is nimble, allowing you to work locally. If you really really end up depending on a lot of proprietary technologies from a vendor, consider something like localstack https://github.com/localstack/localstack

Again: its not easy to come with very general guidelines. But if you have a more concrete example I can comment a bit more in depth
1 year ago
Hi all,

Its a pleasure to be here. Glad to answer any questions.
1 year ago
Hi all,

I am the author of Manning's forthcoming book "High Performance Python for Data Analytics" which will be discussed here in December.

If anyone is interested in the topic of High Performance Python - book or not - feel free to get in touch with me.

1 year ago