Christopher Webster

Ranch Hand
+ Follow
since Sep 13, 2020
Cows and Likes
Total received
In last 30 days
Total given
Total received
Received in last 30 days
Total given
Given in last 30 days
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Christopher Webster

It's a list comprehension, which is nice Pythonic (and functional) way of processing the items in a list to produce another list, just as Ron says.  Using a print() command inside the comprehension is a bit of a hack, as usually you would use a comprehension to create a new list.  List comprehensions are very useful, as you can do lots with them in very little code e.g. filter items, transform items etc, and they do not modify the original list.

If you run the original comprehension in your Python shell, you get results like this:

The command has printed each element from the original list, and also returned a list containing three "None" items - the values returned from the print() command each time.  Because print() returns nothing, the comprehension returns a list of nothings:

2 days ago
Think about your data:

  • It is JSON so it has a structure (fields), but the structure may vary between different records.
  • In your example, it looks like there is a common structure, with optional fields in the attributes list, so you could define a schema to describe this.
  • You want to check specific fields, so you are treating it as structured data.
  • If you simply wanted to store the JSON as a CLOB, then for your purposes it would be unstructured data.

  • I really wouldn't get hung up on these abstract "is it A or B?" questions.  Think about what you want to do, then proceed on that basis.
    2 days ago
    AFAIK, Spark is still implemented in Scala, so the Scala APIs are usually delivered first and are most complete.  Spark SQL, DataFrames, DataSets etc have been in Spark for a couple of years now.  There is no reason to switch between languages for different Spark libraries, if the library you need is available int he language you are using.

    2 days ago
    Spark is a distributed processing engine.  If you run it on Hadoop you can tell it to use the HDFS cluster as its data store.  Or you can tell it to read/write data from other sources if these are available.

    If you run Spark stand-alone, it can read from your local file system, or another source if one is available.

    It will store temporary data where you tell it to e.g. on HDFS if you are on Hadoop, or on your local file system if you are running Spark on your laptop.
    1 week ago
    There is no magic here.  Hive jobs run on Hadoop and are executed in the same way as other Hadoop jobs.

  • When you run a process on Hadoop, where is the process executed?  (clue: it's a distributed processing platform)
  • What does Hive do?  (clue: translates SQL queries into Hadoop jobs)
  • Where do the jobs execute?  (see above)

  • Meanwhile, ETL is Extract (read), Transform (process), Load (write), so figure out where each of these operations would happen.  

    If you're working with Hive tables, then presumably the data will be read from/written to your Hive directories.

    Your Hive processing will almost certainly involve several shuffles, and when you write the data to your target table, it will need to be moved again, so you have to assume there will be a lot of data moving around the cluster at certain stages of your process.  

    You can run an EXPLAIN for your Hive queries.

    Like I say, no magic, and no free lunches.
    1 week ago
    Here are a few questions you could ask:

  • What kind of data is it e.g. binary, text, what encoding, etc?
  • Where does it come from e.g.files, streams, etc?
  • Does it have a known structure e.g. a schema?
  • If it has a schema, is that schema consistent, or does each record have a different structure?
  • What kind of structured data do you want to produce - CSV, SQL, JSON, Avro, XML etc?

  • And so on...
    1 week ago
    The difference between RDDs and DataFrames has nothing to do with the volume of data, but with what kind of data it is, and what you want to do with it.  You can use RDDs or Spark DataFrames to process a single record from a single file, or massive data-sources containing gigabytes of data.  You can run Spark on your laptop, in a cloud environment (Azure, AWS, Google Cloud Platform etc), or on a huge on-premises Hadoop cluster, and so on. It depends on what you want to do.  

    If you are working with structured data that can be represented as a table, then you would probably use DataFrames and the Spark SQL APIs.

    You can read data into DataFrames from pretty much any source that Spark can read, and you can write data via DataFrames to any sink that Spark can write to.

    Spark data sources

    This includes JDBC sources i.e. SQL databases.
    1 week ago
    Probably the best place to start is with Data Frames and Spark SQL as there have been a lot of really useful changes here since Spark 1.6.

    If you are working with structured data e.g. tables  or CSV files etc, then you rarely need to use raw RDDs.  Instead you can read your data via DataFrames or Datasets, which provide a lot more functionality and much better support for structure and data-types.

    A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine

    Datasets seem to replace RDDs, effectively, but you will probably want to use DataFrames for this kind of data anyway:

    A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

    DFs allow you to use SQL or a SQL-like programming API to interact with your data, wherever it comes from.  So you could have DFs that are using data from different sources - HDFS, Hive, Parquet files, CSV files, SQL DBs, NoSQL DBs etc - and process all of it using SQL within Spark e.g. perform SQL joins between MongoDB and Hive and write the results to Cassandra, or whatever.

    Spark SQL gives you a common distributed programming platform for a huge range of data sources, which is really powerful and flexible.

    And there is now an optional free package for Spark called Delta Lake which uses Parquet as the basic storage format but adds a commit log and other features to support ACID transactions, in-place updates etc.  I'm just starting to look at Delta Lake now, but I think it looks really promising as a way to combine many of the useful features of conventional RDBMS platforms with the power and flexibility of distributed storage and processing.

    Going beyond DataFrames, Spark has also added a lot more functionality around Machine Learning and Streaming (Structured Streaming is like DataFrames for streams), so there are plenty of interesting topics for you to look at.  

    Finally, the basic execution model is still the same i.e. actions trigger data to be processed via transformations.  But the Spark SQL APIs mean you are less concerned with how the engine works and can focus instead on what you want it to do i.e. a declarative approach.  Of course, it's still useful to understand what's going on underneath the hood, so you can figure out when to perform particular operations that might trigger aggregation of data etc.
    1 week ago
    Essentially you are correct: your choice of NoSQL database should be influenced by your requirements for writing and querying your data.  Do you want really fast writes, really fast queries, or really flexible queries? You cannot always achieve all of these with a single approach.

    I haven't really worked with HBase, but I have done a bit of work with Cassandra, which is also based on the column-family model. Many of the basic principles are similar.  

    Cassandra is designed for very fast writes, because the physical location for the data is determined by its key (as in HBase), so the database can write the data to the correct node very quickly. There are lots of other internal optimisations to make writes as fast as possible (compaction etc).

    In Cassandra, the "primary key" consists of the partition key, which defines the physical location of the partition, and the clustering key, which defines how the data are ordered within the partition.

    Cassandra also has a SQL-like query language (CQL), which is very powerful, provided you understand the limitations of the data model.

    Query performance is based on how you use the key in your query.  

    You have to provide the partition key in order to query data on Cassandra, as this tells Cassandra which node to look at for your data, so it's important to design your data model to reflect how you expect to query your data.  Sometimes, the easiest approach is to store your data in multiple tables with different keys, so you can query the data easily by different keys later on.  You can query by non-key fields in Cassandra, provided you provide the partition key as well: Cassandra will locate the data via the partition key, then filter the results based on the additional query criteria.  

    You can also combine Cassandra as your fast data-store with ElasticSearch as a query indexing engine in Elassandra which tries to give you the best of both worlds.

    You can find out more about Cassandra data-modelling here if you're interested.

    Query performance on other databases depends on the basic data model, possible indexing mechanisms, and so on. For example, MongoDB offers indexes to help improve query performance, similarly to RDBMS indexes. But having lots of indexes means the indexes have to be updated when you write to a collection (MongoDB) or table (SQL), so there is always a trade off.
    2 weeks ago
    MongoDB stores data as "documents" where a "document" is basically a JSON document (strictly speaking, it's BSON, which is Binary JSON and offers stricter typing and parsing features).  BSON also allows you to store binary objects a bit like you can store BLOBs in a SQL database.  But mostly, you can think of your MongoDb documents as JSON, because that's mostly how you interact with them.  A MongoDB "document" has nothing to do with e.g. a PDF or Word document, unless you want to store the PDF as a binary object, I guess.

    MongoDB queries are based on a JSON query language, where you specify your query as a JSON document. This query language is pretty powerful and flexible, and there are also MongoDB query APIs for most programming languages.  The MongoDB Aggregation query framework is a more functional approach to query definition, where you define a query as a pipeline of operations, a bit like in some Big Data platforms such as Apache Spark.

    By default, MongoDB is schema-less: you store your documents in "collections", which are like RDBMS tables, but these do not enforce a schema.  This means you are responsible for ensuring that the documents (records) in a collection are consistent with each other.  So you could say that MongoDB stores semi-structured data i.e. JSON is structured with fields and values, but it isn't necessarily the same structure for every document (record) in your collection (table).

    (Personally, I think this is one reason why MongoDB was so popular with web developers in its early days: web devs often don't like working with or thinking about RDBMS data models, so the NoSQL approach meant they didn't have to care about their data model, as they could just throw everything into the same MongoDB collection.  Kind of like that one drawer in the kitchen where all the random stuff ends up.  But then one day, you need to find specific items in your big collection, and then life becomes ... interesting.  .)

    Luckily, you can tell your query to look for documents where a given field is null, or where the field does not exist, so there are ways to deal with inconsistent JSON structures in your collection.

    Nowadays, MongoDB offers more support for data modelling, including schema validation.  Eventually, even NoSQL databases rediscover SQL...  

    Here is a simple mapping between SQL and MongoDB concepts.

    Anyway, MongoDB University offers lots of free online courses in MongoDB, and these are a great way to learn the key features of the database and query language.

    2 weeks ago
    I used to use Eclipse, mainly for Java, but switched to IntelliJ once I started working with Scala, because IntelliJ's support for Scala is much better.  These days I use IntelliJ for my occasional programming tasks in Scala, Kotlin and Python, and I have no intention of returning to Eclipse.  My main problem with IntelliJ is that I started using it with a UK English keyboard, but now I work on a multi-lingual Swiss keyboard, so most of the shortcuts don't work any more!  

    But if you are comfortable with your IDE, there is no real reason to change, as they are all much the same and you will be more effective working with a tool you know well, than with a new tool that does the same thing but in unfamiliar ways. Meanwhile, make sure you know how to build and run your applications from the command-line, as everybody recommends.
    2 weeks ago

    Campbell Ritchie wrote:Isn't that how they teach programming at MIT? Learnn the principles in Scheme, and you can apply them in any language.

    I think they switched to Python for the introductory programming classes a few years ago.  Some of their courses are available online at the MIT OpenCourseWare site e.g.
    2 weeks ago
    If you really want to practice recursion (without all the Java kruft  ), try The Little Schemer - lots of simple bite-sized exercises to get you thinking in a recursive and functional style, using the Scheme language (a variety of Lisp).  Because Scheme syntax is pretty minimal, it helps to reveal the underlying logical patterns in these exercises.  You'll probably never program in Scheme for real, but the principles are the same in any language.

    And if you really want to dig deep, try Structure and Interpretation of Computer Programs (SICP) which is available free online.
    3 weeks ago
    This is a classic question, but I often find it helps to think about practical examples.

  • A framework would be something like Java Enterprise Edition, or Spring, where pretty much all your code is tightly coupled to the framework, because you are coding to the framework's requirements and it calls your code (as Campbell points out).  You cannot easily swap one framework for another e.g. from JEE to Spring - without a lot of re-writing.  And often you can't mix two different frameworks together without a world of pain.
  • A library is something that you call from your code e.g. a Maths or DateTime library or a JSON parser etc, where it's easy to swap one for another, and all you have to change is the specific call to the library function because the library does not affect the rest of your code.  You can also mix and match different libraries for different things.

  • I think JEE and Spring have both evolved over the years to allow greater flexibility in how you use their frameworks (indeed, Spring was invented as a deliberate effort to avoid the complexity and tight coupling of J2EE), but you still end up with a lot of coupling between your application and the framework, whichever one you use.

    Meanwhile, web development seems to be full of frameworks too, which seems like a nice money-making exercise for some folks...
    1 month ago
    Yeah, if I were interested in working with Java I'd probably have to actually go and RTFM on Streams and Lambdas and stuff!    

    I like the two sets solution - it's nice and clear what's happening and it does the job with a minimum of boilerplate.  
    1 month ago