|
|
12:46 |
|
show
|
2:04 |
Hello and welcome to MongoDB for Python developers.
Throughout this course, we're going to learn how to connect, model for and build applications with MongoDB, and we're going to do this with Python.
We're going to look at the straightforward, lowest level way of doing things with PyMongo, and we're going to look at mapping classes through what's called an ODM, think orm for Mongo; an ODM to Mongo DB with Mongo Engine.
And these come together to make a great combination.
So let's begin by talking about document databases, how do they work?
Well document databases in some ways are very much like standard relational databases, they have what you would think of as columns in that relational world, title, course id and duration in seconds here for example, but it also has nested data, so in a relational database we might have a lectures table that has some kind of foreign key constraint back to a chapter in this example we have on the screen here, but in fact in a relational database, we can embed those lectures inside of the chapter object, why is this good?
Well, often, we spend so much time and energy building up an object hierarchy in our application and then tearing that apart into a bunch of little pieces what's called third normal form, basically normalizing our data in a relational database and then building a backup, taking it back apart and this object relational impedance mismatch makes it hard for us to reason about our code, it makes it a little bit less intuitive and so on.
So with document databases, we can model our data the same way our application wants to use that data.
We also have more flexibility in the schema, which means that deploying a new version of our application often does not require some kind of migration script and downtime, no, we just deploy the new version and the document database adapts to it.
So document databases in my opinion are the best way to build applications, for the 80 percent case right, maybe you have some edge case where that doesn't make a lot of sense but most apps really benefit from using document databases.
|
|
show
|
1:34 |
Which one happens to be the most popular, the most widely used?
Well, you can guess, it's probably MongoDB, given this course, right.
However, you probably didn't guess how much more popular MongoDB is relative to its other NoSQL friends.
So we've got CouchDB, this orange one, which is way down there, and sort of not even trending; well, we've got RavenDB which is basically not used, Cassandra, which kind of peaked around 2016, is heading down.
We got MongoDB just much, much more popular than these, if you want to experiment or play around with this data yourself, you can check out the link at the bottom.
So MongoDB is really, really popular, it's by far the most popular, widely used document database.
It's also really loved, so one of my favorite places to get insight into to the developer community is Stack Overflow's yearly survey; so on the 2017 yearly survey, you can see MongoDB appears very high up the rank for most loved technology, and if you look at the little description at the bottom, this represents the number of developers who are currently using the technology and really enjoy working with it.
So MongoDB is right up there, among all at the top.
Now, what about wanted?
In fact, MongoDB just dominates the wanted category, so this is people who are not currently using MongoDB or currently using whatever technology is listed here, but they wish they were, right.
So MongoDB is definitely highly, highly desired for the people who are not using it, and very much loved by the people who are.
|
|
show
|
5:07 |
So let's talk about what we're going to cover in this course.
We're going to start by getting your machine set up and ready to go.
It's important that you follow along in this course, that you do the code examples, that you play with the database, that's how we learn as developers.
So the first thing we are going to do is walk through how to set up your operating system with MongoDB and the various other tools that we'll talk about later.
Whether you're using MacOS, Windows or Linux, we have a video that shows you how to set this up correctly on your machine.
Next, we'll dig into what is NoSQL, why do we want to use it, what are document databases, I touched a little bit on this in the beginning, but we'll go deep into document databases, how they work, and the trade-offs and benefits that we get from them.
We'll then start work with MongoDB proper, we're going to fire up MongoDB and we're going to connect to it with its native shell and understand its low level native query syntax.
If this was a relational database, this would be kind of like studying SQL, the TSQL language, but MongoDB doesn't use SQL, it uses its own query language which is easily understandable, but it is not the same.
Now this is actually going to be in Javascript for the most part; you might be thinking this is a Python course, I don't want to learn Javascript.
Well, for the most part we're going to write our code in Python and we're not going to do anything in Javascript, but you need to understand how the query syntax of the various tools and libraries in Python ultimately map down to what you could work with in the management tools, and in the management tools it's Javascript in the native query syntax.
We'll make sure that we cover that really well here as a great foundation.
Next up, we're going to talk about modeling data with documents rather than tables.
You may have heard of third normal form and modeling through normalizing data in a relational database and to some degree, that knowledge carries forward into document databases, but there are certainly many other trade-offs and different types of modeling scenarios that you want to follow, you certainly don't want to just normalize your document database you'd be missing all of the benefits and getting some of the drawbacks potentially.
So we'll see that in document databases in general, in MongoDB in particular, you model your data a little bit differently and have different trade-offs and considerations, and we'll talk about that here.
At this point, it's time to start writing the code with Python.
So we will begin at the primary lowest level that we can work with MongoDB, and this is PyMongo.
So here we're going to work in a query language that is very, very similar to MongoDB's native Javascript language but we're going to do it from Python, and this works great, basically you're exchanging dictionaries and it's very fast and efficient.
However, sometimes it makes a lot more sense to not just pass loosely typed dictionaries around but rather rich classes with lots of functionality and structure.
So we're going to also talk about MongoEngine which is an ODM, object document mapper for MongoDB; think of this as an ORM, but because there is no relational bit, it's not a relational database, we call it document.
So ODM for MongoDB and MongoEngine is one of the best ones, it works really well in Python 3 and Python 2, it has a whole bunch of features and different things you can add to your application on top of what MongoDB the database itself provides, for example like type checking, things like that, so really, really nice and you'll see that MongoEngine is a great addition to what you might be doing.
Once we get the programming side in place we want to take our database and add tons of data to it, so we're going to take a simple example that we were playing with before and add something that has effectively millions or at least million records in it and then we're going to start interacting with it from MongoEngine, just as well it could have been PyMongo, right, we're going to start interacting with this database with lots of data and see that it doesn't perform quite as well as we hoped, maybe as well as all the hype around MongoDB being fast would make you expect.
So we're going to see that we can take this server and it's kind of ok on its own, if we left it alone and we'll make it like five hundred times faster for some totally reasonable operations so we're going to talk about the various knobs and tools we have to make MongoDB really fast.
They're not hard to do, but they are not automatic so you definitely want to learn about those.
And finally, we're going to take all of what we've learned and deploy it into a cloud multi server environment so we are going to create what would be a fake web app, we'll just have a little Python script that stands in for the web app, put that on one server up in a cloud computing environment, on another one, we're going to set up a production hardened MongoDB server and we're going to make sure that MongoDB production server is totally locked down, running as safe as possible.
There are a lot of non obvious things about running MongoDB in production and we want to make sure that it's working really well for us, so we're going to go through this section and do five or six different things to get MongoDB ready to be our production database.
And that's it, this is what we cover the course, I think it's a really comprehensive introduction to MongoDB from Python, and I really hope you enjoy it, I really enjoyed creating it for you.
|
|
show
|
1:08 |
So let's talk a little bit about the tools that we're going to use.
We talked about Python, PyMongo and MongoEngine as the programming language; obviously we're going to use MongoDb as the database, but we also learn a few other things during this course.
You'll learn about a management tool called Robomongo.
Robomongo is hands down the best way to work with MongoDB from the client side, it gives you all the power of the command line interface that comes with MongoDB, but a great GUI kind of wrapped around it, that's all I am going to say about it now, but it's really fabulous and I think you'll enjoy.
Well see that we can even use Robomongo to manage securely our production environment on a remote server.
We're going to be using PyCharm as the ide for editing all of our Python code, so you'll learn a whole bunch of things about PyCharm, if you don't want to use PyCharm, and you want to use Sublime Text, or Emacs, or whatever, it's totally fine, but we're going to be using PyCharm and it's really great, I'll show you many of the tricks and techniques, and there are just better ways of using it.
And finally, when we get to our deployment step we're going to be working with Ubuntu, and you'll learn how to set up MongoDB properly in a production environment on Ubuntu.
|
|
show
|
0:48 |
You might be wondering, hey who is this guy talking to me anyway?
Here I am, hi nice to meet you, my name is Michael, you can find me on Twitter where I'm @mkennedy and just a little bit of my background; so I run the Talk Python To Me podcast as well as I am the founder and I've written many of the courses at Talk Python Training, for example this one.
And I'm even in the MongoDB Masters program which is a group of about 30 people who are external advisers to MongoDB, who work pretty closely with the teams there and give them feedback on how MongoDB is working in the real world, in our environments.
So I hope to bundle up all of this experience and package it into this course, and really give you some great take aways.
I'm so happy you joined my course, I'm looking forward to teaching you a bunch of stuff— let's get started.
|
|
show
|
2:05 |
Welcome to your course i want to take just a quick moment to take you on a tour, the video player in all of its features so that you get the most out of this entire course and all the courses you take with us so you'll start your course page of course, and you can see that it graze out and collapses the work they've already done so let's, go to the next video here opens up this separate player and you could see it a standard video player stuff you can pause for play you can actually skip back a few seconds or skip forward a few more you can jump to the next or previous lecture things like that shows you which chapter in which lecture topic you're learning right now and as other cool stuff like take me to the course page, show me the full transcript dialogue for this lecture take me to get home repo where the source code for this course lives and even do full text search and when we have transcripts that's searching every spoken word in the entire video not just titles and description that things like that also some social media stuff up there as well.
For those of you who have a hard time hearing or don't speak english is your first language we have subtitles from the transcripts, so if you turn on subtitles right here, you'll be able to follow along as this words are spoken on the screen.
I know that could be a big help to some of you just cause this is a web app doesn't mean you can't use your keyboard.
You want a pause and play?
Use your space bar to top of that, you want to skip ahead or backwards left arrow, right?
Our next lecture shift left shift, right went to toggle subtitles just hit s and if you wonder what all the hockey star and click this little thing right here, it'll bring up a dialogue with all the hockey options.
Finally, you may be watching this on a tablet or even a phone, hopefully a big phone, but you might be watching this in some sort of touch screen device.
If that's true, you're probably holding with your thumb, so you click right here.
Seek back ten seconds right there to seek ahead thirty and, of course, click in the middle to toggle play or pause now on ios because the way i was works, they don't let you auto start playing videos, so you may have to click right in the middle here.
Start each lecture on iowa's that's a player now go enjoy that core.
|
|
|
23:53 |
|
show
|
2:34 |
Let's begin our exploration of MongoDB at a high level with the why and history of NoSQL in general.
So if you look around the software development landscape you'll see that people when they talk about how are we going to design our application frequently say well let's assume that we have a relational database, now we can discuss what type of ORM should we use, or should we use an ORM at all, or should we use micro services or things like this.
Basically, the fact that we're starting from a relational database is considered to be an axiom of software development, we have data, it goes into relational database now let's talk about the architecture, now let's talk about scaling, let's talk about performance and so on.
And just to drive home how strong of a statement that is, an axiom, recall the exact definition— a statement or proposition that is regarded as being self evidently true, it's just clear that you start from a relational database.
So what are the things I hope you take away from this course is that the database style, the database engine is a choice, it's a really important choice that has actually super important and far reaching implication for your application.
So I want to sort of break this mold that starting from a relational database is an axiom.
Sometimes it makes sense, sometimes it's perfect, but a lot of times as you'll see throughout this course starting with a document database is actually a better choice.
Now, what is NoSQL?
Ask five people what NoSQL is, you'll probably get five different answers back.
Some people will say, well the 'no' stands for 'not only' so NoSQL is 'not only SQL', well that's a great open minded view of the world, but I'm sorry to say that's not what NoSQL is.
Maybe it means it doesn't have SQL, maybe it means the system operates without the SQL language, right, without select * from this etc, without that language.
If we look at the history, I think you'll see that this is also not the case.
Here's a toaster, this toaster operates without SQL, is it a NoSQL toaster— I don't think so!
And of course not, it's not a NoSQL toaster, it's just a toaster.
NoSQL doesn't mean it operates without the SQL query language in fact, I believe that document DB, Microsoft's document database that runs in Azure, actually more or less uses a flavor of the SQL query language to query it.
So no, it's not about excluding this SQL query language, it's something entirely different, so let's next look at the history, and I think you'll have a really good idea of what NoSQL is and maybe it will come to a little bit closer agreement on the definition of NoSQL.
|
|
show
|
2:18 |
The first record of what you might consider modern day NoSQL, there were some older versions much, much older about object databases that don't really carry on through today; but what we think of when we talk about NoSQL today really started back 2009 in San Francisco.
So this guy Johan Oskarsson, who at the time was working at last.fm was getting together like a big data/ scaling databases type of meetup in San Francisco, and the idea was we're going to talk about open source databases, distributed databases, that is databases that are easily horizontally scalable, and that might not be traditionally relational.
This description here on the right actually comes from Wikipedia, the name itself, the actual NoSQL, the word, I don't believe it's here, but it was in a previous accounting, it's not in Wikipedia, which if I could find the reference, I should go back and edit it, is there's another guy named Eric Evans who was attending this meeting as well and Johan said hey what are we going to call this meeting like we do have a name for these types of groups, this type of thing that we're doing, and let's try to get something short, like say a hashtag that we can use on Twitter to talk about it; So Eric Evans said how about #NoSQL, right, and that is the origin of the modern day term.
And the idea was, it was meant to describe this group of people mostly running web apps with lots of data, with high performance implications, or requirements, getting together to talk about how can we give up some of the features of relational databases to enable other types of things, so maybe we'll give up atomicity, the acid properties, maybe we'll give up joins, maybe we'll give up transactions, things like that, and if we do that, how do we maybe structure our data differently, how do we structure our databases differently, to be better at basically being cluster friendly.
Alright, so to me, this is the idea of what a NoSQL database is, it's a database that gives up some of the relational database features or requirements or properties, so that it is more cluster friendly, it is more friendly to scaling and sharding and things like that.
|
|
show
|
4:34 |
So this historical description indicates that there are a variety of different types of NoSQL databases that we've arrived at, basically different levels or types of trade-offs that you might make to encourage this horizontal scalable, cluster friendly type of data.
Now, if you go and look this up, what are the types of NoSQL systems you'll see there are four distinct types, but I think that would be four types under the not only SQL style systems which as that my contention, that's not what NoSQL is.
So, I think there's actually three types of NoSQL databases and then there's another thing that is not SQL; we'll talk about the four.
So at the most basic, basic level, the simplest ones are I'm going to basically create a distributed dictionary with id key that I can look up and some opaque blob that I can get back so here is some thing, it could be text, it could be binary object, graph, it doesn't really matter, store to this key and if I ask for that key back give me the thing, so for example, I want to store a user and all the stuff that is associated with user into a single blob in Redis, and when I ask for it back by user id, giving that thing back.
That means frequently these databases, these key value stores don't let you query inside the opaque blob, even if it's text, the idea is you query by the id, you get the thing, but you can't easily ask it questions like what is the average price of the order of all the customers stored in your key value store.
Now, there are ways to add like additional metadata that you can stick on to them, so you can sort of fake it a little bit on some of them but in general, I can ask for the thing by a key and I get a blob that's opaque to the database back.
Now, these are not great as databases in terms of general purpose databases because you can't ask them a wide variety of interesting questions.
However, they are great in terms of this cluster ability of the distributed cash type of thing there is super, super fast, because every single query is just finally one item by primary key and get it back to me so if you horizontally scale that out, you can find which key range maps to which server, go to that server get it by id— boom, super, super fast.
So these are nice, they are often used for caches and things like this.
Next, we have what I think is the real sweet spot for NoSQL databases, the ones that potentially could replace traditional relational databases in your application, that is they are flexible and powerful enough to be general purpose databases and that's the document databases.
So this is MongoDB, CouchDB, OrientDB, DocumentDB from Microsoft Azure, things like that, so this is really where we going to focus almost this entire class, so we'll come back to this.
We also have a more interesting type of database called a columnar database, traditional relational database systems like MySQL, Microsoft SQL server etc, store data in rows, but a column oriented database is different in that it stores data and its tables as columns instead of rows; so it's in a lot of ways really similar to relational databases, but you'll find that it's easier to kind of associate what you might think of is like a pre-joint data, maybe some orders, and maybe the orders have order items and there's a bunch of those, so you might have one order with multiple items, it's easier to group those together in a columnar database.
But they're kind of more or less like relational databases in a lot of ways.
So these I believe are the three types of NoSQL databases.
There's a fourth that often gets grouped here, so let's talk about it— graph databases.
So graph databases let you model relationships on many, many small interconnected things so think of like social graphs, friends, the pictures, the people who have shared that picture, you've liked that picture, the friends of your friends who have liked that particular picture you can traverse these relationships incredibly easy in graph databases because you can actually query directly on the relationships.
Show all the things related in this way to this item and so on, but that does not lead to this cluster friendly sort of thing, in fact, this leads to being even more tightly connected the less easy to map across multiple horizontally scaled servers and things like that.
So in my mind, the graph databases are super interesting but they're not NoSQL they're just not-SQL.
So that leaves us with three types— key value stores, document databases and columnar databases.
So let's now continue on to talk about document databases.
|
|
show
|
5:12 |
So let's talk about how document databases work.
Here's a record from my actual online training platform chapter 1001 from 'The Python Jumpstart by Building Ten Apps' course and this is more or less exactly what came out of the database, with a few things taken away so it actually fits on the slide here.
Now, let's break this into two pieces here, this green piece and the blue piece.
First of all, you can see we have json, when you work with document databases, you'll frequently run into json as at least the visual representation of the record.
In fact, in MongoDB it doesn't really work in terms of json it works in something called bson or binary json, so this binary tokenised type typefull, rich typed version of sort of extended json but already tokenised and stored as a binary version; this is what's transferred on the wire and to some degree this is what stored actually in the database so how it actually get stored is it moves around and the database storage engines are changing and sort of plugable now in MongoDB, but more or less you can think of this becoming a binary thing and then stored in the database.
When it comes over into say Python, we'll of course map this into something like let's say a Python dictionary or a type that has these fields and so on.
So if you look at the green area, this is just the jasonified version of any other database record, it has let's think of it as columns if you will for a minute it would have columns here like one would be the id one would be the title, one might be course id and it has values; and that's all well and good, until we get to lectures, and here's where the power of document databases comes in, lectures is not just like seven, there are seven lectures or whatever no, lectures is a list, so multiple things, and each one of those things is a lecture, an individual one, with its individual fields so id, title, video url, duration in seconds, again there's actually more to it, but so it fits on screen right; with this document database, you can think of these things as kind of pre-computed joints, and this solves a ton of problems and makes the NoSQL aspect of document databases super powerful.
So it makes this chapter more self contained, if I want to get this chapter back, instead of going to the chapter and then doing a join against the lectures and maybe some other type of join, and you're getting a bunch of different pieces and pulling them back together I just might do a query, find me the chapter with id 1001 bam, it's back, I've got the whole thing and so you can think of this as like pre-joined data if 80, 90 percent of the time I'm working with a chapter, I care about the lecture data being there, why not store it in a way that it's already bound together, so I don't have to do that join, I don't have to do multiple queries or things like this.
Okay, so this is really powerful and we'll talk a lot about when this makes sense, when it does not make sense and so on, but this means that if I take the single record and I put it on some server, even if I've got like ten servers and some sort of horizontal scale situation and I do a query by chapter id, I don't then have to go back to the cluster find where all the lecture data lives or anything like that.
No, it's just bringing that one record brings most of the data that I need to work with when I'm working with a chapter, right along with it, which is excellent.
That's the benefit, the important question the critical question to say like is this going to work for our database system as a general thing is well can I ask the questions that I would still have asked if lectures was its separate table, if it was a flat table just like relational databases.
So, what if I want to find his last lecture here, 10 106, will I be able to go to the database and say hey document database, I would like to get lecture 10 106 and I want to do that with an index and I want to have it basically instantaneously, even if there's a million records, I want to instantaneously get the record that embedded deep down within it could be many, many levels not just one, right, but in this case it's just one; I want to get the record that deep down within it somewhere matches the fact that the lecture id is 10 106.
And the answer is for the document databases yes, so this makes them very, very different than the key value stores just doing a json blob because we can ask these very interesting questions, we can do map reduce or aggregation type things for big data, analysis and analytics, all sorts of stuff is possible, because you can actually query deeply down into these nested objects.
So that's how document databases work, and we'll explore the when, whys and hows of designing these documents when we get to the document design chapter.
|
|
show
|
3:22 |
Now that we've talked about what NoSQL is, where it came from and a little bit about document databases, let's move to focusing specifically on MongoDB.
First off, I want to point out that MongoDB is open source so if we come down here, you can see we've got github.com/mongodb/mongo, there's actually all what are called drivers so like how do you access MongoDB from Python, or CSharp, or Java or whatever, and a bunch of other stuff out here on their github, but Mongo is the actual database server.
So it's cool that it's open source, now there's a lot of things that are on github that are technically open source, but not really active, it's like oh that's been changed, four years ago, and it's got 50 pull requests that haven't been even addressed in the last six months.
That's not good; that's not the case with Mongo, obviously, eleven thousand stars, three thousand forks, when was the last check in— it was three hours ago, okay?
So that's really awesome, and they're fixing the build on 'Windos' so that must be a slightly different version of Windows, I'm just kidding, I'm sure they lost the w there but you know, it runs on the major platforms, Windows, Linux and MacOS, if we look over at the pull request you can see these are only a few days old, there's 1129 that are closed, so these are all really good signs for MongoDB's open source site, open source repository to be active and real, not just it's up here, but it's a really active thing with a huge company and hundreds of engineers working on it.
If we look at what it's made of, it's basically a C++ app, there is a decent amount of Javascript and there as well as we'll see Javascript is fundamental in the raw query api and some other stuff.
Now we can get the source code here, but that's not how we get MongoDB.
So the way we get MongoDB is we go to mongodb.com, and we click download and it takes us here, so you can see there is the free community server version and if we come over here we could get the OS10 version with ssl we could download it just as a binary or we could install it with home brew, that's pretty cool.
If we want to install it on Linux, let's pick something, let's say we want to install it on Ubuntu, okay, here is how you install it with aptitude, right so on Windows you just get an msi and install it directly that way.
But notice, they all have ssl support, I believe there's a time when ssl support was actually an enterprise feature or like a paid feature, but thankfully that went away, and the community edition has this right here.
So if you get a chance, I would recommend installing it especially on Linux from one of the package manager type things like here with apt, because then the underlying system will know that there's an update for your database server, not a feature on Windows, you just have to keep track.
You can go and get the enterprise server which has an in memory version, an encrypted at rest version, and sort of advanced identity control features within it but if you just want a standard database without those things you don't care about say encryption at rest or integration with active directory or something like that, then the free version is totally good.
|
|
show
|
5:53 |
Now let's look a little bit at who is using MongoDB and how.
On one hand, it's not that important that it's a popularity contest— does it solve your problem, good, use it.
On the other, MongoDB is different, right, it's not a relational database that people have been using for thirty years and we call that axiom conversation, I had at the beginning, if you are the one adopting MongoDB you have to take this idea and present it to the people that run the business, to your managers, to the tech team and say hey this is a safe thing for us to do, this is a good thing for us to do.
And so, by looking at the other users of MongoDB, how they're using it and how much data and traffic they are passing through it can give you some really good support like hey look it's working well for these companies, and they're way more risk adverse than we are, so if they can use it, we can totally use it.
So, with that in mind, let's go look at who uses MongoDB, so they have a whole page who uses MongoDB right here, we can flip through and there's a few major ones; so we've got MetLife, they're doing some pretty interesting things obviously they are a large insurance company they have a single view of a hundred million customers across 70 systems and they built this whole thing up on Mongo and it's 90 days, that's pretty cool.
Expedia uses it for millions of customers while they're looking for travel, that's great.
Now let's look at some more, you can see the scrollbar, this is actually huge, so let's scroll down to find some interesting ones.
So let's say Royal Bank of Scotland, this supports the bank's enterprise data services underpinning several core trading systems, okay that's intense, right, like if you're debating whether or not this can do like you know some probably not super intense for the majority of the students part of your app, if Royal Bank of Scotland is going to make this part of their core trading systems that's really putting a lot of faith in it.
Biotech, they use this to accelerate their drug testing, Facebook, they have a whole bunch of interesting things that they're doing with Mongo, they ran like a backend as a service of Mongo when they acquired Parse, but they're not doing platform stuff like they used to.
Now let's flip around, let's have a look at say ebay they're doing delivering all their media metadata with five nines for liability; Barclays, a big bank, so they've replaced a whole bunch of relational systems there, let's keep going, come down here to our friends in Germany, they built a pretty amazing internet of things platform on top of MongoDB, come down look the New York Times, they basically did all their social sharing activity on top of MongoDB, Business Insider, you probably run across Business Insider the website, so they've been around since 2009, they launched in New York city, and their whole site runs on MongoDB, which is pretty awesome.
Speaking of business, let's look at Forbes, they rebuilt their whole cms on top of MongoDb, resulting in a jump of 5 to 15 percent in mobile traffic overnight, that's really cool.
So Carfax, they sell cars online and in person so a ton of traffic happening there, that's really cool.
Cern, I love Cern, these guys at the Large Hadron Colider they're using MongoDB to manage the data while they're searching for the Higgs Boson which I think this probably needs updating because as they now have found the Higgs Boson and won the Nobel Prize as a part of that.
Another interesting a long time user of MongoDB is Foursquare; so Foursquare is as far as I know more or less entirely powered by MongoDB and here you can say it powers the processing storage of all check ins with hundreds of thousands of IOPS on MongoDB, that's hundreds of thousands of operations, input/ output operations per second on MongoDB, which is really, really cool.
Let's look at Sailthru, so Sailthru is like marketing email campaign company and they store 40 terabytes of data in MongoDB across a 120 nodes so remember we talked about document databases and NoSQL databases in general being good for horizontal scale and sharding and partitioning your data; 120 nodes in your cluster that's pretty intense.
All right, let's do one more, let's talk about Shutterfly, so Shutterfly is like a photo sharing site, pretty cool, you can like put your pictures there, sharing with people you can get like printed books they were doing that before some of the main companies like google were and so on, so this is interesting in that they have a bunch of projects, on Mongo storing over 20 terabytes of data.
Square Space, Stripe and on and on it goes, right, all of these really cool companies are using MongoDB.
I guess one more let's look at UnderArmor here.
So Under Armour is interesting because I haven't seen any of the previous examples explicitly calling this out; so Under Armor is like an athletic clothing company in the US and around the world, and their online shop is powered by MongoDB and it does over two billion dollars in sales, so that's pretty awesome.
All right, so why do we spend all this time talking about who uses MongoDB?
One, to show you there are a bunch of companies being really, really successful with MongoDB and that there are different use cases, different companies in different areas doing different things, we saw like biotech, we saw pharma, here is e-commerce, all sorts of things.
Oh, EA, I didn't pull up EA but they're using it to power, let's go up here to EA, so EA is using this to scale their online games to millions of players; so all sorts of really cool and interesting use cases that you can use to say hey, we should give this database a try because here's a bunch of other people being successful with it.
This also means that you can rely on Mongo because it's taken a serious beating from quite a few different angles and use cases, it's not some barely used database but it's highly, highly used actually.
|
|
|
29:11 |
|
show
|
4:37 |
In order for you to get the most out of this course you're going to need to fallow along.
We were talking about the Mongo shell you should open it up and play around.
When we work with PyMongo, you should pip install it and write some code to talk to your local MongoDB server.
When we're doing MongoEngine or working with indexes, again, you should follow along and do these things.
In order to do that, you're going to need some software, you're going to need some starter code to get going and you're going to need basically to have MongoDB set up and configured correctly.
So in this part of the course, let's talk about getting your machine set up so you can follow along.
This is a course about MongoDB, so it shouldn't be terribly surprising that it's going to require MongoDB.
Now if you look across the bottom here you can see there is a version for Windows, Linux and MacOS.
So regardless of what operating system you are using you should be able to use MongoDB installed locally and work with it there.
There are hosted services, places like ObjectRocket and Mlab and if for some reason you can't install MongoDB and configure it, unlikely, but possible, you could actually connect to one of those services.
But we're going to assume that you can set it up locally and I will walk you through step by step how to do that for each and every operating system below, with the exclusion of Solaris of course.
Now, this is MongoDB for Python Developers so it shouldn't surprise you that hey we're going to need Python, and we're focusing on Python 3, so most new projects are created in Python 3, it's the future of Python, so we're definitely focusing on Python 3, that said, the things we're doing are not super specific to Python 3, it should pretty much work across all the versions in case you happen to be using a legacy Python.
So do you need to install Python 3?
Well that depends if you're working on Ubuntu, you probably already have at least Python 3.5 on your system.
If you're on MacOS, by default you have Python, but only legacy Python, only Python 2, not Python 3, so you'll need to install that.
And if you are on Windows, unless you've done something special there is no version of Python, so make sure you get Python 3 from Python.org, download and install it.
Now we're going to write a lot of code in here, that's good, I think that's the way coding course should be, and we're going to use the editor from Jetbrains called PyCharm.
In my opinion, this is the best tool for working with Python code and you'll even have plugins for MongoDB if you go and search their tool repository, so we're going to use PyCharm.
Now, PyCharm is available in two flavors, there is a community free open source edition, and there's the pro edition.
If you have the pro edition, feel free to use that, but if you don't, you can grab the community edition, it will do everything we need for this course.
If you want to use some other editor, that's totally fine, you can use whatever you like, but if you want to follow along exactly, I recommend you give PyCharm a shot.
There is a couple of ways we can work with MongoDB once we have it installed, we can use the cli the command line interface to it that comes with MongoDb itself, or we can use something called RoboMongo.
So RoboMongo in my opinion is the best way to work with MongoDB the idea is you can see a little dark gray area, that's basically the shell and you can type as if that was a command line interface.
However, it operates inside this gui so you could write a little bit of cli stuff and then go interact with the stuff visually, and this is a really, really nice balance of giving you the full power of MongoDB, but also a lot of visual support.
I think it's super productive and is great.
You can see there's screenshots for all the three major operating systems, so whatever operating system you use, RoboMongo is going to work great, it's also free and it's also open source, how about that.
Finally, when we write that code with PyCharm, you're going to want to be able to take it with you.
Sometimes you might want to grab the finish code that you saw me create in the video and run it, other times, we might have started not from a blank empty file but from some sort of starter code that got us further along in the demo from the beginning.
We also have a couple of large databases that you want to get access to for the performance section of the course, all those and more are contained in this github repository here, so github.com/mikeyckennedy/mongodb-for-Python-developers so be sure right now to pause this video, go over here and star and maybe even fork this repository so you're sure to have it with you.
And also download or clone it to your local drive, because you're going to want to have this to work from, as you go through the course.
So there you have it, that's the software source code and tools that were going to use.
What we're going to do next, I'm going to walk you through each operating system, Windows, MacOS, and Linux and show you how to set up the tools and how to configure MongoDB and get everything working just right.
If you're a Linux person, there is no reason to watch the say Windows version, so pick the video that matches your operating system, skip the others.
|
|
show
|
10:07 |
Here we are in windows 10, so I think this might be the anniversary update or not sure exactly which version, but it's not the fresh one, it's been out for quite a while now.
So what we're going to do is we're going to install MongoDB.
Let's go over here, check out mongodb.com, click on download, so we're going to go and get the msi, I want to get the one with ssl x64 bit, you put your name in here to get info from MongoDB if you want, you don't have to; okay downloaded, let's run this, all right, so current version of MongoDB at the time of this recording is 3.4.4 so we're going to install this, and I'll just open up the custom complete would be totally fine, just so you can see what's there, there's the server which is mongod itself, there's the client tools, there is the monitoring like analysis tools, import, export, for data backup and restore the sharding server Mongo s and then like a few other utilities, unless you have a good reason to not have these on your machine, I would just go for the complete install.
All right, it looks like MongoDB is installed let's try this— no Mongo, all right, the reason there's no Mongo is we've got to set up this thing in our path, so let's go over here to program files, mongodb, server, number, bin, so basically to do anything interesting from the command line and that's really where you want to be working with MongoDB you're going to have to come over here and put this into your path, so let's do that now.
You just go to properties, advanced, properties, environment variables, this is way deep down, go to the path and hit edit, and then hit new, this is very much nicer than the way it has been in the past, and it will just take that path and put it here; close everything off, ok, so now we should be able to run Mongo, hey look it's not going to work, but we can see it's this one that we just found and set up, so in order for Mongo to work, we can actually try to run Mongod and we're going to get a sad, sad message here, so if you look somewhere it's going to say this directory basically the database directory is not set up.
Well, there aren't default places where you can put the data and it will actually create that, you see here is the startup settings that it's using.
So we don't want to do this, we want to actually make another one logs and one called configs, so you get to configure all of these, so you can configure that however you like, but we should set up something like this and so let's go in here, now I'm going to copy a config file over so we have two, and notice I've named one command line and one is service, let's just look at the command line one.
So notice, there's not a lot going on here, I think this directoryPerDB we could actually drop this, this is not used in the new version, so we're basically saying drive c:\mongodb\data, let's just double check that that does exist, it looks good up here, c:\mongodb\data, okay, journaling enabled, you definitely want that on and this is super important, listen on a local host only, only, this is a dev machine there's no reason they should listen on the open internet in case your firewall is down or you're somewhere where people are scanning the local ports on their local network, think hotel, something like that, so we don't want any of that to happen, so we're going to listen on a local host only.
All right, so what we need to do now is we want to try to run MongoDB again, now with this, so let me go up here and copy the path, so we should be able to run MongoDB now, let's clear that off, so mongod and when we tried to run it before and it gave us this error, now we can say --config is that, and if we've got everything set up correctly this should work, there might be permissions on that mongo folder I created we're going to find out in a second.
It looks like things are going well, let's go over here and try to connect so we can type mongo and hey, we're in, I can do things like show dbs what's here, perfect, ok so it looks like this is working, it says now warning, you don't have access control like this is wide open to the internet and it's unrestricted read/ write, this is not the best, it's pretty much okay because we're listening it on the local host, still could be a problem, you might want to set up an account when we get to the deployment and production settings, this is, we're going to solve these problems, but for development this is probably good.
I had that one cofig, this one that worked, let's check this one out and make sure everything is ok as well.
So this service one is going to run when we install MongoDB as a Windows service so if we were running in like Windows virtual machine in aws, ec2 or in Azure something like that, this would be what we'd probably run, of course with credentials and things like that, we'll talk about it at the end; but if we're going to set this as a Windows service, this will only succeed if we set the logs, so that's why we created this logs folder and that's why this service one has a system log section.
So the next thing to do, now that we're over here is we actually want to first let's just test that, so let's test this service version and we won't see anything because the log file but if it just sets there, I guess we could go ahead and test that we can connect to it— yeah, looks like that worked.
Okay so it looks like the service is working we'll just control c out of there.
Now the next thing that we need to do, this is optional, you don't have to do this, you could literally come and type this out every time, but let's go ahead and set this up as a Windows service, so you can tell it to auto start, delay start or just flip open to the services and click go whenever you want to use MongoDB, that's how I whenever I'm working on windows, how I use it.
So we can go to the services, and let's hit m to see if there is anything for Mongo, and now there's nothing for MongoDB here, ok, so no MongoDB; and what we want to do is we want to register MongoDB as a Windows service, now there's something that's really, really important here, I can run MongoDB like this, -port equals whatever port, --ssl and whatever, all of the options go here, so --db path equals, we get filled this out here, it turns out the way that MongoDB registers itself if I try to install it as a Windows service using the explicitly passing the parameters the only way to change those values, to change the way MongoDB works, is to actually go and edit the registry entry in Windows, not amazing.
So what we're going to do instead, is we are going to do what we already did we want to go to basically say run that config file.
Now, the other thing that I've seen can be really tricky is the Windows service path might not be the same as your path so you need to use like full path names everywhere, so we'll say where mongod, so we want to run this explicitly because that's what gets recorded in the Windows service, so we're going to say that instead of just mongod, we'll say --config and that was in c:\mongo\config\ this one, now we've got to use the service one that has the log and then finally, here's the trick, this is the thing, actually this is not going to work, so I'm going to copy it, I'll show you this not going to work.
So the trick is to say I would like to install this as a service because it's not going to work, i'm going to copy it, so I don't have to type it again, ready— enter, now, no errors, but if I refresh, also no MongoDB.
What happened?
Well if you actually open up that log file in there it will say permission denied, could not install MongoDB, why— because this is not an administrator command prompt, not root or anything like that, this is just straight up, just whatever my account is, so I got to right click, you see these options, if you shift right click you say run as administrator, and then you run the exact same command and it does the exact same thing, except in the log file, there's now a different message if I refresh— ta-da, we have MongoDB.
So let's test this, if type mongo, trying to connect, trying to connect, it's going to time out, right, nothing.
Now if I go over here and I press start, do the same thing again, ta-da, now we have MongoDB set up as an auto start windows service.
That's pretty awesome right.
So if we reboot, MongoDB will run.
It might be the case that just sometimes you want to run Mongo and the other times you don't want to it say sucking down the battery on your laptop, you can set it to automatic delayed start, so your Windows boots faster, and you'll still have it, or you can just set up purely to manual in which case it's only going to run after reboot if you come over here and click go.
So that's depending on how you use Mongo whether you want it certainly in production if you're on a Windows server set that to start but maybe manual for development, I don't know, it depends how often you use MongoDB, if you use it all time that's all you work on is apps to talk to it, just set it to auto start or delayed or something.
Okay, so now this Windows machine is all configured to run MongoDB, how about PyCharm, and RobMongo and so on?
Those all have straightforward installers so just install Python 3, Robomongo and PyCharm whichever edition you want and just next, next, next your way through, with the Python one make sure that you check the box that says add Python to my path, that one is important.
Other than that, there's really nothing else to it, you'll have a machine that is set up and ready to do this MongoDB course on Windows.
|
|
show
|
2:25 |
Let's review installing MongoDB on Windows.
It just took a couple of steps, we started out by downloading and running the msi installer, and then, we realized well, we don't actually have access to Mongo or Mongod or any of the other tools from our command line, so what are we going to do— well it turns out that it got installed into this location and notice that the number may vary based on the released version of MongoDB, so we went in there and we actually added that to our path; and then we could type mongo, mongod, mongo restore and all the various tools are going to have to use throughout the course.
And then we said all right, well this is not going to run on its own, the way it gets installed on windows is it assumes that there is a c data mongo or data-db, something like that, it'll tell you on the error, but I don't like to put stuff and just see data, I kind of want to group stuff a little bit more so we made some mongodb folders, we made the c:\mongodb\data, \logs and \configs; so those three obviously longs go in logs, data goes in data, and then we have those two config files that we can use to run mongodb with all the various settings set for us.
We copied the configs over, and I'll include those in the source controle so you guys can grab the windows setup configs and you know, just change the path to wherever you put things; and then you want to test those configs, so we're going to test them by saying mongod--config and the full path to the config, this command one is meant to have no log in, so basically it spits out the log information to the command line to the command shell, that way you can see what's happening.
Of course, in the service version, the service actually won't install if there's not somewhere for the logs to go to because it knows you're not going to see anything so it has to log it for it to work.
All right, so this is just if you want to run it on the command line, then we're going to install it as a windows service, so this time we use the mongo-service config, which is basically the same, other than having a log in and we added the --install, okay.
And it was really important that we run that command prompt as an administrator not as a regular user otherwise it doesn't have access to alter the service configuration in windows.
All right, once this was all done, we went to the service config, we pressed start and then we're good to go.
Final thing you might want to do, adjust the startup mode to manual, depending on whether you want mongodb to start every time you turn on windows or only when you click the button in the service panel.
|
|
show
|
4:11 |
Here we are on a relatively new Mac, you can see that this is MacOS seirra, so relatively new, and if I come over here and I type Mongo, no there is no MongoDB installed here.
So what are we going to do— we're going to install and set up MongoDB, so those of you who use MacOS, like I do, you'll see that this is actually the easiest of all the various operating systems to set up.
Let's do it.
So we're actually going to start out at Homebrew, now you may have Homebrew installed which is awesome, and you can just run brew update, but if you don't then you're going to need to run this command; so we're going to run a script pass it off the ruby, this is going to install Homebrew and if I hit enter, it's going to go do this, this one time it needs my password to make changes to the system, but in general, you should not sudo run Homebrew, it even warns you if you do that I believe.
Okay, I get the little ding, it looks like everything is all set up.
So now if I type brew, yeey, stuff happens.
So the next thing that I want to do is actually install MongoDB, so brew install mongodb, and just like that, after 15 seconds, 20 seconds something to this effect, it says MongoDB is all set up.
Now before I clear this and just run it, notice there's a couple of things, it tells us right away, right here how we get started, we can either run MongoDB as a service, and if I do this without sudo, it's going to run this as a service any time I log in, if I do it with sudo, it's going to basically do this as part of the machine startup, shut down, or I can just run MongoDB like so.
So let's go ahead and set this as a service, it takes a moment and now it's up and running.
So how do we know whether it's running?
Well first of all, if I type mongo, it has something there, it has an app there right you can see 3.4.4 is the one at the time of this recording, and now it's connected and there's a few warnings here about access control, this is worth considering, if this was a production machine I would be quite concerned about this, it's my developer machine, so I'm not.
Let me show you why I'm not.
Okay, so if we exit out of here, the other way that we could run MongoDB it's already running, so this isn't going to work again, but we could run it passing this config file, but what's interesting is, check out this config file so if we go look at that, it has just a few things for us, it tells us where the log file is going, good for system services, where the data is going, and most importantly, it's listening only one local host, 127.0.0.1.
I don't know what my public ip address is or I have both the ipv6 and ipv4, MongoDB is not listening to either of them, moreover, I also have my firewall turned on as much as possible, believe it or not, it is not turned on MacOS by default that is super, super suspicious to me.
But anyway, firewall's on, and we're not even listening on the public network interface.
So do not change that, make sure that you are listening on only local host or that authentication warning, that's going to have a whole different meaning.
When we get to the actual deployment section, where we deploy to production we're going to set up users, we're going to set up ssl, we're going to set up authentication, all of those kinds of things, among others, but for now, for the devmachine, I think listening on local host is probably sufficient for what we're doing.
So this MacOS is ready to take the class.
The other things we would need to do, is install PyCharm, install RoboMongo, and do we have Python 3— we do, we have Python 3 installed, I believe I actually already installed that earlier on this virtual machine.
So you're going to want to make sure you have Python 3 installed, RoboMongo, and PyCharm, they all have super simple installers that's basically either drag and drop or double click and follow the next button and you'll be all done.
Then the machine will be completely ready to take this class.
|
|
show
|
1:17 |
We're going to use homebrew, if you don't like homebrew you can actually download a tarball and unzip it and set up all the stuff but homebrew is much better on a couple of levels, so make sure you have homebrew and that you update it, and then to install, we're just going to install homebrew and then we just brew install MongoDB, wait about 20 seconds for it to do its magic, we're going to start up MongoDB, there's two ways to do that we could say brew services start monogodb and that will actually register it as a system service that starts every time you log in; if you don't want to do that, you don't want to start it this way, it's fine you can say mongodb--config and point at the config file.
If you want to make any changes, well, there's the config file you can just edit that with whatever editor you like, and you can change the security, you can change the ports, whatever you want to change about MongoDB just change this file, and stop and then start the service, or just cancel out running this mongod command, and then run it again.
Final thing is, if you brew service start mongodb and it's going to continue to autostart, if for some reason you want it to stop, I believe it makes a file, a plist file, and it puts it in your home directory for your user account in /library/launch/agents/ something involving mongodb, so you can just delete that file I believe and that will stop it.
|
|
show
|
5:09 |
Are you taking this class using your Linux desktop?
Well, let's get your machine all set up and configured with MongoDb.
So here we are just on the homepage mongodb.com, I am going to go click download, and it's totally possible to click and say download the tarball, but we could also say I would like to see how to set this up on let's say Ubuntu, as you change this, notice different things happen, so if we go to Amazon it says here are instructions for installing with yum, that's the package manager there, if I go to Ubuntu, so here's the instructions for installing with aptitude, so we're going to go over here to that one, make sure you pick the right distribution, do we want to read this— no.
So there's a couple of steps that we need to go through and they're pretty simple, most importantly they just like walk us through it so notice here that there is a package named mongodb.org let's try to just install that, sudo apt install that, oh it needs my password, okay, and nope, there's no mongodb, darn, doesn't it say right here, here's how you install all the pieces?
It is true, but we got to go through a few steps to get there.
So first thing that we got to do is basically add a source to aptitude, so we're going to go over here, and we're going to set a key, so you're just going to go through a few copy paste steps, so we're going to do our apt key here, it takes a moment, and all right, that's all good, next thing to do is we're going to create a list file, all right, great.
Be really careful about the version numbers here, later is probably better, pick the one that matches yours.
So the next thing we need to do, is run a sudo apt update and you can do apt.get or just apt, whatever but you need to tell it hey, go pull all the sources now the new one included and just have that list there.
We don't need to back up anything, so go away.
Alright, now everything is registered, we're pointing at the right package source, we've updated it, now we can go over here and do our thing that we tried to do originally.
So we wanted a sudo apt installed mongodb.org, this time it knows what that means, hey look that's mongodb, mongos which is a sharding server, mongo server, mongo shell, mongo tools, I am just going to install them all.
Perfect, okay, so we have MongoDB all set up and ready to go, and now we can just type mongo, and it tries to connect, we have mongo now but we really need to start it.
So we started up mongod, great, now we can connect to it.
Awesome so it has some warnings again about access control when we get to the deployment chapter, we're actually going to set up Ubuntu as a cloud server with some of these errors removed, we're going to set it up with access control, with authentication, with firewalls, all sorts of things, but for now, for the dev version, we're going to just use this, okay.
So it looks like it's up and running, that's cool, now the last thing is maybe we want to configure our server; so, we can come over here to /etc/mongod/conf and you can see we've got our storage path, like here's where our data files are going to go, change that if you like, journaling, you generally want that on, it's going to be running with wired tiger, it is an older style of database storage engine called this mmapv1, that's how things used to work, they've switched the default to wired tiger because it's faster, I believe it's much faster for inserts a little faster for reads, here's where the log file goes, if it's not working and you want to see what's going on.
So most importantly though is this bit right there, this bindIp.
So the bindIp is set to 127.0.0.1, we should have our firewall turned on anyway, we shouldn't be exposing this port, but we're only listening on the local host, I think this machine actually right now has two ip adresses, one public ipv6 and one net ipv4 ip address, but it's not listening on either of them because of this, right.
So it's super important that this is here, otherwise if someone can get to this port on your machine and you don't set up authentication, bad things happen.
All right, so make sure this is here, and only change that on your staging servers and other things where you control the network, much more carefully.
Again, we'll talk way more about this in the deployment section but for now, this should do for our development environment.
The other things we'll have to do is we want to set up a RoboMongo, we want to set up PyCharm, and we want to make sure that we have Python 3 on here, I believe we do, 3.5.3 is more than late enough, so we don't need to install any Python, but be sure to get RoboMongo and PyCharm, whichever version you want set up so that you can follow along.
|
|
show
|
1:25 |
Let's quickly review how we installed MongoDB on Linux.
So the easiest way to do this is just go mongodb.com go to the download button and pick in the little drop down your distribution and that way you can copy along.
Here's what we copied and ran, so we visited the Ubuntu setup page at mongodb.com that's what I just described, and then we ran add key so they said look we're going to trust this particular source that we're about to give you, and then we said I'd like to go to the repo at mongodb.org and set up this, basically pointed at this package store here.
And then, we're going to update, don't forget this step, make sure you update your local listing by pulling from all of the sources, and then we're going to apt install mongodb-org, and that will take all of the tooling and servers and client stuff; if you only want one of them, just the sharding server or just the server component, you can absolutely just install that piece, we saw like listed on that first page I described that there's actually this mongodb-org, it's kind of a meta package of four other smaller packages.
And then at the start we just say service mongod start, if you want to change a configuration it's at etc/mongod.conf.
Change that, just restart the service and it should pick up the changes right away.
|
|
|
55:07 |
|
show
|
3:54 |
So we've talked a lot about NoSQL document databases and MongoDB.
Now it's time to actually start using MongoDB.
So what we're going to learn in this chapter is twofold: one, how do you connect to it and manage it, with the management tools if you will, that is more or less the shell, and some additional tools, but also how do you query it from that shell.
So maybe in Python in a traditional relational database you might be using say SQLAlchemy to talk to a relational databases, so you wouldn't necessarily use SQL, the language, in Python but if you want to connect to the database directly and work with it then you need to use ddl and SQL and things like that, there is the same parallel here in that we're going to use the shell and we need to use MongoDB's native query syntax which turns out to be very similar to Python's under certain circumstances, so it's going to be serving dual purpose there.
So the primary MongoDB shell is a command line tool, right, we just type mongo name of the server, some connection string options, you can see all that the title here in this terminal.
And then we just issue commands like if I want to go and use the training database out of the server, I'd say use training; and if I want to say go the courses and find the one with id 5 and display it not in a minimized, minified, but in a readable version, I would say db.courses.find and I'd give it the little json thing, id is 5 and I'd say pretty, So this is going to be entirely done in Javascript, so these statements that you type here, although you don't see any semicolons, these are either shell statements like use training otherwise, they're entirely pure Javascript.
So what we're going to do is we're going to learn the Javascript api to talk to MongoDB, to query MongoDB, to do all the crud operations, there's a find, there's a delete, there's an insert, there's an update, of course there's sorts, there's upserts, there's all the things you would do in a standard database, the query syntax uses sort of a json model to help represent either operators or hierarchies and things like that.
Now, you may be thinking, Michael, I came to a Python course, I don't want to learn the Javascript api, I want to learn the Python api— you will, you will learn the Python api for sure, and luckily, it's really, really similar, it's not identical, they made the Pythonic api Pythonic and the Javascript one follow the idioms of Javascript, but nonetheless, other than the slight like variations in naming around those ideas, they're basically identical, in Python we would use {_id : 5 } as a dictionary, here we use it as a json object; so on one hand, learning the Javascript api it is more less learning the Python api.
But on the other, if you work with MongoDB, if this drives your application and you actually work with Mongo, in a real way, you will have to go into the shell, you will have to talk to the database directly, you have to maintain it, and manage it, and back it up, and do all those things; in order to do that, you need to know the Javascript capabilities, the way to do this in Javascript, as much as you do the Python way.
Ultimately, the end game is to use something like MongoEngine which is equivalent to SQLAlchemy, sort of analogous to SQLAlchemy, in that we won't even be speaking in this syntax, but still, you'll need to know how these translate down into these queries because you might want to say add an index to make your MongoEngine perform much, much faster, things like this.
So we're going to focus on Javascript now, and then for the rest of the class, we're going to basically be doing Python, but like I said, in order to actually use, manage, run, work with an application that lives on MongoDB, you have to be able to use the shell, and to use the shell you do Javascript.
So just like anybody who writes web apps, we're all Javascript developers, if we write any form of web app, similarly here, if you work with MongoDB, we're all Javascript developers and we got to do just a tiny bit, but you'll find it like I said, it's super, super similar to what we're going to do in Python.
|
|
show
|
5:18 |
So let's connect to MongoDB, I already have it running as a separate process hidden away, we'll talk about how to run MongoDB later, you should have seen in the setup how to get it started and then we'll talk about the deployment side of things later in the class.
So MongoDB is running, it's running the local machine under default ports, no security, nothing like that for getting started, it's only listening on 127.0.0.1 so it's not listening on the public network, on my machine, so for that reason, more or less plus firewalls, the authentication part we're going to turn off for a little bit, just so we can start from the beginning; okay, the other thing I have is I have set up MongoDB in my path, so I can ask which Mongo, and it comes back with something, so what I actually did is I went to MongoDB and I just downloaded the tarball, and I unzipped it, and I sort of changed the naming around, so it's in this path here, so here's the actual executable.
Mongo is the name of the shell, mongod is the name of the server for deamon so in order to connect to MongoDB, there's a ton of options we could give it and like I said, when we get to the deployment and production stuff at the end, we'll have to pass all sorts of things like authentication, an ssl flags, and whatnot, server names here but in the beginning, we can just type mongo.
And you'll see, right here, we're running 3.4.4 and it's connected to local host 27017, that's the default port for standalone servers, there's 27 thousand, 18, 19 and 20 are reserved or typically the default for other types of things.
So my system is not exactly set up right, but it's not a production machine it's just my dev machine, okay.
So now we're connected, what do you do?
Well, probably the first thing you want to do is focus on a particular database, so you can say show dbs and it will show you the various databases, how large they are things like that, so we're going to work with the bookstore for our examples in this chapter.
Later, we're going to work on something that maps over to a car dealership, so those are the two databases that we're going to be working with, you can see that I have got some for my various sites here and things like this, I have actually broken it apart so like Talk Python the core data it's not really zero gigs, it's just rounding down, it's like 20 MB or something, but the analytics is half a gig here, and it's actually much more if you export it.
So we may have more than one database for our app like I have on my podcast, or you might just have one for the trading site, like we do here.
Great, so now I want to maybe find a book in the bookstore, so how do I do that— the first thing you have to do is you have to activate the database, so you're going to say db.command, whatever that is, and give it some command here, where db refers to one of these databases, so the way we do that is we say use say bookstore, like this, now it says great, we switched to bookstore, and then we could say db.
first of all what are the equivalent of tables in MongoDB these are called collections, because they're not tabular, so we can say show collections, and this is what is contained inside of bookstore, there's a Book, case sensitive, Publisher, Test and User, ok.
So if I wanted to find the books let's say db.Book.find let's say just limit one, so it doesn't go crazy on our shell here, so basically, the way it works is we connect, we figure out what the database we want to work with is, we say use that database and then we say db.collection name and then we typically fire these commands at the collection.
Now, what's interesting that is missing here is there's not like a create database or inside of here there's not a create table or create collection command, so like Python in some ways, MongoDB is very, very dynamic, so if we wanted to create a table, let's go and just create a collection and we won't create a whole new database, so what database we have, we have a bookstore and we have those for collections bookl publisher, test and user, so if I want to create one called logins— let's say just log for history I could even issue a find command against that and there's just nothing, it's just empty.
If we go up here and we say what's here, there's no log, but if I actually try to interact with this, we'll talk about inserts in a little bit, but let's just really quickly see how this works, I would just say let's say name or action is view, something like that, if I insert this, no just crazily this works and something was inserted, if we look there's now a log, so db.Log, case sensitive .find, there and it inserted this thing, action with a view and I gave it the id whatever it is, this is called an object id, we'll talk about that later.
Okay, so this shell is how we work with MongoDB, if I want to get rid of it, I could go here and say drop collection, just drop, right, and now log is gone again.
So this is your base level admin tool and it works everywhere, so we could ssh into our Linux server Digital Ocean, or on aws or whatever, and we could do this, we could even sort of tunnel this through there, but we're going to see that there is actually some better options any time we're running somewhere where we can even just tunnel over to the server.
|
|
show
|
1:27 |
So let's review the main concepts around using the shell.
Remember you just typed mongo enter and it will connect your local default, everything default port, default local host, no account etc, and once we're connected, we'll be in here, and it'll say connected to the server, what version of the shell, what version of the server, 3.4.4 is the latest at the time of this recording, but maybe not at the time you are watching it, like all things that are server's, newer is better.
Ok, so first thing that we might want to do is say what databases are here, and we do that with the show dbs command, we hit enter, and it shows you the various databases that are listed.
Then next we want to activate one, so that we can issue commands to it through the db.collection or other high level operations, so we'd say in this case let's work with talk_Python, so we'd use talk_Python.
And it'll say great, we switch to database talk_Python, and in you're wondering you can always trying as you saw me do db enter and it will say talk_Python, cool, and then, we could say well what collections exist in talk_Python?
This is actually pretty straightforward, the document design I think is pretty interesting but there's not many collections, so we have episodes, guest, reviews and then while developing it, I turned on profiling to see where it was slow and where it was fast, where I need indexes, we'll talk more about that near the end of the course.
So we have these four collections, and now if we want to find an episode we'd say db.episodes.find and give it some search, or sort, or something to that effect.
So this is how we get started and connect with the shell.
|
|
show
|
10:13 |
Now let's see how we do probably the main thing that you do in databases and that is query.
So here we are in the Mongo shell still, and I'm using the bookstore database, so what I want to do is find some particular books; remember, we have book, publisher, test we can really remove test, not actually do anything, and then user, so those three actually used one.
Let's go and remove test just so that it is gone.
Now we have the ones we're actually using.
Now, when we're getting started, it's probably worthwhile to just say db.Book.find as an empty query just like kind of select star if you will, you know show all of the things that are in there, there, that's totally obvious what that is, right, you see the structure, right if you can like kind of exist in the matrix you could entirely see the structure there, but let's do that better.
Notice a certain number of items, I don't know it's 20 or 50 were returned there's actually like a quarter million books, so we didn't get them all which is good, so if we want more, we just type "it" and it will actually get more and so on.
Okay, so this is not super helpful, let's make this more helpful; so here we can go over and say I want this to be like that pretty and in fact, if I just want one of them I could just say limit this to the first one, or let's just say limited to two so we see a couple of examples.
There, now we're starting to see the structure.
Let's go here, ok so now we've got a book, right here you can see the top level document, it doesn't put the results in arrays, like it doesn't print out an array it just prints a whole bunch of individual results in this case two, so here we have our id, there's always an underscore id in the database like this is the name of the primary key, you can have it look different in Python, you can say this thing maps actually to the primary key when you are modeling this with classes and so on, but down at the Javascript and the MongoDB level, this is always the name of the primary key, if you don't give it one when you insert the thing, it's auto generated, and so if you don't have a great reason to care about what id looks like probably using this object id is the best bet.
So our books have isbns, they have titles, they have authors, I kind of wish it was little more Pythonic with lower case ts and as, but this database came from somewhere else and it's like this so we're just going to roll with it.
Ok, so we've got dates notice, json doesn't support dates nor does it support object ids, but the results here do and so dates and object ids are sort of extensions that bson brings to json.
Alright, and then we have a list of these image url objects which have both the size and url, and so on, and then they also have ratings, this one has one rating, so not too many, let's look at the next one— it has a lot of ratings, right, so it has a user id that is foreign key constraint a foreign key link soft not enforced by the database, but a link over to the user table and then a value here; so this is what this database looks like, we have a title, we have an isbn, and these are like the flat things, and then we have most importantly we'll go play with the ratings a little bit, so let's start by asking this question about the books.
So the way it works is db.Book.find put some space in here so the way MongoDB queries it doesn't have a where clause basically what you put in here is the where clause, and the way we do is we pass what I think of as a prototypical json object so the json object that we're going to put here, maybe would have something like this, let's say title, case sensitive remember, is "From the Corner of His Eye", if I put this in, here we go, so "From the Corner of His Eye", now this is a book that should be in this database and we'll be able to do some queries for it what this says to MongoDB is go to the book collection and find every single document that has the title equal to "From the Corner of His Eye", and I think that there's more than one, let's see— yes, so we can come over here and we can do a .count, there's three, alright, so this is nice, however, what you saw come back there was even if I did a pretty, still because we've got the ratings and the image URLs and this one has a crazy amount of ratings and so on, we might want to get less, so with his find thing, this is like— let's put it here, this part where is this title, that is the where clause but in SQL, you could say like select title, id, isbn, from this table so we can do that in MongoDB as well, we can do this like sub projection so I can come down here and say I'm interested in title and anything that's truthy in Javascript, so I could put high, I could put one, I could put true, I like to put one, I don't know why and let's say we want the isbn, this is case sensitive as well and watch what comes back now — okay, so there's our three records now interestingly, each one has three keys and we specified two.
So the way it works is Mongo is like you're probably going to need that primary key so unless you explicitly say you don't want it, you're getting it right, so if we want to do this again, and I could come over here and I could explicitly suppress id and put something falsy here like zero and then I just get isbn and title, okay.
So let's go back to this.
Now suppose I want to find the book with this title and this isbn, how do I do an and here?
Well the way these queries work is everything, basically every property of that little subdocument must be a subset of the thing it matches for, so when I say title is "From the Corner of His Eye", that matches the title, but I could equally come up here and do this again and say oh also that isbn, actually I don't know what it's supposed to be let me run this real quick, let's say we're looking for this one, the one that ends in 41, so now I could come over here and say that isbn, so json or Javascript you don't technically need to put a name there but this is a string, so it goes like that, right see it starts with zero, it wouldn't just be a number.
So now, if I run this, I just get the one, so this is the and clause, select star from book where title is this and isbn is that so you can create these documents to basically and together all the pieces that you need.
So this is all well and good, this looks a lot like a standard database, standard relational database type of thing but remember when I talked about documents, I said their superpower is they get this nested thing so let's go over here and just throw this back, we'll just get one of them so we can look at it again, their super power is that they can reach, let's get the next one so per page you would use skip and limit, so we can reach into like say the ratings and say I'd like to find all of the books that have a rating of let's say eight or all the books that have been rated let's do this, I don't know how many books that person has rated but we can find out in a second, so I want to find all the books that have ratings where the user id was that particular id, right there, so how do we do that— let's come up here again, we don't need this anymore, so in here we kind of want to say something like this like rating, and then if this was an object we would navigate it with .syntax but it's not going to work out so well here, so this would be user id like this, let me just paste this in so I can get my little object id out, when you're quering by object id and you just say object id, the question is that valid Javascript, and the answer is no, it is not.
So any time you have this sort of hierarchy thing traversal you have to put quotes right, if it's a single item is optional if you're doing something funky like an operator or something like this then you're going to have to do like this.
So let's just show, let's select back here we're just going to say give me the title is one and I don't even care about the id; if I can write a query like this, go down into the ratings, and show me all the ones that have this user voted, that means even though I've kind of pre-joined and embedded this ratings concept, I can still query it as if it was a separate table, separate collection and that's the document databases superpower, let's see if I can get it to work now; apparently I did not get it to work what am I missing here?
Oh, notice I think I said rating and the actual schema is ratings plural, I think that's good, it's representing a pluralized thing down there so the problem was I did this, now notice MongoDB didn't crash, it didn't go oh there's no such thing as a ratings field on this, it just said no nothing matches that, so it's really powerful, it means it's super easy to sort of evolve and work with the data and it doesn't break under the tiniest lightest of schema changes, pretty good, but you just got to be careful, so let's try it again.
There we go, so apparently we could even ask because that was not all of them, there's a lot of books this person has rated so I think this data might be partly just generated okay, so here these are the books that that person rated, let's find another, let's try to do this again, come down here I will get this object id, we can say I want to find the books rated by that person how many are there— 107.
And if I actually wanted to see what they are, there's the titles of the first set of them, notice that's really, really fast, I think I have indexes set up right now we'll talk about indexes when we get to the performance part of this course, but we can do these queries down into the ratings embedded part the embedded documents into the books just as if they were their own table, I told you there's about a quarter million books, there's 1.25 million ratings so notice the response time here almost instant, in fact it's like milliseconds.
So not only can we do this query, we can do this query extraordinarily fast.
All right, so this is one of the things that makes document databases interesting and also challenging, how do you define the documents, should you embed them, should you not, we'll get to that in a whole different chapter, but for now, just know it does have this super power to reach down in here and do these queries.
|
|
show
|
2:33 |
We've explored the shell a little bit, we've done some querying, let's look at the concepts behind it, so you have them nice and concise, in case you want to come back for a reference.
So if we want to query say the Book collection in the bookstore database where the title is 'From the Corner of His Eye', we can type find and give it this little prototypical json object, hit enter, and boom everything comes back that has the same title, different isbns, different primary keys and so on, but releases, different versions, maybe one is paper back on is kindle, who knows; so the idea is we're going to come up with these prototypical json objects, here title: whatever the title is.
Now, if we want to do more than just what is the title here we want to say give me the book with the title this and the isbn that, given that the isbn is probably unique, we could maybe just search for it instead, but we want to demonstrate the and clause, right.
So here we'll give it this prototypical sub document with the title being the title we're looking for, and the isbn being this one.
And notice, now we only get one record back, so our prototype will document is basically an and clause, every field must match.
We also saw that one of the excellent ways to group related data, this would be what you might call an aggregate in domain driven design, is to embed items into the document, so here we have ratings that ratings have little sub objects, sub documents that have things like user ids and values and at the very beginning, and in the example you saw, the superpower of these document databases, is that they can query them, so I want to find all the books that have been rated by this highlighted user id— how do I do that?
So we just pretend we're traversing the objects Ratings.UserId, so down here we'll say find Ratings.UserId and we give it the object id that we're looking for because ''Ratings.UserId'' is not a valid key or a field name in a Javascript object we have to put it in quotes, but other than that, it's basically the same idea and here we get back all the books that have been rated by this particular user.
So we just use this dotted notation to traverse the hierarchy one other interesting point is maybe ratings just contained the number like it was at 7,5,...
then you could actually just if I want to say find all the books that have a rating of seven I could just say find ratings:7, I don't have to do this dot notation or anything like that, but because I'm looking within that document inside ratings, regardless of whether it's an array or it's a single rating thing, you do it like this that dot notation.
|
|
show
|
5:35 |
The shell is pretty nice and it's ubiquitous and that you can run it anywhere, you ssh to and things like that, so that really good, and this is more or less the tools that MnongoDB ships, you could work on something else that's coming along but there's a really great, better shell in my opinion much, much better, I really love it, it's called Robomongo, so we talked about Robomongo in the setup how we installed it and so on, so let's see how it works and how it compares to the shell here.
So here it is, you can see it hanging out down there and we click start, maybe it's empty let's go ahead and start from scratch, so now if we open it up it's empty, let's create a connection, I'll just call this local or whatever, and it's going to default the local host 27017, all this stuff turned off, things like that, and we'll just say save and connect and now you can see, let's put these little more side by side, you can see over here we have our bookstore or charge watcher and so on.
And now we have the benefit that we can open this up we can look at the book, we could say explore the indexes we could even go over and say edit this index and make changes, make it unique, do some other things about sparseness and so on.
We'll talk more about that later.
Over here, we could say something like use bookstore and it switches there, the equivalent over here would be something like right click and say open shell, how interesting, so I know a lot of people prefer the command line interface but what's really awesome about Robomongo is you have the entire cli right here, so I could say something like db.Book, notice the auto completion, book, publisher, user, auth, etc, .what do you want to find, find and modify, find one, let's find one where, what did we have before, we had something with the title and let me go back and find the title we were using— so here we can say title like this, and now if I run it, I get a result down here and I can explore it, I can see the ratings and so on, and this, you know if we run this over here, I get I did the little projection, I could do that as well.
So I get this text version and I actually don't really love this too much, so you can actually just switch it to the text version here as well, and you get color coding, highlighting, all sorts of stuff.
You also get this version which is kind of a flat version, I never use this but you can use it if you want.
What is really cool is I can come over here and say I want to maybe edit this document, if I come over and do a find, I think— here I get three, now if I do a straight find, not a find one, I can actually go and edit this, so if I wanted to change the date that this was done on, so let's say 2011, save, rerun this, this is one with so many ratings, here, this is the one I changed, number 2, now it's 2011.
So of course I could run an update command, but you can do all sorts of interesting sort of UI things so I really really like using Robomongo, because it's one hundred percent as capable as the shell so for example, I could come over here, this is like just typing Mongo you could create variables, I could say var page, let's do something with a paging here, so I come and say this now notice, this uses get collection and it doesn't use the .Book like this, I think it does that because it gets better intellisense or auto completion, not really sure, anyway, you can do it either way, they are equivalent.
Now, let's go over here and imagine we're going to do some paging, so first of all, let's just select the titles remember the thing I did with the projection, exactly the same thing here, there we go, I forgot to rerun it, okay.
So rerun it, now we get just the titles, there's "Classical Mythology", "Clara Callan" and "Decision in Normandy" and so on, so suppose we want to do paging, I'd basically want to show you that this is like a full Javascript shell plus kind of an editor, so watch this, so if I put some semi colons in here, I can type let's say var page size is three var page num, like what number are we on, let's say were on page two, than down here I could say ok, this is what I want to do, and I could do skip and page to actually do the paging, so I could say skip and we'll do what page num, minus one, times page size that's how many we want to skip, and then we want to limit it to page size like this so now I should get, let's see, go back to the beginning, three things per page, we're going to be on page two, so it should be the Flu, the Mummies and the Kitchen God's Wife, and that's it.
Oh, by the way if you highlight something, it just runs that expression which apparently evaluates the two, run the whole thing— notice Flu, Mummies, Kitchen, so we can do this basically as much as we want to type up here, but it's also a little editor, I mean just in almost every way this is better than the shell and I could even use this to connect to my remote MongoDB server, using ssh tunneling, again, we'll talk about those kinds of things when we get to the deployment section but for pretty much the rest of the course, we're going to be using Robomongo because it's just better in every way in my opinion.
All right, and as you saw Robomongo installs on Windows, Linux and MacOs, so it's all good.
|
|
show
|
3:44 |
Now, let's use Robomongo, our shiny new shell that I contend is better than just the cli one, let's use it to explore some more advanced query filtering and sorting options.
So here's just a blank find showing me all the records, how many are there in the database, there's 271 thousand books, so this is the same database we've been playing with for a while now.
So let's ask some questions about the ratings.
So we're going to go into the ratings array which contains a bunch of objects, which have values, so I want to say how many of them have the value nine, so what's that actually answering— what question is that answering that is answering how many books have been rated at some point by somebody with a nine, how about with ten— a little bit more, so there are some books that were really, really popular people loved them, this is a 1 to 10 type of scale, I think it might also include zero.
So that's great, this is our prototypical json object here.
However, what if I want to say show me all the books that have a moderately high rating, what does that mean, let's say it has an eight, a nine or a ten as a rating, how do I express that as a prototype?
You can't do it, and so that's why MongoDB has something slightly more complex and nuanced than just straight comparison, right, so this is like an equality query, so instead of putting a value here we can put a little sub search document here and into this, we can say I'd like to apply an operator instead of an exact match, so the operator might be greater than operator >, so the way you know it's an operator is the dollar and gte greater than or equal to is going to be the thing and then we're going to put the value of eight, so show me the books that have a rating of eight or above, tell me how many there are because we're doing a count, so let's run that, look at that 98 thousand books have a rating of eight, a nine or a ten.
Does it mean their average rating as eight, nine or ten, that means somebody somewhere has rated it eight, nine or ten.
So we also have things like greater than, without the equal, just flat up greater than so that's nine or ten right there, so we have a number of these operators, greater than, greater than or equal to, and so on.
Another one that's really interesting is in, this is super important for really powerful queries, so when we have documents that contain sub arrays of other documents you can think of those as basically being pre joined but when you normalize those, that are not contained within each other, then you need a way to still go back and say basically do the join, and this in operator is the key to making that happen, this is not really what's happening here, because this is a sub document, but it's the operator that's involved, so what we can do is say I would like to find me the ratings that have let's say prime numbers as ratings, it's kind of silly, but whatever, here we go, so those are the prime numbers between one and ten, and we could say I would like to find all the ratings where the value, one of the values right, remember they have multiple ratings but one of the values is actually in this set, so the way this usually manifests is like go to the database and maybe I pull back some items, and it's got like a sub array of let's say ids and then I can go back to the database and say give me all the items in this other collection where the idea is in one of this like sub ids, so an example might be in the Talk Python Training stuff that remember the course contains all the chapter ids and I can go back into one single query that will give me all the chapters for a course it's this in operator, so let's try that.
So there we go, apparently 69 thousand have a prime rating at some point not that that means anything, but it shows you how these operators work.
|
|
show
|
2:02 |
So here's the list of the quering operators, all the complex ones.
So we saw that normally we pass these prototypical json objects, ratings.values is five, and that just doesn't exactly match, but we saw that that doesn't really solve all our problems, often we want ranges or properties like I want all of the ratings that are greater than eight, things like that; so instead of putting a number into that prototypical json element, we're going to put an operator, so we might say $eq for equality that's kind of the same thing, but the others are not, so $gt for greater than, greater than or equal to, lt for less than, less than or equal to, not equal to, so you could say I want to see all the ones where there's no vote, or no rating of value ten, right, there's no rating that has a value of ten.
And we talked about the in operator, this is kind of your two step join process that we'll talk much more about when we get to the Python side of things, there's also the inverse of that negation, not in the set.
So here's an example how we might use the greater than or equal to operator to find all the books that have a rating of nine or ten that are super highly rated by at least one person, remember this is not like every single one in there has to be this, but there exists of rating which is a nine or a ten.
We also have some joining operators or some combining operators joining, so we can go in and say and pretty easily by just having all the properties we're looking for in a single document, but if for some reason these are coming from multiple places you can actually combine them with the and operator so that's nice, but what you really sometimes need is the or clause, I want this or that, and there's no way to do that those prototypical json objects but the or operator will let you do this.
You also have not and nor, so neither of these in an or, sort of the negation of an or; now I recommend you check out this link at the bottom for each one of them, like so where the operator appears, does it appear to the right hand side of the property or field name or the left hand side, it kind of depends on the type of operator you're using, so you can just click on this or the and and so on and in the docks and it'll give you a little example.
|
|
show
|
1:22 |
Now sometimes you don't want all the data back, usually it doesn't really matter to you if it comes back or it doesn't come back, in the shell you're printing it out, it probably matters, but in practice, in your app, you rarely care from a display perspective or an interaction perspective, whether some field or list that you are not using has data or not but from a performance perspective, you very much may care.
Suppose that you have a document that's 50k in size and all you want back is the isbn and the title and those are 1k, and you're getting a bunch of them back, it turns out that that can make a really big difference in terms of performance.
So whether it's for display purposes or it's for performance network purposes using this second argument here we can say only return the isbn and the title, and don't give me all of the ratings, don't give me the images, everything else that might be in this book.
So we run this, and we get back these objects here, these documents, and notice, we have the isbn and the title, like we asked for but we also have the _id, so unless you explicitly forbid the id from coming back the id always comes, and everything else defaults to not appearing, unless you indicate it if you pass some document here for the projection or the restriction of things that come back.
If for some reason you don't want the id to come back, just say_id:0 or false or something like this, and then it will just have isbn and title exactly.
|
|
show
|
2:03 |
So here's an interesting question— what if I want to find all the books where user 720 has rated that book exactly a nine.
You would think that this would do it, right, we're using both values in this prototypical object or this document here and it says that the book is going to have to have a rating of nine and user id 720 has rated it.
However, when we run this, you'll see we get mixed results.
The bottom one looks perfect, we got a book with the user id 720 an a value of nine in the ratings, great; but this other one, what's up with this, the red one?
Well, user 601 rated this as a nine, and user 720 actually hated the book, they gave it a one.
However, taken as a whole, does the book have a rating by user id 720— yes, does it have a rating of nine— yes, so it matches this and clause.
So, oftentimes if you're looking for this exact subdocument match and that thing you're looking in is an array so ratings is an array of documents, if ratings was one subdocument, this would work fine, but if it's an array and you want to say I need to make sure that the thing in that array is that subdocument itself matches value and user id as I've specified here you need a different query operator, and that is dollar element match; so you can run this and it'll look down inside and say I want to find all the things in ratings, where both, the user id is 720 and the value is nine.
So this is a slightly more complex version that you have to run and you have to use because you run into that problem we had before where somebody voted a 9, user 720 voted, but it was not user 720 who voted nine.
So a little bit different than if you were working in say a sequel traditional tabular language because you don't ever have this kind of duplication within the one result, so it would be a lot simpler, but this is something that you kind of got to get your head around a little bit, you luckily don't use it very often, and if you are using the higher level of things like MongoEngine, you won't run into it, but down here at the shell or in PyMongo, you have to be really careful if this is actually the question you're trying to ask and answer.
|
|
show
|
3:24 |
So, we're pretty good at finding and filtering down our result sets.
The other super important things that databases do is to sort them, put them in order, so I would like the best selling book and then the second best, and then the third best in this category, that's a perfect sort by category, order by best selling this, right.
So how do we do that in Mongo?
Let's go over here and it turns out that there's a sort that we can run, and the sort takes something, right, kind of like our projection does here, so let me just show you before if I run this that this is not in order, so here we have c, c, d, f, and then t, p, w, and eventually we're back just, you know, something before w, it is not sorted by title, not sorted by published date either, these three seem to be descending but the next one is not, ok.
So it's not sorted at all, it's just however it comes back, probably by object id or something like this.
Anyway, let's go and sort it, so let's suppose I would like sorted by title; so very much like our filter thing or maybe even closer, actually, like our projection here is I can come say I would like to sort and then this part that goes here, this one is ascending, right, so something that is positive means ascending, if it were negative, it would mean go in reverse order.
So let's run this, now you can see, actually this is the beginning of the title, this exclamation mark and then some other exclamation marks, and then let's get past the symbols, a lot of symbols, anyway, you can see this is sorted by this, sorted by the title, not sorted by date, 1994, 1993, 1996, we can also sort by date, let's comment this out, say .sort, published, let's sort in reverse order, newest which was 2050, I think we might have been fooling around with that or no actually I don't know where those came from.
Anyway, 2050, 2038, 2037, 2030 and so on.
Obviously, sorted in reverse order.
What if I want to sort by the title and then any time the title matches I want to see the newest one of those.
We can do that as well, so very very similarly we can say sort and then we just give it one of these objects with multiple values, so you want to sort by title, there's your sort by title ascending and then after that, if any of the titles match, let's show the newest one first, so sort by title ascending and then published descending, let's try that.
Great, ok so here notice that these titles are the same, you might have noticed that before, but here's 1994 and here's 1993, so any time the title matches, we get the newest one first, I don't know if any others are in here with title matches.
This first one must prove it right, this is how it works, sort by that and then by and you can have as many then buys as you like and they can either be ascending or descending, so here we're sorting by title first and then by published.
The other thing that's important to notice is everything in MongoDB is case sensitive, when you're working with strings so that's probably going to play into this somewhere along the way.
All right, so sorting pretty straightforward, just use these field names and then the direction you want to sort.
The other thing that's worth paying attention to is you are going to want to make sure that you have an index so this sorting is actually fast, and we'll talk about that when we get to the performance section.
|
|
show
|
1:06 |
Let's review sorting as a concept.
So there's a sort function right on the result set on the cursor that comes back from find, and the way it works is we pass it some prototypical json document; but now instead of equality meaning matching, it means tell me the thing and the direction that you want to sort.
So here we want to say sort all the books descending show me the most recently published to the oldest, right, show me the most recent books basically.
Now this works pretty well, we could put anything that has a direction like a minus one, or one, I think you could even put higher multiples like ten and 20, 50, -10, but use one and minus one, keep your sanity.
So this works well for one field, if we want to sort just by published, but if I want to sort by one thing, and then another, well we just put more into this document that we passed to sort, so we're going to say sort by title ascending and then sort by published descending, we run this, we saw that we get the results in our demo, first we sorted ascending by the title and any time they matched we sorted descending by the publish date.
So first the 1994, A Nutshell Handbook, and then the 1993 one.
|
|
show
|
1:26 |
Inserts are one of the simpler operations in MongoDB actually.
So we just go db.collection name, in this case db.book.insert and we give it the thing to insert.
Now, if we don't specify an id, an _id which generally we want to let the database generate but it's not always true, like we could have people and their primary key, their id could be there social security number, and that you would provide, so in this case, we're not going to provide an id, we're going to type in title and isbn, and those kinds of things.
And then if we just do a find, that would come back and get the first one maybe say this is our first insert, we'd get something back like this, let's say we specified the isbn, the title, the author, the published and the publisher, this is a relationship over to the publisher table, which we haven't played with yet.
So those were all set by us, you can see "Winning With MongoDB" and down here we have "Winning With MongoDB:, but the _id, because we didn't specify it was auto generated to an object id.
So unless you have a good reason to pick another type of id, this is probably the best one for Mongo, but it could have been a string, like I said, it could have been a social security number or it could be just numerical if you want to have a 1234, all of those kind of put the burden on you to manage the uniqueness of that id and there is a unique disconstraint on _id for every table or collection.
So that's how inserts work, you just give it this document and it stores it more or less directly in the database except for that it will generate this _id as an object id if needed.
|
|
show
|
2:02 |
If inserts are simple, updates maybe not so much.
In fact, there are two types of updates that we're going to look at; first, we're going to look at what is the conceptually more simple one, but also slightly more problematic.
So I'm going to call this the whole document update and the way you might use this is you might go to the database, do a query, get a document back, make a change to it and say here, push this whole document back over top the existing one in the database, kind of orm style.
The other one that we're not talking about here would be the in place updates, so you might say go increment the view count of this post without retrieving it, without changing the other parts, ok, so how does the whole document update work?
Well, first of all, we're going to do an update if we come back and we look at it, we'll see maybe we've changed the title here, the author is still the same, but we had to pass the author, we had to pass the published and the isbn back, okay, in fact also the id, so all that stuff we had to put back, basically the way it works is we're going to do a where clause here so find it by the primary key, this great long object id and then here is the entire whole document we want to replace that document with.
Now because of the way it's working here, there's a couple of features or settings you might want to control here, so you might need to set these, you might not depending on what you're doing, the default is if the where clause does not match, nothing will happen, there will be no kind of upsert, there will not be a new document added because we didn't find one, just nothing happens.
So if you say upsert is true and you run this update, it will say I didn't find this document, so let me create it for you, so you could control that here.
Similarly with multi equal true, normally unlike sql statements update only updates the first item it finds even if the where clause would match ten things, it only updates one of them.
So that's a little bit funky, but if you think it's entirely replacing the record like why would that hole record be duplicated ten times, I don't know, it's kind of weird, but if you do want to update multiple objects, multiple documents in this collection, be sure to set multi to true, both of those orange values, their default values are false.
|
|
show
|
0:33 |
After you've inserted some documents and maybe updated a few, it might be time to get rid of the old ones, so let's talk about deleting them.
So again, it's db.collection name.
and we're going to apply delete operation.
And here we can say I'd like to delete one of them, delete one, or maybe I want to delete a whole set of them, right, the delete one we're passing in something that should be unique, like the primary key, and delete many, maybe a bunch of them have the title, maybe there is a couple of additions like a kindle and a paperback version or something like that.
So just get rid of all of them with the title being some title.
So, delete one, delete many— pretty straightforward.
|
|
show
|
5:11 |
It's time to look at the atomic updates.
We already talked about the whole document updates and how they work, but sometimes it's not really what you want; in the beginning when we talked about NoSQL, we saw that the NoSQL databases gave up things that traditional relational databases embraced or considered sacred.
One of those where the acid properties, or some part of the acid properties and MongoDB does say look things like joints and transactions, transactions mainly being part of the acid properties is something that MongoDB doesn't promise so this whole document updates really require an additional layer in the app tier called optimistic concurrency, and usually it's fine, sometimes it's not, and you can catch it and say hey look somebody saved this out from under you and you do want to keep your changes, their changes, there's things you can do about those types of situations, but not in the database in your app.
On the other hand, MongoDB does support atomic transactional behavior long as it happens on a single document, so if we have a document and let's go ahead and create a whole new collection here called BookReads notice it doesn't exist yet, and we're going to insert just an isbn and then how many times it's been read, I think of like the Goodreads service or something like that, like I want to know how many of my friends read this book, we'll you a simple, simple version of that.
So let's go over here and notice we inserted one and if I refresh, we should now have that in here in our one record, like so.
So we could go and we could do this for this whole document style things, I could say book and of course we will be doing this in Python very likely we're just about to leave the Javascript in the dust, so let's just print out our book that we got here, notice this has actually given us the same thing back, and we could say the read count += 1, we could increment that, and then we could say go over here to the same collection, we could say update, I would like to update with this, here's the where clause, and the thing I want to update with is the book, so let's say _id : book._id, okay, so this should do that like so, and let's run one more query here at the end to get it back, to see it again.
Oh yes, find is not going to work, find one however, we don't want to update a whole query, whatever that means it doesn't make any sense but let's get one of them back, we know this is really going to be unique and then let's make this change, ok so notice, now we've got a read count of one, we do this a few times, bam bam a read count is incrementing over and over and over down here, and we're updating one record, so this is cool but this is not part of the acid property guarantees, this could be problematic in lots of ways so what we're going to look at now, are the operators that we can use to basically do almost transactional stuff and do it in a much more high performance way.
So let's go over here again, and let me grab this little clause here, all right so we got our document back again and now what we're going to do, is we're going to do our db let me just grab this collection bit, and we're going to do our update, in fact update is going to look almost the same, we are going to do this, but instead of passing the whole document we're going to pass just an in place atomic operator, all right so what are we going to do, let's suppose we want somebody to basically do the same thing, increment that alright, I guess we could just use isbn, that works as well right; we're going to need something in our little where clause here, isbn will do.
Now by default, this is going to replace whatever's in there, that's going to be bad, but what we really want to do is we want to increment that one value, so we can use another operator, say inc for increment, and then what do I want to increment, I want to increment let's see what is it called— ReadCount, so I want to increment ReadCount by one, I could increment it by negative one, I could increment it by ten.
So let's run this, now notice we updated one record and let's put this in a way that looks better, nine, ten, eleven, twelve— there we go, check that out, isn't that cool?
So what's happening here is it's actually going into Mongo, go find the document, just change that number right there, just add one to it for me, you don't have to pull the whole thing back, make changes and possibly try to put it back and someone else changed it, none of those things, this is entirely atomic and safe in a multi threaded, multi server environment, because MongoDB guarantees individual updates to individual documents are atomic and because we're not depending on the value, we're not like reading it changing in memory and putting it back change it in our programs memory not Mongo's and put it back, then we're not going to have any problems.
There's a bunch of cool operators like this and we'll see that MongoEngine actually naturally behaves in this style not the document style, even though it's an object document mapper which is really really delightful.
|
|
show
|
3:14 |
Despite the fact that MongoDB is a NoSQL database it does adhere to the acid properties under certain circumstances.
Primarily that means updates to individual documents are guaranteed to be atomic, and along with those, we can get great performance as well as safety if we don't pull the document back for the database, make changes and push it back hoping no one else has changed it during that intervening time there, but in fact we can go to the database and go make this change here I don't care if it's a 100k document, don't pull anything back just make this little change and that happens atomically and safely.
So the operators that we have to work with are increment, multiply, rename a field, set on insert set unset, like basically delete a field, min and max so I would like to set the value but only if this value is lower than the one I'm passing, or the one that's in the document or set it to the max, like only set the value to this if this new value is bigger than the existing one.
You can also use current date to basically grab the server date and save it there as well.
So these are the in place individual updates and we can see how that works so we'll come over here and let's insert just a book and this time our book has a view count, right, the view count is zero, maybe every time somebody pulls up the book we want to increment that, so we can say test.update and give it the object id right here is a real simple one so it was fits onto the screen basically you can say $inc increment view count by one, and we do this a few times, so we've done it three times it should go from zero to— well you guessed it, three and it all happened atomically in the database, without us ever pulling it back or worrying about any sort of concurrency whatsoever.
So this is great for working with individual fields sometimes we need to work with arrays, so we saw like for example our ratings object maybe we want to work with that atomically.
So MongoDB has operators for that as well, so we have things like add to set, so suppose it's got like a votes list, people who have voted on this book, not the values just keep it simple, just the users who have voted and that contains user id, so you could say add to set user id when they vote and that would actually only add them there, if they're not already in that list; what's cool about that is if they push the little vote button twice, it doesn't count twice, just either you add it there and the person has now voted for or they haven't.
Another good example is tags, like think stack overflow, I want to tag a post so you could say add the tag Python, add the tag mongo, and if it's already there, it's just going to leave it alone if it's new, if it's not there it will actually add the tag.
So these are really cool to add to set for kind of uniqueness on these subarrays.
We also have pop and pull for pulling things out, pull all say I want to remove all the votes by a particular user, things like that.
Also push, so push is like add the set without the unique desk constraint, and that's it, I definitely recommend you think about these atomic updates, they are not simple, but they are better performing and they are definitely safer as well.
Like I said before, it's great that the odm, the object document mapper that we're going to look at, MongoEngine automatically does this behind the scenes, we don't ever have to even know how they work, but it's important that you know that they exist and why they're good for you when you look at the logs, and you look the performance and think about things in that way.
|
|
|
27:52 |
|
show
|
2:53 |
All right, the moment you've probably been waiting for is finally here, we're going to start moving away from Javascript and doing Python for the rest of this course to talk to MongoDB.
That doesn't mean we might not use the Javascript API in the shell, just a little bit more, but for the most part we're going to focus now on writing applications that talk to and work with MongoDB.
So we're going to look at in MongoDB's nomenclature something called a driver, so a driver is the underlying library or framework that you used to talk between your application and MongoDB.
So here we've got our web app and it's going to be using the database MongoDB here.
A request is going to come in, into our web app and it's going to use a particular package, right, this is not built into Python, this is something we have to go out and get.
So the package that we're going to work with is built and maintained by MongoDB themselves, and is called PyMongo.
So this is the core, lowest level access to the database server and it does the tone of things for us, in fact if you look at many of the odms the object document mappers the equivalent of the NoSql orm, they build upon PyMongo, right so PyMongo is almost always involved when you're talking to MongoDB from Python.
And it does many things for us, it connects to the database whether it's local, remote, over ssl, with authentication, with certificates, all that kind of stuff, it actually manages replica sets so it knows how to find all the different servers participating in a replica and do the fail over if one fails, it knows how to go over to the other one, things like that; it also knows how to deal with sharding, so maybe you have a cluster of ten MongoDB servers that are all managing part of the data and then participate as a group in the queries, PyMongo does that for us, this is generally where you do the crud operations, the find, insert, update, delete, and those kinds of things; you do the other admin stuff as well, like drop tables or create indexes and so on, and it even does connection pulling, so really this does all the stuff that you need to talk to MongoDB and the api is very, very, very similar to what we saw with the Javascript API which is why I didn't skim over it, I wanted to say, okay, you really learned the Javascript api, now you basically also know the PyMongo api, findOne with a capital O, no spaces, is now find_one, with a lower case o, for example, there's a few variations for like say Pythonic naming but other than that, PyMongo is going to sound and feel very, very familiar to you at this point.
Like many things from MongoDB, PyMongo is open source so you can come over here to github.com/mongodb/mongo-Python-driver, and that is PyMongo.
So you'll see that you can go look around, you can see it's under active development and things like that, a lot of stars, so this is like I said, the official driver but you also have access to the source, right here.
So now that we know about PyMongo, I hope you're ready to go write some code.
|
|
show
|
8:02 |
So finally we're here in our github repository for our demos, we have something to share, so I have the source folder here and let's start with this play around PyMongo.
Now, throughout this course, we are going to build what I think the pretty comprehensive demo that we're going to work on it for a few hours, it's going to have tons of data, and we're going to consider both the design and the performance of the database.
But for PyMongo, let's just sort of fool around a little bit here and then when we get to MongoEngine, we will take on our proper demo there.
So we'll begin by opening this in PyCharm, do that little drag and drop trick in MacOS, but on Windows and Linux you've got to say open folder.
All right, everything is loaded up, and I have created a virtual environment in here a Python 3.6 virtual environment, you can run wherever, but that's the one I'm using; now, let's start by adding a file here, so we'll just call this program, we won't do too much structuring and refactoring and organizing for this particular demo, we will of course for our proper demo.
So, before we can do anything, we just want to type import PyMongo, this is not going to turn out well for us, we'll go over here and try to run this, nope, there's no module named PyMongo, so let's go fix that.
If we all open up the terminal in PyCharm, it's going to automatically find that virtual environment and activate it for us, okay, you can see the prompt says .env, that means that we have our virtual environment active, so let's see what is here— not so much, just to be safe let's go ahead and upgrade setuptools why are we doing that— because PyMongo actually use a C extensions and depending on your system, sometimes setuptools has a little better chance of compiling those, if you have the latest version.
It doesn't always work that way, and it has a way to fall back to just pure Python but the C extensions do make it faster, so that's worth checking out.
Alright, so we can pip install PyMongo, now things are looking good, let's try a program again, code zero, that means happy, zero is happy.
Alright, so we are able to create, or basically import the library, now the thing we've got to do is we could just go and create what's called a client and use all the default settings, but in a real app you're probably not going to talk to an unauthenticated local database server, you're probably talking to one on another machine, maybe there's security, maybe there's ssl, whatever.
So let's go ahead and set up the connection string even if you have like sharting, a replication, all these things require a connection string.
So let's go over here and create a connection string and we'll just put the default values, so they always start with the scheme mongodb:// like so, and then local host, and then 270017, so this is sort of the default local host sets the default port, it's running locally and the scheme is always here.
We'll talk about how you can add things like authentication and ssl and what not there.
So the next thing we need to do is create what's called a mongo client.
You can work with connections directly from PyMongo, but you shouldn't— why, because PyMongo manages connection pulling for you and reconnect and all these different things, so if you work with a client it goes through the connection pulling and that kind of stuff, if you work with the connection directly, you're kind of locking yourself into that single connection which is not the best.
So we're going to create a pymongo.MongoClient, like this I want to give it the connection string like so; now, the way this works, this is basically the equivalent of opening up the shell the way it worked in Javascript was, we said use a database, in Python it's a little bit different, in Python we say the database is client.
make up a database name, literally I could put TheFunBookStore here and now this would actually start working with the database called exactly that, we do case sensitivity in MongoDB.
so let's just call this the_small_bookstore, okay because we're just going to poke around at it we're not going to work with that big set of data that we had before yet and we're also not going to work with our main demo.
So let's call it the_small_bookstore.
Now let's go over here and say insert some data it's not fun to have a database with no data, right, in fact, let's just really quickly have a glance over here if I connect, notice there is no the_small_bookstore, refresh, no, no small bookstore, okay, so this act here almost creates it, when you do a modifying statement against this thing you'll see that it does.
So let's go over here to books, let's make it a little more explicit, I'll say db.
so it looks like the Javascript api.
So db.books is what we are going to call it, we'll say insert and what you want to insert, let's say title, now this is not Javascript, this is not json, this is Python dictionaries so you've got to make sure you have the quotes but otherwise it's really really simple.
The first book, and let's say it has a isbn, let's just put some numbers in there like that and let's do another one, we'll say the second book it's going to have an entirely different isbn and while we're at it, let's say go over here and print out the results and let's do it again, we'll grab the value and let's print out r.inserted_id, so here let's take a look at the whole thing and we'll even print out the type of r, and then the thing that we are usually interested with here is when you're doing an insert, remember the _id thing was generated well what was it, what if you want to actually say I inserted it and here's the idea of the thing I created for you, somewhere in your app alright, so if we capture the response we can check out the inserted_id ok so let's go and run this real quick.
Oh whoops, no this is actually just the id, sorry, if you do a bulk answer, I believe you get this or you could do, we can come over here and say insert one be a little more focused, now if we insert one we'll have our inserted id, let's make this third and the fourth book and make a little change here, there we go, one more time, perfect okay, so if you do an insert one we get an inserted one result which is in results insert one result, and here you can see the inserted id so we've inserted some stuff, let's go look back at our data base here we should have now, if we refresh it we now have the_small_bookstore, if we go to the collections we have our books and we look in the books, that should not be super surprising right, those are the things we just inserted, okay so now, let's go over here and do a little test we'll say if db.books.count is zero, we'll print inserting data and like this, we'll say else print books already inserted skipping and maybe even spell that right huh?
Now we run it, nope, there's already books in here we're not going to insert duplicate books, so that's all well and good, so we've gone over here and we've connected to the database, we've created a client using the connection string and trust me this can get way more complicated to handle all the various complications and features of MongoDB, and once we have a client we say the database name here I've aliased it to db so it looks like the Javascript api or the shell api you're used to working with, and then we work with the collection and we issue commands like find and count and insert, insert one and so on.
So now we have some data, let's go maybe do a query against it, maybe make some in place updates things like that.
|
|
show
|
3:24 |
Let's look at how we can do some basic crud operations and connect to MongoDb with Python via PyMongo.
So if we're going to use PyMongo, let's start by importing PyMongo, and I'm going to not import the items or the classes out of this but actually just the module and use the name space style to make it really clear where this stuff comes from.
Actually I like to do this in a lot of my programs, even in production.
So we import PyMongo, and then we have to create a connection string and feed it off to the pymongo.MongoClient, right so this is a concrete class in PyMongo, and we can give it any sort of connection string, in fact if you give it no connection string, I think it'll use what I have written here basically, no auth, no ssl, local host 27017 which is the default standalone MongoDB port.
Alright, so this is cool, we've got our client here, and now then it gets a little bit trippy, a little bit dynamic here, which is kind of fun.
So the next thing we're going to do, is we are going to go to the client, we're going to say .
some database name, not table name, database name.
Now, this thing doesn't even have to exist at this point this, as you saw on the demo, is actually how we created this database called the_small_bookstore, we just said db = client.the_small_bookstore and by basically saying that it exists, or implying that it exists it's going to since we do some kind of write, or modifying operation to it.
Ok, so just be aware that this is case sensitive, right, so capital T capital S capital B, would not be the same database as lower case t s b.
Right, so let's go, and now we're going to actually do a lot of things that look extremely similar to what we saw in the Javascript shell, that's why I spent so much time in that section it's because the apis are so, so similar at this level.
So now we can just operate on the database via collection so just like we said client.database name, we're going to say db .
collection name and those collections also don't necessarily have to exist, even for queries, if they don't exist, you just get nothing back that's not an error.
So for example, we can do a query against the books collection and ask how many there are, so db.books.count and that'll tell us how many books there are and like I said, even if the database doesn't exist, if the collection doesn't exist or both, it's still going to work, it will just return zero, because guess what, there are no books in the nonexistent database.
We could do a find_one and this will pull back just one item by whatever the default sort the MongoDB happens to be using and we can say find_one and give it one of these prototypical not json but Python dictionary type of objects.
Now this find one is the first place where we're seeing the Python api ever so slightly vary from the Javascript api; in Javascript it's findOne, and in Python it's find_one and they've adapted the api to be Pythonic, right, it would look weird to say findOne, but just be aware that they're not identical, you kind of have to keep in mind which language you're working in, but other than that, what you feed to it and how they work it's more or less the same.
If we want to insert something we say db.books.insert_one and then we give it the document to insert and we get a result and we saw that the result actually comes back and has an inserted _id and the inserted _id is the generated id of the thing that was autogenerated in the database, notice we didn't pass _id, but if we care we can get it back for whatever purpose.
When working at higher levels with like MongoEngine, this will automatically just happen on the class and get set we won't have to worry about it.
|
|
show
|
3:08 |
So in our example we saw that we pass a connection string to the Mongo client and it was super simple, it was just the MongoDB scheme and local host and the default port, like I said, we could even omit the the connection string, I believe it would still be totally picking all the defaults.
So let's look at some non default options.
So here, if I want to connect to a remote server and I've either put some kind of dns records somewhere or I've just hacked my local hosts file to say there's a thing called mongo_server which is maybe within a virtual private network or at least in the same data center zone, if I'm doing cloud hosting like a Digital Ocean or something like this, and if I want to connect it on the default port, which is still 27017, I could just say mongodb://mongo_server, and then we could connect that way.
Well, maybe you want to connect on an alternate port, so port 2000, instead of 27017, this is probably a good idea, there's a lot of people scanning the internet for open MongoDB ports, 27017, 27018, up to 20020 I believe, it's probably the range that they're looking at, because different services run on different ports, like replication versus sharding versus whatever.
So you probably don't want to run on that port, and when we get the deployment section, we'll look at all the steps we need to take in order to make our server safe, so be sure you do not put MongoDB in production until you watch that chapter at the end of the course, but let's just assume that one of the things we might want to do is run on a non default port, we just obviously like any web address type thing, we just say mongodb://mongo_server:2000 okay great, so now we have a separate server on a non default port we probably want to have authentication so if we had a user name and password again we'll talk about this in the deployment section at the end we would have jeff:supersecure, so user name jeff ultra secure password is supersecure, and then we can have everything else.
And if we wanted to talk to a replica set, so this is a set of cooperating duplicated fail over MongoDB servers that can be working together so in case one of them goes down, or you have to take one offline for some reason, it will just switch over and a different server will become the primary and start to store the data.
This doesn't lead to eventual consistency and things like that, there still is one primary place things go to, but depending on how the state of the cluster is, it could be any one of these replicas, and the replica sets.
So here we would say server one port one, server two port two, server three port three— well, the first two are actually both running on the same machine, so in case the process dies but we also have a separate server, Mongo server two that is running on a different port as well, in fact, this might not be all of the replica sets, all the servers in the replica set, this might just be sufficiently many, so that once it connects it finds all the others, and then it will start participating in all of them.
And we also need to say replicaSet=prod or whatever we're calling a replica set.
So we have all these options in terms of connection strings and then once you have this, well you pretty much use it the same way, you create a client by passing the connection string off to it and it figures out all the details for you.
|
|
show
|
5:35 |
So back to our example, we've inserted some data and we have this little guard here to say don't insert duplicate data, things like that, so let's make some changes to our book here.
Let's first of all change the title of the third and fourth book, let's just change this mess with this book for example, let's change this to like this, third book like so, all right; so we have two ways to do this, one way would be to pull back the entire document, work on it and push it back, and this is what I think of as the orm style of working.
So we'll say book = db.books.find_one, let's do find_one here and we're just going to give it the isbn that we have there.
Let's just do a quick little print out of the book and just so you understand what we're working with we'll also print out the type, so if we run this, we obviously get the book back, super, and you can see it is a dictionary, cool; so, I said I want to change the name here, let's actually change something slightly different, so we can work with some more advanced features.
What I want to do is I want to add the ability to have a user like favorite this book and this might not be a good way to do it, I haven't really thought it through because it's just a toy example, but let's suppose we want to have the book store the ids of the people who have favorited it, in practice maybe it's better to have the user accounts store the ids of the books that they individually favorited, but the mechanics would be identical.
So how we're going to do that?
Well, to this book, I'm going to add something called favorited_by, and this is just going to be an empty list here.
Then any time we want to work with it, we can come over here and say .append the user 42 did this, and then we can say db.books.update and give it a little query here so we would say the id and that's got to be, once we're in Python that's got to be in quotes, say book.get_id, it's going to be the value there and then what we're going to put back is just this book, and let's just one more time after this get it back and print out book, this should make sure that everything went sort of round trip just fine.
Ready?
All right, look, oh yeah look at that, we got a favorited_by right there, 42.
If we run it again, now we won't need to do this, we can run it again with 100, now we have two people, two user ids who have favorited this and so on.
Okay, so this is all pretty well and good, but let's do something better, sometimes it makes sense to go and pull a whole document back, look at it, make changes to it and save it.
In fact, that's something you'll do quite often, but in this case, we just kind of want to say add this little id here to this list called favorited_by and maybe it doesn't even exist.
So let's do this, let's a copy this again and change this, so now we're not going to use that, we'll use our isbn and let's modify book four here, so this does not even have a favorited_by yet.
Let's put this in here, so we're going to modify that and then let's actually also get it back and print it out at the end; there we go, so we're going to get the book back but we're not going to pass the whole book we're going to use one of these in place operators; remember add to set, so what we're going to do is we're going to use add to set.
So in Javascript we could type this really in the shell we can type this $addToSet but obviously, PyCharm is telling us not super good Python so what we got to do is put that in quotes, and then the value, we can have actually multiple stuff here, so we're going to say favorited_by, and then the thing let's add user id 101, now, this seems to be telling me I've got something a little bit off here, yes, so we need that to be the entire update document; ok, what we're going to do is we're going to say go find this document, this book with this id, which is notice, it ends in 73, this is going to be book four, actually let me comment this out really quick and we'll just print out, 73 rather, print out notice there's not even a favorited by yet.
So what we're going to do is we want to go add this id here so it should actually create this list and then put 101 in it let's see if that's going to work.
Boom, favorited_by 101, and this time we did not pull it back we used one of our cool operators.
Now, if this was just push, dollar push is another sort of equivalent, this would have more and more and more 101s, but add to set, I should be able to run this code over and over and over and 101 is already there so it's not going in, it's better if I say 120, now I run it, now we have those two right, so this add to set is super nice, I don't even need to go to the database and go well are they there, no they're not there, ok then I'm going to add them.
All right, so I don't even need to do that check, I can just use this cool little add to set operator, very very nice.
So here's how we use the in place operators, there's really not much difference other than we have to put more stuff in strings because it's not the shell, it doesn't have like the special understanding of what those mean and even over here, it's not Javascript, it's Python dictionaries, which those keys there need to be strings in this case.
|
|
show
|
2:23 |
Let's review the ideas behind these in place updates.
So here we have more or less a complete MongoDB Python program using PyMongo here, so we're going to import PyMongo, connect the local database, all the default options, and we're going to either create or get access to the bookstore by saying client.bookstore, now we're going to insert an object that has no favorited by element, right no list, it just has a title and isbn, so after the insert, we're going to end up with an _id and a title and an isbn.
And then maybe we want to add this idea of favorited by, maybe you want to design this already that way and have an empty list there, but whichever more or less would work the same, so we can say I would like to go find the book, the first part of our update statement is the where clause, so find by primary key and remember, that's when we call insert_one that's results.inserted_id, so that's going to find the one and only the item and then we're going to use the add to set operator and we just pass that as a string in PyMongo, and then we'll push on favorited by such and such.
We could also use $set to set, say $setTitle: the new book with updated title, or something like this right, so you can use this all over the place and what's really cool, now you may be thinking oh this api is kind of crazy, we've got these these dollar operators and it's a lot to learn if you're totally new to it, I realize but when we get to MongoEngine, you'll see that MongoEngine does this transparently under the cover for us, so you can actually not have to do this, you won't have to necessarily remember all of these but you'll get all the benefits that we're describing here.
If you're using PyMongo, you have to know the api really intimately so we're going to push this 1001 user id on to favorited by and maybe we'll push 1002 as well if people signed up at the same time, they saw the same book, they loved it and let's go head and push this 1002 again, well not the push operator, but the add to set operator, do this again, because it's add to set we're going to get a new document that has new book title, the same isbn and two items and it's favorited by and it's going to be 1001 and 1002, because add to set is item potent calling it once or calling it a hundred thousand times, it has the same result, other than it might take longer to call it a hundred thousand times, right.
So if it's already there it makes no difference but if it's not there to push it in super cool operator, really taking advantage of the hierarchical nature of these documents.
|
|
show
|
2:27 |
Now, when you go to mongodb.com and you look through the documentation so docs.mongodb.com, you will find stuff about updates and inserts, and queries and aggregation and so on, and so on; all of these are going to be in the Javascript api, notice at the bottom of this web page here, db.collection.insertOne is new in version 3.2, so if you're trying to look up these operations you will most likely find them in the Javascript style, and the Javascript api, that's how MongoDB talks about it, you'll probably find them on Stack Overflow.
So, because that's the way the shell works, MongoDB is kind of standardized on here is how we're going to do our documentation in Javscript, once again, yet another reason we spent so much time on the Javascript api, even though none of us are necessarily Javascript developers.
So, here we have the crud operations, now we have the query and projection operators and things like that, so if you want to know how to map these over to PyMongo, then there's one page really that you need for most things, and that's the collection documentation.
So over here at api.mongodb.com/Python/current/api/pymongo/collection.html you can see right at the top, we've got all of the stuff you can do on the collection itself, so for example, we were passing one and minus one as the sorting operators in the shell, here you could say pymongo.ascending, pymongo.descending, a little bit more explicit, but this is a really good place to go because you'll find like the insert_one and the find_one and all the various ways in which you need to adapt the documentation you find in Javascript over to the PyMongo api, this is probably the biggest bang for the buck right here.
Okay, so if you want to write an app, PyMongo could totally be your data access layer, it would completely solve the problem, it's really great, it's what a lot of applications use to talk to MongoDB from Python.
We're going to talk about some additional things going forward but one of the bigger decisions you need to make is are you going to use an odm that maps classes to MongoDB, with additional features as we'll see in a lot of interesting ways, or are you going to work down at the dictionary level, it's very similar to say I'm going to work with say the DB api and sql strings, versus SQLAlchemy or Django orm or something like that, right.
So, you kind of got the low level way to talk to MongoDB, now, we're going to move on to talk about document design and mapping higher level objects like classes with MongoEngine later In the course.
|
|
|
22:12 |
|
show
|
3:54 |
We've come to a pretty exciting part in the course, we're going to talk about document design and modeling with document databases.
So let's take a step back and think about relational databases.
There is in fact a couple of really systematic, well known, widely taught ways of modeling with relational databases; there's still a bit of an art to it, but basically it comes down to third normal form, first normal form, some of these well known ways to take your data, break them apart, generate the relationships between them, so if we're going to model like a bookstore with publishers and users who buy books at the bookstore, and they rate books at the bookstore, it might look like this— we have a book, the book would have a publisher, so there is a one to many relationship from publisher to books, you can see the one on the star and the little relationship there, and we have some flat properties like title and published and publisher id for that relationship, and similarly, we have a navigational relationship over to the ratings, so a book is rated, so the ratings would have almost normalization table or many to many table there has the book id and the user id and then the value and we just happen to have a auto increment id there, it's not necessarily the way we have to do it, we could have a composite key, we've got our user and the user can go navigate to the ratings, and things like that.
Now, of course, this is a very simplified model in a real bookstore with real ecommerce happening and all that and categories and pictures and all those things, this would be way more complicated, but the whole idea going forward is going to be pretty similar and I think keeping it simple enough that you quickly understand the model and don't get lost in the details, is the most important thing here.
So this more or less follows third normal form here.
in terms of how we're modeling this in the relational database.
Could we move this to MongoDB, could we move this to a document database— sure, we could have exactly the structure.
Now those relationships, those are not full on foreign key constraints, those would be loosely enforced, not enforced in the database but enforced in the app relationships between the what would be collections; but certainly, we could do this, is it the best way though?
The answer is usually not, maybe, but probably not.
So what we're going to focus on now is how do we take our traditional knowledge of modeling databases and relational databases and how does that change, what are the trade-offs we have to deal with when we get to a document database.
So the good news is, usually things get simpler in document databases in terms of the relationships, you might have what would have been four or five separate tables with relationships, it might get consumed into a single item, a single collection or single document really, so here this is how we're going to model our bookstore that we just looked at in third normal form, but now in a document database.
And really, the right choice here comes down to how is your app using this data, what type of questions do you usually ask, what's the performance implications, things like this.
So now we have a books, we have a publisher and a user and these have similar top level items, and we do have some traditional relationships.
So there's a one to many relationship between publisher and books theoretically we can embed the book into the publisher but there's many, many books for some publishers and that would be really a bad idea; so we have this traditional relationship, like you might have in a relational database.
Now again, not enforced by Mongo, but enforced by your app, so same basic idea.
Next up, we have the ratings, remember we have that like many to many table from users to book ratings, now that has actually moved and now we're storing these items in an embedded array of objects inside the book table, or the book collection.
So now each book has a ratings array, it has the number of ratings, those are just put right in there, so is this the right design— maybe, it's certainly a possible design, and it's the design that we're going to go with for our examples, but we'll talk about when it's actually the right design.
And I'll help you make those trade-offs next.
|
|
show
|
5:22 |
When it comes down to modeling with document databases you apply a lot of the same thinking as you do with relational databases about what the entity should be, and so on.
However, there's one fundamental question that you often ask that really does take some thinking about maybe working through some of the guidelines, and that is to embed or not to embed related items.
So in our previous example, you saw that we had a book and the book had ratings embedded within it, but we could just as well have the ratings be a separate table or the ratings could have even gone into the user object about reference back to the book, instead of the reverse.
So should we embed that ratings, and if we do, does it go in books, does it go in users, or does it not go there at all.
So what I'm going to do, is I'm going to give you some guidelines, these are soft rules, we don't have like a really prescriptive way of doing things like third normal form here, but some of the thinking there does help; so let's get into the rules.
First of all, the question you want to ask is that embedded data wanted eighty percent of the time that you get the original object; do I usually want the rating information when I have the book?
If it would have resulted in me doing a join in a traditional database or going back and doing a second query to Mongo to pull that data out, it's very beneficial to have that rating data embedded in the book.
We designed it that way, so let's suppose like most of our query patterns and most the way our application works is we want to list the number of ratings, the average number of ratings, things like this we want to surface that in almost all the time, we want that embedded data when we get a book.
So that would guide us to embed the data, if this is not true, if you only very rarely want that data, then you most likely will not want to embed it, there's a serious performance cost for what you might think of as dead weight, other embedded stuff that comes along with the object that you generally don't care about most of the time, you can do things like suppress those items coming back, so you can basically suppress the ratings object, but if you are doing that, it's probably a sign like hey maybe I shouldn't really be designing it this way.
A lot of considerations, but here's the first rule— do you want the embedded data most of the time?
Next, how often do you want the embedded data without the containing document?
The way our things are structured now is I cannot get the ratings without getting the books, I cannot get individual ratings without getting all of the ratings.
So if what I wanted to do was on the user profile page show here are all of my individual ratings as a user listed on my like favorites page, or things I've rated or something like this, that's actually a little bit challenging the way things are written.
We can definitely do it, and if there's just one query we do it that way it's totally fine, but this is one of the tensions, you can't get the ratings without getting the books you can't get individual ratings, without getting all the other ratings from that particular book, there's no way MongoDB to actually suppress that, I don't think, like you can suppress the other fields we're using a projection right, you get all the ratings, or none of the ratings.
So how often is it necessary to get a rating without getting a book itself?
Right, if that's something you want to do often or it's a very very hot spot in your application maybe again you do not want to embed it, if you want the object without the containing document.
Another really important question to answer is is the embedded data a bounded set?
If it is just a single nested item, fine, that's no problem, if it's a list or an array, like we have in the context of ratings, how big could the ratings get, how many ratings might a book have reasonably speaking; if there's ten ratings, it's probably totally fine to have the rating data embedded in the book, it's nice self contained, you get a little atomicity and some nice features of have it embedded there.
If there's a hundred ratings, maybe it's good, if there's a thousand ratings, if there's an unbounded number of ratings you do not want to embed it, right so is it a bounded set, first of all and related to that, is the bounded set small, because every time you get the book back you're pulling all of that stuff off disk, possibly out of memory, over network for deserialization or serialization depending on the side that you're working with.
So that comes with a cost, and in fact, MongoDB puts a limit on the size of these documents, you're not allowed to have a document larger than 16 MB, in fact, if you try to take a document that's larger than 16 MB and save it into MongoDB, even if you pull it back, add something it makes it a little bit bigger and you call save it's going to totally fail and say no, no, no this is over the limit.
So this should not be thought of as like a safe upper bound this should be thought of as like the absolute limit if you've got a document that's ten megabytes, it doesn't mean like wow, we're only halfway there, this is amazing or great, no, that's a huge performance cost to pull 10 MB over every time you need a little bit of something out of there.
So really, you should aim for a much, much, much smaller thing than the upper limit of 16 MB, but the point here is there is actually a limit where if this embedded data outgrows that 16 MB you just cannot save it back to the database, that's a will no longer operate problem, is the bound small is more of a performance trade-off type of problem, right, but you want to think about these very, very carefully, average size of a document is definitely something worth keeping in mind.
How varied are your queries?
Do you have like a web app and it asks like maybe ten really common questions and you very much know the structure, like these are the types of queries my app asks, these are the really hot pages and here's what I want to optimize for, or is this more of like a bi type thing where people and analysts come along and they can ask like almost any sort of reporting question whatsoever; it turns out the more focused your queries are, the more likely you are to embed data in other things, right, if you know that you typically use these things together, then embedding them often makes a lot of sense.
If you're not really sure about the use case, it's hard to answer the above questions, do you want the data eighty percent of the time, I have no idea, there's all sorts of queries, some of the time, right, and so the more varied your queries, the more likely you are going to tend towards the normalized data, not the embedded modeling data.
And finally, related to this how varied are your queries as are you working with an integration database that lives at the center and almost is used for inter-process, inter-application communication or is it very focused application database?
We're going to dig into that idea next.
|
|
show
|
3:29 |
In order to answer this question about whether you have an integration database or an application database, let's do a quick compare and contrast, especially in large enterprises, you'll see that they use databases almost as a means of inter-application communication, so maybe you have this huge relational database that lives in the center with many, many constraints, many, many store procedures, lots and lots of structures and rules, and so on, why— well, because we have a bunch of different applications and they all need to access this data, maybe the one in the top left here it needs users but so does the one on the right, and their idea of users is slightly different so this user is not like a real simple thing, it's really quite complex it's kind of the thing that will solve the user problem for all of these apps and so on and so on, through the constraints and the way you use it.
This is a decent, well, it's typically a good role for relational databases, you're better off with other architectural patterns anyway, but relational databases are a good guarding against this kind of use case, they have a fixed schema, they have lots of constraints and relationships and they are very good at enforcing and kicking it back to the app and go no, you got it wrong, you messed up the data.
So they can be like this strong rock in the middle.
The problem with rocks is they're not very adaptable, they can't be massaged into new and interesting things; a rock is a rock, and it's extremely hard to change.
So that's partly why some of these major enterprises will have like weekends where they deploy a new version of an app, like we're going to take it down and everybody's going to come in and we're going to release it; that is not a super place to be, it's also not a great use case for document databases with their flexibility in schema design, their less enforcement at the database level and more enforcement inside the app, because how is the app on the left going to help enforce things for the app on the right, that's not great.
So, this is an integration database, and it's generally not a good use case for document databases, if you're still using that this sort of style of document databases, it means your queries will be more varied and you probably need to model in a more relational style, less embedded style, just as a rule of thumb.
So what's the opposite?
Well, it might look like this, we have all of our little apps again, and instead of them all sharing a single massive database you can maybe think of this is more like a micro service type of architecture; each one of them is going to have their own database and they're going to talk to it, and then when they need to exchange information we'll do that through some sort of web api, so they will exchange it through some kind of service broker way they like negotiate and locate the other services, right, maybe the one in the left is about orders, the one on the right is about users and accounts.
So what that means though is each one of these little apps is much simpler, it can have its own database with its own focused query patterns, which is more focused, easier to understand, and the application can enforce the structure and the integrity at its api level, so this is a much better use case when you're sharing data with a document database.
And in fact, this sort of whole pattern here means we don't have to make it NoSQL versus SQL choice, maybe three out of these six are using MongoDB, one is using a graph database and two are using MySQL, it's up to the individual application to decide what the best way and model basically with the best database and its underlying model is.
So when we have an application database like this you are more likely to have slightly more embedded objects because the query patterns are going to be simpler and more focused and more constraint.
|
|
show
|
8:07 |
So let's look inside the application that you're using right now to take this course as an example.
So at the time of this recording, here's what the Talk Python training website database looks like for courses and users.
So, first let's focus on the course side of things, there's a couple of interesting ideas here, one, we have an id which is not an object id, why is it not an object id, well, it was actually migrated from a relational database initially, this was using SQLAlchemy, and it was easier to keep this id here as a number rather than switch to MongoDB's object id, it's also easier to refer to it in other areas, like say in the commerce system I can put the id in without using, I don't have very much space in terms of the message, that can go into the e commerce system based on their api, so one is much easier than like 32 characters, so we're using the non standard id which is generated in the app but for these types of things, that is really no big deal, for the users, I think we might be using object ids.
We have somewhat sort of flat things here, we have the url and the title and when it was published, things like that, so this is the Learn Python by Building Ten Apps Jumpstart Course and you can see a lot of the initial ideas here, and the initial pieces of data are totally straightforward and they would look exactly the same in a relational database.
However, there's two things that are very different than I want to pull your attention to; first is not actually the embedded stuff, but is this duration in seconds, when I created the MongoDB version of this web app, I realized one of the things I do all the time on the home page, on the course listing page, and many many places, is I say how long is the course, this course is 6.5 hours, I think this one is 7.1 hours or something to that effect.
Using quick math you can figure out duration in second.
So there was actually a pretty serious bottleneck where I'd have to go and in this case pull back 12 chapters and then from the chapters I could get the lectures and from the lectures I could get how long each individual one was, I had that all up and then I could print out that number.
And then I would do that for say like on the course catalog page, there was like ten courses, I would have to go through so many of these chapters and then their subsequent lectures, and that was a huge huge bottleneck.
So what I decided to do was in the application, any time I save or update the course, I'm going to compute this on save which is extremely rare, and then I'm going to stash this here, so this is actually computed from the chapters which are computed from the lectures themselves, and this is data duplication, but you'll find that a little bit of data duplication, I find usually most apps is like one or two little pieces like this that just unlock a lot of performance because actually computing this turns out to be really really computationally expensive, but storing it here on this object made it super fast.
So this is one thing, this data duplication which I try to stay away from as much as I can but the trade-off here was so worth it.
Now, the other part we want to focus on is down here, we said I'd like to associate these chapter ids with a particular course, now if this was a relational database, I might have a course to chapter normalization table, right, it'd have the course id and the chapter id and I do some query some kind of join on that; you almost never ever, ever see that in MongoDB and document databases.
Usually, at least the ids are embedded on one side of that, one to many relationships so here we have the course, the course has some chapters, so we're just storing the ids here.
Now, we also have the chapters, you can see chapter 1001 goes right here and this one is a little bit more interesting, we've got again our duration in seconds which is another thing computed from if you look at the individual lectures they've got duration in seconds, and that's the real raw number.
So this is another duplication, because at many, many levels I need to show the time of a chapter, and that was turning out to be computationally expensive at many levels, so again, these two places, this is the one bit of duplicated data and you will see that this is more common in a document database than in a relational one.
So here we've got our chapter which has this soft relationship from the course over to the id, we also have the course id down there and below it, so it's kind of this bidirectional relationship; then we have lectures, and lectures is interested in that almost every time that we get a hold of a chapter we care about its lectures, we usually want to display them in a list any time that I get a lecture, this is the thing like you're watching right now, this is the lecture, right, an individual video let's say, any time you have one of those, you almost always need the other ones, at least the ones before and after it, so like if you look in this particular player you'll see there is a forward and a backward within the course button that you can skip ahead or skip back, that is the other lectures so what I find is grouping the chapter along with the lectures into one blob that makes it super fast and I almost always want the other lectures when I have one lecture, and if I have the lecture, I usually need to display the chapter title, and things like that.
Anyway, so these are really well suited to be put together in this embedded style, so I don't have a lectures table, I have course, courses and I have chapters, and then in the chapters those are embedding the lectures, and we also saw that little bit of data duplication.
So you can see down here is an individual embedded lecture, here's one that talks about doing the exercises in this course and it's apparently 202 seconds, so I hope this look behind the scenes has helped you understand how you might model this stuff, you can look at the course page and the player and think about some of the trade-offs, I don't know that this is perfect, but it is absolutely working well for the web app.
Let's look at one more thing.
Down here we have the users, and we have a couple of items that we're going to focus on when we get to the users, I have blurred some out, we're using object id now for the user id I covered the password and things like that, but we've got some flat stuff like whether or not you're opting out of email, what your user name is, what your email address is, things like that.
And then, I have this concept of an origin, so if you come from like some particular marketing source it might record like hey this person created their account and they originally came from Facebook, this person originally came from the podcast or something like that, so that's pretty interesting, we also have the courses that you are taking, so right here, this particular person, this is me, so I gave myself basically all the courses, these are the ids of the courses that I am a student in, so again, there's not a users, there's not a courses in a user courses sort of normalization thing is very common that when I as a user am loaded into the database, I very often need to know about the courses.
Now I can't easily embed the course into the user, right, that'd be like insane levels of duplication, but closest thing I can do is I can get this list and then I can go back and do another queer say give me all the courses where the course id is in this list of owned courses, so basically two queries I have everything I need.
We also have the bundle id and some other things going on here.
So that embedded course id, that's actually a list one more thing to look at down here is this preferences, so this is short name, somewhat short name, this is the preferences for your player so when you're in the video player, you can choose different qualities, you can turn on captions or you can turn off captions, subtitles, transcripts basically and you can choose a playback speed, it could be like .75 up to two or three or something crazy like this.
One of the primary actions a user does on this site is to go through the course, each course might have 150 lectures so as a user, you come in you look round a little bit and then you go through 150 lectures, so this preferences thing needs to be pulled back frequently.
And so we got to get the user anyway and embedding them together means it's basically instant access any time I'm in the player to figure out how to preconfigure the player to render your video the way that you like it.
So this is an embedded item, but not an embedded list just an embedded preference object.
So there you have it, a look inside Talk Python Training at least as it was when we recorded this, so hopefully this helps you think through some of the challenges of building a more realistic app.
|
|
show
|
1:20 |
Let's close out this chapter with a few more sources you can get some patterns here; so recently I had Rick Copeland who is in the MongoDB masters program along with myself, and I had him on the podcast on episode 109 to talk about applied MongoDB design patterns.
So this concept of embedding and modeling and data duplication and all these things, certainly we talked about on the podcast, and he talks about in his book, but he has a lot of really interesting use cases and actually some performance trade-offs, using some of the atomic update operators, one versus the other or not at all, just to see how that might work out.
So he's got a bunch of use cases and you might flip through his book once you really get into things and say does one of the patterns he talks about really closely match what I'm doing— you might get a huge jumpstart on modeling your data with actual performance numbers behind it.
So check out the podcast, it's free and check out his book if you find it to be helpful.
And final thought on modeling with these document databases is there is no perfect answer, it's always this tension of I could model it this way and this part of my app gets better, I could model it another way, and that part is not quite as good, but another part becomes more flexible or becomes better, so it's really about balancing the trade-offs, not right versus wrong.
|
|
|
1:28:03 |
|
show
|
4:15 |
Now we've got to a serious place in the course where we're going to write some real code and work with some realistic complex demos, and we're going to do that using what's called an odm, an object document mapper.
So an object document mapper is like an orm, an object relational mapper but in NoSQL, we don't have relations we have documents so we're going to map documents to objects rather than navigate and traverse these relationships, and the one we're going to focus on for this course, my favorite one, I think is one of the best, if not the best is something called MongoEngine.
So before we get to MongoEngine, let's just look at the overall goal here, and the features of the odm vary whether you're talking MongoEngine or something different, but they generally follow the same principles.
So with PyMongo, this is the low level api we have our app and we have PyMongo, we'll talk to the database; so when we write a query that talks to PyMongo, we work in the Python api and we send Python dictionaries to it, which either have the prototypical documents in it or the high level operators, in place update operators and things like that like $addToSet, but in order to do that, we basically write in the raw MongoDB api, as we've seen the only real change that we go through to go from the raw shell api of Javascript over to Python is we're working with dictionaries and not json, and we're working with Pythonic names, so insert_one has the underscore and lower case in Python, not in Javascript, but this means you're working at a super low level and for certain operations and some of the time this makes tons of sense, it is one of the faster ways to work with MongoDB from Python.
However, most of the time, we much, much prefer having high level concepts in our application that enforce additional things that automatically use the best features of the PyMongo and MongoDB api without us having to think about how that happens.
So that's when we can bring in an odm, we have the same thing, we got our app, now we're going to have our odm plus PyMongo, we're going to issue a query, but this time we're not going to write in raw api code we're going to actually issue the queries in terms of classes, think SQLAlchemy, think Django orm type of queries here.
So we might have a book class given our previous example, so we'd go to the book and we'd say find based on the isbn equals this and so on, all right, so it's very similar to the Django orm and some of the other orms that you might be familiar with.
So we work in these high level classes, and that's great and it translates of course down to the PyMongo api, what's better though, what's really great is it actually leverages the really advanced in place operator, so at least speaking of MongoEngine specifically now, if we say pull back a class, an instance of a class and we make a change to say for the book, we change the title and we call save, it's actually going to do a dollar set operation, in place update it's not just going to push the whole book back into the database with all the optimistic concurrency issues you might run into, no, it's going to make the changes in the best way possible.
So we'll see that we'll be able to use these advanced operators without actually doing anything other than just working with classes and objects in memory, it's really really sweet; we'll also have additional features, it automatically works with indexes for us, it will automatically add type checking and other types of constraints that simply don't exist in the database, but can be managed at the class level in the object level and described there.
So here's the string field, here's an integer field and the integer has to be greater than ten, all of that stuff can be done through MongoEngine, in our application but the concept of that doesn't even exist in MongoDb, right, so you get a lot more structured, a lot more safety in it by basically describing your schema in classes and long as you stick to one application or share that class definition across applications, you're in a much safer place than just randomly sending dictionaries at the database.
So this odm style of programming, I find it to be extremely productive, very clear and quite safe, neat, fast for most of what we got to do, that's really my favorite way to work with MongoDB, and I hope you'll see the power of it and enjoy it after we go through in this chapter.
|
|
show
|
1:54 |
So here we are at mongoengine.org and MongoEngine is the document object mapper, right they say think ORM but for document databasis, right just like we said, and MongoEngine is a great ODM, which they maybe call it DOM, given their naming, I think ODM is slightly more popular; anyway, it's really great and flexible ODM, it has a very clear way of describing your classes, if you use something like SQLAlchemy and you like the way it works, you really like this, if you like Django ORM it's very similar to that, actually it uses the active record style, not the unit of work style which Django uses active record, so does Ruby On Rails, if you look as opposed to say SQLAlchemy which uses unit of work.
It works well in Python 3, it also works in Python 2.
So if you go here you'll see there's actually additional things you can get, you can get a Flask plug in on top of this, you can get a Django plug in on top of this, and some extras as well, there's a couple of cool additions that you get, but we're just going to work with plain MongoEngine, that means we can use it in any application whether it's a web app or not and we can use it however we want in our web application.
Like pretty much everything in this course MongoEngine is open source so you can go here to githug mongoengine/mongoengine, you can see it's almost two thousand stars, almost a thousand forks, it was updated fourteen days ago, it's very active and living project, it's one of the things I look for when I depend on some core part of my application is is this thing being updated, is it alive, things like that, you don't want to take on something as critical as your object document mapper if no one is out there maintaining it, you probably don't want to be writing an ODM, you probably want to be using ODM and building whatever it is you're trying to build, like a website or app, or a service api, whatever, it's probably not an ODM you want to be building.
So you see, MongoEngine is quite active, and you can go fork it and keep a copy of it for yourself, but for this course, we're just going to pip install it.
|
|
show
|
1:39 |
So far, we've been kind of poking at MongoDB, playing around with some of maybe existing data or creating simple little databases with one or two records in it.
We're kind of done with that, we're ready to move on to be building the main application that we're going to build for this course.
So we're going to take this concept of a car dealership that does service for autos, sells cars, does service like engine repair, fixes flat tires and so on, for a Ferrari dealership, and that's going to be our demo for the rest of this course.
On this first go round, we're going to start out with an empty database or a non-existent database, we're going to model it in MongoDB with MongoEngine and then we're going to run that code and create a few simple cars, a couple Ferraries, maybe associate the cars with some owners, do some service on the cars, somebody over rev the engine and has got to get a new engine, or got a flat tire, things like that; we'll see how it all works.
Later, when we get to the high performance section, we're going to instead of start with an empty database start with one with like a quarter million cars and tons and tons of service records and they will start asking really interesting questions and really focus on the performance side of things.
So we're going to use this for the rest of our time and I have been really waiting till we got to the MongoEngine section to create what I would think of as a somewhat realistic complex demo because with PyMongo it's fine, but you'll see the real power of modeling this in a full featured realistic production style way once we get to MongoEngine things like indexes and uniqueness, and constraints and types and lots of good stuff.
So I hope you're ready to learn MongoEngine, and put it to work building this cool Ferrari dealership.
|
|
show
|
3:32 |
Here we are in the github repository for the course, now notice I put the PyMongo play around stuff that we did into a folder called dir 5 PyMongo, now, we're over into our MongoEngine section, and there's actually two things here, there's a service central starter and then there's a server central; so a lot of times people like to follow along with the code examples which I totally encourage, and this one is the way, it when we saved in the repository exactly the way we're about to get started.
This one we're going to evolve throughout this demo until it becomes sort of the final version, so I want to open this in PyCharm, and I want to use a virtual environment to do that, so there's a couple of cool tricks I could do to make a life as easy as possible, so here I am in that service central place, and if I do an ls even pin files you see there's nothing other than this sort of starter Python ting we'll talk about in a minute.
So the first thing I want to do is I want to actually set up a virtual environment with Python 3s venv, I'll do a dash copies, and I'm going to call .env, and the name .env here is something that PyCharm will look for, so if I open this in PyCharm, after doing this it will actually automatically use this virtual environment, so that's cool, that'll save me a few clicks.
Let's go over here and throw this in PyCharm, now it's going to take it a second, it's sort of looking through that virtual directory, let me add the source control.
So here's a really simple starter application that we're going to talk about but first let's make sure that we have PyMongo installed.
So let's just do a quick list, and notice we're already automatically using our virtual environment, that's because it's top level the project, and it's named .env so PyCharm said cool, we'll use that, I didn't have to do anything that's why I did that first thing in the terminal before open in here.
So notice we have basically nothing, probably worthwhile to upgrade setup tools, some of the things that depends on C completion sometimes a little nicer, if I have that set up, ok so now we can pip install MongoEngine and you'll see that also it's going to install PyMongo, depends on 2.7.1.
or greater, and it's thinking about PyMongo, thinking about MongoEngine, and then we'll be done.
Perfect, it also uses six for Python 2, Python 3 compatibility.
All right, so now we have our system all set up, we have PyMongo installed and here let me just show you this super simple little app, there's absolutely no MongoDB stuff going on, so we have this main that is going to print this header, very cool, you can see we're going to call our app service central and it's going to do this user loop, and the user loop just says here's some actions that you can do, you can add a car, you can list cars and if you look at implementation, all of these are entirely empty, here is where the MongoDB stuff is going to be happening, so let's go and run this, notice there's no run configuration over here no green button to run, so I can right click on this and say run and it runs and actually let's make this little higher, notice that it's running with the Python out of our virtual environment, Python 3, okay, and now here are little header and then here's our user loop, it says you can add a car, cool to do add a car, you can list the cars, you can find a car, you perform service, right so we just basically have the structure in place and we're going to use this for the rest of this demo, and like I said, we're going to be building on this concept of what we create here.
|
|
show
|
4:42 |
Now let's begin by setting up MongoEngine, there's a few start of the app kind of configuration things we need to do in order to use MongoEngine, and then we just use the classes and types throughout the app.
So what I want to do is I'm going to create a folder here let's call it NoSQL, so we're going to put a number of MongoEngine related things in here and I don't want to call it MongoEngine because then it will conflict with the name so, lacking creativity I'm calling it this, now there's a couple of things we need to do we need to set up the connections and then we need to define the classes, this first part we're just going to set up the connection.
I'll create a module called Mongo setup, ok so down here, let's define a function called global init, we are going to call this function from outside.
Now, in real life later as we talk to like sort of the production stuff we're going to want to pass in like the user name, the password, the server name all sorts of stuff that you know maybe in a real app comes from like a config file or the environment in a production server, something like that, but for now we're just going to put this in here.
So to get started, we have to import MongoEngine, we don't need PyMongo but MongoEngine we need.
And then down here, it's really simple what we need to do, we're going to register a connection, so we're not actually going to open the connection here, this doesn't talk to the database, but it basically says look if you have a class that maps to a particular type or named part of our application use this database connection to do the backend work.
So we're going to come down and say Mongoengine.register connection and see it has alias name and then other, and what comes with the other, the ....
there is like the connection string information like server name, port name, host name, use ssl, replica set, all that kind of stuff.
Okay, so we're going to say, make it really explicit here we're going to say alias, I was going to call this core and I'll you what that means in a minute, so let's call this demo_dealership.
Now normally, I would probably just use dealership but I already have that for something else in a previous example, I kind of want to keep it around so we're going to say demo_dealership, there we go, and that's all we're going to need to do.
So the idea is here, we could have multiple things like analytics it could be here, and this could be visits or whatever, it could be mapping to another database assuming I spelled analytics correctly ; so in our classes, we can say this class belongs in the core database, whatever that happens to be configured as, this one over here, happens to belong in the analytics database and so I find it's really valuable if you've got like some core data that are required to make your app run, and then like huge amounts of extra analytical type data, that if you lost, it's like oh well I'd rather have that data but if for some reason I want to back up let's say you've got 5 GB of analytic data and a 100 MB of core data, you could run backups on the core server much more frequently than the analytics one and by partitioning them to different databases or even different servers you can do a lot of cool tricks like that.
Alright, all that said, we're not doing that, we're just going to have one database that we're calling core so we're going to register this connection and when we get to defining the classes you'll see a place where we refer to the core connection, that's what we've configured here, and it's going to default local host default port everything like that.
Again, when we get to the using MongoDB in production, we're going to talk about how to pass all the extra information you need to use this for real, on another server, on another port, with authentication, everything, but for now, this is what we're going to do to set it up.
So let's go ahead and get started, using this, let's go down here, and we've got our print header, let's go ahead and do a config Mongo, so it's easy enough to import, let's go up here at the top, our module, so we'll just call it Mongo setup like this, and I'll just say global init do a little pep8 formatting, and we're good to go, and it thinks this is misspelled, no, just short alias for MongoDB.
Okay let's just run it to make sure everything is working, alright, there is no real way to test it yet, but in a moment, we will, so far everything worked, we configured our MongoDB connection, next up, it's to actually think about modelling these cars and owners and service, and all those kinds of things.
|
|
show
|
8:57 |
Alright, let's start defining our classes that we're going to map to the database.
And I guess the first place to begin would be to describe how I think we're going to store the data and what the data is; so we're going to have a car, a car is going to have an engine, with lots of details about the engine like its horsepower and so on, a car is going to have a service history and each entry in the service history is going to be some additional information, like what was the work performed, how much did it cost, when was it done, that kind of stuff.
There is going to be an owner who can own multiple cars and a car can be owned by multiple people, so there's a many to many relationship between owners and cars, and then owners have personal information like their address and stuff like that.
So really the idea is we have cars and owners, and then the cars have additional things like engines and then the thing you can do to the car that's really interesting is you can give it service right, change its tires, change its parts and plugs, give it a new engine and so on.
So we want to model those things, so let's start right at the heart of it, let's start with the car.
So over here, we're going to define another class, another Python file called car, and we'll go down here and we're just going to define a class called car, like this.
Now, we're going to need to work with MongoEngine, because the way it works is all the classes, all the entities we want to map to the database are going to derive from mongoengine.Document; now this allows us to load and save and query the documents, it also provides a field called id, which maps to underscore id in the database and by default is an object id type of thing, okay so we don't have to worry about this id whether it's an object idea or not, you can change it you can put a different one and overwrite it, but if you leave it alone this is what you get.
Okay, so the car now has an object id and we're going to give it a couple of pieces of information like about what model is it, so if you've worked with these ORMs before, they are very similar, what we're going to do is we're going to define the properties of the document as basically a descriptor, so it's a mongoengine.
this is going to be a string, so we'll say string field so you have sorted list field which is pretty sweet, we're going to start with a string field that's nice and easy, let's add while we're at it a make so a model might be F40, make would be a Ferrari, we're going to have a year, mongoengine.IntField now notice, we have types here, we have strings and we have integers in MongoDB, things have a type in bson they're strings or they're integers but there is no way to enforce a type, there's no way to say the year must be an integer you could easily make it a list if you want it, make anything you want, but in MongoEngine, it has a concrete type which is actually really valuable.
Let's have a mileage, and let's say the mileage is going to be a float field and then it's going to have a vin number, vehicle identification number and that is going to be a mongoengine.StringField because it might have alpha numeric in it, it might start with zero, things like that.
Okay, so this pretty much is what we got to do in order to map this to the database.
However, there's one more thing that you want to do, so we're going to define this meta dictionary and the dictionary is going to say the database alias we want to use is core, remember that from over here, we said this connection to this database with all the properties that we're not specifying because they're defaults but we could have a server name, port authentication, all that kind of stuff we're going to say go find this connection that we've registered here because the db alias we want to use is core; I find this a really nice way to partition our app up into like central parts and analytics and reporting and those kinds of things.
Then we can also control the name of the collections, we don't want to be capital C Car, how about lower case cars.
Alright that's more Pythonic for us, so we're going to call our collection cars, and in the shell we would say db.cars.find, alright, but here we're going to work with MongoEngine.
So this is not the end game, but this is definitely the beginning, let's go down here and write some throw away code just to see that we have everything hanging together.
So let's go down, hit command b, so go to add car, and let's see what do we need here, let's go and grab the stuff we're going to need, in fact, you'll see that some of these we're not going to have to set especially when we get to the final version of what car looks like, but let's say we want to get the model, it's going to be input what is the model I could almost just enter Ferrari because that's what it always is the make, so we have to ask the user some questions here and I'm going to assume this is going to work, assuming that we can parse that as an integer and here we'll say mileage, that's going to be a float and let's go and get a vin number.
Okay, so now we want to create a car we want to insert it into the database and later maybe even do a query with it, so we'll say car = Car like this and I could use keyword syntax to set the value here let's go ahead and import that to the top, so I could say model equal such and such, year equals each and such.
Or I could say car.year = year, card.make = make, notice the auto complete which is very nice, model and we'll just keep going like this.
And then, in order to insert it, all we have to do is go to the car and say save this is the active record style, in active record you work with a single document and you say load, save, query, things like that right, you save them individually, which maps really well to MongoDB because it doesn't have concepts like transactions.
So let me just put in something wrong here for the mileage, remember the mileage, if you look over here, has to be a float so let's try to put a string in there, all right so run my thing, I want to add a car, it is not going to make it through I believe, so let's say 1998, it's going to be abc— it's going to crash, and it says car validation error no, no, no, the mileage only accepts floats and integers so already in the simplest form of our car, I'm going to do a lot more to it it's already helping us out a lot here, so oh yeah yeah yeah that was supposed to be a float, do you know how easy it is to make that mistake when you're working in raw dictionaries and put in a string in the database when it should have been a float, and then how do you do a sort, how do you do a greater than— you're out of luck, right so we already get a huge value from like the simplest variation.
Okay, let's go and put this in for real now, add a car, it is a F40, it's a Ferrari, it was built in 2005, this time the mileage is 10 thousand miles, and the vin is af2.
There we go, ready?
Oh it looks like I made a small mistake configuring MongoEngine here, let's go back really quick, that's unfortunate, so over here if you look, I quickly typed in the alias and I said db but no no, the thing that we want to use, the name of the database is not db its name, so sorry about that, let's fix this here, all right, now let's do it again, now we should be able to add our car which we're going to go over here, we don't really need this anymore, so we are going to ask for the input from the user, create the car and save it.
Now that it actually knows what database to use, that should be pretty easy.
So add the car, F40, Ferrari, 2005, driven slightly further since we crashed it and we tried to add it but here we go, and this would be a F20, boom just like that, we've added it, remember, this demo dealership didn't even exist until we just hit enter there, now let's see what we got, go back over to our favorite shell RoboMongo, and now we have dealership which was already there but now we have demo dealership, and check this out we have cars and in here if we look at it like this, there we have our model, our make, our year, our mileage and notice, this is an integer, this is a float— why, because in the class that's how we defined it and the other two obviously are strings, here is the underscore id that was generated or brought into existence by being just mongoengine.document class, we didn't have to do anything to make that work.
|
|
show
|
4:02 |
So we are able to create a car and this is not a great architecture to just jam the writing here, but for now we're just going to leave it right into our main application.
However, let's go look at the car definition again, there's a couple of things that would be nice, it's not part of MongoDB but it would be nice to require like we already have our type checking, it would be nice to say that you have to specify a model and a make and it would also be nice to say you have to specify a year, but maybe we could have the mileage default to zero, like 0.0 as a float for example and it would be even cool to generate the vin id number here for new cars, right.
Typically that comes with the car automatically and you don't have to worry about it, you have to know what your id is.
So it turns out that this is super easy and this only is available in MongoEngine, it is not available in PyMongo and it's not available in the database itself.
So we can come down here we can say this is required or must match a regular expression, or whatever, so we're going to say required is true, so you must specify a model and given this is a Ferrari dealership we could either say this is required or we could give it a default Ferrari, it's going to just make it required; the year also is going to be required so you have to type those three things in, but maybe over here we could have the default be zero.
New cars have 0.0 miles, that seems fair, how about this, how about auto generating that, well, the default is really callable and we could just put actually like this we could say float and it should call the constructor initializer for float we also put a value, so if we go up here, we can use uuid, so if we import that, go have a quick look, you can use uuid4 as a string so if we say stir of this, we have that dash, I don't think vin numbers have the dash so we could replace dash with nothing like this, what do you think that for vin number, so if we could get this little bit of code to run on every time we insert a new thing, hey that would be cool, right, and we can, so we go over here and say default, I would like to call a function that returns that value, the simplest way is to do a lambda that takes no parameters and returns that, ok.
So that's cool, let's actually wrap this around so it fits a little better on the big font, small resolution, so now we have a better version, let's go back here and now we can take away some of these things that we can just let it get generated and we'll save it so let's try this one more time, so I'm a going to go down here and say add a car the model is F40 again, and Ferrari, the year is shiny new, 2017, boom, notice it didn't ask me about the mileage or the vin number, but did that work, let's go find out, open the shell, turn it again and look at the bottom one, check out the vin number, how cool is that.
So we've got our vin number down here, right this is the one I said 2017 not 2005, this was generated by our default value, this was generated by our default value, and if I haven't done it yet, but if I for some reason omit setting the make let's see what happens if i don't set the make, remember it's required it doesn't matter what I put here, boom field is required, make, right, we can't save it without setting the make but we can save it without setting the vin number because that has a default.
Okay, so go back here, so we can use required and default value as well as other types of things like regular expression matches, these default values can be straight up values or they can actually be callables which result in returning the thing that is the default value in this case each time we add a new car, maybe I'll show you, we'll get a new one, it's going to be 308, and it will be a Ferrari, and it's going to be built in what is that, 1991, now if we go look one more time, there is a 308 again totally distinct number or vin number here, right because each time we insert it, it's calling that lambda, each time you call the lambda, you get a dramatically different uuid.
|
|
show
|
6:43 |
So we have the primary properties of our car modeled, they have their required fields, they have their default values, things like that.
We still maybe want to consider indexes but we're saving that for a performance area.
The next thing we want to look at is the engine, and the embedded elements.
We talked about at the beginning that the car is going to have an engine, and it's going to be equal to something, not a string or a float or something like that, but in fact, to an entire subclass, right, a class that represents engines in particular, so let's create that class and then we'll come back to the car.
So if we come over here, I'm going to create something called engine and just like before, we are going to import MongoEngine, not the same engine, right, class engine and this is going to derive from MongoEngine document, now, before I said document has the id and this is like the top level thing that allows saving and loading, so we don't use this type for embedded documents, subdocuments, right, subdocuments don't necessarily have ids, you don't load query and save them etc independently, you can only work with them through their parent document, so in this case, we are going to say this is only allowed as an embedded document it can't be queried or saved directly, but it can be used as a subelement of another type, like for example our car.
So let's go over here, and give our engine a couple of properties, we're going to give it the horsepower and the horsepower is going to be a mongoengine.integer, so this is going to be an int field, it is going to have a leaders so the size of the engine and this will be mongoengine float field because it could have like 2.3 liter engine something like that; we'll have the mpg, so miles per gallon, and this is going to be the same thing, a float, finally it'll have a serial number, and this is going to be a mongoengine string field.
And the serial number is kind of like the vin number, but in fact it's not exactly the same, it's going to be having dashes it will have a slightly different format, but let's go ahead and work on some of the default values in that, so we're going to import uuid, I am going to use just again uuid to actually generate this so quick review, default is a lambda, and the lambda is going to return a string representation of uuid4, I believe the dash is in there this time, just to make sure hey this is clearly not a vin number, it's a serial number.
And let's set these all to be required, so we can have in the subdocument itself these required values, except for the serial number, which is going to be autogenerated; all right, so let's go back now that we have this not a document but a subdocument, an embedded document, let's go back to the car so in the car what am I going to set this to, a MongoEngine something, right so this is not going to be a string or float or anything, it's going to be an embedded document field, right, so we have an embedded document list field or just a field so this is a single engine, not a list of them, so here it goes like that and then I need to tell it not just what goes in there, but the document type what is the subdocument, the subdocument is actually this, and then let's go ahead and say required = true so we could even say that this subdocument cannot be none or null in Javascript, it has to be set to a thing.
So, we're going to import the engine, so it knows the car, it knows about the engine and it can save it there, all right, let's try to work with our engine here and see what happens.
Okay, it runs, that's already pretty encouraging, so the model is going to be the Testarossa, I'm sure that's misspelled but we'll roll with it, Ferrari, it was built in 2010 and this is going to crash because the engine was not specified, how do we do that, well let's jump right where the problem is and find out.
So we need to set the engine, now I'm just going to hardcore engine setting so we don't ask this anymore, right, so we want to come over and say allocate an engine, allocating it is just the same as the top level item we'll say engine.horsepower is around six hundred horsepower, that's pretty insane isn't it, we have the miles per gallon, I think that's around 20, not super high, liters let's say 5.0 it's not exactly right, I'm sure but close enough; and then we just say car.engine is engine, like this, so we create this object and we associate it, and then later we can say car.engine.
and we get all the various things here.
Okay, so here we have our car, we set the engine and now let's do this again; in fact, let me just comment this out a little bit, yeah, we'll just ask those two questions, keep it little simpler.
Ok so we're going to add a car, the model is Testarossa, don't think we have one yet, let's open up our shell, we have the couple F40s and the 308, but no Testarossa, 2005, and boom, inserted, okay let's run this again.
Where did it go, oh there it is, check that out how awesome is that!
So we have our Testarossa 2008, the mileage defaulted to zero, the vin number was autogenerated, now here we have our engine and check this out, here is our subdocument curly brace means object subdocument in json so we have horse power 600, liters are 5, miles per gallon is 20, serial number is such and such, let's go and make a quick change here so we can ask some interesting questions, let me make this 22, this is a more efficient version and it only makes 590 horse power, okay let's just insert one more.
So we're going to add what model, so let's say 355 is that a thing, I'm not sure, 2000, like so.
So if we go over here and run this, now we have these two that have engines, this one has a 590 and so on, so we can actually go over here and ask interesting questions like I want all the cars with an engine where, let's go for liters, is we can say something like $ > what value, say 5.95 and of course, got to close everything off, what did I miss— I missed something didn't I, because it didn't come back, horse power could be that much or liters could be something much smaller so here, horsepower this much, or liter, I could have done 4.5, there you go, so now you can see that we can query down into this subdocument but it's going to get more interesting when we start doing queries with MongoEngine, because we want to get these rich objects back.
|
|
show
|
5:03 |
Now that we have our engine stored in our car over here as an embedded document, the next thing that we need to work on in our car model is how do we store the service histories, first of all, what kind of data are we going to have in the service history itself.
So let's go create a class that represents that and then we'll figure out what to do with it.
Again we're going to import MongoEngine, create a class called service history and we're going to postpone discussing what goes in there for a minute.
So this is going to have a date, where this has happened either like a create a date or the date when the service was done, so let's create a MongoEngine, a date time field and let's even set the default to now, so we'll go over here and say— so we want to set this to be a lambda, actually we don't need to set it to a lambda, we'll just set it to datetime.datetime.now without parenthesis, we don't want to call it, we want to just pass that function, so we're going to call the now function whenever something gets inserted so the date we could even call this like service date if we want, but I'm going to stick with date.
The next thing is let's have some kind of description here like just some text, we'll say description, it's going to be a mongoengine.StringField, and it is just going to be like changed the oil, they had a flat tire, a nail was stuck in it, we patched the tire and everything was good, something like that, right super simple; we have a price, this is how much they paid for the service, so it will be a float field, and lastly we care about our customers we're primarily a service shop and sometimes we sell our Fearraries and sometimes we just service them, but we want our customers to be happy and how do you know whether they're happy— we better ask them, so let's ask about their customer rating and this is going to be an int field, so we're going to set this this number is going to go from let's say one to five five being the most satisfied, one being the least satisfied.
Great, so now here's the deal, do we want to embed this into the car like we did the engine or do we need to come over here and say something like this car_id = mongoengine.ObjectIdField, like this, right so we're going to have a relationship over to the car or maybe the other way around, on the car, we could have some bunch of ids that represent the service history, or there's a bunch of other options.
So remember when we're designing our documents one of the primary questions is in our application do we want that embedded data with us, most of the time and it turns out because almost all of our data work are reason to pull up one of these cars is to actually look at the history of it we are going to decide for that reason, that we do almost always want the service history associated with the car and we don't usually need the service history without the car, we need details about the car like the mileage for example.
How are we going to do all that— let that means we probably want to embed the service history as an array into this car; the other thing we have to care about is is that set bounded and is that bound small?
You know, a car how much could it possibly get worked on, right let's say it gets worked on once a month every month, just ongoing, very very unlikely, but that would give us at most a hundred of these service histories let's say for some reason that like that upper bound is totally fine with us, it's certainly not crazy unbounded where it's going to escape the 16 MB ram I mean, how much could one these service histories be, 2 K each, not a huge deal.
So for all those reasons we are deciding to embed the service history into the car so we want to come over here just like we did with engine I'll say service history, could be plural could be singular, let's go with singular so I'm going to go mongoengine, now it's not an embedded document field because this is not a single document, this is a list, so instead, is we are going to have an embedded document list field, now over here in this one, we said what is the type that is the embedded document here what we can say is what things, what type of things are contained in that list so this will be a service history, we've got to import that, thank you PyCharm, and then we could come over here and we could say the default value is like a new array or something, a new list but in fact, that's what it already is, so the default value for embedded document list is a new list for each entry, so we're just going to go with that, that seems totally good to us.
All right, so now we have this mapped, let's actually go back to our app and add the ability to create and insert some histories.
One thing that we almost forgot, since we decided we're going to embed this service history, that tells us how we need to treat the base class for the service history.
So recall for the embedded items, this is going to be a mongoengine.EmbeddedDocument, if it was going to be standalone and in its own collection it would just be a document.
There we go, now our service history is ready to go.
|
|
show
|
7:32 |
Now it's time to service the car.
So, we got a couple of options here, one possibility would just be let's get a service, like a random car we grab from the database, the other would be to go over here and implement list cars.
So let's actually go ahead and do this list cars here, so we could hit list cars and then we can ask for the id or something to that effect of the car.
Ok, so let's go over here and talk about listing the cars, we haven't done any queries yet, everything we've really done so far is inserts and that's really straightforward, right, we've seen that we create an object, we set some properties, they may be complicated properties but nonetheless, we just set the properties and we call save.
For listing or any sort of query, we're going to come over here and we're going to do something different, so what we want to query are the cars, so we want to say car.objects, now there's a couple of things we can do, we can come over here and we can say filter and set something or if you're doing a single filter, you could actually do it right here, I'll do it the long way then we'll tighten it up, I'm not sure there's really any benefit other than whatever you feel like typing, do you want to be more explicit or do you want to have code be more concise.
We could come over here and we could do a filter on the car, we could say I want to see only the cars that are a particular year or something like that, so I'm not sure we really need to filter it all, really all we want to do is sort them, so was say order by and then what we're going to do is we're going to say the thing that we want to order by, and for order by, we use string, so let's order by the year and let's just get with this for just a minute, so we'll say cars equal this, and then for car in cars we want to print out, let's say the make, the model and vin, something like this.
So here we'll say car.make, car.model, car.
vi_number, okay excellent, so that should print those out and let's do maybe a little extra line at the end.
Let's go ahead and test this, and see if everything is working, so let's list the cars, excellent, surprise it's all Ferraries, remember, we are a Ferrari dealership, and this is not super helpful because it doesn't show us the year, I want to show you that the order is working, and if you don't see that you're not going to be able to verify whether this is working or not.
So let's do this, we'll come over here and we'll say car.year, do it one more time, now we list, all right look at that, so 1991, 2000, 2005, 2005 and 2017, perfect it's working, but I'd kind of want to see the new ones first, although maybe in terms of service, seeing the ones is what we want, let's say we want to get them in reverse, so put a negative here, here we go, now we're sorting the cars, newest to oldest, sorting by year, descending, okay, so that's working really well, so what we want to do is basically use this vin number, to go find the car we want to service, now that we can see the cars I'd like to go say I want to service a car and come over here hit s, and it is supposed to say okay what car do you want to service, I give it one of these, we're going to go to the database, find that car and then insert a service record to it.
With that mind, let's go and think about this here, let's think about actually showing the service history of each car.
Some number of service records like that, format what we want to print out is a len of car.service history now this is as far as we're concerned in memory just a Python list, so we don't need any database magics, like a straight up length will tell us what we're looking for.
And then for each one of these, we can go and look at it so as in card.service history, this doesn't go back to the database a second time, that's part of the beauty of just pulling it back as part of the document so here we'll print, we'll do like an indent, something like this, it'll have let's say we're going to just show the name, let's do the price and the name, I think that might do it.
All right, so we're going to s.price, s.description, I believe is what we called it, let's go and run this, we should see just zero histories everywhere, zero service records so not super interesting, but it should become interesting as we write this next thing, Alright, so now that we can find the cars, the next thing to do is actually service them, let's go ahead and write one more query here, and then I'll carry on, okay, so first thing we want to do, is we're going to get the vin, I'll say input, now in reality, the car would be sitting in front of us, here we don't really know, so we need to find a way to get the car right, so let's say this, we want to go to the car and we'll say car.objects now here's where we really actually are doing the filter stuff so we'll say filter— and this is pretty cool, we're going to go to this and we're going to basically pretend this function has named parameters that in the simple version represents the prototypical json objects, we saw from the api, from the Javascript and PyMongo api.
So let's go to the car real quick, what is the thing we're looking for, vi number is what I called it, so we're going to say that equals to vin, if we wanted to also match the make = make, right model = model, whatever, if we had three things we want to do like an and on all three, this is how we would do it, but we don't, we just have the vin, now this will give us a list of cars that match this, in theory, this should be unique, we haven't got the indexes or any other way to make them unique yet, but we can get there, now we're expecting one back so we can come here and say first, and that should actually pull back the one car so instead of getting the cursor, we're going to get either a car or none.
If we had none, that means it wasn't found, so let's say if not car, print car with vin not found, okay so a little bit of error handling there, this doesn't throw an exception if it misses, it's just going to be none, all right, so then we'll say print we will service let's say car.make, let's see if this works, okay so list cars, let's go, let me put in service, say bad vin, car with bad vin not found, excellent let's try to service this one that has this vin that I copied, boom, we will service Ferrari excellent, excellent, so we'll go into the database and get it.
Now that doesn't really prove very much, does it, that this is the model, there's only one Testarossa, so if I do this, it should say we're going to service the Testarossa, perfect, all right, so now, we have everything working beautifully in terms of our filter here, now I told you there's two ways to do this, if we have multiple clauses and you want to do multiple filters and multiple ifs and conditionals, just keep piling on the filters, but if you want a simpler version you can actually do it like this, as well, and let's just verify.
Boom, see, we're going to still service that same car.
So now that we have it, how do we get this service record onto the car, you'll see there's actually two ways, and they're both pretty decent.
|
|
show
|
5:11 |
Okay, so we can list the cars, we can now go and find the individual cars using this query and a little bit of error handling here.
Now, let's go and actually add the service record so we are going to create a service = service record I think this is what we called it, no, service history, sorry, we're going to import that, so we're going to allocate one of these and we just need to set a few properties, the price is let's just straight up input this— what is the price right, maybe we would look up at the type of service and it would have an associated price but this is just a demo, right, just trying to get it to work; so this is the float, some kind of currency thing, I want to say service.date, we don't need to set the date, this should happen automatically, description is going to be input what type of service is this, and then the next one we want is going to be the customer rating, and this is an integer, how happy is our customer, I want to put a one to five, something like this to remind them that that's the scale they get to work with.
Okay, we're just assuming this is parsable by float, this can be an integer right, we're not dealing with bad input.
So that should work, let's just double check the service history these three we're setting, this one has a default, perfect.
Ok, so finally all we have to do is we don't insert the service history directly because the service history is not a top level object it doesn't belong in its own collection, instead what we do is we go to the car and we say service history it's a list, so we're going to put it on it like this, like so, the ones we've changed the car we need to push that to the database like this, so that's all we got to do, we're just going to put the service history here and insert, let's go and run this and see what we get, alright, once again we're going to list the cars and notice we're going to work on the one that is really obvious this 2005 Testarossa, okay so let's service it, here's our vin number, excellent we're going to service it, what is the price of the service— this is expensive, 15 thousand, this is going to be a new engine, they blew the engine a customer was very happy, normally, new engine is 20 thousand and so we got him a good deal— bam, just like that, we've inserted it, let's try to list it I might still have that format wrong; no I got it sort of right, so we definitely got that to work there, let me just change that format like I said, there we go, actually let's just drop the currency, I'm sorry drop the decimal points, so here 15 thousand dollars, for our new engine, look at that, it's in there, let's go and actually do another service on that same car the price of this one is a 102 dollars, this is going to be a flat tire, and the customer was a little grumpy, felt like we took too long to fix the car but they still like us, they give us a three; so now if we list it you can see now there's two service records on the Testarossa, pretty cool, right, that's super easy, and we don't need to do a join to get this information, it comes built in even if we're doing horizontal scaling across ten shards it's all there, let's go look at Robomongo, it's a little small, what do you think, it does that sometimes, I don't know way, okay, here is our demo dealership, let's view the documents and we can say vi number, now of course we don't really need to do this, but we can, just to show you, we have tons of data, we look down here and now check this out, is that cool, so we've got our flat level things, here's the vin we just did a search by we have this embedded engine we already saw, we have our service history and moreover our service history is setting the date right, so right now it's like eight o'clock 04 seconds and 56 seconds right, the price of this is beautiful, just beautiful, so now if we do a query against this, we go and say show me the vehicle with this id number, we're going to automatically get the details about the engine, we're going to automatically get the details about their service history, without going back to the database or doing lazy loading, or joins or any of those kinds of things, because it's a sweet document database.
Notice, also that some of these over here, some of these have a service history, some of them don't even have engine, this one doesn't have an engine or a service history, and that's probably ok, right, the schema is flexible in MongoDB, the schema is not flexible in our application, it's a class, it is what the shape of that class is, period, but in the sort of evolution of this as the data grows, you can think of this as like an older version of this record and here is a much newer one, it has an engine and a service history, right, but if we ask questions about show me the cars with the service history greater than this, you know, five or greater, these are not going to come up, they will just not appear in those results.
So it's probably okay if we really need to upgrade them, to make sure the shape is exactly matching we could just run a script, the script would be a Javascript migrate script rather than a sequel dvl type migrate script, but a script nonetheless.
|
|
show
|
6:08 |
So we are able to service these cars, however, there is something we might consider, it's probably fine for this type of application, but if there is contention around these documents, like multiple things that you're trying to update the same record you could run into some trouble here, right, we could have two separate threads, run this part, they could each enter some stuff here, there's actually because the input has super long delay here, and then we append it and save, and which every one got saved last would replace the other one and just throw that data away.
We could add optimistic concurrency by manually implementing it and that would solve that problem, but we could actually make this perform better as well as avoid that problem entirely, so let's come over here and let's duplicate this, so hit command d in PyCharm, control d on the other Oss, so let's come down here and do this entirely different, so we're going to get rid of this part here, and instead of saying this, since we're not pulling back the car, let's actually drop this bit here, so you can compare them directly, so we're going to ask for the vin, we're going to create the service history and now, instead of pulling the record back, making a change and pushing the entire document which could be like 200 K, we just want to move this data over, remember that in the raw api, we talked about the operators, we had $addToSet, and $push, so we want to use those operators and just say here, this service record put it onto the list that is on that document under service history; so that's what we're going to do now instead, so we're going to go car and I need to find an individual car, so I am going to do a query here, so I'll say vi number = vin, right, just like we did in our find up here, but instead of pulling it back we're going to take this and say update one, when we update one, I want to say something like say service history and we'll say service, but how do I tell it to use the operator, this is the first time we've seen this but this, but this pattern you'll see recur over and over and over in MongoEngine, how do I tell it to use the operator?
You'll see that there's a couple of times, a couple different situations where MongoEngine uses double underscores to like represent, not just the value but the value with an operator so the first one that are going to see is this push, we'll say push double underscore, and what that means is we're going to push service onto service history.
Up here before we had a way to say if you gave us the wrong vehicle id number we told you no, there's no car with this, there's no car with this bad vi vehicle id number; so what do we do here— well, this doesn't pull it back it just updates it if it finds it, so how is this going to work?
We need to put our test over here and say if it wasn't updated so updated is if it comes back with one, but if it comes back with zero, that is a problem, the exact same problem we couldn't find the car with the vehicle id number; so now we don't need to append this and push the whole document back it's all in one shot atomically on the server.
Beautiful, so this is a much higher performance and safer thing to do if we don't want any details about the car, we literally just know that this service goes on that service history for that thing, let's try again and see if it works.
So first lease the cars, I'll grab the id of the Testarossa, and let's try to service it— the price is one two three, the type of service is waxing, so we decided to wash it for them, our customer thought the glean on that car was like nobody's business, very happy, there's the moment of truth, ready— boom, car with id not found, oh, it totally worked, it totally worked, I just have the wrong error message, wrong case here, let's do a listing and run it again do a list, now we have our waxing, what did I do wrong— there's a lot of ways we could check it, but if updated is exactly the one we don't want, I could say if not updated or if updated equal zero or if updated less than or equal to zero, things like this, let's just go with == zero alright, let's try it again, so we'll list our cars, notice I still have this, so we'll go on service it, I'll say I want to update this, the price is twelve dollars, this is going to be just the check up and the customer thought something was wrong, we couldn't find it, so they give us a three.
So, see, did it work, notice there were no errors and now the checkup twelve dollars, also notice the order here is literally the order that we're putting onto that list, so very very good, we'll do one more, I guess I'll service something bad car with bad then not found, all right now, sorry about that, we got this working.
So we used the push underscore underscore now the other one, remember I told you about Pythonic versus non Pythonic you could use add to set, right if you want to do sort of a uniqueness thing the service history is because of the date field and I'm not even ensure about the embedded objects but certainly, the date field is always different, so there's no point in this, just do a push, but if you are pushing like tags or customer ids or something like that which could easily be determined if there is a duplicate then you could use that ad to set, but remember in Javascript it was add to set like this, here it's the Pythonic variant as it probably should be.
Okay, so depending on how you're working with your objects this might make sense or this might make sense, I guess I would tend to say prefer this one, this is safer and faster, so if you have no real reason to choose one or the other, why not safer why not faster.
|
|
show
|
7:08 |
So we've really explored a lot of MongoEngine and we've built upon the foundation that we laid with the Javascript api and transferring that over to the PyMongo api; so hopefully, nothing you've seen has surprised you in terms of the types of queries that we're doing, it's just learning how MongoEngine surfaces that and turns it into objects was really what we were looking for; now, there are a few other things that we need to talk about that really we haven't touched on yet, the operators, we talked about the atomic update operators but not things like the greater than, less than, exists, doesn't exist, in set and so on, so we want to look at that; we also want to look at querying into subdocuments so if we go back to our MongoEngine here, we'll run this one more time, see, maybe we want to ask questions like show me the cars that have had some either really good service or really bad service, so we want to query all the way down into service history, into customer rating, and do a question like show me the ratings that were 5, show me those the ones that were 4, show me the ones that are less than 3, things like that, so how do we do this in this format that MongoEngine uses?
So I've sketched out this little 'show poorly serviced cars' and it doesn't do anything, it just pulls back every car and prints it more or less like we had before, except for it shows the satisfaction in addition to the other stuff; so the question is how do I query it, let's just run it real quick, and I can say show me the poorly serviced cars, it doesn't matter what we put now, and it literally just lists all of them, and notice this one has a satisfaction of 3, 3, 5 and 3, so that we can do some queries, let's work on two other cars, let's work on the Ferrari 308 and this 2017 F40.
So let's perform some service on this one and let's say this one got some amazing service, the price was 12 dollars, and we have a let's say monthly check up again here spelled right even, and they were just thrilled, so let's do our list really quick, and now, notice this one had a very happy one in fact, if I say the poorly serviced cars for a moment it's going to show that this one had a satisfaction of 5, okay let's suppose the 308 is not having such a good day, let's service it, and let's say that its price was 10 thousand dollars, the type of service was fender dent repair, so maybe the family went out of town and the teenage son stayed home, the son took the Ferrari out, found the keys and crashed it, so you can't blame the guy for being unhappy, but you know, what are we supposed to do, he came in unhappy, we tried to make him happy, but he was just not having it, so he had a 1, and now let's look really quick, just list everything still, so you can see over here, this Ferrari has no records, this one, this F40 2017, was very satisfied, the 308 very unsatisfied, and this Testarossa has some that are satisfied.
Okay, so great, now we have the right variety of data, let's go over here and write the code that we were trying to write in the first place.
What I want to do is I want to find the cars that had great service, so that's pretty easy to do, we saw that we could do like vi_number = 7, but what about, over here— remember what we want to do, find the one with lots of them, we want to go into service history and down into service history, we want to find customer rating, how do we do that in this format?
Well, it starts with this, service history, and what's the thing called down here, just do a copy to be sure it's identical, because you don't get an error if you get it wrong, just no results.
So I told you that double underscore has special meaning we used it for the push operator earlier, we can also use it here to traverse the hierarchy, so service_history_ _customer_rating is going to go down and let's say this is going to match whatever level they passed in, all right, let's try this.
So I want to find poorly service right now it assumes that we're going to enter a low number, but let's just run with it for a minute, let's say I want to find the ones with level 1, all right, so it was this Ferrari 308 here, and I think that's the only one that has level one, let's go and run the poor but ask for 5, so like I said, bad name, servers at a level or whatever; now we have two, right, we have this Ferrari F40 with this here, and we have the Testarossa, which some of the time at least had really good service, the person was super thrilled.
So that's how we search into those subarrays, we used the double underscore, so double underscore we used it for push onto a thing, we use it to navigate a hierarchy, the last thing that we really are looking for is we would like to find the cars that say have below excellent service or something like that, so let's change this a little bit, max level of satisfaction are we looking for; so we could say 1 and that's a really bad one, if we could say 3, and we could intend that to be 1, 2 or 3, as the level, right, so it's not going to work this way now, it's just going to be straight up a quality.
So, once again, how do we do it in the Javascript api or PyMongo— we would use something like this, we would say that, we would say service_history.customer_rating and then here instead of giving a number we would give it one of those operators, we would say $ lte (less than or equal to) : level, right so how do we do that here— well, we want to use this operator and we're going to do that again with the double underscore, so we'll say double underscore __lte, but here's the thing, the query operators go on the end, the update operators go on the beginning, remember push was like this so the order varies, for better or worse, I think it has to do with the fact that the operators here go to the right in the raw api, and the push one goes to the left, so it's kind of trying to mirror that.
All right, let's run this again.
So let's see the poorly serviced cars, let's try again for 1, we should see just the 308 because that's the only one with that level, boom, there's the 308.
Let's look for it again, I want to find all the cars with 3 or below, remember, if I scroll this up a little bit, we're doing lte less than or equal to 3, bam, look at that, we got the 308 and we got the Testarossa, which some of the time did have this, all right, if I put 5, we would just get all of them.
So you can see that we can use the double underscore to traverse the hierarchy, we can use the double underscore for the operators, and in fact, we can use the double underscore for multiple meanings in the exact same thing, right here, traverse service_history.customer_rating and then apply the operator less than or equal to the value that we set.
|
|
show
|
7:10 |
Let's look at one final thing, I think it is not beautiful but knowing about it and expecting it is really, really important, not in the beginning, but as you evolve your application, you'll end up with some funky complications.
I actually chose the cars that I wanted to update very, very carefully, let me run this again, if I list the cars, notice here in particular state far away from this Ferrari F40 from 2005, there's only one of those, right, notice the id it's d15 and ends in 7e, if I try to list the cars again there's that one but oh, it ends in ae, list the cars again— now it's ending in a1, what is up with that?
Just to be clear, the other ones are not changing, like 0f that's always the value, for the first one 0f, there's not a problem with Mongo or anything like that, what is going on here is this car was inserted into the database when we just had a little bit of our class to find here, remember in the beginning, we didn't have this default concept when I first introduced it, and somewhere along the way after we had inserted a few cars, then we added this, let's look at Robomongo.
If you flip through here, you'll see almost all the cars have a vi number, vehicle id number, vehicle id number except for that F40 from 2005, right at the top— none.
Because when it was inserted, there was no definition for a vi number what the heck was that thing anyway, how was it supposed to know that was not here yet but would eventually exist; so we can do a couple of things, the reason you keep seeing this changing numbers is that there's a default value and basically it gets created every time it comes back from the serialization layer, but it doesn't get set from the database because there is nothing to set it to, so every time it goes back, it reruns that lambda and gets a new value and we're not saving it.
So basically what we need to do is we need to upgrade our documents, now sometimes like I said, this doesn't matter, but this one where we kind of counted on a default value to be there, and then it wasn't, well that's unfortunate.
So here's something, there's a couple of things we could do, I could simply come up here and write the script in Javascript and apply that, that's one option, another option is let's go here and let's write make sure we just below configure Mongo I'll say update_doc_versions, or something like this define that function here, and what we can do is something like this it's not exactly in a work, but it'll give you the idea what we're after; so we'll say for car in car.objects so basically let's look through everything in this collection and let's save it back, I'm going to run this, and let's list the cars and see if we solved it, I really wish it wasn't at the top but there is, 19, 02 what's going on, well, it turns out that unless we somehow forcibly change this object it's like hey, this object is not dirty and we don't need to push it in there.
And say mark as changed, vi number and let's try it again— here we go, so we told it that it changed sadly even though it generates a default value it doesn't look back into the database which I guess would be super expensive, it just says hey someone changed this, right it didn't really trigger that, it came from the databases none and then we set it, so it doesn't know to push that through, so you have to do this little bit of a trick here to say mark has changed, and PyCharm says you're not supposed to do that, let's just tell it hey, don't make me look bad, we have to do this you understand it, right?
Very well, okay, so now we've got this save back to the database, we only want to run this the one time, right, we don't want to run at all the time, this is just like a one time upgrade of our documents and if you have a 100 thousand records, probably fine, if you have a billion records this is not how you want to do it, you want to do some kind of in place updater or something better, write the script, so let's run this again, and now we should see our car here, this is the 2005 F40, you know what, it is time for new tires, let's service this puppy.
Now we come down here and say I don't know how much new tires are in Ferrari, but let's say they're 2000, new tires, the customer is pretty happy they had the low profile ones they were looking for but they could have been like a certain kind, who knows, whatever.
Perfect, that worked, now if we list our cars again, you can see that this one that was basically, we couldn't get to because its vin number kept changing is now fixed and that's because when we reran that we said hey, force of the default of the vin number in there, notice that none of these other ones changed, it did write them back to the database, but it wrote them back in exactly the way they were before, so nothing changed there, I'll just run it one more time, this second one is, I'll just copy this and we can go pull it back in a second, so if we put this back one more time, and then we try to service this car one dollar test service they were pretty thrilled with that here you can see the test service, okay.
So the ids are not changing when we do this, it's just if they're not there they are created.
Alright, so if that seemed kind of annoying, I'm sure it was annoying, but let's think how this would be in a relational database, what would have happened if this was a say SQLAlchemy or something to that effect, or if this was some other thing, we would have a lot more trouble evolving from one to the next, right, so we wouldn't have the problem of hey here's a car that doesn't have a vin number, because if we didn't actually go and manually changed the database every time we added something here, when we added this we would have had to go back to the database and do like a migration or data transformation SQL query to actually change and add this column, same thing for this, but none of that was required, it was just this one case where we went back in time that we had to do a little bit of work here.
So, sometimes you still have these scripts you've got to run, sometimes you still have these changes, you got to do and consider the version history, but it's much much less often than with relational databases where every little change requires a script or it's just oops things are out of sync, bam we can no longer work, but I did want to point that out to you that look, you're going to have to be really careful, some of the time.
as these things evolve, how are you going to deal with the fact that in the database there is this thing that has no vehicle id number.
If we're using PyMongo, it would have just come up as none or key error or something like that, it would have been a little more obvious but that's just one of the trade-offs you get with these ODMs.
|
|
show
|
2:17 |
One of the very first things if not the first thing that we need to do is register our connections.
So it's really straightforward, we just import MongoEngine and then we call register connection, and you want to give this connection an alias or set it as the default, and then we're going to set the name equal to the name of the database.
Here we're calling this one core, I like to be very explicit and say everywhere that you are working with a connection or a database really you name that explicitly in your code as we'll see later when we get to the classes.
So we register connection, and we set the alias to core, and the db we're going to say name = dealership.
Now, this worked well if you're just connecting on local host no authentication, default port all that, right we just let everything else fall through the cracks.
When we get to the production deployment, well that's not really going to fly anymore, we're going to need ssl, we're going to need to enable authentication and pass credentials and all that kind of stuff, so we can use a more complicated variation here, where we do basically the same thing, but we create this dictionary that has the additional elements, now it doesn't have to be a separate dictionary you could just set them explicitly, but it turns out that sometimes if you want to like put this into your log or things like this, it's kind of handy, so we're going to basically set a username, password, host, port, authentication source is almost always admin, not always, it's either the database you're working with if it's a local user or if it's a server wide user you're using to be on admin authentication mechanism is scram-shah-1, or you can change it that's the default and ssl is true, and in this case, we might be using a self signed certificate which is totally good for the encryption, but it's going to fail and say we don't trust the certificate, if we trust the server you can go with ssl cert requires none, or if you want to make sure you have one, trust its certificate, omit the last line.
And then we just use **data basically to pass those as keyword arguments to register connection and notice, each step I'm saying get the user from config or environment so this could be in a web app where these values are stored in the config file, you don't want to put them in source specifically you don't want them checked in anywhere ever, you could say get them from, you can put them in environment variables on your server and then grab them at runtime out of the environment and set them here.
|
|
show
|
3:07 |
The way we primarily work with MongoEngine is we create classes and we map those to collections.
So here we started out with a really simple car, we have a class called car and anything that maps to a collection is a top level document must be well derived from mongoengine.document and then we set up just all of the fields, these could be simple or as we saw they could be nested rich objects, all the ones listed here are simple, so we have string, string int, int and string.
So we just do that mongoengine.stringField and so on.
So this worked pretty well, but we said it would be nice if we could express that some of these are required, that some of these have default values and things like that, so we can come in here and we can say the model, the make, and the year these are all required, just say required = true you must type them in; mileage, we might be happy to go with zero for default this is new cars, things like that, so zero is a good default there, the vi number, the vin number is more interesting, we want to generate a large unique alpha numeric string automatically when a car is created, so we'll say default equals and will give it some kind of callable in this case a lambda that returns a string based on taking the uuid4, turn it to a string, drop the dashes, things like that.
So this worked really well for generating our car and we didn't even have to set the vin number, that just got done automatically.
Finally, we said look, our cars also are going to contain an engine and I don't want to go and do a separate query to a separate table or separate collection specifically, to find out details about the engine and store like the car id in the engine, so instead, we're just going to embed it straight into the car, you have a car, you have the entire details, precisely.
So we did that by first creating an engine class and that engine class has to derive from mongoengine.EmbeddedDocument not document, don't make that mistake, EmbeddedDocument and then we're going to set the type of it here in the car to be an embedded document field, the embedded document feel takes two things, the type that you're going to put there so the engine class and whether it's required is optional, right, but we're going to say at least here yes the engine is required.
We also wanted to store the service history, a set of rich documents modeled by service records, so again here's a class derive some embedded document but this time it's not one thing, it's a list of them, so we have an embedded document list field and this basically starts out as an empty list and then as we wish we can append these service records to it and then save them back.
So if we have our car model like this and we put one into the database it's going to come out looking like this, we'll have an id, we'll have a model, bunch of other flat elements up there, flat fields we have our vin number generated as 9501, from that lambda expression, the engine has four properties horse power, liters, miles per gallon, serial number, and that is modeled by that engine object, and notice the curly braces, this is an embedded sub document here and the service history, notice square bracket this is a list or an array in Javascript and it has a bunch of sub documents that are the service history.
So with our car modeled in Python on the left what we get here on the right is actually what you'll see in MongoDB.
|
|
show
|
1:38 |
We saw that inserting in MognoEngine is super super straightforward, it's really delightful, so here we're going to create a car, but remember the car requires an engine, the engine is required and the engine must be an instance of an engine object.
So we're first going to create an engine, set things like the horsepower, the liters, is the miles per gallon, so notice this is a chevy bolt which is an electric car so we just ramp the mile for gallon like super high, liters is zero because how many does an electric engine have- none, and I'll say it's a 120 horsepower, I really have no idea.
Then we're going to create the car, its model is a bolt, its make is a chevy and the year is 2017, and then we just pass the engine along, right engine = engine, one is a keyword value and one is just the name of the variable.
So then we have our car, and right now the id of the car is not stored in the database, so we hit save and boom, now we have like a car with its id and its default values set all of those things stored in the database.
So this is great for inserting one car but if you are going to insert a thousand or a hundred thousand or a million cars let me tell you, this is the slowness right here, you do not want to do this; there's a much better way, maybe you don't take a million inserts at once maybe you bulk it up and do like 50 at a time or a 100 at a time, but if you are going to do some kind of bulk insert how do you do that?
Also super easy, let's suppose we have a list of cars that we want to insert and I'm not showing how you initialize the cars, but same as above basically, but skip the save step, so we're going to get car one, car two, we want to insert a bunch of them we just go car.objects.insert and give it that list and boom it does a bulk insert in MongoDB, which if you're inserting many items is much much faster.
|
|
show
|
3:32 |
We saw querying in MongoEngine was really quite straightforward and mostly was done by way of interacting with the class and the arguments, not breaking down and working in the actual MongoDB or even PyMongo api; so let's look at a couple of options, a couple of ways in which we might do some queries.
So here we have a function called find_car_by_id and notice it's taking a car id which we're using type annotations or type ins to indicate that this is an object id that comes in and what we give back, what we're returning is an individual single car and really for the type ins to be entirely correct you should say optional car, because it may be none as we saw.
So we are going to say car.objects and then filter(id = car_id) so what we're doing is saying we're looking for the car object that has id = car_id, now this is by primary key basically so you expect it's one or zero, we do a first so that actually returns either the object or none.
One thing to know is in the database it's _id and in MongoEngine they are like forget the underscore it's just straight up id, so minor difference there, we also saw that if you have just one filter, like a really simple thing you could just say objects_id = card_id and don't have to do the filter step, but I kind of like this explicit style.
So we're going to say we're looking for one or more fields we're only passing id but we could pass id and vin number and other types of things and we call first to get one or nothing at all, next, we might want to query by subdocuments or things contained in a list inside of that document, stuff like that; so we can also do this with MongoEngine.
Same type of thing card.objects.filter, however what goes in here is no longer just the straight name, in fact, we're going to use the double underscore to traverse that hierarchy, so we're going to go down to service history, then within service history we're going to look at customer rating.
Now, if we're going to return this, we maybe don't want to return it active cursor we may either want to use a generator or here what we're doing is we're actually creating a list we're saying list of cars, that way by the time this function is done executing it actually is going to entirely have finished whatever it's doing with the database and you'll basically be done with the database by the time you leave this function.
Notice we're also using type ins to say this takes a list of cars, this time we didn't call first we're just converting all of the responses to that list.
All right, finally this one that we just looked at was looking exactly at the customer rating of 4, but here we want to know like show me all the cars that were not rated with great service, right that is 3, 2, 1 or 0, if that's a possibility, and in fact, we're going to use not pull all the cars back but we want to know like as a percentage how are we doing, how many cars that we have that had sub amazing service, versus all of them so we're using the count operator here, we can get all the cars by saying card.objects.count or we can run our query and say count and get just the ones that match the query.
So in this case, we're going to say you the double underscore yet again this time to use the less than operator, so what we're saying in this case is the bad cars how many are there, well, we're going to go to service history.customer rating and show me all the ones that are less than 4 and count how many of those occur.
Right, so we'll just use the count operator instead of actually returning deserializing the documents this is much, much faster than saying give me all the bad cars do a lin operator, unless you have a really, really great service.
|
|
show
|
3:33 |
Finally, let's talk about updating documents.
It's actually really really easy in Mongo Engine to update document, once you get one back from the database it could be either you've gotten one or you've gotten a whole list and you just happen to be making a change to a particular one of them, it doesn't really matter, so in this case like see in line three, here we're getting a car, we're finding it by id, and we're saying first; first of all, we are verifying that we got the car back; on line five, we're like no, no car, error but let's assume we got the car, we're going to create one of these service records and we're going to append it to the list that is the service history and we want to push that down into the database, we want to save that so all we have to do is call car.save and it will actually push that down.
And we saw that there was a possible conflict, a possible raise conditions at the database level if this type of code or changes to other parts of that car that some other operation was being done on the car with the same id it's possible that we could overwrite those changes, maybe, not for sure, depending on how everybody sort of changed different parts of the car, but there could be a problem of saving this here.
So you want to careful when you're doing this, but this works totally fine, most of the time, it depends on your situation how actively you're changing, how much contention there is for particular documents, but assuming the contention is low we're going to be able to say get the car, we should make the changes to it and call save, and it'll push that right back to the database.
However, if the contention is high, you care about performance or you really just want to take most advantage of MongoDB both for safety and performance, you can use the in place updates.
Here you can see we have this owner object that we've introduced and this is like the owner of the car, so maybe we want to record how many times has this owner been to our service shop, owners could own more than one car, and so maybe we want to know like for this particular person they've been in ten times, even though they have a new car that's only been in twice, so we're going to have this number of visits which is an integer on the owner and we can actually use the increment operator right on it like this we can say owner objects id = customer id that's like the primary key, then update one, increment operator double underscore name of the field so incremental_ _number of visits; you can increment it by whatever you want even a negative number which is really a decrement but there's just the increment operator.
So basically add that number on the right the one here to the number of visits, so this is cool for the individual operators you can use set, you can use increment, some of those particular operators, but if we're going to work with the set in our case like we just saw we are adding a service record to a car here we could do the same thing, but we could do this in place with the push operator instead of pulling the document back, adding it and saving it again, so we want to create a service record, and this time we're going to say card.objects again the queries the where clause if you will it's id is car id and then we want to say update one and use the push operator so push on to the service history this subdocument.
In this case, what we get back is the number of updated items we set up date one so you can bet it's one or a zero and if it's not one something went wrong.
So it supports in place updates with the individual value operators like increment and so on, it also supports things like push and add to set for working with sets or arrays in the documents.
This is both better in terms of safety because you get this transactional behavior it's also better in terms of performance, because you don't bring the document back make changes and push it in, you just shove the delta down to the server.
|
|
|
1:01:24 |
|
show
|
3:10 |
Now that you know how to work MongoDB, you know how to work its shell, what the query syntax is, you've seen PyMongo as well as MongoEngine, it's time to turn our attention to tuning MongoDB to be the best database it can possibly be.
We're going to focus on how to make our regular MongoDB server a high performance MongoDB database and you'll see there's no magic here, a lot of the things that you can do are relatively straightforward, and there's a systematic way to go about it.
I want to start this section by maybe putting a little perspective on it.
I want to start this section, this chapter, by putting a little perspective out there.
When people come to NoSql and they start looking for alternative databases often the allure of these databases is their performance you hear about things like sharding, horizontally scaling them, some incredible performance numbers, things like that.
That may be what you really need, that may be the most important thing and certainly if you don't have performance out of your database it's a big problem.
We're going to certainly figure out how to make our databases faster and the variety of techniques that we have available to us in MongoDB.
That said, your biggest problem probably isn't performance, you may have a big data problem, you may have terabytes or petabytes of data but most applications don't.
You may have a performance problem, it may be that you have so much data or you are asking such complex queries that it really does take very precise tuning and scaling to make it work.
So we're going to focus on some of these types of things.
That said, we all have a complexity problem with our application, it's always a pain to maintain these databases especially when we're working with relational databases, you hear about things like migrations and updating your schema adding, removing, transforming columns, all of this stuff is really complex and it even makes deployment really, really challenging, you want to release a new version of something based on SQLAlchemy but you need to change the database scheme before it will even run— okay, that sounds like it could be a little bit of a problem.
What you'll see with MongoDB and these document databases is one of their biggest benefits is the simplicity that they bring.
The document structure means there's fewer tables, there is much fewer connections between these tables, so when you think about the trade-offs and performance and things like that keep in mind that probably the biggest benefit that you are going to get from MongoDB is you are going to have simpler versioning, evolution, maintainability, development story.
I just want to put that out there, because I know sometimes people will say well, I got MongoDB to perform at this speed and I cut this other database, and if I tweak it like this and adapt it like that maybe I could get it to go a little faster, so maybe we should use that instead.
And maybe, I don't know, it depends on the situation, and this is very abstract, so it's hard to say, but keep in mind that one of the biggest things these document databases bring to you to the table here, is this simplicity.
It just so happens we can also make them really, really fast.
So simple and fast, sounds like a great combination, so let's get into this section where we are going to make MongoDB much faster.
|
|
show
|
6:57 |
You've heard MongoDB is fast, really fast, and you've gone through setting up your documents and modeling things, you inserted, you imported your data, and you're ready to go; and you run a query and it comes back, so okay, I want to find all the service histories that have a certain price, greater than such and such, how many are there— apparently there's 989, but it took almost a second to answer that question.
So this is a new version of the database, so we are going to talk about it shortly.
Instead of having just a handful of cars and service histories that we maybe entered in our little play-around app, it has a quarter million cars with a million service histories, something to that effect.
And the fact that we were able to answer this query of how many sort of nested documents had this property in less than a second, on one hand that's kind of impressive, but to be honest, it feels like MongoDB is just dragging, this is not very special, this is not great.
So this is what you get out of the box, if you just follow what we've done so far this is how MongoDB is going to perform.
However, in this chapter, we're going to make this better, a lot better.
How much— well, let's see, we're going to make it fast, here's that same query after applying just some of the techniques of this chapter.
Notice now it runs in one millisecond, not 706 milliseconds.
So we've made our MongoDB just take off, it's running over 700 times faster than what the default MongoDB does.
Well, how do we do it, how do we make this fast?
Let's have a look at the various knobs that we can turn to control MongoDB performance.
Some of which we're going to cover in this course, and some are well beyond the scope of what we're doing, but it's still great to know about them.
The first knob are indexes, so it turns out that there are not too many indexes added to MongoDB by default, in fact, the only index that gets set up is on _id which is basically an index as well as a uniqueness constraint, but other than that, there are no indexes, and it might be a little non intuitive at first, when you first hear about this, but indexes and manually tuning and tweaking and understanding the indexes in document databases is far more important than understanding indexes in a third normal form designed relational database.
So why would that be?
That seems really odd.
So think about a third normal form database, you've broken everything up into little tiny tables that link back to each other and they often have foreign key constraints traversing all of these relationships, well, those foreign key constraints go back to primary keys on the main tables, those are indexed, every time you have one of those relationships it usually at least on one end has an index on that thing.
In document databases, because we take some of those external tables and we embed them in documents, those subdocuments while they kind of logically play the same role there is no concept of an index being added to those.
So we have fewer tables, but we still have basically the same amount of relationships and because of the way documents work, we actually have fewer indexes than we do in say a relational database.
So we're going to see that working with understanding and basically exploring indexes is super, super important and that's going to be the most important thing that we do.
In fact, the MongoDB folks, one of their things they do is they sell like services, consulting and what not to help their customers and you could hire them, say hey I got this big cluster and it's slow can you help me make it faster— the single most dramatic thing that they do, the thing that almost always is the problem is incorrect use of indexes.
So we're going to talk about how to use, discover and explore indexes for sure.
Next is document design, all that discussion about to embed or not to embed, how should you relate documents, this is sort of the beginning of this conversation, it turns out the document design has dramatic implications across the board and we did talk quite a bit about this, but we'll touch on it again in this chapter.
Query style, how are you writing your queries, is there a way that you could maybe restructure a query, or ask the question differently and end up with a more high performance query, maybe one example misses an index and the other particular example uses a better index or something to this effect.
Projections and subsets are also something that we can control, remember when we talked about the Javascript api we saw that you could limit your set of returned responses and this can be super helpful for performance; you could write a query where it returns 5 MB of data but if you restrict that to just the few fields that you actually care about maybe its all K instead of 5 MB, it could be really dramatic, depending on how large and nested your documents might be.
We're going to talk about how we can do this, especially from MongoEngine.
These are the knobs that we're going to turn in this course, these are the things that will work even if you have a single individual database, so you should always think about these things, some of them happen on the database side, document design, indexes, and the other, maybe is in your application interacting with the database, the other two, but MongoDB being a NoSql database, allows for other types of interactions, other configurations and network topologies and so on.
So, one of the things that it supports is something called replication, now replication is largely responsible for redundancy and failover.
Instead of just having one server I could have three servers, and they could work in triplicate, basically one is what's called the primary, and you read and write from this database, and the other two are just there ready to spring into action, always getting themselves in sync with the primary, and if one goes down, the other will spring in to be the primary and they will sort of fix themselves as the what used to be the primary comes back.
There is no performance benefit from that at all.
However, there are ways to configure your connection to say allow me to read not just from the primary one, but also from the secondary, so you can configure a replication for a performance boost, but mostly this is a durability thing.
The other type of network configuration you can do is what's called sharding.
This is where you take your data instead of putting all into one individual server, you might spread this across 10 or 20 servers, one 20th, hopefully, of evenly balanced, across all of them, and then when you issue a query, can either figure out where if it's based on the shard key, which server to point that at and let that one handle the query across the smaller set of data, or if it's general like show me all the things with greater than this for the price, it might need to fan that out to all 20 servers, but it would run on parallel on 20 machines.
So sharding is all about speeding up performance, especially write performance, but also queries as well, so you can get tons of scalability out of sharding, and you can even combine these like, when I said there is 20 shards, each one of those could actually be a replica set, so there is a lot of stuff you could do with network topology and clustering and sharding and scaling and so on.
We're not turning those knobs in this course, I'll show you how to make individual pieces fast, the same idea applies to these replicas and shards, just on a much grander scale if you want to go look at them.
|
|
show
|
1:42 |
Let's return to our dealership.
This was the example we started back when we began the MongoEngine section, and it turns out the dealership is super popular now.
Before we just had a couple of cars, now we have a quarter million cars in our database, we have a 100 thousand owners, I don't believe we talked about owners before in terms of what that looks like in our code, but I've added this concept of owners so we can ask interesting like cross-document related type questions, and we'll look at the details of them, when we get to the code, in just a moment.
Each one of these owners, these 100 thousand owners, owns an average of 2.5 cars, this is kind of like collectors, right, not a standard person that drives to work or whatever, these are Ferraries, and each car has on average about 5 service records and that could be like a new engine, change the tires, change the spark plug, whatever; in particular, there's about 1.25 million service histories, so when we ask questions about like those nested documents that have to do with service histories like customer ratings and price, you can see that that is really quite impressive I think, we got the quarter million cars and within those quarter million documents interspersed are 1.25 million service histories.
So our job is to make a lot of the typical things that we might ask this database, the queries will run to do so in a couple of milliseconds, not in seconds, so that's going to be what the basic goal of this whole section is.
Now, the other things you might want to know is we've got about 180 megs of data and on average each document of the various document kinds, all average together is about 500 bytes per document.
So let's return to or example slightly transformed and see how it's performing now and let's make it fast.
|
|
show
|
2:49 |
Here we are in the github repository for this course and notice we have this data section and in here I have this thing called dealership db 250 K that is this data that I just talked about, with the 250 thousand cars, 100 thousand owners, that sort of thing.
So I'm going to put that over here on the desktop and unzip it and if we look in here, you'll see that there's a cars collection and an owners collection, and I don't believe we've spoken about how to get this data into MongoDB, so let's go over here and I'll use RoboMongo, notice we have these two dealership things that I have been playing with and I want to create one called like test dealership or something to that effect.
We're going to restore this— how do we do that, we'll go like this, we'll say mongorestore and this is the way that we get this exported data imported into MongoDB, now, the first thing you have to ask yourself is this additive to the database, if it exists do you want to also insert this, or do you want to have this be the database and replace anything it exists, we want this one to replace existing data so I'll say --drop and then I need to tell it what database so I'll say db and I could say what you should say is this dealership, but just because I don't want to wipe away what I currently have, I'll say dealership example, but the code that you're going to run expects the name of the database to be just a dealership; and then I need to give it the folder that it's going to work from, so I am just going to give it this folder like so, all right.
So mongorestore, drop to replace the data -- db to name it, and then the location, we hit go, and it's going to go cranking away on this and you can see it's inserting, inserting and done, that was really fast for like close to 1.5 million records.
All right, so let's go over here and refresh and here's our example and we can see that we have our collection, here's our cars and we could just ask how many cars are there.
Notice, there is that many, and if we change this to owners, remember you can also write it like this, owners like this, Now notice, I think the restore data we got here, you want to drop this index right here, I have it only have the id indexes, ok so that's this example I just restored, we're going to work with something you can imagine is exactly the same.
So we're going to work with this dealership code but the way it got there, I'll show you the app I used to originally create it, and then I just restored it using mongorestore just as I showed you up here.
So the way to generate the data that goes into mongorestore, you say mongo dump.
|
|
show
|
5:36 |
Let's explore this slightly updated version of our code.
Here we are in the github repository, and I am in the source folder and I've added an 08_perf section, and we have the starter_big_dealership and we have the big_dealership it even has instructions here to tell you basically how to restore that database we did just in the previous video.
This one is going to be a snapshot of how this chapter starts, it's what we're starting from now and will remain that way; here we're going to take basically a copy of that one and evolve it into the fast high performance version, so let's go over here and see what we've got.
Now, we have a few things that are slightly different, the car is basically unchanged from before although I added a little comment about how do we get to the owners.
The one thing that is new here, in terms of the model is this owner idea, so cars can now have an owner and how do we know which cars are owned by this owner is we have a list of object ids, those object ids are the object ids of the cars so we're going to push the ids of the cars that are owned here I guess we could run it as a many to many or one to many relationship, just depending on how we treat the owner, but theoretically, we can have owners where there is a single car that is multiple owners and there are owners that own multiple cars, and we can manage it this way, you almost never see like a car to owner intermediate table, so you're almost always going to have something like those ids are either embedded in the owner or in the car, or under rare circumstances both.
So here's how we refer back to the cars, then we have a few basic things like the name, when was this owner created, how many times have they visited and things like that.
We want to call it owners in the database and it's just this core collection, so other than that, there's not a whole lot going on here, let's look over here, we now have these services, I've taken all the car queries and moved them down here do you want to create a car, you call this function, do you want to record a customer visit, here we can go to the owner and we can use this increment operator to increment the number of visits in place.
Find cars by make, find owner by name and so on.
Number of cars with bad service, a lot of this stuff is what we wrote previously; there was the program thing that we ran over here that was interactive and I've replaced that with a few things, one is this db stats and you can run this and it will tell you like how many cars are there, how many owners are there, what's the average number of histories, this is basically those stats that I presented to you before, this takes a while to run on this database, I don't recommend you run it but if you want to just run it and see what you get you can.
The database was originally created using this script, I am using something interesting you may not have heard about, I am using this thing called Faker, so down here Faker lets you create this thing and I'm seeding it so it always generates exactly the same things, I'm seeding random and fake and you can see down here it's creating the owners and you can ask it for things like give me a fake name, give me a fake date between these two dates, things like that.
Similarly with cars, we're using random to get a hold of a lot of the numbers then we can use fake for anything else we might.
We ran this, with the right amount of data, it'll build it all up for us, so for some reason if you need to recreate it run this low data thing, you can have it create a small one, if you comment, uncomment that or a large one if you only run it with those settings.
Those are all good, this is like the foundation and this is where we are.
Next, we're going to ask interesting questions of this database and we want to know how long those questions take to answer, so I've written this super simple function called time you pass it a message and a function, it will time how long the function takes to run and then print out the message along with the time in terms of milliseconds.
And then we're going to go through and we're going to ask interesting questions here like how many owners, how many cars, who is the 10 thousandth owner, notice the slicing here to give us a slice of item of length one and then we'll just access it, and then we can start asking interesting questions like how many cars are owned by the 10 thousandth owner, or if we go down here, how many owners own the 10 thousandth car, so ask it in the reverse direction.
Here we want to find the 50 thousand owner by name, so yes, technically have them but the idea is we want to do a query based on the name field and we originally won't have any performance around these types of queries so it should be slow.
This one, how many cars are there with expensive service this was the one with the snail and in one of the first videos in this chapter, I showed you look this takes 700 milliseconds to run to ask this question how many cars have a service history with a price greater than 16800.
So we're going to be to be able to ask all of these questions and this program will let us explore that and we'll see how to add indexes and I'll show you how to add indexes in the shell and how to add them in MongoEngine, and MongoEngine is really nice because as you evolve your indexes, as you add new ones simply deploying your Python web app will adapt the database that it goes and finds to automatically upgrade to those indexes, so it's really really nice.
So here you can see we're going to run this code and ask a bunch of questions we could load the data from here, we could generate the data, but you're much better off importing the data from that zip file because this takes like half an hour to run, you saw that zip takes like five seconds.
|
|
show
|
10:15 |
Let's go ahead and run this code, you've seen the minor changes like the addition of this concept of an owner, and how we generated all this data, and how you can restore it.
Let's go ahead and run it, and see what's happening.
Let's look at this from two perspectives, let's begin over actually in Robomongo, so we're going to ask the question, basically how many owners own a certain car the idea is more or less we're going to call this function which goes right here, really what we're looking for is this query, find me all of the owners where this car id is in their car ids collection, just generate and deserialize that.
The other one that we're going to focus on is show me the cars with the expensive service history, how many cars or what cars had some kind of service that cost over 16800 dollars.
Let's begin by looking at those in Robomongo.
Here we have this concept, we could simplify this a little bit, but it doesn't matter, cars here's the service history, let's go to the price where that's greater than 16800, how many of them are there.
If I run this, notice, it took a while to come back, run it again, here's the speed right there, 0.724 sec, 0.731, 0.733, so it's pretty reliably taking around 700 milliseconds to answer that question.
We're going to come back to this.
Here's a more interesting example, like go and randomly grab a car somewhere deep in the list, in this case I put 61600, grab that car and then find me all the owners, where that car id appears in their id list, and then we'll just dump that out, by saying var it doesn't appear if you just state the name it will show up down here, so make sure to deselect it and run this, and this is actually surprisingly fast, given all the stuff that's going on here, but it's taking still about 75, 80 milliseconds to run here, which, I don't know, maybe in your database going across a 100 thousand records 80 milliseconds seems decent, I can tell you in MongoDB 80 milliseconds is terrible you should really think about making something that's 80 milliseconds faster it's not always possible you can do it, but most of the queries as we'll see are possible.
Let's take this one and just try to understand what's happening here and then we're going to go look at it in Python, but let's just explore it here in the shell for just a moment.
Why is this taking 700 milliseconds?
MongoDB has this way to basically ask how are you running this query, and the way you do that is you say explain, like so, so I can say this query instead of giving me a result tell me how you're running it, if I unselect it, it just runs the selected stuff if there's something there, so we can go and look at it in this mode, so it says okay, here's what the query planner found for you, we've parsed this query, and this is something it's basically what went into the find, it also might have something to the effect of like a sword and other things that are happening, but this is a simple query.
Look down here, see this winning plan, stage column scan, that is bad, that is really, really bad.
Also notice the rejected plan, so if there are multiple indexes and other things that could have done it might have attempted a bunch of them and said no, no, no this is the best, let's see it doesn't seem to tell us any more about what it did there, like sometimes it'll tell you how many records it scanned and things like this, but it's just basically reading entirely in the forward direction over this and just doing a comparison.
So that's why this was taking 700 milliseconds as it was literally reading and comparing 100 thousand entries or actually more, remember their is 1.2 million search histories across those 250 thousand cars, so not 100 thousand, 1.2 million records it scanned over, that's bad, you don't want that.
So what we can do is we can actually add an index, now there's two ways to add an index, but before I add the index, let's go over here just explain is super, super valuable, any time something is slow we're going to explain there's actually way to turn on profiling and say log all of the queries that you see MongoDB that are slower than x, you providing them like say 10 milliseconds might be great, show me all the queries that take more than 10 milliseconds and then you can drop them in here, put an explain and then start creating indexes to make them faster.
So just google mongodb profile enable slow queries or something like this, it's pretty straightforward.
Now let's run this code, we're asking a lot of questions what we want to run is q and a, so we go over here and just right click and say run, notice some of these things are taking time, the database might be cold, it might have not loaded that stuff, so let me run it one more time just to be fair, there's a few things that are already really fast, and that's cool, so let's go here and review, how many owners are there— well, I can tell you it doesn't show the answer it just sort of says this is the question I'm asking here is how long it takes.
Three milliseconds, that is solid, how many cars— half a millisecond.
That's pretty solid, I don't think we can improve the count on the entire collection but this one, find the 10 thousandth owner— not good, so let's see how many cars are owned by that person— this is pretty fast actually, this is surprisingly fast, how many owners this can have— 66 milliseconds that's the one we were looking at in there.
I'm going to take these numbers and put them over here, let's say, this will be Without indexes we're going to get this, we don't really care about the exit code, do we?
With indexes, and we're going to kind of iterate on this a little bit so let's begin over here, and we're going to talk about how we can add an index in MongoDB and then for the most part do this in MongoEngine because it's really part of the way our application works, what the indexes are, and it's better to make that part of our document then kind of do a separate database setup step; we could create a script in Javascript and run it, it will do these things and that may be fine, but let's go over here and work on this.
Again we had the count, here's the almost 800 milliseconds, let's go over here and just I'll take this, I'll make a copy, so here is what we can do, instead of doing the find operation we can say create index, and then we have the thing that we're doing the query on, most the time this is one item but you can have composite indexes they are a little more nuance so we'll talk about them later, but let's just do this one, we want to be able to query by service history's price Here we can put one of two things, one or minus one, what do you want the default sort, descending or ascending?
A lot of times it doesn't really matter, it can read from the back or it can read from the front, whatever, you saw the forward direction on our column scan for example.
So over here we could say one, this creates an index, there's no count; the other thing we can do is we can give it a name so we can come over here and say name is search by service history price, so if we go look in this little indexes, we'll see the name here, we can also say run in the background, if I don't say that it's going to block the database until the index is generated, if you're doing this in production, and you have tons and tons of data maybe background is the way to go.
Okay, anyway let's go ahead and run this and see what happens.
Notice the pause, this is it's actually computing the index right now the database is effectively down, now it's back, what do we get ok, we created collection automatically know it already existed a number of indexes before was one, now we have two and everything was a ok so if I refresh, here's that index and I can actually edit this over here in Robomongo, go for the advanced properties, here is the create index and background whether it's sparse, how long it lives, whether it's based on text search or whatever, but here's just the basic thing.
We've added this index, remember this took 800 milliseconds ask the same question now, boom, 8 milliseconds.
Ask it one more time, 2, here we go, 2, 2, 2, 3, 2, 2, right, the screen sharing is probably put in a pretty heavy load on the server that's also the database server, right but still, we're getting it down 350, 400 times faster by adding that.
Now if I go back and I ask that question explain now we get something way better, winning plan is index scan index name search by service history price, that is really awesome; that means we're using our index which is so much faster.
There was no rejected plans, so it only found one index it tried to use it if found that it was awesome, it's very happy.
Go back to my account more time, boom 2 milliseconds, and that's a really good answer, let's go run our Python code and see what answers we get now, that was already faster, let's go over here and load car name and ids with expensive prices and spark plugs, 20 milliseconds this is actually a pretty complicated query we'll get into cars with expensive service, 1.9 milliseconds.
This is exactly what we saw in Robomongo, so over here in MongoEngine, we're getting essentially the same results— how cool is that?
Very nice, we're going to go through and in Python from now on we're going to add the necessary index to start making these almost all of these run super fast, all of them run fast some of them we can get incredibly fast, like one millisecond, others not quite that fast, but we'll still do good on all of them.
|
|
show
|
20:04 |
Now that you've seen how to create indexes in the shell in Javascript effectively, let's go and see how to do this in MongoEngine.
I think it's preferable to do this in MongoEngine because that means simply pushing your code into production will ensure that the database has all the right indexes set up for to operate correctly.
You theoretically could end up with too many, if you have one in code and then you take it out but you can always manage that from the shell, this way at least the indexes that are required will be there.
I dropped all the indexes again, let's go back through our questions here and see how we're doing.
It says how many owners, how many cars, this is just based on the natural sort however it's in the database there's really nothing to do here, but this one, find the 10 thousandth car by owner, let's look at that; that is going to basically be this name, we'll use test, it doesn't really matter what we put here if we put explain, this should come back as column scan or something like that, yeah, no indexes, okay, so how long did it take to answer that question?
Find the 10 thousandth owner by name, it didn't say by name, I'll go and add by name, well that took 300 milliseconds, well that seems bad and look we're actually using sorting, we're actually using paging skip and limit those types of things here, but in order for that to mean anything, we have to sort it, it's really the sort that we're running into.
Maybe I should change this, like so, sort like so, we could just put one, I guess it's the way we're sorting it, so here you can see down there the sort pattern name is one and guess what, we're still doing column scan.
Any time you want to do a filter by, a greater than, an equality, or you want to do a sort, you need an index.
Let's go over to the owner here, this is the owner class and let's add the ability to sort it by name or equivalently also do a filter like find exactly by name, so we're going to come down here we're going to add another thing to this meta section, and we're going to add indexes, and indexes are a list of indexes, now this is going to be simple strings or they can be complex subdictionaries, for composite indexes or uniqueness constraints, things like that, but for name all we need is name.
Let's run this, first of all, let's go over here and notice, if I go to owners and refresh, no name, let's run this code, find the 10 thousandth owner by name, 19 milliseconds, that's pretty good, let me run it one more time, 15 yeah okay, so that seems pretty stable, and let's go over here and do a refresh, hey look there's one by name; we can see it went from what was that, something like 300 milliseconds to 15 milliseconds, so that's good.
How many cars are owned by the 10 thousandth owner, so that's 3 milliseconds, but let's go ahead and have a look at this question anyway.
How many cars are owned by the 10 thousandth owner, so here's this function right here that we're calling it doesn't quite fit into a lambda expression, so we put it up here so we want to go and find the owner by id, that should be indexed right, that should be indexed right there because it's the id, the id always says an index, and now we are saying the id is in this set, so we're doing two queries, but both of them are hitting the id thing, so those should both be indexed and 3 milliseconds, well that really seems to indicate that that's the case.
How many owners own the 10 thousandth car, that is right here.
So we'll go find the car, ask how many owners own it.
Now this one is interesting, so remember when we're doing this basically this in query, let's do a quick print of car id here, so if we go back over to this, we say let's go over to the owners save your documents, so this is going to be car ids, it's going to have an object id of that, all right, so run this, zero records, apparently this person owns nothing, but notice it's taking 77 milliseconds, we could do our explain again here and column scan, yet again, not the most amazing.
So what we want is we want to have an index on car ids, right because column scan, not good, I think it's not really telling us in our store example but for the find it definitely should be.
So we can come back to our owner over here, let's add also like an index on car_ids, If we'd run this once again, just the act of restarting it should regenerate the database, how long did it take over here— a little late now isn't it, because I did the explain, I can look at this one, how many cars, how many owners does the 10 thousandth car have, 66 milliseconds, if we look at it now— how many owners own the 10 thousandth car, 1.9 milliseconds, so 33 times faster by adding that index, excellent, find the 50 thousandth owner by name, that's already done.
Alright we already have an index on owners name so that goes nice and quick, and how is this doing, one millisecond perfect, this one is super bad, the cars with expensive service 712 milliseconds, alright so here, we're looking at service history and then we're navigating that .relationship, that hierarchy, with the double underscore, going to the price, greater than, less than, equal it doesn't matter, we're basically working with this value here, this subdocument.
Let's go over to the car and make that work, now the car doesn't yet have any indexes but it will in a second, so what we want to do is represent that here and in the the raw way of discussing this with MongoDB we use .
(dot) not double underscore, so .
represents the hierarchy here.
Let's run that again, notice expensive service, 712, cars with expensive service, instead of 712 we have 2.4 milliseconds, now notice that first time I ran it there is was a pause, the second time it was like immediate, and that's because it basically was recreating that index and that pause time was how long that index took to create.
So here we have cars with expensive service, now we're getting into something more interesting, look at this one with spark plugs, we're querying on two things, we're querying on the history and the service, let's actually put this over in the shell so we can look at it.
I've got to convert this over, do the dots there, this is going to be the dollar greater operator, colon, like so, all right, so we're comparing that service history.price and this one, again because you can't put dots in normal json, do the dot here and quotes, and this one is just spark plugs, alright, let's run this, okay 22 milliseconds, how long is it taking over here— 20 milliseconds, so that's actually pretty good and the reason I think it's pretty good is we already have an index on this half and so it has to just basically sort the result, let's find out.
Winning plan, index on this one, yes, exactly so this one is just going to be crank across there but we're going to use at least this index here, this by price so that gets part of the query there.
Now maybe we want to be able to do a query just based on the description show me all the spark plugs, well that's a column scan, so let's go back and add over here one for the description.
Now how do I know what goes in this part, see I have a service history here, if we actually look at the service record object it has a price and description, right so we know that that results in this hierarchy of service history.price, service history.description.
If we'd run this again, it will regenerate those and let's go over here and run this, and let's see, now we're doing index scan on price, what else do we got, rejected plans, okay so we got this and query and it looks like we're still using the— yes, oh my goodness, how about that for a mistake, comma, so what did that do that created, in Python you can wrap these lines and that just created this, and obviously, that's not what we want, that comma is super important there.
So let me go over here and drop this nonsense thing, try this again, I can see it's building index right now, okay, once again we can explain this, okay great, so now we're using price and actually we use the description this time and you can see the rejected plan is the one that would have used the price, so we're using description, not price, and how long does it take to run that query— 7.9 milliseconds, that's better but what would be even better still is if we could do the description and price as a single thing.
How do we do that?
This gets to be a little trickier, if we look at the query we're running, we're first asking for the price and then the description, so we can actually create a composite index here as well, and we do that by putting a little dictionary, saying fields and putting a list of the names of the fields and you can bet those go like this, now this turns out to be really important, the order that you put them here price and the description versus description price, for sorting, not so much for matching, run it one more time, alright, expensive cars with spark plugs, here we go, look at that, less than one millisecond, so we added one index, it took it from like 66 milliseconds down to 15, and then, we added the description one, it turns out that was a better index and it took it from 15 to 9, we added the composite index, and we took it from 9 to half a millisecond, a 0.6 milliseconds, that is really cool.
Notice over here, this got faster, let's go back and look at what that is.
Load cars, so this is the one we are optimizing and what are we doing here— let me wrap this so you can see, we're doing a count, okay, we're doing a count and so it's basically having the database do all the work but there's zero serialization.
Now in this one, we're actually calling list so we're deserializing, we're actually pulling all of those records back and let's just go over here and see how many there are, well that's not super interesting, to have just one, is it, alright, that's good, but let's actually make this just this, let's drop this spark plug thing and just see how many cars there are with this, okay there we go, now we have some data to work with, 65 thousand cars had 15 thousand dollar service or higher, after all, this is a Ferrari dealership, right.
Now, it turns out it's a really bad idea to pull back that many cars, let me stop this, let's limit that to just a thousand here as well.
Okay, so we're pulling back thousand cars because we're limited to this and we're pulling back a thousand cars here.
But notice, this car name and id versus the entire car so let's go over here cars with expensive service, car name and id, so notice the time, so to pull back and serialize those thousand records took actually a while, so it took one basically a second, and if we don't ask for all the other pieces, if we just say give me just the make, the model and the id, here we're using the only keywords, it says don't pull back the other things just give me the these three fields when you create them, it makes it basically ten times faster, let's turn this down to a 100 and see, maybe get a little more realistic set of data.
Okay, there we go, a 100 milliseconds down to 14 milliseconds, so it turns out that the deserialization step in MongoEngine is a little bit expensive so if you like blast a million cars into that list, it's going to take a little bit.
If we can express like I only want to pull back these items, than it turns out to be quite a bit faster, in this case not quite faster, but definitely faster.
Let's round this out here and finish this up.
Here we're asking for the highly rated, highly priced cars, we're asking like hey for all the people that come and spend a lot of money how did they feel about it?
And then also what cars had a low price and also a low rating, so maybe we could have just somehow changed our service for these sort of cheaper like oil change type people.
It turns out that that one is quite fast, this one we could do some work and fixing one will really fix the other so we have this customer rating thing, we probably want to have an index on, and we already have one on the price, so I think that that's why it's pretty quick actually.
Go over here, and we don't yet have one on the price, on the rating rather, so we can do that and see if things get better, not too much, it didn't really make too much of a difference, it's probably better to use the price than it is the rating, because we're kind of doing that together, so we're also going to go down here and have the price and customer rating, one of these composite indexes, once again, and maybe if we change price one more time, rating and price— it doesn't seem like we're getting much better, so down here this is about as fast as we can get, 16 milliseconds and this is less than one millisecond, so that's really good.
The final thing is, we are looking for high mileage cars, so let's go down here and say find where the mileage of the car is greater than 140 thousand miles, do we have an index on that, you can bet the answer is no.
Now we could go to the shell and see that, but no we don't have one, so let's go up here and add one more, and this is in fact the only index we have here in this thing that is on like just plain field, not one of these nested ones like this; so maybe we also want to be able to select by year, so we could have one for year as well.
I'm going to add those in.
Now this high mileage car goes from a hundred and something milliseconds down to six, maybe one more time just to make sure, yep, 5, 6, seems pretty stable around there.
So we've gone and we've added these indexes to our models, our MongoEngine documents by adding indexes and we can have flat ones like this, or we have these here, and we also can have composite ones or richer things, if we create a little dictionary and we have fields and things like that.
Similarly an owner, we didn't have as many things we were after but we did want to find them by their name and by car id, so we had those two indexes, honestly this is just a simpler document than the cars.
So with these things added here, we can run this one more time and see how we're doing that code all runs really quick, if we kind of scan through here, there's nothing that stands out like super bad, 5 milliseconds, half, 18, 6, half, 1, 3, 1, let's say, this one, I really wish we could do better, it just turns out there is like so many records there that if we run that here you can see that the whole thing runs in one millisecond, super, super fast, we can't make it any faster than that.
The slowness is basically the allocation, assignment, verification of 100 car objects.
I'd like to see a little better serialization time out of MongoEngine, if you have some part of your code that has to load tons of these things and it's super performance critical, you could drop down to PyMongo, talk to it directly and probably in the case where you're doing that you don't need to pull back many, many objects, but also you can see that if we limit what we ask for down here, that goes back to 14 miliseconds which is really great, here we're looking at a lot of events, this is like 16 thousand or no, 65 thousand, that's quite a bit, this one is really fast, this one is really fast, so I feel like from an index perspective we've done quite a good job, how do we know we're done?
I guess this is the final question, this has been a bit of a long— how do we know we're done with this performance bit?
We know we're done when all of these numbers come by and they're all within reason of what we're willing to take.
Here I have set this up as these are the explicit queries we're going to ask and then we'll just time them, like your real application does not work that way.
How do you know what questions is your applications asking and how long it's taking.
So you want to set up profiling, so you can come over here and definitely google how to do profiling in MongoDB, so we can came over here and let's just say, db set profiling level and you can use this function to say I'm looking for slow queries and to me slow means 10 milliseconds, 20 milliseconds something like that, it will generate a table called system.profile and you can just go look in there and see what queries are slow, clear it out, run your app, see what shows up in there add a bunch of indexes, make them fast, clear that table, then turn around and run your app again, and just until stuff stops showing up in there, you can basically find the slowest one, make it faster, clear out the profile and just iterate on that process, and that will effectively like gather up all of the meaningful queries that your app is going to do, and then you can go through the same process here to figure out what indexes you need to create.
|
|
show
|
1:49 |
We've seen how powerful adding indexes to MongoDB is and I talked a little bit how the nested nature of these documents means there's naturally fewer primary keys, so there's fewer on average actual indexes that get created just as part of working with the database; so creating these indexes is even more important in document databases than it is in relational databases.
So here we are in the shell, this would be Robomongo or just the Mongo command line interface and we can create an index on a collection by saying db.collection name so here we have cars.createIndex and then we pass it two things, first one required, second one optional we pass it the actual fields we want to create the index on; so here we have service_history.customer_rating so we could traverse this hierarchy if necessary we just use that dot like we have been in the shell the whole time and then we say one or minus one, so do you want to sort ascending or descending.
And this mostly matters for either what you might consider the natural sort or if you're doing a composite key or a composite index and that composite index is being used for sorting on both fields and all the orders have to line up exactly for the sort to use that index.
Then we can pass additional information, here we have background as true and the name, I like to name my indexes if I'm doing this shell because then it's easier to see like okay why did I create this index here we want the customer ratings of service, so that's pretty nice, background true, that's not the default but that means it will run basically in the background without blocking the database operations, if you don't put that, when you hit go the database will stop doing any sort of database stuff until this index is generated so be aware.
|
|
show
|
1:52 |
Now if we're using MongoEngine, we don't have to go to the shell and manually type all the indexes we basically go to each individual top level document so all the things that derive from mongoengine.document not the embedded documents, and we go to the meta section and we add an indexes, basically array so here we want to have, you can see the blue stuff that's highlighted we want an index on make, we want an index on service history and within service history, remember these are service records showing on the bottom we went an index the description and price.
So for index that we put 'make', that's straightforward and then we have service_history.customer_rating so service history is the field name and then customer rating is the field name of service record and for some reason I don't have it blue, it's that last one down there but we also want this composite key so service_history.price and service_history.description we want to be able to find where both of those match and we're going to do that up by having a more complicated entry in the indexes bit here this is going to be a dictionary where the fields are set to be this array of strings and not just the flat string itself.
So once we add this, when we run our code, it's actually going to first time we work with that document ensure that all the indexes are there, and remember that like hung up our application for just a little bit, but the real benefit here is our app is always going to be in sync, we don't have to go oh oops, I forgot to add the index, that one particular index to say the staging server, or when I push to production are there new indexes, I got to go out on the database, now you don't worry about that, you just push your code, restart your web app or whatever kind of app it is, and then as part of interacting with it, it will make sure that those indexes are there.
If you don't want that pause to be there, just go and create the indexes you know the thing is going to create put them on the production server and then push the new version of code and it will just go great, these indexes exist.
|
|
show
|
5:04 |
One of the most important things you can do for performance in your database and these document databases is think about your document design, should you embed stuff, should you not, what embeds where, do you embed just ids, do you embed the whole thing; all of these are really important questions and it takes a little bit of experience to know what the right thing to do is.
It also really depends on your application's use case, so something that's really obviously a thing we should consider is this service history thing, this adds the most weight to these car objects, so we've got this embedded document list field so how often do we need these histories?
How many histories might a car have?
Should those maybe be in a separate collection where it has all the stuff that service record, the class has, plus car id, or something to that effect?
So this is a really important question, and it really depends on how we're using this car object, this car document if almost all the time we want to work with the service history, it's probably good to go in and put it here, unless these can be really large or something to that effect, but if you don't need them often, you'll consider putting them in their own collection, there's just a tension between complexity and separation, safety and separation, speed of having them in separate so you don't pull them back all the time; you can also consider using the only keyword or only operator in MongoEngine to say if I don't need it, exclude the service history, it adds a little bit of complexity because you often know, hey is this the car that came with service history or is it a car where that was excluded, things like that, but you could use performance profiling and tuning to figure out where you might use only.
Let's look at one more thing around document design.
You want to consider the size of the document, remember MongoDB has a limit on how large these documents can be, that's 16 MB per record, that doesn't mean you should think oh it's only 10 MB so everything is fine for my document design, that might be terrible this is like a hard upper bound, like the database stops working after it hits 16 MB, so you really want to think about what is the right size, so let's look at a couple examples: we can go to any collection and say .stats and it will talk about the size of the documents and things like that, so here we ran db.cars.stats in MongoEngine, and we see that the average object size is about 700 bytes, there is information about how many there are, and all that kind of stuff, but really the most interesting thing for this discussion is what is the average object size, 700 bytes that seems like a pretty good size to me, it's not huge by any means, and this is the cars that contain those service histories, so this is probably fine for what we're doing.
Let me give you a more realistic example.
Let's think about the Talk Python Training website, and the courses and chapters, we talked about them before, so here if we run that same thing, db.courses.stats you can see that the average object size is 900 bytes for a course, and remember the course has the description that shows on the page and that's probably most the size, it has a few other things as well, like student testimonials and whatnot, but basically it's the description and a few hyperlinks.
So I think this is again a totally good object, average object size.
Now one of the considerations was I could have taken the chapters which themselves contain all the lectures, and embedded those within the course, would that have been a good idea— I think I might have even had it created that way in the very beginning, and it was a lot slower than I was hoping for, so I redesigned the documents.
If we run this on this chapter section, you can see that the average object size is 2.3 KB, this is starting to get a little bit big, on its own it's fine, but think about the fact that a course on average has like 10 to 20 chapters, so if I embedded the chapters in the course instead of putting them to a separate document like I do, this is how it actually runs at the time of the recording, then it would be something like these courses would be 24 up to maybe 50 KB of data per entry, think about that you go to like the courses page and it shows you a big list of all the courses and there might be 10 or later 20 courses, we're pulling back and deserializing like megabytes of data to render a really, really common page, that is probably not ok, so this is why I did not embed the chapters and lectures inside the course, I just said okay, this is the breaking point I looked at the objects' size I looked at where the performance was and I said you know what, really it's not that common that we actually want more than one chapter at a time, but it is common we want lectures, so it's probably the right partitioning, but you build it one way, you try it, it doesn't work, you just redesign your class structure, recreate the database and try it again, but you do want to think about the average object size and you can do it super easy with db.colection name.stats.
|
|
show
|
2:06 |
One of the last simple tools you have in your tool belt when we're working with MongoEngine or even in PyMongo, just different api is this ability to restrict the data returned from the document.
In our car object we've got the make, the model, the id, some other things, we've got the engine which is a subdocument or an embedded document there and then the biggest thing that contributes to the size is actually the service history which might be many service record entries.
If really all we care about is the make, the model and the id of a car, and we're going to create like a list or something like that, we can use this .only operator here and dramatically reduce the amount of data returned from MongoDB so this is an operation that we saw when we first learned about the api actually operates at the database level, you're able to restrict the elements returned from the queries so when it gets back to MongoEngine basically it looks at what comes back and it says, alright, I need to create some cars and I need to set their make to this, the model to that and their id to whatever comes back, and then nothing else is transferred, deserialized, anything.
So you can, if you don't need them, exclude the heavyweight things like the engine and the service histories for this particular use case.
So this is kind of like select make, model, id from table such and such in SQL, and it really can improve the performance especially when you have either large documents or many documents.
So you've seen a lot of different ways to turn the knobs of MongoDB to make it faster and to use MongoEngine to control those knobs.
Now this applies to a single individual database server and if you use this to tune your database, you can actually make the need for having a sharded cluster and all these scaling things possibly go away, but even if you do end up with one of these more interesting topologies, all of these techniques still apply and they'll make your cluster go faster, they'll make your replicas go faster, all of those things.
What you've learned here are really the foundational items of making MognoDB go fast.
|
|
|
1:13:39 |
|
show
|
4:37 |
You've learned almost everything you need to know about MongoDB to work with it, to make it fast, to access it from things like MongoEngine.
The last thing is to actually put it into production, to use MongoDB in production to set up our applications, to talk to a secured version of MongoDB, all of those things.
So we're going to focus on two main areas, one deploy MongoDB for production servers, two, doing that safely.
So far, what we've been doing is we've been running our web app, or script, our little test app, whatever it is that we're going to be building, a little thing we even playing with, and that's the blue thing here, and we've been running it and talking to MongoDB on our local machine, probably our laptop, and we've been just on the local loop back, 127.0.0.1, talking to MongoDB, and I have gone on and on about how you must not listen on another ip address on your local dev machine, take that thing to a coffee shop or worse, to like a big hotel where there's a tech conference, god forbid, black hat or something going on and that thing will be in some serious, serious trouble right away.
By the way, did you know on your Mac that the firewall is off by default?
That's right off by default, that's crazy, but it is, so just another thing to consider, layers and layers and layers, but assuming we're only listening on local loopback we're pretty safe like this, but we have been running without encryption and running without authentication, MongoDB gave us a little warning when we connected but you have to connect, you have to care, if you connect with code and not with a shell, there is no warning it just works, we're going to set up an entirely different thing we're going to have a set of web front ends, fake web front ends, we're not really going to write a website, but what would stand in for our website and we're going to have our production MongoDb server, and these things are going to talk to each other over at least the file computing data center connection, potentially farther than that, so we're going to have to open this thing up and that means we need to add things like encryption, we need to add authentication, firewall sorts of rules and things like that.
That's what we're going to talk about in this chapter.
This is a bit of a chore, right, this is not the default unfortunately this is not the falling into the pit of success type of thing, you have to work to get this set up correctly so let me just give you some very recent warnings this is not to tell you not to use MongoDB, I use MongoDB for my production stuff, I love MongoDB but you must use it carefully, it's like a sharp knife.
What I am about to show you is not meant to dissuade you in any way but just to make sure you really have this burnt your mind that you have to be careful when you're deploying stuff to production with MongoDB.
That said, check this out— here are some recent headlines and you want your company and your data to be nowhere near them, MongoDB databases are being hacked for ransom using ransomware notice this is 2017, here's a little example, look at this, show dbs, please read, use please read, show collections, please read me do a little find on it and you get your database is been hacked and encrypted you have to send a bitcoin here to get it back and chances are they will probably just throw away your data and they'll take your bitcoin and yeah, good job, okay, so here's another one, how about this massive ransomware attack takes out 27 thousand MongoDB servers terabytes and terabytes of data, the petabytes of data we're lost to the world, so these are all not good things, right, you've lost your data here is one more, two million recordings of families imperiled by cloud connected toys crappy MongoDB you don't want that anywhere near your business, so this is a little bear thing here, that with a light on it it connects to the internet and it connects to a service and then the parent and the kid can like talk to each other over the internet through the bear, that was basically all that data in that exchange was done entirely on a wide open MongoDB database with no account at all, just hey here's the port, connect to it, go have fun.
All right, so the problem here is that MongoDB lets you listen on the wide open internet without a usnername and password and without encryption and that's the default.
What we're going to do it is we're going to add all those things but you just wanted to be really clear like this is not an optional step, we really need to do this unless you're running this, like say you're running a website and the database server is running the same machine as the web server and it just talks over a local host any other scenario use to be extremely careful and we're going to go through all the steps necessary to get this working just right.
|
|
show
|
2:37 |
Let's go through the MongoDB security checklist.
Now, most of these come from MongoDB, but also from me personally, from my experience running the server.
I've run professional commercial websites using MongoDB for many years, 5 or 6 years and we've never had any problems, but you have to follow the rules.
Some of the rules include things like limit network exposure, so this is always a good idea for databases or anything else that listens on the internet, if something doesn't need to talk to it, don't allow it to have an open connection.
Enable access control, that means add users and require them to authenticate, this should really be the default, if I were king of MongoDB, and I'm not, but if I were a king of MongoDB, I would make a decree, a new rule that says MongoDB is not allowed to listen on anything other than local host, unless it has access control enabled, period.
That's not the way it works right now though, by default you can just say listen on 0000 and it will, even if that's wide open, so that can be problematic, so we're going to change that, encrypt the communication, so certainly what goes on the wire should be ssl style encrypted, but there is an option to encrypt the data at rest, I think you have to get the enterprise version of MongoDB which is the paid version this is not something I worry too much about but if it's something that you were about, you can encrypt the data at rest using the wire tiger engine.
You can audit system activity, this is easy enough turn on what's called a caped collection and turn on auditing so it will keep some standard amount obviously set up logging, all those types of things.
Back up, obviously you want to back up your data this is production data, so back up, back up, back up, back up, set up some way to back up and we'll see that there's two options we can run a tool called Mongo dump which will just take a complete backup and for a certain size of data, that could be pretty large actually but for certain size that works fine, at some point if you truly are working with tremendous amounts of data that doesn't work so well so there's various ways to set up replicas that are like delayed or right on time, things like this but back up, back up, back up, an important thing to do.
So you can find all the details on how to do this here at mongodb.com/manual/administration/security-checklist now you're welcome to go over there and check that out and I encourage you to do so, but we're going to go through each of the yellow steps here we're not going to talk about auditing, but everything else pretty much and encryption and rest, we're also not going to do that, everything else we're going to do as part of this chapter.
|
|
show
|
8:30 |
To deploy our database and set up our production environment I'm going to use Digital Ocean, I've run web applications in MongoDB, in AWS, EC2, I've done it in Azure, and various other places I'll talk about some of those, and I found something like Digital Ocean really to be just such a nice service, simple, extremely fast, extremely affordable compared to the other options.
We're going to use Digital Ocean, but what I'm going to show you is not specific Digital Ocean for the most part you can use any other server that lets you basically set up vms, and in a single data center.
We're going to use this and if we come down here we'll look at the various options, we'll see that we can basically choose different machines, now it turns out for reasonable amounts of data I'll describe what I think but reasonable is, certainly ten dollars a month is absolutely fine, we've got 30 gigs of disk space on an ssd disc, we've got tons of bandwidth and I don't even know if it counts within data center bandwidth, 1GB is not a ton a memory, but it is enough, this is really nice and cheap, the five dollar one, but it's going to put you up against memory limits pretty quickly, if you have lots of data, so what do I mean by a lot, so right now I'm running most of my websites using a shared MongoDB server, separate databases, but shared server and it's running on one of these ten dollar machines and it's got about six million documents in there, something around six million documents, and it takes about let's say 30 percent of the memory of 1GB, so about 300MB resident, if I had lots more than six million things, than probably I'd need to move up.
Also if I wanted to run a replica set on the same machine, all these kinds of things, but this is probably a totally decent starting point, unless you really have quite a bit of data.
Anyway, we'll get started with Digital Ocean.
The first thing we're going to do, what we're going to do in this video, in this lecture is we're just going to create two servers, one that's going to be our web server that's just the thing that's going to try to access MongoDB our app and the one that is the deployment production server that we've kind of locked down and hardened, so let's switch over here for now, and we're going to go and create a droplet.
I've done a tiny amount of work in advance, I've created a certificate that I'm going to use to ssh in, I'll show you where the step is and there's a button you can click and it basically says type this, put the contents here, you're good.
When we come in here there's a couple of options, the first thing is to choose an image, so we could choose all these different versions of Ubuntu, I'll just take the default, I'm tempted to take the new one but it will take the long term to support one.
If you wanted to use another distribution, you totally could, also they've got this one click apps thing that is pretty interesting and I could come down here and even click MongoDB but I don't want to assume that using Digital Ocean you have this button I want to show you how to set up a fresh Linux machine running MongoDB in the end basically.
So I'm not going to click this, but this is a totally reasonable option to click this and it has ability to upgrade basically through apt update.
So for this, let's go with the ten dollar one, it's charged by the hour we actually pay for this course, it's going to be quite quite low, I'm not going to leave it running for months at a time.
We're going to do this, 10 dollars a month, standard Ubuntu, I don't care about block storage, I'm on the West Coast of the US, so let's pick something somewhat nearby but you see there is other data centers, probably you want monitoring, this allows you to go back and do a droplet and get graphs of like cpu, disk, memory, over time that's kind of cool, maybe private networking, but again we're not going to do that here, I have already set up the ssh key, so I'm going to pick this Digital Ocean course test key, which doesn't want for anything but this test bit that I'm doing right here, we also create a new ssh key and there's a little help button you can click, and I'll just show you how to create and store one of these here.
Alright, so last thing we want to go over here, we got to give it a name, this Ubuntu name not the most amazing, we'll call it the mongo server, that seems decent right, it doesn't like this, so we'll just go like— so this is all looking good, we've got our ssh key we just need one of these types of things and we click go.
I'll let this go in real time, so not sure how long it's going to take today but I'm not going to speed up this part, you can see this is just a real time creation here.
And we're good, it says happy coding, I love it.
Alright, so let's copy this, let's go ahead, I think my ssh is already registered if not I might have to add that, let's go, so we're going to go here like this, and it says you've never connected to this server, no it's brand new.
Apparently I have not added that, so go down here, add ssh-add at the k is added to my keychain, like so, so it's added now if I ssh again, do this one, how about the one we're actually working with.
Okay, look at that, we are connected, so I had generated my key but I hadn't added it to this user profile, so this is great, and it should also be somewhat concerning that there are 16 security updates right now, so first thing we are going to do, we're going to apt update, go refresh the possible updates and a real quick upgrade, and we'll be back in a minute.
Okay, everything is done, now let's exit out real quick and just come straight back and notice, there's no more updates, but a restart is required in order to make this basically finalize those changes, something deep down in the guts was updated, so we'll just do a quick reboot and just to show you the time in here I will not speed this part up either.
Usually it takes about ten seconds, but with that many updates it might take a little bit longer; let's be optimistic give it a shot, and we're back, so really quickly we updated our system, we rebooted so we've got Ubuntu 16.04.2 long term support, and it's all up to date.
This is great, this is our Mongo server, let's do this one more time let's go and do this for here, go back to the other stuff in a minute, let's do this for the fake web app that we're going to have talk to this.
We'll come down here and pick Ubuntu, five dollar one we don't need block storage, sfo 1, same data center as before that's very important for latency; go ahead and add monitoring, use this ssh key, and we'll call this the web server, and go— good, these are the same data center and we'll do the same thing, I'll ssh into here I'll do apt update, apt upgrade and give it a good reboot, and then we'll have two fresh up to date machines and we'll start configuring them afterwards, let's just double check this one, so it's alive, but make those a little bit quick, there we go, now it took a moment just to turn on, excellent, everything is good here let's say apt update, it says there's no packages but I'm not so sure, it's basically running that right now so let's come back in a second, oh look, there's a whole bunch of stuff that we got to do so apt upgrade and we'll do this, I'll kind of shorten the video here you've gone through this before, and we'll just let it do all the upgrades and then we'll come back and talk about installing MongoDB on the Mongo server.
|
|
show
|
9:35 |
It's time to install MongoDB on our cloud server.
One thing I'd like to point out is you don't have to necessarily go down this path to run your own MongoDB server, you maybe don't want to deal with it, maybe you don't have enough experience with it things like that, so a couple of options just that I want to point out, but I definitely want to show you how to run your own MongoDB server, how to do it safely so that you can because in a lot of cases it really is the best.
So MongoDB has this thing called MongoDB Atlas which is database as a service, and basically what you do, I haven't used this really, but I have looked at it, is you create a AWS account it was on web service, EC2 account, you give them access to make machines and manage those machines on your behalf and they can create a replica sets and things like that that are cure for you, this is like a service that will manage more or less your EC2 machines upon your behalf, so this is decent.
Another one is you can go to something like mlab over here and you can check out the plans on pricing, they have a free, sandbox free half a gig, just cool when you do a shared one with light production work for up to eight gigs of data, but it isn't a replica set with fail over and things like that, so this is a pretty nice service, it's really expensive in my opinion but it is pretty much turnkey, push the button you get what you want, you get what you need.
I found it to be decent, but also it seems like it's added a lot of latency to some of the apps that we moved off of our own servers on to mlab, so I guess it probably depends, one on how much you pay, and two, how close your machine is to their machine, but they do claim to do a lot for half a million MongoDB deployments, on the major cloud providers.
I just want to put it that like you can go and just get MongoDb as a service.
Now, if you're still with me, I'm assuming that you want to install understand how to create your own MongoDB servers, so let's go over here to download, and we're going to go through a few interesting steps, so I would like to do the Linux download, I know I'm on MacOS, but I'm configuring this over here, now notice, I could hit the tarball and this would do a thing, I could install this, I could run it, but it wouldn't give me the ability to say automatically upgrade my server.
Right now it says instructions for installing with yum, but I want to do this on this x64 version of Linux of Ubuntu, 16.04 that's the one I got, I think you can just take the same instructions and apply them for 17.04 as well.
Now here's what we really want to go, we could click this, we could get the binary, but this is better, so we're going to come down here, and there's just going to be some copy paste action, now look, it says what we can do is you can just use aptitude to install this, so let's try that.
Before we actually go over here, tell me, which one of these is a MongoDB server, I don't know, I don't remember either, so let's take just a moment and step back and give these names, and I want to give them the exact same name as they believe their machine names are, so this one, the web server, it refers to itself as a web server, this one its local machine name is this, themongoserver so let me open this one up, now we should probably enable floating ips for real production, but this is not real production this is me playing around so I'm not going to mess with that.
We could also enable what are called cloud firewalls, but again, this is the Digital Ocean specific thing, you do this in EC2 differently, you do this in Azure differently and so on, so I didn't want to show you how to just use the Linux tools to do that but it may be better actually to do this, here you can see some of the monitoring kicking in so I'd like to be able to say ssh root@ themongoserver, right and sadly, it doesn't work, so let's tell this machine, let's do a sudo, and I'll run Visual Studio Code again, it's /etc/host here you can see I have hacked a few things together already, and we're going to go and put this, the Mongo server in here and what is its ip address, of course like I said, give it a floating ip and use that one possibly but we're going to go like this, if you want to give it a real domain name feel free to go ahead and do so, but this will work for, there's probably no reason to give your MongoDB like a public dns name so I'm going to suggest that maybe you don't do that.
Let's go here and get the web server, okay, so I save that, now let's try that again, we've never connected to a machine called themongoserver with this key so it's fine, and now we're back, so now we can say connect to root@themongoserver, and at thewebserver, that's what I called it right, thewebserver.
This will make things easier and you can see even on that machine, it believes it's called this, for some of the tricks we do later with tunneling ssl, it turns out that makes our life a little bit easier.
Okay, so we're on the mongo server, that was our goal maybe a little bit long to get here but that was our goal.
The next thing to do is we're going to go down this list that they gave us here, so we are going to do is ssh in here and play this, so it says what you can do is to install MongoDB is you can install this aptitude package and then you can actually install smaller pieces, like we could install say just the server, right or maybe just the sharding deamon things like that, but if you install this you kind of get all of it, and it's going to be amazing except for that it's not, it's not amazing it all, it's not there, because this comes from one of mongoDB's own app update servers, so we got to go down here and go through the steps, so the first thing we have to trust, trust MongoDB we're going to stall the software as root, I guess we're going to have to trust it anyway aren't we.
Then all seem to come out alright, be careful here, I always screw this up, even though I clicked on install for 16.04, it gives me all the options here so don't do that, that's 12, 14, 16, that's done.
Now the next thing to do is run apt update, I'm already in root, so I don't need sudo so we needed to do that basically to pull from that list we just added there, so now let's see what it's asking about, it's all good.
Alright, so now we can go do that apt install mongodb.org, and what happens— magic is going to happen, that's what.
Notice when I said this one, it's like a metapackage, it's really installing those four, it just said hey here's an empty package with these four dependencies.
Alright, that was quite quick, and notice we have 3.4.5 so that is quite a recent one here, and it's even created a mongodbuser for us that is the one I believe that runs the deamon process so it is not running his route, that's pretty awesome, it's another thing you'd have to do if you just downloaded the tarball and tried to set it up.
Very cool, and now next time I come over here and I run apt update and then upgrade that could potentially install 3.4.6 or whatever is next.
Do we have MongoDB— I could type Mongo and something happens except fail, no you cannot connect; why— because it's not running, it will run though if we just say service mongod start then we can ask status, and notice, it's running this process, it's running the server in quiet mode, especially important using this configuration file so we're going to be able to use that basically to configure or to adjust to the server all the changes we need to make to MongoDB we're just going to edit that file and it'll be golden, but for now, let's just try to see if we can connect it.
Wow, we can, we get a few warnings like you really should not put this on the open internet, you really shouldn't do that and some other stuff that we might consider about changing our file system around so we can say things like show db's and it just has the local stuff but we're connected and it's all good.
So, that's installing MongoDB on our cloud server.
However, you want to be extremely careful about changing this we're not ready to open this up, not even close, we can see here's the log file if we want to go and get the log here's where the data is stored, you generally don't need to go in there and mess with the data you don't need to mess with the files directly, we'll use the tools or replication or something to back up and configure the data but that's where it lives and you can change where it is if you need to, right so those two things are interesting, we're going to go change this stuff but we need to make a few configuration changes before we go and do that.
|
|
show
|
1:11 |
We saw that we can basically leverage the package manager on Ubuntu to install MongoDB, but it won't work by default, we've got to add some things.
Now notice this link at the bottom, you shouldn't be typing in at least not the first two lines at all, you can just copy them from down there, right.
So go to the install on Ubuntu, or if you have a different distribution pick that one, and it will show you how to do it with your package manager most likely, so here we are going to add the MongoDb key, so we trust the list, the file list we're going to set up, then we're going to basically set up this mongodb.org file list here and then in order to actually have the stuff in the list available we need to run a quick update, so it pulls it down and then we can say apt install mongodb-org to install everything or you saw that there's subpackages you can choose we're going to manage the server via the just editing etc/mongod.conf and then make changes, restart the service and then it will just pick up those changes which is really nice.
Of course, this doesn't mean it's running, it's just ready to run next time you reboot the server so you just say service mongod start and you'll be golden.
|
|
show
|
1:14 |
One of the most important things we can do to make our MongoDB server safe even if we screw up the configuration, the authentication, the encryption, all those things is to make sure nobody can talk to it.
So we're going to do two simple things right away to lock down our server.
Obviously our web app, or whatever app, our service whatever we're building that uses MongoDB should be able to talk to it, and it's this probably within a data center we could possibly get to it from our local machines, but well do things like ssl tunnels and so on to do that, so we won't open up any extra ports for this.
However, there's always something out there lurking, I showed you that super scary warning at the beginning and they're out there looking, they are saying hey I would love to talk to the server on port 27017 the default port or maybe 1.8 or 1.9, or 20, depending on the service you're running.
So we want to block those guys, we want to block them with the firewall and a couple of other things.
That's what we're going to do next.
We're going to do this like I said, in Linux itself, in Ubuntu itself, we could the cloud computing stuff like Digital Ocean just announced this cloud firewall thing that is really probably easier and if you're using Digital Ocean have a look at that, but we'll do it here and it works just fine.
|
|
show
|
6:05 |
Alright, so on the left here we're logged into our MongoDB server and let's go to the web server, we're logged in here, now on the web server, just for now, I'm going to set up the Mongo shell so that we can sort of simulate talking to this from the web application, our little fake web application in Python which we haven't gotten to yet, but we'll do that later in this chapter.
And we already added the list here, so we're going to install, apt install this, ok so let's go Mongo, you're going to run something great, not the right one, okay, so before we do anything let's see if we can get to our Mongo server, and the answer will be no, so here this is the host name of the Mongo server, right now if I try to connect to it, it's going to say no, if I come over here and I type mongo it connects, what is going on?
Remember this, remember it's listening only on local host.
01:14 So we're going to want to change this, but not before we make it safe, so we don't want to just tell it to listen on the open internet right away so let's first block access to all of these ports and everything basically except for initially ssh, so what we're going to use is we are going to use something built into Ubuntu called uncomplicated firewall.
The first thing that we're going to do is say ufw default deny incoming.
By default we're blocking all of the ports.
Now, we're going to say allow outgoing, so by default allow our server to get back out, that's cool.
The other thing that we want to allow, unless this is going to be the very last time we see the server, we're going to need to allow ssh back to this server.
Not default, just allow ssh.
Okay, great, it updated for ipv4 and ipv6, that's pretty sweet.
Now the last thing is a moment of truth, we're going to enable it, we could ask the status, it's not enabled, it says you know, if you are blocking ssh, you're going to be done for; we're not.
And let's just verify, just connect, reconnect, okay, we're good.
So at least now nothing can talk to any port except for 22 ssh, at all on this server.
The one final thing to do, let's go over here and say ping the web server, so this, that's the ip address of the web server, what I want is to allow the web server to get to the Mongo server, so one more thing I'll say ufw allow from here, so uncomplicated firewall allow from this to any port and we're going to give it a port here and normally you would type this, 27017, that's the default port, but the very next thing we are going to do is say running MongoDB on the default port probably is a stupid idea, everyone is scanning the wide open internet for 27017 and then seeing what kind of havoc they can wreak upon that.
So even though we think our firewalls are blocking the wide open internet for everything except for ssh— let's go ahead and change the port, so we're going to say 100001 is the port we're going to run Mongo, so we're going to allow that thing to come back to 10001, where MongoDB is going to be listening.
Okay, rule added.
So it is running, it's listening on just that port.
Next thing to do is we're going to want to go and change the port here, like this, and change this port, 10001.
Excellent, okay, so MongoDB, we're going to have to go do a service restart, now if I type Mongo fail, but if I say --port, like that, we're good.
So it looks like everything is working over here.
It's still not going to listen to us, because we're still not listening on the public internet, we're just listening on local host.
Okay, but this is one step in the right path, we've got basically the firewall here restricting access to everything, except for wide open ssh and MongoDB on a default port only from the web server.
Let's while we're over here go ahead and do this as well.
Just assuming that you're treating this as your web server, let's go ahead do the same thing.
So by default we're going to do deny incoming allow outgoing, allow ssh, and let's say allow 80 and 443 to simulate this being the web server, we're not actually going to run a website, like I said, but that is what I would do, and then we would do an enable.
It says are you sure you want to do this, we'll exit one more time, make sure we can get back, and we can, fabulous.
So now, we've got that server sort of foul lock down just to play along, this one is like actually lock down and this thing can talk to it, but this one is not listening.
I don't want to make that one listen, until we go through a few other steps, so you are going to have to hold off on having this whole connection thing working.
|
|
show
|
1:49 |
Limiting network exposure in concepts, so what do we do?
First of all, I said listening on the default port is just crazy because people are going to be scanning that like nobody's business, they may scan every port on your machine, connect to it, somehow distinguish it's a MongoDB port, but chances are that's not going to happen, chances are people are just going to check the few ports and just move on to scanning millions or billions of other ip addresses, even if they do connect, we're going to have some additional layers of security there, hopefully the firewall makes all of this redundant.
But still, it's a good idea to just have layers of security so here we have a port that is non default, 10001.
Now, we're also going to turn on our firewall so in fact it's very unlikely anyone can get to that from outside of our data center other than the apps or the servers that we said explicitly they can get to it.
So by default, deny all incoming connections, allow all outgoing connections allow ssh so that we can get back in or this is going to be the last time we ever see this server, so we're going to allow ssh and then we're going to enable it, that's the moment of truth, it says are you sure I suggest doing this right away before you have lots of important data on the server.
And then we're also going to allow from the ip address that is the application that's based upon MongoDB, and then to any port this one here.
We've got our farewell set up, we've got MongoDB set up to be non default of firewall rules, reflect that non default port.
So this is the web app address, this is the configured MongoDB port this, we're not ready for listening on the internet yet.
Two more things, encryption of the connection, which within the same data center may be it doesn't matter but we're going to add it anyway and authentication.
|
|
show
|
5:08 |
For our MongoDB server we want to add communication level encryption, basically ssl.
Now we could go get a proper ssl certificate, we could even use let's encrypt, but because this is only talked to from our few servers we can actually just do an auto generated one, so a self signed certificate.
Let's go over here to etc/ssl, let's see what's here— not so much, alright, so the next thing that we want to do is we want to run open ssl to generate this key.
Now, I'm going to give you a link that you can copy this from so don't worry about trying to type this in, so notice it's good for 365 days, we could put way more on here if we really wanted, save yourself some trouble, and it's going to be output into these two a public and private key.
Let's go.
Then you can basically put whatever you want, I'll go in and put some stuff here okay, so I entered some, sort of, kind of accurate data, and now we have our two keys, out two MongoDB public and private keys, the next thing is to generate a pem file which is really just the combination of the public and private key and we could do that with a cat command like this, so we run this, and now we've got the private key and the certificate there, okay great.
Now, the next thing to do is actually tell MongoDB hey, I would like you to use encryption and I would like you to use this particular key so notice, we're over here in the etc/ssl, and we're going to get that mongodb.pem we just got, so let's edit the config here, we'll go under security oh actually sorry, it's not under security, not yet, we're going to be there in a minute, we want to go to network here, and we're going to say ssl say mode is require ssl like so, not model, mode and the pem key file like this is going to be /etc/ssl/mongo.pem Okay, so make sure we save that, and then we just have to restart mongo so service mongod restart, let's see if that went well.
It doesn't look so great, does it?
Well, why is that?
let me grab our little log file here, there's our log file ah so, it says here's the error, etc/ssl/mongo.pem file not found now I can just edit this out of the video right and we would skip it, but I kind of want to show you like oh jeez, what do you do when something goes wrong?
Well, you go to look at the log file, first of all you can quickly ask on the status and it'll say crash something bad, go look at the log file and then go from there, maybe you want to tail it in a real production thing.
So we are just going to edit this again and say you know what, you're right, I believe that's mongodb, so we'll restart it ask for the status and hey, look, a running process, super, that is so much better.
Okay, so let's try to connect to this on the same machine here so we tried Mongo, and it said no, no, no you can't find it there so we did the port 10001, and it said I can't connect to this, this is not so good, I'm not sure what this error message is but we need to basically say one more thing, we need to say allow invalid ssl certificates because it doesn't trust itself and use ssl; there we go, so you can see this network error while attempting to run is master basically said I tried to run an unencrypted command on an encrypted connection and I got junk back— yeah, because it was encrypted.
Now we're kind of talking to the server on its non default port using its non valid ssl certificate, you can generate valid ones if you want, you can use other things lets encrypt, you can buy them, whatever, but like I said it's probably fine to use this.
We're very close to coming over here, and coming down and changing this to 0000 which will allow our web app to talk so we have the encryption of a communication that's good, but still, this is not good enough, what we need to be able to do is restrict this to only people with username and password and because we're doing this over ssl that password exchange is relatively safe.
|
|
show
|
1:09 |
Let's review how we added encryption.
Somehow we got a hold of an ssl certificate and a private key so the way we did this is we just went into the location where we typically store those analytics and we ran open ssl to generate a self signed certificate, the only change that I made here from the recommendation from MongoDB is I added, I made it a ten year certificate, because look as long as it's not trusted let's not trust it for a long time.
Anyway, we did that and then we combined the private key and the certificate into this pem file, which we point MongoDB at we restarted, I first paused on the left, go to the right, we added this ssl section, we added the mode to require ssl and here's the file to do that, and then we were able to connect to MongoDB but only if we say --allow invalid certificates and --ssl, all of this is documented in that url below manual, tutorial, configure -ssl, so you can check that out and like I said, copy the details from there, not by typing them in from watching me do it.
Alright, so a really nice step and important step to enabling ssl and secure communication on our MongoDB server.
|
|
show
|
5:20 |
So we've encrypted our MongoDB, we've got it hidden behind a firewall and listening on a non standard port, let's get into it.
Here we are connected to our Mongo, there is really nothing going on yet, it's just empty, we haven't added our data or anything like that, but nonetheless here it is, notice there was no user name or password required to get in, that's what we're going to fix next.
So the first thing to do is we're going to run this db.create user command.
We want to create a user to admin entire database server like all of MongoDB not just wherever we happen to be, which is called test, not an amazing name.
So we're going to say use admin and now you can see db is admin so we can show collections, see what's here and it's basically empty, but now we can run these db commands focused on creating users against admin which means kind of global.
So we're going to run this command here, paste it so I don't get it wrong because these roles have to be just so, it's very touchy, go with this for the db admin, that's probably fine, or mongodb admin, you can take your pick and the password is probably little wimpy, let's try to fix that.
Let's go over here and run pt Python and import uuid, okay, and then let's do something like this, print we'll call uuid that uuid4, call that, there we go, what do you think is that a decent password?
I would say so, that's going to definitely slow down some dictionary attacks.
Now over here, we got to delete this, sadly you can't paste over a selection in MacOS, alright, so we're going to run this user, this password and now we have to specify the roles we could create like multiple users that have certain restricted access to different databases and that's probably not a bad idea, but for this example we're just going to say this thing can admin read databases, admin any databases or clusters by the way just because you are an admin for a database does not mean you can read and write to it you could just create users and things like that, so you need them all.
Let's try this, boom, successfully created.
Now, did magic happen when we did this?
Let me copy this real quick, if I exit and I go over here and I try to connect without any authentication, no, nothing happened; why, if we come over here and we check out our config, down here at the security, this puupy is wide open so we need to go down and say authorization is enabled; now, if we do that and we restart MongoDB, so service mongo d restart, probably a good idea to ask for status, also not happy, again, what have we done, let's have a look.
I think it might help if I spelled this right, not very forgiving these computers are they, all right, everything is running that's all good, and if we try to connect to it again, now it's going to come over here and say hello you are connected right, db.version for example, like this, right so we're connected to it, we can sort of interact with it but watch this, show dbs, failure you cannot show dbs, in fact, you can't do anything other than basically log in.
So I can come over and say db, I say this use admin db.auth and I could set the username and password, so I could say user is this, password is whatever we want to put here, you have to forgive me if I don't want to type that again, copy and paste that, pwd rather not password, so we could log in this way, as you'll see now I can say show dbs, use test and so on, show collection, so I am basically authenticated at this point, right, so I can log in this way and do this, but you probably don't want to do this, you probably don't want to do it that way, instead you probably want to say user is this, it says pwd, I think it might be, is this oh one more thing, I forgot, so we have the username and the password but we also have to add the authentication database being admin there we go, okay, notice our warning about running without authentication is gone and I can now show dbs straight away, I don't have to go do this like switch to admin, do the auth and so on.
So this is really handy for our scripts here that we're going to use later.
|
|
show
|
1:01 |
Now let's see about enabling access control, we're going to connect to the shell like we have been, again specifying the extra things like ssl, import and so on, we're going to say use the admin database and then we want to create a user, I set the user to pwd and do not forget the roles this is very important.
Once we've done this, this doesn't magically make authentication work, we have to go over to our Mongo config and say security authorization is enabled, then we want to talk to it, we can now pass the port, the, ssl stuff, the users, the user name, the password- p password and authentication database is admin, don't forget that, it doesn't work without it.
At this point, you've basically set up your MongoDB in a safe way the final thing that you might consider, and it depends on how you want to run your database and so on, and you might set up a replica set to have failover and multi machine redundancy and things like that, that's certainly a next step that you could take but it's beyond the scope of this course so check out the docs.
|
|
show
|
3:35 |
I think our big moment has arrived.
We're finally ready to make this thing actually listen on the network and do what it is supposed to do.
So with everything set up the ssl right here, security right here, also not default port right there, firewall in place, all these things let's change that to the glorious 0.0.0.0 now, let's restart the server, see if it liked it, excellent it definitely liked it; see if we can still connect to it here on our local machine— we can, everything seems to be working well.
It's now listening at least on local host, let's go over here to this one, now here's all the connection info we got built up, port, certificates, username, password, auth db, let's go add one more thing, because obviously local host is not where this server is running.
So we are going to add host and put the ip address of the Mongo server, the moment of truth—we're in, look at that now, that's pretty cool, we could even do stuff on it, let me go over you and copy this and let's try one more thing, maybe we've screwed up somehow, maybe something super bad is happening here and this is just wide open, let's try to connect to it, notice I am on my Mac book, I'm not on the web server or the Mongo server, I'm on my Mac book and timing, timing, it's timing out, ta-da, that's awesome, that's what you want, no connection possible, we couldn't connect to this because why— only that one server magically gets to connect to it, beautiful.
And of course, we saw that we have to pass this auth stuff right, for some reason we don't pass that.
We still can connect as you saw, but we can't do anything whatsoever so I am not sure if I like that, I kind of would prefer that you can't even connect unless you just go through the authentication step but I guess more or less the same thing.
So exit out and now we're back, working, because I ran the one that passed username and password, so this configuration of this little fake web server and this not fake Mongo server is running.
Let's do one more thing, let's say pip install glances but we don't have pip, so apt install glances, let's skip that, like this, it's going to take a moment, so glances is pretty heavyweight, you may or may not want to install it but it's a really cool way to look at our server, so if we come over here and look around, it will show us things like how much memory our server is using, right now 15 percent.
If we want to know how much cpu it's using, not much at all, right now we're sorting by cpu and here you can see Mongo is here just kind of hanging out that's 0.6 cpu like it must be doing some maintenance on itself, you can sort by memory and it will almost always put MongoDB at the top so over here you can see it's using really in terms of resident memory only six percent, that's not much but it has no data in it.
So we'll come over here and we'll use this glances program to have a look at Mongo, maybe later we could load it up with this big dealership database that has the 1.5 million records or so in it.
|
|
show
|
10:26 |
Let's go back to our little play-around service app, I'll go and run this for you, it probably looks familiar, remember our service central version 2.0, this is demo edition let me actually change that little header to prod edition, not that it much matters, but we're going to set this up to run in our production environment.
If I try to do things like list the car, it will show me my local cars because I'm running on my Mac, however if I push this up to the server and I put it onto that fake web server server, it's going to try to talk to local host and have a big fall, right that's not going to work well.
So instead, what we need to do is we need to do is we need to go and adjust our little connection bit here.
Now, let me go and actually add some parameters to this we're going to add a password, say user, a password, a port, a server, use ssl, and I think that'll do, okay.
So now I want to use those things to actually connect, so we're going to have like this dual mode thing going on, and I'll do it like this, we'll say if user or password, so if either of those are set we're going to do something else, we're going to just fall back and just do this little simple bit or right here, here I'll do a print, registering dev connection, so go like this, and it is not going to love it, so let's go over here and give this some defaults, so none for all of these; default here's to 27017, server=local host and use_ssl = false actually let's go and default that to true.
Okay so now I should be able to run this here and list cars actually up here we'll see registering dev connection and let's put a little kind of indicator, something to the say hey this is like an extra thing, so go over here and we'll say registering prod connection and I want to give it some extra info, and let's hold off and what goes there just for a second, okay so we want to gather this up, we actually have to pass more information than this and just to make sort of recording how we connected a little bit easier, I'm going to create this dictionary where we set all these, so username is this, password, server, port, authentication sources, admin authentication mechanism is SCRAM-SHA-1, ssl is, use ssl and we have to say ignore the self signed certificate if we don't do this, it will say your certificate is not valid.
Now PyCharm warns that this thing is basically missing from ssl but it's not, just ignore that.
So we're going to come over here, and we're going to do this as well let's go and say, actually let me change the order real quick, so we're going to say all of these are keyword arguments for this method so we can just say **data and that's going to basically set username= user, password = password and so on, why did I put it like this— because I'd like to capture all those details.
So let me just do really quick data of password equals this, and then I'll just print out this dictionary here so registering, production, connection with those details.
Okay, so if you pass a username or password in it's going to work differently, let's just make sure everything still runs can I list the cars, see the dev connection, yeah, excellent.
So things are still working good on of the dev side of the story.
The next thing we've got to do is come over here where we're calling this, and let's just go ahead and pass in all the details here.
We wanted to use ssl that defaults to true, so that's all good.
Now if I run this, you're going to see not amazing stuff so like list is probably going to time out, it takes a while for it to time actually, let's try to renegotiate the connection and it really doesn't want to crash but eventually this is going to timeout, we already saw we can't connect to the server here.
So let me push this up to the git repository and then we'll get it on to the server and production and make sure everything works.
I pushed those changes to github and let's go over to the web server see I am already here, I'm just in my home directory/root so what I want to do is I want to go and get that code over here, so we're going to go and go to the github repository for our code here notice when I do a refresh, you should see that I just added now with production capabilities, so let's copy this, and let's say git clone this, its a public repository so I don't need any credentials or any of that business.
Okay, so things are good, we'll go to Mongo and notice there's a source and I have 09 deploys, so if we look in here, we've got service central deploy and service starter, server central deploy is the starter obviously it's what we started with, the service central deploys is the one that we just changed; so for example, if we look at this you can see it's using this complicated version here, if we look at this one, you can see we're setting a MongoDB just the way we like.
Okay, so now what we have to do is run it and let's go over here connect to the MongoDb server and say show dbs, hey there's nothing here, so let's go and run this, so we've got our service deploy, so we'll say it Python 3 we didn't use a … or change its execution states.
Now one thing we need is we need to install Mongoengine of course so let's do this, we'll just let Python do it, so we'll save Python 3 -m venv to create a virtual environment, here we need to apt install Python 3 -venv, try again, so now we'll source activate this and our prompt changes.
Okay good, so now we should be able to run our Python 3 thing again, oh yeah, well it's active, we still need to pip install Mongoengine and that'll take PyMongo along with it.
I believe that failed building a wheel because set up tools is out of date, anyway, it should still work.
Let's give this another shot, now we have Mongoengine registered in a virtual environment, a virtual environment is active, our code is here a lot of deployment stuff, let's go.
Oh, look at that, so now we're registering the production connection, I mean, you probably don't want to print this out all the time but notice the hosts, authentication, everything, it seemed to take it like the register worked we haven't tried to talk to the database yet, let's try to list the cars.
There are no cars, did that make in a dent?
No, no dent yet.
Let's add a car this is going to be an F40, it's going to be built in 2010, that didn't crash, let's try to list the cars, look at that, let's add a service record, service that car.
The vin number is that, the price of the service is a thousand dollars and this is going to be a new tires, the customer is extremely happy, loved it.
Now we've got our new tires, so look at this, show dbs, use demo dealership, show collections, db.cars.find.pretty bam, look at that, we were able to make our Python code on the right machine with all the right settings, and all the farewell rules and everything, go over and talk to the MongoDB server.
This is pretty excellent, we can go add another car obviously like at this point once you see it creating some documents and working to some degree everything is going to work, right, there's really nothing to it, so this is excellent, let me just go create one more car so we have two things, this is going to be Enzo and this was build very recently let's list the cars and add a service record for it.
The Enzo needs some work, so for a 100 dollars that will be oil change, pretty happy, yeah, one more, the same car, this is going to be 250 tire rotation moderately happy, so let's go over here and do this again.
There we go, we've got our Enzo with two service histories our F40 with one service history and so on.
Okay excellent, so it looks like this is working, I added this other record here so we have a little bit of data because the next thing that we want to look at is how do we manage this, how do we connect our database management tools and backup things and what not to.
As far as Python goes, this baby is rocking.
I guess maybe connect one more time, boom, list the cars, there they are, yeah looks good to me.
|
|
show
|
1:22 |
Let's review how we connected to our production server from Python and Mongoengine, here's how we connected before, we just said the alias is core and the name is dealership, and that was it, we just let everything else be the default and that worked fine when it was a wide open server on our local machine.
It didn't work so well for production, so we saw that we actually added this function that there is a whole bunch of different things here so it takes the password, the port, the server, whether or not to use ssl, the username, as well as the alias and db and I kind of broke this into two different functions that you can use for multiple connections in this example, but you could jam it all into one like I did in the project.
So I created this dictionary and we set additional things like the authentication source and mechanism and to tell it to ignore the ssl certificate, I put it in the dictionary so it's easy to print out like in my log here is I am connecting to the database server so you know which machine you're talking to, how you're talking to it, what user you're talking as, things like that.
So if you want to just put all that data straight into register connection, fine, you could do that but I find this to be valuable for sort of historical purposes, so here's how we connected and in my real example I said we're going to use the old version if you don't pass in a user name of password, or other things, but if you do then I'll switch to this more advanced version here.
|
|
show
|
7:15 |
It's great that we have our MongoDB running in production we've got our web server and a MongoDB server and they're entirely locked down, we saw before if we try to connect to that Mongo server, even though it's on a different port with ssl and authentication, we can't talk to it because the Ubuntu firewall is blocking access from everywhere in the world except for that one fake web server thing.
So we can't talk to it, we can't for example manage it with Robomongo, which would be totally sweet, right, but we can't even connect to it via the shell, can we?
Well, we tried that and we saw it failed and if I do it again it will fail; but I can ssh into the Mongo server like this, we've seen that, so that's cool, now what can we do with this?
It turns out we can set up an ssh tunnel using that mechanism so here if we run this command -f to run in the background ssh over here, map the local port, 10001, on the remote machine say the local host 10001, like that.
So if we run this code, it takes a second and it creates a background task of ssh tunneling one port locally over there; now, what if we try this— we're going to run the same command we saw working in production with authentication being this, here is the password, the admin and so on, notice there's no host up here we have the port 10001, what is the host if we don't put one— local host, but local host 10001 really means the Mongo server 10001.
Let's do it.
Check that out, it's working okay, we can ask it how are you doing, how many things you got going on here, what is your host, this is what I was looking for, your host is the Mongo server, we're connected to the Mongo server, that's really cool.
Now we can say things like show dbs, we could come over here and say use that, we could even do our pretty find here so cars.find.pretty and there's our data, okay so we can access this.
And just like we could in the shell.
Well if we can get to it this way, maybe, just maybe something magical could happen with better tools.
And yes, yes it can, we'll create, it's going to be a direct connection, I'll call this themngoserver, connect on the local host 10001, that part is good, authentication database is good, copy these over, paste that in, you can see this here, a mechanism is good, so this is all set, come over and say use ssl, I've not tried to do anything else, let's try this, let's test it— all right, under ssl we say use a self signed certificate, there we go, good, alright, so we have themongoserver, I didn't test it, but let's go ahead and give it a shot anyway.
Authentication failure, okay let's go and edit that again, oh look at that, have a little space right there, how frustrating, couldn't somebody give me a trim, connecting, authorized, yes!
That is so awesome.
Okay, save the connection, now let's go over here, double click it's a little bit slow because hey, it's going over tunnels but look at that, if we go over here we got our cars, we can view the documents we have everything that you could have done before with Robomongo, you can do now, here's the two documents you saw me create in that Python section, oil change, tire rotation, Enzo Ferrari and so on.
And we can do things like maybe we had the service_history.price as an index well, add an index, it's going to be service history price, and down here we'll say { 'service_history.price' :1 } like that, save and now how about that, we could even do a little thing come down here say service_history.price is let's say 100, this should return just one record, and it does and if we say explain, all the stuff we were doing, does it work— you bet it does.
It's using that index that we just created remotely using Robomongo, so this is super cool, last thing let's see about doing a backup.
The next thing that I want to show you which I don't think we've done before, let's go to our desktop here and we'll say make a directory called backtest cd the backup, notice it's there on the back up, nothing is in it, so the last thing I want to do is show you how to use Mongodump so you can go to help and see all of the things that this does but we're going to use Mongodump with basically all the same settings down to here we're going to go to demo dealership as we've named it and the output is going to be into the working folder which is this.
Because we're tunneled into the production machine we can go and grab that data from there and back it up locally, let's try.
Boom, we wrote two, we're whopping two documents but over here, we have this, now the data comes out in this binary json but you can't really look at, we could technically look at this but the point is this worked, we got our two documents, now you might wonder like ok that's cool for two documents that kind of works, can you really do this for like actual data— yes, yes you can.
So I do something like this for Talk Python To Me and the training site, all these things, and I can back them all up in one giant script that does things along these lines and it will back up to six million of records, six million documents, I would say it probably takes less than a minute and a half over my pretty standard connection, and I'm on the West Coast of the US, and that server is on the East Coast in Virginia, so it's not like I'm right next door, that's why it works.
So this actually works better than I expected it to work I guess, and it really is quite nice, so using this ssh tunnel means we never have to open up that port, but we can go through ssh and still work with our server, with all of our cool tools.
Over here, come back, which one do you want to work with— local or remote, remote one of course.
|
|
show
|
2:45 |
We've seen that we can use our ssh as a tunnel to give us access to our production MongoDB server without exposing too much of it.
So we can run this ssh command to the Mongo server and say map locally the port 10001 over to the machine called local host over in your area Mongo server to port 10001, which means we basically can connect our local host and we connect to the Mongo server on that side of things.
Once we do that, we can go to things like Robomongo, and say I'd like to connect to here local host 10001, and the user name is whatever it is, the password is whatever it is, make sure you check perform authentication and also use ssl, check that, you want to use the self signed certificate if that's the way you did it you saw that it doesn't work, kind of blocked me when I said certificate because it is like no, no this is an invalid certificate we won't talk to the server, you decide how you do the certificate management but if you followed along exactly, you want to make sure you use self signed certificate and then ta-da, you are connected to the server just as if it was local there's a slight latency but like I said, once it spins up and starts moving it can actually ship a lot of data over that connection pretty quickly.
The other thing you might want to do is connect with the Mongo shell, we've already seen how to do that, here's the command to do it once again remember, this is using a port 10001 on my local dev machine tunneling through the ssh tunnel back to the server; same type of thing you put on the server, basically goes right here as long as you have that ssh tunnel running.
We can also use Mongodump and Mongodump is one of the ways in which you can back up a database so same commands exactly except we said --db dealership output is local folder, that will dump out all that data, and like I said this actually works pretty well over that ssh tunnel, for large amounts of data.
There you have it, we have our production MongoDB server up and running we installed it through aptitude which means if there's update, security or otherwise it will automatically apt upgrade— boom take care of it for us, that is really super nice, we've got over the wire encryption, we've got non default ports, we've got authentication and we even saw how we can use ssh to still use our local dev machine we even saw how we can use ssh tunnels and our local database management tools, Robomongo, the Mongo shell and things like that to get back into the production server and manage it, without punching holes in the firewall for that purpose.
So all in all, I'd call this a pretty good setup.
|
|
|
19:12 |
|
show
|
1:08 |
There it is, the finish line!
That's right, you've made it all the way to the end of this course, I hope you found it super interesting and you've learned a lot, because I believe you now have enough to build production ready applications and deploy them based on MongoDB.
So really, the big question you need to be asking yourself is what are you going to build now, you have this amazing new power, this amazing new database, and way of writing data driven applications, what are you going to build?
I hope you take what you learned in this course, and you go build something amazing.
Now, before you do leave, and you go build that thing, let's talk about a few wrap up details; first of all, make sure you get the materials from the github repository, if you haven't already, go to github.com/mikeyckennedy/mongodb-for-Python-developers, the url is there at the bottom, and star this, and consider also forking it so you have a permanent version for yourself.
As far as I know, the git materials are entirely finished and published, there is a chance that somebody will find a small bug throughout the course and I'll amend that, so very likely what you see at this github repository is the final materials, it's certainly what you saw me create online during these videos.
|
|
show
|
1:01 |
Before we put the wraps on this course let's do a quick lightning review of each chapter that we've covered.
We're certainly not going to cover everything that we covered in the chapter, this is just a really quick review, but maybe the main takeaway from each chapter.
So we began the course by talking about what is NoSql, and I think there's a little bit of a misunderstanding or maybe multiple definitions of what NoSql means sometimes people say it's not only sql, sometimes you people say it means that there's no sql, the language involved in this.
Well what we saw is looking at the history back in 2009, this concept of NoSql came about by a meeting of people working on horizontal scales type of databases, like what trade-offs do they make against relational databases, so that they are more easily horizontally scalable, and basically cluster friendly databases.
That world it's not whether or not there's no sequel or there is sequel in the language, it's really about the style of databases and the trade-offs around how they work with that data.
|
|
show
|
1:46 |
The MongoDB shell and native query syntax; we saw that the MongoDB shell which you start by typing the word 'mongo' and it just runs the shell, tries to talk to the local one, there's all the different ways to get it to connect to different servers as we've seen.
So once it starts you get this little greater than prompt and you write Javascript so we interact with MongoDB at the lowest level in Javascript in a textual way and actually this is converted to bson a binary extended version of json.
So here we type something like db so this is the database we have active and book would be the collection name or table if you're still thinking relationally, but the collection name, and we say things like find or count or sort, or things like this and what we give it is this prototypical json object and what we get back are all the things that match the elements of that prototype.
So here you can see we got two records back and they both had the same title as the title we indicated here.
So it's very much about passing these prototypical json documents, however sometimes we have to do more than just say I want basically equality in my search, I would like to express things like greater than.
So this query here that we have written is actually doing a couple of very interesting things, maybe the thing that stands out the most is this greater than operator, so the dollar gte is indicating, the dollar indicates an operator, and gte is the name the greater than or equal to operator, so instead of just saying ratings.value is nine, we're saying I'd like all the ratings where the value is either equal to or greater than nine.
The other powerful and interesting thing here is we're actually traversing this hierarchy of the document we're going to find the ratings array which is a list of subdocuments which has a value as an integer, so we're actually reaching down inside that document and we're doing this query with this operator.
|
|
show
|
3:26 |
Next step we worked with— PyMongo.
So we put our Javascript away, we said all right enough with the Javascript stuff, we're going to write in Python basically for the rest of this course.
So the lowest level way to talk to MongoDB from Python is with PyMongo.
So let's look at a couple of the crud operations here.
We'll start of course by importing the package, import PyMongo, and if you don't have it just pip install it; and then we need to create a Mongo client by passing a connection string, I believe if you actually get a hold of the PyMongo connection you can use it directly, but you should not, because the Mongo client handles reconnects and connection pulling is stuff like that whereas the connection itself wouldn't do those kinds of things.
Then if we want to work with the database, we have this sort of interesting highly dynamic api, we go to the client and we just say .
(dot) the name of the database so we say client.the_small_bookstore, and we assign that to db so it looks like the rest of the shell stuff that we have been doing, but technically that's optional.
This database doesn't even have to exist, we could create the database in this style just by doing our first insert into it.
Whether or not it exists, we get all the database and now we can operate on the collections.
Let's imagine that in that database there's a collection called books and we want to know how many of them are, we would just say db.books.count and that would actually go there and do this operation.
If it happens to be that either the database of the collection doesn't exist, it doesn't crash, you get zero.
We could also do a find_one, this line here is notable because in the Javascript api is findOne and they've made a Pythonic version here, so find_one just be aware that it's not always a one to one exact verbatim match from the native query syntax over to PyMongo.
We can also do an actual search, before we said find_one I basically got the first here we're going to say I want to find a book by isbn, I want to pass it over, here we use Python dictionaries which play the role of those prototypical json objects.
We also insert new data, so here we're going to say insert this thing which is a dictionary, it has a title called new book and an isbn of whatever is written there and we get back this result, the result will have this object id in the field inserted _id, we can go requery it and do all sorts of stuff with it.
Basically when we say insert one, we get this result which, if it succeeds has the inserted id.
Now these are the straightforward crud operations, we can also use our fancy in place operators, so here let's just insert this book, so we see what we get, and we grab a hold of the inserted id, and now suppose we want to add a field called favorited_by, and this is going to be a list, and we want the list to be basically distinct we're adding the ids of the customers or people visiting our site who have favorited in this book, and we'd like to put them in there but there's no reason to have them in there twice, that can cause all sorts of problems.
We're going to use the dollar add to set, so we run this, run it again for 1002, and hey we could run it a second time for 1002, and what we'll end up with is an object that looks like this, the two things we inserted, the generated_id and his favorited_by list which has 1001 and 1002.
Definitely keep in mind these in place operators because they're very powerful and they leverage some of the special properties of the way MongoDB treats documents atomically.
|
|
show
|
2:39 |
Next up was document design.
Some of the concepts and ideas of relational databases still apply here, you still are modeling data, you still put it into a database, but many of the techniques fall down, this whole concept of third normal form doesn't make nearly as much sense as it does in a relational database.
What more we focus on often is really how do we make relationships either between documents or within documents.
We saw the primary question, not the only one, but the most challenging one, the one you have to think most carefully about is to embed or not to embed, and I gave you a few rules or tips to help you guide this decision.
One— is the embedded data wanted and you use it 80 percent of the time or more, most of the time when you get that containing document?
If that's true, you probably want to embed, if that's false, maybe consider that as a warning sign not to.
How often do you want the embedded document without the outer containing document?
If often what you really want to get access to is these little inside pieces, there's a lot of overhead and it really kind of complicates the way you access it through your application, if you want to get them most of the time, or frequently, on their own.
Is the embedded data abounded set?
Remember, these documents can only be sixteen megabytes or larger, the number is way higher than you really want it to be, if this is an unbounded set you're going to continue to add to it, it very easily could outgrow the actual size that you're allowed to store.
Really for a performance reason though, is it abounded set and is that set small?
Because if you put huge amounts of data in there, you're going to really slow down your read time for these database operations that involve this document.
These are the four main rules here, you also want to consider how your application accesses this data, it might be really easy to answer these four questions because there's a very constrained and small set of queries you run against your database; or it could be that you ask all sorts of questions in a highly varied ways in which case it's harder to answer those questions, the more types of queries you have the harder it is to know whether most of the time you want the embedded data for example.
The more varied your queries are, the more you'll trend towards third normal form, relational style and less embedding.
One of the situations where you have lots of varied queries is if you have this thing called an integration database, which we talked about sort of sharing a database across different applications, versus having one dedicated to a particular application where you can understand these questions very clearly.
So when you're designing these documents you want to really think most carefully about do you want to embed this data or create a soft foreign key type of relationship.
|
|
show
|
3:56 |
After we talked about document design and we talked about the raw access from PyMongo we said let's take this up a level of abstraction, let's actually build classes and map those over ORM style into MongoDB.
We saw a really nice way to do that is with the ODM called MongoEngine.
Let's review the main way that we sort of define classes and add constraints and things like that.
Over here we are going to create this car object, this is our dealership example and we are going to store the car in the database.
The way we create something that MongoEngine can manage in MongoDB as a top level document, is that we're going to derive from mongoengine.document.
And then every field is going to be one of these fundamental field types, like StringField, IntField, FloatField and so on.
And we can have some of them required, the first three required, we can have some of them with basic default values, like mileage defaults to zero but we can also have interesting functions, for example the vin number is automatically generated and we're based in this on the uuid4 random alphanumeric thing, so what we have here so far is really sort of equivalent to what you might have in a traditional relational database, there's entry and there is a flat set of what you would call columns, this is only part of the story, remember we can have nested documents, we can have actually a rich hierarchy of nested objects.
One thing we might want to store in the car is an engine and the engine itself is a special type, here in the field it's going to be an embedded document field an engine derives from mongoengine.EmbeddedDocument, not document, embedded document.
These we're never going to directly insert into the database, in fact, we're going to always put them into a car, so this is like a strong relationship between a car and its engine, we can even mark it as required.
Now going a little further than that, our service history actually contains a list of subdocuments, each one modeled by the service record.
The service record has things like the customer satisfaction, what service was performed and so on.
Now if we take this, put some appropriate data into it and store it, we'll get something looking along the lines of this, in our document database in MongoDB, so here we have the first few elements that are just the flat fields and then we have the nested engine, one of them, we have the nested array of nested items for the service histories, and this really gets at the power of MongoDB, this nesting and these strong relationships where you get this aggregate object the car, that always contains everything we need to know about it.
How about queering— we're not going to write now in the low level api, we're going to use basically the properties of these objects.
Here's the function that we wrote where we wanted to ask the question what percentage of cars have bad customer rating, that would be average or below, so we're going to go to the car and we say objects, we could do lots of these objects.filter.filter.filter but if you just have one query you can just stick it in object, so as the objects service_history, now we can't say dot here, because service_history .
customer_rating would not be a valid variable name or parameter name in Python, so we're going to traverse a hierarchy with a double underscore.
We also might want to apply one of the operators, in this case we're going to say less than 4, so we're going to use again this double underscore, but in this case it's going to say on the left is the name of the target and on the right is the operator we're going to apply to it.
You don't put the dollar again, that wouldn't be valid in Python, but double underscore __lt, and then we can ask things like count, or go and get the first one, or things like that.
We can even do paging by slicing on that result.
This syntax lets us use almost the entire spectrum of the way of creating MongoDB really straightforward and in a way that ties back to the car object that we defined.
|
|
show
|
2:34 |
At this point, we pretty much had MongoDB doing everything we needed it to do, and we'd heard MongoDB was fast, but it turned out it didn't really seem to be behaving as quickly as maybe we hoped, we put a ton of data from our dealership in there, and we were getting query times of like one second, 700 milliseconds, stuff like that.
It was okay, but really, we saw it can do much better.
What levers and knobs do we have to turn to make this faster?
The most important one, even more important than in relational databases, are the indexes, we'll see MongoEngine as well as PyMongo in the shell all have really good ways to deal with this.
Document design is really important, mostly around this embedding question but there are many ways to think about document design, there's a lot of really non intuitive and powerful patterns, design patterns you can apply here.
What is your query style, maybe one query is better than another and using projections to only pull back a subset of responses, suppose we have a car that has a ton of those service histories and we don't care about them for a particular query we could suppress returning those from the database which saves us a lot of bandwidth on the network, disks reads on the database server and deserialization processing on our side.
We also saw there is some network apology things we can do, replication and sharding, and those are both interesting and powerful but not part of this course, so go check that out on your own if you're interested.
For indexes, we took an example like our car and we said let's suppose we have make here that we're interested in querying by a service history, and if you look below how service history is defined as the service record objects and they have a description and a customer rating and things like this, price for example, so our goal is to query these things, the make, the service history and stuff, quickly, so we saw adding an index which really a powerful way to do that, so all we've got to do is go to our meta object, our meta element here and say these are the index as an array now these indexes can simply be the name of the thing, like make that's super straightforward, they could traverse the hierarchy using the Javascript style, using the dot, so we'll service_history.customer_rating and that would go down and let us do queries deep into these cars and say let's find the ones that are either good or low customer ratings and we can even do composite indexes, so here we're having a composite index on price and description, within the service history, so we do that by having this fields dictionary thing and the fields are an array, so you can use the simple version or if you need to, you can get a more complex definition of the index there.
|
|
show
|
2:20 |
After we had everything we needed for our database up and working, our code was working, we said time to put this puppy in the cloud and let people access it, so we talked about deployments.
Now, there's a couple of things we could do, if you go to the MongoDB website and you pick the Linux deployment, you pick your distribution, it actually has a lot of really clear steps, like these are the steps that takes to use your package manager on Linux to get MongoDB installed, and I recommend you to use the package manager because then you get automatic updates, and things like that, it's really nice.
However, we also talked about the ways in which MongoDB is maybe going to put you at risk, let's say if you don't know what you're doing about configuring it, so if you configure it to just listen on the open internet without say authentication, you are just asking for some sort of punishment, so there's a couple of things that we went through, a very detailed set of here is how you limit network access on Ubuntu, here is how you enable encryption, here's how you enable authentication, and so on, so the checklist we went through was, first thing to do is limit network exposure.
That was a couple of things, one we set up the firewall on Ubuntu, if you want to use a cloud provider that's fine as well, so we set up the firewall, we actually listened on a non default port which we blocked by the firewall, and then we let the few servers in the world that needed to talk to it back in by explicitly allowing in those ip addresses.
We enabled access control by creating an account and go into the configuration and enforcing authentication, say it's required, we added encrypted communication by creating a self signed ssl certificates and then adding that in there, you may consider adding encryption at rest as well, so like the actual stuff on disc is encrypted, we didn't go to that it wasn't really necessary for what we were doing.
You could audit what's happening on your server, we didn't talk about that but it's pretty straightforward, we also talked about how you can run backups, I mentioned that you can do replication and some of these live backups but you can also use Mongodump for reasonably small data, not terabytes type of data but gigabytes, and that works pretty well as well, we saw that we can even do that over our ssh, so back up, back up, back up.
Here's the whole security checklist that we talked about you can go through and read all the ways do it, or just go back and look at the various steps in the previous chapter's video.
|
|
show
|
0:22 |
That's it!
I want to say thank you, thank you, thank you, I really appreciate you taking my course, I hope you learned a lot, and I hope you found it valuable and enjoyable, and generally just had a great time.
I also hope you go build something amazing with MongoDB, if you do, send me a message, either here on twitter at @mkennedy, or visit the website and send us an email and tell us all about it.
Until then, take care and thanks again
|